# Module 1 Project: Working with FASTA sequences

## Learning Objectives

Time to put all your new skills into practice! You will download 2 gene sequence files from the NCBI database and align the two genes using the packages biopython and xyz.

## Objectives

You will compare the human alcohol dehydrogenase 1A gene (NM_000667.4) to a similar gene sequence from the American Mink (Neovison vison, XM_044226065.1), which is the E chain. 
1. Determine the lengths of the two DNA sequences (hint: len(X) gives you the length of any file.)
2. Calculate the GC% in each sequence  (hint: we wrote a tool called count that you could copy or recreate.)
3. Perform a pairwise global alignment, obtaining the score (hint: tools within pairwise2 imported from bio.)

<div class="alert alert-block alert-info"> <b>Tip:</b> If you need help, you can jump to the solutions from the next box</a>. </div>

In [None]:
from Bio import Entrez, SeqIO

Entrez.email = "yourname@gmail.com"

handle = Entrez.efetch(db="nucleotide", id="NM_000667.4", rettype="fasta")

humanADH1A = SeqIO.read(handle, "fasta")
handle.close()

print(humanADH1A.id)
print(humanADH1A.description)
print(humanADH1A.seq)


Now, carry out the same processes with a similar class-4 ADH from *Neovison vison*  XM_044224430.1. We'll use the genbank format to demonstrate the similarities.

In [None]:
Entrez.email = "yourname@gmail.com"

handle = Entrez.efetch(db="nucleotide", id="XM_044226065.1", rettype="gb", retmode="text")

minkADH = SeqIO.read(handle, "genbank")
handle.close()

print(minkADH.id)
print(minkADH.description)
print(minkADH.seq)

Next, determinine the GC percentages:

In [None]:
def count_base(dna, base): #the function is named count_base and takes 2 inputs- a sequence string and the letter to look for
    dna=dna.upper()   #convert all letters in the string to uppercase
    base=base.upper() #convert the letter provided to uppercase
    return dna.count(base)

humADH_GC=count_base(humanADH1A.seq, "C") + count_base(humanADH1A.seq, "G")
humGC_percent = humADH_GC/len(humanADH1A.seq)
print(humGC_percent)
print(len(humanADH1A.seq))
minkADH_GC=count_base(minkADH.seq, "C") + count_base(minkADH.seq, "G")
minkGC_percent = minkADH_GC/len(minkADH.seq)
print(minkGC_percent)
print(len(minkADH.seq))

Align the two sequences (suggestion: if you'd like to see that it's working, you might want to align and display only a smaller portion of the file, e.g., `...(hum.seq[0:50], mink.seq[100:200])`

In [None]:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
alignments = pairwise2.align.globalxx(humanADH1A.seq, minkADH.seq)

#option: print just a portion to view
#alignments = pairwise2.align.globalxx(humanADH1A.seq[0:50], minkADH.seq[100:200])

print(format_alignment(*alignments[0]))
print(len(alignments)) #bonus

# Conclusion
By now, you should have a basic grasp of some of the ways that variables can hold information that Python will process using scripts. .
With that foundation, we will can look at more advanced data structures which will be necessary for Bioinformatics.

## Clean up
Remember to shut down your Jupyter Notebook instance when you are done for the day to avoid unnecessary charges. You can do this by stoping the notebook instance from the Cloud console.