# Big Data for Biologists: Decoding Genomic Function- Class 4

## How can we compare two or more DNA sequences or compare a DNA sequence to a reference genome? 

##  Learning Objectives
***Students should be able to***
 <ol>
   <li><a href=#SeqAlignIntro>Identify ways that DNA sequence alignments can provide insights into human biology</a></li>
 <li><a href=#Import>Import a module into a Python script</a></li>
 <li><a href=#Package>Explain what a Python package is and how to import modules from a package </a></li>
<li><a href=#Align2>Align two sequences using modules from the BioPython package </a></li>
 <li><a href=#AlignMuscle>Align multiple sequences using modules from the BioPython pacakge</a></li>


# How can DNA sequence alignments provide insights into human biology?<a name='SeqAlignIntro' />


<i>
    * "What model organism can I use to study a gene that has been associated with a human disease?"
    
    * "I made a discovery about how a gene works in fruit flies, could my finding also be relevant in humans?"  
    
    * "How can I analyze my DNA sequencing results to determine if I am at risk of a disease?"  
    
    * "How different are humans from Neanderthals or other ancient humans?"
</i>


**ALL of these questions utilize the tools of sequence alignment**   


For today's class we will be looking at the very important procedure of DNA sequence alignment. 

We will look at examples of two types of sequences alignments that can be performed:

* Comparing two sequences **pairwise sequence alignment**  
* Comparing three or more sequences **multiple sequence alignment**

In our examples today we will use DNA sequences, but there are also algorithms that can be applied to aligning protein sequences. 

We will be showing you how to perform alignments in Python to continue building your skills in Python, but there are a number of web-based tools for performing both pairwise and multiple sequence alignments such as [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) for pairwise alignments and [CLUSTAL Omega](http://www.ebi.ac.uk/Tools/msa/clustalo/) for multiple sequence alignments. 


# Import a module into a Python script<a name='Import' />

Writing the algorithms for sequence alignments is beyond the scope of this class. However, we can perform sequence alignments with the help of algorithms that have been developed and shared by others. 

In order to use code that has been shared by others, we first need to learn how to import the code into Python. 

As a starting example, we can look first at how to import the module that we created in the last class, central_dogma_helpers.py, into a Python script.

Remeber, that we defined two functions in central_dogma_helpers.py: "write_RNA_from_DNA" and "write_protein_from_RNA". 

Once the module is imported, we will be able to call these two functions by name in our code. We will not have to write out the entire function.  

In our example, we will also use the sys module that comes with the Python distribution. 

We've seen a few examples of the import command already in earlier classes, and now you should have a better understanding of what that command means. 

As a reminder, the box below has the helper functions that we wrote in the last class. We are going to make a few edits to this file. 

In [None]:
%%writefile ../helpers/central_dogma_helpers.py

#define a function to write the RNA nucleotide sequence
#from a DNA sequence in FASTA format

def write_RNA_from_DNA(FASTAsequence):
    FASTAsequence=open(FASTAsequence,'r')
    DNAsequence=(FASTAsequence.readlines()[1:])
    DNAsequence=''.join(DNAsequence)
    DNAsequence=DNAsequence.replace('\n','')
    RNAsequence='' #this defines the variable 'complementarysequence'
    for i in DNAsequence:
        if i=='T':
            RNAsequence=RNAsequence+'U'
        else:
            RNAsequence=RNAsequence+ i
    return(RNAsequence)

#define a function to write the protein amino acid sequence
#from an mRNA coding sequence in FASTA format


def write_protein_from_RNA(RNAsequence,start,stop):

    #calls the function defined above called write_mRNA_from_DNA
    RNAsequence=write_RNA_from_DNA('../class_01_gene_sequences/data/Human-Insulin NM_000207.2.txt')

    #defines the python dictionary for the one letter genetic code 
    geneticcode1let={'UUU':'F','UUC':'F','UUA':'L','UUG':'L',
         'CUU':'L','CUC':'L','CUA':'L','CUG':'L',
         'AUU':'I','AUC':'I','AUA':'I','AUG':'M',
         'GUU':'V','GUC':'V','GUA':'V','GUG':'V',
         'UCU':'S','UCC':'S','UCA':'S','UCG':'S',
         'CCU':'P','CCC':'P','CCA':'P','CCG':'P',
         'ACU':'T','ACC':'T','ACA':'T','ACG':'T',
         'GCU':'A','GCC':'A','GCA':'A','GCG':'A',
         'UAU':'Y','UAC':'Y','UAA':'*','UAG':'*',
         'CAU':'H','CAC':'H','CAA':'Q','CAG':'Q',
         'AAU':'N','AAC':'N','AAA':'K','AAG':'K',
         'GAU':'D','GAC':'D','GAA':'E','GAG':'E',
         'UGU':'C','UGC':'C','UGA':'*','UGG':'W',
         'CGU':'R','CGC':'R','CGA':'R','CGG':'R',
         'AGU':'S','AGC':'S','AGA':'R','AGG':'R',
         'GGU':'G','GGC':'G','GGA':'G','GGG':'G'}

    #defines the string variable proteinseq
    proteinseq=''

    #range command (start,stop(not included),step)

    for i in range(start-1,stop-1,3): 
        proteinseq=proteinseq+str(geneticcode1let[RNAsequence[i:i+3]])

    return (proteinseq)


In [None]:
#Tells python where to look for .py files
#adds ../helpers to the list of directories where to look for .py files. 
#The list of directories to look in is called the "path". 
#sys is a pre-installed module that comes with standard Python distributions (https://docs.python.org/3.6/library/sys.html)
 
import sys
sys.path.append('../helpers')

#Imports the module central_dogma_helpers.py
import central_dogma_helpers

#Import the names of all the functions in central_dogma_helpers.py
#The names of all the functions is denoted by the *. 
#You could also import each function by its individual name. 
#Or you could call each function by using the syntax central_dogma_helpers.function name"""

from central_dogma_helpers import *

#Runs the two functions in central_dogma_helpers

#Prints the output 
print('RNAsequence:  '+ )
print('\n'+'Protein Sequence:'+ )

## What are python pacakges and how can I import packages?<a name='Package' />

There are many publicly available modules that can be imported into Python. Often, modules are made available as part of **packages** which are sets of module files that can be installed by users to expand Python functionality. 

To use a package, the package first needs to be installed. How you install a package will depend on the system that you are using. 

For this class we have pre-installed the packages that you will need. Today we will be using a package called Biopython which you can learn more about [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html). 

You can check to see if a package has been installed by running the import command and seeing if there is an error. 

In the code below, we are checking to be sure that the Bio package from BioPython has been installed. We can also check the version number. 

In [1]:
import Bio 
print(Bio.__version__)

1.70


As a summary, we've looked today at how to import three types of modules into Python:

* modules that you wrote and saved as .py files (eg. central_dogma_helpers.py from the last class) 
* modules that came with the Python distribution 
* modules that come from packages that you install 

One final note is that you may hear packages being referred to as **libraries**.  

Now that we've set our system up to use the BioPython package we are going to look at ways that we can use it. 

## Align two sequences using BioPython<a name='Align2' />

In the last class we looked at the sequence for human insulin. 

Mice have two copies of the insulin gene. How similar are they to the human gene? Which one is more similar? 

This is an example question that we'll look at as we learn about pairwise sequence alignments.    

We saved the FASTA sequences for the mouse genes in two files Mouse Insulin GeneID 16333.txt and Mouse Insulin Gene ID 16334.txt in files in the data directory for this class.  

In this example we are going to use two modules:

SeqIO a convenient tool for reading FASTA sequences into Python. 

And pairwise2, an algorithm for aligning two sequences. 

If you remember back to the first class, we wrote some code to read a FASTA sequence into Python, but we had to separate the header (the first line starting with >) from the actual sequence. 

SeqIO is a Biopython package that conveniently reads in a file and separates (or **parses**) a FASTA sequence into its ID, Name, Description, features and the sequence.

If the variable name defined when you call SeqIO is seq1 then you can refer to the id by seq1.id, the description by seq1.description, or the sequence by seq1.seq

Before we run SeqIO, let's take a look at the source code. 

In [None]:
from IPython.display import IFrame

IFrame("http://biopython.org/DIST/docs/api/Bio.SeqIO-pysrc.html",height=1000,width=1000)

In [None]:
# note we used the import Bio command above otherwise we would need to have it here.

#imports the sequence reading package SeqIO from the Bio module and prints the sequence identifier.  
from Bio import SeqIO

#Reads the FASTA sequences 
seq=SeqIO.read('../class_01_gene_sequences/data/Human-Insulin-NG_007114.1.txt',"fasta")

print(seq.id)


 In the box below, revise the code above to print out the gene sequence instead of the identifier. 


In [None]:
#imports the sequence reading package SeqIO from the Bio module and prints the nucleotide sequence. 

Now we are ready to use the SeqIO package with the pairwise2 module to run the sequence alignment. 

In [None]:
# Write code to do a pairwise sequence alignment between the Human-Insulin Gene NG_007114.1
# and the Mouse Insulin GeneID 16333

# note we used the import Bio and import SeqIO command above otherwise we would need to have them here.

#imports the pairwise sequence alignment algorithm pairwise2 from the Bio module. 
from Bio import pairwise2 

#import the sequence_alignment_helpers.py file in the helpers directory


#Reads the FASTA sequences 
seq1=SeqIO.read('../class_01_gene_sequences/data/Human-Insulin-NG_007114.1.txt',"fasta") 
seq2=SeqIO.read('data/Mouse Insulin GeneID 16333.txt',"fasta")

#Conducts a global pairwise alignment between the two sequences  
#the xx gives instructions about how to calculate the alignment score
#http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

alignments = pairwise2.align.globalxx(seq1.seq, seq2.seq)

In [None]:
#uses the sequence_alignment_helper functions to print the alignments with a nice format
align1_linebreaks=insert_newlines(alignments[0][0])
align2_linebreaks=insert_newlines(alignments[0][1])
 

#format_alignment_linebreak inputs are: align1_linebreaks,align2_linebreaks,score,begin,end,seq1,seq2
print(format_alignment_linebreak(align1_linebreaks,align2_linebreaks,alignments[0][2],alignments[0][3],
                                 alignments[0][4],seq1,seq2))

In the space below, write the code to run a pairwise sequence alignment with the Mouse Insulin GeneID 16334.txt file. 

In [None]:
# Write code to do a pairwise sequence alignment between the Human-Insulin Gene NG_007114.1
# and the Mouse Insulin GeneID 16334

## Align multiple sequences using modules from the BioPython pacakge and the MUSCLE algorithm<a name='AlignMuscle' />

If we now want to compare the three sequences together (ie. the two insulin sequences from mouse and the insulin sequence from human), we can perform a multiple sequence alignment. 

Today we are just going to use the example of looking at three sequences, but we will see more examples later where multiple sequence alignments are particularly useful for looking at how well sequences are conserved across species or within populations. 

For the multiple sequence alignments today, we are going to be using an algorithm called MUSCLE. We have preinstalled MUSCLE on the Jupyter Notebook server that we are using. Documentation or information on downloading the program an be found [here](http://www.drive5.com/muscle/). 

The input for the multiple sequence alignment is a list of sequences in FASTA format. We have created a file in 
data/Human and Mouse Insulin Genes.fa. Which you can see below. 

In [2]:
# Runs a multiple sequence alignment between the Human-Insulin Gene NG_007114.1
# the Mouse Insulin GeneID 16333 and the Mouse Insulin GeneID 16334

# note we used the import Bio command above otherwise we would need to have it here.

# we are importing the MuscleCommandline here from Bio.Align.Applicaitons 
from Bio.Align.Applications import MuscleCommandline

# defines a variable with the path of the executable for the MUSCLE algorithm program 
muscle_exe="/usr/bin/muscle"

#runs the multiple sequence alignment and writes to an out file in ClustalW format 
muscle_cline = MuscleCommandline(muscle_exe,input="data/Human and Mouse Insulin Genes.fa",out="Human_and_Mouse_Insulin_Genes.aln",clw=True)
muscle_cline()

('',
 '\nMUSCLE v3.8.31 by Robert C. Edgar\n\nhttp://www.drive5.com/muscle\nThis software is donated to the public domain.\nPlease cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.\n\nHuman and Mouse Insulin Genes 3 seqs, max length 1430, avg  length 1073\n00:00:00    22 MB(-2%)  Iter   1   16.67%  K-mer dist pass 1\n00:00:00    22 MB(-2%)  Iter   1  100.00%  K-mer dist pass 1\n00:00:00    22 MB(-2%)  Iter   1   16.67%  K-mer dist pass 2\n00:00:00    22 MB(-2%)  Iter   1  100.00%  K-mer dist pass 2\n00:00:00    23 MB(-3%)  Iter   1   50.00%  Align node       \n00:00:00    28 MB(-3%)  Iter   1  100.00%  Align node\n00:00:00    28 MB(-3%)  Iter   1  100.00%  Align node\n00:00:00    28 MB(-3%)  Iter   1   33.33%  Root alignment\n00:00:00    28 MB(-3%)  Iter   1   66.67%  Root alignment\n00:00:00    28 MB(-3%)  Iter   1  100.00%  Root alignment\n00:00:00    28 MB(-3%)  Iter   1  100.00%  Root alignment\n00:00:00    28 MB(-3%)  Iter   2  100.00%  Root alignment\n00:00:00    28 MB(-3%)  It

Thought questions:

* What does your analysis tell you about the similarities between human and mouse insulin genes? 
* Which insulin gene from mouse is more similar to the human insulin gene? 
* What do you think the alignment would look like if you were using processed mRNA or protein sequences instead of gene sequences?