# Big Data for Biologists: Decoding Genomic Function- Class 4

## How can we compare two or more DNA sequences or compare a DNA sequence to a reference genome? 
## Introduction to high-throughput DNA sequencing and read alignment

##  Learning Objectives
***Students should be able to***
 <ol>
 <li>Identify ways that DNA sequence alignments can provide insights into human biology</li>
 <li>Import a module into a Python script</li>
 <li>Explain what a Python package is and how to import modules from a package </li>
 <li>Align two sequences using modules from the BioPython package </li>
 <li>Align multiple sequences using modules from the BioPython pacakge</li>
 <li>Explain what a reference genome is and why it is important for analysis of DNA sequences </li>
 <li>Recognize FASTQ file format</li>
 <li>Align a sequence to the human reference genome </li>


# How can DNA sequence alignments provide insights into human biology?


<i>
    * "What model organism can I use for experiments to study a gene that has been associated with a human disease?"
    
    * "I made a discovery about how a gene works in fruit flies, could my finding also be relevant in humans?"  
    
    * "How can I analyze my DNA sequencing results to determine if I am at risk of a disease?"  
    
    * "How different are humans from Neanderthals or other ancient humans?"
    
    * "I just finished a sequencing experiment, how can I align my data with the human reference genome?"
</i>


**ALL of these questions utilize the tools of sequence alignment**   


For today's class we will be looking at the very important procedure of DNA sequence alignment. 

We will look at examples of three types of sequences alignments that can be performed:

* Comparing two sequences **pairwise sequence alignment**  
* Comparing three or more sequences **multiple sequence alignment**
* Comparing sequences to a reference genome **short-read sequence alignment**

In our examples today we will use DNA sequences, but the methods you will be learning can also be applied to aligning protein sequences. 

We will be showing you how to perform alignments in Python to continue building your skills in Python, but there are also a number of web-based tools for performing both pairwise and multiple sequence alignments such as [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) for pairwise alignments and [CLUSTAL Omega](http://www.ebi.ac.uk/Tools/msa/clustalo/) for multiple sequence alignments. 


# Import a module into a Python script

Writing the algorithms for sequence alignments is beyond the scope of this class. However, we can perform sequence alignments with the help of algorithms that have been developed and shared by others. 

In order to use code that has been shared by others, we need to learn first how to import the code into Python. 

As a starting example, we can look first at how to load the module central_dogma_helpers.py that we created in the last class.

Remeber, that central_dogma_helpers.py  defined two functions "write_mRNA_from_DNA" and "write_protein_from_mRNA". 

Once the module is loaded, we will be able to call these two functions by name in our code. We will not have to write out the entire function.  

In our example, we will also use one of the modules that comes with the Python distribution. We've seen a few examples of the import command already in earlier classes, and now you should have a better understanding of what that command means. 


In [7]:
#Tells python where to look for .py files
#adds ../helpers to the list of directories where to look for .py files. 
#The list of directories to look in is called the "path". 
#sys is a pre-installed module that comes with standard Python distributions"""
 
import sys
sys.path.append('../helpers')

#Imports the module central_dogma_helpers.py
import central_dogma_helpers

#Import the names of all the functions in central_dogma_helpers.py
#The names of all the functions is denoted by the *. 
#You could also import each function by its individual name. 
#Or you could call each function by using the syntax central_dogma_helpers.function name"""

from central_dogma_helpers import *

#Runs the two functions in central_dogma_helpers
RNAsequence=write_mRNA_from_DNA('../class_1/data/Human-Insulin NM_000207.2.txt')
proteinsequence=central_dogma_helpers.write_protein_from_mRNA(RNAsequence)

#Prints the output  
print(proteinsequence)

MetAlaLeuTrpMetArgLeuLeuProLeuLeuAlaLeuLeuAlaLeuTrpGlyProAspProAlaAlaAlaPheValAsnGlnHisLeuCysGlySerHisLeuValGluAlaLeuTyrLeuValCysGlyGluArgGlyPhePheTyrThrProLysThrArgArgGluAlaGluAspLeuGlnValGlyGlnValGluLeuGlyGlyGlyProGlyAlaGlySerLeuGlnProLeuAlaLeuGluGlySerLeuGlnLysArgGlyIleValGluGlnCysCysThrSerIleCysSerLeuTyrGlnLeuGluAsnTyrCysAsnStop


## What are python pacakges and how can I import packages?

There are many publicly available modules that can be imported into Python. Often, modules are made available as part of **packages** which are sets of module files that can be installed by users to expand Python functionality. 

To install a package, the package first needs to be installed. How you install a package will depend on the system that you are using. 

For this class we have pre-installed the packages that you will need. Today we will be using the Biopython package which you can learn more about [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html). 

You can check to see if a package has been installed by running the import command and seeing if there is an error. 

In the code below, we are checking to be sure that the Bio package from BioPython has been installed. We can also check the version number. 

In [8]:
import Bio 
print(Bio.__version__)

1.68


As a summary, we've looked today at how to import three types of modules into Python:

* modules that you wrote and saved as .py files (eg. central_dogma_helpers.py from the last class) 
* modules that came with the Python distribution 
* modules that come from packages that you install 

Now that we've set our system up to use the BioPython package we are going to look at ways that we can use it! 

## Align two sequences using BioPython

Now that we have the system set up we are ready to look at a Biological question. 

In the last class we looked at the sequence for human insulin. 

In mice, there are two copies of the insulin gene. Which one is more similar to the human gene? 

This is a type of question that we can investigate using pairwise sequence alignments and is the question we'll be looking at today.  

We saved the FASTA sequences for the mouse genes in two files Mouse Insulin GeneID 16333.txt and Mouse Insulin Gene ID 16334.txt in files in the data directory for this class.  

In this example we are going to use two modules. One of the modules, pairwise, is the algorithm for aligning two sequences. 

The other, SeqIO is a convenient tool for reading FASTA sequences into Python. 

If you remember back to the first class, we wrote some code to read a FASTA sequence into Python, but we had to separate the header (the first line starting with >) from the actual sequence. 

SeqIO is a Biopython package that conveniently reads in a file and separates (or **parses**) a FASTA sequence into its ID, Name, Description, features and the sequence.

If the variable name defined when you call SeqIO is seq1 then you can refer to the id by seq1.id, the description by seq1.description, or the sequence by seq1.seq

In [62]:
# note we used the import Bio command above otherwise we would need to have it here.

#imports the sequence reading package SeqIO from the Bio module. 
from Bio import SeqIO

#Reads the FASTA sequences 
seq=SeqIO.read('../class_1/data/Human-Insulin-NG_007114.1.txt',"fasta")

print(seq.id)


NG_007114.1:4986-6416


 In the box below, revise the code above to print out the gene sequence instead of the identifier. 


In [63]:
##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 
# note we used the import Bio command above otherwise we would need to have it here.

#imports the sequence reading package SeqIO from the Bio module. 
from Bio import SeqIO

#Reads the FASTA sequences 
seq=SeqIO.read('../class_1/data/Human-Insulin-NG_007114.1.txt',"fasta")

print(seq.seq)

AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGTGGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCGTGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCCTCAGCCCTGCCTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGGTGAGCCAACTGCCCATTGCTGCCCCTGGCCGCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCACCCAGCATGGGCAGAAGGGGGCAGGAGGCTGCCACCCAGCAGGGGGTCAGGTGCACTTTTTTAAAAAGAAGTTCTCTTGGTCACGTCCTAAAAGTGACCAGCTCCCTGTGGCCCAGTCAGAATCTCAGCCTGAGGACGGTGTTGGCTTCGGCAGCCCCGAGATACATCAGAGGGTGGGCACGCTCCTCCCTCCACTCGCCCCTCAAACAAATGCCCCGCAGCCCATTTCTCCACCCTCATTTGATGACCGCAGATTCAAGTGTTTTGTTAAGTAAAGTCCTGGGTGACCTGGGGTCACAGGGTGCCCCACGCTGCCTGCCTCTGGGCGAACACCCCATCACGCCCGGAGGAGGGCGTGGCTGCCTGCCTGAGTGGGCCAGACCCCTGTCGCCAGGCCTCACGGCAGCTCCATAGTCAGGAGATGGGGAAGATGCTGGGGACAGGCCCTGGGGAGAAGTACTGGGATCACCTGTTCAGGCTCCCACTGTGACGCTGCCCCGGGGCGGGGGAA

Now we are ready to use the SeqIO package with the pairwise2 module to run the sequence alignment. 

In [54]:
# note we used the import Bio and import SeqIO command above otherwise we would need to have them here.

#imports the pairwise sequence alignment algorithm pairwise2 from the Bio module. 
from Bio import pairwise2 

#Reads the FASTA sequences 
seq1=SeqIO.read('../class_1/data/Human-Insulin-NG_007114.1.txt',"fasta") 
seq2=SeqIO.read('data/Mouse Insulin GeneID 16333.txt',"fasta")

#Conducts a global pairwise alignment between the two sequences  
#the xx gives instructions about how to calculate the alignment score
#http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

alignments = pairwise2.align.globalxx(seq1.seq, seq2.seq)

print(alignments[0])

('AGCCCTCCAGGACAG-GCTGCATCA--G-AAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCG-TCAGGTGGGCTCAGGATTCCAGGGTGGC-TGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCGTGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCA-GCCTGCCTCAGCCCTGCCTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCT-GT-GGATGCGCCTCCTGCCCCTG-CTGGCGCTGCTGGCCCT-CTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGC-TCACACCTGGTGGAAGCTCTCTACCTAG---TGTGCGG-G-GAACG-AGGCTTCTTCTAC-ACACC-CAAGACCCG-CCGGGAGG-CAG-AGGACCTGCAGGGTGA-GCCA-ACTGC-CCATTGCTGCCCCTGGCC-GCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCACCCAGCATGGGCAGAAGGGGGCAGGAGGCTGCCACCCAGCAGGGGGTCAGGTGCACTTTTTTAAAAAGAAGTTCTCTTGGTCACGTCCTAAAAGTGACCAGCTCCCTGTGGCCCAGTCAGAATCTCAGCCTGAGGACGGTGTTGGCTTCGGCAGCCCCGA-G-ATACATCAGAGGGTGGGCACGCTCCTCC-CTCCACTCGCCCCTCAAACAAATGCCCCGCA-GCCCATTTCTCCACCCTCATTTGATGACCGCAGAT-TCAAGTGTTTTGTTAAGTAAAGTCCTGGGTGACCTGGGGTCACAGGGTGCCCCACGCTGCCTGCCTCTGGGC-GAACACC--CCATCACGCCCGGAGGAGG-GCGTGGCTGCCTGCCTGAGTGGGCCAGACCCCTGTCGCCAGGCCTCACGGC-AGCTCCATAGTCAGGAGATGGGGAAGATGCTGGGGAC-AGGCCCTGGGGAGAAGTACTGGGATCACCT

In [52]:
# note we used the import Bio command above otherwise we would need to have it here.

#imports the pairwise sequence alignment algorithm pairwise2 from the Bio module. 
from Bio import pairwise2 

#imports a sequence reading tool SeqIO from the Bio module. 
from Bio import SeqIO


#Reads the FASTA sequences 
seq1=SeqIO.read('../class_1/data/Human-Insulin-NG_007114.1.txt',"fasta")
seq2=SeqIO.read('data/Mouse Insulin GeneID 16333.txt',"fasta")

#Conducts a global pairwise alignment between the two sequences  
#the xx gives instructions about how to calculate the alignment score
#http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

alignments = pairwise2.align.globalxx(seq1.seq, seq2.seq)

print(alignments[0])

('AGCCCTCCAGGACAG-GCTGCATCA--G-AAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCG-TCAGGTGGGCTCAGGATTCCAGGGTGGC-TGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCGTGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCA-GCCTGCCTCAGCCCTGCCTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCT-GT-GGATGCGCCTCCTGCCCCTG-CTGGCGCTGCTGGCCCT-CTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGC-TCACACCTGGTGGAAGCTCTCTACCTAG---TGTGCGG-G-GAACG-AGGCTTCTTCTAC-ACACC-CAAGACCCG-CCGGGAGG-CAG-AGGACCTGCAGGGTGA-GCCA-ACTGC-CCATTGCTGCCCCTGGCC-GCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCACCCAGCATGGGCAGAAGGGGGCAGGAGGCTGCCACCCAGCAGGGGGTCAGGTGCACTTTTTTAAAAAGAAGTTCTCTTGGTCACGTCCTAAAAGTGACCAGCTCCCTGTGGCCCAGTCAGAATCTCAGCCTGAGGACGGTGTTGGCTTCGGCAGCCCCGA-G-ATACATCAGAGGGTGGGCACGCTCCTCC-CTCCACTCGCCCCTCAAACAAATGCCCCGCA-GCCCATTTCTCCACCCTCATTTGATGACCGCAGAT-TCAAGTGTTTTGTTAAGTAAAGTCCTGGGTGACCTGGGGTCACAGGGTGCCCCACGCTGCCTGCCTCTGGGC-GAACACC--CCATCACGCCCGGAGGAGG-GCGTGGCTGCCTGCCTGAGTGGGCCAGACCCCTGTCGCCAGGCCTCACGGC-AGCTCCATAGTCAGGAGATGGGGAAGATGCTGGGGAC-AGGCCCTGGGGAGAAGTACTGGGATCACCT

In the space below, write the code to run a pairwise sequence alignment with the Mouse Insulin GeneID 16334.txt file. 

In [45]:


##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 
import Bio
from Bio import pairwise2
from Bio import SeqIO
seq1=SeqIO.read('../class_1/data/Human-Insulin-NG_007114.1.txt',"fasta")
seq2=SeqIO.read('data/Mouse Insulin GeneID 16334.txt',"fasta")
alignments = pairwise2.align.globalxx(seq1.seq, seq2.seq)
print(alignments[0])

('AG----CCC--T--CCAGG--A-C---A-G-GC-T--GC-A---TCAGAAGAGG-CCATCAAGCAG-GTCT-GTTCC-AAGGGC--CT-TTGCGTCAGGTGGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCC-A-G-CTCTGCAGCAGGGAGGACGTGGCTGG-GCTCGTGAAGCATG-TGGGGG-TGAGCCCAGGGG-CC-C-CA--AGGCAGG-GCAC--CTGGCCTTCA-GCCTG--CCTCAG-CCCT-GCCTG-TCTC-CCAGA-TC-ACTGTCC-TTCTGC--CATGGCCCTGTGGATGCGCCT-CCTGCCCCTGCTGGCGC-TGCTGGC--CCTCTGGGGA--CCTG-ACCC-AGCCGCAGC-CTTTGTG--AAC-CAA-CACCTGT-GC-GGCT-CAC-ACCTGGTGGAAG-CTCTCTACCTAG-TGTGC-GGGGAA-CGA-GGCTTCTTCTACACACCCAA-GA-CCCGCCGG-GA-G-GCAGAGGACCTGCAGGGTGAGCCAAC--TGCCCA-TTGCTGCCC-CTGGCCGCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCACCCAGCATGGGCAGAAGGGGGC-AGGAGGCTGCCACCCAGCAGGGGGTCAGGTGCACTTTT-TTAAAAAGA-AGTTCT-CTTGGTC-ACG-TCCTAAAAGTGACCAGC-TCCCTGTGGCCCAGTCAGAATCTCAGCC-TGA-GGACGGTGTTGGCTTCGGCAGCCCCGAGATACATC-AGAGG-GTGGG--CACGCTCCTCCCTCCACTCGCCCCTCAAACA----AATGCCCCGC-AGCC-CATT-TC-TCCACCCT--CATTTGATGAC--C-GC-AGATTC-AAGTGTTTT-GTTAAGTAAAGTCCTGGGTGACCTGGGGTCACAGGGTGCCCCACGCTGCCTGCCTCTGG-GCGAACACCCCATCACGCCCGG-AGGAGGGCGT-GGCTGC-C-TGCCTGAGTGGGCC-AGACCCCTG

## Align multiple sequences using modules from the BioPython pacakge and the MUSCLE algorithm

If we now want to compare the three sequences together (ie. the two insulin sequences from mouse and the insulin sequence from human), we can perform a multiple sequence alignment. 

Today we are just going to use the example of looking at three sequences, but we will see more examples later where multiple sequence alignments are particularly useful for looking at how well sequences are conserved across species or within populations. 

For the multiple sequence alignments today, we are going to be using an algorithm called MUSCLE. We have preinstalled MUSCLE on the Jupyter Notebook server that we are using. Documentation or information on downloading the program an be found [here](http://www.drive5.com/muscle/). 

In [25]:
# note we used the import Bio command above otherwise we would need to have it here.

# we are importing the MuscleCommandline here from Bio.Align.Applicaitons 
from Bio.Align.Applications import MuscleCommandline

# defines a variable with the path of the executable for the MUSCLE algorithm program 
muscle_exe="/Users/annettesalmeen/Documents/CompBio Programs/muscle3.8.31_i86darwin64"

#runs the multiple sequence alignment and writes to an out file in ClustalW format 
muscle_cline = MuscleCommandline(muscle_exe,input="data/Human and Mouse Insulin Genes.fa",out="Human and Mouse Insulin Genes.aln",clw=True)

Here's what the output file looks like: 

MUSCLE (3.8) multiple sequence alignment


NG_007114.1:4986-6416                 ------------------------------------------------------------
NC_000085.6:52264297-52265015         ACCAGGCAAGTGTTTGGAAACTGCAGCTTCAGCCCCTCTGGCCATCTGCCTACCCACCCC
NC_000073.6:c142679726-142678656      ------------------------------------------------------------
                                                                                                  

NG_007114.1:4986-6416                 ------------------------------------------------------------
NC_000085.6:52264297-52265015         ACCTGGAGACCTTAATGGGCCAAACAGCAAAGTCCAGGGGGCAGAGAGGAGGTACTTTGG
NC_000073.6:c142679726-142678656      ------------------------------------------------------------
                                                                                                  

NG_007114.1:4986-6416                 --------------------------------AGCCCTCCAGGACAGGCTGC-ATCAGAA
NC_000085.6:52264297-52265015         ACTATAAAGCTGGTGGGCATCCAGTAACCCCCAGCCCTTAGTGACCAGCTATAATCAGAG
NC_000073.6:c142679726-142678656      --------------GGGGACCCAGTAACCACCAGCCCTAAGTGATCCGCTACAATCAAAA
                                                                      ******    **   ***   **** * 

NG_007114.1:4986-6416                 GAGGCCATCAAGCAGG-------TCTGTTCCAAGGGCCTTTGCGTCAGGTGGGCTCAGGA
NC_000085.6:52264297-52265015         ACCATCAGCAAGCAGGTATGTACTCTCCTCTTTGGGCCT------------GGCTC----
NC_000073.6:c142679726-142678656      ACCATCAGCAAGCAGGAAGGTACTCTTCTCAGTGGGCCT------------GGCTC----
                                           ** ********       ***  **   ******            *****    

NG_007114.1:4986-6416                 TTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG
NC_000085.6:52264297-52265015         -CCCAG-----------CCAAGACTCCAG--CGACTTTAGGGAGAATG----TGGGCTCC
NC_000073.6:c142679726-142678656      -CCCAG-----------CTAAGACCTCAG--GGACTTGAGGTAGGATA----TAGCCTCC
                                        ****           *  ** *  ***     *   *** ** *      * * *** 

NG_007114.1:4986-6416                 TGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCC
NC_000085.6:52264297-52265015         TCTCTTACATG----------------------------------GATCTTTTGCTAGCC
NC_000073.6:c142679726-142678656      TCTCTTACGTG----------------------------------AAACTTTTGCTATCC
                                      *     *  **                                     ***  **   **

NG_007114.1:4986-6416                 TCAGCCCTGCCTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCC
NC_000085.6:52264297-52265015         TCAACCCTGCCTATCTTTCAGGTCATTG---TTTCAACATGGCCCTGTTGGTGCACTTCC
NC_000073.6:c142679726-142678656      TCAACCCAGCCTATCTTCCAGGTTATTG---TTTCAACATGGCCCTGTGGATGCGCTTCC
                                      *** *** **** ***  *** * * **   **    *********** * *** * ***

NG_007114.1:4986-6416                 TGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACC
NC_000085.6:52264297-52265015         TACCCCTGCTGGCCCTGCTTGCCCTCTGGGAGCCCAAACCCACCCAGGCTTTTGTCAAAC
NC_000073.6:c142679726-142678656      TGCCCCTGCTGGCCCTGCTCTTCCTCTGGGAGTCCCACCCCACCCAGGCTTTTGTCAAGC
                                      * *********** *****   ********   *  * **  **   ** ***** ** *

NG_007114.1:4986-6416                 AACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCT
NC_000085.6:52264297-52265015         AGCATCTTTGTGGTCCCCACCTGGTAGAGGCTCTCTACCTGGTGTGTGGGGAGCGTGGCT
NC_000073.6:c142679726-142678656      AGCACCTTTGTGGTTCCCACCTGGTGGAGGCTCTCTACCTGGTGTGTGGGGAGCGTGGCT
                                      * ** ** ** **  * ******** ** *********** ***** ***** ** ****

NG_007114.1:4986-6416                 TCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGGTGAGCCAACTGCCC
NC_000085.6:52264297-52265015         TCTTCTACACACCCAAGTCCCGCCGTGAAGTGGAGGACCCACA-----------------
NC_000073.6:c142679726-142678656      TCTTCTACACACCCATGTCCCGCCGTGAAGTGGAGGACCCACAAGGTGAG----------
                                      *************** * ******* ** *  *******  **                 

NG_007114.1:4986-6416                 ATTGCTGCCCCTGG---CCGCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCACCCAGCAT
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      --TTCTGCCACTGAATTCTGTCCCCAG---------------------------------
                                                                                                  

NG_007114.1:4986-6416                 GGGCAGAAGGGGGCAGGAGGCTGCCACCCAGCAGGGGGTCAGGTGCACTTTTTTAAAAAG
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      ---------------------TGCTAACTACCCTGGTTTTCTTCACACTT---------G
                                                                                                  

NG_007114.1:4986-6416                 AAGTTCTCTTGGTCACGTCCTAAAAGTGACCAGCTCCCTGTGGCCCAGTCAGAATCTCAG
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      GGACATTGTAAATTGTGTCCTAGGTGTG--------------------------------
                                                                                                  

NG_007114.1:4986-6416                 CCTGAGGACGGTGTTGGCTTCGGCAGCCCCGAGATACATCAGAGGGTGGGCACGCTCCTC
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      ---------------------GAGGGTCTCGGGATA-ACCAGGGAGTGGGGACAC-----
                                                                                                  

NG_007114.1:4986-6416                 CCTCCACTCGCCCCTCAAACAAATGCCCCGCAGCCCATTTCTCCACCCTCATTTGATGAC
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      ------------------------------------------------------------
                                                                                                  

NG_007114.1:4986-6416                 CGCAGATTCAAGTGTTTTGTTAAGTAAAGTCCTGGGTGACCTGGGGTCACAGGGTGCCCC
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      ------------------------------------------------------------
                                                                                                  

NG_007114.1:4986-6416                 ACGCTGCCTGCCTCTGGGCGAACACCCCATCACGCCCGGAGGAGGGCGTGGCTGCCTGCC
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      ---------GTTTCTGGGGGAA--------GCTAGACATATGTAAACATGGCAGCTGCCA
                                                                                                  

NG_007114.1:4986-6416                 TGAGTGGGCCAGACCCCTGTCGCCAGGCCTCACGGCAGCTCCATAGTCAGGAGATGGGGA
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      GGAATGAGTAAGAATCCTGCCTTAAGGGGTCCTTGGTGGTAGTAACTTGGGACATGTGAC
                                                                                                  

NG_007114.1:4986-6416                 AGATGCTGGGGACAGGCCCTGGGGAGAAGTACTGGGATCACCTGTTCAGGCTCCCACTGT
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      TAGATCCCAGGATAGG----------------------TACCTATTTAGGGCCCTCATAG
                                                                                                  

NG_007114.1:4986-6416                 GACGCTGCCCCGGGGCGGGGGAAGGAGGT------GGGACATGTGGGCGTTGGGGCCTGT
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      AGCACTGCACTGACTGAAGATGAGTAGGCTTTAGAGGCCCATGTGTCCATCCATGACCAG
                                                                                                  

NG_007114.1:4986-6416                 AGGTCCACACCCAGTGTGGGTGACCCTCCCTCTAACCTGGGTCCAGCCCGGCTGGAGATG
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      TGACTTGTCCCACAGGCATGCAACCCCTGCC---------------ACCTGCAGGGGTTA
                                                                                                  

NG_007114.1:4986-6416                 GGTGGGAGTGCGACCTAGGGCTGGCGGGCAGGCGGGCACTGTGTCTC-CCTGACTGTGTC
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      AGGGGCGAGAAAACCTGGGGT-----AGTAGGAGGTTGCTCAGCTACTCCTGACTGGATT
                                                                                                  

NG_007114.1:4986-6416                 CTCCTGTGTCCCTCTGCCTCGCCGCTGTT----CCGGAACCTGCTCTGCGCGGCACGTCC
NC_000085.6:52264297-52265015         ------------------------------------------------------------
NC_000073.6:c142679726-142678656      TTCCTATGTGTCTTTGCTTCTGTGCTGCTGATGCCCTGGCCTGCTCTGACACAACCTCCC
                                                                                                  

NG_007114.1:4986-6416                 TGGCAGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGG
NC_000085.6:52264297-52265015         ----AGTGGAACAACTGGAGCTGGGAGGAAGCCCCGGGG------ACCTTCAGACCTTGG
NC_000073.6:c142679726-142678656      TGGCAGTGGCACAACTGGAGCTGGGTGGAGGCCCGGGAGCAGGTGACCTTCAGACCTTGG
                                          *****  **  ********** **  **** ** *       *** *** ******

NG_007114.1:4986-6416                 CCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCT
NC_000085.6:52264297-52265015         CGTTGGAGGTGGCCCGGCAGAAGCGTGGCATTGTGGATCAGTGCTGCACCAGCATCTGCT
NC_000073.6:c142679726-142678656      CACTGGAGGTGGCCCAGCAGAAGCGTGGCATTGTAGATCAGTGCTGCACCAGCATCTGCT
                                      *  ****** * *** ****************** ** ** ***** *************

NG_007114.1:4986-6416                 CCCTCTACCAGCTGGAGAACTACTGCAACT-AGACGCAGCCCGCAGGCAGCCCCACACCC
NC_000085.6:52264297-52265015         CCCTCTACCAGCTGGAGAACTACTGCAACTAAGGCCCA--CCTCGACCCGCCCCAC----
NC_000073.6:c142679726-142678656      CCCTCTACCAGCTGGAGAACTACTGCAACT-AGACCCA--CCACTACCCAGCCTAC----
                                      ****************************** ** * **  ** *   *   ** **    

NG_007114.1:4986-6416                 GCCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAG------------
NC_000085.6:52264297-52265015         ----CCCTCTGCAAT----------GAATAAAACTTTTGAATAAGCACCAAAAAAAA
NC_000073.6:c142679726-142678656      ----CCCTCTGCAAT----------GAATAAAACCTTTGAATGAGCACAA-------
                                          **  *****            ******* *  *****  **            


Thought questions:

What does your analysis tell you about the similarities between human and mouse insulin genes? 
Which insulin gene from mouse is more similar to the human insulin gene? 
What would the alignment look like if you were using processed mRNA or protien sequences instead of gene sequences?

## What is a reference genome and why is it important for analysis of DNA sequences?

In the last part of todays class we are going to look at one of the questions we introduced earlier: 

 "I just finished a sequencing experiment, how can I align my data with the human reference genome?"
 
We are going to introduce the concept of a reference genome and short-read alignment today. In the next class you will be using the reference genome and doing short-read alignment with RNA-seq data. 

First we need to define reference genomes. 

While the genomes of humans are very similar, they still differ in many positions. 




## Introduction to FASTA and FASTQ data formats

You have already seen data in the FASTA format. The first line contains the sequence label, preceded by ">". The second line contains the actual sequence bases (A,C,G,T): 

**>FORJUSP02AJWD1** 

**CCGTCAATTCATTTAAGTTTTAACCTT**

FASTQ format takes this a step further by including sequence quality information  in ASCII characters. 
<img src="images/fastq_fig.jpg",align="center"//>



In [None]:
## You can convert the ASCII-encoded quality values to numeric Q scores with the 'ord' function. You must subtract 33
## from the converted value to obtain a Q score

quality_ascii='A:99@::??@@::FFAA'
numerical=[ord(c)-33 for c in quality_ascii]
print(numerical)