# Big Data for Biologists: Decoding Genomic Function- Class 3

## How can we predict the amino acid sequence for the protein product made from an mRNA sequence? 
## What are genomic coordinates and how are they used to designate the position of a gene in the human genome? 

##  Learning Objectives
***Students should be able to***
 <ol>
 <li>Use the genetic code to determine the amino acid sequence of the protein product of an mRNA sequence </li>
 <li>Make a Python dictionary for the genetic code (also called a look up table) </li>
 <li>Define and call a function in a Python script </li>
 <li>Predict a protein sequence from a processed mRNA sequence using a Python dictionary</li>
 <li>Explain how genomic coordinates are used to designate the position of a gene or feature in the human genome </li>
 <li>Use genomic coordinates to make a file in BED format </li>
 <li>Use the genomic analysis package Bedtools to extract exon regions from a gene sequence</li>


## How can we use the genetic code to determine the amino acid sequence of the protein product made from an mRNA sequence?

In the last class, we wrote code to transcribe DNA to pre-mRNA and concluded by finding the start and stop codons in an mRNA sequence. Today we are going to look at the next step in gene expression, the translation of an mRNA sequence into protein. 

<img src="../Images/1-CentralDogma.png" style="width: 40%; height: 50%" align="center"//>

As a reminder,during translation, every three base pairs in an mRNA sequence past the start codon codes for one amino acid. These three base pair sequences are called codons. 

The start codon, as we saw previously is ATG which codes for the amino acid Methionine. Below is the **genetic code** for the rest of the amino acids as well as the three stop codons. 


<img src="../Images/3-Genetic Code.png" style="width: 40%; height: 40%" align="center"//>

 

In the final step of our last class we found three possible combinations of start and stop codons: 

* start codon: 60 stop codon: 390 orf length: 332
* start codon: 72 stop codon: 390 orf length: 320
* start codon: 442 stop codon: 448 orf length: 8

The actual combination of start and stop codons is often the combination that results in the longest sequence, but the true start and stop codon needs to be determined experimentally. 

In this case, the actual start codon for human insulin is at amino acid 60 in the mRNA sequence that we were using in the last class. 

Here is the mRNA sequence for the first eighteen residues from codon 60 of the insulin sequence from the previous class: 

AUGGCCCUGUGGAUGCGCCUC 

As an exercise, write out the amino acid sequence corresponding to the mRNA sequence. 

The amino acid sequence should be: 

MALTMRL

We are now going to learn additional Python tools to help us create the code to write out the amino acid sequence of a protein product that is made from an mRNA sequence.

## Making a python dictionary for the genetic code

The python code we will be writing today has some similarities to scripts that we looked at in the last class. 

Last time, when we wrote out the complementary DNA sequence we made four substitutions and when we wrote out the mRNA sequence we made one substitution. 

To write out the amino acid sequence that will be produced from a mRNA sequence, we will need to make sixty four substitutions, that's a lot! 

We could write code similar to what we used in the last class, but it would be long and messy.

A helpful Python tool to know about is dictionaries also known as look-up tables. 

Python dictionaries let you define a number of substitutions in one line rather than as a series of lines. 

There are a few different ways to define a dictionary in Python. If you are interested, this  [link](https://docs.python.org/2/library/stdtypes.html#typesmapping) gives the complete syntax options. 

Before we define the dictionary for translation, we are going to practice writing a dictionary for writing out a complementary DNA sequence.

We'll use the same code as last time but we will use a python dictionary to do the substitutions instead of using if statements. 

The syntax for creating the dictionary is: 

DNAdict={'A':'T','T':'A','G':'C','C':'G'}

DNAdict is the name of the dictionary.

For the first entry, the 'A' before the : is the element in the original sequence. The 'T' after the : is the element in the new sequence. 


In [8]:
#Write out the complementary sequence for a DNA sequence using a look up table
FASTAgenesequence=open('../class_1/data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])
genesequence=''.join(genesequence)
genesequence=genesequence.replace('\n','') #this line removes the \n values from the genesequence
DNAdict={'A':'T','T':'A','G':'C','C':'G'}
complementarysequence='' #this defines the variable 'complementarysequence'
for i in genesequence:
    complementarysequence=complementarysequence+str(DNAdict[i])
print (complementarysequence)

TCGGGAGGTCCTGTCCGACGTAGTCTTCTCCGGTAGTTCGTCCAGACAAGGTTCCCGGAAACGCAGTCCACCCGAGTCCTAAGGTCCCACCGACCTGGGGTCCGGGGTCGAGACGTCGTCCCTCCTGCACCGACCCGAGCACTTCGTACACCCCCACTCGGGTCCCCGGGGTTCCGTCCCGTGGACCGGAAGTCGGACGGAGTCGGGACGGACAGAGGGTCTAGTGACAGGAAGACGGTACCGGGACACCTACGCGGAGGACGGGGACGACCGCGACGACCGGGAGACCCCTGGACTGGGTCGGCGTCGGAAACACTTGGTTGTGGACACGCCGAGTGTGGACCACCTTCGAGAGATGGATCACACGCCCCTTGCTCCGAAGAAGATGTGTGGGTTCTGGGCGGCCCTCCGTCTCCTGGACGTCCCACTCGGTTGACGGGTAACGACGGGGACCGGCGGGGGTCGGTGGGGGACGAGGACCGCGAGGGTGGGTCGTACCCGTCTTCCCCCGTCCTCCGACGGTGGGTCGTCCCCCAGTCCACGTGAAAAAATTTTTCTTCAAGAGAACCAGTGCAGGATTTTCACTGGTCGAGGGACACCGGGTCAGTCTTAGAGTCGGACTCCTGCCACAACCGAAGCCGTCGGGGCTCTATGTAGTCTCCCACCCGTGCGAGGAGGGAGGTGAGCGGGGAGTTTGTTTACGGGGCGTCGGGTAAAGAGGTGGGAGTAAACTACTGGCGTCTAAGTTCACAAAACAATTCATTTCAGGACCCACTGGACCCCAGTGTCCCACGGGGTGCGACGGACGGAGACCCGCTTGTGGGGTAGTGCGGGCCTCCTCCCGCACCGACGGACGGACTCACCCGGTCTGGGGACAGCGGTCCGGAGTGCCGTCGAGGTATCAGTCCTCTACCCCTTCTACGACCCCTGTCCGGGACCCCTCTTCATGACCCTAGTGGACAAGTCCGAGGGTGACACTGCGACGGGGCCCCGCCCCCTT

Using the space below, start writing a python dictionary for the genetic code starting with the four entries for the upper left hand corner of the table in the figure above. 

In [None]:

##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 
geneticcode3let={'UUU':'Phe','UUC':'Phe','UUA':'Leu','UUG':'Leu',
      }

We will be using a complete dictionary for the genetic code below when we write out the protein sequence for an mRNA sequence, but first we will be covering one other tool for writing more complex scripts, creating functions. 

## Defining and calling functions in Python scripts 

As you start to write more complex python scripts, a very helpful tool for any set of commands that accomplish a particular task and are used repeatedly either within a script or between scripts is to define a function for the series of commands. 

Once the funciton is defined the series of commands can be run with one line of code rather than multiple lines. 

An example of a task that we will be using more than once is converting between DNA and RNA sequences. That is converting the 'T's in a sequence to 'U's. 

We wrote the code for the conversion of Ts to Us in a sequence in the last class. In this class we are going to write a function.   

Functions are defined using the **def** command followed by the name of the function and then any necessary inputs separated by commas. 

In the example below: 

    def write_mRNA_from_DNA(filename_FASTAgenesequence):
 
defines a function called "write_mRNA_from_DNA". 

The input to the function is assigned the variable name "filename_FASTAgenesequence", and will be defined when the command is called. 

In the example below the line: 

RNAsequence=write_mRNA_from_DNA('../class_1/data/Human-Insulin NM_000207.2.txt')

calls the function "write mRNA_from_DNA). 

The input sequence '../class_1/data/Human-Insulin NM_000207.2.txt' is assigned the name 'filename_FASTAgenesequence' within the function. 

At the end, the "return" command defines the output of the function. 

In [14]:
#define a function to write mRNA from DNA sequence in FASTA format

def write_mRNA_from_DNA(filename_FASTAgenesequence):
    FASTAgenesequence=open(filename_FASTAgenesequence,'r')
    genesequence=(FASTAgenesequence.readlines()[1:])
    genesequence=''.join(genesequence)
    genesequence=genesequence.replace('\n','')
    RNAsequence='' #this defines the variable 'complementarysequence'
    for i in genesequence:
        if i=='T':
            RNAsequence=RNAsequence+'U'
        else:
            RNAsequence=RNAsequence+ i
    return(RNAsequence)

RNAsequence=write_mRNA_from_DNA('../class_1/data/Human-Insulin NM_000207.2.txt')
print(RNAsequence)

AGCCCUCCAGGACAGGCUGCAUCAGAAGAGGCCAUCAAGCAGAUCACUGUCCUUCUGCCAUGGCCCUGUGGAUGCGCCUCCUGCCCCUGCUGGCGCUGCUGGCCCUCUGGGGACCUGACCCAGCCGCAGCCUUUGUGAACCAACACCUGUGCGGCUCACACCUGGUGGAAGCUCUCUACCUAGUGUGCGGGGAACGAGGCUUCUUCUACACACCCAAGACCCGCCGGGAGGCAGAGGACCUGCAGGUGGGGCAGGUGGAGCUGGGCGGGGGCCCUGGUGCAGGCAGCCUGCAGCCCUUGGCCCUGGAGGGGUCCCUGCAGAAGCGUGGCAUUGUGGAACAAUGCUGUACCAGCAUCUGCUCCCUCUACCAGCUGGAGAACUACUGCAACUAGACGCAGCCCGCAGGCAGCCCCACACCCGCCGCCUCCUGCACCGAGAGAGAUGGAAUAAAGCCCUUGAACCAGCAAAA


## Predict a protein sequence from a processed RNA sequence using a Python dictionary

We are now ready to write our script to predict a protein sequence from a processed mRNA sequence. We will both call a function and use a dictionary in the code. 

The code has three main sections. 

First, since we wrote our genetic code using uracils (reflecting the correct biology!) in our first step we will need to convert our processed mRNA sequence with thymines to the actual mRNA sequence with Uracil. 

For that, we are going to use the write_mRNA_from_DNA function that we defined above. 

Second, we will define a complete python dictionary for the three letter genetic code. 

Finally, we will iterate over the RNAsequence from the start to the stop codon and will convert each codon to the corresponding amino acid defined in the dictionary. 

We will use basepair 60 as the start codon and 390 as the stop codon since that is the start and stop codon in the actual insulin sequence.  

Complete the code below to write out the protein sequence for an mRNA sequence. 
 

In [19]:
#Write out the protein sequence for a mRNA sequence

#calls the function defined above called write_mRNA_from_DNA
RNAsequence=

#defines the python dictionary for the three letter genetic code 
geneticcode3let={'UUU':'Phe','UUC':'Phe','UUA':'Leu','UUG':'Leu',
     'CUU':'Leu','CUC':'Leu','CUA':'Leu','CUG':'Leu',
     'AUU':'Ile','AUC':'Ile','AUA':'Ile','AUG':'Met',
     'GUU':'Val','GUC':'Val','GUA':'Val','GUG':'Val',
     'UCU':'Ser','UCC':'Ser','UCA':'Ser','UCG':'Ser',
     'CCU':'Pro','CCC':'Pro','CCA':'Pro','CCG':'Pro',
     'ACU':'Thr','ACC':'Thr','ACA':'Thr','ACG':'Thr',
     'GCU':'Ala','GCC':'Ala','GCA':'Ala','GCG':'Ala',
     'UAU':'Tyr','UAC':'Tyr','UAA':'Stop','UAG':'Stop',
     'CAU':'His','CAC':'His','CAA':'Gln','CAG':'Gln',
     'AAU':'Asn','AAC':'Asn','AAA':'Lys','AAG':'Lys',
     'GAU':'Asp','GAC':'Asp','GAA':'Glu','GAG':'Glu',
     'UGU':'Cys','UGC':'Cys','UGA':'Stop','UGG':'Trp',
     'CGU':'Arg','CGC':'Arg','CGA':'Arg','CGG':'Arg',
     'AGU':'Ser','AGC':'Ser','AGA':'Arg','AGG':'Arg',
     'GGU':'Gly','GGC':'Gly','GGA':'Gly','GGG':'Gly'}

#translates the RNAsequence into protein 

#In the range command: 
#The first number is the start codon using numbering starting at zero.
#The second number is the stop codon using numbering starting at zero, so that the stop codon is included. 
#The third number is the number of basepairs that are skipped every iteration since codons come in threes. 

proteinseq=''

for i in range(59,390,3): 
    proteinseq=proteinseq+str(geneticcode3let[])
print ( )



##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 

#Write out the protein sequence for a mRNA sequence

#calls the function defined above called write_mRNA_from_DNA
RNAsequence=write_mRNA_from_DNA('../class_1/data/Human-Insulin NM_000207.2.txt') 

#defines the python dictionary for the three letter genetic code 
geneticcode3let={'UUU':'Phe','UUC':'Phe','UUA':'Leu','UUG':'Leu',
     'CUU':'Leu','CUC':'Leu','CUA':'Leu','CUG':'Leu',
     'AUU':'Ile','AUC':'Ile','AUA':'Ile','AUG':'Met',
     'GUU':'Val','GUC':'Val','GUA':'Val','GUG':'Val',
     'UCU':'Ser','UCC':'Ser','UCA':'Ser','UCG':'Ser',
     'CCU':'Pro','CCC':'Pro','CCA':'Pro','CCG':'Pro',
     'ACU':'Thr','ACC':'Thr','ACA':'Thr','ACG':'Thr',
     'GCU':'Ala','GCC':'Ala','GCA':'Ala','GCG':'Ala',
     'UAU':'Tyr','UAC':'Tyr','UAA':'Stop','UAG':'Stop',
     'CAU':'His','CAC':'His','CAA':'Gln','CAG':'Gln',
     'AAU':'Asn','AAC':'Asn','AAA':'Lys','AAG':'Lys',
     'GAU':'Asp','GAC':'Asp','GAA':'Glu','GAG':'Glu',
     'UGU':'Cys','UGC':'Cys','UGA':'Stop','UGG':'Trp',
     'CGU':'Arg','CGC':'Arg','CGA':'Arg','CGG':'Arg',
     'AGU':'Ser','AGC':'Ser','AGA':'Arg','AGG':'Arg',
     'GGU':'Gly','GGC':'Gly','GGA':'Gly','GGG':'Gly'}

#translates the RNAsequence into protein 

#In the range command: 
#The first number is the start codon using numbering starting at zero.
#The second number is the stop codon using numbering starting at zero, so that the stop codon is included. 
#The third number is the number of basepairs that are skipped every iteration since codons come in threes. 

proteinseq=''

for i in range(59,390,3): 
    proteinseq=proteinseq+str(geneticcode3let[RNAsequence[i:i+3]])
print (proteinseq)

SyntaxError: invalid syntax (<ipython-input-19-b076904cc46a>, line 4)

In the code above we have started a script to write out the same protein sequence using one letter amino acid symbols. Add in the missing lines to write out the protein sequence using the one letter amino acid symbols. 

In [17]:
#calls the function defined above called write_mRNA_from_DNA
RNAsequence=write_mRNA_from_DNA('../class_1/data/Human-Insulin NM_000207.2.txt') 

#defines the python dictionary for the one letter genetic code 
geneticcode1let={'UUU':'F','UUC':'F','UUA':'L','UUG':'L',
     'CUU':'L','CUC':'L','CUA':'L','CUG':'L',
     'AUU':'I','AUC':'I','AUA':'I','AUG':'M',
     'GUU':'V','GUC':'V','GUA':'V','GUG':'V',
     'UCU':'S','UCC':'S','UCA':'S','UCG':'S',
     'CCU':'P','CCC':'P','CCA':'P','CCG':'P',
     'ACU':'T','ACC':'T','ACA':'T','ACG':'T',
     'GCU':'A','GCC':'A','GCA':'A','GCG':'A',
     'UAU':'Y','UAC':'Y','UAA':'*','UAG':'*',
     'CAU':'H','CAC':'H','CAA':'Q','CAG':'Q',
     'AAU':'N','AAC':'N','AAA':'K','AAG':'K',
     'GAU':'D','GAC':'D','GAA':'E','GAG':'E',
     'UGU':'C','UGC':'C','UGA':'*','UGG':'W',
     'CGU':'R','CGC':'R','CGA':'R','CGG':'R',
     'AGU':'S','AGC':'S','AGA':'R','AGG':'R',
     'GGU':'G','GGC':'G','GGA':'G','GGG':'G'}

 

##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 

geneticcode1let={'UUU':'F','UUC':'F','UUA':'L','UUG':'L',
     'CUU':'L','CUC':'L','CUA':'L','CUG':'L',
     'AUU':'I','AUC':'I','AUA':'I','AUG':'M',
     'GUU':'V','GUC':'V','GUA':'V','GUG':'V',
     'UCU':'S','UCC':'S','UCA':'S','UCG':'S',
     'CCU':'P','CCC':'P','CCA':'P','CCG':'P',
     'ACU':'T','ACC':'T','ACA':'T','ACG':'T',
     'GCU':'A','GCC':'A','GCA':'A','GCG':'A',
     'UAU':'Y','UAC':'Y','UAA':'*','UAG':'*',
     'CAU':'H','CAC':'H','CAA':'Q','CAG':'Q',
     'AAU':'N','AAC':'N','AAA':'K','AAG':'K',
     'GAU':'D','GAC':'D','GAA':'E','GAG':'E',
     'UGU':'C','UGC':'C','UGA':'*','UGG':'W',
     'CGU':'R','CGC':'R','CGA':'R','CGG':'R',
     'AGU':'S','AGC':'S','AGA':'R','AGG':'R',
     'GGU':'G','GGC':'G','GGA':'G','GGG':'G'}

proteinseq=''
for i in range(59,359,3):
    proteinseq=proteinseq+str(geneticcode1let[RNAsequence[i:i+3]])
print (proteinseq)

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSIC


## What are genomic coordinates and how are they used to designate the position of a gene in the human genome?

In the last class, you had the opportunity to practice finding the position of start and stop codons for a gene in a single DNA sequence. 

As we move forward with the class, we are going to be working with genomic datasets with many sequences.

In these cases, the genomic coordinates need to specify the start and stop positions as well as the chromosome number. 

Human Genomes typically have 23 pairs of chromosomes. 

<img src="../Images/3-Human Chromosomes.jpg" style="width: 40%; height: 50%" align="center"//>


## Use genomic coordinates to make a file in BED format


The format for specifying genomic coordinates depends on the particular programs that you plan to use. There are several formats including BED, VCF, GFF/GTF. For this class, we will be using a suite of programs called BED tools, so we will be using the BED format for genomic coordinates. 

BED files are text files that contain genomic coordinate information, and are typically given the .bed extension. The format of a BED file is specified here: https://genome.ucsc.edu/FAQ/FAQformat.html#format1. Only the first three columns are mandatory for bed files, and they contain the following information:

Columns in a BED file:
- Column 1: chromosome (this is designated as a number for chromosomes 1 to 22 and chrX or chrY for the sex chromosomes) 
- Column 2: start position (the beginning of the first base is indicated by the start position 0; the beginning of the 5th base is indicated by the start position "4")
- Column 3: end position (the end of the first base is indicated by the end position "1"; the end of the 5th base is indicated by the end position "5")

People are often confused by the fact that the same "base" is referred to by a different number depending on whether you are referring to the start or the end. A simple way to understand this is to realize that the positions are not referring to the numbering of the bases themselves, but to the boundary between bases, as illustrated in the figure below:

<img src="./array_slice_indexing.png">

As a reminder from our previous class, this convention is also consistent with how slicing in python works, as illustrated below:

In [1]:
dna_string = "ACCTG"
print(dna_string[0:4])
print(dna_string[1:5])
print(dna_string[0:5])

ACCT
CCTG
ACCTG


We can use Python to make a file in bed format. 

For this example, we will be making a .bed file that specifies the exon regions in the human insulin gene. 

Originally the exon boundaries would be determined experimentally, but we can obtain them from genomics resources such as this [link](https://www.ncbi.nlm.nih.gov/nuccore/161086962?report=genbank&from=4986&to=6416)from NCBI.

The exon boundaries for the human insulin gene that we have used in the preceding two classes are: 
1-42
222-425
1213-1431
 

In [18]:
#Note that the exon locations have been adjusted to the zero-based numbering system described above
#\t is a tab character, and \n is a newline character
file1 = open('human_insulin_exons.bed', 'w') #defines a file with the name "human_insulin_exons.bed"
#writes the lines of the file
file1.write("chr11\t0\t42\n") 
file1.write("chr11\t221\t425\n")
file1.write("chr11\t1212\t1431\n")
file1.close()
 

Let's view the created files:

In [13]:
!echo "human_insulin_exons.bed" #print "file1.bed" to the screen
!cat "human_insulin_exons.bed" #display the contents of file1.bed

human_insulin_exons.bed
chr11	0	42
chr11	221	425
chr11	1212	1431


In [None]:
#Transcription: Writes out the pre-mRNA sequence that will be made from a DNA sequence 
FASTAgenesequence=open('../data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])
genesequence=''.join(genesequence)

RNAsequence='' #this defines the variable 'RNAsequence'

To write the output of any command to a file, use the > operator which redirects the output to a text file

In [6]:
!bedtools getfastafromBED -a file1.bed -b file2.bed > intersection_results.bed

The contents can then be read back from the file, which could then be used as an input file for subsequent commands.

In [7]:
!cat intersection_results.bed

chr1	0	50
chr1	50	90
chr1	90	100
chr1	50	90
chr1	90	110
chr1	110	150


The --help command can be used to view options. We recommend you read through all of these to know what is possible