# Big Data for Biologists: Decoding Genomic Function- Class 2

## How can we predict the protein product of a gene? 

##  Learning Objectives
 ***Students should be able to***
 <ol>
 <li><a href=#CentralDogma>Explain the Central Dogma and what it means for a gene to be expressed</a></li>
 <li><a href=#ComplementarySequence>Describe what a complementary DNA sequence is</a></li>
 <li><a href=#Directionality>Recognize conventions for designating DNA sequence directionality </a></li>
 <li><a href=#ForloopsandIf>Print a complementary DNA sequence using "for loops" and "if" statements</a></li>
 <li><a href=#Stringslice>Find the index of characters in a string variable and slice a string </a></li>
  <li><a href=#Mutations>Use string slicing to make a substitution, deletion, insertion or inversion in a DNA, RNA or protein sequence</a></li>
 <li><a href=#WriteSequence>Write out a complementary DNA sequence to a file</a></li>
 <li><a href=#Transcription>Write out the RNA transcription product for a DNA sequence.</a></li>
 <li><a href=#ExonandIntron>Define RNA splicing, exon and intron </a></li> 
 <li><a href=#FindStartandStopCodons>Use Python to find possible translation start and stop codons in a mRNA sequence </a></li>
 </ol>

 ## What is the Central Dogma and what does it mean for a gene to be expressed?<a name='CentralDogma' />
 
As we discussed before, typically the flow of information in cells follows what is known as 
**The Central Dogma** where **DNA makes RNA makes protein**. 

A gene is said to be **expressed** if it is turned into a funtional product. Proteins are one example of functional products from gene expression. DNA can also code for RNA that has a function on its own and is not translated into protein.  
 
<img src="../Images/1-CentralDogma.png" style="width: 50%; height: 60%" align="center"//>

Today, we are going to take a closer look at replication and transcription. We'll use Python to:

* Write the complementary strand from a DNA sequence.
* Write the RNA sequence that gets produced from a DNA sequence.

In the next class we'll look at translation. We'll use Python to: 

* Write the protein sequence that gets produced from an RNA sequence. 


## What is a complementary DNA sequence?<a name='ComplementarySequence' />

Before one cell divides to becomes two, the process of **DNA replication** occurs. 

During DNA replication the two strands of DNA (review figure 1 from the last class) unwind and a new, **complementary strand** is created. 

The composition of the complementary strand is determined from the original DNA sequence based on the base pairing rules that we discussed in the last class and are reviewed in the figure below. 

<img src="../Images/2-BasePairs.png" style="width: 40%; height: 50%" align="center"//>

Without using a computer, what would be the complementary DNA sequence for AGCCCTCCA?


## How do you designate DNA sequence Directionality?<a name='Directionality' />

When working with DNA sequences, its important to keep track of the directionality.  

A DNA molecule has two ends, a **5'end** that has a phosphate group and a **3'end** that has an OH group.

By convention, DNA sequences are written 5' to 3'. Our original sequence would be: 

5'-AGCCCTCCA-3'.

The complementary sequence would be: 

3'-TCGGGAGGT-5' 

If we don't include the information about directionality as we did above, the sequence would need to be reversed and written: 

TGGAGGGCT  

The 5' to 3' sequence is also referred to as the + strand 

The 3' to 5' sequence is also referred to as the - strand 


Predicting the complementary DNA sequence for a short sequence can be performed by hand quickly, but what if you needed the complementary strand for a larger segment of DNA or if you want to write out the complementary strand for a lot of sequences? 


##  How can I use for loops and if statements to print a complementary DNA Sequence? <a name='ForloopsandIf' />


Using a program like Python (or other programming languages) its possible to quickly and accurately write out a complementary DNA sequence. 

Keep in mind that the overall structure of the program will be similar to the program that we looked at for the first class, but this time instead of asking the program to calculate the length of the sequence, we'll be asking it to write out the complementary DNA strand for a sequence. 
 
<img src="../Images/2-ComplementaryDNAProgram.png" style="width: 50%; height: 60%" align="center"//>

The first part of the program will require similar code to what you learned last time to read the sequence into the computer. 

For the second part, we will need to introduce some new programming concepts **for loops** and **if statements**. 

To simplify things, lets first think through how you wrote out the complementary sequence in the 'AGCCCTCCA' example above. 

You probably started with the first letter, A, decided what the complementary base pair should be, a T, and then moved onto the second letter. 

The programming concept that lets you repeat the same process for each letter is what's called a **for loop**. Let's look at a simple example of a for loop first. 

In this first case we will just have the computer write out the original sequence so you can see clearly what the for statement does. 


In [1]:
for i in 'AGCCCTCCA':
    print (i)

A
G
C
C
C
T
C
C
A


To write out the complementary sequence instead of the original sequence, for each letter we want the computer to consider if the original basepair is an A,T,G or C and write out T if the original base pair is an A, A if the original base pair is a T, C if the original base pair is a G and G if the original base pair is a C. 

In Python you can code this decision making process using **if** statements. 

In [2]:
#Write out the complementary sequence for a DNA sequence
for i in 'AGCCCTCCA':
    if  i=='A':
        print ('T')
    if i=='T':
        print ('A')
    if i=='G':
        print ('C')
    if i=='C':
        print ('G') 

T
C
G
G
G
A
G
G
T


The code above successfully prints out the complementary sequence, but our goal is not just to print out the sequence to the screen, but to print out the sequence into a file. 

We will need to introduce a variable where we store the complementary sequence so we can call it later to print the sequence or write to a file. 

We'll print the sequence in the first example to make sure that the code is set up properly, then we'll add the code to write to a textfile. 

In [3]:
#Write out the complementary sequence for a DNA sequence
complementarysequence='' #this defines the variable 'complementarysequence'
for i in 'AGCCCTCCA':
    if  i=='A':
        complementarysequence=complementarysequence+'T'
    if i=='T':
        complementarysequence=complementarysequence+'A'
    if i=='G':
        complementarysequence=complementarysequence+'C'
    if i=='C':
        complementarysequence=complementarysequence+'G'
print (complementarysequence)


TCGGGAGGT


What is a problem with the output of the code above?

## Find the index of characters in a string variable and slice a string<a name='Stringslice' />

The complementary sequence that we wrote out above is the sequence in the 3' to 5' direction instead of the 5' to 3' direction. 

We are going to change the direction using a method called string slicing. 

**Strings** are composed of text or characters and are not numbers which can be added, multiplied or divided. The sequences we have been looking at are strings with all letters, but strings can have numbers or special characters. 

To slice a string, its helpful first to know how to index the characters in a sequence. 

In Python, the first letter of the sequence gets an index of a "0" (in other programing languages indexing may start with "1"). 

To print the first character in a variable called "sequence" you would write: 

print(sequence[0])

To slice a string (or obtain a range of characters) you can specify the beginning index and the end index: 

sequence[0:2]

Note [0:2] includes zero and goes up to everything before 2. The character in the [2] position is not included.  

Test yourself by predicting the output using the code below. 

In [4]:
sequence='TCGGGAGGT'
print (sequence[0])
print (sequence[0:2])
print (sequence[0:3])
print (sequence[1])
print (sequence[1:3])
print (sequence[:4])

T
TC
TCG
C
CG
TCGG


If you want to skip characters in a string, or reverse the order, you can add a stride. 

For example, if you want to write out everyother character you could say. 

In [5]:
sequence='TCGGGAGGT'

#slice the sequence string from [lowerbound:upperbound:stride]
print (sequence[2:5:2])

GG


If the lower and upper bound are not specified, by default the program will use the first charcter for the lower bound and last character for the upper bound. This is helpful when you are working with longer sequences and don't know the length ahead of time. 

In our example, sequence [0:9:2] gives the same output as [::2]. 

In [6]:
sequence='TCGGGAGGT'

#slice the sequence string from [lowerbound:upperbound:stride]
print (sequence[0:9:2])
print (sequence[::2])

TGGGT
TGGGT


In [7]:
sequence='TCGGGAGGT'

#slice the sequence string from [lowerbound:upperbound:stride] a negative value for the stride reverses the direction. 
print(sequence[::-1])

TGGAGGGCT


## Use string slicing to make a substitution, deletion, insertion or inversion in a DNA, RNA or protein sequence <a name='Mutations' />

String slicing can also be used to write out a DNA, RNA or protein sequence with a substitution, deletion or insertion. 

Later in the class, you will learn more about the types of variation that can take place in DNA from different individuals or populations. 

In cases where variation leads to a disease, it is often reffered to as a **mutation**. 

These mutations are often substitutions, deletions or insertions.  

Below are examples of how you can use string slicing to make a substitution, deletion or insertion. 

In [8]:
#Make a single substitution in a sequence 
sequence='TCGGGAGGT'
mutated_sequence= sequence[0:4] + 'A' + sequence[6:]
print(mutated_sequence)

TCGGAGGT


In [9]:
#Delete part of a sequence
sequence='TCGGGAGGT'
mutated_sequence= sequence[0:4] + sequence[6:]
print(mutated_sequence)

TCGGGGT


In [10]:
#Add an insertion in a sequence 
sequence='TCGGGAGGT'
mutated_sequence= sequence[0:4] + 'AT' + sequence[6:]
print(mutated_sequence)

TCGGATGGT


A negative "stride" notation can be used to invert (or reverse) a sequence. We can use a negative stride to write out the complementary DNA sequence from the example above in the 5' to 3' direction. Edit the print statement in the box below to write out the complementary sequence in the 5' to 3' direction.  

In [11]:
#Write out the complementary sequence for a DNA sequence
complementarysequence='' #this defines the variable 'complementarysequence'
for i in 'AGCCCTCCA':
    if  i=='A':
        complementarysequence=complementarysequence+'T'
    if i=='T':
        complementarysequence=complementarysequence+'A'
    if i=='G':
        complementarysequence=complementarysequence+'C'
    if i=='C':
        complementarysequence=complementarysequence+'G'
print (complementarysequence)

#ANSWER REMOVE BEFORE GIVING TO STUDENTS. 
print (complementarysequence[::-1])

TCGGGAGGT
TGGAGGGCT


A negative stride can also be used to invert part of a sequence

In [12]:
#Add an inversion in a sequence 
sequence='TCGGGAGGT'
print(sequence[7:4:-1])
mutated_sequence= sequence[0:5] + sequence[7:4:-1] + sequence[8:]
print(mutated_sequence)

GGA
TCGGGGGAT


## How can I write out a complementary DNA sequence to a file? <a name='WriteSequence' />

Now that the complementary sequence is being written out correctly in the 5' to 3' direction, we can use the code below to write out a sequence to a file. 

In [13]:
#Write out the complementary sequence for a DNA sequence to a file
complementarysequence='' #this defines the variable 'complementarysequence'
for i in 'AGCCCTCCA':
    if  i=='A':
        complementarysequence=complementarysequence+'T'
    if i=='T':
        complementarysequence=complementarysequence+'A'
    if i=='G':
        complementarysequence=complementarysequence+'C'
    if i=='C':
        complementarysequence=complementarysequence+'G'

#create a file object f and open a writeable file called 'complementarysequence' in the working directory
f =open('complementarysequeunce', 'w') 

#write the complementary sequence variable to the 'complementarysequence' file
f.write(complementarysequence[::-1]) 

#close the file object f so it does not take resources in the program        
f.close () 

If you list the files in your working directory now what do you expect to see? Try this!

You can also look at the complementary sequence in a text editor. 

In [14]:
#list the files in your working directory

In [15]:
#ANSWER REMOVE BEFORE GIVING TO STUDENTS. 
import os
os.listdir('.')

['.DS_Store',
 '.ipynb_checkpoints',
 'complementarysequeunce',
 '2-Complementary Sequences and Transcription.ipynb']

For short sequences like the example we've been using, having a computer doesn't save significant amounts of time, but for longer sequences it can be a big help. 

Now try combining the commands that we used yesterday to read in the insulin sequence and print out the complementary DNA sequence for insulin. 

In [16]:
#Write out the complementary sequence for a DNA sequence
FASTAgenesequence=open('../class_01_gene_sequences/data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])
genesequence=''.join(genesequence)
genesequence=genesequence.replace('\n','')
complementarysequence='' #this defines the variable 'complementarysequence'


#ANSWER REMOVE BEFORE GIVING TO STUDENTS.
for i in genesequence:
    if i=='A':
        complementarysequence=complementarysequence+'T'
    if i=='T':
        complementarysequence=complementarysequence+'A'
    if i=='G':
        complementarysequence=complementarysequence+'C'
    if i=='C':
        complementarysequence=complementarysequence+'G'
print (complementarysequence[::-1])

GCTGGTTCAAGGGCTTTATTCCATCTCTCTCGGTGCAGGAGGCGGCGGGTGTGGGGCTGCCTGCGGGCTGCGTCTAGTTGCAGTAGTTCTCCAGCTGGTAGAGGGAGCAGATGCTGGTACAGCATTGTTCCACAATGCCACGCTTCTGCAGGGACCCCTCCAGGGCCAAGGGCTGCAGGCTGCCTGCACCAGGGCCCCCGCCCAGCTCCACCTGCCCCACTGCCAGGACGTGCCGCGCAGAGCAGGTTCCGGAACAGCGGCGAGGCAGAGGGACACAGGAGGACACAGTCAGGGAGACACAGTGCCCGCCTGCCCGCCAGCCCTAGGTCGCACTCCCACCCATCTCCAGCCGGGCTGGACCCAGGTTAGAGGGAGGGTCACCCACACTGGGTGTGGACCTACAGGCCCCAACGCCCACATGTCCCACCTCCTTCCCCCGCCCCGGGGCAGCGTCACAGTGGGAGCCTGAACAGGTGATCCCAGTACTTCTCCCCAGGGCCTGTCCCCAGCATCTTCCCCATCTCCTGACTATGGAGCTGCCGTGAGGCCTGGCGACAGGGGTCTGGCCCACTCAGGCAGGCAGCCACGCCCTCCTCCGGGCGTGATGGGGTGTTCGCCCAGAGGCAGGCAGCGTGGGGCACCCTGTGACCCCAGGTCACCCAGGACTTTACTTAACAAAACACTTGAATCTGCGGTCATCAAATGAGGGTGGAGAAATGGGCTGCGGGGCATTTGTTTGAGGGGCGAGTGGAGGGAGGAGCGTGCCCACCCTCTGATGTATCTCGGGGCTGCCGAAGCCAACACCGTCCTCAGGCTGAGATTCTGACTGGGCCACAGGGAGCTGGTCACTTTTAGGACGTGACCAAGAGAACTTCTTTTTAAAAAAGTGCACCTGACCCCCTGCTGGGTGGCAGCCTCCTGCCCCCTTCTGCCCATGCTGGGTGGGAGCGCCAGGAGCAGGGGGTGGCTGGGGGCGGCCAGGGGCAGCAATGGGCAGTTG

## How can I write out the RNA transcription product of a DNA sequence?<a name='Transcription' />

During the process of transcription (see Central Dogma Figure above), DNA is transcribed into RNA. There are a few structural differences between RNA and DNA molecules that we will not cover in detail. 

For the purposes of this class, the main difference between RNA and DNA that you will need to be aware of is that RNA has the base pair **Uracil (U)** instead of Thymine (T).   

By convention, most of the DNA sequences that are listed in gene databases are written in the 5' to 3' direction and are known as the **coding strand**. Unless there is some type of error in the transcription process, the coding strand of DNA will have the same sequence as the 5' to 3' RNA that is made in transcription except the Ts will become Us. 

There are a number of different types of RNA in a cell. For this class, we will be focusing on messenger RNA or **mRNA** which is the RNA that codes from proteins. 

Using what you learned above, complete the script below that would convert the DNA sequence for insulin to the corresponding RNA sequence?  

In [17]:
#Write out the pre-mRNA sequence for a DNA sequence
FASTAgenesequence=open('../class_01_gene_sequences/data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])
genesequence=''.join(genesequence)
genesequence=genesequence.replace('\n','')
RNAsequence='' #this defines the variable 'RNAsequence'


#loops over each character in genesequence and converts Ts to Us in the RNAsequence variable.
for i in genesequence:
    if i=='A':
        RNAsequence=RNAsequence+' '
    if i=='T':
        RNAsequence=RNAsequence+' '
    if i=='G':
        RNAsequence=RNAsequence+' '
    if i=='C':
        RNAsequence=RNAsequence+' '
print ( )


##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 
#Write out the pre-mRNA sequence for a DNA sequence
FASTAgenesequence=open('../class_01_gene_sequences/data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])
genesequence=''.join(genesequence)
genesequence=genesequence.replace('\n','')
RNAsequence='' #this defines the variable 'RNAsequence'
for i in genesequence:
    if i=='A':
        RNAsequence=RNAsequence+'A'
    if i=='T':
        RNAsequence=RNAsequence+'U'
    if i=='G':
        RNAsequence=RNAsequence+'G'
    if i=='C':
        RNAsequence=RNAsequence+'C'
print (RNAsequence)


AGCCCUCCAGGACAGGCUGCAUCAGAAGAGGCCAUCAAGCAGGUCUGUUCCAAGGGCCUUUGCGUCAGGUGGGCUCAGGAUUCCAGGGUGGCUGGACCCCAGGCCCCAGCUCUGCAGCAGGGAGGACGUGGCUGGGCUCGUGAAGCAUGUGGGGGUGAGCCCAGGGGCCCCAAGGCAGGGCACCUGGCCUUCAGCCUGCCUCAGCCCUGCCUGUCUCCCAGAUCACUGUCCUUCUGCCAUGGCCCUGUGGAUGCGCCUCCUGCCCCUGCUGGCGCUGCUGGCCCUCUGGGGACCUGACCCAGCCGCAGCCUUUGUGAACCAACACCUGUGCGGCUCACACCUGGUGGAAGCUCUCUACCUAGUGUGCGGGGAACGAGGCUUCUUCUACACACCCAAGACCCGCCGGGAGGCAGAGGACCUGCAGGGUGAGCCAACUGCCCAUUGCUGCCCCUGGCCGCCCCCAGCCACCCCCUGCUCCUGGCGCUCCCACCCAGCAUGGGCAGAAGGGGGCAGGAGGCUGCCACCCAGCAGGGGGUCAGGUGCACUUUUUUAAAAAGAAGUUCUCUUGGUCACGUCCUAAAAGUGACCAGCUCCCUGUGGCCCAGUCAGAAUCUCAGCCUGAGGACGGUGUUGGCUUCGGCAGCCCCGAGAUACAUCAGAGGGUGGGCACGCUCCUCCCUCCACUCGCCCCUCAAACAAAUGCCCCGCAGCCCAUUUCUCCACCCUCAUUUGAUGACCGCAGAUUCAAGUGUUUUGUUAAGUAAAGUCCUGGGUGACCUGGGGUCACAGGGUGCCCCACGCUGCCUGCCUCUGGGCGAACACCCCAUCACGCCCGGAGGAGGGCGUGGCUGCCUGCCUGAGUGGGCCAGACCCCUGUCGCCAGGCCUCACGGCAGCUCCAUAGUCAGGAGAUGGGGAAGAUGCUGGGGACAGGCCCUGGGGAGAAGUACUGGGAUCACCUGUUCAGGCUCCCACUGUGACGCUGCCCCGGGGCGGGGGA

A slightly shorter alternative for the code above is to use an if/else statement. 

**If** the original base pair is a T then it needs to be changed to a U. 
**Else** keep the original base pair. 

Here's how the **if/else** statement looks in Python. 

In [18]:
#Transcription: Writes out the pre-mRNA sequence that will be made from a DNA sequence 
FASTAgenesequence=open('../class_01_gene_sequences/data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])
genesequence=''.join(genesequence)
genesequence=genesequence.replace('\n','')
RNAsequence='' #this defines the variable 'RNAsequence'

#loops over each character in genesequence and converts Ts to Us in the RNAsequence variable. 
for i in genesequence:
    if i==' ':
        RNAsequence=RNAsequence+' '
    else:
        RNAsequence=RNAsequence+ i
print ( )


##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 
#Write out the pre-mRNA sequence for a DNA sequence
FASTAgenesequence=open('../class_01_gene_sequences/data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])
genesequence=''.join(genesequence)
RNAsequence='' #this defines the variable 'RNAsequence'
for i in genesequence:
    if i=='T':
        RNAsequence=RNAsequence+'U'
    else:
        RNAsequence=RNAsequence+ i
print (RNAsequence)


AGCCCUCCAGGACAGGCUGCAUCAGAAGAGGCCAUCAAGCAGGUCUGUUCCAAGGGCCUUUGCGUCAGGU
GGGCUCAGGAUUCCAGGGUGGCUGGACCCCAGGCCCCAGCUCUGCAGCAGGGAGGACGUGGCUGGGCUCG
UGAAGCAUGUGGGGGUGAGCCCAGGGGCCCCAAGGCAGGGCACCUGGCCUUCAGCCUGCCUCAGCCCUGC
CUGUCUCCCAGAUCACUGUCCUUCUGCCAUGGCCCUGUGGAUGCGCCUCCUGCCCCUGCUGGCGCUGCUG
GCCCUCUGGGGACCUGACCCAGCCGCAGCCUUUGUGAACCAACACCUGUGCGGCUCACACCUGGUGGAAG
CUCUCUACCUAGUGUGCGGGGAACGAGGCUUCUUCUACACACCCAAGACCCGCCGGGAGGCAGAGGACCU
GCAGGGUGAGCCAACUGCCCAUUGCUGCCCCUGGCCGCCCCCAGCCACCCCCUGCUCCUGGCGCUCCCAC
CCAGCAUGGGCAGAAGGGGGCAGGAGGCUGCCACCCAGCAGGGGGUCAGGUGCACUUUUUUAAAAAGAAG
UUCUCUUGGUCACGUCCUAAAAGUGACCAGCUCCCUGUGGCCCAGUCAGAAUCUCAGCCUGAGGACGGUG
UUGGCUUCGGCAGCCCCGAGAUACAUCAGAGGGUGGGCACGCUCCUCCCUCCACUCGCCCCUCAAACAAA
UGCCCCGCAGCCCAUUUCUCCACCCUCAUUUGAUGACCGCAGAUUCAAGUGUUUUGUUAAGUAAAGUCCU
GGGUGACCUGGGGUCACAGGGUGCCCCACGCUGCCUGCCUCUGGGCGAACACCCCAUCACGCCCGGAGGA
GGGCGUGGCUGCCUGCCUGAGUGGGCCAGACCCCUGUCGCCAGGCCUCACGGCAGCUCCAUAGUCAGGAG
AUGGGGAAGAUGCUGGGGACAGGCCCUGGGGAGAAGUACUGGGAUCACCUGUUCAGGCUCCCACUGUGAC
GCUGC

## What are RNA splicing, Exons and Introns? <a name='ExonandIntron' />

After RNA is made, it undergoes a processing step called **RNA splicing**. During RNA splicing, part of the sequence, **the introns**, are removed before the RNA gets translated into protein. 

The process of splicing takes the RNA from what is known as precursor or pre-mRNA to an mRNA. 

The regions of RNA that are retained in the mRNA after splicing are called **exons**. Exons contain the sequences that code for proteins, but also contain untranslated regions. 

There are computer programs that can predict where splicing occurs, but they are beyond the scope of what we will be covering in this class.

Instead, for the next exercise we are going to get the mRNA sequence for insulin from the NCBI Nucleotide database at the following [link](https://www.ncbi.nlm.nih.gov/nuccore/109148525?report=fasta).



## Use Python to find possible translation start and stop codons in a mRNA sequence <a name='FindStartandStopCodons' />

**Translation**, the process of converting RNA to protein, usually starts at 'AUG' sequences. 

'AUG' codes for the amino acid Methionine. 

During translation, every three base pairs in an mRNA sequence codes for one amino acid. The series of three base pairs is called a **codon**

Looking at the RNA sequence above, how do we know which 'AUG' is the start codon? 

Determining the start codon definitively ultimately has to be supported by experimental data. However, we can use programming to predict possible start and stop sites. Often, the actual start site is the combination of start and stop sites that results in the longest gene. 

The region between the start and stop site is referred to as the **open reading frame** sometimes abbreviated orf. 

PAUSE:: Think for a few minutes about how you might identify the possible start codons before looking at the code below.


To find the start codons in an mRNA sequence we are going to combine for loops, if statements and indexing strings the three programming principles we've discussed today. 

We also will introduce the **range() ** command. 

In the processed mRNA sequence that we downloaded from the NCBI Nucleotide database the sequence was written out with T(Thymine) rather than U(Uracil). While the basepairs in the mRNA sequence would be uracils, computational biologists and bioinformatics databases often simplify the conversion between mRNA and DNA sequences by just listing RNA sequences with thymine. 

Therefore in this example, we are going to look for 'ATG' as the start codon rather than 'AUG'. 

We are going to loop over every base pair in the sequence and ask if it is an ATG and have the script print out the base pair number if it is an ATG. 

Notice that for the print statement we are printing a mixture of the string 'candidate start codon site: ' with the base pair number. In order to print the base pair number we need to use the "str" command to convert it from a number to a string so it will print properly. 

Fill in the code below. 
 

In [20]:
#Find possible start codons in an mRNA sequence
FASTAgenesequence=open('../class_01_gene_sequences/data/Human-Insulin NM_000207.2.txt','r')
mRNAsequence=(FASTAgenesequence.readlines()[1:])
mRNAsequence=''.join(mRNAsequence)
mRNAsequence=mRNAsequence.replace('\n','')


#loops over every set of three consecutive basepairs in the mRNAsequence and looks to see if it is an ATG.
for i in mRNAsequence:
    if mRNAsequence[ : ]==' ':
#prints the possible start codon starting position
        print('candidate start codon site: '+ str(i+1))



##ANSWER -- ## 


candidate start codon site: 60
candidate start codon site: 72
candidate start codon site: 341
candidate start codon site: 442


The code, if you filled it in like we did above gives an error message:

"TypeError: Can't convert 'int' object to str implicitly"

What is different here from above is that rather than using i to go through each character in a string, i is an index for slicing a string from the mRNA sequence. 

We need for i to be a number! 

To fix the code we can use the **range** command. 

Instead of having i go through each element of the string, we will tell the code that i should take on each value in a range from 0 to the length of the mRNA sequence. 


In [22]:
ls()

/bin/sh: 1: Syntax error: "(" unexpected


In [23]:
#Find possible start codons in an mRNA sequence
FASTAmRNAsequence=open('../class_01_gene_sequences/data/Human-Insulin NM_000207.2.txt','r')
mRNAsequence=(FASTAmRNAsequence.readlines()[1:])
mRNAsequence=''.join(mRNAsequence)
mRNAsequence=mRNAsequence.replace('\n','')

#loops over every set of three consecutive basepairs in the mRNAsequence and looks to see if it is an ATG.
for i in range(0,len(mRNAsequence)):
    if mRNAsequence[i:i+3]=='ATG':
#prints the possible start codon starting position, the i+1 converts the start position to 1. 
        print('candidate start codon site: '+ str(i+1))

candidate start codon site: 60
candidate start codon site: 72
candidate start codon site: 341
candidate start codon site: 442


In [24]:
FASTAgenesequence=open('../class_01_gene_sequences/data/Human-Insulin NM_000207.2.txt','r')
mRNAsequence=(FASTAgenesequence.readlines()[1:])
mRNAsequence=''.join(mRNAsequence)
mRNAsequence=mRNAsequence.replace('\n','')

#loops over each character in the mRNAsequence and looks to see if it is an ATG.
for i in range(0,len(mRNAsequence)):
    if mRNAsequence[i:i+3]=='ATG':
        #loops over each character in the mRNAsequence and looks to see if it is a stop codon TAA,TAG or TGA.
        for j in range(i+3,len(mRNAsequence),3):
            if mRNAsequence[j:j+3]=='TAA' or mRNAsequence[j:j+3]=='TAG' or mRNAsequence[j:j+3]=='TGA' : 
                    print ('candidate start codon site: ' + str(i+1) + ' candidate stop codon site: ' + str(j+1) 
                           + ' orf length: '+ str(j+2-i)) 
                    #exits the loop after the first sequence is printed, avoids finding > 1 stop codon per start codon
                    break

candidate start codon site: 60 candidate stop codon site: 390 orf length: 332
candidate start codon site: 72 candidate stop codon site: 390 orf length: 320
candidate start codon site: 442 candidate stop codon site: 448 orf length: 8


Thought Question: Why does the ATG from position 341 no longer show up in the output?

Thought Question: Why is the start codon at 442 unlikely to be the actual start codon?

Congratulations, that concludes class 2. Next time where we'll start looking at translation, the final step in the Central Dogma for making proteins. 