# BF527 Lab 13

## Open Reading Frames and Virtual Ribosome

## Background 
Given a DNA coding sequence, we often want to see the corresponding protein sequence that it represents. We can do this by building a python program to translate the DNA sequence codon‐by‐codon into a protein (amino acid) sequence, somewhat like a "virtual ribosome." Such a program can also be useful for gene finding; we can translate a given DNA sequence in all six possible reading frames and look for an open reading frame (ORF). Recall from the lecture that an ORF is a stretch of DNA sequence that begins with a start/methionine codon (ATG), proceeds for some significant length without a stop codon (UAA, UAG, UGA), and then finally terminates with a stop codon.

## Task 
Build a virtual ribosome program that will translate the three positive reading frames of a given DNA sequence. Find the true reading frame and identify the corresponding protein using BLAST.

### Part A: Load the DNA sequence

In order to use a script to translate the DNA sequence into an amino acid sequence, first we need to load the sequence. The DNA sequence is provided for you in FASTA format as part of the BF527 Lab 13 materials on `scc2` (`dna.fasta`). If you followed the instructions, this file should be in the same directory as the current notebook.

Write python code to read in the `dna.fasta` file and create a variable that contains the concatenated DNA sequence as a string. You may use or adapt your solution from BF527 Lab 9 to do this. 

In [1]:
import colorama
from colorama import Fore
def get_color(dna):
    console_colors = {'A' : Fore.BLUE, 'C' : Fore.RED, 'G' : Fore.GREEN, 'T' : Fore.BLACK}
    colored_dna=''
    for nt in dna: 
        colored_dna += console_colors[nt] + nt
    print(colored_dna)

In [2]:
# Write your python code here:    

#open the FASTA file and read it 
dnafile=open('dna.fasta')
save_file = dnafile.readlines() #read lines in file and save it 

dna = '' #initialize a variable to store dna as a string
#loop through the saved file
for line in save_file[1:]:# skip header line
    dna+= line.strip() #strip off the newline character
       
#close the original file
dnafile.close()

get_color(dna)

[31mC[30mT[34mA[32mG[32mG[31mC[30mT[34mA[34mA[30mT[32mG[31mC[34mA[34mA[34mA[30mT[30mT[30mT[30mT[30mT[32mG[30mT[31mC[34mA[34mA[32mG[34mA[31mC[30mT[30mT[30mT[32mG[34mA[31mC[30mT[32mG[32mG[30mT[34mA[34mA[32mG[34mA[31mC[31mC[34mA[30mT[31mC[34mA[31mC[30mT[30mT[30mT[32mG[32mG[34mA[34mA[32mG[30mT[30mT[32mG[34mA[34mA[30mT[31mC[30mT[30mT[31mC[30mT[32mG[34mA[31mC[34mA[31mC[30mT[34mA[30mT[30mT[32mG[34mA[31mC[34mA[34mA[30mT[32mG[30mT[31mC[34mA[34mA[32mG[30mT[31mC[34mA[34mA[34mA[32mG[34mA[30mT[30mT[31mC[34mA[34mA[32mG[34mA[31mC[34mA[34mA[32mG[32mG[34mA[34mA[32mG[32mG[30mT[34mA[30mT[31mC[31mC[31mC[34mA[31mC[31mC[30mT[32mG[34mA[31mC[31mC[34mA[34mA[31mC[34mA[34mA[34mA[32mG[34mA[30mT[30mT[32mG[34mA[30mT[31mC[30mT[30mT[30mT[32mG[31mC[30mT[32mG[32mG[30mT[34mA[34mA[32mG[31mC[34mA[34mA[30mT[30mT[32mG[32mG[34mA[34mA[32mG[34mA[31mC[32mG[32mG[30

### Part B: Load and Store the Genetic Code

In order to translate from DNA to protein, we must know which codons code for which amino acids, and this is best accomplished by saving the information as a dictionary. We have provided the genetic code as a separate file on [learn.bu.edu] `universal_genetic_code.tab`. Each of the 64 lines in the file looks like `AAA\tB\n`, a three‐letter codon, a tab `\t`, a single‐letter amino acid designation, and a newline character `\n`.

1. Read this file line‐by‐line.
2. Split each line into a codon string and a one‐letter amino acid string.
3. Store this pair in a dictionary, with the codon being the key, and the amino acid being the value.

A `*` is used to represent a translated **STOP** codon.

**Hint:** We performed a similar task of splitting a string into substrings and then using those substrings to build a dictionary in BF527 Lab 9.

In [3]:
# Write your python code here:
def get_genecode():
    #open the FASTA file and read it 
    genetic_file=open('universal_genetic_code.tab')
    code_file = genetic_file.readlines() #read lines in file and save it 

    code_dic ={} #initialize a library to store codons and aas

    #loop through the code file
    for line in code_file:
        line=line.strip() # strip end characters
        codon = line.split('\t') [0] # store condon
        code_dic[codon]=line.split('\t') [1] #store aa
      
    #close the original file
    genetic_file.close()
    return code_dic
get_genecode()

{'AAA': 'K',
 'AAC': 'N',
 'AAG': 'K',
 'AAT': 'N',
 'ACA': 'T',
 'ACC': 'T',
 'ACG': 'T',
 'ACT': 'T',
 'AGA': 'R',
 'AGC': 'S',
 'AGG': 'R',
 'AGT': 'S',
 'ATA': 'I',
 'ATC': 'I',
 'ATG': 'M',
 'ATT': 'I',
 'CAA': 'Q',
 'CAC': 'H',
 'CAG': 'Q',
 'CAT': 'H',
 'CCA': 'P',
 'CCC': 'P',
 'CCG': 'P',
 'CCT': 'P',
 'CGA': 'R',
 'CGC': 'R',
 'CGG': 'R',
 'CGT': 'R',
 'CTA': 'L',
 'CTC': 'L',
 'CTG': 'L',
 'CTT': 'L',
 'GAA': 'E',
 'GAC': 'D',
 'GAG': 'E',
 'GAT': 'D',
 'GCA': 'A',
 'GCC': 'A',
 'GCG': 'A',
 'GCT': 'A',
 'GGA': 'G',
 'GGC': 'G',
 'GGG': 'G',
 'GGT': 'G',
 'GTA': 'V',
 'GTC': 'V',
 'GTG': 'V',
 'GTT': 'V',
 'TAA': '*',
 'TAC': 'Y',
 'TAG': '*',
 'TAT': 'Y',
 'TCA': 'S',
 'TCC': 'S',
 'TCG': 'S',
 'TCT': 'S',
 'TGA': '*',
 'TGC': 'C',
 'TGG': 'W',
 'TGT': 'C',
 'TTA': 'L',
 'TTC': 'F',
 'TTG': 'L',
 'TTT': 'F'}

### Part C: Translating from DNA to Protein in 3 Frames

We have loaded our DNA sequence, and saved all of the genetic code to an accessible file. Now we need to split the DNA into codons and use our dictionary to translate this into amino acids. However, we need to do this in three reading frames!

1. Use a for loop and the functions range and len to split the DNA into codons. **Hint:** In BF527 Lab 6, we used `range` and `len` to build a matrix. Although we don’t want to build a matrix, we want to make use of the step function of `range` in order to change the opening reading frame.
2. Translate the codons of the DNA sequence by looking the codons up in the dictionary, and printing the corresponding amino acid.
3. Print the translated sequence to the screen, or save it or a variable to be printed later.
4. Visually inspect the amino acid to see if it corresponds to an **ORF**, does it begin with a start/methionine codon (ATG)?
5. Use BLAST to identify the protein.

The range function creates lists containing arithmetic progressions. It is often used in `for` loops. The arguments must be plain integers. If the step argument is omitted, it defaults to 1. If the start argument is omitted, it defaults to 0. The full form returns a list of plain integers `[start, start + step, start + 2*step, ...]`.

```
range(start, stop, step)
```

In [21]:
# Add your python code here:
def get_orfs(dna):
    orflist=[] # initialize a variable to store open reading frames
    frames = [] # initialize a variable to store codons for each orf
    # split the dna into codons 
    frames.append([dna[i:i + 3] for i in range(0, len(dna), 3)])
    frames.append([dna[i:i + 3] for i in range(1, len(dna), 3)])
    frames.append([dna[i:i + 3] for i in range(2, len(dna), 3)])
    
    for i in range(len(frames)): #looping through frames
        for j in range(len(frames[i])):
            if frames[i][j]=="ATG":#find a start codon
                for n in range(j+1,len(frames[i])):
                    if frames[i][n] in ["TAA","TAG","TGA"]:#meet a stop codon       
                        orflist.append(frames[i][j:n]) # retrieve the orf from start to stop codon
                        break
        j+=1
        
    #finally,translate codons into amino acids
    proteins=[]#initialize a variable to store translated proteins of each orf
    for i in range(len(orflist)):
        aas=''
        for j in range(len(orflist[i])):
            aas+=get_genecode()[orflist[i][j]]
        proteins.append(aas)
        aas=''
    print(proteins)
    
    #transfer orf list into a list of orf in consecutive sequences
    orf=[]
    orf_str=''
    for i in range(len(orflist)):
        for j in range(len(orflist[i])):
            orf_str+=orflist[i][j]
        orf.append(orf_str)
        orf_str=''

    #print orf in color
    print("Open reading frames:")
    for i in range(len(orf)):
        get_color(orf[i])
        
get_orfs(dna)

['MSSQRFKTRKVSHLTNKD', 'MLDCHQELPTVERESVVTPTNCVQRRS', 'MQIFVKTLTGKTITLEVESSDTIDNVKSKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGIIEPSLKALASKYNCDKSVCRKCYARLPPRATNCRKRKCGHTNQLRPKKKLK']
Open reading frames:
[34mA[30mT[32mG[30mT[31mC[34mA[34mA[32mG[30mT[31mC[34mA[34mA[34mA[32mG[34mA[30mT[30mT[31mC[34mA[34mA[32mG[34mA[31mC[34mA[34mA[32mG[32mG[34mA[34mA[32mG[32mG[30mT[34mA[30mT[31mC[31mC[31mC[34mA[31mC[31mC[30mT[32mG[34mA[31mC[31mC[34mA[34mA[31mC[34mA[34mA[34mA[32mG[34mA[30mT
[34mA[30mT[32mG[31mC[30mT[34mA[32mG[34mA[30mT[30mT[32mG[31mC[31mC[34mA[31mC[31mC[34mA[34mA[32mG[34mA[32mG[31mC[30mT[34mA[31mC[31mC[34mA[34mA[31mC[30mT[32mG[30mT[34mA[32mG[34mA[34mA[34mA[32mG[34mA[32mG[34mA[34mA[34mA[32mG[30mT[32mG[30mT[32mG[32mG[30mT[31mC[34mA[31mC[34mA[31mC[31mC[34mA[34mA[31mC[31mC[34mA[34mA[30mT[30mT[32mG[31mC[32mG[30mT[31mC[31mC[34mA[34mA[34mA[32mG[34mA[34mA[32mG[34mA

### EXTRA: Translating from DNA to Protein in 6 Frames

This is **NOT REQUIRED** for completion of BF527 Lab 13. However, if you would like a personal challenge, translate the DNA sequence in the full 6 frames. There are an additional 3 reading frames in the reverse complement of the DNA sequence that could also code for a protein. The "reverse complement" of a sequence is backwards and the complementary nucleotides are used. For example, the reverse complement of `ATTTGC` is `GCAAAT`.

1. Build the reverse complement in a new variable. Use a `for` loop to read the original DNA sequence, and concatenate or add the complementary nucleotide. Make sure the output is reversed!
2. Now you can use the code you wrote for Part C to also translate this extra sequence.

In [22]:
# Write your python code here.
def revcomp_dna(dna):
    #first, build the compelentary dna sequence
    comp = {"A": "T", "C": "G", "T": "A", "G": "C"} #base pairs
    comp_dna=''
    for i in range(len(dna)):
        comp_dna+=comp[dna[i]]
    #reverse the complement dna
    rev_comp = comp_dna[::-1]
    
    return rev_comp
revcomp_dna(dna)

'CTAGCGCGAGATCCGGAATCCGTCATTTTAACTTCTTCTTTGGACGCAATTGGTTGGTGTGACCACACTTTCTCTTTCTACAGTTGGTAGCTCTTGGTGGCAATCTAGCATAACACTTACGGCAAACAGATTTGTCACAGTTGTACTTGGAAGCCAAAGCTTTCAAAGATGGTTCAATGATACCACCTCTCAATCTCAAGACTAAGTGCAAAGTGGATTCTTTTTGAATGTTGTAGTCAGACAAGGTTCTACCGTCTTCCAATTGCTTACCAGCAAAGATCAATCTTTGTTGGTCAGGTGGGATACCTTCCTTGTCTTGAATCTTTGACTTGACATTGTCAATAGTGTCAGAAGATTCAACTTCCAAAGTGATGGTCTTACCAGTCAAAGTCTTGACAAAAATTTGCATTAGCCTAG'

In [23]:
get_orfs(revcomp_dna(dna))

['MVQ', 'MIPPLNLKTKCKVDSF', 'ML']
Open reading frames:
[34mA[30mT[32mG[32mG[30mT[30mT[31mC[34mA[34mA
[34mA[30mT[32mG[34mA[30mT[34mA[31mC[31mC[34mA[31mC[31mC[30mT[31mC[30mT[31mC[34mA[34mA[30mT[31mC[30mT[31mC[34mA[34mA[32mG[34mA[31mC[30mT[34mA[34mA[32mG[30mT[32mG[31mC[34mA[34mA[34mA[32mG[30mT[32mG[32mG[34mA[30mT[30mT[31mC[30mT[30mT[30mT[30mT
[34mA[30mT[32mG[30mT[30mT[32mG
