<a href="https://colab.research.google.com/github/Balaji-0-5/Python/blob/main/DNA_translation_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Before running the program in Google Colab, make sure to upload the DNA and protein text file in the
session storage.

Download the data from : https://www.ncbi.nlm.nih.gov/

In [1]:
def read_seq(inputfile):

    """Reads and returns the input sequence with special characters removed"""
    
    with open(inputfile, encoding ='utf-8') as f:
        seq = f.read()
    seq = seq.replace("\n","")
    seq = seq.replace("\r","")
    return seq

In [2]:
def translate(seq):

    """Translate a string containing a nucleotide sequence into a string containing the 
    corresponding sequence of amino acids . Nucleotides are translated in triplets using
     the table dictionary; each amino acid 4 is encoded with a string of length 1. """

    table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }

    protein =""
    if len(seq)%3 == 0:
        for i in range(0,len(seq),3):
            codon = seq[i:i+3]
            protein += table[codon]
    
    return protein

In [3]:
protein = read_seq("protein.txt.txt")
dna = read_seq("dna.txt.txt")

Running the DNA sequence directly into the translate function won't return any results because the length of the sequence is not divisible by 3. To find out the correect position (or indices) of teh sequence of DNA required for translation, check for the variable named CDS (Codon sequence) and note the index.

In our example for NM_207618.2, the webpage shows  CDS             21..938

This means that the sequence starts from 21 and ends at 938. But in python the index starts form zero. So while slicing the sequence we must use the index 20 to 938

In [4]:
codon_seq = translate(dna[20:938])
codon_seq

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC_'

In [5]:
protein

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'

From the above codon sequences we can see that the translated sequence has an extra underscore '_' at the end of the sequence. This is because '_' represents the stop codon which is usually neglected in the website. 
So to get the correct seqence we remove the last 3 amino acids in the dna sequence while translation

In [6]:
codon_seq = translate(dna[20:935])

In [7]:
codon_seq == protein

True