# DNA Translation Case Study
#### DNA is a discrete code physically present in almost every cell of an organism.
#### We can think of DNA as a one dimensional string of characters with four characters to choose form. {A,C,G,T},  These letters stands for the first letter of nucleotides used to construct DNA
#### Each unique three character sequence of nucleotides, sometimes called a nucleotide triplet, corresponds to one amino acid. The sequence of amino acids is unique for each type of protein. And all proteins are built from same set of just 20 amino acids for all living things.

#### We can think of DNA when read as sequences of three letters, as dictionary of life


### Our Program
#### The input to our program is going to be a DNA sequence that consists of a four letter alphabet.  We then read this sequence three letters at a time, translet each triplet to a single letter, that stands for a specific amino acid, and then procced to next set of three letters. We do this until we reach end of the sequence.




## Remark:
#### In general, before you start writing any code, make sure to really understand the problem. Sometimes it's helpful to run through a simple example on a piece of paper before even thinking about the code itself.

## Reading a file in python
#### The best way to read a file, depends on what we want to do with the file. If there is some large file and we need only some of it, we can read the file using for loop, line by line. This approach leads to memory efficient and fast code.
#### Here we will read whole file in one go.

In [1]:
pwd() # First check working directory and make sure files to be read are in the working directory

'D:\\Mtech\\Optimization in Chemical Engineering\\Jupyter\\SciPyLearn\\CaseStudies\\DNA Translation'

### Open function in python
#### open() function opens a file and returns it as a file object
#### Syntax: open(file,mode)
- file: path and name of the file
- mode: A string define which mode you want to open the file in
        "r"- Read
        "a" append
        "w" write
        "x" create




In [2]:
input_file = 'dna.txt'
f = open(input_file,"r")
seq = f.read()

In [3]:
# We need to remove extra characters \n
seq = seq.replace("\n","")
seq = seq.replace("\r","")

In [3]:
seq

'GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA\nGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT\nCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT\nTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT\nCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG\nAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA\nACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA\nGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT\nTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA\nGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA\nCCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT\nTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT\nGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG\nTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGA

### Translating DNA sequence
- Step1: Check that length of sequence is divisible by 3
- Step2: Look up each 3-letter string in table and store result
- step3: Contiue lookups untile reaching end of sequence

In [24]:
## Pseudo code
# Check that length of sequence is divisible by 3
    #loop over the sequence
        #extract a single codon
        # look up the codon and store the result


In [5]:
def translate(seq):
    """Translate a string containing a nucleotide sequence into a string
    conaining the cooresponding sequence of amino acids. Nucleotides are
    translated in triplets using the table dictionary; each amino 
    acid is encoded with a string of length 1."""
    
    ## Defining lookup talbe as a dicitionary
    table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',}
    protein = ""
    if len(seq)%3 == 0:
        for i in range(0,len(seq),3):
            codon = seq[i:i+3]
            protein += table[codon]
    return protein


In [6]:
translate(seq)

''

#### For reading a file there is another, preferred way, which involves using a with statement to read the file. This way is preferred because it's better able to cope with situations where something goes wrong with the reading of the file

In [12]:
# Let's see how ir works
inputfile = "dna.txt"
def read_seq(inputfile):
    """ Reads and returns input sequence with special c
    charachters removed"""
    with open(inputfile,"r") as f:
        seq = f.read()
        seq = seq.replace("\n","")
        seq = seq.replace("\r","")
    return seq


In [30]:
prt = read_seq("protein.txt")

In [14]:
dna = read_seq("dna.txt")

In [23]:
# Check if length of DNA is 3
len(dna)%3
translate(dna[20:938])

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC_'

In [26]:
prt

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'

In [31]:
# We need to remove underscore from our translated sequence
# as it is a stop codon,
prt == translate(dna[20:938])[:-1]

True