# CS 525 exercise 1:  the central dogma of molecular biology

The objective of this notebook is to understand DNA strand complementarity and the concepts of transcription of DNA to RNA and translation of RNA to protein.

### Problem 0 - a discussion of Python dictionaries

Python dictionaries are a highly effective and efficient way to map keys to values.  Which underlying data structure is used in implementing Python dictionaries?  Given that information what can you say about the computational time required to insert or retrieve a value from a dictionary?

Python dictionaries are hash tables and have O(1) time for insertion and lookup

### Problem 1 - reverse complement

In this problem, we will write a function that is given the string representing a strand of a DNA molecule and returns a string representing the  complementary strand.


In [1]:
def reverse_complement(sequence): # this takes in a 3' - > 5'
    """Returns the reverse complement of the input sequence"""
    transcription_lookup = {'A': 'T',
                         'T': 'A',
                         'G': 'C',
                         'C': 'G'}
    return ''.join(transcription_lookup[base] for base in sequence.upper()[::-1]) # 3' -> 5'

In [2]:
# tests for reverse_complement
assert reverse_complement("A") == "T"
assert reverse_complement("ATCG") == "CGAT"
assert reverse_complement("") == ""
assert reverse_complement("GAATTC") == "GAATTC", "Failed on palindromic EcoR1 recognition sequence"
print("SUCCESS: all tests for reverse_complement passed!")

SUCCESS: all tests for reverse_complement passed!


### Problem 2 - transcription

Write a function that takes as input a string representing a DNA sequence and outputs the string representing the RNA that would result from transcribing this DNA sequence.

As a hint, there is a [string method](https://docs.python.org/3/library/stdtypes.html#string-methods) that will allow you to do this in one line.

In [3]:
def transcribe(sequence): # assumption this is a positive sense DNA strand and is 3' -> 5'
    """
    Returns the RNA sequence that would result from transcription
    of a DNA sense strand sequence
    """
    transcription_lookup = {'A': 'U', 'T': 'A', 'G': 'C', 'C': 'G'}
    return ''.join(transcription_lookup[base] for base in sequence.upper()) # this will return rna that is 5' -> 3'
transcribe('ATCG')

'UAGC'

### Problem 3 - translation

In this problem you will write a function that translates an RNA sequence into the protein sequence for which it codes.  Recall that each codon is translated into an amino acid, where codons are nonoverlapping substrings of length three.
To assist you in this task, here is a Python dictionary that represents the standard genetic code:

In [4]:
genetic_code = {
 'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAU': 'N',
 'ACA': 'U', 'ACC': 'U', 'ACG': 'U', 'ACU': 'U',
 'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGU': 'S',
 'AUA': 'I', 'AUC': 'I', 'AUG': 'M', 'AUU': 'I',
 'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAU': 'H',
 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCU': 'P',
 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGU': 'R',
 'CUA': 'L', 'CUC': 'L', 'CUG': 'L', 'CUU': 'L',
 'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAU': 'D',
 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCU': 'A',
 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGU': 'G',
 'GUA': 'V', 'GUC': 'V', 'GUG': 'V', 'GUU': 'V',
 'UAA': '*', 'UAC': 'Y', 'UAG': '*', 'UAU': 'Y',
 'UCA': 'S', 'UCC': 'S', 'UCG': 'S', 'UCU': 'S',
 'UGA': '*', 'UGC': 'C', 'UGG': 'W', 'UGU': 'C',
 'UUA': 'L', 'UUC': 'F', 'UUG': 'L', 'UUU': 'F'}

In [5]:
def translate(rna_sequence): # this will take in 5' -> 3' RNA
    """
    Returns the protein sequence resulting from translation of 
    a given RNA sequence
    """
    return ''.join(genetic_code[rna_sequence[i:i+3]] for i in range(0, len(rna_sequence), 3)) # return N to C protein

In [6]:
assert translate("UUUGCGACUUAU") == "FAUY", "Failed on input 'UUUGCGACUUAU'"
assert translate("ACG") == "U", "Failed on input 'UGA'"
assert translate("") == "", "Failed on the empty string"
print("SUCCESS: translate_rna_fragment passed all tests!")

SUCCESS: translate_rna_fragment passed all tests!


### Problem 4 - Consequences of the deltaF508 mutation in CFTR 

One of the most famous disease causing mutations in humans is the deltaF508 mutation in the *CFTR* gene.  This is the most common mutation among people with Cystic Fibrosis.  This mutation occurs in the gene fragment shown below and corresponds to the deletion of 3 consecutive bases, starting at base 129 (using 1-based indexing).  The code below shows how to generate the sequence corresponding to the mutation.

We will now examine how the deltaF508 mutation impacts the resulting amino acid sequence of the encoded protein.  Here is the sequence of the CFTR gene fragment:

In [7]:
cftr_gene_fragment = ("ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAA"
                      "AATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTA"
                      "TGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAA"
                      "TATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG")
print(cftr_gene_fragment)
print(translate(transcribe(cftr_gene_fragment)))

ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG
*SEDYH*YPLDLGSLPF*FVSPS*SKURVKRU*YGPW*FLL**KPQRILLISMSSQ*FRUVDLL


And here is the sequence of the mutated CFTR gene fragment, which has bases 129, 130, and 131 (1-based coordinates) removed:

In [8]:
deltaf508_fragment = cftr_gene_fragment[:128] + cftr_gene_fragment[131:]
print(deltaf508_fragment)
print(translate(transcribe(deltaf508_fragment)))

ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG
*SEDYH*YPLDLGSLPF*FVSPS*SKURVKRU*YGPW*FLL**PQRILLISMSSQ*FRUVDLL


What are the consequences of this mutation on the protein sequence of this gene?
Keep in mind that the first three bases of the fragment are a codon.  **Note that it's possible to answer this simply by considering the coordinates of the mutation and the number of bases removed by the mutation**.  You can then verify your answer using your translation function.

Mutation can cause many issue. In this case it was the deletion of a K and caused everything after the mutation to shift. This can cause major changes to the protein structure. This one is unique in that it deletes a whole codon and does not cause a frame shift. If single point deletion occurs you can get frame shift that can lead to complete changes in the protein structure after the mutation point because the ribosome reads every 3 base pairs at a time. 