# Problem 1: Use a dictionary to translate a DNA sequence to protein sequence (5 points)


This class is focused on statistics, and so we don't spend much time learning how to parse DNA and protein sequences. But many of the Python skills that we apply to statistics in this course are also useful for sequence analysis.

In this problem, you're going to translate coding DNA sequence to protein sequence, using a dictionary containing the genetic code. This problem is very similar to the Gene Ontology and Restriction Enzyme problems. It combines elements of both those problems, and introduces a new argument for range.

## 1.1 Loop over a coding DNA sequence and print out the codons.

In the problem below, I give you a DNA sequence. Your task is to write a **`for`** loop that grabs three bases at a time from that DNA sequence and appends them to a list.

To loop over a string (such as a DNA sequence), we use two tools: slices and `range()`. Let's look at these one at a time:

**Grabbing three bases with a slice - COPIED FROM LAST PROBLEM FOR YOUR CONVENIENCE**

Strings, like lists, can be accessed with indexing and slices, like this:

```python
dna = 'ATGAGCAGGTCAGTGACTGAT' # a DNA sequence
dna[0] # the first base
dna[0:3] # a slice that grabs the first three bases
dna[i:i+3] # a slice that grabs three bases beginning at index i
```

**Looping over a string with range() - IMPORTANT NEW APPLICATION OF `range()`**
The function `range()` can take up to three arguments:
`range(start, stop+1, step size)`

The start position defaults to 0, and the step size defaults to 1. So these three statemens are equivalent:

```python
# All three count from 0 to (length of dna - 1), by ones
range(len(dna))

range(0, len(dna)) 

range(0, len(dna), 1)
```

So if you wanted to print out each base of a DNA sequence, you would use a **`for`** loop with `range()` and string indexing, as shown in the cell below. Run the cell to see how it works.

In [None]:
# Run this cell
dna = 'ATGAGCAGGTCAGTGACTGAT' # a partial protein coding sequence

for i in range(len(dna)): # looping over the sequence one base at a time
    print(dna[i])

Using similar approach, write a **`for`** loop that iterates over the following DNA sequence **three** bases (i.e., one codon) at a time. In the block of the **`for`** loop, use a slice to grab the codon, and append it to the list `codons`.

**HINT:** In your `for` statement, call `range()` with *all three arguments* - start, stop, and step size, as in the last example in the markdown cell above.

In [None]:
# Below is the coding sequence of yeast CYC1
dna = 'ATGACTGAATTCAAGGCCGGTTCTGCTAAGAAAGGTGCTACACTTTTCAAGACTAGATGTCTACAATGCCACACCGTGGAAAAGGGTGGCCCACATAAGGTTGGTCCAAACTTGCATGGTATCTTTGGCAGACACTCTGGTCAAGCTGAAGGGTATTCGTACACAGATGCCAATATCAAGAAAAACGTGTTGTGGGACGAAAATAACATGTCAGAGTACTTGACTAACCCAAAGAAATATATTCCTGGTACCAAGATGGCCTTTGGTGGGTTGAAGAAGGAAAAAGACAGAAACGACTTAATTACCTACTTGAAAAAAGCCTGTGAGTAA'
codons = [] # append codons to this list

# YOUR ANSWER HERE

print(codons[0:10]) # display some of the results

In [None]:
# Run to check your answers
assert len(codons) == 110
assert codons[59] == 'AAG'
assert codons[33] == 'GTT'

## 1.2 Write a `for` loop to translate a DNA sequence using a dictionary of the genetic code.

Now that you know how to extract codons from a DNA sequence, it's easy to look up codons in a genetic code dictionary. Your task is to write code to translate a DNA sequence into protein sequence:

1. Loop over the DNA sequence one codon at a time, building your protein sequence as you go.
2. Use the dictionary find the amino acid corresponding to that codon.
3. Add that amino acid to the end of the protein sequence.

Below I've saved you some typing by creating a dictionary with the genetic code below. Run the cell.

In [None]:
# Run this cell. Written one entry per line for legibility.

genetic_code = {
    'TTT':'F',
    'TTC':'F',
    'TTA':'L',
    'TTG':'L',
    'CTT':'L',
    'CTC':'L',
    'CTA':'L',
    'CTG':'L',
    'ATT':'I',
    'ATC':'I',
    'ATA':'I',
    'ATG':'M',
    'GTT':'V',
    'GTC':'V',
    'GTA':'V',
    'GTG':'V',
    'TCT':'S',
    'TCC':'S',
    'TCA':'S',
    'TCG':'S',
    'CCT':'P',
    'CCC':'P',
    'CCA':'P',
    'CCG':'P',
    'ACT':'T',
    'ACC':'T',
    'ACA':'T',
    'ACG':'T',
    'GCT':'A',
    'GCC':'A',
    'GCA':'A',
    'GCG':'A',
    'TAT':'Y',
    'TAC':'Y',
    'TAA':'STOP',
    'TAG':'STOP',
    'CAT':'H',
    'CAC':'H',
    'CAA':'Q',
    'CAG':'Q',
    'AAT':'N',
    'AAC':'N',
    'AAA':'K',
    'AAG':'K',
    'GAT':'D',
    'GAC':'D',
    'GAA':'E',
    'GAG':'E',
    'TGT':'C',
    'TGC':'C',
    'TGA':'STOP',
    'TGG':'W',
    'CGT':'R',
    'CGC':'R',
    'CGA':'R',
    'CGG':'R',
    'AGT':'S',
    'AGC':'S',
    'AGA':'R',
    'AGG':'R',
    'GGT':'G',
    'GGC':'G',
    'GGA':'G',
    'GGG':'G',
}

Now write the code to translate DNA sequence below to protein sequence. Don't worry about handling stop codons yet. Just write a function that looks up dictionary entries to create the protein sequence.

Assign the protein sequence to the variable `protein`.

HINT: Your loop needs to build your protein sequence, which is a string. You've learned how to create empty lists and `.append()` new values to them in a `for` loop. Conceptually, you'll can do the same thing with a string. You can use the `+=` operator to add characters to a string, much like `.append()` for a list. Also, just like you define an empty list with a pair of brackets, you can define an empty string with quotes like this:

`my_string = ''`

In [None]:
sequence = 'ATGACAGCCAGTTTAACTACCAAGTTCTTGAACAATACCTATGAAAACCCATTTATGAATTGAGGG'

# YOUR ANSWER HERE

print(protein)

In [None]:
# Run to check your answer
assert protein == 'MTASLTTKFLNNTYENPFMNSTOPG'

### 2. Modify the code to handle stop codons
The translation code above has two flaws in it. First, it adds the string 'STOP' to our translated sequence. (Notice the `STOP` in the output above?) Second, the loop contines past the stop codon. Your task is to modify the code so that the `for` loop ends when it encounters a stop codon.

**Breaking out of a `for` loop with `break`**

The way to end a `for` loop early is to use `break` with an `if` statement:

```python
# A for loop that ends if it encounters a number larger than 10
my_list = [1,5,2,6,75,3,5,2] 

for n in my_list:
    if n > 10:
        break
    else:
        # code to do whatever you want to do
```
When `break` is triggered, the loop stops immediately, and the Python interpreter then moves on to any subsequent code. (Another command, `continue`, doesn't end the loop but immediately moves on to the next iteration without executing futher code.)

In the cell below, write a modified version of your code anove to include a `break` statement. If the code encounters a stop codon, it should break out of the loop without adding any further characters to the protein sequence. Again, the protein sequence should be assigned to the variable `protein`.

In [None]:
sequence = 'ATGACAGCCAGTTTAACTACCAAGTTCTTGAACAATACCTATGAAAACCCATTTATGAATTGAGGG'

# YOUR ANSWER HERE
print(protein)

In [None]:
# Run this to check your answer
assert protein == 'MTASLTTKFLNNTYENPFMN'

Along with `break` there are two other control words that are helpful with loops and conditionals: `pass` and `continue`. These other control words are useful in some special situations and you should be aware that they exist. To read more about them, check out sections 4.4 and 4.5 of the official Python page on flow control: https://docs.python.org/3/tutorial/controlflow.html