# Chapter 9: Decisions and loops, using http://rosalind.info

Go to this link to enroll in the Rosalind class for this course: http://rosalind.info/classes/enroll/10cba3c42b/

The class contains seven exercises, the first five are for today (the fifth is optional).

### Counting DNA nucleotides
We will start with the first exercise, counting the DNA nucleotides. Rosalind provides a sample data set, in this case a short DNA sequence. You can use that to develop the algorithm. If the code works, you can click the Download dataset button, this will give you a **rosalind_dna.txt** file in the Downloads folder. That file will contain a longer DNA sequence on which you can run your algorithm. You have five minutes to give the solution.

Let's start with creating a variable *dna* that contains the sample sequence.

In [1]:
dna = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"

The next step is counting each of the four nucleotides, Chapter 9 of the book already gives most of the solution to this exercise. You can use a **string method** for this, first let's have a look at the methods that are available by using the **dir** function on the dna string.

In [2]:
#dir(dna)

Find the approprate method and add the proper code below to get the counts for each of the nucleotides.

In [3]:
NumberA = dna.count("A")
NumberC = dna.count("C")
NumberG = dna.count("G")
NumberT = dna.count("T")

For Rosalind we have make sure the counts are printed on one line, for instance like this:

In [4]:
print NumberA,NumberC,NumberG,NumberT

20 12 17 21


If the result matches the sample output of the Rosalind exercise you could try the exercise with 'real' data, but for now let's take the six lines of code for this algorithm out of the Notebook and first run them through the interactive Python prompt and then as a script.

### The interactive Python prompt
Start a new terminal window and type **python**. You should see some text, and then a line with three **>** characters  followed by a blinking cursor. Type the first line defining the *dna* variable and press enter. Now type dna and press enter to see the value of the dna variable.
```BASH
>>> dna = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
>>> dna
'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
```

Do the same for the four lines where you count each of the nucleotides and lastly type the print line.
```BASH
>>> dna = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
>>> NumberA = dna.count("A")
>>> NumberC = dna.count("C")
>>> NumberG = dna.count("G")
>>> NumberT = dna.count("T")
>>> print NumberA,NumberC,NumberG,NumberT
20 12 17 21
```

### A Python script
In the **~/exercises/Week3/Day1** folder create a new text file called **dna.py** using nano (or another text editor). In this file, copy the six lines of code for this exercise. Save the file and run it like this: 
```BASH
python dna.py
```

Does it produce the output that you expect?

Now add this line as the first line in the file:
```BASH
#!/usr/bin/env python
```
and make the file executable with:
```BASH
chmod a+x dna.py
```
Now you should be able to run the script as:
```BASH
./dna.py
```

### Reading data from a pipe

Now if you want to test the script with 'real' data, you can click the **Download dataset** button which will give you a **rosalind_dna.txt** file in the Downloads folder. You could copy the DNA sequence from this file into your script as the new value of the *dna* variable, but it would be easier if you could use the **rosalind_dna.txt** as input for your script. We have not covered reading files from Python yet, but it would already be good if the script could read from a pipe **|**, like we have seen last week for the **grep** shell command.
```BASH
cat rosalind_dna.txt | ./dna.py
```
This Python function for this called *raw_input()*, this function reads a line of text from the pipe. Change your script like this:
```python
#dna = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
dna = raw_input()
```
Test that it works using the **dna.txt** file.

Now you are ready to do the Rosalind exercise. Remember, after clicking the **Download dataset** button you have five minutes to do give the answer. Good luck!

### Running Python code

So far we have used three ways to run Python code:
1. the Notebook
2. the interactive prompt
3. a script

When would you use a Notebook, when the interactive prompt and when would a script be more appropriate?

### Transcribing DNA into RNA

Congratulations, you completed your first Rosalind exercise. In this next exercise the aim is to transcribe DNA into RNA, which in practice means replacing all Ts with Us. This can be solved directy using the replace method, like so:

In [5]:
dna = "GATGGAACTTGACTACGTAAATT"
rna = dna.replace("T","U")
print rna

GAUGGAACUUGACUACGUAAAUU


But that is too easy, so let's try to solve it using a **for-loop** and an **if** statement. We will read each nucleotide in the DNA sequence, starting with the first, and one letter a time grow the RNA sequence. For this we use a variable *rna* that is an empty string at the start of the for-loop. Inside the for-loop we add each nucleotide to the *rna* string (so this string grows with one nucleotide in each iteration of the loop), untill every nucleotide is copied to the RNA sequence.

In [1]:
dna = "GATGGAACTTGACTACGTAAATT"
rna = ""
for nucleotide in dna:
    rna = rna + nucleotide
    print rna

G
GA
GAT
GATG
GATGG
GATGGA
GATGGAA
GATGGAAC
GATGGAACT
GATGGAACTT
GATGGAACTTG
GATGGAACTTGA
GATGGAACTTGAC
GATGGAACTTGACT
GATGGAACTTGACTA
GATGGAACTTGACTAC
GATGGAACTTGACTACG
GATGGAACTTGACTACGT
GATGGAACTTGACTACGTA
GATGGAACTTGACTACGTAA
GATGGAACTTGACTACGTAAA
GATGGAACTTGACTACGTAAAT
GATGGAACTTGACTACGTAAATT


By moving the print statement to the left, we take it out of the for-loop. So now we only print the value of *rna* after the loop is finished.

In [7]:
dna = "GATGGAACTTGACTACGTAAATT"
rna = ""
for nucleotide in dna:
    rna = rna + nucleotide

print rna

GATGGAACTTGACTACGTAAATT


But now we only copy the DNA, so we should add the functionality to replace every **T** for a **U**. This can be done inside the loop, using an if-statement.

In [8]:
dna = "GATGGAACTTGACTACGTAAATT"
rna = ""
for nucleotide in dna:
    if nucleotide == "T":
        nucleotide = "U"         
    rna = rna + nucleotide

print rna

GAUGGAACUUGACUACGUAAAUU


Fix to if-block to do the right thing (which does not including printing messages). If the result matches the Rosalind sample solution, create a Python script **rna.py** with the required lines of code. Test it and when you are confident that it works, do the exercise:
```BASH
cat rosalind_rna.txt | ./rna.py
```

### Complementing a strand of DNA

In this exercise we have to produce the reverse complement of a DNA sequence. The first thing we do is recognize that this problem consists of two smaller problems: making the complement of a sequence and reversing a sequence. Again there is a 'simple' solution that is only a few lines. For the complement we could use the translate() function, for the reverse we could use a so called slicing operation that reverses the characters in the string.

In [9]:
from string import maketrans

dna = "AAAACCCGGT"

trans = maketrans("ACGT","TGCA")
complement = dna.translate(trans)

revc = complement[::-1]

print revc

ACCGGGTTTT


Unfortunately this code uses some concepts that we have not yet covered... However, as before this can also be solved using for-loops and if-statements. Below I give you some code to start, with the solution for the previous exercise you should have the tools to make it work.

In [10]:
dna = "AAAACCCGGT"

#create the complement, change T -> A, A ->T, C ->G, G->C
complement = ""
for nucleotide in dna:
    if nucleotide == "T":
        nucleotide = "A"
    elif nucleotide == "A":
        nucleotide = "T"
    elif nucleotide == "C":
        nucleotide = "G"
    elif nucleotide == "G":
        nucleotide = "C"
    else:
        #should not get here
        pass
    
    complement = complement + nucleotide

#check that it is correct
print complement

#now reverse the sequence
revc = ""
for nucleotide in complement:
    #revc = revc + nucleotide
    revc = nucleotide + revc
    
print revc

TTTTGGGCCA
ACCGGGTTTT


If you have a working solution, create a Python script (**revc.py**) with the code and try to solve the Rosalind exercise.

### Translating RNA into protein
Now it gets a little more complicated, in this exercise an RNA sequence should be translated to a protein sequence. Again we can split the task in smaller separate problems. First we have to separate the RNA sequence in triplets to get the individual codons, then we can translate each codon to the corresponding amino acid. 

For finding the codons let's Consider this fragment:
```python
rna = "AUGGCCAUG"
```
The first codon is AUG, the second is GCC, the third codon is AUG. If we look at the start positions in the string, the first codon starts at position 0 (the first position in a string/list has the index 0 instead of 1), the second starts at position 3 and the third starts at position 6:
```
AUGGCCAUG
012345678
```
or:
```
AUG GCC AUG
012 345 678
```
So each codon is a substring of *rna* with a length of three. The way to get these substrings is similar to getting a slice of a list, which will be covered in more detail tomorrow:
```python
rna[start_position:start_position+3]
```
The starting positions are multiples of three, see the code below

In [11]:
rna = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"

#first we need to know the number of codons
codon_count = len(rna)/3
print "We have %d codons"%codon_count

#then we loop through the codons, starting with codon 0 (typical for Python)
for codon_number in range(0,codon_count):
    codon_start = codon_number * 3
    codon = rna[codon_start:codon_start+3]
    print "codon %d: %s"%(codon_number, codon)


We have 17 codons
codon 0: AUG
codon 1: GCC
codon 2: AUG
codon 3: GCG
codon 4: CCC
codon 5: AGA
codon 6: ACU
codon 7: GAG
codon 8: AUC
codon 9: AAU
codon 10: AGU
codon 11: ACC
codon 12: CGU
codon 13: AUU
codon 14: AAC
codon 15: GGG
codon 16: UGA


### Using a dictionary to translates codons to amino acids
Now we have the codons, we can translate each codon to the corresponding amino acid. A very useful datastructure to do that is the dictionary. A dictionary consists of key-value combinations. A way to define an empty dictionary is like this:
```python
codon_table = dict()
```
We can then set key-value combinations like this:
```python
codon_table["UUU"] = "F"
```
Now we can use the key to get the corresponding value:
```python
print codon_table["UUU"]
```

In [12]:
codon_table = dict()
codon = "UUU"
amino_acid = "F"
codon_table[codon] = amino_acid
print codon_table[codon]

F


Now to define all 64 codons in one go

In [13]:
codon_table = {"UUU":"F", "UUC":"F", "UUA":"L", "UUG":"L",
              "UCU":"S", "UCC":"S", "UCA":"S", "UCG":"S",
              "UAU":"Y", "UAC":"Y", "UAA":"*", "UAG":"*",
              "UGU":"C", "UGC":"C", "UGA":"*", "UGG":"W",
              "CUU":"L", "CUC":"L", "CUA":"L", "CUG":"L",
              "CCU":"P", "CCC":"P", "CCA":"P", "CCG":"P",
              "CAU":"H", "CAC":"H", "CAA":"Q", "CAG":"Q",
              "CGU":"R", "CGC":"R", "CGA":"R", "CGG":"R",
              "AUU":"I", "AUC":"I", "AUA":"I", "AUG":"M",
              "ACU":"T", "ACC":"T", "ACA":"T", "ACG":"T",
              "AAU":"N", "AAC":"N", "AAA":"K", "AAG":"K",
              "AGU":"S", "AGC":"S", "AGA":"R", "AGG":"R",
              "GUU":"V", "GUC":"V", "GUA":"V", "GUG":"V",
              "GCU":"A", "GCC":"A", "GCA":"A", "GCG":"A",
              "GAU":"D", "GAC":"D", "GAA":"E", "GAG":"E",
              "GGU":"G", "GGC":"G", "GGA":"G", "GGG":"G"}

Notice that the stop codons have "\*" as the value.

Now combine the codons with the codon lookup table to translate the RNA sequence to protein sequence

In [14]:
rna = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"

#first we need to know the number of codons
codon_count = len(rna)/3

prot = ""
#the we loop through the codons, starting with codon 0 (typical for Python)
for codon_number in range(codon_count):
    codon_start = codon_number * 3
    codon = rna[codon_start:codon_start+3]
    prot = prot + codon_table[codon]
    
print prot


MAMAPRTEINSTRING*


If your code is ready, create the Python file **prot.py** and try to complete the exercise.

### Open Reading Frames (optional)
By combining the ideas of the previous exercises you can get quite far completing this exercise, but there is one catch: the sequence is not in a single line, but in (multiline) FASTA format. 

In [15]:
dna = "AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG"

#transcribe to rna
rna = dna.replace("T","U")

#translate frame 1: AGC CAT GTA etc.
codon_count = len(rna)/3

prot = ""
for codon_number in range(codon_count):
    codon_start = codon_number * 3
    codon = rna[codon_start:codon_start+3]
    prot = prot + codon_table[codon]

#find all sequences that start with M and end with the first "*"
for AA_pos in range(len(prot)):
    if prot[AA_pos] == "M":
        stop_pos = prot.find("*",AA_pos)
        if stop_pos != -1:
            print prot[AA_pos:stop_pos]
            
#do the same with frame 2 and frame 3
#and for the three frames in the reverse strand

MGMTPRLGLESLLE
MTPRLGLESLLE


These are the protein strings in the first frame of the sense strand  
```
MGMTPRLGLESLLE
MTPRLGLESLLE
```

Now a version that uses functions

In [16]:
from string import maketrans #needed for revcomp

def translate(rna):
    codon_count = len(rna)/3
    
    prot = ""
    for codon_number in range(codon_count):
        codon_start = codon_number * 3
        codon = rna[codon_start:codon_start+3]
        prot = prot + codon_table.get(codon,"")
    return prot

def revcomp(rna):
    trans = maketrans("ACGU","UGCA")
    complement = rna.translate(trans)
    revc = complement[::-1]
    return revc
    
def get_valid_prots(prot):
    valid_prots = []
    for AA_pos in range(len(prot)):
        if prot[AA_pos] == "M":
            stop_pos = prot.find("*",AA_pos)
            if stop_pos != -1:
                print prot[AA_pos:stop_pos]

                
dna = "AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG"

#transcribe to rna
rna = dna.replace("T","U")
revc = revcomp(rna)

#frame -3
prot = translate(revc[2:])
get_valid_prots(prot)

MLLGSFRLIPKETLIQVAGSSPCNLS
