# Python introduction - session 2
## Loops and control statements

## Loops

Computing is mostly about doing the same thing again and again in an automated fashion.

An example task that we might want to repeat is printing each character in a word or DNA sequence on a line of its own. One way to do this would be to use a series of print statements:

In [1]:
DNAseq = 'atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa'
print(DNAseq[0])
print(DNAseq[1])
print(DNAseq[2])
print(DNAseq[3])
print(DNAseq[4])

a
t
g
t
a


Now you might want to repeat this for another word and you copy & paste your existing code.

But that’s a bad approach for two reasons:

1. It doesn't scale, imagine wanting to print the characters in a string that is hundreds of letters long.

2. It is fragile: if we give it a longer string, it only prints part of the data, and if we give it a shorter one, it produces an error because we’re asking for characters that don't exist.


In [2]:
DNAseq2 = 'ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa'
print(DNAseq2[0])
print(DNAseq2[1])
print(DNAseq2[2])
print(DNAseq2[3])
print(DNAseq2[4])

c
c
g
t
a


A better approach is to use **for loops** in Python.

In [3]:
for base in DNAseq:
    print(base)

a
t
g
t
a
t
a
a
c
a
t
t
g
g
c
c
a
t
a
c
c
c
c
g
t
a
t
a
c
c
c
a
t
g
c
g
a
a
c
c
a
t
a
t
t
g
g
c
c
a
t
t
a
a


In [4]:
for base in DNAseq2:
    print(base)

c
c
g
t
a
t
a
c
c
c
a
t
g
c
g
a
a
c
a
t
g
g
c
g
a
a
a
g
a
a
a
g
c
t
t
t
g
c
g
a
g
c
a
c
c
t
a
a


In [30]:
x = 1
print(x**2)
x = 2
print(x**2)
x = 3
print(x**2)

1
4
9


##### In Python, a for loop is structured like this.

for variable in collection:
    do things with variable

We can call the loop variable anything we like, but there must be a **colon** at the end of the line starting the loop, and we must **indent** anything we want to run inside the loop. Unlike many other languages, there is no command to start/end a loop (e.g. do/done in **bash**); what is indented after the for statement belongs to the loop.

### Exercise
Write a loop that counts the number of bases in the your DNA sequence varialbe DNAseq.

### Solution

In [5]:
number_of_bases = 0

for base in DNAseq: ### you might want to do this as before with 'atgcg' here and DNAseq below
    number_of_bases = number_of_bases + 1
    
print('There are', number_of_bases, 'bases in our DNAsequence.')

There are 54 bases in our DNAsequence.


Now make it print the DNA sequence as well, too. Notice that Python will print spaces between the variables in the print statement by default.

In [6]:
number_of_bases = 0


for base in DNAseq2:
    number_of_bases = number_of_bases + 1
    
print('There are', number_of_bases, 'bases', DNAseq2,'.')

print('There are ', number_of_bases, ' letters in the word ', DNAseq2,'.',sep="")


There are 48 bases ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa .
There are 48 letters in the word ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa.


Note also that finding the length of a string is such a common operation that Python actually has a built-in function to do it called len:

In [7]:
print(len(DNAseq))

54


len is much faster than any function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other things that we haven’t met yet, so we should always use it when we can.

Python has a built-in function called range that creates a list of numbers. Range can accept 1-3 parameters. If one parameter is input, range creates an array of that length, starting at zero and incrementing by 1. If 2 parameters are input, range starts at the first and ends at the second, incrementing by one. If range is passed 3 parameters, it stars at the first one, ends at the second one, and increments by the third one.

In [35]:
for i in range(3):
    print(i)

0
1
2


In [36]:
for i in range(2,5):
    print(i)

2
3
4


In [37]:
for i in range(10,30,5):
    print(i)

10
15
20
25


### Exercise
Write a loop that prints all the even numbers in the range between 1 and 10 (inclusive)

In [39]:
### Solution
for i in range(2,11,2):
    print(i)

2
4
6
8
10


You can make it more generic and use what we learnt before.

In [40]:
start, end = 1, 10

for i in range(start, end + 1):
    if i % 2 == 0:
        print(i)

2
4
6
8
10


### Subscripting
Rember. You can use indexes to subset strings. We used this in the inidial examples above.

In [8]:
DNAseq[0]

'a'

In [10]:
DNAseq[0:3]

'atg'

You can use subsetting to print out all the first base of all codons.

In [11]:
for i in range(0, len(DNAseq), 3):
    print(DNAseq[i])

a
t
a
a
g
c
a
c
t
a
c
g
a
c
a
g
c
t


### Exercise
Write a loop that prints all codons (non-overlapping 3-mers) of DNAseq

In [12]:
### Solution
for i in range(3, len(DNAseq), 3):
    print(DNAseq[i-3:i])

atg
tat
aac
att
ggc
cat
acc
ccg
tat
acc
cat
gcg
aac
cat
att
ggc
cat


## Control flow - more ways to affect the order in which statements run

So far we've seen "for" loops in both shell programs and a few of our Python examples.

There are a few other "control flow" statements that affect the order in which statements run.

We'll start first with the "if statement" ... and its variants.

In [41]:
x = 5

print("step 1")
if x < 10:
    print("x is small ... just", x)

x = 11
print("step 2")
if x < 10:
    print("x is small ... just", x)
if x >= 10:
    print("x is 10 or bigger. It's", x)


step 1
x is small ... just 5
step 2
x is 10 or bigger. It's 11


In [42]:
if x < 10:
    print("x is small ... just", x)
else:
    print("x is 10 or bigger. It's", x)

x is 10 or bigger. It's 11


An "if statement" can be quite complex.

Note the use of "`elif`" below, which means "else if"

In [43]:
if x < 10:
    print("x is small ... just", x)
elif x == 10:
    print("medium x. It's", x)
elif x == 11:
    print("medium x. It's", x)
elif x == 12:
    print("medium x. It's", x)
else:
    print("x bigger than 12. It's", x)

medium x. It's 11


Multiple conditions can be tested in the same `if`-statement. You can use "`and`" if both predicates need to be true, or "`or`" if only one of them needs to be true.

In [44]:
if x < 10:
    print("x is small ... just", x)
elif x >= 10 and x <= 12:
    print("medium x. It's", x)
else:
    print("x bigger than 12. It's", x)

medium x. It's 11


Statements can be nested in other complex statements. So we can put an `if`-statement inside a `for`-statement, as follows.

In [45]:
for i in range(8, 15):
    if i < 10:
        print(i, "is smaller.")
    elif i >= 10 and i <= 12:
        print(i, "is medium size :)")
    else:
        print("i is", i)

8 is smaller.
9 is smaller.
10 is medium size :)
11 is medium size :)
12 is medium size :)
i is 13
i is 14


Then there is the `while`-loop ... also know as the wheel of death.

In [None]:
x = 1
while x <= 50:
    print("x =", x)
    x *= 2
print("Done.")

Be careful with `while`-statements. 

*What would happen if you left the line that says* `x *= 2` *out of the program fragment above?* 

*What would happen if the first line said `x = 0` (you can try this out yourself)?*

We can also use loops to count the time how often each base occurs in our DNAseq.

In [14]:
a_count = 0
t_count = 0
c_count = 0
g_count = 0
for base in DNAseq:
    if base == 'a':
        a_count += 1
    elif base == 't':
        t_count += 1
    elif base == 'c':
        c_count += 1
    elif base == "g":
        g_count += 1
    else:
        print(base, 'is not a regular base [a, t, c, g]')
print('We have the following base count.\n','a:', a_count, 't:', t_count, 'c:', c_count, 'g:', g_count, sep="")

We have the following base count.
a:17t:14c:15g:8


### Excerise
What's the count of each regular base [a, t, c, g] in DNAseq2

In [15]:
a_count = 0
t_count = 0
c_count = 0
g_count = 0
for base in DNAseq2:
    if base == 'a':
        a_count += 1
    elif base == 't':
        t_count += 1
    elif base == 'c':
        c_count += 1
    elif base == "g":
        g_count += 1
    else:
        print(base, 'is not a regular base [a, t, c, g]')
print('We have the following base count.\n','a:', a_count, 't:', t_count, 'c:', c_count, 'g:', g_count, sep="")

We have the following base count.
a:16t:8c:13g:11


## Lists, sets, and dictionaries

Python has a set of standard 'containers' were you can store inforamtion in.


### Lists

Lists are ordered collection of items e.g. codons within DNA sequence could be stored in a list of 3-mers

In [17]:
DNAseq

'atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa'

In [18]:
### initially you can copy and paste the codons over and make a list
DNAseq_codons = ['atg', 'tat', 'aac' ]

In [19]:
### we can also make a list more easily by looping over the string (copy code from above)

In [20]:
for i in range(3, len(DNAseq), 3):
    print(DNAseq[i-3:i])

atg
tat
aac
att
ggc
cat
acc
ccg
tat
acc
cat
gcg
aac
cat
att
ggc
cat


In [22]:
DNAseq_codons = []
for i in range(3, len(DNAseq), 3):
    DNAseq_codons.append(DNAseq[i-3:i])
    print(DNAseq[i-3:i])
print(DNAseq_codons)

atg
tat
aac
att
ggc
cat
acc
ccg
tat
acc
cat
gcg
aac
cat
att
ggc
cat
['atg', 'tat', 'aac', 'att', 'ggc', 'cat', 'acc', 'ccg', 'tat', 'acc', 'cat', 'gcg', 'aac', 'cat', 'att', 'ggc', 'cat']


### Lists are
* subscriptable
* mutable

In [23]:
DNAseq_codons[4]

'ggc'

In [24]:
DNAseq_codons[4] = 'aat'

In [25]:
DNAseq_codons

['atg',
 'tat',
 'aac',
 'att',
 'aat',
 'cat',
 'acc',
 'ccg',
 'tat',
 'acc',
 'cat',
 'gcg',
 'aac',
 'cat',
 'att',
 'ggc',
 'cat']

### Sets

Sets are unordered collection of items where each item occures only once.

In [26]:
### we can use sets to get to all the codons in a gene

In [27]:
DNAseq_codons

['atg',
 'tat',
 'aac',
 'att',
 'aat',
 'cat',
 'acc',
 'ccg',
 'tat',
 'acc',
 'cat',
 'gcg',
 'aac',
 'cat',
 'att',
 'ggc',
 'cat']

In [28]:
set(DNAseq_codons)

{'aac', 'aat', 'acc', 'atg', 'att', 'cat', 'ccg', 'gcg', 'ggc', 'tat'}

In [30]:
set[0]

TypeError: 'type' object is not subscriptable

### Sets are
* not subscriptable
* not mutable

### Dictionaries

Dictionaries are (unordered) containers of key:value pairs where each key can only occur once.

In [32]:
### we can use a dictionary to reverse complement a DNAseqence

In [33]:
base_pair_dict = {'a' : 't', 't' : 'a', 'g' : 'c', 'c' : 'g'}

In [34]:
base_pair_dict

{'a': 't', 't': 'a', 'g': 'c', 'c': 'g'}

In [35]:
base_pair_dict['g']

'c'

In [36]:
###so now reverse complement our DNA sequence
reverse_comp_DNAseq = ''
for base in DNAseq[::-1]:
    paired_base = base_pair_dict[base]
    reverse_comp_DNAseq = reverse_comp_DNAseq + paired_base
print(reverse_comp_DNAseq, 'is the reverse complement of', DNAseq)

ttaatggccaatatggttcgcatgggtatacggggtatggccaatgttatacat is the reverse complement of atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa


### Exercise decode the hidden message in the DNA sequences

Use the coding table dictonary to decode the hidden message!!!!  
Hint you can convert a string to upper case with the the function string intrinsic function .uppper()


In [40]:
'agt'.upper()

'AGT'

In [37]:
coding_table_dict = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                  
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', 
    } 

In [43]:
### hint you can uppercase a string with the 
hidden_message = ''
for i in range(3, len(DNAseq), 3):
    codon = DNAseq[i-3: i]
    amino_acid = coding_table_dict[codon.upper()]
    hidden_message = hidden_message + amino_acid

print("This is the hidden message in", DNAseq, ':\n',hidden_message )

This is the hidden message in atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa :
 MYNIGHTPYTHANHIGH


In [44]:
### hint you can uppercase a string with the 
hidden_message = ''
for i in range(3, len(DNAseq2), 3):
    codon = DNAseq2[i-3: i]
    amino_acid = coding_table_dict[codon.upper()]
    hidden_message = hidden_message + amino_acid

print("This is the hidden message in", DNAseq2, ':\n',hidden_message )

This is the hidden message in ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa :
 PYTHANMAKESFAST
