# Python introduction - session 2
## Loops and control statements & data structures

If you get stuck in an endless loop hit the **"STOP" button (black square)** above or our good friend from bash, **ctrl+c**  
You know you are stuck in a loop if you see **In \[\*\]:** forever

## Loops

Computing is mostly about doing the same thing again and again in an automated fashion.

An example task that we might want to repeat is printing each character in a word or DNA sequence on a line of its own. One way to do this would be to use a series of print statements:

In [45]:
DNAseq = 'atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa'
print(DNAseq[0])
print(DNAseq[1])
print(DNAseq[2])
print(DNAseq[3])
print(DNAseq[4])

a
t
g
t
a


Now you might want to repeat this for another word and you copy & paste your existing code.

But that’s a bad approach for two reasons:

1. It doesn't scale, imagine wanting to print the characters in a string that is hundreds of letters long.

2. It is fragile: if we give it a longer string, it only prints part of the data, and if we give it a shorter one, it produces an error because we’re asking for characters that don't exist.


In [46]:
DNAseq2 = 'ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa'
print(DNAseq2[0])
print(DNAseq2[1])
print(DNAseq2[2])
print(DNAseq2[3])
print(DNAseq2[4])

c
c
g
t
a


In [47]:
DNAseq3 = 'ccgt'
print(DNAseq3[0])
print(DNAseq3[1])
print(DNAseq3[2])
print(DNAseq3[3])
print(DNAseq3[4])

c
c
g
t


IndexError: string index out of range

A better approach is to use **for loops** in Python.

In [75]:
for base in DNAseq:
    print(base)

a
t
g
t
a
t
a
a
c
a
t
t
g
g
c
c
a
t
a
c
c
c
c
g
t
a
t
a
c
c
c
a
t
g
c
g
a
a
c
c
a
t
a
t
t
g
g
c
c
a
t
t
a
a


In [76]:
for base in DNAseq2:
    print(base)

c
c
g
t
a
t
a
c
c
c
a
t
g
c
g
a
a
c
a
t
g
g
c
g
a
a
a
g
a
a
a
g
c
t
t
t
g
c
g
a
g
c
a
c
c
t
a
a


In [77]:
for base in DNAseq3:
    print(base)

c
c
g
t


##### In Python, a for loop is structured like this.

We can call the loop variable anything we like, but there must be a **colon** at the end of the line starting the loop, and we must **indent** anything we want to run inside the loop. Unlike many other languages, there is no command to start/end a loop (e.g. do/done in **bash**); what is indented after the for statement belongs to the loop.

**Indentation errors are THE MOST COMMON MISTAKE made when writing loops**  
In this course I will exclusivly use the "tab" key to make my indents.  
I find it best to slowly build my loop and use the print(function) a lot, to show myself that it is working correctly.  

### Exercise
Write a loop that counts the number of bases in your DNA sequence (variable DNAseq).

In [78]:
## start a counter at 0
number_of_bases = 0

# Here goes a loop that iterates over DNAseq and counts the bases
for base in DNAseq: ### you might want to do this as before with 'atgcg' here and DNAseq below
    number_of_bases = number_of_bases + 1



print('There are', number_of_bases, 'bases in our DNAsequence.')

There are 54 bases in our DNAsequence.


### Solution

In [79]:
number_of_bases = 0

for base in DNAseq: ### you might want to do this as before with 'atgcg' here and DNAseq below
    number_of_bases = number_of_bases + 1
    
print('There are', number_of_bases, 'bases in our DNAsequence.')

There are 54 bases in our DNAsequence.


Now make your final print statement print the DNA sequence as well. Notice that Python will print spaces between the variables in the print statement by default.

In [80]:
number_of_bases = 0

for base in DNAseq:
    number_of_bases = number_of_bases + 1
    
print('There are', number_of_bases, 'bases in the sequence', DNAseq,'.')

print('There are ', number_of_bases, ' letters in the sequence ', DNAseq, '.', sep="")


There are 54 bases in the sequence atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa .
There are 54 letters in the sequence atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa.


Note also that finding the length of a string is such a common operation that Python actually has a built-in function to do it called len:

In [81]:
print(len(DNAseq))

54


len is much faster than any function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other things that we haven’t met yet, so we should always use it when we can.

### How do we iterate over numbers?
Python has a built-in function called range that creates a list of numbers. Range can accept 1-3 parameters. 
- If one parameter is input, range creates an array of that length, starting at zero and incrementing by 1. 
- If 2 parameters are input, range starts at the first and ends at the second, incrementing by one. 
- If range is passed 3 parameters, it stars at the first one, ends at the second one, and increments by the third one.

In [82]:
for i in range(3):
    print(i)

0
1
2


In [83]:
for i in range(2,5):
    print(i)

2
3
4


In [84]:
for i in range(10,30,5):
    print(i)

10
15
20
25


### Exercise
Write a loop that prints all the even numbers in the range between 1 and 10 (inclusive)

In [85]:
### Solution
for i in range(2,11,2):
    print(i)

2
4
6
8
10


You can make it more generic and use what we learnt before.

In [86]:
start, end = 1, 10

for i in range(start, end + 1):
    if i % 2 == 0:
        print(i)

2
4
6
8
10


### Subscripting
Remember. You can use indexes to subset strings. We used this in the initial examples above.

In [87]:
DNAseq

'atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa'

In [88]:
DNAseq[0]

'a'

In [89]:
DNAseq[0:3]

'atg'

You can use subsetting to print out all the first bases of all codons.

In [90]:
for i in range(0, len(DNAseq), 3):
    print(DNAseq[i])

a
t
a
a
g
c
a
c
t
a
c
g
a
c
a
g
c
t


### Exercise
Write a loop that prints all the codons (non-overlapping 3-mers) of our variable DNAseq

In [91]:
## Hint use your string slicing DNAseq[i], how could you slice more letters?

In [92]:
### Solution
for i in range(0, len(DNAseq), 3):
    print(DNAseq[i:i+3])

atg
tat
aac
att
ggc
cat
acc
ccg
tat
acc
cat
gcg
aac
cat
att
ggc
cat
taa


In [93]:
for i in range(0, len(DNAseq), 3):
    print(i,DNAseq[i:i+3])

0 atg
3 tat
6 aac
9 att
12 ggc
15 cat
18 acc
21 ccg
24 tat
27 acc
30 cat
33 gcg
36 aac
39 cat
42 att
45 ggc
48 cat
51 taa


## Control flow - more ways to affect the order in which statements run

So far we've seen "for" loops in both shell programs and a few of our Python examples.

There are a few other "control flow" statements that affect the order in which statements run.

We'll start first with the "if statement" ... and its variants.

In [94]:
## Using if
x = 5

print("step 1")
if x < 10:
    print("x is small ... just", x)

## Using multiple ifs
x = 11
print("step 2")
if x < 10:
    print("x is small ... just", x)
if x >= 10:
    print("x is 10 or bigger. It's", x)


step 1
x is small ... just 5
step 2
x is 10 or bigger. It's 11


In [95]:
## if else
x = 11
if x < 10:
    print("x is small ... just", x)
else:
    print("x is 10 or bigger. It's", x)

x is 10 or bigger. It's 11


An "if statement" can be quite complex.

Note the use of "`elif`" below, which means "else if"

In [96]:
#elif
x = 11

if x < 10:
    print("x is small ... just", x)
elif x == 10:
    print("medium x. It's", x)
elif x == 11:
    print("medium x. It's", x)
elif x == 12:
    print("medium x. It's", x)
else:
    print("x bigger than 12. It's", x)

medium x. It's 11


Multiple conditions can be tested in the same `if`-statement. You can use "`and`" if both predicates need to be true, or "`or`" if only one of them needs to be true.

In [97]:
x=13

if x < 10:
    print("x is small ... just", x)
elif x >= 10 and x <= 12:
    print("medium x. It's", x)
else:
    print("x bigger than 12. It's", x)


x bigger than 12. It's 13


In [98]:
## multiple conditions with OR
x=30

if x <10:
    print("x is small ... just", x)
elif x == 10 or x%10 == 0:
    print("x appears to be a mutiple of 10")
else:
    print("x is bigger than 10 and not a mutiple of 10. It's", x)

x appears to be a mutiple of 10


Statements can be nested in other complex statements. So we can put an `if`-statement inside a `for`-statement, as follows.

In [99]:
for i in range(8, 15):
    if i < 10:
        print(i, "is smaller.")
    elif i >= 10 and i <= 12:
        print(i, "is medium size :)")
    else:
        print("i is", i)

8 is smaller.
9 is smaller.
10 is medium size :)
11 is medium size :)
12 is medium size :)
i is 13
i is 14


We can also use loops to count how often the base 'a' occurs in our variable DNAseq.

In [100]:
a_count = 0

for base in DNAseq:
    ##start conditional statement
    if base == 'a':
        ## use python shortcut += to add and assign a varialbe at the same time
        a_count += 1
        
print('We have the following base counts:')
print('a:', a_count)    

We have the following base counts:
a: 17


### Exercise
What are the counts of each regular base [a, t, c, g] in DNAseq?

In [101]:
## extra cell to play with 

In [102]:
a_count = 0
t_count = 0
c_count = 0
g_count = 0

for base in DNAseq:
    ###add your loop here to count the number of a's, t's, c's and g's
    if base == 'a':
        a_count += 1
    elif base == 't':
        t_count += 1
    elif base == 'c':
        c_count += 1
    elif base == "g":
        g_count += 1
    else:
        print(base, 'is not a regular base [a, t, c, g]')
        
print('We have the following base counts:')
print('a:', a_count)
print('t:', t_count)
print('c:', c_count)
print('g:', g_count)   

We have the following base counts:
a: 17
t: 14
c: 15
g: 8


### Now that we have wrote this really awesome loop...we're going to show you how python could have done this for you...faster

#### Object Methods (a.k.a. functions)

You can also perform certain operations on objects, these are called methods (also commonly referred to as functions). You can use an object method in a similar way to attribute by typeing the **objectname.method( )**  
(NOTE: We can tell this is a method because of the **( )**!!)

We can use the built-in function **help( )** to learn more about a method, BUT if you want help you need to leavet the paretheses off the method **help(objectname.method)**

In [103]:
help(DNAseq.count)

Help on built-in function count:

count(...) method of builtins.str instance
    S.count(sub[, start[, end]]) -> int
    
    Return the number of non-overlapping occurrences of substring sub in
    string S[start:end].  Optional arguments start and end are
    interpreted as in slice notation.



In [104]:
help(DNAseq.count())

TypeError: count() takes at least 1 argument (0 given)

In [105]:
DNAseq.count("a")

17

In [106]:
DNAseq.count("b")

0

In [107]:
DNAseq.isalpha()

True

## Data structures

### Lists, sets, and dictionaries

Python has a set of standard 'containers' were you can store information in. Each data structure has its purpose and advantages.


### Lists

Lists are ordered sequences of elements. Each element or value that is inside of a list is called an item. Just as strings are defined as characters between quotes, lists are defined by having values between square brackets []. 

For example, codons within DNA sequence could be stored in a list of 3-mers.

In [108]:
DNAseq

'atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa'

In [109]:
### initially you can copy and paste the codons over and make a list
DNAseq_codons = ['atg', 'tat', 'aac' ]

We can also make a list more easily by looping over the string.

In [110]:
DNAseq_codons = []

## Use a for loop on a range object to extract codons
for i in range(0, len(DNAseq), 3):
    DNAseq_codons.append(DNAseq[i:i+3])
    print(DNAseq[i:i+3])
print(DNAseq_codons)

atg
tat
aac
att
ggc
cat
acc
ccg
tat
acc
cat
gcg
aac
cat
att
ggc
cat
taa
['atg', 'tat', 'aac', 'att', 'ggc', 'cat', 'acc', 'ccg', 'tat', 'acc', 'cat', 'gcg', 'aac', 'cat', 'att', 'ggc', 'cat', 'taa']


### Lists are
* subscriptable  (i.e. you can access the elements with [ ]
* mutable (i.e. they can be altered without making a new object)

In [111]:
## subscript or slice my list
DNAseq_codons[4]

'ggc'

In [112]:
## mutate/alter my list
DNAseq_codons[4] = 'aat'

In [113]:
## look at the changes
DNAseq_codons

['atg',
 'tat',
 'aac',
 'att',
 'aat',
 'cat',
 'acc',
 'ccg',
 'tat',
 'acc',
 'cat',
 'gcg',
 'aac',
 'cat',
 'att',
 'ggc',
 'cat',
 'taa']

### Common Methods for Lists

In [114]:
DNAseq_codons.count("acc")

2

In [115]:
DNAseq_codons.reverse()

In [116]:
DNAseq_codons

['taa',
 'cat',
 'ggc',
 'att',
 'cat',
 'aac',
 'gcg',
 'cat',
 'acc',
 'tat',
 'ccg',
 'acc',
 'cat',
 'aat',
 'att',
 'aac',
 'tat',
 'atg']

In [117]:
DNAseq_codons.reverse()

In [118]:
DNAseq_codons

['atg',
 'tat',
 'aac',
 'att',
 'aat',
 'cat',
 'acc',
 'ccg',
 'tat',
 'acc',
 'cat',
 'gcg',
 'aac',
 'cat',
 'att',
 'ggc',
 'cat',
 'taa']

### Sets

Sets are unordered collection of items where each item occurs only once. It has no duplicated elements.

In [119]:
print(DNAseq_codons)
type(DNAseq_codons)

['atg', 'tat', 'aac', 'att', 'aat', 'cat', 'acc', 'ccg', 'tat', 'acc', 'cat', 'gcg', 'aac', 'cat', 'att', 'ggc', 'cat', 'taa']


list

In [120]:
## Create a set with set()
set(DNAseq_codons)

{'aac', 'aat', 'acc', 'atg', 'att', 'cat', 'ccg', 'gcg', 'ggc', 'taa', 'tat'}

In [121]:
## Sets have no order so they are NOT indexable/sliceable
set[0]

TypeError: 'type' object is not subscriptable

### Sets are
* not subscriptable (i.e. you cannot do set[0]) because sets have no order)
* mutable (There exsits a special type of set called a "frozen set" which is immutable)

Because sets cannot have multiple occurrences of the same element, it makes sets highly useful to efficiently remove duplicate values from a list and to perform common math operations like unions and intersections.

### Common Methods for Sets

In [122]:
## create a set object
codon_set = set(DNAseq_codons)

In [123]:
## How many things are in my set object
len(codon_set)

11

In [124]:
## Look at set object
codon_set

{'aac', 'aat', 'acc', 'atg', 'att', 'cat', 'ccg', 'gcg', 'ggc', 'taa', 'tat'}

In [125]:
## Add a new value to my set
codon_set.add('ctt')

In [126]:
## Look at len(set) did I successfully add a value?
len(codon_set)

12

In [127]:
## Create a new set with two objects that are in set 1 and one new object
codon_set2 = set(["aac","att","gtt"])

In [128]:
codon_set2.intersection(codon_set)

{'aac', 'att'}

In [129]:
codon_set.union(codon_set2)

{'aac',
 'aat',
 'acc',
 'atg',
 'att',
 'cat',
 'ccg',
 'ctt',
 'gcg',
 'ggc',
 'gtt',
 'taa',
 'tat'}

In [130]:
codon_set

{'aac',
 'aat',
 'acc',
 'atg',
 'att',
 'cat',
 'ccg',
 'ctt',
 'gcg',
 'ggc',
 'taa',
 'tat'}

### Dictionaries

Dictionaries are (unordered) containers of key:value pairs where each key can only occur once.

They are one of most useful Python data structures. Looking up elements in a dictionary is really fast and there are lots of built-in functions to use and manipulate dictionaries.

We can use a dictionary to reverse complement a DNA sequence.

In [131]:
## create a dictionary in two ways
## dict_name = {'key':value}
## dict_name = dict(key1= value1,key2 = value2, key3 = value3)

age_dict = dict(alex = 5, 
                tim = 7, 
                alexa = 5, 
                sandy = 8)



In [132]:
## What is the value of the key "tim"
age_dict["tim"]

7

In [133]:
age_dict["jessica"]=6

In [134]:
age_dict

{'alex': 5, 'tim': 7, 'alexa': 5, 'sandy': 8, 'jessica': 6}

In [135]:
base_pair_dict = {'a' : 't', 
                  't' : 'a', 
                  'g' : 'c', 
                  'c' : 'g'}

In [136]:
# look at all values of dictionary base_pair_dict
base_pair_dict

{'a': 't', 't': 'a', 'g': 'c', 'c': 'g'}

In [137]:
# look at the value of the key "g"
base_pair_dict['g']

'c'

In [138]:
###so now reverse complement our DNA sequence

reverse_comp_DNAseq = ''

for base in DNAseq[::-1]:
    paired_base = base_pair_dict[base]
    reverse_comp_DNAseq += paired_base
    
print(reverse_comp_DNAseq, 'is the reverse complement of', DNAseq)

ttaatggccaatatggttcgcatgggtatacggggtatggccaatgttatacat is the reverse complement of atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa


### Dictionaries are
* not subscriptable (i.e. you cannot do bais_pair_dict[0]) because dictionaries have no order)
* mutable (you can alter the value of a key after it has been created, or add new keys)

Dictionaires can only have a given key once BUT the values of those keys can be the same

### Common Methods for Dictionairies

In [139]:
base_pair_dict.get("a")

't'

In [140]:
base_pair_dict.keys()

dict_keys(['a', 't', 'g', 'c'])

In [141]:
base_pair_dict.values()

dict_values(['t', 'a', 'c', 'g'])

In [142]:
base_pair_dict["u"] = "t"

In [143]:
base_pair_dict.keys()

dict_keys(['a', 't', 'g', 'c', 'u'])

In [144]:
base_pair_dict.values()

dict_values(['t', 'a', 'c', 'g', 't'])

### Exercise: decode the hidden message in the DNA sequences

Use the coding table dictonary to translate your DNAseq into amino acids and decode the hidden message!!!!  
Hint you can convert a string to upper case with the function: x.upper(), where x is your string variable.


In [145]:
DNAseq = 'atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa'
DNAseq2 = 'ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa'

In [146]:
coding_table_dict = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                  
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', 
    } 

### HINT
You need to convert our DNAseq to UPPERCASE letters
Perhaps there is a handy method to do that ???

In [147]:
"atg".upper()

'ATG'

### Solution

In [148]:
hidden_message = ''

for i in range(0, len(DNAseq), 3):
    ## slice DNAseq and keep your codon
    ## look up which AA is encoded by your codon
    ## write your AA to your hidden_message string
    codon = DNAseq[i:i+3]
    amino_acid = coding_table_dict[codon.upper()]
    hidden_message = hidden_message + amino_acid

print("This is the hidden message in", DNAseq, ':\n', hidden_message)

This is the hidden message in atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa :
 MYNIGHTPYTHANHIGH_


In [149]:
hidden_message = ''

for i in range(0, len(DNAseq2), 3):
    codon = DNAseq2[i:i+3]
    amino_acid = coding_table_dict[codon.upper()]
    hidden_message = hidden_message + amino_acid

print("This is the hidden message in", DNAseq2, ':\n',hidden_message)

This is the hidden message in ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa :
 PYTHANMAKESFAST_
