# Python introduction - session 2
## Loops and control statements & data structures
### Loops

Computing is mostly about doing the same thing again and again in an automated fashion.

An example task that we might want to repeat is printing each character in a DNA sequence on a line of its own. 

In [3]:
DNAseq = 'atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa'

One way to do this would be to use a series of print statements: 

In [3]:
print(DNAseq[0])
print(DNAseq[1])
print(DNAseq[2])
print(DNAseq[3])
print(DNAseq[4])

a
t
g
t
a


__Python indexing__ starts from 0 because it is a convention that has been adopted by many programming languages, including C, java, and many others. The idea behind starting from 0 is to make it easier to calculate memory addresses, which is important when dealing with low-level programming tasks. 

Do people need me to explain more about indexing? It can be difficult to understand but you can also just remember python indexing starts from 0. 

Okay, let's get back to print our bases. But I'm getting a little bit tired... let me check how many bases left for me to print.

In [4]:
len(DNAseq)

54

The len() function can check the length of a string.

The sequence is 54 bases long and to print all of them I have to to write the print() funtion 54 times. 

Obviously this is not a good way to solve the problem:
* What if I need to print a sequence with 20K bases? 
* What if I have multiple sequences I need to print?

A better approach is to use **for loops** in Python.

In [5]:
# create two more variables for us to practice 

DNAseq2 = 'ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa'
DNAseq3 = 'ccgt'

In [7]:
for base in DNAseq:
    print(base)

a
t
g
t
a
t
a
a
c
a
t
t
g
g
c
c
a
t
a
c
c
c
c
g
t
a
t
a
c
c
c
a
t
g
c
g
a
a
c
c
a
t
a
t
t
g
g
c
c
a
t
t
a
a


In [8]:
for base in DNAseq2:
    print(base)

c
c
g
t
a
t
a
c
c
c
a
t
g
c
g
a
a
c
a
t
g
g
c
g
a
a
a
g
a
a
a
g
c
t
t
t
g
c
g
a
g
c
a
c
c
t
a
a


In [9]:
for base in DNAseq3:
    print(base)

c
c
g
t


We can see that no matter how long our sequence is, we can always use two lines of code to perform the same task. 

__In python, the syntax of for loop is:__

In [7]:
for variable in collection:
    do things on variable

SyntaxError: invalid syntax (1047291770.py, line 2)

We can call the loop variable anything we like, but there must be a **colon** at the end of the line starting the loop, and we must **indent** anything we want to run inside the loop. 

Like the for loop we ran for printing bases:

In [None]:
for base in DNAseq3:
    print(base)

"base" is our variable name, we can change it to anything we want, but we also need to make sure the code we run in the loop has the same name. For example, we can change the loop to:

In [13]:
for i in DNAseq3:
    print(i)

c
c
g
t


Unlike many other languages, there is no command to start/end a loop (e.g. do/done in **bash**). What is indented after the for statement belongs to the loop.

### Exercise

Write a loop that counts the number of bases in your DNA sequence (variable DNAseq).

In [14]:
number_of_bases = 0

# Here goes a loop that iterates over DNAseq and counts the bases



print('There are', number_of_bases, 'bases in our DNAsequence.')

There are 0 bases in our DNAsequence.


### Solution

In [15]:
number_of_bases = 0

for base in DNAseq:
    number_of_bases = number_of_bases + 1
    
print('There are', number_of_bases, 'bases in our DNAsequence.')

There are 54 bases in our DNAsequence.


Now make it print the DNA sequence too.

In [15]:
number_of_bases = 0

for base in DNAseq:
    number_of_bases = number_of_bases + 1
    
print('There are', number_of_bases, 'bases in the sequence', DNAseq, '.')

There are 54 bases in the sequence atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa .


__Notice__ that there is a space between the sequence and the dot in the above output. That's because for the print() function, when you use comma to concatenate data the function will put a space in between by default. 

To change this default setting, we can use the "sep" argument to specify which delimiter we want to use. Below, we specify sep="" which means we don't use any delimiter so the data will concatenate right next to each other. In this way, you need to make sure to include the space character in your strings. Otherwise, you won't get any space between your data. 

In [16]:
print('There are ', number_of_bases, ' letters in the sequence ', DNAseq, '.', sep="")

There are 54 letters in the sequence atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa.


But finding the length of a string is such a common operation that Python actually has a built-in function to do it called len:

In [21]:
len(DNAseq)

54

len() is much faster than any function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other things that we haven’t met yet, so we should always use it when we can.

### How do we iterate over numbers?
Python has a built-in function called range that creates a list of numbers. Range can accept 1-3 parameters. 
- If one parameter is input, range creates an array of that length, starting at zero and incrementing by 1. 
- If 2 parameters are input, range starts at the first and ends at the second, incrementing by one. 
- If range is passed 3 parameters, it stars at the first one, ends at the second one, and increments by the third one.

In [26]:
for i in range(3):
    print(i)

0
1
2


In [27]:
for i in range(2,5):
    print(i)

2
3
4


In [28]:
for i in range(10,30,5):
    print(i)

10
15
20
25


### Exercise
Write a loop that prints all the even numbers in the range between 1 and 10 (inclusive)

In [None]:
# write your code here:




In [17]:
# solution 

for i in range(2,11,2):
    print(i)

2
4
6
8
10


Above, we used the range() function to only print out the numbers that start from 2 and end in 10 with a increment of 2 which are all the even numbers between 1 and 10 (inclusive).

We can make it more generic and use what we have learned before:

In [18]:
start, end = 1, 10

for i in range(start, end + 1):
    if i % 2 == 0:
        print(i)

2
4
6
8
10


Here, we used if statement and modulo operation to check if the number is even and only print out those that are. 

### Subscripting

In Python, subscripting refers to accessing individual elements or slices of a sequence or collection, such as strings, lists, tuples, or arrays, using square brackets [] and an index or slice notation. It allows you to extract or modify specific elements of a sequence. For example:

In [19]:
DNAseq

'atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa'

In [25]:
# access the element at index 0
DNAseq[0] 

'a'

In [26]:
# access the element at index 5
DNAseq[5] 

't'

In [27]:
# access the elements from index 0 to 3 (exclusive)
DNAseq[0:3] 

'atg'

__Note__ when using square brackets to slice a sequence from a collection, the beginning index is always inclusive and the ending index is exclusive. 

So for `DNAseq[0:3]`, it includes index 0, 1, 2 but not 3.

In [28]:
# access the elements from the beginning to index 6 (exclusive)
DNAseq[:6]

'atgtat'

In [29]:
# access the elements from index 2 (inclusive) to the end
DNAseq[2:]

'gtataacattggccataccccgtatacccatgcgaaccatattggccattaa'

You can use subsetting to print out all the first bases of all codons.

In [30]:
for i in range(0, len(DNAseq), 3):
    print(DNAseq[i])

a
t
a
a
g
c
a
c
t
a
c
g
a
c
a
g
c
t


For the above loop, consider the question:
* What indexes are been used in the loop? Obviously it's not all the indexes in variable DNAseq. 

### Exercise

Write a loop that prints all the codons (non-overlapping 3-mers) of our variable DNAseq.

In [None]:
# write your code here:




In [31]:
# solution
for i in range(0, len(DNAseq), 3):
    print(DNAseq[i:i+3])

atg
tat
aac
att
ggc
cat
acc
ccg
tat
acc
cat
gcg
aac
cat
att
ggc
cat
taa


In [32]:
# this one also prints the start index of the codon
for i in range(0, len(DNAseq), 3):
    print(i,DNAseq[i:i+3])

0 atg
3 tat
6 aac
9 att
12 ggc
15 cat
18 acc
21 ccg
24 tat
27 acc
30 cat
33 gcg
36 aac
39 cat
42 att
45 ggc
48 cat
51 taa


## Control flow - more ways to affect the order in which statements run

So far we've seen "for" loop in a few of our Python examples.

There are a few other "control flow" statements that affect the order in which statements run.

We'll start first with the "if statement" ... and its variants.

In [33]:
x = 5

print("step 1")

if x < 10:
    print("x is small ... just", x)

x = 11

print("step 2")

if x < 10:
    print("x is small ... just", x)
if x >= 10:
    print("x is 10 or bigger. It's", x)

step 1
x is small ... just 5
step 2
x is 10 or bigger. It's 11


In [34]:
if x < 10:
    print("x is small ... just", x)
else:
    print("x is 10 or bigger. It's", x)

x is 10 or bigger. It's 11


An __if statement__ can be quite complex.

Note the use of `elif` below, which means "else if".

# here 

In [None]:
if x < 10:
    print("x is small ... just", x)
elif x == 10:
    print("medium x. It's", x)
elif x == 11:
    print("medium x. It's", x)
elif x == 12:
    print("medium x. It's", x)
else:
    print("x bigger than 12. It's", x)

Multiple conditions can be tested in the same `if` statement. You can use `and` if both predicates need to be true, or `or` if only one of them needs to be true.

In [None]:
if x < 10:
    print("x is small ... just", x)
elif x >= 10 and x <= 12:
    print("medium x. It's", x)
else:
    print("x bigger than 12. It's", x)

Statements can be nested into each other. So we can put an if statement inside a for loop, for example:

In [1]:
for i in range(8, 15):
    if i < 10:
        print(i, "is smaller.")
    elif i >= 10 and i <= 12:
        print(i, "is medium size :)")
    else:
        print("i is", i)

8 is smaller.
9 is smaller.
10 is medium size :)
11 is medium size :)
12 is medium size :)
i is 13
i is 14


We can also use loops to count how often the base 'a' occurs in our variable DNAseq.

In [4]:
a_count = 0

for base in DNAseq:
    if base == 'a':
        a_count += 1
        
print('We have the following base counts:')
print('a:', a_count)    

We have the following base counts:
a: 17


### Exercise
What are the counts of each regular base [a, t, c, g] in DNAseq?

In [None]:
# write your code here: 



In [5]:
a_count = 0
t_count = 0
c_count = 0
g_count = 0

for base in DNAseq:
    if base == 'a':
        a_count += 1
    elif base == 't':
        t_count += 1
    elif base == 'c':
        c_count += 1
    elif base == "g":
        g_count += 1
    else:
        print(base, 'is not a regular base [a, t, c, g]')
        
print('We have the following base counts:')
print('a:', a_count)
print('t:', t_count)
print('c:', c_count)
print('g:', g_count)   

We have the following base counts:
a: 17
t: 14
c: 15
g: 8


## Data structures

### Lists, tuples, and sets

Python has a set of standard 'containers' where you can store information in. Each data structure has its purpose and advantages.


### Lists

Lists are ordered sequences of elements. Each element or value that is inside of a list is called an item. Just as strings are defined as characters between quotes, lists are defined by having values between square brackets []. 

For example, codons within DNA sequence could be stored in a list of 3-mers.

In [None]:
DNAseq

In [None]:
### initially you can copy and paste the codons over and make a list
DNAseq_codons = ['atg', 'tat', 'aac' ]

We can also make a list more easily by looping over the string.

In [None]:
DNAseq_codons = []

for i in range(0, len(DNAseq), 3):
    DNAseq_codons.append(DNAseq[i:i+3])
    print(DNAseq[i:i+3])
print(DNAseq_codons)

### Lists are
* subscriptable
* mutable

In [None]:
DNAseq_codons[4]

In [None]:
DNAseq_codons[4] = 'aat'

In [None]:
DNAseq_codons

### Sets

Sets are unordered collection of items where each item occurs only once. It has no duplicated elements.

In [None]:
print(DNAseq_codons)
type(DNAseq_codons)

In [None]:
set(DNAseq_codons)

In [None]:
set[0]

### Sets are
* not subscriptable
* not mutable

Because sets cannot have multiple occurrences of the same element, it makes sets highly useful to efficiently remove duplicate values from a list and to perform common math operations like unions and intersections.

### Dictionaries

Dictionaries are (unordered) containers of key:value pairs where each key can only occur once.

They are one of most useful Python data structures. Looking up elements in a dictionary is really fast and there are lots of built-in functions to use and manipulate dictionaries.

We can use a dictionary to reverse complement a DNA sequence.

In [None]:
base_pair_dict = {'a' : 't', 
                  't' : 'a', 
                  'g' : 'c', 
                  'c' : 'g'}

In [None]:
base_pair_dict

In [None]:
base_pair_dict['g']

In [None]:
###so now reverse complement our DNA sequence

reverse_comp_DNAseq = ''

for base in DNAseq[::-1]:
    paired_base = base_pair_dict[base]
    reverse_comp_DNAseq += paired_base
    
print(reverse_comp_DNAseq, 'is the reverse complement of', DNAseq)

### Exercise: decode the hidden message in the DNA sequences

Use the coding table dictonary to decode the hidden message!!!!  
Hint you can convert a string to upper case with the function: x.upper(), where x is your string variable.


In [None]:
'agt'.upper()

In [None]:
coding_table_dict = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                  
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', 
    } 

### Solution

In [None]:
hidden_message = ''

for i in range(0, len(DNAseq), 3):
    codon = DNAseq[i:i+3]
    amino_acid = coding_table_dict[codon.upper()]
    hidden_message = hidden_message + amino_acid

print("This is the hidden message in", DNAseq, ':\n', hidden_message)

In [None]:
hidden_message = ''

for i in range(0, len(DNAseq2), 3):
    codon = DNAseq2[i:i+3]
    amino_acid = coding_table_dict[codon.upper()]
    hidden_message = hidden_message + amino_acid

print("This is the hidden message in", DNAseq2, ':\n',hidden_message)