# Python introduction - session 2
## Loops and control statements & data structures

If you get stuck in an endless loop hit the **"STOP" button (black square)** above or our good friend from bash, **ctrl+c**  
You know you are stuck in a loop if you see **In \[\*\]:** forever

## Indexing
Each position in this string has an "index"

Getting the data that you want from a string

In [None]:
DNAseq = 'atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa'

Python is ZERO indexed, this means you always start counting at 0 not 1

**We use square brackets to divide a string up by index positions**

The first index is INCLUSIVE
The second index is EXCLUSIVE

**Slice farom the beginning to a set end index positon**

This is called "Slicing", where we access parts of our string using their index, rather than their 

**Slice from a set index position to the end of the string object, no matter how long it is**

## Loops

Computing is mostly about doing the same thing again and again in an automated fashion.

An example task that we might want to repeat is printing each character in a word or DNA sequence on a line of its own. One way to do this would be to use a series of print statements:

In [None]:
DNAseq = 'atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa'


Now you might want to repeat this for another word and you copy & paste your existing code.

But that’s a bad approach for two reasons:

1. It doesn't scale, imagine wanting to print the characters in a string that is hundreds of letters long.

2. It is fragile: if we give it a longer string, it only prints part of the data, and if we give it a shorter one, it produces an error because we’re asking for characters that don't exist.


In [None]:
DNAseq2 = 'ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa'


In [None]:
DNAseq3 = 'ccgt'


A better approach is to use **for loops** in Python.

##### In Python, a for loop is structured like this.

We can call the loop variable anything we like, but there must be a **colon** at the end of the line starting the loop, and we must **indent** anything we want to run inside the loop. Unlike many other languages, there is no command to start/end a loop (e.g. do/done in **bash**); what is indented after the for statement belongs to the loop.

**Indentation errors are THE MOST COMMON MISTAKE made when writing loops**  
In this course I will exclusivly use the "tab" key to make my indents.  
I find it best to slowly build my loop and use the print(function) a lot, to show myself that it is working correctly.  


### Remember from yesterday our variable x

In [1]:
x=5

Say we want to increment x by 2 and assign it to the same variable, how do we do this?

### Exercise
Write a loop that counts the number of bases in your DNA sequence (variable DNAseq).

In [4]:
DNAseq = 'atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa'

In [None]:
## start a counter at 0
number_of_bases = 0

# Here goes a loop that iterates over DNAseq and counts the bases



# print out the results of your loop counter
print('There are', number_of_bases, 'bases in our DNAsequence.')

### Solution

### Exercise

Now make your final print statement print DNAseq. Notice that Python will print spaces between the variables in the print statement by default.

In [None]:
number_of_bases = 0



### Exercise

Note also that finding the length of a string is such a common operation that Python actually has a built-in function to do it called len( )

Can you use len( ) to get the length of DNAseq?

len( ) is much faster than any function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other things that we haven’t met yet, so we should always use it when we can.

### How do we iterate over numbers?
Python has a built-in function called range that creates a list of numbers. Range can accept 1-3 parameters. 
- If one parameter is input, range creates an array of that length, starting at zero and incrementing by 1. 
- If 2 parameters are input, range starts at the first and ends at the second, incrementing by one. 
- If range is passed 3 parameters, it stars at the first one, ends at the second one, and increments by the third one.

range(0, 3)

*NOTE which numbers are included and which are excluded!*

### Exercise
Write a loop that prints all the even numbers in the range between 1 and 10 (inclusive)

### Solution


Using variables we can make our loop very generic and flexible

In [None]:
start, end = 1, 10


*NOTE, I have assigned two variables to two numbers on one line, using a comma*

### Subscripting
Remember. You can use indexes to subset strings. We used this in the initial examples above.

In [5]:
DNAseq

'atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa'

You can use subsetting to print out all the first bases of all codons.

### Exercise
Write a loop that prints all the codons (non-overlapping 3-mers) of our variable DNAseq

In [None]:
## Hint use your string slicing DNAseq[], how could you slice more letters?

### Solution

## Control flow - more ways to affect the order in which statements run

So far we've seen "for" loops in both shell programs and a few of our Python examples.

There are a few other "control flow" statements that affect the order in which statements run.

We'll start first with the "if statement" ... and its variants.

In [None]:
## Using if
x = 5

print("step 1")
if x < 10:
    print("x is small ... just", x)

In [None]:
## Using multiple ifs


In [None]:
## if else


An "if statement" can be quite complex.

Note the use of "`elif`" below, which means "else if"

In [None]:
## elif 
x = 11

if x < 10:
    print("x is small ... just", x)


Multiple conditions can be tested in the same `if`-statement. You can use "`and`" if both predicates need to be true, or "`or`" if only one of them needs to be true.

In [None]:
## multiple conditions with AND
x=13

if x < 10:
    print("x is small ... just", x)


In [None]:
## multiple conditions with OR
x=30


Statements can be nested in other complex statements. So we can put an `if`-statement inside a `for`-statement, as follows.

In [None]:
for i in range(8, 15):
    if i < 10:
        print(i, "is smaller.")


We can also use loops to count how often the base 'a' occurs in our variable DNAseq.

In [None]:
a_count = 0

for base in DNAseq:
    ##start conditional statement
    
        ## use python shortcut += to add and assign a varialbe at the same time
        
print('We have the following base counts:')
print('a:', a_count)    

### Exercise
What are the counts of each regular base [a, t, c, g] in DNAseq?

In [None]:
## extra cell to play with 

In [None]:
a_count = 0
t_count = 0
c_count = 0
g_count = 0

for base in DNAseq:
    ###add your loop here to count the number of a's, t's, c's and g's

        
print('We have the following base counts:')
print('a:', a_count)
print('t:', t_count)
print('c:', c_count)
print('g:', g_count)   

### Now that we have wrote this really awesome loop...we're going to show you how python could have done this for you...faster

#### Object Methods (a.k.a. functions attached to objects)

You can also perform certain operations on objects, these are called methods (sometimes still called "functions"). You can use an object method in a similar way to attribute by typeing the **objectname.method( )**  
(NOTE: We can tell this is a method because of the **( )**!!)

We can use the built-in function **help( )** to learn more about a method, BUT if you want help you need to leavet the paretheses off the method **help(objectname.method)**

In [None]:
help(DNAseq.count)

## Data structures

### Lists, sets, and dictionaries

Python has a set of standard 'containers' were you can store information in. Each data structure has its purpose and advantages.


### Lists

Lists are ordered sequences of elements. Each element or value that is inside of a list is called an item. Just as strings are defined as characters between quotes, lists are defined by having values between square brackets []. 

For example, codons within DNA sequence could be stored in a list of 3-mers.

In [None]:
DNAseq

In [None]:
### initially you can copy and paste the codons over and make a list
DNAseq_codons = []

We can also make a list more easily by looping over the string.

In [None]:
DNAseq_codons = []

## Use a for loop on a range object to extract codons


print(DNAseq_codons)

### Lists are
* subscriptable  (i.e. you can access the elements with [ ]
* mutable (i.e. they can be altered without making a new object)

In [None]:
## subscript or slice my list


In [None]:
## mutate/alter my list


In [None]:
## look at the changes


### Common Methods for Lists

In [None]:
## Use tab to look at what methods are available


### Sets

Sets are unordered collection of items where each item occurs only once. It has no duplicated elements.

In [None]:
print(DNAseq_codons)
type(DNAseq_codons)

In [None]:
## Create a set with set()
set(DNAseq_codons)

In [None]:
## Sets have no order so they are NOT indexable/sliceable
set[0]

### Sets are
* not subscriptable (i.e. you cannot do set[0]) because sets have no order)
* mutable (There exsits a special type of set called a "frozen set" which is immutable)

Because sets cannot have multiple occurrences of the same element, it makes sets highly useful to efficiently remove duplicate values from a list and to perform common math operations like unions and intersections.

### Common Methods for Sets

In [None]:
## create a set object


In [None]:
## How many things are in my set object


In [None]:
## Look at set object


In [None]:
## Add a new value to my set


In [None]:
## Look at len(set) did I successfully add a value?


In [None]:
## Create a new set with two objects that are in set 1 and one new object
codon_set2 = set(["aac","att","gtt"])

### Dictionaries

Dictionaries are (unordered) containers of key:value pairs where each key can only occur once.

They are one of most useful Python data structures. Looking up elements in a dictionary is really fast and there are lots of built-in functions to use and manipulate dictionaries.

We can use a dictionary to reverse complement a DNA sequence.

In [None]:
## create a dictionary in two ways
## dict_name = {'key':value}
## dict_name = dict(key1= value1,key2 = value2, key3 = value3)

age_dict = dict(alex = 5, 
                tim = 7, 
                alexa = 5, 
                sandy = 8)


In [None]:
## What is the value of the key "tim"

In [None]:
## Add a new value to my dictionary


In [None]:
base_pair_dict = {'a' : 't', 
                  't' : 'a', 
                  'g' : 'c', 
                  'c' : 'g'}

In [None]:
# look at all values of dictionary base_pair_dict


In [None]:
# look at the value of the key "g"


In [None]:
###so now reverse complement our DNA sequence

reverse_comp_DNAseq = ''

for base in DNAseq[::-1]: ##write a for loop here that writes the RC of of DNAseq to the empty string reverse_compDNAseq
    ##look up the RC of base in your dictionary
    ## write the value of RC to your empty string reverse_comp_DNAseq
    
print(reverse_comp_DNAseq, 'is the reverse complement of', DNAseq)

### Dictionaries are
* not subscriptable (i.e. you cannot do bais_pair_dict[0]) because dictionaries have no order)
* mutable (you can alter the value of a key after it has been created, or add new keys)

Dictionaires can only have a given key once BUT the values of those keys can be the same

### Common Methods for Dictionairies

### Exercise: decode the hidden message in the DNA sequences

Use the coding table dictonary to translate your DNAseq into amino acids and decode the hidden message!!!!  
Hint you can convert a string to upper case with the function: x.upper(), where x is your string variable.


In [None]:
DNAseq = 'atgtataacattggccataccccgtatacccatgcgaaccatattggccattaa'
DNAseq2 = 'ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa'

In [None]:
coding_table_dict = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                  
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', 
    } 

### HINT
You need to convert our DNAseq to UPPERCASE letters
Perhaps there is a handy method to do that ???

In [None]:
"atg".upper()

### Solution

In [None]:
hidden_message = ''

for i in range(0, len(DNAseq), 3):
    ## slice DNAseq and keep your codon
    ## look up which AA is encoded by your codon
    ## write your AA to your hidden_message string

print("This is the hidden message in", DNAseq, ':\n', hidden_message)

In [None]:
hidden_message = ''

for i in range(0, len(DNAseq2), 3):


print("This is the hidden message in", DNAseq2, ':\n',hidden_message)