# Python introduction - session 2
## Loops and control statements & data structures

If you get stuck in an endless loop hit the **"STOP" button (black square)** above or our good friend from bash, **ctrl+c**  
You know you are stuck in a loop if you see **In \[\*\]:** forever

## Indexing
Each position in this string has an "index"

Getting the data that you want from a string

In [202]:
DNAseq = 'atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa'

Python is ZERO indexed, this means you always start counting at 0 not 1

**We use square brackets to divide a string up by index positions**

The first index is INCLUSIVE
The second index is EXCLUSIVE

In [203]:
DNAseq[0:4]

'atgt'

**Slice from the beginning to a set end index positon**

This is called "Slicing", where we access parts of our string using their index

In [204]:
DNAseq[:4]

'atgt'

**Slice from a set index position to the end of the string object, no matter how long it is**

In [205]:
DNAseq[3:]

'tctaacattggccataccccgtatacccatgcgaaccatattggccattaa'

#### Create slices of intervals

In [206]:
print(DNAseq)
DNAseq[:6:2]

atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa


'agc'

#### Slice backwards

In [207]:
DNAseq[::-1]

'aattaccggttataccaagcgtacccatatgccccataccggttacaatctgta'

#### Your turn! Make your own slice do you get the letters you expected?


## Loops

Computing is mostly about doing the same thing again and again in an automated fashion.

An example task that we might want to repeat is printing each character in a word or DNA sequence on a line of its own. One way to do this would be to use a series of print statements:

In [208]:
DNAseq = 'atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa'
print(DNAseq[0])
print(DNAseq[1])
print(DNAseq[2])
print(DNAseq[3])
print(DNAseq[4])

a
t
g
t
c


Now you might want to repeat this for another word and you copy & paste your existing code.

But that’s a bad approach for two reasons:

1. It doesn't scale, imagine wanting to print the characters in a string that is hundreds of letters long.

2. It is fragile: if we give it a longer string, it only prints part of the data, and if we give it a shorter one, it produces an error because we’re asking for characters that don't exist.


In [209]:
DNAseq2 = 'ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa'
print(DNAseq2[0])
print(DNAseq2[1])
print(DNAseq2[2])
print(DNAseq2[3])
print(DNAseq2[4])

c
c
g
t
a


In [210]:
DNAseq3 = 'ccgt'
print(DNAseq3[0])
print(DNAseq3[1])
print(DNAseq3[2])
print(DNAseq3[3])
print(DNAseq3[4])

c
c
g
t


IndexError: string index out of range

A better approach is to use **for loops** in Python.

In [None]:
for base in DNAseq:
    print(base)

In [None]:
for base in DNAseq2:
    print(base)

In [None]:
for base in DNAseq3:
    print(base)

##### In Python, a for loop is structured like this.

We can call the loop variable anything we like, but there must be a **colon** at the end of the line starting the loop, and we must **indent** anything we want to run inside the loop. Unlike many other languages, there is no command to start/end a loop (e.g. do/done in **bash**); what is indented after the for statement belongs to the loop.

**Indentation errors are THE MOST COMMON MISTAKE made when writing loops**  
In this course I will exclusivly use the "tab" key to make my indents.  
I find it best to slowly build my loop and use the print(function) a lot, to show myself that it is working correctly.  


### Remember from yesterday our variable x

In [211]:
x=5

Say we want to increment x by 2 and assign it to the same variable, how do we do this?

In [212]:
x= x+2

In [213]:
print(x)

7


### Exercise
Write a loop that counts the number of bases in your DNA sequence (variable DNAseq).

In [214]:
DNAseq = 'atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa'

In [215]:
## start a counter at 0
number_of_bases = 0

# Here goes a loop that iterates over DNAseq and counts the bases




print('There are', number_of_bases, 'bases in our DNAsequence.')

There are 0 bases in our DNAsequence.


### Solution

In [216]:
number_of_bases = 0

for base in DNAseq: ### you might want to do this as before with 'atgcg' here and DNAseq below
    number_of_bases = number_of_bases + 1
    
print('There are', number_of_bases, 'bases in our DNAsequence.')

There are 54 bases in our DNAsequence.


### Exercise
Now make your final print statement print DNAseq. Notice that Python will print spaces between the variables in the print statement by default.

Let's re-write our print statement by changing the default spacing of print( )

In [217]:
number_of_bases = 0
help(print)
for base in DNAseq:
    number_of_bases = number_of_bases + 1
    
print('There are', number_of_bases, 'bases in the sequence', DNAseq,'.')

print('There are ', number_of_bases, ' letters in the sequence ', DNAseq, '.', sep="")


Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.

There are 54 bases in the sequence atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa .
There are 54 letters in the sequence atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa.


### Exercise

Note also that finding the length of a string is such a common operation that Python actually has a built-in function to do it called len( )

Can you use len( ) to get the length of DNAseq?

In [218]:
print(len(DNAseq))

54


len( ) is much faster than any function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other things that we haven’t met yet, so we should always use it when we can.

### How do we iterate over numbers?
Python has a built-in function called range that creates a list of numbers. Range can accept 1-3 parameters. 
- If one parameter is input, range creates an array of that length, starting at zero and incrementing by 1. 
- If 2 parameters are input, range starts at the first and ends at the second, incrementing by one. 
- If range is passed 3 parameters, it stars at the first one, ends at the second one, and increments by the third one.

In [219]:
range(3)

range(0, 3)

In [220]:
for i in range(3):
    print(i)

0
1
2


In [221]:
for i in range(2,5):
    print(i)

2
3
4


*NOTE which numbers are included and which are excluded!*

In [222]:
for i in range(10,30,5):
    print(i)

10
15
20
25


### Exercise
Write a loop that prints all the even numbers in the range between 1 and 10 (inclusive)

### Solution

In [223]:
for i in range(2,11,2):
    print(i)

2
4
6
8
10


You can make it more generic and use variables to specify your start and stop.

This is useful when you have big scripts to set all your important variables at the very beginning (i.e. at the top!)

In [224]:
start, end = 1, 10

for i in range(start, end + 1):
    if i % 2 == 0:
        print(i)

2
4
6
8
10


*NOTE, I have assigned two variables to two numbers on one line, using a comma*

### Subscripting
Remember. You can use indexes to subset strings. We used this in the initial examples above.

In [225]:
DNAseq

'atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa'

In [226]:
DNAseq[0]

'a'

In [227]:
DNAseq[0:3]

'atg'

In [228]:
DNAseq[3:6]

'tct'

You can use subsetting to print out all the first bases of all codons.

In [229]:
for i in range(0, len(DNAseq), 3):
    print(DNAseq[i])

a
t
a
a
g
c
a
c
t
a
c
g
a
c
a
g
c
t


### Exercise
Write a loop that prints all the codons (non-overlapping 3-mers) of our variable DNAseq

So we want to print:
    
    atg
    tct
    aac
    ...

In [230]:
## Hint use your string slicing DNAseq[]
## How could you slice more letters at once?

In [231]:
### Solution
for i in range(0, len(DNAseq), 3):
    print(DNAseq[i:i+3])

atg
tct
aac
att
ggc
cat
acc
ccg
tat
acc
cat
gcg
aac
cat
att
ggc
cat
taa


In [232]:
for i in range(0, len(DNAseq), 3):
    print(i,DNAseq[i:i+3])

0 atg
3 tct
6 aac
9 att
12 ggc
15 cat
18 acc
21 ccg
24 tat
27 acc
30 cat
33 gcg
36 aac
39 cat
42 att
45 ggc
48 cat
51 taa


## Control flow - more ways to affect the order in which statements run

So far we've seen "for" loops in both shell programs and a few of our Python examples.

There are a few other "control flow" statements that affect the order in which statements run.

We'll start first with the "if statement" ... and its variants.

In [233]:
# Using if
x = 5

print("step 1")
if x < 10:
    print("x is small ... just", x)

# Using multiple ifs
x = 11
print("step 2")
if x < 10:
    print("x is small ... just", x)
if x >= 10:
    print("x is 10 or bigger. It's", x)


step 1
x is small ... just 5
step 2
x is 10 or bigger. It's 11


In [234]:
#if else
if x < 10:
    print("x is small ... just", x)
else:
    print("x is 10 or bigger. It's", x)

x is 10 or bigger. It's 11


An "if statement" can be quite complex.

Note the use of "`elif`" below, which means "else if"

In [235]:
x=13
if x < 10:
    print("x is small ... just", x)
elif x == 10:
    print("medium x. It's", x)
elif x == 11:
    print("medium x. It's", x)
elif x == 12:
    print("medium x. It's", x)
else:
    print("x bigger than 12. It's", x)

x bigger than 12. It's 13


Multiple conditions can be tested in the same `if`-statement. You can use "`and`" if both predicates need to be true, or "`or`" if only one of them needs to be true.

In [236]:
## multiple conditions with AND
x=12

if x < 10:
    print("x is small ... just", x)
elif x >= 10 and x <= 12:
    print("x is medium x. It's", x)
else:
    print("x bigger than 12. It's", x)

x is medium x. It's 12


In [237]:
## multiple conditions with OR
x="horse"

if x == "dog" or x=="puppy":
    print("woof")
elif x == "kitten" or x == "cat":
    print("meow")
else:
    print("I cannot make a sound because I do not know which animal I am.")

I cannot make a sound because I do not know which animal I am.


Statements can be nested in other complex statements. So we can put an `if`-statement inside a `for`-statement, as follows. Watch out for the indentation!!!

In [238]:
for i in range(8, 15):
    if i < 10:
        print(i, "is smaller.")
    elif i >= 10 and i <= 12:
        print(i, "is medium size :)")
    else:
        print("i is big", i)

8 is smaller.
9 is smaller.
10 is medium size :)
11 is medium size :)
12 is medium size :)
i is big 13
i is big 14


We can also use loops to count how often the base 'a' occurs in our variable DNAseq.

In [239]:
a_count = 0

for base in DNAseq:
    if base == 'a':
        a_count += 1 #NOTE this is a special python shortcut
        #use python shortcut += to add and assign a varialbe at the same time
print('We have the following base counts:')
print('a:', a_count)    

We have the following base counts:
a: 16


In [319]:
a_count = 0

for base in DNAseq:
    if base == 'a':
        a_count = a_count + 1 #NOTE this is a special python shortcut
        #use python shortcut += to add and assign a varialbe at the same time
print('We have the following base counts:')
print('a:', a_count)    

We have the following base counts:
a: 16


### Exercise
What are the counts of each regular base [a, t, c, g] in DNAseq?

In [320]:
DNAseq

'atgattggccatacctatccgtatacccatgcgaactcagctgtggaaagctaa'

In [None]:
## extra cell to play with 

In [240]:
a_count = 0
t_count = 0
c_count = 0
g_count = 0

for base in DNAseq:
    if base == 'a':
        a_count += 1
    elif base == 't':
        t_count += 1
    elif base == 'c':
        c_count += 1
    elif base == "g":
        g_count += 1
    else:
        print(base, 'is not a regular base [a, t, c, g]')
        
print('We have the following base counts:')
print('a:', a_count)
print('t:', t_count)
print('c:', c_count)
print('g:', g_count)   

We have the following base counts:
a: 16
t: 14
c: 16
g: 8


### Now that we have wrote this really awesome loop...we're going to show you how python could have done this for you...faster

#### Object Methods (a.k.a. functions attached to objects)

You can also perform certain operations on objects, these are called methods (sometimes still called "functions"). You can use an object method in a similar way to attribute by typeing the **objectname.method( )**  
(NOTE: We can tell this is a method because of the **( )**!!)

We can use the built-in function **help( )** to learn more about a method, BUT if you want help you need to leavet the paretheses off the method **help(objectname.method)**

In [241]:
help(DNAseq.count)

Help on built-in function count:

count(...) method of builtins.str instance
    S.count(sub[, start[, end]]) -> int
    
    Return the number of non-overlapping occurrences of substring sub in
    string S[start:end].  Optional arguments start and end are
    interpreted as in slice notation.



In [242]:
DNAseq.count("a")

16

In [243]:
DNAseq.count("A")

0

In [244]:
DNAseq.upper().count("A")

16

## Data structures

### Lists, sets, and dictionaries

Python has a set of standard 'containers' were you can store information in. Each data structure has its purpose and advantages.


### Lists

Lists are ordered sequences of elements. Each element or value that is inside of a list is called an item. Just as strings are defined as characters between quotes, lists are defined by having values between square brackets []. 

For example, codons within DNA sequence could be stored in a list of 3-mers.

In [245]:
DNAseq

'atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa'

In [246]:
### initially you can copy and paste the codons over and make a list
DNAseq_codons = ['atg', 'tct', 'aac' ]

We can also make a list more easily by looping over the string.

In [247]:
DNAseq_codons = []

for i in range(0, len(DNAseq), 3):
    DNAseq_codons.append(DNAseq[i:i+3])
    print(DNAseq[i:i+3])
print(DNAseq_codons)

atg
tct
aac
att
ggc
cat
acc
ccg
tat
acc
cat
gcg
aac
cat
att
ggc
cat
taa
['atg', 'tct', 'aac', 'att', 'ggc', 'cat', 'acc', 'ccg', 'tat', 'acc', 'cat', 'gcg', 'aac', 'cat', 'att', 'ggc', 'cat', 'taa']


### Lists are
* subscriptable (each item has an index)
* mutable
* you can loop over items in a list

In [248]:
DNAseq_codons[4]

'ggc'

In [249]:
DNAseq_codons[4] = 'aat'

In [250]:
DNAseq_codons[4]

'aat'

In [251]:
DNAseq_codons.sort()

In [252]:
DNAseq_codons

['aac',
 'aac',
 'aat',
 'acc',
 'acc',
 'atg',
 'att',
 'att',
 'cat',
 'cat',
 'cat',
 'cat',
 'ccg',
 'gcg',
 'ggc',
 'taa',
 'tat',
 'tct']

### Exercise
Look-up some common Methods for Lists

In [None]:
## Use tab to look at what methods are available


### Exercise
Write a loop that prints each item of your list

In [253]:
# Just like a string we can loop over lists
for i in DNAseq_codons:
    print(i)

aac
aac
aat
acc
acc
atg
att
att
cat
cat
cat
cat
ccg
gcg
ggc
taa
tat
tct


### Sets

Sets are unordered collection of items where each item occurs only once. It has no duplicated elements.

In [254]:
print(DNAseq_codons)
type(DNAseq_codons)

['aac', 'aac', 'aat', 'acc', 'acc', 'atg', 'att', 'att', 'cat', 'cat', 'cat', 'cat', 'ccg', 'gcg', 'ggc', 'taa', 'tat', 'tct']


list

In [255]:
set(DNAseq_codons)

{'aac',
 'aat',
 'acc',
 'atg',
 'att',
 'cat',
 'ccg',
 'gcg',
 'ggc',
 'taa',
 'tat',
 'tct'}

In [256]:
set[0]

TypeError: 'type' object is not subscriptable

### Sets are
* not subscriptable (i.e. you cannot do set[0]) because sets have no order)
* mutable (There exsits a special type of set called a "frozen set" which is immutable)

Because sets cannot have multiple occurrences of the same element, it makes sets highly useful to efficiently remove duplicate values from a list and to perform common math operations like unions and intersections.

In [257]:
## create a set object
codon_set1= set(["aac","atc","gcc"])

In [258]:
## How many things are in my set object
len(codon_set1)

3

In [259]:
## Look at set object
codon_set1

{'aac', 'atc', 'gcc'}

In [260]:
## Add a new value to my set
codon_set1.add("gtt")

In [261]:
## Look at len(set) did I successfully add a value?
len(codon_set1)

4

In [262]:
codon_set1

{'aac', 'atc', 'gcc', 'gtt'}

In [263]:
## Create a new set with two objects that are in set 1 and one new object
codon_set2 = set(["aac","att","gtt"])

In [264]:
codon_set1.intersection(codon_set2)

{'aac', 'gtt'}

### Exercise 
Explore some of the methods assciated with set object using the tab key

Can you find a method that will create a union of two sets?

In [265]:
codon_set1.union(codon_set2)

{'aac', 'atc', 'att', 'gcc', 'gtt'}

In [266]:
# save the union to a new variable
codon_set1

{'aac', 'atc', 'gcc', 'gtt'}

In [267]:
codon_set3 = codon_set1.union(codon_set2)

In [268]:
codon_set3

{'aac', 'atc', 'att', 'gcc', 'gtt'}

### Dictionaries

Dictionaries are (unordered) containers of key:value pairs where each key can only occur once.

They are one of most useful Python data structures. Looking up elements in a dictionary is really fast and there are lots of built-in functions to use and manipulate dictionaries.

In [269]:
## create a dictionary in two ways
## dict_name = {'key':value}
## dict_name = dict(key1= value1,key2 = value2, key3 = value3)

age_dict = dict(alex = 5, 
                tim = 7, 
                alexa = 5, 
                sandy = 8)

In [270]:
age_dict = {"tim":7,"alex":5,"alexa":5, "sandy":8}

In [271]:
## What is the value of the key "tim"
age_dict["tim"]

7

In [272]:
age_dict["sandy"]

8

In [273]:
age_dict.keys()

dict_keys(['tim', 'alex', 'alexa', 'sandy'])

In [274]:
## Add a new value to my dictionary
age_dict["jessica"]=9

In [275]:
age_dict.keys()

dict_keys(['tim', 'alex', 'alexa', 'sandy', 'jessica'])

We can use a dictionary to reverse complement a DNA sequence.

In [276]:
base_pair_dict = {'a' : 't', 
                  't' : 'a', 
                  'g' : 'c', 
                  'c' : 'g'}

In [277]:
# look at all values of dictionary base_pair_dict
base_pair_dict

{'a': 't', 't': 'a', 'g': 'c', 'c': 'g'}

In [278]:
# look at the value of the key "g"
base_pair_dict['g']

'c'

### Exercise 
Can you write a loop over DNAseq that uses the dictionary to reverse complement the sequence?

\## Hints for what each line should contain in your loop

In [279]:
DNAseq

'atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa'

In [280]:
###so now reverse complement our DNA sequence

reverse_comp_DNAseq = ''

for base in DNAseq[::-1]:
    paired_base = base_pair_dict[base]
    reverse_comp_DNAseq += paired_base
    
print(reverse_comp_DNAseq, 'is the reverse complement of', DNAseq)

ttaatggccaatatggttcgcatgggtatacggggtatggccaatgttagacat is the reverse complement of atgtctaacattggccataccccgtatacccatgcgaaccatattggccattaa


### Dictionaries are
* not subscriptable (i.e. you cannot do bais_pair_dict[0]) because dictionaries have no order)
* mutable (you can alter the value of a key after it has been created, or add new keys)

Dictionaires can only have a given key once BUT the values of those keys can be the same

### Common Methods for dictionaries

In [281]:
age_dict.items()

dict_items([('tim', 7), ('alex', 5), ('alexa', 5), ('sandy', 8), ('jessica', 9)])

In [282]:
age_dict.pop("tim")

7

In [283]:
age_dict.items()

dict_items([('alex', 5), ('alexa', 5), ('sandy', 8), ('jessica', 9)])

In [284]:
age_dict.values()

dict_values([5, 5, 8, 9])

### Exercise: decode the hidden message in the DNA sequences

Use the coding table dictonary to decode the hidden message!!!!  
Hint you can convert a string to upper case with the function: x.upper(), where x is your string variable.


In [321]:
DNAseq = 'atgattggccatacctatccgtatacccatgcgaactcagctgtggaaagcgacgcatactaa'
DNAseq2 = 'ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa'

In [322]:
coding_table_dict = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                  
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', 
    } 

### HINT
You need to convert our DNAseq to UPPERCASE letter, perhaps there is a handy method to do that??

In [323]:
'agt'.upper()

'AGT'

### Solution

In [324]:
hidden_message1 = ''

for i in range(0, len(DNAseq), 3):
    codon = DNAseq[i:i+3]
    amino_acid = coding_table_dict[codon.upper()]
    hidden_message1 = hidden_message1 + amino_acid

print("This is the hidden message in", DNAseq, ':\n', hidden_message1)

This is the hidden message in atgattggccatacctatccgtatacccatgcgaactcagctgtggaaagcgacgcatactaa :
 MIGHTYPYTHANSAVESDAY_


In [325]:
hidden_message2 = ''

for i in range(0, len(DNAseq2), 3):
    codon = DNAseq2[i:i+3]
    amino_acid = coding_table_dict[codon.upper()]
    hidden_message2 = hidden_message2 + amino_acid

print("This is the hidden message in", DNAseq2, ':\n',hidden_message2)

This is the hidden message in ccgtatacccatgcgaacatggcgaaagaaagctttgcgagcacctaa :
 PYTHANMAKESFAST_


### Exercise (Hard)
Write a loop that iterates over our age dictionary and adds 1 to the age of each child

*Hint use dictionary methods .keys( ) and .items( ) in your loop*

In [290]:
age_dict = {"tim":7,"alex":5,"alexa":5, "sandy":8}

#### Start simple figure out how to loop over the keys and values of the dictionary first, use your dictionary methods!

In [291]:
for key in age_dict.keys():
    print(key)

tim
alex
alexa
sandy


In [292]:
for item in age_dict.items():
    print(item)

('tim', 7)
('alex', 5)
('alexa', 5)
('sandy', 8)


In [293]:
for key in age_dict.keys():
    print(age_dict[key])

7
5
5
8


In [294]:
## Hint, you can loop over mutiple variables in a for loop
for key, value in ######:
    print(key,value)

SyntaxError: invalid syntax (<ipython-input-294-573d07873ca7>, line 2)

In [295]:
for key, value in age_dict.items():
    print(key,value)

tim 7
alex 5
alexa 5
sandy 8


In [296]:
for key, value in age_dict.items():
    age_dict[key]=value+1
    print(key,age_dict[key])

tim 8
alex 6
alexa 6
sandy 9


In [297]:
age_dict.items()

dict_items([('tim', 8), ('alex', 6), ('alexa', 6), ('sandy', 9)])

## Learning more about objects with dir( )
Strings, lists, dictionaries, sets are all python objects with special rules and methods.

We can learn more about these objects with the built-in function dir( ).
One of the most important things to know is can I iterate over an object or not?

In other words can I use my object directly in a loop?

There are many other things we can discover with dir( ), we do not have time
to cover them all for this course.

NOTE: Here is where we see the __variable__ used, these __ are only used for **special methods** which Python uses internally to perform some operations. This naming convention should only be used for **special methods**. Unless you become a Python developer, you are very unlikey to use these methods in your own code. 

In [298]:
dir(age_dict)

['__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'clear',
 'copy',
 'fromkeys',
 'get',
 'items',
 'keys',
 'pop',
 'popitem',
 'setdefault',
 'update',
 'values']

In [299]:
dir(set())

['__and__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__iand__',
 '__init__',
 '__init_subclass__',
 '__ior__',
 '__isub__',
 '__iter__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__or__',
 '__rand__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__ror__',
 '__rsub__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__xor__',
 'add',
 'clear',
 'copy',
 'difference',
 'difference_update',
 'discard',
 'intersection',
 'intersection_update',
 'isdisjoint',
 'issubset',
 'issuperset',
 'pop',
 'remove',
 'symmetric_difference',
 'symmetric_difference_update',
 'union',
 'update']

In [300]:
dir(str())

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',


In [301]:
dir(int())

['__abs__',
 '__add__',
 '__and__',
 '__bool__',
 '__ceil__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floor__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__le__',
 '__lshift__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 '__xor__',
 'as_integer_ratio',
 'bit_length',
 'conjugate',
 'denominator',
 'from_bytes',
 'imag',
 'numerator',
 'real',
 'to_bytes']

### Writing your own functions
We have used a lot of built-in functions thoughtout these notebooks, print( ), len( ), dir( ), divmod( ), min( ), max( )

However with data we commonly want to perform an analysis on mutiple datasets, for example take the mean of several treatments.
Unlike R, base Python does not have a built-in function for mean, so lets make one. (There is a python library that has a mean function but you will see this in the next lesson).

In [3]:
Experiment_1= (35,37,42,25,26,38)
Experiment_2= (32,37,32,24,28)


In [4]:
round(sum(Experiment_1)/len(Experiment_1),1)

33.8

In [5]:
round(sum(Experiment_2)/len(Experiment_2),1)

30.6

### Lets define a function
Define a function using def with a name, parameters, and a block of code.

1. Begin the definition of a new function with def.
2. Followed by the name of the function.
3. Must obey the same rules as variable names.
4. Then parameters in parentheses.
5. Empty parentheses if the function doesn’t take any inputs.
6. We will discuss this in detail in a moment.
7. Then a colon.
8. Then an indented block of code.

#### First we will make a function that does not take any inputs, it just prints a greeting



In [6]:
def greeting():
    print("Hello from Python")

In [7]:
greeting()

Hello from Python


#### Now we will re-create our mean function from above so we will need to put something in the parentheses to tell Python our function should have an input

In [10]:
def mean(list):
    round(sum(list)/len(list),1)

In [11]:
mean(Experiment_1)

#### What went wrong why doesn't our function work?
You need to to tell python we want to "return" the value of our calcualtion.

To do this we need to use the return statement

In [309]:
def mean(list):
    return round(sum(list)/len(list),1)


In [310]:
mean(Experiment_1)

33.8

#### Functions can return multiple values, for example

In [311]:
def mean(list):
    x = round(sum(list)/len(list),1)
    y = len(list)
    return x, y

In [312]:
mean(Experiment_1)

(33.8, 6)

In [313]:
mean(Experiment_2)

(30.6, 5)

### Exercise

Take our reverse complement dictionary and loop from above and turn it into a new function rev_comp( ) that takes one input called DNAstring.


In [314]:
base_pair_dict = {'a' : 't', 
                  't' : 'a', 
                  'g' : 'c', 
                  'c' : 'g'}

In [315]:
###so now reverse complement our DNA sequence
reverse_comp_DNAseq = ''

for base in DNAseq[::-1]:
    paired_base = base_pair_dict[base]
    reverse_comp_DNAseq += paired_base
    
print(reverse_comp_DNAseq, 'is the reverse complement of', DNAseq)

ttagctttccacagctgagttcgcatgggtatacggataggtatggccaatcat is the reverse complement of atgattggccatacctatccgtatacccatgcgaactcagctgtggaaagctaa


In [316]:
def rev_comp(DNAstring):
    reverse_comp_DNAseq=""
    base_pair_dict = {'a' : 't', 
                  't' : 'a', 
                  'g' : 'c', 
                  'c' : 'g'}
    for base in DNAstring[::-1]:
        paired_base = base_pair_dict[base]
        reverse_comp_DNAseq += paired_base
    return reverse_comp_DNAseq

In [317]:
rev_comp("atgctt")

'aagcat'

In [318]:
rev_comp(DNAseq)

'ttagctttccacagctgagttcgcatgggtatacggataggtatggccaatcat'