# ReCAP Python 

In this notebook you'll find a recap of the most important Python programming essentials. 

## Variable types, and assigning values to variables

Assign a value to a variable, like this:

In [2]:
answer = 42
type(answer)

int

Note that 'type()' is a built-in fuction. Python has a relatively small number of built-in functions, which inlude 'print', 'open()', etc. 

<a href=https://en.wikipedia.org/wiki/42_(number)#The_Hitchhiker.27s_Guide_to_the_Galaxy>42</a> , really....? 🙄

To verify that the value of variable '`answer`' is, indeed, 42:

In [4]:
print answer

42


In [5]:
GC_perc = 36.5
type(GC_perc)

float

In [6]:
DNA = 'ATGCAG'
type(DNA)

str

In [7]:
DNA = u'AGGCAG'
type(DNA)

unicode

Ok, this last one was just for good measure. Note, in Python3 you don't need to explicitly convert text strings to unicode.

## Python operations
Python has a complete set of mathematical and logical operators, such as `+, -, *, **, /, //` and many more. Although these operations seem completely straightforward, they do have some context-dependent behaviour.

In [8]:
# let's create a few variables
my_int1 = 4
my_int2 = 5
my_float = 4.0
my_string = 'abc'


In [9]:
# addition
my_int1 + my_int2

9

In [10]:
# addition; What is different from the previous addition?
my_int2 + my_float

9.0

In [11]:
# multiplication
my_int1 * my_int2

20

In [12]:
# exponentiation
my_int1 ** my_int2

1024

In [13]:
# division; note the unexpected Python2 specific behaviour!
my_int2 / my_int1

1

In [14]:
# division; 
my_int2 / my_float

1.25

In [15]:
# floor division
my_float1 = 4.0
my_int2 // my_float

1.0

In [16]:
# multiplication; string context
my_string * my_int1

'abcabcabcabc'


## 'objects', such as variables, have built-in methods

The reason the number of built-in fuctions in Python is limited, is because the preferred way of doing things with object is to use object-specific methods. This has a few great advantages. Among the most important ones: if you are working with some kind of object, which can be a simple one, such as a variable, or something more complex, just ask the thing: "What can I do with you?"

In [17]:
DNA = 'ATGCAG'
dir(DNA)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__getslice__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_formatter_field_name_split',
 '_formatter_parser',
 'capitalize',
 'center',
 'count',
 'decode',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'index',
 'isalnum',
 'isalpha',
 'isdigit',
 'islower',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

In [18]:
DNA = 'ATGCAG'
DNA.lower()

'atgcag'

In [19]:
DNA

'ATGCAG'

In [20]:
DNA.count('A')

2

## Printing strings, variables


In [21]:
DNA = 'ATGCAGA'
GC_perc = 42.9
print DNA
print GC_perc

ATGCAGA
42.9


In [22]:
print 'The DNA sequence %s has gc content %f' % (DNA,GC_perc)

The DNA sequence ATGCAGA has gc content 42.900000


In [23]:
print 'The DNA sequence %s has gc content %.1f' % (DNA,GC_perc)

The DNA sequence ATGCAGA has gc content 42.9


## Indexes and slicing of strings

In [24]:
DNA = 'ATGCAG'
DNA[0]

'A'

In [25]:
DNA[0:3]

'ATG'

In [26]:
DNA[2:]

'GCAG'

In [27]:
DNA[::-1]

'GACGTA'

In [28]:
DNA[1] = 'U' # strings are 'immutable'!

TypeError: 'str' object does not support item assignment

## Lists

In [29]:
DNA = 'ATGCAC'
list(DNA)

['A', 'T', 'G', 'C', 'A', 'C']

In [30]:
DNA_list = list(DNA)
type(DNA_list)

list

In [31]:
DNA_list[1] = 'U' # This works! lists are 'mutable', i.e. can be changed in-place
DNA = ''.join(DNA_list) # join on empty string. '.join' is a string method
type(DNA)

str

In [32]:
DNA

'AUGCAC'

In [33]:
codons = ['ATG','CAC']
type(codons)

list

In [34]:
codons[1]

'CAC'

In [35]:
codons.insert(1,'GGG') # again, lists are 'mutable'
codons

['ATG', 'GGG', 'CAC']

In [36]:
del(codons[1])
# This would also work:
# codons = codons[:1]+codons[2:]
# or
# codons.pop(1)
codons

['ATG', 'CAC']

In [37]:
codons[0] = 'AUG'
codons

['AUG', 'CAC']

A few words on tuples; tuples look like lists, behave sort of like lists, but are not lists because they are not mutable (i.e., can't be changed in-place)

In [38]:
codons = ('ATG','CAC')
type(codons)

tuple

In [39]:
codons[0] = 'AUG'

TypeError: 'tuple' object does not support item assignment

## Dictionaries
Of all the built-in variable types that Python provides (well, of the ones that you are likely to use on a regular basis), dictionaries are probably the most challenging to learn.
Important concept is the 'dictionary-like' lookup nature of the dictionary, that uses unique keywords, to retrieve values. 

In [40]:
codon_table = {'ATG':'M','CAC':'H'}
type(codon_table)

dict

In [41]:
len(codon_table)

2

In [42]:
codon_table['ATG']

'M'

In [43]:
codon_table['TAG'] = '*'
codon_table

{'ATG': 'M', 'CAC': 'H', 'TAG': '*'}

In [44]:
len(codon_table)

3

In [45]:
codon_table.keys()

['CAC', 'ATG', 'TAG']

In [46]:
codon_table.values()

['H', 'M', '*']

In [47]:
codon_table['UUU'] = 'F'
len(codon_table)

4

## Logic and tests
True or False?

In [48]:
DNA = 'AAGT'
len(DNA)

4

In [49]:
len(DNA) < 4

False

In [50]:
if len(DNA) < 4:
    print 'DNA strand %s has fewer than 4 bases' % (DNA)

In [51]:
if len(DNA) < 4:
    print 'DNA strand %s has fewer than 4 bases' % (DNA)
else: 
    print 'DNA strand %s has 4 or more bases' % (DNA)

DNA strand AAGT has 4 or more bases


## Loops

In [52]:
DNA = 'ATGCAC'
for nuc in DNA:
    print nuc

A
T
G
C
A
C


In [53]:
for i in range(len(DNA)):
    print DNA[i]

A
T
G
C
A
C


In [54]:
for i in range(len(DNA)):
    print i,DNA[i]

0 A
1 T
2 G
3 C
4 A
5 C


In [55]:
i = 0
while i < len(DNA):
    print DNA[i]
    i += 1

A
T
G
C
A
C


In [56]:
for c in codon_table.keys():
    print "%s: %s" % (c, codon_table[c])

CAC: H
ATG: M
TAG: *
UUU: F


In [57]:
for c in sorted(codon_table.keys()):
    print "%s: %s" % (c, codon_table[c])

ATG: M
CAC: H
TAG: *
UUU: F


## Reading files

In [58]:
%%bash
ls -l

total 21504
drwxr-xr-x 2 lin029 domain users        0 Sep  5 16:38 AppData
drwxr-xr-x 2 lin029 domain users        0 Sep  4 12:10 AppData2
drwxr-xr-x 2 lin029 domain users        0 Sep 22 13:09 bio
-rwxr-xr-x 1 lin029 domain users 20318373 Sep 20 12:05 Marketa Zveibil & Jeremy O. Baum - Understanding Bioinformatics.pdf
drwxr-xr-x 2 lin029 domain users        0 Sep 25 09:31 MMap
dr-xr-xr-x 2 lin029 domain users        0 Sep 25 08:43 My Documents
-rwxr-xr-x 1 lin029 domain users    22804 Sep 25 14:18 W4D1 - Python Recap.ipynb
drwxr-xr-x 2 lin029 domain users        0 Sep 25 13:36 week


In [59]:
codon_table = dict()
len(codon_table)

0

In [61]:
codon_file = open("codon_table.txt", 'r') # creates a filehandle object

In [62]:
for line in codon_file:
    line = line.strip('\n')
    (codon,aa) = line.split('\t')
    codon_table[codon] = aa
    
codon_file.close()
codon_table

{'AAA': 'K',
 'AAC': 'N',
 'AAG': 'K',
 'AAU': 'N',
 'ACA': 'T',
 'ACC': 'T',
 'ACG': 'T',
 'ACU': 'T',
 'AGA': 'R',
 'AGC': 'S',
 'AGG': 'R',
 'AGU': 'S',
 'AUA': 'I',
 'AUC': 'I',
 'AUG': 'M',
 'AUU': 'I',
 'CAA': 'Q',
 'CAC': 'H',
 'CAG': 'Q',
 'CAU': 'H',
 'CCA': 'P',
 'CCC': 'P',
 'CCG': 'P',
 'CCU': 'P',
 'CGA': 'R',
 'CGC': 'R',
 'CGG': 'R',
 'CGU': 'R',
 'CUA': 'L',
 'CUC': 'L',
 'CUG': 'L',
 'CUU': 'L',
 'GAA': 'E',
 'GAC': 'D',
 'GAG': 'E',
 'GAU': 'D',
 'GCA': 'A',
 'GCC': 'A',
 'GCG': 'A',
 'GCU': 'A',
 'GGA': 'G',
 'GGC': 'G',
 'GGG': 'G',
 'GGU': 'G',
 'GUA': 'V',
 'GUC': 'V',
 'GUG': 'V',
 'GUU': 'V',
 'UAA': '*',
 'UAC': 'Y',
 'UAG': '*',
 'UAU': 'Y',
 'UCA': 'S',
 'UCC': 'S',
 'UCG': 'S',
 'UCU': 'S',
 'UGA': '*',
 'UGC': 'C',
 'UGG': 'W',
 'UGU': 'C',
 'UUA': 'L',
 'UUC': 'F',
 'UUG': 'L',
 'UUU': 'F'}

In [63]:
codon_file = open("codon_table.txt", 'r')
cf_RNA = open("RNA_codons.txt",'w')
for line in codon_file:
    line = line.strip('\n')
    (codon,aa) = line.split('\t')
    c = codon.replace('T','U') # note, codon is string: immutable, no in-place replacement!
    cf_RNA.write("%s\t%s\n" % (c, aa)) # write to file
    print "%s\t%s" % (c, aa) # print to scren
    
codon_file.close()
cf_RNA.close()

GGC	G
GCG	A
CCG	P
ACC	T
CCA	P
CAG	Q
UAU	Y
UCC	S
CCC	P
GCC	A
UCA	S
AGC	S
CGU	R
CGA	R
UAG	*
AGG	R
AAG	K
UUU	F
UUA	L
GUA	V
AUG	M
UGC	C
CGG	R
GUC	V
CUC	L
CAA	Q
GGG	G
GAA	E
GUG	V
UGA	*
AAU	N
ACG	T
AGA	R
AAA	K
CUG	L
CUU	L
AUC	I
AUU	I
GAU	D
ACU	T
GCU	A
CAC	H
UUG	L
AAC	N
AGU	S
UAC	Y
GAG	E
GUU	V
UGU	C
UUC	F
CUA	L
CAU	H
UCU	S
CGC	R
CCU	P
GGA	G
UAA	*
GAC	D
GGU	G
GCA	A
UGG	W
ACA	T
UCG	S
AUA	I


## Functions

In [67]:
def calc_GC_content(dna):
    GC = dna.count("G") + dna.count("C")
    GC_content = 1.0 * GC/len(dna) # most incomprehensible, crappy part of P2!
    return GC_content

In [69]:
def transcribe(dna):
    return dna.replace('T','U')

In [70]:
# main
DNA = "ATGCACTAG"
GC_content = calc_GC_content(DNA)
RNA = transcribe(DNA)

print '%.2f' % GC_content
print RNA

0.44
AUGCACUAG


## Importing and using modules

In [71]:
import re
dir(re)

['DEBUG',
 'DOTALL',
 'I',
 'IGNORECASE',
 'L',
 'LOCALE',
 'M',
 'MULTILINE',
 'S',
 'Scanner',
 'T',
 'TEMPLATE',
 'U',
 'UNICODE',
 'VERBOSE',
 'X',
 '_MAXCACHE',
 '__all__',
 '__builtins__',
 '__doc__',
 '__file__',
 '__name__',
 '__package__',
 '__version__',
 '_alphanum',
 '_cache',
 '_cache_repl',
 '_compile',
 '_compile_repl',
 '_expand',
 '_locale',
 '_pattern_type',
 '_pickle',
 '_subx',
 'compile',
 'copy_reg',
 'error',
 'escape',
 'findall',
 'finditer',
 'match',
 'purge',
 'search',
 'split',
 'sre_compile',
 'sre_parse',
 'sub',
 'subn',
 'sys',
 'template']

In [72]:
re.sub('(\wll)','all', 'hello, hello!', 1)

'hallo, hello!'

In [73]:
import random
random.random()

0.7583085836493165

In [74]:
somelist = ['a','b','c','d','e']
random.choice(somelist)

'c'

Creating a random DNA string:

In [75]:
bases = ['A','C','G','T']
dna = ''
for i in range(40):
    dna = dna + random.choice(bases)
print dna

GTATAGAAAGCAGGAAGCTCGGGTAACAGAGATTGATATC


Using the `sys` module to capture additional parameters into a python script. Note the similarity to shell scripts, that use the `$1`, `$2` way of capturing parameters in a similar way.

In [None]:
import re
from sys import argv

pattern = argv[1]
filename = argv[2]

file = open(filename)
for line in file:
    if  re.search(pattern,line):
        print line,

# python grep.py ATG codon_table.txt
# ATG M