# Python

Dynamically typed language created by Guido van Rossum (BDFL) in mid-90.

- code blocks are defined by INDENTATION
- everything is an object
- everything is a reference
- first class functions
- garbage collected
- batteries included

OBJECT:
> A data structure that holds both data and actions that can be performed on these data

REFERENCE:
> A pointer to specific memory location that holds the actual value (object)

FIRST-CLASS FUNCTIONS:
> Functions (reusable blocks of code that compute something) can be manipulated as any other object

BATTERIES:
> Python's standard library (things that come with language) and docs _are_ tremendous resources

To save later aggravation, set your editors to use 4 spaces indents and _NEVER EVER_ mix tabs and spaces.

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


### The takehome message

Python was _designed_ to be concise, high-level multi-paradigm language. You can write Java or C++ like code in python but it'll look ugly. There is a _pythonic_ way of doing things. Often it _is_ the best way, but not always.

The tradeoffs:

- speed
- lack of easy(true) concurrency

##Code is written once but read many times

###You're collaborating with yourself in the future

>**Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.**

_John F Woods_

## Let's dive right in

Python is interpreted language. When you launch it you find yourself in a python shell which operates the same way as `bash`.

## Types

### Numerical types

###`int` - integer

In [2]:
a = 1
a

1

###Integer division gotcha (python 2 specific)

In [1]:
from __future__ import division, print_function

In [10]:
b = 2

a / b

0.5

In [4]:
a / float(b)

0.5

In [5]:
float(a) / b

0.5

###`float` - floating point (real) number

In Python this means 64-bit `float`:

1-bit fraction + 11-bit exponent + 53-bit fraction

In [6]:
# Float can be written in two ways
f = 1.23456
g = 1e-8
f + g

1.23456001

###`complex` - complex number

In [7]:
c = 1j # same as (0 + 1j)
i = 1.092 - 2.98003j
c - i

(-1.092+3.98003j)

### Sequence (iterable) types

###`str` - string

In [2]:
s = 'I am a Python string. I can be "manipulated" in many ways.'
s

'I am a Python string. I can be "manipulated" in many ways.'

In [3]:
seq = '>RNA\tAGUUCGAGGCUUAAACGGGCCUUAU'
seq

'>RNA\tAGUUCGAGGCUUAAACGGGCCUUAU'

In [4]:
print(seq)

>RNA	AGUUCGAGGCUUAAACGGGCCUUAU


In [5]:
len(seq)

30

In [6]:
print(s[0])
print(s[5])

I
a


In [7]:
# Strings are immutable!!!
s[3] = 'n'

TypeError: 'str' object does not support item assignment

In [8]:
# Let's have a look inside:
print(dir(s))

['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']


In [9]:
s.split()

['I',
 'am',
 'a',
 'Python',
 'string.',
 'I',
 'can',
 'be',
 '"manipulated"',
 'in',
 'many',
 'ways.']

In [10]:
s.split('p')

['I am a Python string. I can be "mani', 'ulated" in many ways.']

In [11]:
s.upper()

'I AM A PYTHON STRING. I CAN BE "MANIPULATED" IN MANY WAYS.'

In [12]:
seq.split()

['>RNA', 'AGUUCGAGGCUUAAACGGGCCUUAU']

In [13]:
seq

'>RNA\tAGUUCGAGGCUUAAACGGGCCUUAU'

In [14]:
# Unpacking
name, seq = seq.split()
print('Name: {}\tSequence: {}'.format(name,seq))

Name: >RNA	Sequence: AGUUCGAGGCUUAAACGGGCCUUAU


In [15]:
s = '               i am  a really    stupid    long string   '.strip()
s.strip('i')

' am  a really    stupid    long string'

In [16]:
sl = s.split()
print(' '.join(sl))

i am a really stupid long string


###`list` - (ordered) collection of elements

In [17]:
sl, len(sl)

(['i', 'am', 'a', 'really', 'stupid', 'long', 'string'], 7)

In [21]:
sl[2:]

['a', 'really', 'stupid', 'long', 'string']

In [22]:
print(dir(sl))

['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__delslice__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getslice__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']


In [24]:
sl.insert(2, 'another')
sl

['i', 'am', 'another', 'a', 'really', 'stupid', 'long', 'string', 'another']

In [25]:
sl.extend(['another', 'item'])
sl

['i',
 'am',
 'another',
 'a',
 'really',
 'stupid',
 'long',
 'string',
 'another',
 'another',
 'item']

In [32]:
# Lists are mutable!!!
lst = ['item', 'another item']
lst1 = [item.upper() for item in lst]
print(lst1)

lst.append('add to lst')
print(lst1)

['ITEM', 'ANOTHER ITEM']
['ITEM', 'ANOTHER ITEM']


In [30]:
lst.append('YAY')
print(lst)
print(lst1)

['item', 1, 2, 'another item', 'add to lst', 'YAY']
['item', 1, 2, 'another item']


In [28]:
# Lists are ordered and iterable:
lst1 = []
for item in lst:
    lst1.append(item)
    
print(lst1)

['item', 1, 2, 'another item', 'add to lst']


In [34]:
for c in seq:
    print(c)

A
G
U
U
C
G
A
G
G
C
U
U
A
A
A
C
G
G
G
C
C
U
U
A
U


In [33]:
# Don't do this:
for i in range(len(lst)):
    print(lst[i])
    
# If you need index of each element:
for ind,item in enumerate(lst):
    print(ind, item)

item
another item
add to lst
0 item
1 another item
2 add to lst


In [35]:
# To check that an element is in list:

print('item' in lst)
print('Item' in lst)

if 'item' in lst:
    print('Item in list!')
else:
    print('Not in list')

True
False
Item in list!


###`tuple` is a immutable list

In [36]:
t = ('name', 'position')
name, position = t
print(name, position)

name position


In [38]:
# From the earlier example:
r1 = range(11)
r2 = range(0, 100, 10)
for i1,i2 in zip(r1,r2):
    print(i1,i2)

0 0
1 10
2 20
3 30
4 40
5 50
6 60
7 70
8 80
9 90


DIFFERENCE FROM ARRAYS:
>Elements of list do not need to be of the same type

###`sets`  - unordered collections of unique elements

In [39]:
sl

['i',
 'am',
 'another',
 'a',
 'really',
 'stupid',
 'long',
 'string',
 'another',
 'another',
 'item']

In [40]:
set(sl)

{'a', 'am', 'another', 'i', 'item', 'long', 'really', 'string', 'stupid'}

In [41]:
# Nice optimization trick:
huge_list = range(1000000)
huge_set = set(huge_list)

def check_list(elem, lst=None):
    if lst:
        # DO smth
    return elem in huge_list

def check_set(elem):
    return elem in huge_set

# And let's refactor this into one function and talk about variable scopes

In [44]:
%time check_list(10000)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 276 µs


True

In [45]:
%time check_set(10000)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 14.1 µs


True

###`dictionary` - key/value pairs

AKA hashtable, associative array. In general, mapping type.

In [48]:
d = {'key1': 1, 'key2': 20, 'key3': 345}
d

{'key1': 1, 'key2': 20, 'key3': 345}

`key` can be any *hashable* type. What it means in practical terms is that it must be immutable.

In [47]:
print(dir(d))

['__class__', '__cmp__', '__contains__', '__delattr__', '__delitem__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'has_key', 'items', 'iteritems', 'iterkeys', 'itervalues', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values', 'viewitems', 'viewkeys', 'viewvalues']


###`dict` is mutable!

In [49]:
d.pop('key1')
print(d)
# Note also that dict is unordered

{'key3': 345, 'key2': 20}


In [50]:
# Assignment
d['key1'] = 101
d['key2'] = 202
d

{'key1': 101, 'key2': 202, 'key3': 345}

In [51]:
# Membership test operates on keys
print('key3' in d)
print('key4' in d)

True
False


In [52]:
# Looping over key:value pairs

for key,val in d.items():
    print(key,val)

key3 345
key2 202
key1 101


###Working with files. 

In [None]:
# This works but is not good

fh = open('myfile.txt', 'r')
for line in fh:
    # Do something
fh.close()

In [None]:
# Better way (context manager)
import os.path
directory = 'dir1'
filename = 'file.txt'

with open(os.path.join(directory, filename), 'r') as fh:
    for line in fh:
        # Do something
# Do something else. At this point fh.close() will be called automatically

In [None]:
# We can also work with gzipped files
import gzip

with gzip.open('myfile.gz', 'rb') as fh: # Note 'rb' as file opening mode
    for line in fh:
        # Do something

##Standard library

Python has expansive standard library and comprehensive, accessible documentation complete with examples. Some of the must-know modules:

- `collections`
- `itertools`
- `os` and particularly `os.path`
- `re`

## Comparisons

In comparison operations each operand is first cast to `boolean`.

The `boolean` type is a subclass of `int` (`False ~ 0, True ~ 1`) mainly for historical reasons.

In [55]:
a = 1
b = 0
if 0 < a < 2:
    print('a is True!')
if not b:
    print('b is False!')

a is True!
b is False!


In [56]:
s1 = 'Fox'
s2 = 'Dog'

# Equality check
if s1 != s2:
    print('s1 and s2 are different')
    
# Identity check
if s1 is s2:
    pass
else:
    print('s1 and s2 are not the same thing!')

s1 and s2 are different
s1 and s2 are not the same thing!


In [57]:
s2 = s1

# Identity check
if s1 is s2:
    print('s1 and s2 POINT to the same thing')
else:
    print('s1 and s2 are not the same thing!')

s1 and s2 POINT to the same thing


In [58]:
# This however doesn't work for integers
b = 1
if a == b:
    print('a and b are equal!')
    
if a is b:
    print('a and b are not the same thing!')
    
# Implementation detail: integers from -5 to ~256 are cached (singletons)

a and b are equal!
a and b are not the same thing!


In [59]:
# Empty sequence types evaluate to False, non-empty ones to True
lst1 = []
lst2 = ['a']

if lst1 and lst2:
    print('Both lists are non-empty.')
else:
    print('One of the lists is empty.')

One of the lists is empty.


In [61]:
if lst1:
    pass
else:
    print('Empty')

Empty


In [None]:
# Another singleton
a = None

## Example

In [62]:
!ls -lah ../data

total 191M
drwxrwxr-x 2 ilya ilya 4.0K Sep 30 18:05 .
drwxrwxr-x 5 ilya ilya 4.0K Sep 16 13:23 ..
-rw-rw-r-- 1 ilya ilya  24M Sep 30 08:54 7E2.pileup.gz
-rw-rw-r-- 1 ilya ilya  21M Sep 30 15:49 AAGCTA_R2.pileup.gz
-rw-rw-r-- 1 ilya ilya 146M Sep 29 12:04 BJ-HSR1_R1.fastq.gz
-rw-rw-r-- 1 ilya ilya 500K Sep 16 09:03 contigs.fasta
-rw-r----- 1 ilya ilya  644 Sep 16 09:00 dHSR1.fa
-rw-rw-r-- 1 ilya ilya 5.4K Sep 23 15:58 gradtimes.txt
-rw-rw-r-- 1 ilya ilya  445 Sep 16 09:00 hHSR-435.fa
-rw-rw-r-- 1 ilya ilya  611 Sep 16 09:00 hHSR.fa
-rw-rw-r-- 1 ilya ilya  126 Sep 16 09:04 rose.fa


In [70]:
import os
import sys
import csv
import gzip
import pandas as pd
csv.field_size_limit(sys.maxsize)

def parse_pileup(barcode, dirname='../results', track='minus', sample_id='minus'):
    filename = os.path.join(dirname, '{0}_{1}.pileup.gz'.format(barcode, sample_id))
    print(filename)
    with gzip.open(filename, 'rb') as pileup:
        reader = csv.DictReader(pileup,
                        delimiter='\t',
                        fieldnames=['seqname', 'pos', 'base', 'coverage', 'details', 'qual'])
        data = []
        for rec in reader:
            pos = int(rec['pos'])
            last = rec
            if pos == 1:
                data.append({'pos': 0, 'base': '*',
                             track: rec['details'].count('^')})
            else:
                data.append({'pos': pos-1, 'base': last['base'],
                             track: rec['details'].count('^')})
    return pd.DataFrame.from_records(data)

In [71]:
df = parse_pileup('AAGCTA', dirname='../data', sample_id='R2')

../data/AAGCTA_R2.pileup.gz


In [72]:
df

Unnamed: 0,base,minus,pos
0,*,25,0
1,C,1,1
2,T,5,2
3,C,2,3
4,G,1,4
5,C,1,5
6,T,0,6
7,A,21,7
8,T,2,8
9,G,5,9


## Assignment

Do the summary of E.coli `.gff` file in python

EXTRA CREDIT:
>Find all occurences of the sequence `GGGGCGGGGG` in E.coli genome

EXTRA EXTRA CREDIT:
>Find all _almost_ exact occurences of the same sequence in E.coli genome (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz)

In [32]:
import gzip
from collections import Counter

foo = 'bar'

def column_stat(filename, col=2, sep='\t'):
    counter = Counter()
    foo = 'BAZ'
    print(foo)
    with gzip.open(filename) as fi:
        for line in fi:
            line_norm = line.strip()
            if line and (not line_norm.startswith('#')):
                fields = line_norm.split(sep)
                counter[fields[col]] += 1
    return counter
                         
def column_stat1(filename, **kwargs):
    counter = Counter()
    col = kwargs['col']
    sep = kwargs['sep']
    with gzip.open(filename) as fi:
        for line in fi:
            line_norm = line.strip()
            if line and (not line_norm.startswith('#')):
                fields = line_norm.split(sep)
                counter[fields[col]] += 1
    return counter



In [33]:
#column_stat('../data/GCF_000005845.2_ASM584v2_genomic.gff.gz', col=3, sep=',')
kwargs = {'col': 2, 'sep': '\t'}
column_stat('../data/GCF_000005845.2_ASM584v2_genomic.gff.gz', **kwargs)

BAZ


Counter({'gene': 4518, 'CDS': 4386, 'repeat_region': 355, 'exon': 178, 'tRNA': 89, 'ncRNA': 65, 'region': 62, 'rRNA': 22, 'tmRNA': 2})

In [12]:
!ls -lah ../data

total 192M
drwxrwxr-x 2 ilya ilya 4.0K Oct 14 14:41 .
drwxrwxr-x 5 ilya ilya 4.0K Oct 14 11:21 ..
-rw-rw-r-- 1 ilya ilya  24M Sep 30 08:54 7E2.pileup.gz
-rw-rw-r-- 1 ilya ilya  21M Sep 30 15:49 AAGCTA_R2.pileup.gz
-rw-rw-r-- 1 ilya ilya 146M Sep 29 12:04 BJ-HSR1_R1.fastq.gz
-rw-rw-r-- 1 ilya ilya 500K Sep 16 09:03 contigs.fasta
-rw-r----- 1 ilya ilya  644 Sep 16 09:00 dHSR1.fa
-rw-r----- 1 ilya ilya 1.4M Oct 14 14:41 GCF_000005845.2_ASM584v2_genomic.fna.gz
-rw-rw-r-- 1 ilya ilya 440K Sep 10 13:08 GCF_000005845.2_ASM584v2_genomic.gff.gz
-rw-rw-r-- 1 ilya ilya 5.4K Sep 23 15:58 gradtimes.txt
-rw-rw-r-- 1 ilya ilya  445 Sep 16 09:00 hHSR-435.fa
-rw-rw-r-- 1 ilya ilya  611 Sep 16 09:00 hHSR.fa
-rw-rw-r-- 1 ilya ilya  11K Oct 14 09:18 ROSE1_dp.ps
-rw-rw-r-- 1 ilya ilya  126 Sep 16 09:04 rose.fa


In [41]:
pattern = 'GGGGCGGGGG'

def rc(seq):
    return ''.join([nuc[i] for i in seq])[::-1]

with gzip.open('../data/GCF_000005845.2_ASM584v2_genomic.fna.gz') as fi:
    glines = []
    for line in fi:
        line_norm = line.strip().upper()
        if not line_norm.startswith('>'):
            glines.append(line_norm)

seq = ''.join(glines)

def find_kmer(seq, pattern):
    seq_map = {}
    k = len(pattern)
    for kmer, pos in [(seq[i:i+k], i) for i in xrange(0, len(seq)-k)]:
        if kmer in seq_map:
            seq_map[kmer].append(pos)
        else:
            seq_map[kmer] = [pos,]
    return seq_map
    
forward = find_kmer(seq, pattern)
print(' '.join([str(x) for x in forward[pattern]]))
reverse = find_kmer(seq, rc(pattern))
print(' '.join([str(x) for x in reverse[rc(pattern)]]))

300115
3349720


In [39]:
nuc = {'A':'T', 'T' : 'A', 'C' : 'G',  'G' : 'C'}
#take the patter            , search it based on dic
res = []
for i in pattern :
     res.append(nuc[i])
print(''.join(res)[::-1])

CCCCCGCCCC


In [40]:
def rc(seq):
    return ''.join([nuc[i] for i in seq])[::-1]

'CCCCCGCCCC'

In [7]:
seq[300115:300115+k]

'GGGGCGGGGG'

In [9]:
D = 1 # number of mismatches

def hamming(s1, s2):
    '''
    Computes Hamming distance between s1 and s2.
    '''
    if len(s1) != len(s2):
        raise ValueError('s1 and s2 must be the same length to compute Hamming distance!')
    return sum(ch1 != ch2 for ch1,ch2 in zip(s1, s2))

seq_map = {}
k = len(pattern)
for kmer, pos in [(seq[i:i+k], i) for i in xrange(0, len(seq)-k+1)]:
    if hamming(kmer, pattern) <= D:
        if kmer in seq_map:
            seq_map[kmer].append(pos)
        else:
            seq_map[kmer] = [pos,]
    
res = ' '.join([' '.join([str(x) for x in v]) for key,v in seq_map.items()])

print(' '.join([str(x) for x in sorted([int(x) for x in res.split()])]))

300115


In [47]:
s1 = 'AAATTTTAAA'
s2 = 'AAATTATAAT'
for c1,c2 in zip(s1,s2):
    print(c1,c2)

A A
A A
A A
T T
T T
T A
T T
A A
A A
A T
