# Python

Dynamically typed language created by Guido van Rossum (BDFL) in mid-90.

- code blocks are defined by INDENTATION
- everything is an object
- everything is a reference
- first class functions
- garbage collected
- batteries included

### OBJECT:
> A data structure that holds (encapsulates) both data and actions that can be performed on these data

### REFERENCE:
> A pointer to specific memory location that holds the actual value (object)

### FIRST-CLASS FUNCTIONS:
> Functions (reusable blocks of code that compute something) can be manipulated (assigned and passed around) as any other object

### BATTERIES:
> Python's standard library (things that come with language) and docs _are_ tremendous resources

To save later aggravation, set your editors to use 4 spaces indents and _NEVER EVER_ mix tabs and spaces.

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


### The takehome message

Python was _designed_ to be concise, high-level multi-paradigm language. You can write Java or C++ like code in python but it'll look ugly. There is a _pythonic_ way of doing things. Often it _is_ the best way, but not always.

The tradeoffs:

- speed
- lack of easy(true) concurrency

## Code is written once but read many times

### You're collaborating with yourself in the future

>**Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.**

_John F Woods_

## Let's dive right in

Python is interpreted language. When you launch it you find yourself in a python shell which operates the same way as `bash`.

## Types

### Numerical types

### `int` - integer

In [2]:
a = 1
a

1

### Integer division gotcha (python 2 specific)

In [None]:
from __future__ import division, print_function

In [None]:
b = 2

a / b

In [None]:
a / float(b)

In [None]:
float(a) / b

### `float` - floating point (real) number

In Python this means 64-bit `float`:

1-bit fraction + 11-bit exponent + 53-bit fraction

In [None]:
# Float can be written in two ways
f = 1.23456
g = 1e-8
f + g

### `complex` - complex number

In [None]:
c = 1j # same as (0 + 1j)
i = 1.092 - 2.98003j
c - i

### Sequence (iterable) types

### `str` - string

In [None]:
s = 'I am a Python string. I can be "manipulated" in many ways.'
s

In [None]:
seq = '>RNA\tAGUUCGAGGCUUAAACGGGCCUUAU'
seq

In [None]:
print(seq)

In [None]:
len(seq)

In [None]:
print(s[0])
print(s[5])

In [None]:
# Strings are immutable!!!
s[3] = 'n'

In [None]:
# Let's have a look inside:
print(dir(s))

In [None]:
s.split()

In [None]:
s.split('p')

In [None]:
s.upper()

In [None]:
seq.split()

In [None]:
seq

In [None]:
# Unpacking
name, seq = seq.split()
print('Name: {}\tSequence: {}'.format(name,seq))

In [None]:
s = '               i am  a really    stupid    long string   '.strip()
s.strip('i')

In [None]:
sl = s.split()
print(' '.join(sl))

### `list` - (ordered) collection of elements

In [None]:
sl, len(sl)

In [None]:
sl[2:]

In [None]:
print(dir(sl))

In [None]:
sl.insert(2, 'another')
sl

In [None]:
sl.extend(['another', 'item'])
sl

In [None]:
# Lists are mutable!!!
lst = ['item', 'another item']
lst1 = [item.upper() for item in lst]
print(lst1)

lst.append('add to lst')
print(lst1)

In [None]:
lst.append('YAY')
print(lst)
print(lst1)

In [None]:
# Lists are ordered and iterable:
lst1 = []
for item in lst:
    lst1.append(item)
    
print(lst1)

In [None]:
for c in seq:
    print(c)

In [None]:
# Don't do this:
for i in range(len(lst)):
    print(lst[i])
    
# If you need index of each element:
for ind,item in enumerate(lst):
    print(ind, item)

In [None]:
# To check that an element is in list:

print('item' in lst)
print('Item' in lst)

if 'item' in lst:
    print('Item in list!')
else:
    print('Not in list')

### `tuple` is a immutable list

In [None]:
t = ('name', 'position')
name, position = t
print(name, position)

In [None]:
# From the earlier example:
r1 = range(11)
r2 = range(0, 100, 10)
for i1,i2 in zip(r1,r2):
    print(i1,i2)

### DIFFERENCE FROM ARRAYS:
>Elements of list do not need to be of the same type

### `sets`  - unordered collections of unique elements

In [None]:
sl

In [None]:
set(sl)

In [None]:
# Nice optimization trick:
huge_list = range(1000000)
huge_set = set(huge_list)

def check_list(elem, lst=None):
    if lst:
        # DO smth
    return elem in huge_list

def check_set(elem):
    return elem in huge_set

# And let's refactor this into one function and talk about variable scopes

In [None]:
%time check_list(10000)

In [None]:
%time check_set(10000)

### `dictionary` - key/value pairs

AKA hashtable, associative array. In general, mapping type.

In [None]:
d = {'key1': 1, 'key2': 20, 'key3': 345}
d

`key` can be any *hashable* type. What it means in practical terms is that it must be immutable.

In [None]:
print(dir(d))

### `dict` is mutable!

In [None]:
d.pop('key1')
print(d)
# Note also that dict is unordered

In [None]:
# Assignment
d['key1'] = 101
d['key2'] = 202
d

In [None]:
# Membership test operates on keys
print('key3' in d)
print('key4' in d)

In [None]:
# Looping over key:value pairs

for key,val in d.items():
    print(key,val)

### Working with files. 

In [None]:
# This works but is not good

fh = open('myfile.txt', 'r')
for line in fh:
    # Do something
fh.close()

In [None]:
# Better way (context manager)
import os.path
directory = 'dir1'
filename = 'file.txt'

with open(os.path.join(directory, filename), 'r') as fh:
    for line in fh:
        # Do something
# Do something else. At this point fh.close() will be called automatically

In [None]:
# We can also work with gzipped files
import gzip

with gzip.open('myfile.gz', 'rb') as fh: # Note 'rb' as file opening mode
    for line in fh:
        # Do something

## Standard library

Python has expansive standard library and comprehensive, accessible documentation complete with examples. Some of the must-know modules:

- `collections`
- `itertools`
- `os` and particularly `os.path`
- `re`

## Comparisons

In comparison operations each operand is first cast to `boolean`.

The `boolean` type is a subclass of `int` (`False ~ 0, True ~ 1`) mainly for historical reasons.

In [None]:
a = 1
b = 0
if 0 < a < 2:
    print('a is True!')
if not b:
    print('b is False!')

In [None]:
s1 = 'Fox'
s2 = 'Dog'

# Equality check
if s1 != s2:
    print('s1 and s2 are different')
    
# Identity check
if s1 is s2:
    pass
else:
    print('s1 and s2 are not the same thing!')

In [None]:
s2 = s1

# Identity check
if s1 is s2:
    print('s1 and s2 POINT to the same thing')
else:
    print('s1 and s2 are not the same thing!')

In [None]:
# This however doesn't work for integers
b = 1
if a == b:
    print('a and b are equal!')
    
if a is b:
    print('a and b are not the same thing!')
    
# Implementation detail: integers from -5 to ~256 are cached (singletons)

In [None]:
# Empty sequence types evaluate to False, non-empty ones to True
lst1 = []
lst2 = ['a']

if lst1 and lst2:
    print('Both lists are non-empty.')
else:
    print('One of the lists is empty.')

In [None]:
if lst1:
    pass
else:
    print('Empty')

In [None]:
# Another singleton
a = None

## Example

In [None]:
!ls -lah ../data

In [None]:
import os
import sys
import csv
import gzip
import pandas as pd
csv.field_size_limit(sys.maxsize)

def parse_pileup(barcode, dirname='../results', track='minus', sample_id='minus'):
    filename = os.path.join(dirname, '{0}_{1}.pileup.gz'.format(barcode, sample_id))
    print(filename)
    with gzip.open(filename, 'rb') as pileup:
        reader = csv.DictReader(pileup,
                        delimiter='\t',
                        fieldnames=['seqname', 'pos', 'base', 'coverage', 'details', 'qual'])
        data = []
        for rec in reader:
            pos = int(rec['pos'])
            last = rec
            if pos == 1:
                data.append({'pos': 0, 'base': '*',
                             track: rec['details'].count('^')})
            else:
                data.append({'pos': pos-1, 'base': last['base'],
                             track: rec['details'].count('^')})
    return pd.DataFrame.from_records(data)

In [None]:
df = parse_pileup('AAGCTA', dirname='../data', sample_id='R2')

In [None]:
df

## Assignment

Do the summary of E.coli `.gff` file in python

### EXTRA CREDIT:
>Find all occurences of the sequence `GGGGCGGGGG` in E.coli genome

### EXTRA EXTRA CREDIT:
>Find all _almost_ exact occurences of the same sequence in E.coli genome (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz)

In [None]:
import gzip
from collections import Counter

foo = 'bar'

def column_stat(filename, col=2, sep='\t'):
    counter = Counter()
    foo = 'BAZ'
    print(foo)
    with gzip.open(filename) as fi:
        for line in fi:
            line_norm = line.strip()
            if line and (not line_norm.startswith('#')):
                fields = line_norm.split(sep)
                counter[fields[col]] += 1
    return counter
                         
def column_stat1(filename, **kwargs):
    counter = Counter()
    col = kwargs['col']
    sep = kwargs['sep']
    with gzip.open(filename) as fi:
        for line in fi:
            line_norm = line.strip()
            if line and (not line_norm.startswith('#')):
                fields = line_norm.split(sep)
                counter[fields[col]] += 1
    return counter



In [None]:
#column_stat('../data/GCF_000005845.2_ASM584v2_genomic.gff.gz', col=3, sep=',')
kwargs = {'col': 2, 'sep': '\t'}
column_stat('../data/GCF_000005845.2_ASM584v2_genomic.gff.gz', **kwargs)

In [None]:
!ls -lah ../data

In [None]:
pattern = 'GGGGCGGGGG'
nuc = {'A':'T', 'T' : 'A', 'C' : 'G',  'G' : 'C'}

def rc(seq):
    return ''.join([nuc[i] for i in seq])[::-1]

with gzip.open('../data/GCF_000005845.2_ASM584v2_genomic.fna.gz') as fi:
    glines = []
    for line in fi:
        line_norm = line.strip().upper()
        if not line_norm.startswith('>'):
            glines.append(line_norm)

seq = ''.join(glines)

def find_kmer(seq, pattern):
    seq_map = {}
    k = len(pattern)
    for kmer, pos in [(seq[i:i+k], i) for i in xrange(0, len(seq)-k)]:
        if kmer in seq_map:
            seq_map[kmer].append(pos)
        else:
            seq_map[kmer] = [pos,]
    return seq_map
    
forward = find_kmer(seq, pattern)
print(' '.join([str(x) for x in forward[pattern]]))
reverse = find_kmer(seq, rc(pattern))
print(' '.join([str(x) for x in reverse[rc(pattern)]]))

In [None]:
nuc = {'A':'T', 'T' : 'A', 'C' : 'G',  'G' : 'C'}
#take the patter            , search it based on dic
res = []
for i in pattern :
     res.append(nuc[i])
print(''.join(res)[::-1])

In [None]:
def rc(seq):
    return ''.join([nuc[i] for i in seq])[::-1]

In [None]:
D = 1 # number of mismatches

def hamming(s1, s2):
    '''
    Computes Hamming distance between s1 and s2.
    '''
    if len(s1) != len(s2):
        raise ValueError('s1 and s2 must be the same length to compute Hamming distance!')
    return sum(ch1 != ch2 for ch1,ch2 in zip(s1, s2))

seq_map = {}
k = len(pattern)
for kmer, pos in [(seq[i:i+k], i) for i in xrange(0, len(seq)-k+1)]:
    if hamming(kmer, pattern) <= D:
        if kmer in seq_map:
            seq_map[kmer].append(pos)
        else:
            seq_map[kmer] = [pos,]
    
res = ' '.join([' '.join([str(x) for x in v]) for key,v in seq_map.items()])

print(' '.join([str(x) for x in sorted([int(x) for x in res.split()])]))

In [None]:
s1 = 'AAATTTTAAA'
s2 = 'AAATTATAAT'
for c1,c2 in zip(s1,s2):
    print(c1,c2)

In [None]:
def hamming(s1, s2):
    '''
    Computes Hamming distance between s1 and s2.
    '''
    if len(s1) != len(s2):
        raise ValueError('s1 and s2 must be the same length to compute Hamming distance!')
    return sum(ch1 != ch2 for ch1,ch2 in zip(s1, s2))

def find_with_mismatch(seq, pattern, D=1):
    seq_map = {}
    k = len(pattern)
    for kmer, pos in [(seq[i:i+k], i) for i in xrange(0, len(seq)-k+1)]:
        if hamming(kmer, pattern) <= D:
            if kmer in seq_map:
                seq_map[kmer].append(pos)
            else:
                seq_map[kmer] = [pos,]
    return seq_map
        
print(find_with_mismatch(seq, pattern, D=0))

In [None]:
print(find_with_mismatch(seq, pattern, D=1))