## Comprehensions, Generators
### BIOINF 575 

### For loop RECAP

### for: the repetitive control structure with a known number of steps

To loop through a sequence of elements is to iterate

```python
for var in sequence:
    statements
```

___ 

### Python Comprehension Statements
Courtesy of Marcurs Sherman - partly adapted

First, the **purpose** of comprehensions:
> "\[...\] comprehensions provide a more concise way to create \[iterables\] in situations where `map()` and `filter()` and/or nested loops would currently be used" - Barry Warsaw, [PEP 202](https://www.python.org/dev/peps/pep-0202/)

Comprehensions are what we call "_syntactic sugar_". 
This means that they do not do anything you could not have done already.     
But, with them, you can do some operations easier.

<img src="venn_diagram2.png" width=400 />

---
### Comprehension Syntax

#### Legend

<img src="legendary.png" width=250 />

#### Examples
<img src="comprehensions.png" width=500 />

#### Alternate syntax of a comprehensions

<center><img src="http://python-3-patterns-idioms-test.readthedocs.io/en/latest/_images/listComprehensions.gif" width = "500"/></center>

---
#### The Comprehension Categories
1. `list` comprehensions - create a list
2. `dict`ionary comprehensions - create dictionaries
3. `set` comprehensions - create sets
4. `tuple`? comprehensions

In [3]:
sequences = ["ACTTGCCC", "AAAGTC", "CCTAC", "AAACCTA"]

In [5]:
sequences

['ACTTGCCC', 'AAAGTC', 'CCTAC', 'AAACCTA']

#### Basic list comprehension
* Compute simple expression for each element

In [9]:
sl = [len(seq) for seq in sequences]
sl

[8, 6, 5, 7]

In [13]:
sl = []
for s in sequences:
    sl.append(len(s))
sl

[8, 6, 5, 7]

#### List comprehension - use [ ]
* Compute complex expression for each element



In [15]:
# compute GC count

gcl = [(seq.count("C") + seq.count("G"))/len(seq) for seq in sequences]
gcl

[0.625, 0.3333333333333333, 0.6, 0.2857142857142857]

#### List comprehension with predicate
* Compute complex expression for specific elements
    * add a predicate - an if expression 
    * if expression - similar to the to the if statement but with no statements after the header line
        * e.g.: if "#" not in item


In [17]:
# compute GC content only for sequences that contain "AC"

sequences

['ACTTGCCC', 'AAAGTC', 'CCTAC', 'AAACCTA']

In [19]:
"AC" in 'AAAGTC'

False

In [25]:
gc_ac = [(s.count("C")+s.count("G"))/len(s) for s in sequences if "AC" in s]
gc_ac

[0.625, 0.6, 0.2857142857142857]

#### If the comprehension becomes to complex - use a regular for loop

In [27]:
# compute GC content only for sequences that contain "AC"

sequences

['ACTTGCCC', 'AAAGTC', 'CCTAC', 'AAACCTA']

In [33]:
gc_ac = []
for s in sequences:
    gc = (s.count("C")+s.count("G"))/len(s)
    if "AC" in s:
        gc_ac.append(gc)

gc_ac

[0.625, 0.6, 0.2857142857142857]

#### Set comprehensions - use { } 
* Use when you want unique elements and the order does not matter

In [35]:
# get the first codon in each sequence  

sequences

['ACTTGCCC', 'AAAGTC', 'CCTAC', 'AAACCTA']

In [39]:
[s[:3] for s in sequences]

['ACT', 'AAA', 'CCT', 'AAA']

In [41]:
{s[:3] for s in sequences}

{'AAA', 'ACT', 'CCT'}

#### Dictionary comprehensions - use { }
* must start with something like: key_expression:value_expression
* Use when you want key:value pairs and the order does not matter

In [45]:
# sequence as key GC count as value

dict(enumerate(sequences))


{0: 'ACTTGCCC', 1: 'AAAGTC', 2: 'CCTAC', 3: 'AAACCTA'}

In [59]:
{"seq" + str(i+1):s for i,s in enumerate(sequences)}

{'seq1': 'ACTTGCCC', 'seq2': 'AAAGTC', 'seq3': 'CCTAC', 'seq4': 'AAACCTA'}

In [61]:
{"seq" + str(i+1):len(s) for i,s in enumerate(sequences)}

{'seq1': 8, 'seq2': 6, 'seq3': 5, 'seq4': 7}

#### <font color = "red">Exercise:</font>   

* Create a list comprehension where we store if the corresponding sequence can code for the amino acid Tyrosine (TAT and TAC codons code for this amino acid).
* Change this into a dictionary comprehension where the key is the "Seq pos", where pos is the position of the sequence on the `sequences` list.


In [63]:
sequences

['ACTTGCCC', 'AAAGTC', 'CCTAC', 'AAACCTA']

In [65]:
s = 'AAAGTC'
"TAT" in s

False

In [70]:
[s for s in sequences]

['ACTTGCCC', 'AAAGTC', 'CCTAC', 'AAACCTA']

In [72]:
["TAT" in s for s in sequences]

[False, False, False, False]

In [76]:
sequences

['ACTTGCCC', 'AAAGTC', 'CCTAC', 'AAACCTA']

In [80]:
["TAT" in s or "TAC" in s for s in sequences]

[False, False, True, False]

In [84]:
{i:"TAT" in s or "TAC" in s for i, s in enumerate(sequences)}

{0: False, 1: False, 2: True, 3: False}

In [86]:
{"seq " + str(i +1):"TAT" in s or "TAC" in s for i, s in enumerate(sequences)}

{'seq 1': False, 'seq 2': False, 'seq 3': True, 'seq 4': False}

### Some pros of comprehensions
1. Concise - their use can easily distill multiple lines of code into a single, concise statement
1. Efficient (time and other resources) - _slightly_ more performant than regular loops
1. Flexible output - list, set, dictionary ...

### Some cons of comprehensions
1. The "imperative" syntax - the order in which you type things to make one is different from the rest of Python
1. Readability - comprehension statements get more unreadable as complexity is added

### RESOURCES

https://www.tutorialspoint.com/python-list-comprehension  
https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html  
https://realpython.com/list-comprehension-python/  
http://scipy-lectures.org/advanced/advanced_python/index.html   

#### Did we miss the tuple comprehensions?

In [109]:
# Try to make a `tuple` comprehension
# this will not return a tuple

g = (number * 2 for number in range(10))
g

<generator object <genexpr> at 0x13ebedb10>

In [111]:
list(g)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [115]:
# generator is at the end, no more values to show
for i in g:
    print(i)

In [117]:
g = (number * 2 for number in range(10))
for i in g:
    print(i)

0
2
4
6
8
10
12
14
16
18


### Python Generators
Courtesy of Marcurs Sherman - partly adapted

#### What was mentioned above as "comprehension statements" are actually called "generator expressions".

<img src="http://nvie.com/img/relationships.png" width=600 align='middle'/>


* Iterable is an object, which one can iterate over.
    * It generates an Iterator when passed to `iter()` method.       
* Iterator is an object, which is used to iterate over an iterable object using `__next__()` method. 
    * Iterators have `__next__()` method, which returns the next item of the object.       

* Note that **every iterator** is also an **iterable**, but **_not every iterable is an iterator_**.    
    * For example, a list is iterable but a list is not an iterator.        
* An iterator can be created from an iterable by using the function `iter()`. 
    * To make this possible, the class of an object needs either a method `__iter__`, which returns an iterator, or a `__getitem__` method with sequential indexes starting with 0.           

https://www.geeksforgeeks.org/python-difference-iterable-iterator/



In [123]:
r = range(3)
r

range(0, 3)

In [135]:
# dir(r)

In [137]:
iter(r)

<range_iterator at 0x13eba7420>

In [141]:
# dir(iter(r))

In [None]:
# dir(range)

In [131]:
# dir(1)

In [143]:
# is range an iterator?
next(range(3))

TypeError: 'range' object is not an iterator

In [145]:
next(iter(range(3)))

0

In [147]:
test_iter = iter(range(3))

In [153]:
next(test_iter)

2

#### and we can do next again and again ...

In [155]:
# and ...that's it ... 
# when we reach the end of the sequence 
# the generator gives an error on next
# we have to create it again to start from the beginning

next(test_iter)

StopIteration: 

In [157]:
test_gen = (number * 2 for number in range(10))

In [163]:
next(test_gen)

4

In [165]:
# retrieve all values
tuple(test_gen)

(6, 8, 10, 12, 14, 16, 18)

In [169]:
test_gen = (number * 2 for number in range(10))
tuple(test_gen)

(0, 2, 4, 6, 8, 10, 12, 14, 16, 18)

In [195]:
test_gen = (number * 2 for number in range(100000000))
next(test_gen)

0

In [197]:
next(test_gen)

2

In [199]:
for i in test_gen:
    pass

In [200]:
i

199999998

___
#### Functions RECAP

```python

# DEFINITION - creating a function

def function_name(arg1, arg2, darg=None):
    # instructions to compute result
    return result

# CALL - running a function

function_result = function_name(val1, val2, dval)
```

___


A generator is just a special case of a function. The main difference is how it gives its output. 

How do you make a function give a result?

In [203]:
def number_one():
    number = 1
    return number

In [207]:
number_one()

1

In [209]:
def number_one():
    number = 1
    yield number

In [213]:
x = number_one()
next(x)

1

In [215]:
next(x)

StopIteration: 

In [217]:
# create a generator for an infinite sequence of numbers
# Note for generators we have yield instead of return

def infinite_sequence():
    number = 0
    while True:
        yield number
        number += 1

In [219]:
numbers_seq_gen = infinite_sequence()

In [221]:
numbers_seq_gen

<generator object infinite_sequence at 0x13ec34a00>

In [223]:
next(numbers_seq_gen)

0

#### and we can do next again and again ...

In [233]:
next(numbers_seq_gen)

5

In [235]:
# a generator for a finite sequence of numbers
# this starts to look like range

def finite_sequence(limit):
    number = 0
    while number < limit:
        yield number
        number += 1

In [237]:
numbers_seq_gen = finite_sequence(3)

In [239]:
numbers_seq_gen

<generator object finite_sequence at 0x13ec34f40>

In [241]:
next(numbers_seq_gen)

0

In [247]:
# and we can do next again and again ... and ...that's it

next(numbers_seq_gen)


StopIteration: 

In [249]:
# we can put all the results in a list

list(finite_sequence(30))


[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29]

In [251]:
# go through the elements of the generator

x = finite_sequence(10)
y = next(x)
while y < 5:
    print(y)
    y = next(x)

0
1
2
3
4


In [253]:
for i in x:
    print(i)

6
7
8
9


In [263]:
# generator to put a key and a values list together in a dictionary

def zip_2sequences(seq1, seq2):
    i = 0
    l = min(len(seq1), len(seq2))
    while i<l:
        yield (seq1[i], seq2[i])
        i += 1
    

In [277]:
s1 = [1,2,3]
s2 = ["ACG","AAC","CCC"]
g1 = zip_2sequences(s1,s2)
g1

<generator object zip_2sequences at 0x13ece6110>

In [279]:
next(g1)

(1, 'ACG')

In [281]:
g1 = zip_2sequences(s1,s2)
dict(g1)

{1: 'ACG', 2: 'AAC', 3: 'CCC'}

In [283]:
g1 = zip_2sequences(s1,s2)
for key,value in g1:
    print(key, value)

1 ACG
2 AAC
3 CCC


#### <font color = "red">Exercise:</font>   

* Create a generator of n nucleotides that keeps giving us a nucleotide in the order A,C,G,T and then starts again from A until it reaches n nucleotides. 


In [293]:
def gen_nucleotides(n, s):
    i = 0
    while i < n:
        yield s[i % len(s)]
        i += 1

In [297]:
list(gen_nucleotides(15, "ACGT"))

['A', 'C', 'G', 'T', 'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T', 'A', 'C', 'G']

In [299]:
9%5

4

In [305]:
11%5

1

---
# Conclusion
Generators and generator expressions should be a standard tool in every bioinformaticist's tool belt. 

1. Generator expressions can compress simple for loops down to a single line
1. List comprehensions tend to be more efficient than standard for loops when the data is sufficiently large
1. The same syntax to make a list comprehension can be used to make dictionaries, sets, and generators
1. Generators are iterators that lazily evaluate the next value and `yield` it back
1. Once a generator (or any iterator) is consumed you need to recreated to get the values

### Some pros of generators
1. Lazy evaluation: does not produce all the data at one time
1. Maintains state between steps: does not forget where it left off
1. Easily handles data of any size

### Some cons of generators
1. Hard to explain to someone that does not use Python
1. If the data you are using is sufficiently small that the trade-off is not worth it

#### RESOURCES 
https://www.tutorialspoint.com/generators-in-python   
https://www.geeksforgeeks.org/generators-in-python/   
https://book.pythontips.com/en/latest/generators.html   


In [309]:
[1,2,3] + 5

TypeError: can only concatenate list (not "int") to list

---
### Function Examples

___
##### <b>`*args`</b> - unkown no. of arguments - unpack collection of argument values
##### <b>`**kargs`</b> - unkown no. of arguments - unpack mapping of names and values 

In [311]:
x ,y ,z = [20,30,40]
print(x)
print(y)
print(z)

20
30
40


In [None]:
# what if the number of elements do not match?



In [313]:
x ,*y ,z = [20,30,50, "A", 40]
print(x)
print(y)
print(z)

20
[30, 50, 'A']
40


In [315]:
# if we use * we can provide an unknown number value of arguments

def test_arg(*args_list):
    for value in args_list:
        print("value = ", value)

In [317]:
test_arg(1,2,3, {"a":4}, [4,5])

value =  1
value =  2
value =  3
value =  {'a': 4}
value =  [4, 5]


In [319]:
# no key=value arguments allowed
test_arg(args_list = 2)

TypeError: test_arg() got an unexpected keyword argument 'args_list'

In [321]:
# if we use * we can provide an unknown number value of arguments
# if we use ** we can provide an unknown number key = value of arguments

def test_karg(**keys_args_dict):
    for name,value in keys_args_dict.items():
        print("name = ", name)
        print("value = ", value)

In [323]:
test_karg(**{"gene":"EGFR", "expression": 20,"transcript_no": 4})

name =  gene
value =  EGFR
name =  expression
value =  20
name =  transcript_no
value =  4


In [None]:
test_karg(gene = "EGFR", expression = 20, transcript_no = 4, snp_no = 5, genes_regualted = {"TP53", "EGR"})

In [325]:
# we can check for the key and perform computations with the value for that key
# or retrieve the value for a specific key

def test_karg(**keys_args_dict):
    for name,value in keys_args_dict.items():
        print("name = ", name)
        print("value = ", value)
        if (name == "expression"):
            print("new value", 2*keys_args_dict[name])
        

In [327]:
test_karg(gene = "EGFR", expression = 20, transcript_no = 4, snp_no = 5, genes_regualted = {"TP53", "EGR"})

name =  gene
value =  EGFR
name =  expression
value =  20
new value 40
name =  transcript_no
value =  4
name =  snp_no
value =  5
name =  genes_regualted
value =  {'EGR', 'TP53'}


In [329]:
test_karg(gene = "EGFR", Expression = 20, transcript_no = 4, snp_no = 5, genes_regualted = {"TP53", "EGR"})

name =  gene
value =  EGFR
name =  Expression
value =  20
name =  transcript_no
value =  4
name =  snp_no
value =  5
name =  genes_regualted
value =  {'EGR', 'TP53'}


In [None]:
# if we provide a dictionary then all our eky value pairs have to be in the dictionary we create
def test_karg(keys_args_dict):
    for name,value in keys_args_dict.items():
        print("name = ", name)
        print("value = ", value)

In [331]:
test_karg({"gene":"EGFR", "expression": 20,"transcript_no": 4})

TypeError: test_karg() takes 0 positional arguments but 1 was given

In [333]:
# we cannot provide the dictionary items as independent arguments
test_karg(gene = "EGFR", Expression = 20, transcript_no = 4, snp_no = 5, genes_regualted = {"TP53", "EGR"})

name =  gene
value =  EGFR
name =  Expression
value =  20
name =  transcript_no
value =  4
name =  snp_no
value =  5
name =  genes_regualted
value =  {'EGR', 'TP53'}


____
##### <b>`lambda` function</b> - anonymous function - it has no name
Should be used only with simple expressions

https://docs.python.org/3/reference/expressions.html#lambda<br>
https://www.geeksforgeeks.org/python-lambda-anonymous-functions-filter-map-reduce/<br>
https://realpython.com/python-lambda/<br>

`lambda arguments : expression`

A lambda function can take <b>any number of arguments<b>, but must always have <b>only one expression</b>.

In [1]:
help(compute_expression)

NameError: name 'compute_expression' is not defined

In [5]:
compute_expression = lambda x, y: x + y + x*y

In [7]:
help(compute_expression)

Help on function <lambda> in module __main__:

<lambda> lambda x, y



In [9]:
compute_expression(2, 3)

11

____
### Useful functions

#### Built-in functions
https://docs.python.org/3/library/functions.html

##### <b>`zip(*iterables)`</b> - make an iterator that aggregates respective elements from each of the iterables.   
https://docs.python.org/3/library/functions.html#zip

##### <b>`map(function, iterable, ...)`</b> - apply function to every element of an iterable - return iterable with results
https://docs.python.org/3/library/functions.html#map

##### <b>`filter(function, iterable)`</b> - apply function (bool result) to every element of an iterable - return the elements from the input iterable for which the function returns True
https://docs.python.org/3/library/functions.html#filter

##### <b>`functools.reduce(function, iterable[, initializer])`</b> - apply function to every element of an iterable to reduce the iterable to a single value
https://docs.python.org/3/library/functools.html#functools.reduce

____



<b>`zip(*iterables)`</b> - make an iterator that aggregates respective elements from each of the iterables.  


In [11]:
combined_res = zip([10,20,30],["ACT","GGT","AACT"],[True,False,True])
combined_res

<zip at 0x150babf80>

In [13]:
for element in combined_res:
    print(element)

(10, 'ACT', True)
(20, 'GGT', False)
(30, 'AACT', True)


In [15]:
list(combined_res)

[]

In [17]:
combined_res = zip([10,20,30],["ACT","GGT","AACT"],[True,False,True])
list(combined_res)

[(10, 'ACT', True), (20, 'GGT', False), (30, 'AACT', True)]

In [19]:
combined_res = zip([10,20,30,500],["ACT","GGT","AACT"],[True,False,True])
list(combined_res)

[(10, 'ACT', True), (20, 'GGT', False), (30, 'AACT', True)]

In [23]:
x, y, z = zip(*[(10, 'ACT', True), (20, 'GGT', False), (30, 'AACT', True)])
print(x, y, z)

(10, 20, 30) ('ACT', 'GGT', 'AACT') (True, False, True)


In [353]:
# unzip list
x, y, z = zip(*[(3,4,7), (12,15,19), (30,60,90)])
print(x, y, z)

(3, 12, 30) (4, 15, 60) (7, 19, 90)


In [25]:
x, y, z = zip(*[(3,4,7,8), (12,15,19), (30,60,90)])
print(x, y, z)

(3, 12, 30) (4, 15, 60) (7, 19, 90)


In [27]:
combined_res = zip(["ACT","GGT","AAC"], [10,20,30])
dict(combined_res)

{'ACT': 10, 'GGT': 20, 'AAC': 30}

In [29]:
dict(zip(["ACT","GGT","AACT"], [10,20,30]))

{'ACT': 10, 'GGT': 20, 'AACT': 30}

In [33]:
keys = [1,2,3]
values = ["one", "two","three"]
list(zip(keys,values))

[(1, 'one'), (2, 'two'), (3, 'three')]

In [43]:
keys = [1,2,3]
values = ["one", "two","three"]
dict(zip(keys,values))

{1: 'one', 2: 'two', 3: 'three'}

In [39]:
dict([(1,), (2, 'two'), (3, 'three')])

ValueError: dictionary update sequence element #0 has length 1; 2 is required

In [41]:
dict([(1, "one"), (2, 'two'), (3, 'three')])

{1: 'one', 2: 'two', 3: 'three'}

_____

<b>`map(function, iterable, ...)`</b> - apply function to every element of an iterable - return iterable with results

In [45]:
map(abs,[-2,0,-5,6,-7])

<map at 0x1445fda50>

In [47]:
list(map(abs,[-2,0,-5,6,-7]))

[2, 0, 5, 6, 7]

In [51]:
def compute_addition(x,y):
    return x + y


In [53]:
list(map(compute_addition, [1,2,3,4], [50,60,70]))

[51, 62, 73]

In [57]:
def compute_addition(x,y = 10):
    return x + y

In [59]:
list(map(compute_addition, [1,2,3,4]))

[11, 12, 13, 14]

In [61]:
list(map(compute_addition, [1,2,3,4], [50,60,70]))

[51, 62, 73]

https://www.geeksforgeeks.org/python-map-function/

In [63]:
numbers1 = [1, 2, 3] 
numbers2 = [4, 5, 6] 
  
result = map(lambda x, y: x + y, numbers1, numbers2) 
list(result)

[5, 7, 9]

In [65]:
list(map(lambda x, y: x + y, [1,2,3,4], [50,60,70]) )

[51, 62, 73]

____
Use a lambda function and the map function to compute a result from the followimg 3 lists.<br>
If the element in the third list is divisible by 3 return 3*x, otherwise return 2*y.

In [67]:
numbers1 = [1, 2, 3, 4, 5, 6] 
numbers2 = [7, 8, 9, 10, 11, 12] 
numbers3 = [13, 14, 15, 16, 17, 18] 

result = map(lambda x, y, z: 3*x if z%3 ==0 else 2*y, \
             numbers1, numbers2, numbers3) 
list(result)



[14, 16, 9, 20, 22, 18]

In [69]:
def compute_res(x,y,z):
    res = None
    if z%3 == 0:
        res = 3*x
    else:
        res = 2*y
    return res


result = map(compute_res, numbers1, numbers2, numbers3) 
list(result)

[14, 16, 9, 20, 22, 18]

____
<b>`filter(function, iterable)`</b> - apply function (bool result) to every element of an iterable - return the elements from the input iterable for which the function returns True

In [71]:
test_list = [3,4,5,6,7]
result = filter(lambda x: x > 4, test_list)
result

<filter at 0x1445efe50>

In [73]:
list(result)

[5, 6, 7]

In [75]:
# Filter to remove empty structures or 0
test_list = [3, 0, 5, None, 7, "", "AACG", [], {}, {1:"one"}]
result = filter(bool, test_list)
list(result)

[3, 5, 7, 'AACG', {1: 'one'}]

____
<b>`functools.reduce(function, iterable[, initializer])`</b> - apply function to every element of an iterable to reduce the iterable to a single value



In [80]:
help(reduce)

NameError: name 'reduce' is not defined

In [84]:
from functools import reduce

In [86]:
help(reduce)

Help on built-in function reduce in module _functools:

reduce(...)
    reduce(function, iterable[, initial]) -> value

    Apply a function of two arguments cumulatively to the items of a sequence
    or iterable, from left to right, so as to reduce the iterable to a single
    value.  For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates
    ((((1+2)+3)+4)+5).  If initial is present, it is placed before the items
    of the iterable in the calculation, and serves as a default when the
    iterable is empty.



In [88]:
reduce(lambda x,y: x+y, [47,11,42,13])

113

<img src = https://www.python-course.eu/images/reduce_diagram.png width=300/>

https://www.python-course.eu/lambda.php

https://www.geeksforgeeks.org/reduce-in-python/
https://www.tutorialsteacher.com/python/python-reduce-function

In [90]:
test_list = [1,2,3,4,5,6]
reduce(lambda x,y: x+y, test_list)

21

In [92]:
# compute factorial of n
n=5
reduce(lambda x, y: x*y, range(1, n+1))

120

In [94]:
list(range(n))

[0, 1, 2, 3, 4]

In [96]:
list(range(1, n+1))

[1, 2, 3, 4, 5]

In [98]:
reduce(lambda x,y: x+y, ["AACT", "AA", "C", "TTG"])

'AACTAACTTG'

In [104]:
reduce(set.intersection, [{1,2,3,4}, {5,6,7,3, 2}, {2,3,8,9}])

{2, 3}