Under the hood python constructs like For loops and List comprehensions rely on objects that are "iterators" or that can be made in "iterator" objects.

## Iterators

- Python objects that represent streams of data

- Iteratators return the data they hold one element at a time (what's considered an "element" depends on the type of iterator). Data element are returned by using the iterator in the `next()` function

- When an iterator has exhausted its elements, than subsequent calls to net will raise a `StopIteration` exception

## Getting iterators from sequence objects

Common sequence objects like lists, tuple, string, and ranges are not iterators themselves, but are "iterable" in that we can get an iterator from them using the `iter()` function:

In [114]:
l = [1, 2, 3]
l_iter = iter(l)
l_iter

<list_iterator at 0x12581ca00>

In [117]:
import math
t = (math.sin, math.cos, math.tan)
t

(<function math.sin(x, /)>,
 <function math.cos(x, /)>,
 <function math.tan(x, /)>)

In [119]:
t_iter = iter(t)
t_iter

<tuple_iterator at 0x12581d7e0>

In [115]:
s = "ATGCAATGC"
s_iter = iter(s)
s_iter

<str_ascii_iterator at 0x12581dab0>

In [116]:
d = {'a': 1, 'b': 2, 'c': 3}
d_iter = iter(d)
d_iter

<dict_keyiterator at 0x12584b420>

In [120]:
# integers and many other objects can't be turned into iterators
i_iter = iter(10)


TypeError: 'int' object is not iterable

## Getting elements from an iterator using `next()`

In [130]:
l = [1, 2, 3]
l_iter = iter(l)

In [131]:
list(l_iter)

[1, 2, 3]

In [126]:
next(l_iter)

1

In [127]:
next(l_iter)

2

In [128]:
next(l_iter)

3

In [129]:
next(l_iter)

StopIteration: 

In [134]:
d = {'a': 1, 'b': 2, 'c': 3}
d_iter = iter(d)

In [135]:
next(d_iter)

'a'

In [136]:
next(d_iter)

'b'

In [137]:
next(d_iter)

'c'

In [138]:
next(d_iter)

StopIteration: 

In [141]:
t = (math.sin, math.cos, math.tan)
t_iter = iter(t)
next(t_iter)

<function math.sin(x, /)>

In [142]:
next(t_iter)

<function math.cos(x, /)>

## for loops and comprehensions can be applied to anything that is iterable

"Iterable" literally means that when you call `iter()` on the object it returns an interator.

You can think if for loops as repeatedly calling `next()` on the iterator until a `StopIteration` exception is raised.

In [150]:
s = "ABCD"
for i in s:
    print(i)

A
B
C
D


In [159]:
# above is equivalent to
s = "ABCD"
s_iter = iter(s)
while True:
    try:
        print(next(s_iter))
    except StopIteration:
        break
    

A
B
C
D


### List, Set, and Dictionary Comprehensions

In [162]:
nucs = "ATGCTAATA"

In [163]:
# list comprehension
[i for i in nucs]

['A', 'T', 'G', 'C', 'T', 'A', 'A', 'T', 'A']

In [165]:
# set comprehension
{i for i in nucs}

{'A', 'C', 'G', 'T'}

In [166]:
# dictionary comprehension
{i: i.lower() for i in nucs}

{'A': 'a', 'T': 't', 'G': 'g', 'C': 'c'}

### Generator expressions

In Python a "generator" is  function that returns and iterator. A generator expression is a syntax that is similar to a list comprehension, but rather than returning a list it returns a generator.


In [219]:

# stand-in for some expensive to compute function
def complex_func(x):
    return math.cos(math.sin(x**2))



In [223]:
# list comprehension
# note that this carries out all the complex computations
# AND stores the results -- for big inputs or really complex computatoins
# we may only need one result at a time or want to delay the computation until
# we actually need it

lc = [complex_func(i) for i in range(100)]
lc[:10]


[1.0,
 0.6663667453928805,
 0.7270351311688124,
 0.9162743174606308,
 0.9588413200803038,
 0.9912542848596704,
 0.5472018255605284,
 0.5786265349466179,
 0.6057994404065464,
 0.8080934908296122]

In [229]:
# generator expression
# note parentheses rather than square brackets

ge = (complex_func(i) for i in range(100))
ge

<generator object <genexpr> at 0x125a6e330>

In [230]:
next(ge), next(ge), next(ge)

(1.0, 0.6663667453928805, 0.7270351311688124)

In [231]:
ge = (complex_func(i) for i in range(100))
sum(ge)

76.8916482025566

In [232]:
sum(lc)

76.8916482025566

## The `itertools` module

The `itertools` module included as part of the Python standard library includes a large number of functions that produce various useful iterators. We'll illustrate a few of these.

In [12]:
import itertools

### Infinite iterators

In [38]:
# This iterator infinitely returns the same thing!

rptr =  itertools.repeat("Hello")

[next(rptr) for i in range(5)]


['Hello', 'Hello', 'Hello', 'Hello', 'Hello']

In [40]:
# An example where a repeater can be useful in combination
# with a generator expression


import random

rep1 = itertools.repeat(1)
random_nuc = (random.choice("ATGC") for i in rep1)

# an infinite stream of random nucleotides!
next(random_nuc), next(random_nuc), next(random_nuc)


('T', 'C', 'G')

In [45]:
# using string join method with the generator expression above
# to generate a random 15bp nuc acid sequence 
# this will be different every time its evaluated

rand_seq = ''.join(next(random_nuc) for i in range(15))
rand_seq

'TCATGGATGGCCACG'

In [215]:
z = itertools.count(11, step=3)

In [46]:
# itertools.cycle is another infinite iterator
# but one which cycles through it's inputs

color_cycle = itertools.cycle(["red", "green", "blue"])

next(color_cycle), next(color_cycle), next(color_cycle), next(color_cycle)

('red', 'green', 'blue', 'red')

In [49]:
# itertools.count gets a starting value
# and a step size and can inifitely return
# the next value in a sieres

ctr = itertools.count(1, step=9)
[next(ctr) for i in range(10)]


[1, 10, 19, 28, 37, 46, 55, 64, 73, 82]

In [50]:
# remembers where it was if called again
[next(ctr) for i in range(10)]

[91, 100, 109, 118, 127, 136, 145, 154, 163, 172]

### Other useful iterators 

In [52]:
# itertools.batched takes items in batches of size n from 
# the input iterable

seq = "ATGCATTTGACTC"

codon_itr = itertools.batched(seq, 3)

next(codon_itr), next(codon_itr), next(codon_itr)


(('A', 'T', 'G'), ('C', 'A', 'T'), ('T', 'T', 'G'))

In [74]:
# itertools.groupby provides an iterator
# that groups elements by a key function
#
# NOTE: items should be sorted by the same key function first



# first example, group by first letter

animals = ["aardvark", "ant", "dog", 
           "cat", "cougar", "koala", "beaver", "bear"]

sorted_by_first = sorted(animals, key = lambda x: x[0])

grp_by_first = itertools.groupby(sorted_by_first, key = lambda x: x[0])

for first, grp in grp_by_first:
    print("First letter:", first, " -> Group: ", list(grp))



First letter: a  -> Group:  ['aardvark', 'ant']
First letter: b  -> Group:  ['beaver', 'bear']
First letter: c  -> Group:  ['cat', 'cougar']
First letter: d  -> Group:  ['dog']
First letter: k  -> Group:  ['koala']


In [76]:
# second example, group by length of name

sorted_by_len = sorted(animals, key = len)

grp_by_len = itertools.groupby(sorted_by_len, key = len)

for namelen, grp in grp_by_len:
    print("Name length:", namelen, " -> Group: ", list(grp))


Name length: 3  -> Group:  ['ant', 'dog', 'cat']
Name length: 4  -> Group:  ['bear']
Name length: 5  -> Group:  ['koala']
Name length: 6  -> Group:  ['cougar', 'beaver']
Name length: 8  -> Group:  ['aardvark']


### Products, Permutations, and Combinations in itertools

In [31]:
# product (permutation with repetition) -> 
#    gove all possible sequences of length 2, 
#    composed of the characters drawn from "ABC"

list(itertools.product("ABC", repeat=2))

[('A', 'A'),
 ('A', 'B'),
 ('A', 'C'),
 ('B', 'A'),
 ('B', 'B'),
 ('B', 'C'),
 ('C', 'A'),
 ('C', 'B'),
 ('C', 'C')]

In [32]:
# permutation (w/out repetition) -> 
#    give  all possible sequences of length 2, 
#    composed of the characters drawn from "ABC"
#    but with no character appearing more than once

list(itertools.permutations("ABC", 2))

[('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'C'), ('C', 'A'), ('C', 'B')]

In [34]:
# combination  -> 
#    give  all possible sequences of length 2, 
#    composed of the characters drawn from "ABC.
#    Order doesn't matter and no character appears
#    more than once.

list(itertools.combinations("ABC", 2))

[('A', 'B'), ('A', 'C'), ('B', 'C')]

In [36]:
# combination_with_replacement  -> 
#    give all possible sequences of length 2, 
#    composed of the characters drawn from "ABC.
#    Order doesn't matter and but character appears
#    can be repeated

list(itertools.combinations_with_replacement("ABC", 2))

[('A', 'A'), ('A', 'B'), ('A', 'C'), ('B', 'B'), ('B', 'C'), ('C', 'C')]

### Pathlib

[pathlib](https://docs.python.org/3/library/pathlib.html) is a Python standard library that provides a convenient object oriented interface for working with file system paths.  

It provides an iterator based interface for getting path information and searching for matches.


In [2]:
from pathlib import Path

In [3]:
# specify a path from scratch

p = Path("/Applications/")
p

PosixPath('/Applications')

In [4]:
# use the home() method to get my home directory

Path.home()

PosixPath('/Users/pmagwene')

In [7]:
# append a subdirectory "gits" to my home directory

git_path = Path.home() / "gits"
git_path

PosixPath('/Users/pmagwene/gits')

In [8]:
# expand user methods translates tilde ~ to user's home directory

Path.expanduser(Path("~/tmp"))  

PosixPath('/Users/pmagwene/tmp')

In [9]:
bio724 = git_path / "Bio724D_2024_2025"
bio724

PosixPath('/Users/pmagwene/gits/Bio724D_2024_2025')

In [53]:
# Path objects provide an iterator over the items contained in a path

directories = [i for i in bio724.iterdir() if i.is_dir()]
directories

[PosixPath('/Users/pmagwene/gits/Bio724D_2024_2025/class_notes'),
 PosixPath('/Users/pmagwene/gits/Bio724D_2024_2025/practical-tutorials'),
 PosixPath('/Users/pmagwene/gits/Bio724D_2024_2025/docs'),
 PosixPath('/Users/pmagwene/gits/Bio724D_2024_2025/slides'),
 PosixPath('/Users/pmagwene/gits/Bio724D_2024_2025/code_examples'),
 PosixPath('/Users/pmagwene/gits/Bio724D_2024_2025/templates'),
 PosixPath('/Users/pmagwene/gits/Bio724D_2024_2025/.git'),
 PosixPath('/Users/pmagwene/gits/Bio724D_2024_2025/data'),
 PosixPath('/Users/pmagwene/gits/Bio724D_2024_2025/python_notebooks')]

In [65]:
# We can search for matches in a Path using the 
# `glob` and `rglob` (recursive glob)
# functions, both of which  return generators

shell_files = bio724.rglob("*.sh")
shell_files

<generator object Path.rglob at 0x104cb2240>

In [67]:
for f in shell_files:
    print(f)

/Users/pmagwene/gits/Bio724D_2024_2025/code_examples/conductor.sh
/Users/pmagwene/gits/Bio724D_2024_2025/code_examples/align_sequences.sh
/Users/pmagwene/gits/Bio724D_2024_2025/code_examples/build_trees.sh
/Users/pmagwene/gits/Bio724D_2024_2025/code_examples/combine_sequences.sh
/Users/pmagwene/gits/Bio724D_2024_2025/code_examples/get_sequences.sh
