![Erudio logo](../../img/erudio-logo-small.png)

# More Itertools

In this lesson we continue to explore combining iterators, both using additional capabilities in `itertools`, and a few extras utilizing the 3rd party `more-itertools` module that adds additional useful function.  There are a number of short "recipes" in the official documentation of `itertools`—all of these recipes, along with numerous other functions, are included in `more_itertools`.

In [2]:
!pip install more_itertools

Collecting more_itertools
  Downloading more_itertools-10.2.0-py3-none-any.whl (57 kB)
     ---------------------------------------- 57.0/57.0 kB 2.9 MB/s eta 0:00:00
Installing collected packages: more_itertools
Successfully installed more_itertools-10.2.0



[notice] A new release of pip available: 22.2.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
from itertools import *
from more_itertools import *

## Run-length encoding

Sometimes data can be compressed very time- and memory-efficiently using a technique called *run-length* encoding.  The idea is that certain data sets, the same element may occur numerous times successively.  If so, it can be more compact to represent that as a number counting occurrences, followed by the element that is repeated.  As compression techniques go, this is rarely the most compressed result, although often RLE is combined with other techniques.  Let us implement it in an iterator style to show some of the conciseness of `itertools`.

As sample data, let us look at a FASTA record of rhibosomal RNA.  If the recurring objects were larger than a single character, RLE would be more significant, but this suffices to demonstrate the concept.  Here is sample sequence:

In [4]:
ab000482 = """
agtttgatcctggctcagaacgaacgctggcggcaggcctaacacatgcaagtcgaggga
gaagctatcttcggatnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnaaacgactgctaataccgcatacgcccttcgggggaaagatttatc
gctattcgattggcccgcgttagattagctaagttggtaaggtaacggcttaccaaggcg
acgatctatagctggtttgagaggatgatcagccacactgggactgagacacggcccaga
ctcctacgggaggcagcagtggggaatattgcgcaatggaggaaactctgacgcagccat
gccgcgtgagtgaagaaggccttagggttgtaaagctctttcagacgtgatgaatgatga
cagtagcgtcaaaagaagttccggctaaacttcgtgccagcagccgcggtaatacgaagg
gaactagcgttgttcggatttactgggcgtaaagagcatgtaggcggattggacagttga
gggtgaaatcccagagctcaactctggaacggccttcaatacttccagtctagagtccgt
aagggggtggtggaattccgagtgtagaggtgaaattcgtagatattcggaggaacacca
gtggcgaaggcgaccacctggtacggtactgacgctgagatgcgaaagcgtggggagcaa
acaggattagataccctggtagtccacgccgtaaacgatgagtgctagttgtcaggatgt
ttacatcttggtgacgcagctaacgcattaagcactccgcctggggagtacggtcgcaag
attaaaactcaaaggaattgacgggggcccgcacaagcggtggagcatgtggtttaattc
gaagcaacgcgaagaaccttaccaattcttgacatacctgtcgcgatttccagagatgga
tttcttcagttcggctggacaggatacaggtgctgcatggctgtcgtcagctcgtgtcgt
gagatgttgggttaagtcccgcaacgagcgcaaccctcacccctagttgccagcatttag
ttgggcactctatgggaactgccggtgacaagccggaggaaggtggggatgacgtcaagt
catcatggcccttacggattgggctacacacgtgctacaatggtaactacagtgggcagc
gacgtcgcgaggcgaagcaaatctccaaaagttatctcagttcggattgttctctgcaac
tcgagagcatgaagtcggaatcgcctagtaatcgcggatcagcatgccgcggtgaata
"""

Let us wrap that slightly to remove the the optional newlines.  This rRNA sample is quite small, but the genes in human DNA, for example are vastly larger (but still of this general form in FASTA format).  An iterator over individual nucleic acids can be constructed as:

In [5]:
def symbols(fasta):
    # Use a generator comprehension
    yield from (c for c in fasta if c != '\n')

# A few symbols from first line continuing on second
list(islice(symbols(ab000482), 55, 65))

['a', 'g', 'g', 'g', 'a', 'g', 'a', 'a', 'g', 'c']

A generic RLE encoder can be constructed using `itertools.groupby()` which this thinly wraps.  Notice that we encode incrementally as symbols are read.

In [6]:
def rle_encode(it):
    for k, g in groupby(it):
        yield (k, len(list(g)))

In [7]:
seq = symbols(ab000482)
for enc in rle_encode(islice(seq, 70, 140)):
    print(enc, end=" ")

('t', 1) ('c', 1) ('g', 2) ('a', 1) ('t', 1) ('n', 58) ('a', 3) ('c', 1) ('g', 1) ('a', 1) 

We can reverse the encoding be implementing a similarly short function using `itertools.repeat()` and `itertools.chain.from_iterable()`.  Repeating is just yielding the same items back numerous times.  Chaining is interesting.  It allows the lazy combination of as many iterables as you like, the next as soon as the previous one is exhausted.  However, if what you want to chain is not only a handful of named iterables, but rather an iterable of perhaps thousands of millions of iterables, it is impractical to pass these all as explicit arguments.  For example:

In [8]:
a = {1, 2, 3}
b = "ABC"
c = [9, 4, 2]
for x in chain(a, b, c):
    print(x, end=' ')

1 2 3 A B C 9 4 2 

In [9]:
def iter_of_iters():
    yield a
    yield b
    yield c
    
d = iter_of_iters()
    
for x in chain.from_iterable(d):
    print(x, end=' ')

1 2 3 A B C 9 4 2 

So we can decode incrementally, taking each tuple `(val, ncount)` that we get from RLE encoding.

In [10]:
def rle_decode(it):
    yield from chain.from_iterable(repeat(x, n) for x, n in it)

In [11]:
seq = symbols(ab000482)
encoded = rle_encode(seq)
decoded = rle_decode(encoded)
print(seq, encoded, decoded, sep='\n')

<generator object symbols at 0x0000026A57425C40>
<generator object rle_encode at 0x0000026A574258C0>
<generator object rle_decode at 0x0000026A574259A0>


Looping through decoded should reproduce our original sequence.

In [12]:
for symbol in islice(decoded, 60):
    print(symbol, end='')

agtttgatcctggctcagaacgaacgctggcggcaggcctaacacatgcaagtcgaggga

## Combining iterables

Let's build further on the idea of chaining that we saw used in `rle_decode()`.  FASTA files consist of a header line followed by sequence information.  We might have stored millions or billions of these, and wish to process them only incrementally.  For our stipulated purpose, we would like not to read in all at once, nor ever to read in files after a search had found a certain pattern.

Combining things we've done, let us identify the first sequence of more than 50 repetitions of the same symbol across a family of separate FASTA files.

In [42]:
from glob import glob
from collections import namedtuple
Record = namedtuple("Record", "metadata sequence")

def read_fasta(pat):
    for fname in glob(pat):
        with open(fname) as fasta:
            # Return a pair of sequence description and seq as iterator
            yield Record(next(fasta).strip(), symbols(fasta.read()))
            
next(read_fasta("data/rRNA*.fasta"))

Record(metadata='>AB000106_1|Sphingomonas sp.|16S ribosomal RNA', sequence=<generator object symbols at 0x0000026A574B0820>)

Let us find those sequences with 30 repeated symbols.

In [14]:
for record in read_fasta("rRNA*.fasta"):
    long_subseq = [enc for enc in rle_encode(record.sequence) if enc[1] >= 30]
    if long_subseq:
        print(record.metadata)
        print(*long_subseq)

>AB000476_1|Novispirillum itersonii|16S rRNA
('n', 34)
>AB000477_1|Novispirillum itersonii|16S rRNA
('n', 34)
>AB000478_1|Novispirillum itersonii|16S rRNA
('n', 34)
>AB000481_1|Aquaspirillum polymorphum|16S rRNA
('n', 99)
>AB000482_1|Terasakiella pusilla|16S rRNA
('n', 58)


## Interleaving, tupling, chaining

The functions `itertools.chain()` and `itertools.chain.from_iterable()` combine multiple iterables.  Built-in `zip()` and `itertools.zip_longest()` also do this, although in a manner that incrementally advances each iterables.  These are all useful, but sometimes a slightly different way of combining is useful instead.  `more_itertools` provides `interleave()` and `interleave_longest()`.  Let us look at all these options using some some simple collections (iterables).  

These all work with iterables in general, including infinite ones. But it is easier to see with small collections.

In [15]:
a = {1, 2, 3, 4, 5}
b = "ABCD"
c = [99, 88, 77]

In [16]:
list(chain(a, b, c))

[1, 2, 3, 4, 5, 'A', 'B', 'C', 'D', 99, 88, 77]

In [17]:
list(chain.from_iterable([a, b, c]))

[1, 2, 3, 4, 5, 'A', 'B', 'C', 'D', 99, 88, 77]

In [18]:
list(zip(a, b, c))

[(1, 'A', 99), (2, 'B', 88), (3, 'C', 77)]

In [19]:
list(zip_longest(a, b, c))

[(1, 'A', 99), (2, 'B', 88), (3, 'C', 77), (4, 'D', None), (5, None, None)]

In [20]:
list(interleave(a, b, c))

[1, 'A', 99, 2, 'B', 88, 3, 'C', 77]

In [21]:
list(interleave_longest(a, b, c))

[1, 'A', 99, 2, 'B', 88, 3, 'C', 77, 4, 'D', 5]

## Combinatorics

One of the useful things `itertools` can do is combinatorics on iterators.  In the standard library, this consists of `permutations()`, `combinations()`, Cartesian `product()`, and `combinations_with_replacement()`.  But `more_itertools()` adds a bunch more including (but not limited to) `distinct_permutations()`, `circular_shifts()`, `partitions()`, `set_partitions()`, and `powerset()`.

The functions generally consume entire iterators as implemented, and are not suitable for infinite iterators. However, for finite iterators, they provide efficient and *lazy* ways to get arrangements of the source iteration elements.

In [22]:
a = {1, 2, 3, 4}
b = "ABC"
c = [99, 88]

One convenience of Cartesian product is that it can sometimes simplify nested loops.

In [23]:
for prod in product(a, b):
    print(prod, end=" ")

(1, 'A') (1, 'B') (1, 'C') (2, 'A') (2, 'B') (2, 'C') (3, 'A') (3, 'B') (3, 'C') (4, 'A') (4, 'B') (4, 'C') 

In [24]:
# Save two nested elements while looking at all combos
for i, j, k in product(a, b, c):
    print(i, j, k, sep='', end=' ')

1A99 1A88 1B99 1B88 1C99 1C88 2A99 2A88 2B99 2B88 2C99 2C88 3A99 3A88 3B99 3B88 3C99 3C88 4A99 4A88 4B99 4B88 4C99 4C88 

In [25]:
list(permutations(b))

[('A', 'B', 'C'),
 ('A', 'C', 'B'),
 ('B', 'A', 'C'),
 ('B', 'C', 'A'),
 ('C', 'A', 'B'),
 ('C', 'B', 'A')]

In [26]:
list(permutations(b, r=2))

[('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'C'), ('C', 'A'), ('C', 'B')]

In [27]:
# Not order dependent for distinct combination
list(combinations(b, r=2))

[('A', 'B'), ('A', 'C'), ('B', 'C')]

In [28]:
print(list(powerset(a)))

[(), (1,), (2,), (3,), (4,), (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4), (1, 2, 3), (1, 2, 4), (1, 3, 4), (2, 3, 4), (1, 2, 3, 4)]


Sometimes you worry about distinctness.  The standard tools only consider distinctness relative to position in an iterator.  Some extra in `more_itertools` also filter by item equality.

In [32]:
three_letters = list(permutations('roberto', r=3))
print(len(three_letters))
three_letters

210


[('r', 'o', 'b'),
 ('r', 'o', 'e'),
 ('r', 'o', 'r'),
 ('r', 'o', 't'),
 ('r', 'o', 'o'),
 ('r', 'b', 'o'),
 ('r', 'b', 'e'),
 ('r', 'b', 'r'),
 ('r', 'b', 't'),
 ('r', 'b', 'o'),
 ('r', 'e', 'o'),
 ('r', 'e', 'b'),
 ('r', 'e', 'r'),
 ('r', 'e', 't'),
 ('r', 'e', 'o'),
 ('r', 'r', 'o'),
 ('r', 'r', 'b'),
 ('r', 'r', 'e'),
 ('r', 'r', 't'),
 ('r', 'r', 'o'),
 ('r', 't', 'o'),
 ('r', 't', 'b'),
 ('r', 't', 'e'),
 ('r', 't', 'r'),
 ('r', 't', 'o'),
 ('r', 'o', 'o'),
 ('r', 'o', 'b'),
 ('r', 'o', 'e'),
 ('r', 'o', 'r'),
 ('r', 'o', 't'),
 ('o', 'r', 'b'),
 ('o', 'r', 'e'),
 ('o', 'r', 'r'),
 ('o', 'r', 't'),
 ('o', 'r', 'o'),
 ('o', 'b', 'r'),
 ('o', 'b', 'e'),
 ('o', 'b', 'r'),
 ('o', 'b', 't'),
 ('o', 'b', 'o'),
 ('o', 'e', 'r'),
 ('o', 'e', 'b'),
 ('o', 'e', 'r'),
 ('o', 'e', 't'),
 ('o', 'e', 'o'),
 ('o', 'r', 'r'),
 ('o', 'r', 'b'),
 ('o', 'r', 'e'),
 ('o', 'r', 't'),
 ('o', 'r', 'o'),
 ('o', 't', 'r'),
 ('o', 't', 'b'),
 ('o', 't', 'e'),
 ('o', 't', 'r'),
 ('o', 't', 'o'),
 ('o', 'o'

In [33]:
three_distinct = list(distinct_permutations('roberto', r=3))
print(len(three_distinct))
three_distinct

84


[('b', 'e', 'o'),
 ('b', 'e', 'r'),
 ('b', 'e', 't'),
 ('b', 'o', 'e'),
 ('b', 'o', 'o'),
 ('b', 'o', 'r'),
 ('b', 'o', 't'),
 ('b', 'r', 'e'),
 ('b', 'r', 'o'),
 ('b', 'r', 'r'),
 ('b', 'r', 't'),
 ('b', 't', 'e'),
 ('b', 't', 'o'),
 ('b', 't', 'r'),
 ('e', 'b', 'o'),
 ('e', 'b', 'r'),
 ('e', 'b', 't'),
 ('e', 'o', 'b'),
 ('e', 'o', 'o'),
 ('e', 'o', 'r'),
 ('e', 'o', 't'),
 ('e', 'r', 'b'),
 ('e', 'r', 'o'),
 ('e', 'r', 'r'),
 ('e', 'r', 't'),
 ('e', 't', 'b'),
 ('e', 't', 'o'),
 ('e', 't', 'r'),
 ('o', 'b', 'e'),
 ('o', 'b', 'o'),
 ('o', 'b', 'r'),
 ('o', 'b', 't'),
 ('o', 'e', 'b'),
 ('o', 'e', 'o'),
 ('o', 'e', 'r'),
 ('o', 'e', 't'),
 ('o', 'o', 'b'),
 ('o', 'o', 'e'),
 ('o', 'o', 'r'),
 ('o', 'o', 't'),
 ('o', 'r', 'b'),
 ('o', 'r', 'e'),
 ('o', 'r', 'o'),
 ('o', 'r', 'r'),
 ('o', 'r', 't'),
 ('o', 't', 'b'),
 ('o', 't', 'e'),
 ('o', 't', 'o'),
 ('o', 't', 'r'),
 ('r', 'b', 'e'),
 ('r', 'b', 'o'),
 ('r', 'b', 'r'),
 ('r', 'b', 't'),
 ('r', 'e', 'b'),
 ('r', 'e', 'o'),
 ('r', 'e'

Partitions can also be handy at times (from `more_itertools`):

In [31]:
for part in partitions('mertz'):
    for segment in part:
        print(''.join(segment), end=' : ')
    print()

mertz : 
m : ertz : 
me : rtz : 
mer : tz : 
mert : z : 
m : e : rtz : 
m : er : tz : 
m : ert : z : 
me : r : tz : 
me : rt : z : 
mer : t : z : 
m : e : r : tz : 
m : e : rt : z : 
m : er : t : z : 
me : r : t : z : 
m : e : r : t : z : 


Simple partitions are order preserving, but set partitions can rearrange before partitioning (and hence there are many more).

In [34]:
for part in set_partitions('sanchez'):
    for segment in part:
        print(''.join(segment), end=' : ')
    print()

sanchez : 
s : anchez : 
sa : nchez : 
a : snchez : 
san : chez : 
an : schez : 
sn : achez : 
n : sachez : 
sanc : hez : 
anc : shez : 
snc : ahez : 
nc : sahez : 
sac : nhez : 
ac : snhez : 
sc : anhez : 
c : sanhez : 
sanch : ez : 
anch : sez : 
snch : aez : 
nch : saez : 
sach : nez : 
ach : snez : 
sch : anez : 
ch : sanez : 
sanh : cez : 
anh : scez : 
snh : acez : 
nh : sacez : 
sah : ncez : 
ah : sncez : 
sh : ancez : 
h : sancez : 
sanche : z : 
anche : sz : 
snche : az : 
nche : saz : 
sache : nz : 
ache : snz : 
sche : anz : 
che : sanz : 
sanhe : cz : 
anhe : scz : 
snhe : acz : 
nhe : sacz : 
sahe : ncz : 
ahe : sncz : 
she : ancz : 
he : sancz : 
sance : hz : 
ance : shz : 
snce : ahz : 
nce : sahz : 
sace : nhz : 
ace : snhz : 
sce : anhz : 
ce : sanhz : 
sane : chz : 
ane : schz : 
sne : achz : 
ne : sachz : 
sae : nchz : 
ae : snchz : 
se : anchz : 
e : sanchz : 
s : a : nchez : 
s : an : chez : 
s : n : achez : 
s : anc : hez : 
s : nc : ahez : 
s : ac : nhez : 
s : c

# Exercises

## Description

In this exercise, you will utilize `itertools` and the iterator protocol to write generator functions `prime_factors(N)` and `all_factorizations(N)`.  These must be generator functions to yield results incrmentally, not produce lists of complete answers (although the tests here will concretize the iterators).  In your answer to all factorizations, include the number itself, but always exclude 1 from each tuple.

As a starting point, the prime generation function and its support function that were presented in an earlier exercise are available in the setup.  Answers should look like the below.  However, you may yield the individual factors or tuples of factors in whatever order you like, the tests will permit a different ordering than shown in this example.


```python
>>> list(prime_factors(420))
[2, 2, 3, 5, 7]

>>> list(all_factorizations(420))
[(2, 2, 3, 5, 7),
 (3, 4, 5, 7),
 (5, 7, 12),
 (7, 60),
 (5, 84),
 (3, 7, 20),
 (3, 140),
 (3, 5, 28),
 (2, 5, 6, 7),
 (2, 7, 30),
 (2, 210),
 (2, 5, 42),
 (2, 3, 7, 10),
 (2, 3, 70),
 (2, 3, 5, 14),
 (2, 2, 7, 15),
 (2, 2, 105),
 (2, 2, 5, 21),
 (2, 2, 3, 35),
 (420,)]
```

## Setup

In [35]:
from itertools import *
from math import sqrt, ceil

def up_to(seq, lim):
    for n in seq:
        if n <= lim:
            yield n
        else:
            break

def get_primes():
    "Pretty good Sieve of Erotosthenes"
    yield 2
    candidate = 3
    found = []
    while True:
        lim = int(ceil(sqrt(candidate)))
        if all(candidate % prime != 0 for prime in up_to(found, lim)):
            yield candidate
            found.append(candidate)
        candidate += 2
        
def prime_factors(N: int):
    # Correct signature, correct for N=10
    yield 2
    yield 5
    
def all_factorizations(N: int):
    # Correct signature, correct for N=10
    yield (2, 5)
    yield (10,)

## Solution

In [36]:
from functools import reduce
from operator import mul

def prime_factors(N):
    for p in get_primes():
        while N % p == 0:
            yield p
            N //= p
        if N == 1:
            return

def all_factorizations(N):
    yielded = set((N,))
    for factors in permutations(prime_factors(N)):
        prod = 1
        for i in range(1, len(factors)):
            prod = reduce(mul, factors[:i])
            answer = tuple(sorted((prod,) + factors[i:]))
            if answer not in yielded:
                yield answer
            yielded.add(answer)
    yield (N,)

## Test Cases

In [37]:
def test_isgen_prime():
    from typing import Iterator
    assert isinstance(prime_factors(10), Iterator)
    
test_isgen_prime()

In [38]:
def test_isgen_allfac():
    from typing import Iterator
    assert isinstance(all_factorizations(10), Iterator)
    
test_isgen_allfac()

In [39]:
def test_prime_facs():
    assert set(prime_factors(380)) == {2, 5, 19}
    
test_prime_facs()

In [40]:
def test_all_facs():
    correct = {(2, 2, 5, 19), (2, 2, 95), (2, 5, 38), (2, 10, 19), 
               (2, 190), (4, 5, 19), (5, 76), (19, 20), (380,)}
    assert set(all_factorizations(380)) == correct
    
test_all_facs()

-------------
Materials licensed under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) by the authors