## Chapter 4. 迭代器与生成器
> 本章节主要使用的模块:
1. collections
 - collections.deque
 - collections.defaultdict
 - collections.Iterable
2. itertools
 - itertools.dropwhile
 - itertools.islice
 - itertools.permutations
 - itertools.combinations
 - itertools.combinations_with_replacement
 - itertools.zip_longest
 - itertools.chain
3. heapq
4. os
5. fnmatch
6. gzip
7. bz2
8. re
9. sys  

#### 1. 手动遍历迭代器
为了手动的遍历可迭代对象，使用 next() 函数并在代码中捕获 StopIteration 异常

In [None]:
def manual_iter():
    with open('/etc/passwd') as f:
        try:
            while True:
                line = next(f)
                print(line, end='')
        except StopIteration:
            pass

**迭代机制**

In [1]:
items = [1, 2, 3]
# Get the iterator
it = iter(items) # Invokes items.__iter__()

# Run the iterator
print(next(it)) # Invokes it.__next__()
print(next(it))
print(next(it))
print(next(it))

1
2
3


StopIteration: 

**代理迭代** 只需要定义一个`__iter__()` 的方法

In [2]:
class Node:
    def __init__(self, value):
        self._value = value
        self._children = []

    def __repr__(self):
        return 'Node({!r})'.format(self._value)

    def add_child(self, node):
        self._children.append(node)

    def __iter__(self):
        return iter(self._children)

# Example
if __name__ == '__main__':
    root = Node(0)
    child1 = Node(1)
    child2 = Node(2)
    root.add_child(child1)
    root.add_child(child2)
    # Outputs Node(1), Node(2)
    for ch in root:
        print(ch)

Node(1)
Node(2)


### 2. 使用生成器创建新的迭代模式

In [None]:
def frange(start, stop, increment):
    x = start
    while x < stop:
        yield x
        x += increment

for n in frange(0, 4, 0.5):
    print(n)

一个函数中需要有一个 yield 语句即可将其转换为一个生成器。 跟普通函数不同的是，生成器只能用于迭代操作,下面是底层机制

In [None]:
def countdown(n):
    print('Starting to count from', n)
    while n > 0:
        yield n
        n -= 1
    print('Done!')


# Create the generator, notice no output appears
c = countdown(3)
print(c)

# Run to first yield and emit a value
print(next(c))

# Run to the next yield
print(next(c))

# Run to next yield
print(next(c))

# Run to next yield (iteration stops)
print(next(c))

**构造一个反向迭代器**

In [None]:
class Countdown:
    def __init__(self, start):
        self.start = start

    # Forward iterator
    def __iter__(self):
        n = self.start
        while n > 0:
            yield n
            n -= 1

    # Reverse iterator
    def __reversed__(self):
        n = 1
        while n <= self.start:
            yield n
            n += 1

for rr in reversed(Countdown(30)):
    print(rr)
for rr in Countdown(30):
    print(rr)

**带有外部状态的生成器函数**

In [2]:
from collections import deque

class linehistory:
    def __init__(self, lines, histlen=3):
        self.lines = lines
        self.history = deque(maxlen=histlen)

    def __iter__(self):
        for lineno, line in enumerate(self.lines, 1):
            self.history.append((lineno, line))
            yield line

    def clear(self):
        self.history.clear()

with open('untitled.txt') as f:
    lines = linehistory(f)
    for line in lines:
        if 'python' in line:
            for lineno, hline in lines.history:
                print('{}:{}'.format(lineno, hline), end='')

4:epugh
5:jikhgfs
6:pythoni
5:jikhgfs
6:pythoni
7:python 


**迭代器切片**
迭代器和生成器不能使用标准的切片操作，因为它们的长度事先我们并不知道(并且也没有实现索引)。 函数 `islice()` 返回一个可以生成指定元素的迭代器，它通过遍历并丢弃直到切片开始索引位置的所有元素。 然后才开始一个个的返回元素，并直到切片结束索引位置。

In [5]:
def count(n):
    while True:
        yield n
        n += 1

c = count(0)
print(c[10:20])

# Now using islice()
import itertools
for x in itertools.islice(c, 10, 20):
    print(x)

TypeError: 'generator' object is not subscriptable

这里要着重强调的一点是 islice() 会消耗掉传入的迭代器中的数据。 必须考虑到迭代器是不可逆的这个事实。 所以如果你需要之后再次访问这个迭代器的话，那你就得先将它里面的数据放入一个列表中。

** 跳过可迭代对象的开始部分**
加入跳过开头注释部分，仅仅跳过开始部分满足测试条件的行，在那以后，所有的元素不再进行测试和过滤了。

In [None]:
from itertools import dropwhile
with open('/etc/passwd') as f:
    for line in dropwhile(lambda line: line.startswith('#'), f):
        print(line, end='')
        
from itertools import islice
items = ['a', 'b', 'c', 1, 4, 10, 15]
#明确跳过3个元素，islice(items,None，3) 意思相反，只获得前三个元素。
for x in islice(items, 3, None):
    print(x)

### 3.排列组合的迭代
遍历一个集合中元素的所有可能的排列或组合

In [7]:
items = ['a', 'b', 'c']
from itertools import permutations
for p in permutations(items):
    print(p)
    
#指定长度
for p in permutations(items, 2):
    print(p)

('a', 'b', 'c')
('a', 'c', 'b')
('b', 'a', 'c')
('b', 'c', 'a')
('c', 'a', 'b')
('c', 'b', 'a')
('a', 'b')
('a', 'c')
('b', 'a')
('b', 'c')
('c', 'a')
('c', 'b')


使用 itertools.combinations() 可得到输入集合中元素的所有的组合。**组合**
函数 itertools.combinations_with_replacement() 允许同一个元素被选择多次。**排列**

In [12]:
from itertools import combinations
for c in combinations(items, 3):
    print(c)
print('-'*30)
for c in combinations(items, 2):
    print(c)
print('-'*30)
for c in combinations(items, 1):
    print(c)
print('-'*30)
from itertools import combinations_with_replacement
for c in combinations_with_replacement(items, 3):
    print(c)
print('-'*30)

('a', 'b', 'c')
------------------------------
('a', 'b')
('a', 'c')
('b', 'c')
------------------------------
('a',)
('b',)
('c',)
------------------------------
('a', 'a', 'a')
('a', 'a', 'b')
('a', 'a', 'c')
('a', 'b', 'b')
('a', 'b', 'c')
('a', 'c', 'c')
('b', 'b', 'b')
('b', 'b', 'c')
('b', 'c', 'c')
('c', 'c', 'c')
------------------------------


### 4.迭代一个序列的同时跟踪正在被处理的元素索引

In [15]:
my_list = ['a', 'b', 'c']
for idx, val in enumerate(my_list):
    print(idx, val)
print('-'*30)    
#为了按传统行号输出(行号从1开始)，你可以传递一个开始参数：
my_list = ['a', 'b', 'c']
for idx, val in enumerate(my_list, 1):
    print(idx, val)

0 a
1 b
2 c
------------------------------
1 a
2 b
3 c


如果你想将一个文件中出现的单词映射到它出现的行号上去，可以很容易的利用 enumerate() 来完成

In [17]:
from collections import defaultdict
word_summary = defaultdict(list)

with open('untitled.txt', 'r') as f:
    lines = f.readlines()

for idx, line in enumerate(lines):
    # Create a list of words in current line
    words = [w.strip().lower() for w in line.split()]
    for word in words:
        word_summary[word].append(idx)

对元组使用

In [18]:
data = [ (1, 2), (3, 4), (5, 6), (7, 8) ]

# Correct!
for n, (x, y) in enumerate(data):
    ...
# Error!
for n, x, y in enumerate(data):
    ...

ValueError: not enough values to unpack (expected 3, got 2)

### 5. 同时迭代多个序列
使用zip()拉链函数

In [19]:
xpts = [1, 5, 4, 2, 10, 7]
ypts = [101, 78, 37, 15, 62, 99]
for x, y in zip(xpts, ypts):
    print(x,y)

1 101
5 78
4 37
2 15
10 62
7 99


如果序列不一致，默认迭代到最短的序列；也可以使用zip_longest迭代到最长的函数。

In [21]:
a = [1, 2, 3]
b = ['w', 'x', 'y', 'z']
for i in zip(a,b):
    print(i)
print('-'*20)
from itertools import zip_longest
for i in zip_longest(a,b):
    print(i)

(1, 'w')
(2, 'x')
(3, 'y')
--------------------
(1, 'w')
(2, 'x')
(3, 'y')
(None, 'z')


迭代多个序列

In [22]:
a = [1, 2, 3]
b = [10, 11, 12]
c = ['x','y','z']
for i in zip(a, b, c):
    print(i)

#迭代字典
s = dict(zip(a,b))
#对于迭代器结果，需要使用list来转换
print(list(zip(a, b)))

(1, 10, 'x')
(2, 11, 'y')
(3, 12, 'z')
[(1, 10), (2, 11), (3, 12)]


### 6.不同集合上元素的迭代
如果想要遍历几个容器，可以使用以下两种：

In [25]:
a = [1, 2, 3, 4]
b = ['x', 'y', 'z']
#1.使用加法（效率低，因为会另外创建一个a + b的对象）
for x in a + b:
    ...
    #print(x)

#2.使用 chain()函数
from itertools import chain
for x in chain(a, b):
    print(x)

1
2
3
4
x
y
z


### 7.生成器函数实现管道机制
加入系统下的文件夹及其内容如下所示：
`
foo/
    access-log-012007.gz
    access-log-022007.gz
    access-log-032007.gz
    ...
    access-log-012008
bar/
    access-log-092007.bz2
    ...
    access-log-022008
`  

`
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
210.212.209.67 - - [10/Jul/2012:00:18:51 -0500] "GET /ply/ ..." 200 11875
210.212.209.67 - - [10/Jul/2012:00:18:51 -0500] "GET /favicon.ico ..." 404 369
61.135.216.105 - - [10/Jul/2012:00:20:04 -0500] "GET /blog/atom.xml ..." 304 -
...
`
使用yeild函数管道操作

In [None]:
import os
import fnmatch
import gzip
import bz2
import re

def gen_find(filepat, top):
    '''
    Find all filenames in a directory tree that match a shell wildcard pattern
    '''
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist, filepat):
            yield os.path.join(path,name)

def gen_opener(filenames):
    '''
    Open a sequence of filenames one at a time producing a file object.
    The file is closed immediately when proceeding to the next iteration.
    '''
    for filename in filenames:
        if filename.endswith('.gz'):
            f = gzip.open(filename, 'rt')
        elif filename.endswith('.bz2'):
            f = bz2.open(filename, 'rt')
        else:
            f = open(filename, 'rt')
        yield f
        f.close()

def gen_concatenate(iterators):
    '''
    Chain a sequence of iterators together into a single sequence.
    '''
    for it in iterators:
        yield from it

def gen_grep(pattern, lines):
    '''
    Look for a regex pattern in a sequence of lines
    '''
    pat = re.compile(pattern)
    for line in lines:
        if pat.search(line):
            yield line

#查找包含单词python的所有日志行
lognames = gen_find('access-log*', 'www')
files = gen_opener(lognames)
lines = gen_concatenate(files)
pylines = gen_grep('(?i)python', lines)
for line in pylines:
    print(line)
    
#计算出传输的字节数并计算其总和
lognames = gen_find('access-log*', 'www')
files = gen_opener(lognames)
lines = gen_concatenate(files)
pylines = gen_grep('(?i)python', lines)
bytecolumn = (line.rsplit(None,1)[1] for line in pylines)
bytes = (int(x) for x in bytecolumn if x != '-')
print('Total', sum(bytes))

### 8.展开嵌套的序列
可以写一个包含 yield from 语句的递归生成器来轻松解决这个问题。语句 yield from 在你想在生成器中调用其他生成器作为子例程的时候非常有用。

In [26]:
from collections import Iterable

def flatten(items, ignore_types=(str, bytes)):
    for x in items:
        if isinstance(x, Iterable) and not isinstance(x, ignore_types):
            yield from flatten(x)
            '''
            等同于for语句
            for i in flatten(x):
                yield i
            '''
        else:
            yield x

items = [1, 2, [3, 4, [5, 6], 7], 8]
# Produces 1 2 3 4 5 6 7 8
for x in flatten(items):
    print(x)

1
2
3
4
5
6
7
8


### 9.顺序迭代合并后的排序迭代对象

In [27]:
import heapq
a = [1, 4, 7, 10]
b = [2, 5, 6, 11]
for c in heapq.merge(a, b):
    print(c)

1
2
4
5
6
7
10
11


合并几个文件，heapq.merge 可迭代特性意味着它不会立马读取所有序列。 这就意味着你可以在非常长的序列中使用它，而不会有太大的开销。
**有一点要强调的是 heapq.merge() 需要所有输入序列必须是排过序的。**

In [None]:
with open('sorted_file_1', 'rt') as file1, \
    open('sorted_file_2', 'rt') as file2, \
    open('merged_file', 'wt') as outf:

    for line in heapq.merge(file1, file2):
        outf.write(line)

### 10.迭代器代替while无限循环
iter 函数一个鲜为人知的特性是它接受一个可选的 callable 对象和一个标记(结尾)值作为输入参数。 当以这种方式使用的时候，它会创建一个迭代器， 这个迭代器会不断调用 callable 对象直到返回值和标记值相等为止。

In [None]:
#一般操作
CHUNKSIZE = 8192

def reader(s):
    while True:
        data = s.recv(CHUNKSIZE)
        if data == b'':
            break
        process_data(data)
        
#通常可以使用iter()替代
def reader2(s):
    for chunk in iter(lambda: s.recv(CHUNKSIZE), b''):
        pass
        # process_data(data)

import sys
f = open('/etc/passwd')
for chunk in iter(lambda: f.read(10), ''):
    n = sys.stdout.write(chunk)