第三章 字典和集合

3.1 泛映射类型

collections.abc模块中有Mapping和MutableMapping这两个抽象基类，它们的作用是为dict和其他类似的类型定义形式接口

![chapter3-1-1](image/chapter3-1-1.jpg)

In [1]:
from collections import abc
my_dict={}
isinstance(my_dict,abc.Mapping)

True

In [2]:
isinstance(my_dict,abc.MutableMapping)

True

In [3]:
lit=[]
isinstance(lit,abc.MutableMapping)

False

标准库中所有映射类型都是利用dict来实现的，因为它们有共同的限制，即只要可散列的数据类型才能用作这些映射里的键（只有键有这个要求，值并不需要）

In [4]:
tt = (1,3,[30,40])
hash(tt)

TypeError: unhashable type: 'list'

In [5]:
tt=(1,3,(30,40))
hash(tt)

8027211466853604488

In [6]:
a=dict(one=1,two=2,three=3)
b={'one':1,'two':2,'three':3}
c=dict(zip(['one','two','three'],[1,2,3]))
d=dict([('two',2),('one',1),('three',3)])
e=dict({'one':1,'two':2,'three':3})
a==b==c==d==e

True

3.2 字典推导

In [1]:
DIAL_CODES = [
        (86, 'China'),
        (91, 'India'),
        (1, 'United States'),
        (62, 'Indonesia'),
        (55, 'Brazil'),
        (92, 'Pakistan'),
        (880, 'Bangladesh'),
        (234, 'Nigeria'),
        (7, 'Russia'),
        (81, 'Japan'),
    ]
country_code={country:code for code,country in DIAL_CODES}
country_code

{'Bangladesh': 880,
 'Brazil': 55,
 'China': 86,
 'India': 91,
 'Indonesia': 62,
 'Japan': 81,
 'Nigeria': 234,
 'Pakistan': 92,
 'Russia': 7,
 'United States': 1}

In [8]:
{code:country.upper() for country,code in country_code.items() if code <66}

{1: 'UNITED STATES', 7: 'RUSSIA', 55: 'BRAZIL', 62: 'INDONESIA'}

3.3 常见的映射方法

![chapter3-3](image/chapter3-3.jpg)

![chapter3-3-2](image/chapter3-3-2.jpg)

In [9]:
country_code.pop('China')

86

In [10]:
country_code

{'Bangladesh': 880,
 'Brazil': 55,
 'India': 91,
 'Indonesia': 62,
 'Japan': 81,
 'Nigeria': 234,
 'Pakistan': 92,
 'Russia': 7,
 'United States': 1}

In [2]:
import collections
my_orderedDict=collections.OrderedDict(DIAL_CODES)
my_orderedDict

OrderedDict([(86, 'China'),
             (91, 'India'),
             (1, 'United States'),
             (62, 'Indonesia'),
             (55, 'Brazil'),
             (92, 'Pakistan'),
             (880, 'Bangladesh'),
             (234, 'Nigeria'),
             (7, 'Russia'),
             (81, 'Japan')])

In [13]:
my_orderedDict.move_to_end(86,last=True)
my_orderedDict

OrderedDict([(91, 'India'),
             (1, 'United States'),
             (62, 'Indonesia'),
             (55, 'Brazil'),
             (92, 'Pakistan'),
             (880, 'Bangladesh'),
             (234, 'Nigeria'),
             (7, 'Russia'),
             (81, 'Japan'),
             (86, 'China')])

In [14]:
my_orderedDict.move_to_end(86,last=False)
my_orderedDict

OrderedDict([(86, 'China'),
             (91, 'India'),
             (1, 'United States'),
             (62, 'Indonesia'),
             (55, 'Brazil'),
             (92, 'Pakistan'),
             (880, 'Bangladesh'),
             (234, 'Nigeria'),
             (7, 'Russia'),
             (81, 'Japan')])

In [3]:
my_orderedDict.setdefault(86,'Chinese')

'China'

In [4]:
my_orderedDict

OrderedDict([(86, 'China'),
             (91, 'India'),
             (1, 'United States'),
             (62, 'Indonesia'),
             (55, 'Brazil'),
             (92, 'Pakistan'),
             (880, 'Bangladesh'),
             (234, 'Nigeria'),
             (7, 'Russia'),
             (81, 'Japan')])

In [16]:
my_orderedDict.setdefault(88,'illegal')

'illegal'

用setdefault处理找不到的键

我们通常用d.get(k,default)来代替d[k]，但在要更新某个键对应的值的时候，不管是__getitem__还是get都不自然且效率低

In [17]:
#示例3-2：从索引中获取单词出现的频率信息，并写进对应的列表里
import sys
import re
WORD_RE=re.compile(r'\w+')
index={}
with open(sys.argv[1],encoding='utf-8') as fp:
    for line_no,line in enumerate(fp,1):
        for match in WORD_RE.finditer(line):
            word = match.group()
            column_no = match.start()+1
            location = (line_no,column_no)
            
            occurrences = index.get(word,[])
            occurrences.append(location)
            index[word]=occurrences
for word in sorted(index,key=str.upper):
    print(word,index[word])

FileNotFoundError: [Errno 2] No such file or directory: '-f'

In [None]:
#示例3-4：从索引中获取单词出现的频率信息，并写进对应的列表里
import sys
import re
WORD_RE=re.compile(r'\w+')
index={}
with open(sys.argv[1],encoding='utf-8') as fp:
    for line_no,line in enumerate(fp,1):
        for match in WORD_RE.finditer(line):
            word = match.group()
            column_no = match.start()+1
            location = (line_no,column_no)
            '''
            occurrences = index.get(word,[])
            occurrences.append(location)
            index[word]=occurrences
            '''
            index,setdefault(word,[]).append(location)#一行解决
            '''
            相当于
            if key not in my_dict:
                my_dict[key]=[]
            my_dict[key].append(new_value)
            '''
for word in sorted(index,key=str.upper):
    print(word,index[word])

3.4 映射的弹性键查询

3.4.1 defauldict:处理找不到的键的一个选择

In [None]:
#示例3-5：从索引中获取单词出现的频率信息，并写进对应的列表里
import sys
import re
import  collections
WORD_RE=re.compile(r'\w+')
index=collections.defaultdict(list)
with open(sys.argv[1],encoding='utf-8') as fp:
    for line_no,line in enumerate(fp,1):
        for match in WORD_RE.finditer(line):
            word = match.group()
            column_no = match.start()+1
            location = (line_no,column_no)
            index[word].append(location) #index里没有word的记录则会调用default_factory,为查询不到的键创始一个值这里为空列表
for word in sorted(index,key=str.upper):
    print(word,index[word])

defaultdict里的default_factory只会在__getitem__里被调用，在其他方法里不发挥作用。dd[k]会调用default_factory,dd.get(k)则会返回None

In [4]:
import collections
l=collections.defaultdict(list)
l['w']

[]

In [5]:
int_l =collections.defaultdict(int)
int_l['w']

0

In [6]:
gender=collections.defaultdict(lambda :'male')
gender['chenglong']

'male'

3.4.2 特殊方法__missing__

__misising__方法只会被__getitem__调用(比如d[k])。__missing__方法对get和__contains__(in运算会用到这个方法)这些方法的使用都无影响。

为了自定义一个映射类型，最合适的策略其实是继承collections.UserDict类。这里我们从dict继承，主要是为了演示__missing__是如何被dict.__getitem__调用的

In [10]:
#示例3.7
class StrKeyDict0(dict):
    def __missing__(self, key):
        #isinstance(key,str)是必须的，否则的d[int]会陷入无限循环return self[str(key)]
        if isinstance(key,str):
            raise KeyError(key)
        return self[str(key)]
    
    def get(self,key,default=None):
        try:
            return self[key]
        except KeyError:
            return default
        
    def __contains__(self, key):
        return key in self.keys() or str(key) in self.keys()

for k in dict.key()这种操作在python3中是很快的，因为dict.keys()返回的是一个“视图”.

In [11]:
d = StrKeyDict0([('2','two'),('4','four')])
d['2']

'two'

In [13]:
d[4]

'four'

In [14]:
d[1]

KeyError: '1'

In [15]:
d.get('2')

'two'

In [16]:
d.get(4)

'four'

In [17]:
d.get(1,'N/A')

'N/A'

In [19]:
2 in d

True

In [20]:
1 in d

False

3.5 字典的变种

collections.OrderedDict 添加键的时候保持顺序

collections.ChainMap 容纳数个不同的映射对象

collections.Counter给键准备一个整数计数器


In [21]:
from  collections import OrderedDict,ChainMap,Counter
l=[('1','ONe'),('2','Two'),('3','three'),('4','four')]
d = OrderedDict(l)
d.popitem()


('4', 'four')

In [22]:
d.popitem(last=False)

('1', 'ONe')

In [23]:
import builtins
pylookup = ChainMap(locals(),globals(),vars(builtins))

In [25]:
ct=Counter('abracadabra')
ct

Counter({'a': 5, 'b': 2, 'c': 1, 'd': 1, 'r': 2})

In [26]:
ct.most_common(2)

[('a', 5), ('r', 2)]

3.6 子类化UserDict

自定义映射类型——>UserDict 方便

In [28]:
#示例3.8 用了UserDict，UserDict的子类就能在实现__setitem__的时候避免不不必要的递归，也可让__contains__更简洁
#无论是添加、更新还是查询操作，StrKeyDict都会把非字符串的键转化为字符串
import collections
class StrKeyDict(collections.UserDict):
    def __missing__(self, key):
        if isinstance(key,str):
            raise KeyError(key)
        return self[str(key)]
    
    def __contains__(self, key):
        return str(key) in self.data#UserDict的data属性实际上是UserDict最终存储数据的地方
    
    def __setitem__(self, key, item):
        self.data[str(key)]=item

In [29]:
d1 = StrKeyDict([('2','two'),('4','four')])

In [31]:
d1[4]

'four'

3.7 不可变映射类型

In [33]:
from types import  MappingProxyType
d={1:'A'}
d_pro = MappingProxyType(d)
d_pro

mappingproxy({1: 'A'})

In [34]:
d_pro[1]

'A'

In [35]:
d_pro[2]='B'

TypeError: 'mappingproxy' object does not support item assignment

In [36]:
d[2]='B'
d_pro

mappingproxy({1: 'A', 2: 'B'})

d_pro是动态的，也就是说对d做的任何改动都会反馈到它上面

3.8 集合论

许多唯一对象的聚集

集：set

不可变的集：frozenset

In [37]:
l = ['spam','spam','eggs','l']
set(l)

{'eggs', 'l', 'spam'}

集合中的对象必须是可以散列的，set本身不可散列，但frozenset可以。因此可以创建一个包含不同frozenset的set。

3.8.1 集合字面量

In [39]:
s={1}
type(s)

set

In [40]:
s

{1}

In [41]:
s.pop()

1

In [42]:
s

set()

In [43]:
s={}
type(s)

dict

In [44]:
s=set({})
type(s)

set

In [45]:
frozenset(range(10))

frozenset({0, 1, 2, 3, 4, 5, 6, 7, 8, 9})

3.8.2 集合推导

In [48]:
from unicodedata import name
{chr(i) for i in range(32,256) if 'W' in name(chr(i),'')}

{'2',
 'W',
 '_',
 'w',
 '²',
 '¶',
 'À',
 'Á',
 'Â',
 'Ã',
 'Ä',
 'Å',
 'Ç',
 'È',
 'É',
 'Ê',
 'Ë',
 'Ì',
 'Í',
 'Î',
 'Ï',
 'Ñ',
 'Ò',
 'Ó',
 'Ô',
 'Õ',
 'Ö',
 'Ø',
 'Ù',
 'Ú',
 'Û',
 'Ü',
 'Ý',
 'à',
 'á',
 'â',
 'ã',
 'ä',
 'å',
 'ç',
 'è',
 'é',
 'ê',
 'ë',
 'ì',
 'í',
 'î',
 'ï',
 'ñ',
 'ò',
 'ó',
 'ô',
 'õ',
 'ö',
 'ø',
 'ù',
 'ú',
 'û',
 'ü',
 'ý',
 'ÿ'}

3.8.3 集合的操作

![chapter3-8](image/chapter3-8.jpg)

In [54]:
a={1,2,3,4,5}
b={1,2}
c={3,4}
d=[3,1,2,3]
a.difference(b,c,d)

{5}

In [55]:
a.difference_update(b,c)

In [56]:
a

{5}

In [59]:
b^c

{1, 2, 3, 4}

![chapter3-8-2](image/chapter3-8-2.jpg)

3.9 dict和set的背后