# 构建结构化的程序
----
截止目前相信大家已经对使用python进行自然语言处理有了一定的了解。然而如果你之前并没有python的基础的话，可能在使用python时还是有各种不爽的地方。所以本章旨在帮您解决如下几个问题：

1. 如何构建具有良好结构、可读性强的、能够方便他人复用的程序？
2. python中诸如循环、函数以及赋值是如何工作的？
3. 使用python可能遇到的坑有哪些，如何避免踩坑？

本章中您会通过大量的例子巩固python编程基础，同时学会一些自然语言处理方面有用的可视化技术。如果您对python早有了解，则可以快速略过本章。若没有相关的程序设计经验，则通过仔细学习将会有很大提高。

本章中由于需要，将会涉及到许多和NLP技术相关的程序设计概念。我们并不会赘述很多，我们值关注对于NLP最为重要的那一部分。

# Deep Vs Shallow

In [1]:
import copy

# Shallow Copy
浅拷贝 Pass by Reference 传地址引用。

```a```和```b```共享同一块内存区域，因此针对```a```的修改对应会同步到```b```上。拷贝过程是，```b```作为一个指针，指向```a```所在的内存的位置。

In [224]:
a = [[],[],[]]

In [225]:
b = a.copy()

In [226]:
a[1].append(666)

In [227]:
a

[[], [666], []]

In [228]:
b

[[], [666], []]

In [229]:
b[2].append(777)

In [230]:
a

[[], [666], [777]]

In [231]:
b

[[], [666], [777]]

In [241]:
for i in a:
    print(id(i))

2280831643848
2280831661064
2280830730184


In [242]:
for p in b:
    print(id(p))

2280831643848
2280831661064
2280830730184


# Deep Copy
深拷贝 Pass by Value 传值引用

```c```是```a```的一个值的copy，深拷贝的过程是，先申请和a同样大小的空间，将a的值复制到这篇空间中。

这样过程之后```c```与```a```是两个相互独立的变量，并不共享任何空间。

In [104]:
c = copy.deepcopy(a)

In [105]:
c

[[], [1], [2], [], [], []]

In [106]:
c[0].append(666)

In [107]:
c

[[666], [1], [2], [], [], []]

In [108]:
a

[[], [1], [2], [], [], []]

In [247]:
a = ()

In [248]:
type(a)

tuple

In [252]:
b = [(),]

# 一个相关的例子

In [45]:
list1 = [[]]*3

In [46]:
list1

[[], [], []]

In [47]:
for i in list1:
    print(id(i))

2280831016584
2280831016584
2280831016584


In [48]:
list1[1].append(666)

In [49]:
list1

[[666], [666], [666]]

In [50]:
list2 = [[] for i in range(0,3)]

In [51]:
list2

[[], [], []]

In [52]:
for i in list2:
    print(id(i))

2280831016392
2280831278792
2280830880840


In [55]:
list2[1].append(666)

In [56]:
list2

[[], [666], []]

In [57]:
a = [1,2,3,4,5,6,7]

In [58]:
b = a.copy()

In [59]:
a[1]= 2131

In [60]:
b

[1, 2, 3, 4, 5, 6, 7]

In [61]:
c = a[:]

In [62]:
a

[1, 2131, 3, 4, 5, 6, 7]

In [63]:
c

[1, 2131, 3, 4, 5, 6, 7]

In [64]:
a[1] = 2

In [65]:
a

[1, 2, 3, 4, 5, 6, 7]

In [66]:
c

[1, 2131, 3, 4, 5, 6, 7]

# equal 和 is

In [144]:
a = 'ppp'

In [145]:
b = 'ppp'

In [146]:
a is b

True

In [147]:
a == b

True

In [148]:
id(a)

2280831530688

In [149]:
id(b)

2280831530688

In [150]:
b = 'qqq'

In [151]:
id(b)

2280831629888

In [152]:
b = 'ppp'

In [153]:
id(b)

2280831530688

In [154]:
a = [[]]*2

In [158]:
a[0] is a[1]

True

In [159]:
a[0] == a[1]

True

In [157]:
a[0].append(123)

In [160]:
a

[[123], [123]]

In [161]:
b = [[] for i in range(0,2)]

In [162]:
b[0] == b[1]

True

In [163]:
b[0] is b[1]

False

In [204]:
class ppp:
    def __init__(self):
        self.a = 6
        self.b = []
    def __eq__(self,other):
        if self.a == other.a:
            return True
        else:
            return False

In [205]:
ppp1 = ppp()

In [206]:
ppp2 = ppp()

In [208]:
ppp1 == ppp2

True

In [209]:
ppp1.a = 2233

In [210]:
ppp1 == ppp2

False

In [211]:
a = [1,2,3,4,5,6]

In [216]:
b = a

In [217]:
a is b

True

In [218]:
c = a.copy()

In [219]:
c == a

True

In [220]:
c is a

False

In [221]:
id(a)

2280831694600

In [222]:
id(b)

2280831694600

In [223]:
id(c)

2280831740808

# 列表的生成

In [1]:
list1 = [[]] * 3

In [3]:
list2 = [[] for i in range(0,3)]

In [4]:
list1 == list2

True

In [5]:
list1 is list2

False

In [6]:
list2[1].append(1)

In [7]:
list1 == list2

False

# 生成器表达式

In [1]:
import nltk

In [16]:
text = 'The second line uses a generator expression. This is more than a notational convenience: in many language processing situations, generator\
expressions will be more efficient. In , storage for the list object must be allocated before the value of max() is computed. If the text is very large,\
this could be slow. In , the data is streamed to the calling function. Since the calling function simply has to find the maximum value — the word\
which comes latest in lexicographic sort order — it can process the stream of data without having to store anything more than the maximum value\
seen so far.'
text = text *10000

In [17]:
%timeit max([w.lower() for w in nltk.word_tokenize(text)])

1 loop, best of 3: 12.2 s per loop


In [18]:
%timeit max(w.lower() for w in nltk.word_tokenize(text))

1 loop, best of 3: 12.2 s per loop


# iterator

In [32]:
class Reverse:
    def __init__(self,_list):
        self.__list = _list
        self.__len = len(_list)
    def __next__(self):
        if self.__len == 0:
            raise StopIteration
        else:
            self.__len -= 1
            return self.__list[self.__len]
    def __iter__(self):
        self.__len = len(self.__list)
        return self

In [33]:
rev = Reverse([i for i in range(0,10)])

In [34]:
[i for i in rev]

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

In [36]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


# Function

In [43]:
def swap(first,second):
    '''
    this function for swap two parameter
    input:
        first means first parameter
        second means second parameter
    
    return:
        two value swap their location
    '''
    
    new_first, new_second = second, first
    
    return new_first, new_second

In [None]:
swap()

In [42]:
def swap1(a,b):
    b,a=a,b
    return a,b

In [50]:
def use_position_arg(lng,lat,*pos_arg):
    print(pos_arg)

In [51]:
use_position_arg(24,36,'Raynor','Terran')

('Raynor', 'Terran')


In [52]:
def use_keywords_arg(lng,lat,**keyword_arg):
    print(keyword_arg)

In [55]:
use_keywords_arg(24,36,Name='Raynor',Species='Terran')

{'Name': 'Raynor', 'Species': 'Terran'}


In [56]:
def both_arg(lng,lat,*p_arg,**k_arg):
    print('p_arg:',p_arg)
    print('k_arg:',k_arg)

In [59]:
both_arg(24,36,'laotie',666,Species='Terran',name='Raynor')

p_arg: ('laotie', 666)
k_arg: {'Species': 'Terran', 'name': 'Raynor'}


# isInstance check

In [70]:
def order_by_dict_value(not_order):
    assert isinstance(not_order,dict),'input must be dictionary'
    ordered_key = sorted(not_order,key=lambda x:not_order[x],reverse=True)
    new_dict= {}
    for key in ordered_key:
        new_dict[key] = not_order[key]
    return new_dict

In [71]:
order_by_dict_value({'a':65,'b':43,'c':3123,'d':1})

{'c': 3123, 'a': 65, 'b': 43, 'd': 1}

# generator

In [12]:
def multiable():
    '''乘法表'''
    for i in range(1,10):
        for j in range(1,10):
            if j<i:
                continue
            else:
                yield '{0}*{1}'.format(i,j)

In [16]:
def _multiable():
    '''乘法表'''
    table = []
    for i in range(1,10):
        for j in range(1,10):
            if j<i:
                continue
            else:
                table.append('{0}*{1}'.format(i,j))
    return table

In [None]:
[i for i in multiable()]

In [None]:
_multiable()

# map & reduce

In [18]:
from functools import reduce

In [19]:
def reduce_sum(list_of_number):
    return reduce(lambda x,y:x+y,list_of_number,0)

In [20]:
reduce_sum([i for i in range(0,10)])

45

In [26]:
def reduce_mean(list_of_number):
    return reduce(lambda x,y:(x+y),list_of_number,0)/len(list_of_number)

In [27]:
reduce_mean([i for i in range(0,10)])

4.5

In [28]:
def square(numbers):
    return map(lambda x:x*x,numbers)

In [31]:
[p for p in square([i for i in range(0,10)])]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# filter

In [33]:
def odd(list_of_int):
    return filter(lambda x:x%2==1,list_of_int)

In [35]:
[p for p in odd([i for i in range(0,10)])]

[1, 3, 5, 7, 9]

# use \_\_file\_\_

In [36]:
import jieba
jieba.__file__

'c:\\env\\anaconda3\\lib\\site-packages\\jieba\\__init__.py'

In [37]:
jieba.__name__

'jieba'

In [42]:
jieba.__doc__

# Pythonic

In [44]:
def dict_process(input_dict):
    '''C/C++ like'''
    count = 0
    max_len = len(input_dict.keys())
    while count < max_len:
        print('Current key number:{0}'.format(count))
        key = input_dict.keys()[count]
        input_dict[key] += 17
        count += 1
    return input_dict

In [45]:
def dict_process_pynic(input_dict):
    '''python way'''
    for num,key in enumerate(input_dict.keys()):
        print('Current key number:{0}'.format(num))
        input_dict[key] += 17
    return input_dict