# Part I Python Fundamentals 5

This is my review note of Python for the purpose of self-study. The note mixes up with English & Chinese.
- Part I follows the *[Liao's Python tutorial](https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000) (in Chinese)*

In [40]:
# Display multiple interactive objects in one shell
# No Need for print function
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### 9. 进程和线程

- Process: open a program (word, pyCharm ...)
- Thread: tasks from the program (执行单元）

Python Multi-thread:
    - multi process + single thread
    - single process + multi thread
    - multi process + multi thread (complicated, not recommended)

Multiprocessing

Unix/Linux操作系统提供了一个fork()系统调用，它非常特殊。普通的函数调用，调用一次，返回一次，但是fork()调用一次，返回两次，因为操作系统自动把当前进程（称为父进程）复制了一份（称为子进程），然后，分别在父进程和子进程内返回。

子进程永远返回0，而父进程返回子进程的ID。这样做的理由是，一个父进程可以fork出很多子进程，所以，父进程要记下每个子进程的ID，而子进程只需要调用getppid()就可以拿到父进程的ID。

In [18]:
import os


print('Start running process %s' % os.getpid()) # get current process id

# RETURN TWICE
pid = os.fork()
if pid == 0:
    print('This is the child process %s from father process %s' % (os.getpid(), os.getppid())) # get praent process id
else:
    print('This is the father process %s' % os.getpid())

Start running process 1356
This is the father process 1356
Start running process 1356
This is the child process 1379 from father process 1356


有了fork调用，一个进程在接到新任务时就可以复制出一个子进程来处理新任务

（挖坑）

### 10.正则表达式 Regex (贪婪搜索）

Cheatsheet: [Regex Cheat Sheet](./source/regex_cheatsheet.pdf)

### re

In [29]:
import re

# 一定要用 r 前缀, \ literally means \, 不需要转义
# backslash, \, is taken as meaning "just a backslash"
# 没有其他不同

re.match(r'\d{3}-\d{5}', '123-45678')
# if retrun object: OK
# if return None: NOT MATCHED

<_sre.SRE_Match object; span=(0, 9), match='123-45678'>

In [31]:
# split string
re.split(r'[\,\s+\;]+', 'a;, b  ,c')

['a', 'b', 'c']

分组 (group)

In [43]:
matcher = re.match(r'(\d{3})-(\d{5})', '123-45678')
matcher
matcher.group(0) # whole
matcher.group(1) # group 1
matcher.group(2) # group 2

<_sre.SRE_Match object; span=(0, 9), match='123-45678'>

'123-45678'

'123'

'45678'

In [60]:
time = '19:05:30'
matcher = re.match(r'(\d{2}):(\d{2}):(\d{2})', time)
matcher
h, m, s = matcher.groups() # 元组拆包
h, m, s

<_sre.SRE_Match object; span=(0, 8), match='19:05:30'>

('19', '05', '30')

预编译该正则表达式, 为了重复使用

In [62]:
my_regex = re.compile(r'^(\d{3})-(\d{5})$')
my_regex.match('123-45678')
my_regex.match('123-45678').groups()

<_sre.SRE_Match object; span=(0, 9), match='123-45678'>

('123', '45678')

In [74]:
# 验证Email地址的正则表达式
# someone@gmail.com
# bill.gates@microsoft.com

def is_valid_email(email):
    email_regex = re.compile(r'^([^\-]+)@(.+\.com)$')
    return email_regex.match(email)

# test
assert is_valid_email('someone@gmail.com')
assert is_valid_email('bill.gates@microsoft.com')
assert not is_valid_email('bob#example.com')
assert not is_valid_email('mr-bob@example.com')
print('ok')

ok


In [92]:
# 可以提取出带名字的Email地址
# <Tom Paris> tom@voyager.org => Tom Paris
# bob@example.com => bob

def is_valid_email(email):
    email_regex = re.compile(r'^([^\-]+)@(.+)$')
    return email_regex.match(email)

def name_of_email(email):
    email_regex = re.compile(r'^([^\-]+)@(.+)$')
    if is_valid_email(email):
        g = email_regex.match(email)
        username, mailbox = g.groups()
        if '<' in username and '>' in username:
            username_regex = re.compile(r'^<([\s\w]+)>.*')
            assert username_regex.match('<Kite>a')
            name = username_regex.match(username).group(1)
        else:
            name = username
        return name
    else:
        print(email)
        print('Not a valid one.')


# test
assert name_of_email('<Tom Paris> tom@voyager.org') == 'Tom Paris'
assert name_of_email('tom@voyager.org') == 'tom'
print('ok')

ok


In [95]:
# 一个更好的实现 卧槽牛逼啊！
# 无论有没有 < >, name 都在 group(2) 的位置
# 关键点在于 [\w\s]*, 如果是空的 * 也会返回一个空的 ‘’

def name_of_email(addr):
    r=re.compile(r'^(<?)([\w\s]*)(>?)([\w\s]*)@([\w.]*)$')
    if not r.match(addr):
        return None
    else:
        m=r.match(addr)
        print(m.groups())
        return m.group(2)

# test
assert name_of_email('<Tom Paris> tom@voyager.org') == 'Tom Paris'
assert name_of_email('tom@voyager.org') == 'tom'
print('ok')

('<', 'Tom Paris', '>', ' tom', 'voyager.org')
('', 'tom', '', '', 'voyager.org')
ok


### 11. Built-in functions (常用)

In [116]:
# datetime

from datetime import datetime
now = datetime.now() # 本地时间

print(now)

# spcific datetime
my_time = datetime(2018,12,30,12,30,22)
print(my_time)

# datetime to timestep
# 在计算机中，时间实际上是用数字表示的。我们把1970年1月1日 00:00:00 UTC+00:00时区的时刻称为epoch time
# 记为0（1970年以前的时间timestamp为负数），当前时间就是相对于epoch time的秒数，称为timestamp
# timestamp = 0 等于 1970-1-1 00:00:00 UTC+0:00
now = datetime.now()
now.timestamp()

# timestep to datetime
# fromtimestamp()
t_step = 1516692394.293447
my_time = datetime.fromtimestamp(t_step)
print(my_time)

# user's input string to datetime
# strptime()
input_time = '1987-9-1 15:00'
dt = datetime.strptime(input_time, '%Y-%m-%d %H:%M') # to datetime
dt2 = datetime.now().strftime('%H:%M') # to string
print(dt)
print(dt2)

# datetime加减
# >>> from datetime import datetime, timedelta
# >>> now = datetime.now()
# >>> now
# datetime.datetime(2015, 5, 18, 16, 57, 3, 540997)
# >>> now + timedelta(hours=10)
# datetime.datetime(2015, 5, 19, 2, 57, 3, 540997)
# >>> now - timedelta(days=1)
# datetime.datetime(2015, 5, 17, 16, 57, 3, 540997)
# >>> now + timedelta(days=2, hours=12)
# datetime.datetime(2015, 5, 21, 4, 57, 3, 540997)

# 时区转换：不写了

2018-01-23 15:39:50.931737
2018-12-30 12:30:22


1516693190.931948

2018-01-23 15:26:34.293447
1987-09-01 15:00:00
15:39


In [127]:
# collections

import collections
# 1. namedtuple
# 用来快速定义 只有属性没有方法的class
coordinate = collections.namedtuple('coordinate', ['x', 'y'])
p1 = coordinate(1, 2)
p1.x, p1.y

(1, 2)

In [128]:
# 2. deque
# list 优势是索引搜索快，劣势是插入删除速度太慢。因为它是线性储存，数据越多，插入删除越慢。
# deque 可以做到高效插入删除
dq = collections.deque(['x'] * 3)
dq.append('a')
dq.appendleft('b')
dq.pop()
dq.popleft()
dq

'a'

'b'

deque(['x', 'x', 'x'])

In [129]:
# 3. defaultdict
# 使用dict时，如果引用的Key不存在，就会抛出KeyError。如果希望key不存在时，返回一个默认值，
# 就可以用defaultdict
dd = collections.defaultdict(lambda: 'N/A')
dd['key1'] = 1
dd['key2']

'N/A'

In [134]:
# 4. OrderedDict
# Key 是有序的, Key 会按照 插入的顺序排列
od = collections.OrderedDict([('b', 1), ('c', 2)])
od
od['a'] = 3
od

OrderedDict([('b', 1), ('c', 2)])

OrderedDict([('b', 1), ('c', 2), ('a', 3)])

In [136]:
# 5. Counter 
# 实际上是 dict 的 一个子类
# E.g.面记录 char 出现次数
c = collections.Counter()
for char in "mynameismichael":
    c[char] = c[char] + 1
c

Counter({'a': 2,
         'c': 1,
         'e': 2,
         'h': 1,
         'i': 2,
         'l': 1,
         'm': 3,
         'n': 1,
         's': 1,
         'y': 1})

### 12.其他

(挖坑)