re(Regular Expression)中文叫正則表達式，是一個較靈活的配對工具， 可以在一大串字符中找想找的内容 

```
常用語法
符號         意義                     備註                   
.           所有
^           限定字串開頭    
$           限定字串結尾
* ?         皆為後面可有0或多個字
+           後面可有1或多個字
*? +? ??    只找出搜尋結果的第一個
{m}         對前一個字符重複m次
[]          配對[]內的字符         [a-z A-Z 0-9]配對所有英文字母及數字 但[^6] 為配對6以外的數字
()          配對()內的任意正則表達式

```
參考 : https://www.ibm.com/developerworks/cn/opensource/os-cn-pythonre/index.html

```
\ 對特殊字轉義 或指定特殊序列
常用特殊序列
符號         意義                 相當於
\A           只配對字串開頭
\Z           只配對字串結尾
\d           配對0-9             [0-9]
\D           配對非0-9           [^0-9]
\s           配對任意空白        [\t\n\r\f\v]
\S           配對非任意空白      [^\t\n\r\f\v]
\w           配對任意数字和字母   [a-zA-Z0-9_]
\W           配對非任意数字和字母 [^a-zA-Z0-9_]
  
```
參考 : https://www.ibm.com/developerworks/cn/opensource/os-cn-pythonre/index.html        

In [1]:
import re

# 1. 分割字串  
## re.split()
比較.split()與re.split()

In [2]:
s1 = "aa bb          cc"
print(s1.split(' '))
print(re.split(r'\s+', s1))

['aa', 'bb', '', '', '', '', '', '', '', '', '', 'cc']
['aa', 'bb', 'cc']


用()配對的時候 被配對到的字也會輸出 

In [3]:
line = 'asdf fjdk; afed, fjek,asdf, foo'
# | 代表 "或"
print(re.split(r'(;|,|\s)\s*', line))
print(re.split(r'[;|,|\s]\s*', line))
#用  ?: 去掉分隔符
print(re.split(r'(?:,|;|\s)\s*', line))

['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']


# 2. 配對字串的開頭或結尾
```
.endswith('')  
.startswith('') 
```

In [4]:
filename = 'spam.txt'
print(filename.endswith('.txt'))
print(filename.startswith('file:'))


True
False


```
要用集合表示要配對的項目的話
一定要化成tuple 如果用list或set會error
```

In [5]:
choices = ['http:', 'ftp:']
url = 'https://www.ptt.cc/bbs/beauty/index.html'
#如果用  url.startswith(choices)  會error
#要化成tuple 如下
url.startswith(tuple(choices))   

False

# 3. 用通配符匹配
```
fnmatchcase()  大小寫需一樣
fnmatch()     視作業系統而定   (Mac 對大小寫敏感  Windows 大小寫沒差)
可用於處理不是文件名的字串
```

In [6]:
from fnmatch import fnmatch, fnmatchcase
print("Does 'foo.txt' and '*.txt' match ?", fnmatch('foo.txt', '*.txt'))
print("Does 'foo.txt' and '*.TXT' match ?", fnmatch('foo.txt', '*.TXT'))
print("Does 'foo.txt' and '*.TXT' match ?", fnmatchcase('foo.txt', '*.TXT'))
print("Does 'foo.txt' and '?oo.txt' match ?", fnmatch('foo.txt', '?oo.txt'))
print("Does 'Dat45.csv' and 'Dat[0-9]*' match ?", fnmatch('Dat45.csv', 'Dat[0-9]*'))

Does 'foo.txt' and '*.txt' match ? True
Does 'foo.txt' and '*.TXT' match ? True
Does 'foo.txt' and '*.TXT' match ? False
Does 'foo.txt' and '?oo.txt' match ? True
Does 'Dat45.csv' and 'Dat[0-9]*' match ? True


In [7]:
addresses = [
'5412 N CLARK ST',
'1060 W ADDISON ST',
'1039 W GRANVILLE AVE',
'2122 N CLARK ST',
'4802 N BROADWAY',
]

print([addr for addr in addresses if fnmatchcase(addr, '* ST')])
print([addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')])

['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']
['5412 N CLARK ST']


# 4. 複雜的字串配對
```
re.match()
re.compile()
```

In [8]:
import re
text1 = '11/27/2012'

if re.match(r'\d+/\d+/\d+', text1):
    print('yes')
else:
    print('no')

yes


```
.compile() 可以將欲多次使用的匹配法儲存起來 
再搭配.match()使用 
```

In [9]:
datepat = re.compile(r'\d+/\d+/\d+')
text2 = 'Nov 27, 2012'

if datepat.match(text1):
    print('yes')
else:
    print('no')


if datepat.match(text2):
    print('yes')
else:
    print('no')

yes
no


```
.match()     從字串開始去匹配
.findall()   尋找字串中任意位置符合配對項目的
```

In [10]:
datepat = re.compile(r'\d+/\d+/\d+')
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
datepat.findall(text)

['11/27/2012', '3/13/2013']

在.compile()裡 用()包住要配對的對象 可以方便取出來

In [11]:
datepat1 = re.compile(r'(\d+)/(\d+)/(\d+)')
m = datepat1.match('11/27/2012')
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(3))
print(m.groups())

11/27/2012
11
27
2012
('11', '27', '2012')


.finditer()以迭代方式產生

In [12]:
datepat1 = re.compile(r'(\d+)/(\d+)/(\d+)')
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
for m in datepat1.finditer(text):
    print(m.groups())

('11', '27', '2012')
('3', '13', '2013')


# 5. 字串搜索和替換

.replace() 與re.sub()

In [13]:
text = 'yeah, but no, but yeah, but no, but yeah'
print('origin        is:', text)
print('after replace is:', text.replace('yeah', 'yep'))
print('='*70)
text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
print('origin        is:', text2)
print('after replace is:', re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2))

origin        is: yeah, but no, but yeah, but no, but yeah
after replace is: yep, but no, but yep, but no, but yep
origin        is: Today is 11/27/2012. PyCon starts 3/13/2013.
after replace is: Today is 2012-11-27. PyCon starts 2013-3-13.


re.subn() 又多顯示取代了幾次

In [14]:
datepat1 = re.compile(r'(\d+)/(\d+)/(\d+)')
newtext, n = datepat1.subn(r'\3-\1-\2', text2)
print(newtext)
print(n)

Today is 2012-11-27. PyCon starts 2013-3-13.
2


# 6. 替換時忽略大小寫
flags=re.IGNORECASE

In [15]:
text = 'UPPER PYTHON, lower python, Mixed Python'
#找出python 不管大小寫
print(re.findall('python', text, flags=re.IGNORECASE))
#只要是python 全換成snake
# 但
# 被取代的字不會因原字的大小寫狀態而改變
print(re.sub('python', 'snake', text, flags=re.IGNORECASE))

['PYTHON', 'python', 'Python']
UPPER snake, lower snake, Mixed snake


可定義隨原字而變的函數

In [16]:
def match(sub_word, match_word):
    text = match_word.group()
    if text.isupper():   #檢驗是否全大寫
        return sub_word.upper()
    elif text.islower(): #檢驗是否全小寫
        return sub_word.lower()
    elif text[0].isupper():   #檢驗是否第一字為大寫
        return sub_word.capitalize()
    else:
        return sub_word

```
matchcase('snake') 返回了一個回調函數(參數要是match 對象 ex:下面例子的python)
sub() 函數除了接受替換字符串外，還能接受一個回調函數。
```

In [17]:
from functools import partial

In [18]:
re.sub('python', partial(match, 'snake'), text, flags=re.IGNORECASE)

'UPPER SNAKE, lower snake, Mixed Snake'

# 7.  找最短的配對項
```
re配對某個字串，可能找到的是最長的
所以修改它變成找最短的配對項
```

In [19]:
#找雙引號內的字
str_pat = re.compile(r'\"(.*)\"')
text1 = 'Computer says "no."'
print(str_pat.findall(text1))
text2 = 'Computer says "no." Phone says "yes."'
print(str_pat.findall(text2))

['no.']
['no." Phone says "yes.']


```
* 會找最長的
需要在*後面加一個?
*? 為找搜尋結果的第一個(最開頭有寫)
```

In [20]:
str_pat = re.compile(r'\"(.*?)\"')
str_pat.findall(text2)

['no.', 'yes.']

# 8.  多行情況下匹配

In [21]:
# 找 /* */ 內的字

comment = re.compile(r'/\*(.*?)\*/')
text1 = '/* this is a comment */'
text2 = '''/* this is a
    multiline comment */
    '''
print(comment.findall(text1))
#找不到
print(comment.findall(text2))

[' this is a comment ']
[]


(?:.|\n) 為 找.或\n

In [22]:
comment = re.compile(r'/\*((?:.|\n)*?)\*/')
comment.findall(text2)

[' this is a\n    multiline comment ']

```
re.compile() 可加入一個標誌參數叫re.DOTALL
讓正則表達式中的點(.) 匹配包括換行符在內的任意字符
```

In [23]:
comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)
comment.findall(text2)

[' this is a\n    multiline comment ']

# 9. 刪除字符串中不需要的字符

```
預設下會刪除空白
strip()      刪除開始或結尾的字符
lstrip()     從左執行刪除操作
rstrip()     從右執行刪除操作
```

In [24]:
s = ' hello world \n'
print(s.strip())
print(s.rstrip())
s.lstrip()

hello world
 hello world


'hello world \n'

In [25]:
t = '-----hello====='
print(t.lstrip('-'))
print(t.strip('-='))

hello=====
hello


```
用strip() 不會 對中間的文本產生影響
若要刪除空格 就用replace 或 re.sub()把空格替換掉
```

In [26]:
s = ' hello    world \n'
s.strip()

'hello    world'

# 10. 字符串對齊
```
ljust()   從左對齊
rjust()   從右對齊
center()  向中間對齊
```

In [27]:
#參數為總長
text = 'Hello World'
print(text.ljust(20))
print(text.rjust(20))
print(text.center(20))

Hello World         
         Hello World
    Hello World     


In [28]:
#填充字串
text.rjust(20,'=')



format() 也可執行一樣的事

In [29]:
# >為靠右 <靠左 ^靠中間
print(format(text, '>20'))
print(format(text, '<20'))
print(format(text, '^20'))
# 位置符號前可加 填充字符
print(format(text, '*^20s'))

         Hello World
Hello World         
    Hello World     
****Hello World*****


```
format()可格式多個
也可格式化數值
```

In [30]:
print('{:>10s} {:>10s}'.format('Hello', 'World'))
x = 1.2345
print(format(x, '>10'))
print(format(x, '^10.2f'))

     Hello      World
    1.2345
   1.23   


```
整體而言 
format較ljust() rjust() center()優
```

# 11. 字符串中插入變量

.format()

In [31]:
s = '{name} has {n} messages.'
s.format(name='Guido', n=37)

'Guido has 37 messages.'

```
變數如果有先定義
可用format_map()搭配vars()編輯
```

In [32]:
name = 'Guido'
n = 37
s.format_map(vars())

'Guido has 37 messages.'

# 但
```
format跟format_map 在有missing value 的時候 
插入變量會error
否則須自訂函數
```

In [33]:
class safesub(dict): #防止key 找不到
    def __missing__(self, key):
        return '{' + key +  '}' #把missing value 用key名取代

In [34]:
del n # 把剛定義的n刪掉
s.format_map(safesub(vars()))

'Guido has {n} messages.'

# 12. 解構字串 逐項歸類(令牌化)
```
模式中， ?P<TOKENNAME> 用於給一個模式命名
```

In [35]:
text = 'foo = 23 + 42 * 10'
tokens = [('NAME', 'foo'), ('EQ','='), ('NUM', '23'), ('PLUS','+'),('NUM', '42'), ('TIMES', '*'), ('NUM', 10)]
import re
NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
TIMES = r'(?P<TIMES>\*)'
EQ = r'(?P<EQ>=)'
WS = r'(?P<WS>\s+)'
master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))

In [36]:
'|'.join([NAME, NUM, PLUS, TIMES, EQ, WS])

'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)|(?P<NUM>\\d+)|(?P<PLUS>\\+)|(?P<TIMES>\\*)|(?P<EQ>=)|(?P<WS>\\s+)'

In [37]:
scanner = master_pat.scanner('foo = 42')

In [38]:
scanner.match()

<_sre.SRE_Match object; span=(0, 3), match='foo'>

In [39]:
print(_.lastgroup, _.group())

NAME foo


In [40]:
scanner.match()

<_sre.SRE_Match object; span=(3, 4), match=' '>

In [41]:
print(_.lastgroup, _.group())

WS  


In [42]:
scanner.match()

<_sre.SRE_Match object; span=(4, 5), match='='>

In [43]:
print(_.lastgroup, _.group())

EQ =


In [44]:
scanner.match()

<_sre.SRE_Match object; span=(5, 6), match=' '>

In [45]:
print(_.lastgroup, _.group())

WS  


In [46]:
scanner.match()

<_sre.SRE_Match object; span=(6, 8), match='42'>

In [47]:
print(_.lastgroup, _.group())

NUM 42


定個函數較方便

In [48]:
from collections import namedtuple
master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))
scanner = master_pat.scanner('foo = 42')
def generate_tokens(pat, text):
    Token = namedtuple('Token', ['type', 'value'])
    scanner = pat.scanner(text)
    #遇到None就停止迭代
    for m in iter(scanner.match, None):
        yield Token(m.lastgroup, m.group())
for tok in generate_tokens(master_pat, 'foo = 42'):
    print(tok)

Token(type='NAME', value='foo')
Token(type='WS', value=' ')
Token(type='EQ', value='=')
Token(type='WS', value=' ')
Token(type='NUM', value='42')
