### 正则表达式

正则表达式就是用于描述字符串规则的工具。换句话说，正则表达式就是记录文本规则的代码。

#### 能用来做什么：

- 在大段文字中搜索特定规则字符串，例如：你想找到email地址在哪里
- 替代特定规则的字符串们，例如：你向把特定规则的小写文字替代成大写
- 校验输入的正确性，例如：你在设置密码的时候，你要求密码的长度或大小写<br/>
等等<br/>

https://deerchao.net/tutorials/regex/regex.htm  (正则表达式30分钟入门教程)

https://docs.python.org/3.6/library/re.html  (re documentation in python library)

https://regex101.com/ online regular expression tester and debugger (recommended)

### 常见代码
```
.	匹配除换行符以外的任意字符
\w	匹配any字母或数字或下划线或汉字 (e.g. \w can be "a", "b", "c", 1,2,3...); (E.g. \w\w can be "ab", "cf", ...) 
\s	匹配任意的空白符
\d	匹配数字
\b	匹配单词的开始或结束
^	匹配字符串的开始
$	匹配字符串的结束

*   重复零次或更多次
+	重复一次或更多次
?	重复零次或一次
{n}	重复n次
{n,}	重复n次或更多次
{n,m}	重复n到m次

\W	匹配任意不是字母，数字，下划线，汉字的字符
\S	匹配任意不是空白符的字符
\D	匹配任意非数字的字符
\B	匹配不是单词开头或结束的位置
[^x]	匹配除了x以外的任意字符
[^aeiou]	匹配除了aeiou这几个字母以外的任意字符
```

#### 基本语法
- re.compile()
- re.search/match()

```python
a = re.compile(pattern)
result = a.match(string)
```
相当于：
```python
result = re.match(pattern, string)
```
**Note:** The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.


In [6]:
import sys  
import re  

In [5]:
re_desc = """
This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings.
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.
It is important to note that most regular expression operations are available as module-level functions and RegexObject methods. The functions are shortcuts that don’t require you to compile a regex object first, but miss some fine-tuning parameters.
"""

In [7]:
re_desc.count("expression")

5

In [28]:
pattern = re.compile(r"expression")
search_result = pattern.search(re_desc) 
print (search_result.group(0))
print(search_result)

expression
<_sre.SRE_Match object; span=(30, 40), match='expression'>


In [15]:
search_result = re.search(r"expression",re_desc) 
search_result.group(0)

'expression'

``` python
** re.match() vs. re.search()**
```
re.match() checks for a match only at the **beginning** of the string

re.search() checks for a match anywhere in the string (this is what Perl does by default).

BOTH returns a match_type object if match, returns None (a None_type object) when no match.

match_type object has some attributes: .group(), .start(), .end(), etc. (see documentation for details) 

ref: https://docs.python.org/3.6/library/re.html#search-vs-match

In [82]:
pattern = re.compile(r"expression")
match_result = pattern.match(re_desc)  
print(match_result)  #return None because no match found (string "expression") at the beginning of the re_desc

None


In [317]:
re.match("c", "abcdef")    # No match
re.search("c", "abcdef")   # Match
re.match("c", "abcdef")    # No match
re.search("^c", "abcdef")  # No match
re.search("^a", "abcdef")  # Match
re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match

<_sre.SRE_Match object; span=(4, 5), match='X'>

``` python
**re.findall()**
```
output in a list/array

In [43]:
#pattern = re.compile()
search_result = re.findall(r"exp\w*",re_desc) 
search_result

['expression',
 'expressions',
 'expression',
 'expressed',
 'expression',
 'expressed',
 'expression']

``` python
**re.finditer()**
```
print line by line

In [17]:
for m in re.finditer(r"expression", re_desc):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

30-40: expression
191-201: expression
556-566: expression
724-734: expression
1070-1080: expression


#### special characters

1.r combined with \

note: in standard python DEFAULT expression, "\n" means feed a new line; Since \ is a special character in python, \combined with other letter may result in different meaning;

So to avoid it, in Regular Expression, we use prefix r to indicate the following is raw string rather than special character:
r"\n" is a two-character string containing '\' and 'n'; search(r"\n",text) meaning search "\n" in text;

In [164]:
text="C:\rl;kfj"
pattern = re.compile(r"\r")
result = pattern.search(text)  
print (result)
print(result.group(0))

<_sre.SRE_Match object; span=(2, 3), match='\r'>



2.1 r combined with "\\\x", where "\x" is any special meaning characters defined in Regular Expression (E.g. 常见代码 at beginning of the notes)



--(e.g. \W    匹配任意不是字母，数字，下划线，汉字的字符)

But what if we just want to search for "\W" this string, not its special character as above?

 -- we need to use double \ as below

In [152]:
text = "C:\Windows\Program\Joshua"
pattern = re.compile(r"C:\\Windows")
result = pattern.search(text)  
print(result.group(0))

C:\Windows


2.2 r combined with "\ (" or "\ )"

In regular expression, ( ) has special meaning, so use a \ before it to indicate "(" or ")" is raw string.

In [167]:
text = "(Windows\Program\Joshua"
pattern = re.compile(r"\(Windows")
result = pattern.search(text)  
print(result.group(0))

(Windows


3.* vs. +
```
*    重复零次或更多次, even if searched stuff not exists in search area(i.e. 零次), it will NOT return Nonetype object, it will return match object with group(0)=''
    -- use * with cautious since when you search(), rather than findall(), search() will stop searching once it returns something, including ''. (see example 3 below)
+    重复一次或更多次, if searched stuff not exists in search area, it will return Nonetype object, i.e. NONE
```

In [70]:
num_str="aaaba"

In [92]:
pattern = re.compile(r"\d*")
result = pattern.search(num_str)  
print (type(result))
print (result)
result.group(0)

<class '_sre.SRE_Match'>
<_sre.SRE_Match object; span=(0, 0), match=''>


''

In [93]:
pattern = re.compile(r"\d+")
result = pattern.search(num_str)  
print (type(result))
print (result)
result.group(0)

<class 'NoneType'>
None


AttributeError: 'NoneType' object has no attribute 'group'

Example 3 - use * with cautious
```
we can see below, when search() combined with *, the function will go like this: search from 1st character in num_str, 
i.e. 1 --> found it is not "a"(1!=a), return match_type object with match=''--> exit function. 

In this case, we usually use + instead of *, since + will ensure it will not stop until it finds 1st "a"
```

In [102]:
num_str="11ab334fga"
pattern = re.compile(r"a*")
result = pattern.search(num_str)
result

<_sre.SRE_Match object; span=(0, 0), match=''>

In [103]:
num_str="11ab334fga"
pattern = re.compile(r"a+")
result = pattern.search(num_str)
result

<_sre.SRE_Match object; span=(2, 3), match='a'>

4.Combination of different special characters of regular expressions

In [107]:
s_str="Iraq Benq"
s_str1="Iraq,Benq"

In [113]:
pattern=r"\b\w*q[^u]\w*\b"
result = re.search(pattern,s_str)
result.group(0)

'Iraq Benq'

In [114]:
pattern=r"\b\w*q \w*\b"
result = re.search(pattern,s_str)
result.group(0)

'Iraq Benq'

### 字符匹配

这个表达式可以匹配几种格式的电话号码，像(010)88886666，或022-22334455，或02912345678等。

<br/>我们对它进行一些分析吧：首先是一个转义字符\(,它能出现0次或1次(?),然后是一个0，后面跟着2个数字(\d{2})，然后是)或-或空格中的一个，它出现1次或不出现(?)，最后是8个数字(\d{8})。

In [401]:
ip_address="not validate 255.255.255.0 and 127.0.0.1 or 1.1.1.1"
pattern = re.compile(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}")
search_result = pattern.findall(ip_address) 
search_result

['255.255.255.0', '127.0.0.1', '1.1.1.1']

In [402]:
phone_number="some phone number is (010)88886666,some is 022-22334455, other is 02912345678, \
how about (022-87654321 format?"
pattern = re.compile(r"\(?0\d{2}[)-]?\d{8}")
search_result = pattern.findall(phone_number) 
search_result

['(010)88886666', '022-22334455', '02912345678', '(022-87654321']

### 分枝条件 A or B: using | to separate two condition A and B
匹配分枝条件时，将会从左到右地测试每个条件，如果满足了某个分枝的话，就不会去再管其它的条件了。

In [130]:
phone_number="some phone number is (010)88886666,some is 022-22334455, other is 02912345678, \
how about (022-87654321 format?"
pattern = re.compile(r"\(0\d{2}\)[- ]?\d{8}|0\d{2}[- ]?\d{8}")
search_result = pattern.findall(phone_number) 
search_result

['(010)88886666', '022-22334455', '02912345678', '022-87654321']

In [129]:
phone_number="some phone number is (010)88886666,some is 022-22334455, other is 02912345678, \
how about (022-87654321 format?"
pattern = re.compile(r"\(0\d{2}\)[- ]?\d{8}|0\d{2}[- ]?\d{8}")
for m in re.finditer(pattern, phone_number):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

21-34: (010)88886666
43-55: 022-22334455
66-77: 02912345678
90-102: 022-87654321


### 分组--findall 的两种用法

In [154]:
text = "ababab hello ab  cdababcd"
pattern = re.compile(r"(ab){2}")
pattern.findall(text) 

['ab', 'ab']

In [155]:
text = "ababab hello ab  cdababcd"
pattern = re.compile(r"(?:ab){2}")
pattern.findall(text)

['abab', 'abab']

In [128]:
text = "ababab hello ab  cdababcd"
for m in re.finditer(r"(ab){2}", text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

00-04: abab
19-23: abab


In [406]:
for m in re.finditer(r"(\d{1,3}\.){3}\d{1,3}", ip_address):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

13-26: 255.255.255.0
31-40: 127.0.0.1
44-51: 1.1.1.1


In [407]:
ip_address="not validate 355.555.255.0 and 127.0.0.1 or 1.1.1.1"
pattern = re.compile(r"((2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]|[01]?\d\d?)")
for m in pattern.finditer(ip_address):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

31-40: 127.0.0.1
44-51: 1.1.1.1


```
捕获	
(exp)	匹配exp,并捕获文本到自动命名的组里
(?P<name>exp)	匹配exp,并捕获文本到名称为name的组里，也可以写成(?'name'exp)
(?:exp)	匹配exp,不捕获匹配的文本，也不给此分组分配组号

```

In [146]:
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
m.group()

'Isaac Newton'

In [147]:
m.group(1)

'Isaac'

In [148]:
m.group(2)

'Newton'

In [153]:
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
m.group('first_name')

'Malcolm'

In [151]:
m.group('last_name')

'Reynolds'

In [12]:
text ="Iraq,Benq"
text1 ="Iraqx,Benq"
m = re.match(r"\b\w*q[^u]\w*\b",text)
m.group()

'Iraq,Benq'

In [11]:
m = re.match(r"\b\w*q(?!u)\w*\b",text)
m.group()

'Iraq'

### 后向引用-- 如何捕获叠词 (e.g. ABC ABC)-- using \1

In [358]:
text = "kitty kitty go go, so cute"
pattern = re.compile(r"\b(\w+)\b\s+\1\b")  # \1第一匹配
#search_result = pattern.findall(text) 
for m in pattern.finditer(text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

00-11: kitty kitty
12-17: go go


In [371]:
text = "kitty kitty go go, so cute"
pattern = re.compile(r"\b(?P<dw>\w+)\b\s+(?P=dw)\b")  
for m in pattern.finditer(text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

00-11: kitty kitty
12-17: go go


<a href="cn.dataapplab.com">'hello dal'</a>

In [376]:
text = '<a href="http://cn.dataapplab.com">\'hello dal\'</a>'
pattern = re.compile(r"(?P<quote>['\"]).*?(?P=quote)")  
for m in pattern.finditer(text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

08-34: "http://cn.dataapplab.com"
35-46: 'hello dal'


### 零宽断言 - search 以 xxx 开头/结尾 的词

像\b,^,$那样用于指定一个位置，这个位置应该满足一定的条件(即断言)，因此它们也被称为零宽断言

```
(?=exp)	匹配exp前面的位置
(?<=exp)	匹配exp后面的位置
(?!exp)	匹配后面跟的不是exp的位置
(?<!exp)	匹配前面不是exp的位置

注释	(?#comment)	这种类型的分组不对正则表达式的处理产生任何影响，用于提供注释让人阅读
```

In [171]:
m = re.search('(?<=abc)\w+', 'abcdef') #以 abc 开头的词(output excluding abc)
m.group(0)

'def'

In [169]:
m = re.search('(?<=-)\w+', 'spam-egg')
m.group(0)

'egg'

In [54]:
text = "I'm singing while you're dancing"
pattern = re.compile(r"\b\w+(?=ing\b)")   #匹配以ing结尾的单词的前面部分(除了ing以外的部分)
for m in pattern.finditer(text): 
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

04-08: sing
25-29: danc


In [51]:
text = "I'm singing while you're dancing"
pattern = re.compile(r"\b\w+(?<!ing)\b")   #匹配不是以ing结尾的单词的前面部分
for m in pattern.finditer(text): 
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

00-01: I
02-03: m
12-17: while
18-21: you
22-24: re


In [34]:
text = "I'm singing while you're dancing"
pattern = re.compile(r"(?<=\bdan)\w+\b")   #会匹配以dan开头的单词的后半部分(除了dan以外的部分
for m in pattern.finditer(text): 
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

28-32: cing


### Real examples

### Q:找出副词

In [273]:
text = "He was carefully disguised but captured quickly by police."
re.findall(r"\w+ly", text)

['carefully', 'quickly']

In [274]:
text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

07-16: carefully
40-47: quickly


## 分词

In [157]:
phone_text = """Ross McFluff: 834.345.1254 155 Elm Street
Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way
Heather Albrecht: 548.326.4584 919 Park Place"""

In [158]:
entries = re.split("\n+", phone_text)
entries

['Ross McFluff: 834.345.1254 155 Elm Street',
 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
 'Frank Burger: 925.541.7625 662 South Dogwood Way',
 'Heather Albrecht: 548.326.4584 919 Park Place']

In [160]:
re.split('\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

In [161]:
re.split('(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

In [68]:
re.split('\W+', 'Words, words, words.', 1)


['Words', 'words, words.']

In [162]:
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

In [70]:
[re.split(":? ", entry, maxsplit=3) for entry in entries]

[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

### 替换

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

In [8]:
?re.sub

In [6]:
new_str=re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
       r'static PyObject*\npy_\1(void)\n{',
       'def myfunc():')
print(new_str)

static PyObject*
py_myfunc(void)
{


If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example

In [73]:
import random
def repl(m):
    inner_word = list(m.group(2))
    random.shuffle(inner_word)
    return m.group(1) + "".join(inner_word) + m.group(3)

text = "Professor Abdolmalek, please report your absences promptly."
re.sub(r"(\w)(\w+)(\w)", repl, text)

'Psfooresr Abalomdlek, plasee rreopt your asebecns prtlpomy.'

In [74]:
re.sub(r"(\w)(\w+)(\w)", repl, text)

'Pfsoerosr Alemdoalbk, peasle roeprt your acenbess ppomrlty.'

### 如何处理中文？

In [275]:
title = u'你好，hello，世界'
pattern = re.compile(r'[\u4e00-\u9fff]+')
result = pattern.findall(title)
result

['你好', '世界']

<table cellspacing="0" cellpadding="0" width="900" border="1"><colgroup></colgroup><colgroup><col width="10%"><col width="75%"><col width="15%"></colgroup><tbody><tr><td colspan="3">
<p align="center"><span style="font-family:'Microsoft YaHei';font-size:24px;">主要非英文语系字符范围</span></p>
</td>
</tr><tr><td><span style="font-family:'Microsoft YaHei';font-size:16px;"><strong>范围</strong></span></td>
<td><span style="font-family:'Microsoft YaHei';font-size:16px;"><strong>编码</strong></span></td>
<td><span style="font-family:'Microsoft YaHei';font-size:16px;"><strong>说明</strong></span></td>
</tr><tr><td><span style="font-family:'Microsoft YaHei';font-size:16px;"><em>2E80~33FFh</em></span></td>
<td><span style="font-family:'Microsoft YaHei';color:#ff0000;font-size:16px;">中日韩符号区</span></td>
<td><span style="font-family:'Microsoft YaHei';font-size:16px;">收容康熙字典部首、中日韩辅助部首、注音符号、日本假名、韩文音符，中日韩的符号、标点、带圈或带括符文数字、月份，以及日本的假名组合、单位、年号、月份、日期、时间等。</span></td>
</tr><tr><td><span style="font-family:'Microsoft YaHei';font-size:16px;"><em>3400~4DFFh</em></span></td>
<td><span style="font-family:'Microsoft YaHei';color:#ff0000;font-size:16px;">中日韩认同文字扩充A区</span></td>
<td><span style="font-family:'Microsoft YaHei';font-size:16px;">中日韩认同表意文字扩充A区，总计收容6,582个中日韩汉字。</span></td>
</tr><tr><td><span style="font-family:'Microsoft YaHei';font-size:16px;"><em>4E00~9FFFh</em></span></td>
<td><span style="font-family:'Microsoft YaHei';color:#ff0000;font-size:16px;">中日韩认同表意文字区</span></td>
<td><span style="font-family:'Microsoft YaHei';font-size:16px;">中日韩认同表意文字区，总计收容20,902个中日韩汉字。</span></td>
</tr><tr><td><span style="font-family:'Microsoft YaHei';font-size:16px;"><em>A000~A4FFh</em></span></td>
<td><span style="font-family:'Microsoft YaHei';color:#ff0000;font-size:16px;">彝族文字区</span></td>
<td><span style="font-family:'Microsoft YaHei';font-size:16px;">收容中国南方彝族文字和字根</span></td>
</tr><tr><td><span style="font-family:'Microsoft YaHei';font-size:16px;"><em>AC00~D7FFh</em></span></td>
<td><span style="font-family:'Microsoft YaHei';color:#ff0000;font-size:16px;">韩文拼音组合字区</span></td>
<td><span style="font-family:'Microsoft YaHei';font-size:16px;">收容以韩文音符拼成的文字</span></td>
</tr><tr><td><span style="font-family:'Microsoft YaHei';font-size:16px;"><em>F900~FAFFh</em></span></td>
<td><span style="font-family:'Microsoft YaHei';color:#ff0000;font-size:16px;">中日韩兼容表意文字区</span></td>
<td><span style="font-family:'Microsoft YaHei';font-size:16px;">总计收容302个中日韩汉字</span></td>
</tr><tr><td><span style="font-family:'Microsoft YaHei';font-size:16px;"><em>FB00~FFFDh</em></span></td>
<td><span style="font-family:'Microsoft YaHei';color:#ff0000;font-size:16px;">文字表现形式区</span></td>
<td><span style="font-family:'Microsoft YaHei';font-size:16px;">收容组合拉丁文字、希伯来文、阿拉伯文、中日韩直式标点、小符号、半角符号、全角符号等。</span></td>
</tr></tbody></table>

In [276]:
title = u'你好，hello，世界，生生世世，好不好啊, 世界繁荣'
pattern = re.compile(r'世{1,2}[\u4e00-\u9fff]?')
result = pattern.findall(title)
result

['世界', '世世', '世界']