# `re`
### - 在 Python 中，我們可以使用內建的 re 模組來使用正規表示式( Regular Expression )。


### 
## `re.search( pattern, string )`
### - 找出符合 pattern 條件的物件。
### pattern : 資料型式
### string : 要比對的文字

In [4]:
import re

In [5]:
text = "The person's phone number is 408-555-1234. Call soon!"
pattern = 'phone'

In [6]:
re.search(pattern,text) # span=(13,18) -> 物件在位置13到位置18

<re.Match object; span=(13, 18), match='phone'>

**可以將他指派給特地物件 :**

In [7]:
match = re.search(pattern,text)

In [8]:
match

<re.Match object; span=(13, 18), match='phone'>

**.span( ) : 回傳一個 tuple，( 起始位置, 終止位置 )**

In [9]:
match.span()

(13, 18)

In [10]:
match.start()

13

In [11]:
match.end()

18

### 
## `re.findall( pattern, string )`
### - 如果想找符合條件的物件有幾個，可以用 re.findall() 。
### pattern : 資料型式
### string : 要比對的文字

In [14]:
text = "my phone is a new phone"

In [15]:
matches = re.findall("phone",text)

In [16]:
matches

['phone', 'phone']

In [17]:
len(matches)

2

### 
## `re.finditer( pattern, string )`
### - 跌代目標字串中所有符合條件的物件。
### pattern : 資料型式
### string : 要比對的文字

In [20]:
text = "my phone is a new phone"

In [21]:
for match in re.finditer("phone",text):
    print(match)

<re.Match object; span=(3, 8), match='phone'>
<re.Match object; span=(18, 23), match='phone'>


In [22]:
for match in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)


**.group( ) : 獲取匹配的字符，而不是返回一個字符的位置**

In [24]:
match.group()

'phone'

### 
### 
### 
# `符號 Patterns`
### - 在正則表示式中，我們可以用不同圖案表示數字、字母、符號。

<table ><tr><th>符號</th><th>解釋</th><th>範例寫法</th><th >符合範例</th></tr>

<tr ><td><span >\d</span></td><td>匹配任何數字，等同於 0 - 9</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>匹配字母（ A-Z ）、數字（ 0-9 ）和下滑線（ _ ）</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>匹配空白格（ \t \n \r \f ）</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>匹配任何非數字</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>匹配任何非字母（ A-Z ）、數字（ 0-9 ）和下滑線（ _ ）</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>匹配任何非空白格（ \t \n \r \f ）</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>


**更多 Patterns : http://120.105.184.250/cswang/thit/Linux/RegularExpression.htm**

In [25]:
text = "My telephone number is 408-555-1234"

In [26]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

In [27]:
phone.group()

'408-555-1234'

### 
# `數量定義詞 Quantifiers`
### - 我們可以透過量詞的正規表示式來指定數量。

<table ><tr><th>符號</th><th>解釋</th><th>範例寫法</th><th >符合範例</th></tr>

<tr ><td><span >+</span></td><td>連續出現一次或多次</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>出現 3 次</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>出現 2-4 次</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>出現 3 次以上</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>出現 0 次 或 0 次以上</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>出現一次 或 沒有出現</td><td>plurals?</td><td>plural</td></tr></table>

In [32]:
text = "My telephone number is 408-555-1234"

In [33]:
test1 = re.search(r'\d{3}-\d{3}-\d{4}',text)

In [34]:
test1.group()

'408-555-1234'

### 
## `re.compile( (正規表示式) )`
### - 對正規表示式做編譯，產一個 compiled re object ，然後用這個 object 做匹配工作。

In [35]:
import re

ore = re.compile(r'\d{4}')

a = ['1234','abc123','123bgt4567','qwer1234']

for it in a:
    if ore.search(it):
         print(it)

1234
123bgt4567
qwer1234


### 

In [36]:
text = "My telephone number is 408-555-1234"

**將 pattern 透過 re.compile()函式變成一個可以迭代物件：**

In [41]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
# part1.(\d{3}) part2.(\d{3}) part3.(\d{4})

In [42]:
results = re.search(phone_pattern,text)

In [43]:
results.group()

'408-555-1234'

In [44]:
# 這裡要注意第一組是編號1，不是編號0
results.group(1)

'408'

In [45]:
results.group(2)

'555'

In [46]:
results.group(3)

'1234'

In [47]:
results.group(4)

IndexError: no such group

### 
## `其他 Regex 符號`
### 
### 或 Or  : ` |`

In [48]:
re.search(r"man|woman","This man was here.")

<re.Match object; span=(5, 8), match='man'>

In [50]:
pattern = r"man|woman"
text = "This man was here."

match = re.findall(pattern, text)
match

['man']

In [49]:
re.search(r"man|woman","This woman was here.")

<re.Match object; span=(5, 10), match='woman'>

In [51]:
pattern = r"man|woman"
text = "This woman was here."

match = re.findall(pattern, text)
match

['woman']

### 通用符號 ( 代表任意物件 ) ： `.` or `\S`

In [52]:
re.findall(r".at","The cat in the hat sat here.")

['cat', 'hat', 'sat']

In [53]:
re.findall(r".at","The bat went splat")

['bat', 'lat']

一個點代表一個物件，包含空格：

In [54]:
re.findall(r"...at","The bat went splat")

['e bat', 'splat']

如果只想取得單詞，應該用 \S (包含所有非空白) :

In [55]:
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

### 設定以什麼開頭 or 以什麼結尾： 開頭 `＾` ,  結尾 `＄`

In [56]:
# Ends with a number
re.findall(r'\d$','This ends with a number 2')

['2']

In [57]:
# Starts with a number
re.findall(r'^\d','1 is the loneliest number.')

['1']

### 排除 ： `[ ]`
#### - 任何符合 [ ] 的物件都會被排除在外。 

In [58]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

除了數字開頭以外：

In [59]:
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

去除標點符號：

In [63]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [64]:
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [65]:
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))
clean

'This is a string But it has punctuation How can we remove it'

### 加上一個或一個以上的任意物件： `＋` ( 數量定義詞 ) 

In [61]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [62]:
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

### 透過括號將選項組合在一起： `[ ]`
    ( ) : 比對所有物件跟括號內的字元
    ex. (ele) elephant , (thank) thanks
    
    [ ] : 任意順序比對括號內字元跟所有物件
    ex. [abc] a,b,c,ab,ac,bc,ba,ca,bca,cba

In [68]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [70]:
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

### 如果有部分相同，也可以用 ( ) 表示：

In [71]:
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

r'cat ( fish|nap|claw )' 可以是 catfish (cat+fish), catnap (cat+nap), catclaw (cat+claw)

In [72]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [73]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [75]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)