# 正規表達式練習
## 在網路爬蟲當中，正規表達式常常用來過濾以及搜尋特定的pattern字串。
## 今天要來練習過濾IP address，以及URL。

In [1]:
import re #載入re模組

In [23]:
# 定義一個函數，用來測試是否能匹配正規表達式
def RegexMatchingTest(regex, input_text):
    #將正規表達式轉換成pattern
    pattern = re.compile(regex)
    
    # 使轉換後的pattern，來測試是否匹配
    result = re.search(pattern, input_text)

    if result:
        # 匹配完的結果會儲存在group()的屬性中，我們可以把匹配的結果列印出來
        print("Matched: %s" % (result.group()))
        
        if result.lastindex is not None:
            # group(0)代表整個字串，group(1)、group(2)...代表分組中，匹配的內容
            for i in range(0, result.lastindex+1):
                print("  group(%d): %s" % (i, result.group(i)))
    else:
        print("Not matched.")    

## 用正規表達式過濾IP address。
#### 一個合法的網路IP address，其格式為：X.X.X.X, 其中X是0~255的數字。我們可以用一個regex，來表達IP address的內容。

In [21]:
test_string = "Google IP address is 216.58.200.227"

# 過濾IP address的regex pattern
regex = '(\d{1,3}).(\d{1,3}).(\d{1,3}).(\d{1,3})'
RegexMatchingTest(regex, test_string)

<class 're.Match'>
Matched: 216.58.200.227
  group(0): 216.58.200.227
  group(1): 216
  group(2): 58
  group(3): 200
  group(4): 227


#### 以上是最簡單的regex寫法。但深入思考，上面的regex也能夠匹配444.555.666.777這種無效的IP address。
#### 我們必須再雕琢regex，只接受[0 ~ 255].[0 ~ 255].[0 ~ 255].[0 ~ 255]這種合法的IP address，而過濾不合法的IP。

In [26]:
'''
    Your code here.
    hint: 把IP可能出現的數字範圍，分開來思考
          1. 000 ~ 199
          2. 200 ~ 249
          3. 250 ~ 255
'''
regex = "(25[0-5]|2[0-4][0-9]|[0-1]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[0-1]?[0-9][0-9]?)\.(25[0-9]|2[0-4][0-9]|[0-1]?[0-9]?[0-9])\.(25[0-9]|2[0-4][0-9]|[0-1]?[0-9]?[0-9])"


test_string1 = "Test IP 216.58.200.227"
RegexMatchingTest(regex, test_string1)  #測試表達式是否會匹配此合法IP

test_string2 = "Test IP 999.888.777.666"
RegexMatchingTest(regex, test_string2)  #測試表達式是否會匹配此不合法IP

Matched: 216.58.200.227
  group(0): 216.58.200.227
  group(1): 216
  group(2): 58
  group(3): 200
  group(4): 227
Not matched.


## 用正規表達式過濾URL。
#### 在網頁爬蟲中，常常會有外部連結的A tag，例如：
< a href="https://movies.yahoo.com.tw/movietime_result.html/id=9467"> 時刻表 < /a >
#### 我們要把"href="之後的URL擷取出來，用來做後續處理。

In [7]:
html_a_tag = "<a href=https://movies.yahoo.com.tw/movietime_result.html/id=9467> 時刻表 </a>"

'''
    Your code here.
    過濾URL的regex pattern
'''
regex = "https?://.+\d{4}"
RegexMatchingTest(regex, html_a_tag)

Matched: https://movies.yahoo.com.tw/movietime_result.html/id=9467


## 範例1:使用「\w」匹配字母，「\d」匹配數字，「\s」匹配空白

In [27]:
test_string = "My plate number is XYZ-1234."
regex = 'My plate number is \w\w\w-\d\d\d\d'
RegexMatchingTest(regex, test_string)

Matched: My plate number is XYZ-1234


In [28]:
test_string = "My phone number is 0912-345 678."
regex = 'My phone number is \d\d\d\d-\d\d\d\s\d\d\d'
RegexMatchingTest(regex, test_string)

Matched: My phone number is 0912-345 678


In [29]:
#利用量詞{n,m}來簡化寫法
test_string = "My phone number is 0912-345 678."
regex = 'My phone number is \d{4}-\d{3}\s{1}\d{3}'
RegexMatchingTest(regex, test_string)

Matched: My phone number is 0912-345 678


In [30]:
# 更偷懶的寫法，用「.」來代表任何字元
test_string = "My phone number is 0912-345 678."
regex = 'My phone number is .{4}-.{3}.{1}.{3}'
RegexMatchingTest(regex, test_string)

Matched: My phone number is 0912-345 678


## 範例2:使用[...]匹配在[ ]裡面所列出的字元

In [31]:
test_string = "I love dogs."
regex = 'I love [acdgnost]'
RegexMatchingTest(regex, test_string)

Matched: I love d


In [32]:
test_string = "I love cats."
regex = 'I love [acdgnost]'
RegexMatchingTest(regex, test_string)

Matched: I love c


In [33]:
# 若要匹配超過一個以上的字元，必須加入量詞(「+」或「*」或「?」)來表達
test_string = "I love dogs."
regex = 'I love [acdgnost]+'
RegexMatchingTest(regex, test_string)

Matched: I love dogs


In [34]:
test_string = "I love people."
regex = 'I love [acdgnost]+'
RegexMatchingTest(regex, test_string)
# people裡面只有'p'、'e'、'o'、'l'等字元，無法滿足[acdgnost]裡面所列出的條件

Not matched.


## 範例3:分組及捕捉

In [35]:
test_string = "I like baseball sport."
regex = 'I like (hiking|baseball) sport'
RegexMatchingTest(regex, test_string)

Matched: I like baseball sport
  group(0): I like baseball sport
  group(1): baseball


In [36]:
test_string = "I like hiking sport."
regex = 'I like (hiking|basketball) sport'
RegexMatchingTest(regex, test_string)

Matched: I like hiking sport
  group(0): I like hiking sport
  group(1): hiking


## 範例4:使用跳脫符號「\」
當遇到詮釋字元要被視為一般字元時，就必須要在前面加上跳脫符號「\」

In [37]:
test_string = "Please call number (02)2882-5252."
regex = 'Please call number \([0-9]{2}\)[0-9]{4}-[0-9]{4}'  #用「\(」來匹配左括號"("，用「\)」來匹配右括號")"
RegexMatchingTest(regex, test_string)

Matched: Please call number (02)2882-5252


## 範例5:比對中文字

In [38]:
test_string = "Here are 中文字 and English"  #中英夾雜的句子
regex = '[\u4e00-\u9fa5]+'                  #中文的UNICODE，範圍是0x4E00 ~ 0x9FA5
RegexMatchingTest(regex, test_string)

Matched: 中文字
