# Regular Expression
- Convert Tool:  http://tool.oschina.net/regex/
- Extract specific text with certain rules
- Regex Match Table: https://www.rexegg.com/regex-quickstart.html
- Python has `re` package for the whole regex symbols and operation.
- `re` Code References: https://docs.python.org/3/library/re.html
- Regex Web Crawling Application Example: https://blog.csdn.net/Eastmount/article/details/78275983
- Medium post: https://medium.com/towards-data-science/regular-expressions-regex-with-examples-in-python-and-pandas-461228335670
- Escape Match: http://c.biancheng.net/view/2176.html


### Match
- Input the character strings and regular expression, check whether these two are matched.
- Try to check the match from the first string.

In [24]:
import re

In [25]:
content1='Hello 123 4567 World_This is a Regex Demo'
print(len(content1))

41


In [26]:
result=re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}', content1) #correspond to the regex table
print(result)
print(result.group())
print(result.span())

<re.Match object; span=(0, 25), match='Hello 123 4567 World_This'>
Hello 123 4567 World_This
(0, 25)


`^`: the beginning symbol

`\s`: space 

`\d`: number

`\d{4}`: four times with number for simplicity

`\w{10}`: ten times with letter or underline

-----------
Code model: `match(regex,content)`

Output: 
- `.group()`: output the matches content
- `.span()`: the index position of successfully matched portion in original content


#### Target Matching
- abstract part of content use regrex method
- Method: 
   - use `()` to select the string we want to get in regex
   - then use `group(1)` to index and get the first successfully abstracted content, following by `group(2)`, `group(3)`...

In [27]:
content2 = 'Hello 1234567 World_This is a Regex Demo'
result1=re.match('^Hello\s(\d+)\sWorld',content2) 
print(result1)
print(result1.group())
print(result1.group(1)) #indexing
print(result1.span())

<re.Match object; span=(0, 19), match='Hello 1234567 World'>
Hello 1234567 World
1234567
(0, 19)


#### Universal Matching
- use `.` to match any regex string and content except `\n`
- `.` represents match the front strings with indefinite times
- Intuition: Do not need to match regex string and content one by one
- Code: use `.*` omit the middle part without typing the matched regex strings one by one, then end with the ending symbol `$`.

In [28]:
result2=re.match('^Hello.*Demo$',content1)
print(result2)
print(result2.group())
print(result2.span())

<re.Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
Hello 123 4567 World_This is a Regex Demo
(0, 41)


#### Greedy Match & Lazy Match
- Cases: use `.*` , sometimes we get unwanted results

In [29]:
#if we want to abstract the number portion in content
result3=re.match('^He.*(\d+).*Demo$',content2)
print(result3)
print(result3.group(1)) #as result shows, we only get 7 this number, instead of 1234567

<re.Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
7


In **Greedy Matching**: `.*` successful matches as many strings  as possible.
  - in this case, `.*` matches `...123456`

Hence, we use **Lazy Matching** method, with `.*?` symbol. `.*?` matches as small numbers of strings as possible.


In [30]:
result4 = re.match('^He.*?(\d+).*Demo$', content2)
print(result4)
print(result4.group(1))

<re.Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
1234567


Check what happens if use `.*?` symbol in the end of regex strings.

In [31]:
content3='http://weibo.com/comment/kEraCN'
result5=re.match('http.*?comment/(.*?)',content3)
result6=re.match('http.*?comment/(.*)',content3)
print('result5', result5.group(1))
print('result6', result6.group(1))

result5 
result6 kEraCN


As consequences, we will get nothing if we put lazy matching symbols in the end of regex string.

#### Modifier
Optional modifier to control match patterns.

In [32]:
content4 = '''Hello 1234567 World_This
is a Regex Demo
'''

In [33]:
#add change line symbol in this case
result7 = re.match('^He.*?(\d+).*?Demo$', content4)
print(result7.group(1))
#we get error

AttributeError: ignored

`.*?` can not match change lines symbol.To fix it, we add `re.S` to let the lazy match symbol include changing line.

In [34]:
result8 = re.match('^He.*?(\d+).*?Demo$', content4, re.S)
print(result8.group(1))

1234567


##### Other Modifiers
`re.l`: Performs case-insensitive matching.

`re.L`: local-aware matching.Interprets words according to the current locale, affecting the *alphabetic group* (`\w` and `\W`), as well as *word boundary behavior*(`\b` and `\B`).

`re.M`: Makes `$` match the end of a *line* and makes `^ ` match the start of any line, not just of a string.

`re.U`:Interprets letters according to the Unicode character set

`re.X`: permits more flexible regex syntax, ignoring whitespace and treating `#` as a *comment marker*.

### Escape Match
- Situation: if we just want `.` to match content `.`, escaping from special representations
- Method: use `\` in front of it, sometimes we need `\\` to do the escape match

In [35]:
content5 = '(百度) www.baidu.com'
result9 = re.match('\(百度 \) www\.baidu\.com',content5)
print(result9)

None


## Search
- Span the whole content, then return the first successfully matched result.
- While, `match` method is unsuccessful when the first string cannot match the first content.

In [36]:
content6 = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
result10 = re.search('Hello.*?(\d+).*?Demo', content6)
print(result10)

<re.Match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'>


### Work with HTML

In [37]:
html = '''<div id="songs-list">
<h2 class="title"> Classical Songs </h2>
<p class="introduction">
Classical Songs List
</p>
<ul id="list" class="list-group">
<li data-view="2"> You Know What? </li>
<li data-view="7">
<a href="/2.mp3" singer="Image Dragon"> Bones </a>
</li>
<li data-view="4" class="active">
<a href="/3.mp3" singer="Donna Lewis"> I could be the one </a>
</li>
<li data-view="6"><a href="/4.mp3" singer="salem ilese"> Crypto Boy(Explicit) </a></li>
<li data-view="5"><a href="/5.mp3" singer="Emma Bale"> Cut Loose </a></li>
<li data-view="5">
<a href="/6.mp3" singer="Tone Damli"> Stupid </a>
</li>
</ul>
</div>'''

To abstract the singer and songer names:
`<li.*?active.*?singer="(.*?)">(.*?)</a>`

In [38]:
result11 = re.search('<li.*?active.*?singer="(.*?)">(.*?)</a>', html, re.S) 
if result11:  
    print(result11.group(1), result11.group(2))

Donna Lewis  I could be the one 


In [39]:
#without active in code
result12 = re.search('<li.*?singer="(.*?)">(.*?)</a>', html, re.S)
if result12:  
    print(result12.group(1), result12.group(2))

Image Dragon  Bones 


In [40]:
#remove re.S in this case
result13 = re.search('<li.*?singer="(.*?)">(.*?)</a>', html)
if result13:  
    print(result13.group(1), result13.group(2))

salem ilese  Crypto Boy(Explicit) 


## findall
- Get **All/Multiple results** of regex and content matches

In HTML case, we want to get resource link, singer name, and song name, with for loop.

In [41]:
result14 = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>', html, re.S)
print(result14)  
print(type(result14)) 

for result in result14:  
    print(result)  
    print(result[0], result[1], result[2]) #list the result use index

[('/2.mp3', 'Image Dragon', ' Bones '), ('/3.mp3', 'Donna Lewis', ' I could be the one '), ('/4.mp3', 'salem ilese', ' Crypto Boy(Explicit) '), ('/5.mp3', 'Emma Bale', ' Cut Loose '), ('/6.mp3', 'Tone Damli', ' Stupid ')]
<class 'list'>
('/2.mp3', 'Image Dragon', ' Bones ')
/2.mp3 Image Dragon  Bones 
('/3.mp3', 'Donna Lewis', ' I could be the one ')
/3.mp3 Donna Lewis  I could be the one 
('/4.mp3', 'salem ilese', ' Crypto Boy(Explicit) ')
/4.mp3 salem ilese  Crypto Boy(Explicit) 
('/5.mp3', 'Emma Bale', ' Cut Loose ')
/5.mp3 Emma Bale  Cut Loose 
('/6.mp3', 'Tone Damli', ' Stupid ')
/6.mp3 Tone Damli  Stupid 


## sub
- used to fix the content 

For example, remove all the numbers in the context.
- `re.sub('regex need to remove','replace regex',centent)`

In [42]:
content7 = '54aK54yr5oiR54ix5L2g'
content7 = re.sub('\d+', '', content7)
print(content7)

aKyroiRixLg


We want to get song name only in HTML case with `sub`.

In [50]:
html = re.sub('<a.*?>|</a>', '', html) #remove <a...></a> part, only remain the text
print(html)

result15 = re.findall('<li.*?>(.*?)</li>', html, re.S) #use findall() method to get song name
for result in result15:
    print(result.strip())

<div id="songs-list">
<h2 class="title"> Classical Songs </h2>
<p class="introduction">
Classical Songs List
</p>
<ul id="list" class="list-group">
<li data-view="2"> You Know What? </li>
<li data-view="7">
 Bones 
</li>
<li data-view="4" class="active">
 I could be the one 
</li>
<li data-view="6"> Crypto Boy(Explicit) </li>
<li data-view="5"> Cut Loose </li>
<li data-view="5">
 Stupid 
</li>
</ul>
</div>
You Know What?
Bones
I could be the one
Crypto Boy(Explicit)
Cut Loose
Stupid


## compile
Combine a regular expression pattern into pattern objects, which can be used for pattern matching in multiple times.

In [51]:
content8 = '2016-12-15 12:00'
content9 = '2016-12-17 12:55'
content10 = '2016-12-22 13:21'
#remove the clock time xx:xx with compile and sub method
pattern = re.compile('\d{2}:\d{2}')
result16 = re.sub(pattern, '', content8)
result17 = re.sub(pattern, '', content9)
result18 = re.sub(pattern, '', content10)
print(result16, result17, result18)

2016-12-15  2016-12-17  2016-12-22 
