# 🥨 Regex
  - Author: Nana Yang 
  - Date (Created): 2022/06/21
  - Tutorial Link: https://www.youtube.com/watch?v=AEE9ecgLgdQ
  - Written Tutorial: https://www.python-engineer.com/posts/regular-expressions/


In [None]:
import re

## Simple Exact Match
functions for matching:

* `re.match(pattern, string)`: return (if available) match starting from the beginning
* `re.search(pattern, string)`: scan through all locations and return (if available) the first match
* `re.findall(pattern, string)`: find all matches from all locations, return as a list
* `re.finditer(pattern, string)`: ~= re.findall() but return all matches as an iterator





In [None]:
string = 'abc12345abc678ABC'
pattern = r'abc' # r means raw-string
matches = re.finditer(pattern, string)
for m in matches: 
  print(m)

<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(8, 11), match='abc'>


In [None]:
## other ways that yield the same result 
pattern = re.compile(pattern)
matches = pattern.finditer(string)
for m in matches: 
  print(m) # the 

<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(8, 11), match='abc'>


### How to get the text and index?

In [None]:
## getting match location/index
string = 'abc12345abc678ABC'
pattern = r'abc' # r means raw-string
matches = re.finditer(pattern, string)
for idx, m in enumerate(matches):
  print(f'text {idx}: {m.group()}, \n \
  start at: {m.start()}, \n \
  end at: {m.end()}, \n \
  span: {m.span()}')

text 0: abc, 
   start at: 0, 
   end at: 3, 
   span: (0, 3)
text 1: abc, 
   start at: 8, 
   end at: 11, 
   span: (8, 11)


## Meta/Special Characters

Exactly **10** sets of special character(s). 

### Metas
- `.` Any character (except newline character) "he..o"
- `^` Starts with, e.g. "^hello"
- `$` Ends with, e.g. "world$"
- `*` Zero or more occurrences "aix*"
- `+` One or more occurrences "aix+"
- `{}` Exactly the specified number of occurrences, e.g. "al{2}"
- `[]` A set of characters, e.g. "[a-m]"
- `\` Signals a special sequence (can also be used to escape special characters), e.g. "\d"
- `|` Either or, e.g. "falls|stays"
- `()` Capture and group


In [None]:
string = 'meowhahahahahahahahahaaa'
print('test string length:', len(string)) 

test string length: 24


In [None]:
# exactly the specified number; return the first loc
pattern = r'(ha){4}' 
re.search(pattern, string)

<re.Match object; span=(4, 12), match='hahahaha'>

In [None]:
pattern = r'(ha)*' 
re.search(pattern, string)

<re.Match object; span=(0, 0), match=''>

In [None]:
pattern = r'(ha)+' # but why 
re.search(pattern, string)

<re.Match object; span=(4, 22), match='hahahahahahahahaha'>

### Special Sequences 
- Decimals 
  - \d :Matches any decimal digit; this is equivalent to the class [0-9].
  - \D : Matches any non-digit character; this is equivalent to the class [^0-9].
- Whitespaces 
  - \s : Matches any whitespace character
  - \S : Matches any non-whitespace character
- Words
  - \w : Matches any alphanumeric (word) character; this is equivalent to the class [a-zA-Z0-9_].
  - \W : Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
- Beginnings or Ends
  - \b: Returns a match where the specified characters are at the beginning or at the end of a block r"\bain" r"ain\b"
  - \B: Returns a match where the specified characters are present, but NOT at the beginning of the block (or at the end) of a word r"\Bain" r"ain\B"
  - \A: Returns a match if the specified characters are at the beginning of the string "\AThe"
  - \Z: Returns a match if the specified characters are at the end of the string "Spain\Z"

In [None]:
string = 'hello hey heywow hohey 123_ wowwhey'
print(len(string))

35


In [None]:
# match at the beginning of a block
pattern = r'\bhey'
matches = re.finditer(pattern, string)
for m in matches:
  print(m)

<re.Match object; span=(6, 9), match='hey'>
<re.Match object; span=(10, 13), match='hey'>


In [None]:
## match at the end of a block
pattern = r'hey\b'
matches = re.finditer(pattern, string)
for m in matches:
  print(m)
# <re.Match object; span=(6, 9), match='hey'> 
# ->
#  at the beginning and end of the block (itself as a block)
#  therefore the match captures it whether the pattern is `\bhey` or `hey\b` 

<re.Match object; span=(6, 9), match='hey'>
<re.Match object; span=(19, 22), match='hey'>
<re.Match object; span=(32, 35), match='hey'>


In [None]:
pattern = r'\Bhey' # ~= pattern r'hey\b', but EXCLUDED the match that is at the beginning of the block
matches = re.finditer(pattern, string)
for m in matches:
  print(m)
print('===============')
pattern = r'hey\B' # ~= pattern r'\bhey', but EXCLUDED the match that is at the end of the block
matches = re.finditer(pattern, string)
for m in matches:
  print(m)

<re.Match object; span=(19, 22), match='hey'>
<re.Match object; span=(32, 35), match='hey'>
<re.Match object; span=(10, 13), match='hey'>


In [None]:
pattern = r'\b\s+' # captures by block, and clean the trailing whitespaces
re.split(pattern, string)

['hello', 'hey', 'heywow', 'hohey', '123_', 'wowwhey']

In [None]:
pattern = r'\d'
matches = re.finditer(pattern, string)
for m in matches:
  print(m)

<re.Match object; span=(23, 24), match='1'>
<re.Match object; span=(24, 25), match='2'>
<re.Match object; span=(25, 26), match='3'>


## Sets `[]`

In [None]:
## other ways to capture the digits
# -> define a set and captures the elements in the set

string = 'hello hey HELLO heywow hohey 123_ wowwhey'
pattern = r'[0-9]'
matches = re.finditer(pattern, string)
for m in matches:
  print(m)

<re.Match object; span=(29, 30), match='1'>
<re.Match object; span=(30, 31), match='2'>
<re.Match object; span=(31, 32), match='3'>


In [None]:
# alpha sets (case-sensitive) [a-z], [A-Z]
pattern = r'[A-Z]'
matches = re.finditer(pattern, string)
for m in matches:
  print(m)

<re.Match object; span=(10, 11), match='H'>
<re.Match object; span=(11, 12), match='E'>
<re.Match object; span=(12, 13), match='L'>
<re.Match object; span=(13, 14), match='L'>
<re.Match object; span=(14, 15), match='O'>


In [None]:
pattern = r'[A-L]' # Upper Case alpha set containing only A, B, C, D, ..., J, K, L (not M, N, O, ...Z)
matches = re.finditer(pattern, string)
for m in matches:
  print(m)
# no 'O' captured

<re.Match object; span=(10, 11), match='H'>
<re.Match object; span=(11, 12), match='E'>
<re.Match object; span=(12, 13), match='L'>
<re.Match object; span=(13, 14), match='L'>


### Parsing dates

In [None]:
dates = '''
01.04.2020

01 04 2020

2020.04.01

2020-04-01
2020-05-23
2020-06-11
2020-07-11
2020-08-11
2020-08-18

2020/04/02

2020_04_04
2020_04_04
'''

In [None]:
print('All dates with a character in between')
print('following {year} {month} {date} or {year} {date} {month}')

pattern = r'\d\d\d\d.\d\d.\d\d'
matches = re.finditer(pattern, dates)
for m in matches:
  print(m)

All dates with a character in between
following {year} {month} {date} or {year} {date} {month}
<re.Match object; span=(25, 35), match='2020.04.01'>
<re.Match object; span=(37, 47), match='2020-04-01'>
<re.Match object; span=(48, 58), match='2020-05-23'>
<re.Match object; span=(59, 69), match='2020-06-11'>
<re.Match object; span=(70, 80), match='2020-07-11'>
<re.Match object; span=(81, 91), match='2020-08-11'>
<re.Match object; span=(92, 102), match='2020-08-18'>
<re.Match object; span=(104, 114), match='2020/04/02'>
<re.Match object; span=(116, 126), match='2020_04_04'>
<re.Match object; span=(127, 137), match='2020_04_04'>


In [None]:
print('All dates with `-` in between')
print('following {year}-{month}-{date} or {year}-{date}-{month}')
pattern = r'\d\d\d\d-\d\d-\d\d'
matches = re.finditer(pattern, dates)
for m in matches:
  print(m)

All dates with `-` in between
following {year}-{month}-{date} or {year}-{date}-{month}
<re.Match object; span=(37, 47), match='2020-04-01'>
<re.Match object; span=(48, 58), match='2020-05-23'>
<re.Match object; span=(59, 69), match='2020-06-11'>
<re.Match object; span=(70, 80), match='2020-07-11'>
<re.Match object; span=(81, 91), match='2020-08-11'>
<re.Match object; span=(92, 102), match='2020-08-18'>


In [None]:
from datetime import datetime

In [None]:
print('All dates with `.` in between')
## Use \. insted of . to escape the special character ###
print('following {year}.{month}.{date} or {year}.{date}.{month}')
pattern = r'\d\d\d\d\.\d\d\.\d\d' 
matches = re.finditer(pattern, dates)
for m in matches:
  print(m)
  text = m.group()
  datetime_object = datetime.strptime(text, '%Y.%m.%d')
  print(datetime_object)

All dates with `.` in between
following {year}.{month}.{date} or {year}.{date}.{month}
<re.Match object; span=(25, 35), match='2020.04.01'>
2020-04-01 00:00:00


## Quantifiers `{}`
```
* : 0 or more
+ : 1 or more
? : 0 or 1, used when a character can be optional
{4} : exact number
{4,6} : range numbers (min, max)

```

In [None]:
# a dash in a character set specifies a range if it is in between, otherwise the dash itself
print('Dates with - or . in between')
pattern = '\d{4}[-.]\d{2}[-.]\d{2}'
matches = re.finditer(pattern, dates)
for match in matches:
    print(match)
print('========')
print('within those dates, I want the dates to be within 10 - 25')
print('Dates with - or . in between')
# (1[0-9]|2[0-5]) -> match either 10 ~ 19 or 20 ~ 25, capture as a group 
pattern = '\d{4}[-.]\d{2}[-.](1[0-9]|2[0-5])'
matches = re.finditer(pattern, dates)
for match in matches:
    print(match)

Dates with - or . in between
<re.Match object; span=(25, 35), match='2020.04.01'>
<re.Match object; span=(37, 47), match='2020-04-01'>
<re.Match object; span=(48, 58), match='2020-05-23'>
<re.Match object; span=(59, 69), match='2020-06-11'>
<re.Match object; span=(70, 80), match='2020-07-11'>
<re.Match object; span=(81, 91), match='2020-08-11'>
<re.Match object; span=(92, 102), match='2020-08-18'>
within those dates, I want the dates to be within 10 - 25
Dates with - or . in between
<re.Match object; span=(48, 58), match='2020-05-23'>
<re.Match object; span=(59, 69), match='2020-06-11'>
<re.Match object; span=(70, 80), match='2020-07-11'>
<re.Match object; span=(81, 91), match='2020-08-11'>
<re.Match object; span=(92, 102), match='2020-08-18'>


In [None]:
string = 'pro--------gram--fi---les'
pattern = r'-{1,2}'
re.sub(pattern, '*', string)
# the first 8 - are are matched by 2 and are replaced with 4 *
# the second 2 - are replaced by 1 *
# the third 3 - are replaced by 1 * repr. 2 - and 1 * repr. by 1 - 

'pro****gram*fi**les'

## Conditions 
Use an `email` example to illustrate. 


In [None]:
names = """
Mr 
Mr Simpson
Mrs Simpson
Mr. Brown
Ms Smith
Mr.. Simpson
Mr. T
"""

In [None]:
# pattern = r'Mr\s\w+'  # \s match 1 whitespace, \w+ matches 1 or more word
pattern = r'Mr\s\w*'  # \s match 1 whitespace, \w* matches 0 or more word
matches = re.finditer(pattern, names)
for m in matches:
  print(m)

<re.Match object; span=(1, 4), match='Mr '>
<re.Match object; span=(5, 15), match='Mr Simpson'>


In [None]:
pattern = r'Mr\.?\s\w+' 
# match actural `Mr.` (since dot is special char it is escaped)
# ? means that `\.` (actual dot) can appear 0 or 1 time
# then match a space and 1 or more word
matches = re.finditer(pattern, names)
for m in matches:
  print(m)

<re.Match object; span=(5, 15), match='Mr Simpson'>
<re.Match object; span=(28, 37), match='Mr. Brown'>
<re.Match object; span=(59, 64), match='Mr. T'>


In [None]:
pattern = r'Mr\.*\s\w+' 
# same as the previous but
# * means that `\.` (actual dot) can appear 0 or more times 
# the resulted strings have 1 match that has 0 dot,
# 2 matches with 1 dot and 1 match with 2 dots. 
matches = re.finditer(pattern, names)
for m in matches:
  print(m)

<re.Match object; span=(5, 15), match='Mr Simpson'>
<re.Match object; span=(28, 37), match='Mr. Brown'>
<re.Match object; span=(47, 59), match='Mr.. Simpson'>
<re.Match object; span=(60, 65), match='Mr. T'>


Define a proper name-formatting as starting with 
  - Mr. Mrs. Ms. 
  or
  - Mr Mrs Ms 
  followed by exactly a whitespace, and then the surname (all english letters). 

In [None]:
# the pattern the finds all proper names: 
pattern = r'(Mr|Mrs|Ms)\.?\s\w+' 
matches = re.finditer(pattern, names)
for m in matches:
  print(m)

<re.Match object; span=(5, 15), match='Mr Simpson'>
<re.Match object; span=(16, 27), match='Mrs Simpson'>
<re.Match object; span=(28, 37), match='Mr. Brown'>
<re.Match object; span=(38, 46), match='Ms Smith'>
<re.Match object; span=(60, 65), match='Mr. T'>


In [None]:
emails = """
pythonengineer@gmail.com
Python-engineer@gmx.de
python-engineer123@my-domain.org
b06102020@ntu.edu.tw
"""

In [None]:
prefix = r'[a-zA-Z0-9]+@'
matches = re.finditer(prefix, emails)
for m in matches:
  print(m)

<re.Match object; span=(1, 16), match='pythonengineer@'>
<re.Match object; span=(33, 42), match='engineer@'>
<re.Match object; span=(56, 68), match='engineer123@'>
<re.Match object; span=(82, 92), match='b06102020@'>


In [None]:
# get prefix and the domain name of the email 
# domain is in between @ and the first . (actual dot)
# domain element set: [a-zA-Z-]+ ( + means 1 or more )
prefix_domain = r'[a-zA-Z0-9]+@[a-zA-Z-]+\.'
matches = re.finditer(prefix_domain, emails)
for m in matches:
  print(m)

<re.Match object; span=(1, 22), match='pythonengineer@gmail.'>
<re.Match object; span=(33, 46), match='engineer@gmx.'>
<re.Match object; span=(56, 78), match='engineer123@my-domain.'>
<re.Match object; span=(82, 96), match='b06102020@ntu.'>


## Grouping `()`
`()` captures a group, so we can get the substring in the match directly. 

In [None]:
pieces = """
Elisa uses 2 email formats. The most common Elisa email format is first '.' last 
(ex. jane.doe@elisa.com) 
being used 100.0% of the time. 
Other common formats are first last (ex. janedoe@elisa.com).
"""

In [None]:
### the full email checking pattern 
# before @, [a-zA-Z0-9]+ is used to match the prefix of the email
# ([a-zA-Z-]+\.?) is used to match things like `ntu.`, `edu.`, `tw` 
# and it can repeat 1 or multiple times (+)
email_check = r'[a-zA-Z0-9.]+@([a-zA-Z-]+\.?)+'
matches = re.finditer(email_check, emails)
for m in matches:
  print(m)

<re.Match object; span=(1, 25), match='pythonengineer@gmail.com'>
<re.Match object; span=(33, 48), match='engineer@gmx.de'>
<re.Match object; span=(56, 81), match='engineer123@my-domain.org'>
<re.Match object; span=(82, 102), match='b06102020@ntu.edu.tw'>


In [None]:
email_check = r'([a-zA-Z0-9.]+)@([a-zA-Z-]+\.?)+'
# first group: full match 
# 2nd group: ([a-zA-Z0-9.])  
# 3rd group: ([a-zA-Z-]+\.?)
#(g1)@(g2)

In [None]:
matches = re.finditer(email_check, pieces)
for m in matches:
  print(m.group())
  print(m.groups()) 
  print('=======')

jane.doe@elisa.com
('jane.doe', 'com')
janedoe@elisa.com
('janedoe', 'com')


## Modification
Use a `url` example to illustrate.

In [None]:
urls = """
My colab is 
https://colab.research.google.com/drive/1ajhs_-6SuSLW0KzNz-J60WJUN9flGucc?authuser=3#scrollTo=-hkYC-j7M1f7
and the youtube tutorial link is
https://www.youtube.com/watch?v=AEE9ecgLgdQ
My recent favorite song is  
https://www.youtube.com/watch?v=Tt9hwaM4d4A
"""

Captures URL links 

`pattern = r'https?://(www\.)?([0-9a-zA-Z-._#?=/]+)'`
- `()` is a capturing group
- first capturing group: `(www\.)?` captures `www` for 0 or 1 time (`?` makes the group optional)
- `([0-9a-zA-Z-._#?=/]+)` is the second capturing group where:
  - `[0-9a-zA-Z-._#?=/]` is a set of elements possible in a URL
  - `[]+` means the element inside the set can appear 1 or more times
  - `([]+)` makes the whole part a capturing group

In [None]:
pattern = r'https?://(www\.)?([0-9a-zA-Z-._#?=/]+)' 
matches = re.finditer(pattern, urls)
for m in matches:
  print(m.group())
  print(m.groups()) 
  print('=======')

https://colab.research.google.com/drive/1ajhs_-6SuSLW0KzNz-J60WJUN9flGucc?authuser=3#scrollTo=-hkYC-j7M1f7
(None, 'colab.research.google.com/drive/1ajhs_-6SuSLW0KzNz-J60WJUN9flGucc?authuser=3#scrollTo=-hkYC-j7M1f7')
https://www.youtube.com/watch?v=AEE9ecgLgdQ
('www.', 'youtube.com/watch?v=AEE9ecgLgdQ')
https://www.youtube.com/watch?v=Tt9hwaM4d4A
('www.', 'youtube.com/watch?v=Tt9hwaM4d4A')


In [None]:
pattern = re.compile(pattern)
suburls = pattern.sub(r'\2', urls) # meaning to replace the urls position with the match's capturing group 2 
suburls = suburls.split('\n')
suburls

['',
 'My colab is ',
 'colab.research.google.com/drive/1ajhs_-6SuSLW0KzNz-J60WJUN9flGucc?authuser=3#scrollTo=-hkYC-j7M1f7',
 'and the youtube tutorial link is',
 'youtube.com/watch?v=AEE9ecgLgdQ',
 'My recent favorite song is  ',
 'youtube.com/watch?v=Tt9hwaM4d4A',
 '']

## Compilation Flag 
- ASCII, A : Makes several escapes like \w, \b, \s and \d match only  on ASCII characters with the respective property.
- DOTALL, S : Make . match any character, including newlines.
- IGNORECASE, I : Do case-insensitive matches.
- LOCALE, L : Do a locale-aware match.
- MULTILINE, M : Multi-line matching, affecting ^ and $.
- VERBOSE, X (for ‘extended’) : Enable verbose REs, which can be organized more cleanly and understandably.

In [None]:
my_string = "Hello World"
pattern = re.compile(r'world', re.IGNORECASE) # No match without I flag
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(6, 11), match='World'>
