# Introduction to Regular Expression or `regex`

* **Pattern** - A text pattern of interest expressed in Regular Expression Language <br> Example: `\b\d+\b` matches a word made up of one or more decimal digits.

* **Text** – String in which to look for a match with a given pattern

* **Regex Engine** - Regular Expression Engine that does the actual work

* **Regex Module** – Python module for interacting with Regex Engine.
Module: re

# Resources

* <a href="https://regex101.com/"> regex101 </a>
* <a href="https://docs.python.org/3/library/re.html"> docs.python.org </a>
* <a href="https://www.regular-expressions.info/brackets.html"> www.regular-expressions.info </a>

## Python `regex` methods

Some common methods:
* `re.match()` - Finds first match at start of text.   
* `re.fullmatch()`
* `re.search()` - Finds the first match anywhere in the text
* `re.findall()` - Finds all matches 
    * This method will return only after scanning the entire text. Can take a long time to run if looking throught a long text. 
* `re.finditer()` - Iterator
* `re.sub()` - Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl
* `re.split()` - Split string by the occurrences of pattern

For a complete list: https://docs.python.org/3/library/re.html#module-contents








## `regex` pattern

### The raw-string

In [5]:
# Regex pattern are given as raw-strings (litteral strings)
pattern = r"\d+"

### Single char patterns

In [72]:
pattern = r"very" # You can specify a literal string as pattern
pattern = r"[ry]" #Set [...] is used for OR condition
pattern = r"[a-dx-z0-9]" #a-d and x-z and 0-9
pattern = r"bod[yies]*" # Combine a literal with a set to find variants of words
pattern = r"[^aieou]" # You can use ^ to negate a pattern
pattern = r"." # Dot (.) is a wildcard character and matches all characters except newline \n
pattern = r"\." # Use \ to escape special characters

### `Character Classes`

In [None]:
pattern = r"\w" # Word character. [0-9_a-zA-Z] and Unicode word characters
pattern = r"\W" # Negation of word char
pattern = r"\d" # Match digits in all languages
pattern = r"\D" # match all non-digits
pattern = r"\s" # match all whitespaces
pattern = r"\S" # Negation of all whitespaces

### `Quantifiers`

<a href="https://medium.com/@318097/greedy-lazy-match-in-regular-expression-35ce8eca4060"> Difference between Greedy and Lazy </a>

In [74]:
# There are several regular expression quantifiers which are used to specify how many times a given 
# character can be repeated before matching is done. This is mainly used when the number of characters 
# going to be matched is unknown.

# Greedy    Lazy    Matches
# *         *?      0 or more times
# +         +?      1 or more times
# ?         ??      0 or 1 time
# {n}       {n}?    Exatly n times
# {n,}      {n,}?   At least n times
# {n,m}     {n,m}?  From n to m times

In [78]:
text = "Bam Baam Baaaaaamaam Bm"
pattern_grd = r"B\D*m"
pattern_lzy = r"B\D*?m"
print("Greedy match: ",re.findall(pattern_grd,text))
print("Lazy match: ",re.findall(pattern_lzy,text))

Greedy match:  ['Bam Baam Baaaaaamaam Bm']
Lazy match:  ['Bam', 'Baam', 'Baaaaaam', 'Bm']


In [82]:
text = "12, 123, 1234, 12345, 123456"
pattern_grd = r"\d{2,5}"
pattern_lzy = r"\d{2,5}?"
print("Greedy match: ",re.findall(pattern_grd,text))
print("Lazy match: ",re.findall(pattern_lzy,text))

Greedy match:  ['12', '123', '1234', '12345', '12345']
Lazy match:  ['12', '12', '12', '34', '12', '34', '12', '34', '56']


### `Anchors`

In [89]:
# Word boundary
# Use ^ to check at beginning of string or line
# Use $ to check at beginning of string or line
# Use \b anchor to match ar word boundaries
# Use \B anchor to match ar word boundaries

text = "There is greenhouse gas in my green house, the one besides my outhouse."
pattern = r"\bhouse\b"
print(re.findall(pattern, text))

text = "text goes on line one \n text goes here as well"
pattern= r"^text" # Use ^ to check at beginning of string or line
pattern= r"(?m)text"  # Use (?m) to check at beginning of string and after newline
print(re.findall(pattern, text))

['house']
['text', 'text']


### `Groups` - find sub matches
group 0 = refers to the text in a string that matched the pattern<br>
group 1..n onwards refer to the sub-groups

In [90]:
pattern = r"(\d{4})(\d{2})(\d{2})"
text = "Start date 20220222"

match = re.search(pattern,text)

for idx, value in enumerate(match.groups()):
    print(f"\tGroup: {idx+1} {value} \tat index: {match.start(idx+1)}")

	Group: 1 2022 	at index: 11
	Group: 2 02 	at index: 15
	Group: 3 22 	at index: 17


### `named groups`

In [None]:
pattern = r"(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})"
text = "Start date 20220222"

match = re.search(pattern,text)

if match:
    print(f"Found match: {match.group(0)} at index: {match.start()}")
    print(f"\t {match.groupdict()}")

## Examples

In [6]:
import re

### `re.match` - Find first match at start of text

In [62]:
text = "123 The year that Gustav Vasa became the king of Sweden is 1523"
pattern = r"\d+" # one or more digits

match = re.match(pattern,text)
print(f"{match.group(0)} at index {match.start()}" if match else "no match")

123 at index 0


### `re.search` - Finds the first match anywhere in the text

In [61]:
pattern = r"\d+" # one or more digits
text = r"The year that Gustav Vasa became the king of Sweden is 1523."

match = re.search(pattern,text)
print(f"Found a match: {match.group(0)} at index {match.start()}" if match else "No match")

Found a match: 1523 at index 55


### `re.findall` - Finds all matches 

This method will return only after scanning the entire text. Can take a long time to run if looking throught a long text.

In [37]:
pattern = r"\d+"
text = "Lidingö Postal Codes are 18162, 18143, 18157, 18130"

match = re.findall(pattern,text)
print("Found matches:", match) if match else print("Found no matches")

Found matches: ['18162', '18143', '18157', '18130']


### `re.finditer` - Iterator

In [40]:
pattern = r"\d+"
text = "Lidingö Postal Codes are 18162, 18143, 18157, 18130"

match_iter = re.finditer(pattern, text)

i = 0
for match in match_iter:
    print("\t", match.group(0), "at index:",match.start())
    i += 1
    if i > 2:
        break

	 18162 at index: 25
	 18143 at index: 32
	 18157 at index: 39


### `re.sub` - Finds pattern and replace it with pattern

In [69]:
# Format date: 20200920 => 09-20-2020
text = " I was born 19950420 and Gustav was born 19951218 and Nils was born 19940802"
pattern = r"(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})"
replac_pattern = r"\g<year>-\g<month>-\g<day>"

print("Original text:", text)
print("New text:", re.sub(pattern,replac_pattern,text))

Original text:  I was born 19950420 and Gustav was born 19951218 and Nils was born 19940802
New text:  I was born 1995-04-20 and Gustav was born 1995-12-18 and Nils was born 1994-08-02


### `re.sub` - Finds pattern and replace it with date

In [70]:
# 20200821 => python's datetime format (formated as 21-Aug-2020)
# Ref: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
import datetime

def format_date(match):
    in_date = match.groupdict()
    year = int(in_date["year"])
    month = int(in_date["month"])
    day = int(in_date["day"])
    return datetime.date(year,month,day).strftime("%d %B %Y")

In [71]:
pattern = r"(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})"
text = " I was born 19950420 and Gustav was born 19951218 and Nils was born 19940802"

print("Original text:", text)
print("New text:", re.sub(pattern,format_date,text)) # Calling my user-defined function inside the sub

Original text:  I was born 19950420 and Gustav was born 19951218 and Nils was born 19940802
New text:  I was born 20 April 1995 and Gustav was born 18 December 1995 and Nils was born 02 August 1994


### Input validation

In [41]:
def is_integer(text):
    match = re.search(r"^\d+$",text) # Match digits at start, middel and end
    return True if match else False

In [43]:
def test_is_integer():
    pass_lst = ["123","4","900","23464","0091"]
    fail_lst = ["as23","12b","1 2 3","1\t2"," 12","45 "]
    
    for t in pass_lst:
        if not is_integer(t):
            print("\tFailed to detect integer:",t)
    for t in fail_lst:
        if is_integer(t):
            print("\tIncorrectly classified as an integer:",t)

In [44]:
test_is_integer()