# Practical 3-2: Regular Expression

## 1.1 Python Regular Expressions

▪ A **RegEx**, or **Regular Expression**, is a sequence of characters that forms a search pattern.

▪ A search pattern defined using RegEx can be used to match against a string to check if the string contains the specified search pattern.

<img src="searchpattern.png" width="500">

https://towardsdatascience.com/easiest-way-to-remember-regular-expressions-regex-178ba518bebd

## 1.2 re Module Functions

▪ We can use functions defined in the **re module** to work with regular expressions:

\>>> <b>search: </b>	Returns a Match object if there is a match anywhere in the string

\>>> <b>findall: </b>	Returns a list containing all matches

\>>> <b>split:</b>	Returns a list where the string has been split at each match

\>>> <b>sub: </b>	Replaces one or many matches with a string

In [None]:
import re

### 1.2.1 search()

▪ The **search()** function searches the string for **the first match**, and returns a Match object if there is a match.

In [None]:
text = "She lives in Malaysia."
result = re.search("Malaysia", text)
print(result)

In [None]:
text = "She lives in Indonesia."
result = re.search("Malaysia", text)
print(result)

In [None]:
text = "I love coffee, yes I love coffee"
result = re.search("coffee", text)
print(result)

## 1.3 Metacharacters and Special Sequences

### 1.3.1 Introduction

▪ A RegEx is a **sequence of characters** that forms a search pattern.

▪ **Metacharacters** are **special characters** that are interpreted in a special way by a RegEx engine. 

![metacharacter.png](metacharacter.png)

▪ A **special sequence** is a \ followed by one of the characters in the list below, and has a special meaning.

![sequences.png](sequences.png)

### 1.3.2 Metacharacter: ^ (Caret)

▪ The caret symbol **^** is used to check if **a string starts with a certain character**.

![regex2.jpg](regex2.jpg)

In [None]:
text = "I love coffee, yes I love coffee"
result = re.search("^coffee", text)
print(result)

In [None]:
text = "coffee, yes I love coffee"
result = re.search("^coffee", text)
print(result)

### 1.3.3 Metacharacter: $

▪ The dollar symbol **$** is used to check if **a string ends with a certain character**.

![regex3.jpg](regex3.jpg)

In [None]:
text = "I love coffee, yes I love coffee"
result = re.search("coffee$", text)
print(result)

In [None]:
text = "I love coffee, yes I love coffee."
result = re.search("coffee$", text)
print(result)

### 1.3.4 Metacharacter: .

▪ The period symbol **.** matches any single character (except newline '\n').

![regex4.jpg](regex5.jpg)

#### 1.3.4.1 Example \#1

In [None]:
text = "dart"
result = re.search(".art", text)
print(result)

In [None]:
text = "cart"
result = re.search(".art", text)
print(result)

In [None]:
text = "cat"
result = re.search(".art", text)
print(result)

#### 1.3.4.2 Example \#2

In [None]:
pattern = '^a...s$'
text = 'abyss'

result = re.search(pattern, text)

if result:
    print("Search successful.")
else:
    print("Search unsuccessful.")

In [None]:
pattern = '^a...s'
text = 'abyss amyls'

result = re.search(pattern, text)

if result:
    print("Search successful.")
else:
    print("Search unsuccessful.")

In [None]:
pattern = '^a...s$'
text = 'abyss amyls'

result = re.search(pattern, text)

if result:
    print("Search successful.")
else:
    print("Search unsuccessful.")

### 1.3.5 Metacharacter: *

▪ The star symbol **\*** **matches zero or more occurrences of the pattern left to it**.

![regex4.jpg](regex4.jpg)

#### 1.3.5.1 Example \#1

In [None]:
text = "cat"
result = re.search(".*cat", text)
print(result)

In [None]:
text = "catwoman"
result = re.search(".*cat", text)
print(result)

In [None]:
text = "cute cat"
result = re.search(".*cat", text)
print(result)

In [None]:
text = "100 cats"
result = re.search(".*cat", text)
print(result)

In [None]:
text = "c7sb@#puiercat"
result = re.search(".*cat", text)
print(result)

#### 1.3.5.2 Example \#2

In [None]:
text_1 = "abysdfs"
result_1 = re.search("^a.*s$", text_1)
print(result_1)

In [None]:
text_2 = "a___s"
result_2 = re.search("^a.*s$", text_2)
print(result_2)

#### 1.3.5.3 Example \#3

In [None]:
text_1 = "The rain in Setapak"
result_1 = re.search("^The.*Setapak$", text_1)
print(result_1)

In [None]:
text_2 = "The rain in Kuala Lumpur"
result_2 = re.search("^The.*Setapak$", text_2)
print(result_2)

### 1.3.6 Special Sequences \d

▪ **\d** returns a match where the string contains digits (numbers from 0-9).

![regex7.jpg](regex7.jpg)

In [None]:
pattern = '\d'
text = 'My phone number is 012-3456789.'

result = re.search(pattern, text)
print(result)

In [None]:
pattern = '\d\d\d-\d\d\d\d\d\d\d'
text = 'My phone number is 012-3456789.'

result = re.search(pattern, text)
print(result)

### 1.3.7 Metacharacter: {} 

▪ The braces symbol **{}** matches exactly the specified number of occurrences.

![regex9.jpg](regex9.jpg)

#### 1.3.7.1 Example \#1

In [None]:
pattern = '\d{3}-\d{7}'
text = 'My phone number is 012-3456789.'

result = re.search(pattern, text)
print(result)

In [None]:
pattern = '\d{3}-\d{7,8}'
text = 'My phone number is 011-34567890.'

result = re.search(pattern, text)
print(result)

#### 1.3.7.2 Example \#2

In [None]:
pattern = 'ha{3}'
text = 'hahahaha'

result = re.search(pattern, text)
print(result)

In [None]:
pattern = 'ha{3}'
text = 'haaaa'

result = re.search(pattern, text)
print(result)

### 1.3.8 Metacharacter: ()

▪ The paretheses symbol **()** is used to group sub-patterns.

#### 1.3.8.1 Example \#1

In [None]:
pattern = '(ha){3}'
text = 'hahahaha'

result = re.search(pattern, text)
print(result)

#### 1.3.8.2 Example \#2

In [None]:
pattern = '(\d{3})-(\d{7})'
text = 'My phone number is 012-3456789.'

result = re.search(pattern, text)
print(result)

#### 1.3.8.3 Example \#3

In [None]:
# Extract entire matching values
print(result.group())

# Extract match value for the first parenthesized subgroup
print(result.group(1))

# Extract match value for the second parenthesized subgroup
print(result.group(2))

# groups() returns all matching subgroups in a tuple (empty if there weren't any)
print(result.groups())

### 1.3.9 Metacharacter: |

▪ The **|** symbol is used to check either/or condition.

![regex8.jpg](regex8.jpg)

#### 1.3.9.1 Example \#1

In [None]:
pattern = "python|java"
text = "I love java and python programming!"

result = re.search(pattern, text)
print(result)

In [None]:
if result:
    print("Yes, there is at least one match!")
else:
    print("No match")

## 1.2 re Module Functions (cont')

### 1.2.2 search() vs. match() 

▪ **match()** finds match if it occurs at start of the string, but **search()** doesn’t limit us to find matches at the beginning of the string only. 

In [None]:
text_1 = "It is raining in Setapak"

result_1 = re.search("It.*Setapak", text_1) 
if result_1:
    print("Search successful.") 
else:
    print("Sorry, the string is not found. ")
    
result_1 = re.match("It.*Setapak", text_1) 
if result_1:
    print("Search successful.") 
else:
    print("Sorry, the string is not found. ")

In [None]:
text_2 = "OMG, It is raining in Setapak"

result_2 = re.search("It.*Setapak", text_2) 
if result_2:
    print("Search successful.") 
else:
    print("Sorry, the string is not found. ")

result_2 = re.match("It.*Setapak", text_2)
if result_2:
    print("Search successful.") 
else:
    print("Sorry, the string is not found. ")

## 1.3 RegEx (cont')

### 1.3.10 Special Sequences: \w	

▪ **\w** returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character).

![regex10.jpg](regex10.jpg)

### 1.3.11 Metacharacter: +

▪ The plus symbol **+** matches one or more occurrences of the pattern left to it.

![regex6.jpg](regex6.jpg)

In [None]:
text_1 = 'python programming'

result = re.match("\w+", text_1)
if result: 
    print(result.group(0))
else:
    print('No match')

result = re.search("\w+", text_1)
if result: 
    print(result.group(0))
else:
    print('No match')

In [None]:
text_2 = '@python programming'
    
result = re.match("\w+", text_2)
if result: 
    print(result.group(0))
else:
    print('No match')
    
result = re.search("\w+", text_2)
if result: 
    print(result.group(0))
else:
    print('No match')

## 1.2 re Module Functions (cont')

### 1.2.3 findall()

▪ The **findall()** method returns a list of strings containing all matches.

#### 1.2.3.1 Example \#1

In [None]:
pattern = "python|java"
text = "I love java and python programming!"

result = re.search(pattern, text)
print(result)

In [None]:
pattern = "python|java"
text = "I love java and python programming!"

result = re.findall(pattern, text)
print(result)

In [None]:
if result:
    print(len(result), "matches" if len(result) > 1 else "match", "found")
else:
    print("No match")

#### 1.2.3.2 Example \#2

In [None]:
text = "The rain in Sungai Kapal, Johor"

result_1 = re.findall("ai", text)
print(result_1)

In [None]:
print(len(result_1))

#### 1.2.3.3 Example \#3

In [None]:
text = "bac" 
result = re.findall("^a", text) # caret '^' = starting with
print(len(result), 'match(es) found.')

In [None]:
text = "abc" 
result = re.findall("^a", text) # caret '^' = starting with
print(len(result), 'match(es) found.')

## 1.3 RegEx (cont')

### 1.3.12 Metacharacter: []

▪ Square brackets **[]** specifies a set of characters you wish to match.

![regex12.jpg](regex12.jpg)

#### 1.3.12.1 Example \#1

In [None]:
text = "I love abc"
result = re.findall("[abc]", text)

print(result)
print('You found', len(result), 'match(es).')

In [None]:
text = "I love abc"
result = re.findall("[a-c]", text)

print(result)
print('You found', len(result), 'match(es).')

In [None]:
text = "I love abc"
result = re.findall("[a-e]", text) #[a-e] = [abcde]

print(result)
print('You found', len(result), 'match(es).')

In [None]:
text = "I love abc"
result = re.findall("[^abc]", text) # Excluding letters a, b and c

print(result)
print('You found', len(result), 'match(es).')

In [None]:
text = "I love ba abc bac abcd bcab acbde"

# [abc]{2} matches exactly any two letters occurrance sequence created from a, b and c
result = re.findall("[abc]{2}", text)

print(result)
print('You found', len(result), 'match(es).')

In [None]:
text = "I love abca"

# [abc]{4} matches exactly any four letters occurrance sequence created from a, b and c
result = re.findall("[abc]{4}", text)

print(result)
print('You found', len(result), 'match(es).')

#### 1.3.12.2 Example \#2

In [None]:
text = "You know the song 182932123256327?"

result_1 = re.findall("123", text)   # "123" != [123]
print(result_1)
print('You found', len(result_2), 'match(es).')

In [None]:
text = "You know the song 182932123256327?"

result_1 = re.findall("[1-3]", text) # [1-3] = [123]
print(result_1)
print('You found', len(result_1), 'match(es).')

In [None]:
text = "You know the song 182932123256327?"

result_1 = re.findall("[1-37]", text)   # [1-3] = [1237]
print(result_1)
print('You found', len(result_1), 'match(es).')

In [None]:
text = "You know the song 182932123256327?"

result_1 = re.findall("[1-3]{2}", text) # [1-3] = [123]
print(result_1)
print('You found', len(result_1), 'match(es).')

#### 1.3.12.3 Example \#3

In [None]:
text = "https://www.tarc.edu.my/"

# Any single character except newline '\n'
result = re.findall(".", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "https://www.tarc.edu.my/"

# Any period character'
result = re.findall("[.]", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "https://www.tarc.edu.my/"

# Any single character except these 3 symbols : and / and .
result = re.findall("[^:/.]", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "https://www.tarc.edu.my/"

# Any single alphabet or digit ranges from 0 to 9
result = re.findall("[a-zA-Z0-9]", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "https://www.tarc.edu.my/"

# Any single alphabet or digit ranges from 0 to 9
result = re.findall("[\w]", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "https://www.tarc.edu.my/"

# Any single alphabet or digit ranges from 0 to 9
result = re.search("[\w]", text) 

print(result)

#### 1.3.12.4 Example \#4

In [None]:
text = "https"

# Matches ANY ONE of the character within the square bracket
result = re.findall("[a-z]", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "https"

# Matches one or more LOWERCASE LETTERS SEQUENCE such as 'http', 'www'.
result = re.findall("[a-z]+", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "https"

# Matches zero or more LOWERCASE LETTERS SEQUENCE such as 'http', 'www'.
# The second output is a zero occurrance of matched result.
result = re.findall("[a-z]*", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "https://www.tarc.edu.my/"

# Any string that contains a single character or digit range from 0 to 9
result = re.findall("[a-zA-Z0-9]+", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "https://www.tarc.edu.my/"

# Any string that contains a single character or digit range from 0 to 9 or not
result = re.findall("[a-zA-Z0-9]*", text) 

print(result)
print(len(result), 'match(es) found.')

#### 1.3.12.5 Example \#5

In [None]:
text = "BIG small LARGE little HUGE tiny ENORMOUS stingy"
result = re.findall("[A-Z]+", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "BIG small LARGE little HUGE tiny ENORMOUS stingy"
result = re.findall("[^a-z ]+", text)

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "BIG small LARGE little HUGE tiny ENORMOUS stingy"
result = re.findall("[a-z]+", text)

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "BIG small LARGE little HUGE tiny ENORMOUS stingy"
result = re.findall("[a-z]*", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "BIG small LARGE little HUGE tiny ENORMOUS stingy"
result = re.findall(".*", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "A man is looking at me."
result = re.findall("[A-Z][a-z]*", text) 

print(result)
print(len(result), 'match(es) found.')

In [None]:
text = "A man is looking at me."
result = re.findall("^[A-Z].*", text) 

print(result)
print(len(result), 'match(es) found.')

### 1.3.13 Special Sequences: \s	

▪ **\s** returns a match when a string contains any **whitespace character** equivalents to [ \t\n\r\f\v]

## 1.2 re Module Functions (cont')

### 1.2.4 split()

▪ The **split()** method splits the string where there is a match and returns a list of strings where the splits have occurred.

#### 1.2.4.1 Example \#1

In [None]:
text = "The rain in Sungai Kapal"
result = re.split("\s", text)
print(result)

In [None]:
text = "The rain in Sungai Kapal"
result = re.split(" ", text)
print(result)

In [None]:
text = "The rain in\tSungai Kapal"
result = re.split("\s", text)
print(result)

In [None]:
text = "The rain in\tSungai Kapal"
result = re.split(" ", text)
print(result)

#### 1.2.4.2 Example \#2

In [None]:
text = "FuJiang#TangHua#WenJie#EeMin"

result = re.split("#", text)

print('There are', len(result), 'students as follows:')
for index in range(len(result)):
    print('{}. {}'.format(str(index + 1), result[index]))

#### 1.2.4.3 Example \#3

In [None]:
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

### 1.2.5 sub()

▪ The **sub()** function replaces one or many matches with a string.

#### 1.2.5.1 Example \#1

In [None]:
text = "The rain in Sungai Kapal"
result = re.sub("\s", "9", text)
print(result)

In [None]:
result = re.sub("\s", "*", text)
print(result)

#### 1.2.5.2 Example \#2

In [None]:
text = "i love apple"
result = re.sub("apple", "***", text)
print(result)

#### 1.2.5.3 Example \#3

In [None]:
email = "jolinTsai@gmail.com"
result = re.sub("gmail", "***", email)
print(result)

## 1.4 Regexr

### 1.4.1 Introduction

▪ **Regexr** is a regular expression tester (https://regexr.com/) with syntax highlighting that can be used to understand the meaning of a regex.

\>> Press **New** to start an experiment

\>>> **Explain** section: Explains the meaning of regex.

\>>> **Details** section: Show the matching part(s).

#### 1.4.1.1 Example \#1

In [None]:
text = 'Cats are smarter than dogs.'
pattern = '(.*) are (.*?) (.*)'

result = re.search(pattern, text)
print(result)

In [None]:
print(result.group(0))
print(result.group(1))
print(result.group(2))
print(result.group(3))

### 1.4.2 Greedy vs. Non-Greedy (or lazy) Quantifiers

▪ Quantifiers allow us to match their preceding elements a number of times. 

▪ By default, quantifiers work in the greedy mode. 

▪ **Greedy quantifiers** will match their preceding elements as much as possible to return to the biggest match possible.

▪ **Non-greedy quantifiers** will match as little as possible to return the smallest match possible. 

▪ To turn greedy quantifiers into non-greedy quantifiers, we add an extra question mark (?) to the quantifiers. 

![](greedyvslazy.png)

https://www.pythontutorial.net/python-regex/python-regex-non-greedy/

#### 1.4.1.2 Example \#2 (Greedy Quantifier)

![greedy.jpg](greedy.jpg)

https://cppsecrets.com/users/218111411511410110199104971141051161049764103109971051084699111109/Python-re-Greedy-vs-Non-Greedy-Matching.php

In [None]:
text = '<html><head><title></title></head><body></body></html>'

result = re.findall("<.+>", text)
print(result)

#### 1.4.1.3 Example \#3 (Lazy Quantifier)

![lazy.jpg](lazy.jpg)

In [None]:
result = re.findall("<.+?>", text)
print(result)

#### 1.4.1.4 Example \#4

In [None]:
text = 'Cats are  smarter than dogs.'
pattern = '(.*) are (.*?) (.*)'

result = re.search(pattern, text)
print(result)

In [None]:
print(result.groups())
print(result.group(0))
print(result.group(1))
print(result.group(2))
print(result.group(3))

#### 1.4.1.3 Example \#3

In [None]:
text = 'Cats are smarter than dogs.'
pattern = '(.*)are(.*?)(.*)'

result = re.search(pattern, text)
print(result)

In [None]:
print(result.groups())
print(result.group(0))
print(result.group(1))
print(result.group(2))
print(result.group(3))

#### 1.4.1.4 Example \#4

In [None]:
pattern = '^[a-zA-Z][a-zA-Z0-9._+-]+@[A-Za-z]+.[A-Za-z]+'
text = 'abc-regex_2020@xmen.com'
result = re.search(pattern, text)

print(result)

if result:
    print(result[0])

In [None]:
pattern = '^[a-zA-Z][a-zA-Z0-9._+-]+@[A-Za-z]+.[A-Za-z]+'
text = 'This is a sample email address: abc-regex_2020@xmen.com'
result = re.search(pattern, text)

print(result)

if result:
    print(result[0])

In [None]:
pattern = '[a-zA-Z][a-zA-Z0-9._+-]+@[A-Za-z]+.[A-Za-z]+'
text = 'This is a sample email address: abc-regex_2020@xmen.com'
result = re.search(pattern, text)

print(result)

if result:
    print(result[0])

In [None]:
pattern = '[a-zA-Z][a-zA-Z0-9._+-]+@[A-Za-z]+.[A-Za-z]+'
text = 'This is a sample email address: 123abc-regex_2020@xmen.com'
result = re.search(pattern, text)

print(result)

if result:
    print(result[0])

<span style = "color:red">
    
**Exercise \#1: Write python code that identifies whether an input is a valid FOCS student ID.** 

</span>

<span style = "color:red">
    
**Exercise \#2: Write python code that extracts all name entities from a given text.** 

</span>

<span style = "color:red">
    
**Exercise \#3: Write python code that verifies TAR UMT student's email address.** 

</span>

<span style = "color:red">
    
**Exercise \#4: Write python code that removes all numbers from a given text.** 

</span>

<span style = "color:red">
    
**Exercise \#5: Write python code that reads dates from a given text.** 

</span>