## Regular Expressions

*Prepared by:*
**Jude Michael Teves**  
Faculty, Software Technology Department  
College of Computer Studies - De La Salle University

This notebook shows how to perform basic regular expressions (regex) in Python.

## Preliminaries

### Import library

In [1]:
import re

## Regex Syntax

A regular expression (regex) "is a sequence of symbols and characters expressing a string or pattern to be searched for within a longer piece of text." In the context of text preprocessing, we use this to extract the parts that we need from a text data.

Here are some cheatsheets for the regex syntax:

<img style="margin: auto" src="../images/pyregex-cheatsheet.png">
<br>
<img style="margin: auto" src="../images/regex-lovesdata.png">
<br>

### Regex Tester Website

**Note!** It is very helpful if you test out your regex first on this <a href="http://www.regex101.com/">website</a> to ensure that it works before running your Python code.

## Basic Regex

### Regex functions 

**re.match**

Checks if the pattern matches the string. The pattern must occur at the start of the string, else it will return nothing. If the regex has a capturing group, denoted by (), then re.match will return the substrings that match the pattern inside the parenthesis ().

In [2]:
text = 'I am Jude Jude Jude lala'

if re.match(r'Jude', text):
    print('matched')
else:
    print('no match')

no match


In [3]:
if re.match(r'I am', text):
    print('matched')
else:
    print('no match')

matched


In [4]:
re.match(r'I am', text)

<re.Match object; span=(0, 4), match='I am'>

In [5]:
if re.match(r'(.*) am (.*)', text):
    print('matched')
else:
    print('no match')

matched


In [6]:
re.match(r'(.*) am (.*)', text)

<re.Match object; span=(0, 24), match='I am Jude Jude Jude lala'>

**re.search**

Returns the first instance of the substring matching the pattern.

In [7]:
text = 'I am Jude Jude Jude lala'
print(re.search(r'Jude', text)[0]) # whole matched string
print(re.search(r'I am', text)[0]) # whole matched string

s = re.search(r'(.*) am (.*)', text)
print(s[0]) # whole matched string
print(s.groups()) # captured groups

s = re.search(r'.*(Jude).*', text)
print(s[0]) # whole matched string
print(s.groups()) # captured groups

Jude
I am
I am Jude Jude Jude lala
('I', 'Jude Jude Jude lala')
I am Jude Jude Jude lala
('Jude',)


**re.findall**

Returns all substrings matching the pattern.

In [8]:
re.findall(r'Jude', text)

['Jude', 'Jude', 'Jude']

In [9]:
re.findall(r'am', text)

['am']

In [10]:
re.findall(r'hello', text)

[]

### Non-greedy searching

By default, + and * are greedy when it comes to searching. This means that it will try to capture everything until the last occurence of the pattern on the right. Here is an example.

In [11]:
text = 'I am Jude Jude Jude lala'

s = re.search(r'(.*)Jude(.*)', text)
print(s.groups())

('I am Jude Jude ', ' lala')


In [12]:
print(len(s.groups()))
print(s[0]) # the first index contains the whole string
print(s[1]) # the second index contains the first captured group
print(s[2]) # the third index contains the second captured group

2
I am Jude Jude Jude lala
I am Jude Jude 
 lala


To prevent greedy searching, we add ? after * or +.

In [13]:
text = 'I am Jude Jude Jude lala'

s = re.search(r'(.*?)Jude(.*)', text)
print(s.groups())

('I am ', ' Jude Jude lala')


### Lookahead and lookbehind

Regex lookahead is used to get words before a certain pattern

In [14]:
text = 'jude hello data hello mining hello meow hello lala hello woof hello end'

s = re.findall(r'.+(?=hello)', text)
print(s)

['jude hello data hello mining hello meow hello lala hello woof ']


Let's also prevent greedy searching on this regex.

In [15]:
text = 'jude hello data hello mining hello meow hello lala hello woof hello end'

s = re.findall(r'.+?(?=hello)', text)
print(s)

['jude ', 'hello data ', 'hello mining ', 'hello meow ', 'hello lala ', 'hello woof ']


Regex lookbehind is the opposite of lookahead. It is used to get words after a certain pattern

In [16]:
text = 'jude hello data hello mining hello meow hello lala hello woof hello end'

s = re.findall(r'(?<=hello).+', text)
print(s)

[' data hello mining hello meow hello lala hello woof hello end']


Combining lookahead and lookbehind. Getting words in between two patterns.

In [17]:
text = 'jude hello data hello mining hello meow hello lala hello woof hello end'

s = re.findall(r'(?<=hello).+?(?=hello)', text)
print(s)

[' data ', ' mining ', ' meow ', ' lala ', ' woof ']


### Named Capture Group

In [18]:
text = '''abc 123 def 456 def 789
abc 123 def 456 def 789'''

s = re.search(r'(?P<letters>\D{3}) (?P<numbers>\d{3})', text)
print(s)

<re.Match object; span=(0, 7), match='abc 123'>


In [19]:
s.groups()

('abc', '123')

In [20]:
s.groupdict()

{'letters': 'abc', 'numbers': '123'}

In [21]:
s['letters']

'abc'

In [22]:
s['numbers']

'123'

To iterate over the different matches, do the following:

In [23]:
for s in re.finditer(r'(?P<letters>\D{3}) (?P<numbers>\d{3})', text):
    print(s.groupdict())

{'letters': 'abc', 'numbers': '123'}
{'letters': 'def', 'numbers': '456'}
{'letters': 'def', 'numbers': '789'}
{'letters': 'abc', 'numbers': '123'}
{'letters': 'def', 'numbers': '456'}
{'letters': 'def', 'numbers': '789'}


Another example.

In [24]:
text = '''Mr. Jude9999 valbabkalfa 2021-11-17

Dr. Seuss sadfasdfasfcxzv 1990-10-10 

Mrs. Mojo valbabkalfa 1900-01-06'''

for s in re.finditer(r'(?P<title>.+\.) (?P<name>[a-zA-Z]+).*(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', text):
    print(s.groupdict())

{'title': 'Mr.', 'name': 'Jude', 'year': '2021', 'month': '11', 'day': '17'}
{'title': 'Dr.', 'name': 'Seuss', 'year': '1990', 'month': '10', 'day': '10'}
{'title': 'Mrs.', 'name': 'Mojo', 'year': '1900', 'month': '01', 'day': '06'}


## References

1. "regular expression." lexico.com. 2021. https://www.lexico.com/definition/regular_expression (19 July 2021).
1. http://www.pyregex.com/

## End
<sup>made by **Jude Michael Teves**</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> <href>judemichaelteves@gmail.com</href> or <href>jude.teves@dlsu.edu.ph</href></sup><br>