# 7.4. Regular Expressions

Module M-227-04: Programming for Data Analytics

Instructor: prof. Dmitry Pavlyuk

## Regular expression

regular expression ("regex"): a description of a pattern of text can test whether a string matches the expression's pattern can use a regex to search/replace characters in a string regular expressions are extremely powerful but tough to read.

For example,

__[a-zA-Z_-]+@[a-zA-Z_-]+.[a-zA-Z]{2,4}__

is a simplified(!) regular expression for email address.

## Patterns

Regular expressions are __patterns__ used to match character combinations in strings.

- Letters and numbers match themselves
- Patterns are case sensitive
- __Punctuations__ has special meanings!

In [1]:
import re
citation = 'Errors using inadequate data are much less than those using no data at all. [Charles Babbage]'

pattern1 = "data"
pattern2 = "charles"
print(f"'{pattern1}' in the citation?",re.search(pattern1, citation) is not None)
print(f"'{pattern2}' in the citation?",re.search(pattern2, citation) is not None)

'data' in the citation? True
'charles' in the citation? False


## Pattern matching: square brackets

Square brackets mean that any of the listed characters will match
- [ab] matches either _a_ or _b_
- [a-d] matches either _a_ or _b_ or _c_ or _d_
- Caret means __not__: [^a-d] # anything but _a_, _b_, _c_ or _d_

In [2]:
pattern2 = "charles"
print(f"'{pattern2}' in the citation?",re.search(pattern2, citation) is not None)
pattern3 = "[cC]harles"
print(f"'{pattern2}' in the citation?",re.search(pattern3, citation) is not None)

'charles' in the citation? False
'charles' in the citation? True


## Special symbols

- "__.__" means __any__ character (if you need for "." you must use a backslash - \.)
- "__*__" asterisk sign repeats the previous character 0 or more times
- "__+__" asterisk sign repeats the previous character 1 or more times
- "__?__" asterisk sign repeats the previous character 0 or 1 times

## Special symbols

In [3]:
words = ["dm", "dom", "doom", "dooom"]
patterns = ["d.m","do*m","do+m","do?m"]
for pattern in patterns:
    print(f"Pattern {pattern}")
    for word in words:
        m = re.match(pattern, word)
        print(f"\t{pattern} matches {word}? {m is not None}")

Pattern d.m
	d.m matches dm? False
	d.m matches dom? True
	d.m matches doom? False
	d.m matches dooom? False
Pattern do*m
	do*m matches dm? True
	do*m matches dom? True
	do*m matches doom? True
	do*m matches dooom? True
Pattern do+m
	do+m matches dm? False
	do+m matches dom? True
	do+m matches doom? True
	do+m matches dooom? True
Pattern do?m
	do?m matches dm? True
	do?m matches dom? True
	do?m matches doom? False
	do?m matches dooom? False


## Special symbols

- $ - end of the string
- ^ - start of the string
- \d - digit
- \D - not a digit
- \w - a word character
- \W - not a word character
- \s - a whitespace  = [ \t\n\r\f\v] (plus other Unicode spaces)
- \S - not a whitespace
- \b - a word boundary
- \B - not a word boundary


## Special symbols

In [4]:
words = ["dm", "dom", "d m"]
patterns = [r"d\wm",r"d\Sm",r"^d\Wm"]
for pattern in patterns:
    print(f"Pattern {pattern}")
    for word in words:
        m = re.match(pattern, word)
        print(f"\t{pattern} matches {word}? {m is not None}")

Pattern d\wm
	d\wm matches dm? False
	d\wm matches dom? True
	d\wm matches d m? False
Pattern d\Sm
	d\Sm matches dm? False
	d\Sm matches dom? True
	d\Sm matches d m? False
Pattern ^d\Wm
	^d\Wm matches dm? False
	^d\Wm matches dom? False
	^d\Wm matches d m? True


## Braces / curly brackets

Braces are a more detailed way to indicate number of repeats
- {2} exactly two time
- {2,} at least two times
- {,2} at most two times
- {1,3} from 1 to 3 times

## Braces / curly brackets

In [5]:
words = ["dm", "dom", "doom", "dooom"]
patterns = ["do{2}m","do{2,}m","do{,2}m","do{1,3}m"]
for pattern in patterns:
    print(f"Pattern {pattern}")
    for word in words:
        m = re.match(pattern, word)
        print(f"\t{pattern} matches {word}? {m is not None}")

Pattern do{2}m
	do{2}m matches dm? False
	do{2}m matches dom? False
	do{2}m matches doom? True
	do{2}m matches dooom? False
Pattern do{2,}m
	do{2,}m matches dm? False
	do{2,}m matches dom? False
	do{2,}m matches doom? True
	do{2,}m matches dooom? True
Pattern do{,2}m
	do{,2}m matches dm? True
	do{,2}m matches dom? True
	do{,2}m matches doom? True
	do{,2}m matches dooom? False
Pattern do{1,3}m
	do{1,3}m matches dm? False
	do{1,3}m matches dom? True
	do{1,3}m matches doom? True
	do{1,3}m matches dooom? True


## Regular expression: functions


## Regular expression: functions

- .match() - does the pattern match the beginning of the string? Returns None or a Match object
- .search() - does the pattern match anywhere in the string? Returns None or a Match object
- .findall() - does the pattern match anywhere in the string? Returns a list of strings (or an empty list)


## Regular expression: functions


In [6]:
tsi_hours = """
Working Hours
Mon-Fri 8:30 – 18:30
Sat 8:30-16:00
"""
pattern = r"\d{1,2}:\d{2}"
print(re.match(pattern, tsi_hours))
print(re.search(pattern, tsi_hours))
print(re.findall(pattern, tsi_hours))

None
<re.Match object; span=(23, 27), match='8:30'>
['8:30', '18:30', '8:30', '16:00']


## Parentheses / round brackets

Parentheses allow to indicate what should be returned

In [7]:
pattern = r"\d{1,2}:\d{2}"
print(re.findall(pattern, tsi_hours))
pattern = r"(\d{1,2}):\d{2}"
print(re.findall(pattern, tsi_hours))
pattern = r"(\d{1,2}):(\d{2})"
print(re.findall(pattern, tsi_hours))

['8:30', '18:30', '8:30', '16:00']
['8', '18', '8', '16']
[('8', '30'), ('18', '30'), ('8', '30'), ('16', '00')]


## Example: Parsing Raw Emails

## Parsing raw emails

In [8]:
with open('fraudlent_emails.txt') as f:
    text = f.read()
print(text[:800])

From r  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311
Status: O

FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,


## Parsing raw emails: extract email addresses

In [9]:
pattern = r"([a-zA-Z_-]+@[a-zA-Z_-]+.[a-zA-Z]{2,4})"

emails = re.findall(pattern, text)
emails

['webmaster@aclweb.org',
 'nng@spinfinder.com',
 'nng@spinfinder.com',
 'nng@spinfinder.com',
 'webmaster@aclweb.org',
 'webmaster@aclweb.org',
 'tony_m@lawyer.com']

## Parsing raw emails: extract subjects

In [10]:
pattern = r"Subject\:\ ([^\n]+)"

subjects = re.findall(pattern, text)
subjects

['URGENT BUSINESS ASSISTANCE AND PARTNERSHIP',
 'URGENT ASSISTANCE /RELATIONSHIP (P)',
 'GOOD DAY TO YOU',
 'GOOD DAY TO YOU',
 'I Need Your Assistance.']

## Parsing raw emails: extract Reply-To

In [11]:
pattern = r"Reply-To\:\ ([^\n]+)"

replytos = re.findall(pattern, text)
replytos

['james_ngola2002@maktoob.com', 'obong_715@epatra.com', 'm_abacha03@www.com']

## Parsing Raw emails: extract X- headers

In [12]:
pattern = r"(X-[a-zA-Z_-]+)\:([^\n]+)"

xheaders = re.findall(pattern, text)
xheaders

[('X-Sieve', ' cmu-sieve 2.0'),
 ('X-Mailer', ' Microsoft Outlook Express 5.00.2919.6900 DM'),
 ('X-MIME-Autoconverted',
  ' from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311'),
 ('X-Sieve', ' cmu-sieve 2.0'),
 ('X-Sieve', ' cmu-sieve 2.0'),
 ('X-Mailer', ' Microsoft Outlook Express 5.00.2919.6900DM'),
 ('X-MIME-Autoconverted',
  ' from quoted-printable to 8bit by sideshowmel.si.UM id g9VMRBW20642'),
 ('X-Sieve', ' cmu-sieve 2.0'),
 ('X-Sieve', ' cmu-sieve 2.0'),
 ('X-Mailer', ' Microsoft Outlook Express 5.00.2919.6900 DM'),
 ('X-MIME-Autoconverted',
  ' from quoted-printable to 8bit by sideshowmel.si.UM id gA19mVW29040')]

# Thank you