Regular expressions, regex, is a syntax to search, extract and manipulate specific string patterns from a larger text. 

A basic example is '\s+'.

Here the '\s' matches any whitespace character. By adding a '+' notation at the end will make the pattern match at least 1 or more spaces. So, this pattern will match even tab '\t' characters as well.

In [0]:
import re   
regex = re.compile('\s+')

split a string separated by a regex

In [0]:
text = """101 COM    Computers
205 MAT   Mathematics
189 ENG   English""" 

I have three course items in the format of “[Course Number] [Course Code] [Course Name]”. The spacing between the words are not equal.

I want to split these three course items into individual units of numbers and words. How to do that?

This can be split in two ways:

1. By using the re.split method.

2. By calling the split method of the regex object.

In [0]:
# split the text around 1 or more space characters
re.split('\s+', text) 

['101',
 'COM',
 'Computers',
 '205',
 'MAT',
 'Mathematics',
 '189',
 'ENG',
 'English']

In [0]:
regex.split(text) # this is preferred one

['101',
 'COM',
 'Computers',
 '205',
 'MAT',
 'Mathematics',
 '189',
 'ENG',
 'English']

Finding pattern matches using findall, search and match

Let’s suppose you want to extract all the course numbers, that is, the numbers 101, 205 and 189 alone from the above text. How ?

re.findall() 
'\d' is a regular expression which matches any digit.

 Adding a '+' symbol to it mandates the presence of at least 1 digit to be present in order to be found.

Similar to '+', there is a '*' symbol which requires 0 or more digits in order to be found. I

In [0]:
# find all numbers within the text
print(text)
regex_num = re.compile('\d+')
regex_num.findall(text)

101 COM    Computers
205 MAT   Mathematics
189 ENG   English


['101', '205', '189']

re.search() vs re.match()

regex.search() searches for the pattern in a given text.

regex.match() requires the pattern to be present at the beginning of the text itself.

In [0]:
text2 = """COM    Computers
205 MAT   Mathematics 189"""

In [0]:
# compile the regex and search the pattern
regex_num = re.compile('\d+')
s = regex_num.search(text2)

In [0]:
print('Starting Position: ', s.start())
print('Ending Position: ', s.end())
print(text2[s.start():s.end()])

('Starting Position: ', 17)
('Ending Position: ', 20)
205


we can get the same output using the group() method of the match object.

In [0]:
print(s.group())

205


In [0]:
m = regex_num.match(text2)
print(m)  # unlike search this doesn't work

None


substitute one text with another using regex

To replace texts, use the regex.sub().

In [0]:
text = """101   COM \t  Computers
205   MAT \t  Mathematics
189   ENG  \t  English""" 

print(text)

101   COM 	  Computers
205   MAT 	  Mathematics
189   ENG  	  English


we want to even out all the extra spaces and put all the words in one single line.

To do this, we just have to use regex.sub to replace the '\s+' pattern with a single space ‘ ‘.

In [0]:
# replace one or more spaces with single space
regex = re.compile('\s+')
print(regex.sub(' ', text))

101 COM Computers 205 MAT Mathematics 189 ENG English


In [0]:
# or
print(re.sub('\s+', ' ', text))

101 COM Computers 205 MAT Mathematics 189 ENG English


if we only want to get rid of the extra spaces but want to keep the course entries in the new line itself

This can be done using a negative lookahead (?!\n). It checks for an upcoming newline character and excludes it from the pattern.

In [0]:
# get rid of all extra spaces except newline
regex = re.compile('((?!\n)\s+)')
print(regex.sub(' ', text))

101 COM Computers
205 MAT Mathematics
189 ENG English


Regex groups

I want to extract the course number, code and the name as separate items, Without groups

In [0]:
text = """101   COM   Computers
205   MAT   Mathematics
189   ENG    English"""  

# 1. extract all course numbers
print(re.findall('[0-9]+', text))

# 2. extract all course codes
print(re.findall('[A-Z]{3}', text))

# 3. extract all course names
print(re.findall('[A-Za-z]{4,}', text))

['101', '205', '189']
['COM', 'MAT', 'ENG']
['Computers', 'Mathematics', 'English']


we compiled 3 separate regular expressions one each for matching the course number, code and the name.

For course number, the pattern [0-9]+ instructs to match all number from 0 to 9. Adding a + symbol at the end makes it look for at least 1 occurrence of numbers 0-9. If you know the course number will certainly have exactly 3 digits, the pattern could have been [0-9]{3} instead.

For course code, you can guess that '[A-Z]{3}' will match exactly 3 consequtive occurrences of alphabets capital A-Z.

For course name, '[A-Za-z]{4,}' will look for upper and lower case alphabets a-z, assuming all course names will have at least 4 or more characters.

if we had to write 1 line for this we use Regex Groups.

In [0]:
# define the course text pattern groups and extract
course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})'
re.findall(course_pattern, text)

[('101', 'COM', 'Computers'),
 ('205', 'MAT', 'Mathematics'),
 ('189', 'ENG', 'English')]

greedy matching in regex

The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient

In [0]:
text = "< body>Regex Greedy Matching Example < /body>"
re.findall('<.*>', text)

['< body>Regex Greedy Matching Example < /body>']

Instead of matching till the first occurrence of ‘>’, which I was hoping would happen at the end of first body tag itself, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.

Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a `?` at the end of the pattern.

In [0]:
re.findall('<.*?>', text)

['< body>', '< /body>']

If you want only the first match to be retrieved, use the search method instead.

In [0]:
re.search('<.*?>', text).group()

'< body>'

regular expression syntax and patterns

Regular Expressions Syntax
BASIC SYNTAX

.             One character except new line
\.            A period. \ escapes a special character.
\d            One digit
\D            One non-digit
\w            One word character including digits
\W            One non-word character
\s            One whitespace
\S            One non-whitespace
\b            Word boundary
\n            Newline
\t            Tab

MODIFIERS

$             End of string
^             Start of string
ab|cd         Matches ab or de.
[ab-d]	      One character of: a, b, c, d
[^ab-d]	      One character except: a, b, c, d
()            Items within parenthesis are retrieved
(a(bc))       Items within the sub-parenthesis are retrieved

REPETITIONS

[ab]{2}       Exactly 2 continuous occurrences of a or b
[ab]{2,5}     2 to 5 continuous occurrences of a or b
[ab]{2,}      2 or more continuous occurrences of a or b
+             One or more
*             Zero or more
?             0 or 1

Regular Expressions Examples

Any character except for a new line

In [0]:
text = 'https://www.howlongtoreadthis.com'
print(re.findall('.', text))  # .   Any character except for a new line
print(re.findall('...', text))

['h', 't', 't', 'p', 's', ':', '/', '/', 'w', 'w', 'w', '.', 'h', 'o', 'w', 'l', 'o', 'n', 'g', 't', 'o', 'r', 'e', 'a', 'd', 't', 'h', 'i', 's', '.', 'c', 'o', 'm']
['htt', 'ps:', '//w', 'ww.', 'how', 'lon', 'gto', 'rea', 'dth', 'is.', 'com']


A period

In [0]:
text = 'https://www.howlongtoreadthis.com'
print(re.findall('\.', text))  # matches a period
print(re.findall('[^\.]', text))  # matches anything but a period

['.', '.']
['h', 't', 't', 'p', 's', ':', '/', '/', 'w', 'w', 'w', 'h', 'o', 'w', 'l', 'o', 'n', 'g', 't', 'o', 'r', 'e', 'a', 'd', 't', 'h', 'i', 's', 'c', 'o', 'm']


Any digit

In [0]:
text = '01, Jan 2015'
print(re.findall('\d+', text))  # \d  Any digit. The + mandates at least 1 digit.

['01', '2015']


Anything but a digit

In [0]:
text = '01, Jan 2015'
print(re.findall('\D+', text))  # \D  Anything but a digit

[', Jan ']


Any character, including digits

In [0]:
text = '01, Jan 2015'
print(re.findall('\w+', text))  # \w  Any character

['01', 'Jan', '2015']


Anything but a character

In [0]:
text = '01, Jan 2015'
print(re.findall('\W+', text))  # \W  Anything but a character

[', ', ' ']


Collection of characters

In [0]:
text = '01, Jan 2015'
print(re.findall('[a-zA-Z]+', text))  # [] Matches any character inside

['Jan']


Match something upto ‘n’ times

In [0]:
text = '01, Jan 2015'
print(re.findall('\d{4}', text))  # {n} Matches repeat n times.
print(re.findall('\d{2,4}', text))

['2015']
['01', '2015']


Match 1 or more occurrences

In [0]:
print(re.findall(r'Co+l', 'So Cooool'))  # Match for 1 or more occurrences

['Cooool']


Match any number of occurrences (0 or more times)

In [0]:
print(re.findall(r'Pi*lani', 'Pilani'))

['Pilani']


Match exactly zero or one occurrence

In [0]:
print(re.findall(r'colou?r', 'color'))

['color']


Match word boundaries

Word boundaries \b are commonly used to detect and match the beginning or end of a word. That is, one side is a word character and the other side is whitespace and vice versa.

For example, the regex \btoy will match the ‘toy’ in ‘toy cat’ and not in ‘tolstoy’. In order to match the ‘toy’ in ‘tolstoy’, you should use toy\b

Can you come up with a regex that will match only the first ‘toy’ in ‘play toy broke toys’? (hint: \b on both sides)

Likewise, \B will match any non-boundary.

For example, \Btoy\B will match ‘toy’ surrounded by words on both sides, as in, ‘antoynet’.

In [0]:
re.findall(r'\btoy\b', 'play toy broke toys')  # match toy with boundary on both sides

['toy']

Practice Exercises

1. Extract the user id, domain name and suffix from the following email addresses.
emails = """zuck26@facebook.com
page33@google.com
jeff42@amazon.com"""

desired_output = [('zuck26', 'facebook', 'com'),
 ('page33', 'google', 'com'),
 ('jeff42', 'amazon', 'com')]

In [0]:
emails = """zuck26@facebook.com
page33@google.com
jeff42@amazon.com""" 

# Solution
pattern = r'(\w+)@([A-Z0-9]+)\.([A-Z]{2,4})'
re.findall(pattern, emails, flags=re.IGNORECASE)

[('zuck26', 'facebook', 'com'),
 ('page33', 'google', 'com'),
 ('jeff42', 'amazon', 'com')]

In [0]:
pattern = r'(\w+)@([A-Za-z]+)\.([A-Za-z]{3})'
re.findall(pattern, emails)

[('zuck26', 'facebook', 'com'),
 ('page33', 'google', 'com'),
 ('jeff42', 'amazon', 'com')]

2. Retrieve all the words starting with ‘b’ or ‘B’ from the following text.

text = """Betty bought a bit of butter, But the butter was so bitter, So she bought some better butter, To make the bitter butter better.""" 

In [0]:
text = """Betty bought a bit of butter, But the butter was so bitter, So she bought some better butter, To make the bitter butter better.""" 

# Solution: 
import re
re.findall(r'\bB\w+', text, flags=re.IGNORECASE)

# '\b' mandates the left of 'B' is a word boundary, effectively requiring the word to start with 'B'. 
# Setting 'flags' arg to 're.IGNORECASE' makes the pattern case insensitive.

['Betty',
 'bought',
 'bit',
 'butter',
 'But',
 'butter',
 'bitter',
 'bought',
 'better',
 'butter',
 'bitter',
 'butter',
 'better']

In [0]:
re.findall(r'\b[Bb]\w+', text)

['Betty',
 'bought',
 'bit',
 'butter',
 'But',
 'butter',
 'bitter',
 'bought',
 'better',
 'butter',
 'bitter',
 'butter',
 'better']

3. Split the following irregular sentence into words

sentence = """A, very   very; irregular_sentence"""

desired_output = "A very very irregular sentence"

In [0]:
sentence = """A, very very; irregular_sentence"""

# Solution
import re
" ".join(re.split('[;,\s_]+', sentence))
#> 'A very very irregular sentence'

# Add more delimiters into the pattern as needed.

'A very very irregular sentence'

4. Clean up the following tweet so that it contains only the user’s message. That is, remove all URLs, hashtags, mentions, punctuations, RTs and CCs.

tweet = '''Good advice! RT @TheNextWeb: What I would do differently if I was learning to code today http://t.co/lbwej0pxOd cc: @garybernhardt #rstats'''

desired_output = 'Good advice What I would do differently if I was learning to code today'

In [0]:
tweet = '''Good advice! RT @TheNextWeb: What I would do differently if I was learning to code today http://t.co/lbwej0pxOd cc: @garybernhardt #rstats'''

# Solution
import re
def clean_tweet(tweet):
    tweet = re.sub('http\S+\s*', '', tweet)  # remove URLs
    tweet = re.sub('RT|cc', '', tweet)  # remove RT and cc
    tweet = re.sub('#\S+', '', tweet)  # remove hashtags
    tweet = re.sub('@\S+', '', tweet)  # remove mentions
    tweet = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), '', tweet)  # remove punctuations
    tweet = re.sub('\s+', ' ', tweet)  # remove extra whitespace
    return tweet

print(clean_tweet(tweet))

Good advice What I would do differently if I was learning to code today 


5. Extract all the text portions between the tags from the following HTML page:

https://raw.githubusercontent.com/selva86/datasets/master/sample.html

import requests

r = requests.get("https://raw.githubusercontent.com/selva86/datasets/master/sample.html")

r.text  # html text is contained here

desired_output = ['Your Title Here', 'Link Name', 'This is a Header', 'This is a Medium Header', 'This is a new paragraph! ', 'This is a another paragraph!', 'This is a new sentence without a paragraph break, in bold italics.']

In [0]:
import requests

r = requests.get("https://raw.githubusercontent.com/selva86/datasets/master/sample.html")

r.text

# Solution:
# Note: remove the space after < and /.*> for the pattern to work
re.findall('<.*?>(.*)</.*?>', r.text) 
#> ['Your Title Here', 'Link Name', 'This is a Header', 'This is a Medium Header', 'This is a new paragraph! ', 'This is a another paragraph!', 'This is a new sentence without a paragraph break, in bold italics.']

[u'Your Title Here',
 u'Link Name',
 u'This is a Header',
 u'This is a Medium Header',
 u'This is a new paragraph! ',
 u'This is a another paragraph!',
 u'This is a new sentence without a paragraph break, in bold italics.']