<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Demo 9.2: Regular Expressions

INSTRUCTIONS:

- Run the cells
- Observe and understand the results
- Answer the questions

# Python Regular Expressions: A Simplified Guide
Regular expression, also called regex, is a syntax or rather a language to search, extract and manipulate specific string patterns from a larger text. It is widely used in projects that involve text validation, NLP and text mining.

Based from the blog post [**Python Regular Expressions Tutorial and Examples: A Simplified Guide**](https://www.machinelearningplus.com/python/python-regex-tutorial-examples/) on **Machine Learning Plus**

## Contents
1. Introduction to regular expressions
2. What is a regex pattern and how to compile one?
3. How to split a string separated by a regex?
4. Finding pattern matches using `findall()`, `search()` and `match()`
    - What does `regex.findall()` do?
    - `regex.search()` vs `regex.match()`
5. How to substitute one text with another using regex?
6. Regex groups
7. What is greedy matching in regex?
8. Most common regular expression syntax and patterns
9. Regular Expressions Examples

## 1. Introduction to regular expressions
Regular expressions, also called **regex** are implemented in pretty much every computer language. In Python, it is implemented in the standard module `re`.

It is widely used in natural language processing, web applications that require validating string input (like email addresses) and pretty much most data science projects that involve text mining.

Before getting to the regular expressions syntax, it is better to first understand how the `re` module works.

Let's first introduce the 5 main features of the `re` module and then see how to create commonly used regular expressions in Python.

## 2. What is a regex pattern and how to compile one?
A **regex pattern** is a special language used to represent generic text, numbers or symbols so it can be used to extract texts that conform to that pattern.

A basic example is `\s+`.

Here the `\s` matches any whitespace character. By adding a `+` notation at the end will make the pattern match **at least 1 or more** spaces. So, this pattern will match even **tab** (`\t`) characters as well.

A larger list of regex patterns comes at the end. But before getting to that, let’s see how to compile and play with regular expressions.

In [1]:
# import re
# NOTE: importing regex instead
#      (has to be installed as it is not part of the base libraries)
import re as re

spaces = re.compile('\s+')

The above code imports the `regex` package and compiles a regular expression pattern that can match at least one or more space characters.

## 3. How to split a string separated by a regex?
Let’s consider the following piece of text.

In [2]:
text = '''101 COM   Computers
205 MAT Mathematics
189 ENG  English'''

print('Raw content:\n- - - - - - - - - -\n%r\n- - - - - - - - - -' % text)
print('\nText:\n- - - - - - - - - -\n%s\n- - - - - - - - - -' % text)

Raw content:
- - - - - - - - - -
'101 COM   Computers\n205 MAT Mathematics\n189 ENG  English'
- - - - - - - - - -

Text:
- - - - - - - - - -
101 COM   Computers
205 MAT Mathematics
189 ENG  English
- - - - - - - - - -


The difference between %s and %r is that %s uses the str function and %r uses the repr function. For built-in types, the biggest difference in practice is that repr for strings includes quotes and all special characters are escaped.

There are three course items in the format of `[Course Number] [Course Code] [Course Name]`. The spacing between the words are **not equal**.

How to split these three course items into individual units of numbers and words. How to do that?

This can be split in two ways:
- By using the `re.split()` method
- By calling the `split()` method of the `spaces` object

In [3]:
# split the text around 1 or more space characters
# \s ANY ONE space character. For ASCII, whitespace characters are [ \n\r\t\f]
# + is an occurence indicator, i.e. one or more (1+)
print(re.split(r'\s+', text))

# or
print(spaces.split(text))

['101', 'COM', 'Computers', '205', 'MAT', 'Mathematics', '189', 'ENG', 'English']
['101', 'COM', 'Computers', '205', 'MAT', 'Mathematics', '189', 'ENG', 'English']


So both methods work. But which one to use in practice?

When using a particular pattern multiple times, it is better to compile a regular expression rather than using `re.split()` over and over again.

## 4. Finding pattern matches using `findall()`, `search()` and `match()`
How to extract all the course numbers, that is, the numbers `101`, `205` and `189` alone from the above text. How to do that?

### 4.1 What does `re.findall()` do?

In [4]:
# find all numbers within the text
print('Text:\n- - - - - - - - - -\n%s\n- - - - - - - - - -\n' % text)

regex_num = re.compile('\d+')
print('Numbers:', regex_num.findall(text))

Text:
- - - - - - - - - -
101 COM   Computers
205 MAT Mathematics
189 ENG  English
- - - - - - - - - -

Numbers: ['101', '205', '189']


In the above code, the special character `\d` is a regular expression which matches any digit.

Adding `+` symbol to it mandates the presence of **at least 1** digit to be present in order to be found.

Similar to `+`, there is a `*` symbol which requires **0 or more** digits to be found. It practically makes the presence of a digit optional to make a match.

Finally, the `findall()` method extracts all occurrences of the 1 or more digits from the text and returns them in a `list`.

### 4.2 `re.search()` vs `re.match()`
As the name suggests, `re.search()` searches for the pattern in a given text.

But unlike `findall()` which returns the matched portions of the text as a list, `re.search()` returns a particular **match object** that contains the starting and ending positions of the first occurrence of the pattern.

Likewise, `re.match()` also returns a match object. But the difference is, it requires the pattern to be present **at the beginning** of the text itself.

In [5]:
# define the text
text2 = '''COM    Computers
205 MAT   Mathematics 189'''

# compile the regex and search the pattern
regex_num = re.compile('\d+')
s = regex_num.search(text2)

print('Starting Position:', s.start())
print('Ending Position  :', s.end())
print('Content          :', text2[s.start():s.end()])

Starting Position: 17
Ending Position  : 20
Content          : 205


Alternately, the same output is produced by the `group()` method of the match object.

In [6]:
print('Content          :', s.group())

Content          : 205


The method `match()` cannot find the number as it is not at the beginning of the string.

In [7]:
m = regex_num.match(text2)
print('Content          :', m)

Content          : None


## 5. How to substitute one text with another using regex?
There is the method `re.sub()` to replace texts.

The following modified version of the courses text has an extra tab after each course code.

In [8]:
# define the text
text = '''101   COM \t  Computers
205   MAT \t  Mathematics
189   ENG  \t  English'''
print('Raw content:\n- - - - - - - - - -\n%r\n- - - - - - - - - -' % text)
print('\nText:\n- - - - - - - - - -\n%s\n- - - - - - - - - -' % text)

Raw content:
- - - - - - - - - -
'101   COM \t  Computers\n205   MAT \t  Mathematics\n189   ENG  \t  English'
- - - - - - - - - -

Text:
- - - - - - - - - -
101   COM 	  Computers
205   MAT 	  Mathematics
189   ENG  	  English
- - - - - - - - - -


From the above `text`, how to even out all the extra spaces and put all the words in one single line?

This can be done with the method `sub()` to replace the `\s+` pattern with a single space (` `).

In [9]:
# replace one or more spaces with single space
regex = re.compile('\s+')
print(regex.sub(' ', text))

# or
print(re.sub('\s+', ' ', text))

101 COM Computers 205 MAT Mathematics 189 ENG English
101 COM Computers 205 MAT Mathematics 189 ENG English


How to get rid of the extra spaces but keep the course entries in the newline itself?

To achieve that a regex that effectively excludes newline characters but includes all other whitespaces.

This can be done using a **negative lookahead** `(?!\n)`. It checks for an upcoming newline character and excludes it from the pattern.

In [10]:
print('Raw content:\n- - - - - - - - - -\n%r\n- - - - - - - - - -' % text)

Raw content:
- - - - - - - - - -
'101   COM \t  Computers\n205   MAT \t  Mathematics\n189   ENG  \t  English'
- - - - - - - - - -


In [11]:
# get rid of all extra spaces except newline
regex = re.compile('((?!\n)\s+)')
print('Text:\n- - - - - - - - - -\n%s\n- - - - - - - - - -' % regex.sub(' ', text))

Text:
- - - - - - - - - -
101 COM Computers
205 MAT Mathematics
189 ENG English
- - - - - - - - - -


In [12]:
print('Text:\n- - - - - - - - - -\n%r\n- - - - - - - - - -' % regex.sub(' ', text))

Text:
- - - - - - - - - -
'101 COM Computers\n205 MAT Mathematics\n189 ENG English'
- - - - - - - - - -


## 6. Regex groups
Regular expression groups are a very useful feature that allows the extraction of the desired match objects as individual items.

To extract the course number, code and the name as separate items.

Without groups, requires some coding like the following.

In [13]:
text = '''101   COM   Computers
205   MAT   Mathematics
189   ENG    English'''

# 1. extract all course numbers
print('Number:', re.findall('[0-9]+', text))

# 2. extract all course codes
# {} is an occurence indicator where {m} implies exactly m times
print('Code  :', re.findall('[A-Z]{3}', text))

# 3. extract all course names
# {} is an occurence indicator where {m, } implies m or more (m+)
print('Name  :', re.findall('[A-Za-z]{4,}', text))

Number: ['101', '205', '189']
Code  : ['COM', 'MAT', 'ENG']
Name  : ['Computers', 'Mathematics', 'English']


There are three separate regular expressions one for matching the course number, one for code and one for the name.

For course number, the pattern `[0-9]+` instructs to match all number from `0` to `9`. Adding a `+` symbol at the end makes it look for at least 1 occurrence of numbers `0-9`. If the course number will certainly have exactly 3 digits, the pattern could be `[0-9]{3}` instead.

For course code, the `[A-Z]{3}` will match exactly 3 consecutive occurrences of alphabets capital `A-Z`.

For course name, `[A-Za-z]{4,}` will look for upper and lower case alphabets `a-z`, assuming all course names will have at least 4 or more characters.

QUESTION: What would be the pattern if the maximum limit of characters in course name is say, 20?

Now the code has three separate lines to get the individual items.

But there is a better way: **Regex Groups**.

Since all the entries have the same pattern, there is a unified pattern for the entire course entry and put the parts to extract inside a pair of brackets `()`.

In [14]:
# define the course text pattern groups and extract
course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})'
#                 (------)   (--------)   (------------)
re.findall(course_pattern, text)

[('101', 'COM', 'Computers'),
 ('205', 'MAT', 'Mathematics'),
 ('189', 'ENG', 'English')]

Notice the patterns for the course num: `[0-9]+`, code: `[A-Z]{3}` and name: `[A-Za-z]{4,}` are all placed inside parenthesis `()` in order to form the groups.

## 7. What is greedy matching in regex?
The default behaviour of regular expressions is to be **greedy**. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.

Let’s see an example with a piece of HTML tag.

In [15]:
text = '< body>Regex Greedy Matching Example < /body>'

# "dot" matches any character except a newline
# * causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible
re.findall('<.*>', text)

['< body>Regex Greedy Matching Example < /body>']

Instead of matching util the **first** occurrence of `>`, it extracted the **whole** string.

This is the default greedy or **take it all** behaviour of regex.

Lazy matching, on the other hand, takes **as little as possible**. This can be effected by adding a `?` at the end of the pattern.

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.

In [16]:
re.findall('<.*?>', text)

['< body>', '< /body>']

To get only the **first** match, use the `search()` method instead.

In [17]:
re.search('<.*?>', text).group()

'< body>'

## 8. Most common regular expression syntax and patterns
Some commonly used wildcard patterns.

## Regular Expressions Syntax

### Basic Syntax
    .          One character except new line
    \.         A period. \ escapes a special character.
    \d         One digit
    \D         One non-digit
    \w         One word character including digits
    \W         One non-word character
    \s         One whitespace
    \S         One non-whitespace
    \b         Word boundary
    \n         Newline
    \t         Tab

### Modifiers
    $          End of string
    ^          Start of string
    ab|cd      Matches ab or de.
    [ab-d]     One character of: a, b, c, d
    [^ab-d]    One character except: a, b, c, d
    ()         Items within parenthesis are retrieved
    (a(bc))    Items within the sub-parenthesis are retrieved

### Repetitions
    [ab]{2}    Exactly 2 continuous occurrences of a or b
    [ab]{2,5}  2 to 5 continuous occurrences of a or b
    [ab]{2,}   2 or more continuous occurrences of a or b
    +          One or more
    *          Zero or more
    ?          0 or 1

## 9. Regular Expressions Examples

### 9.1. Any character except for a new line

In [18]:
text = 'website.com'
# .   Any character except for a new line
print('Any 1 character :', re.findall('.', text))
print('\nAny 3 characters:', re.findall('...', text))

Any 1 character : ['w', 'e', 'b', 's', 'i', 't', 'e', '.', 'c', 'o', 'm']

Any 3 characters: ['web', 'sit', 'e.c']


### 9.2. A period

In [19]:
text = 'website.com'
# matches a period
print('Find a period                     :', re.findall('\.', text))

# matches anything but a period
print('\nFind any character except a period:', re.findall('[^\.]', text))

Find a period                     : ['.']

Find any character except a period: ['w', 'e', 'b', 's', 'i', 't', 'e', 'c', 'o', 'm']


### 9.3. Any digit

In [20]:
text = '01, Jan 2015'
# \d  Any digit. The + mandates at least 1 digit.
print(re.findall('\d+', text))

['01', '2015']


### 9.4. Anything but a digit

In [21]:
text = '01, Jan 2015'
# \D  Anything but a digit
print(re.findall('\D+', text))

[', Jan ']


### 9.5. Any character, including digits

In [22]:
text = '01, Jan 2015'
# \w  Any character
print(re.findall('\w+', text))

['01', 'Jan', '2015']


### 9.6. Anything but a character

In [23]:
text = '01, Jan 2015'
# \W  Anything but a character
print(re.findall('\W+', text))

[', ', ' ']


### 9.7. Collection of characters

In [24]:
text = '01, Jan 2015'
# [] Matches any character inside
print(re.findall('[a-zA-Z]+', text))

['Jan']


### 9.8. Match something up to 'n' times

In [25]:
text = '01, Jan 2015'
# {n} Matches repeat n times
print(re.findall('\d{4}', text))
print(re.findall('\d{2,4}', text))

['2015']
['01', '2015']


### 9.9. Match 1 or more occurrences

In [26]:
# Match for 1 or more occurrences
print(re.findall(r'Co+l', 'So Cooool'))

['Cooool']


### 9.10. Match any number of occurrences (0 or more times)

In [27]:
print(re.findall(r'Pi*lani', 'Pilani'))

['Pilani']


### 9.11. Match exactly zero or one occurrence

In [28]:
print(re.findall(r'colou?r', 'colour'))

['colour']


### 9.12. Match word boundaries
Word boundaries `\b` are commonly used to detect and match the beginning or end of a word. That is, one side is a word character and the other side is whitespace and vice-versa.

For example, the regex `\btoy` will match the `toy` in `toy cat` and not in `tolstoy`. In order to match the `toy` in `tolstoy`, you should use `toy\b`.

QUESTION: Come up with a regex that will match only the first `toy` in `play toy broke toys`. (hint: `\b` on both sides)

Likewise, `\B` will match any non-boundary.

For example, `\Btoy\B` will match `toy` surrounded by words on both sides, as in `antoynet`.

In [29]:
# match toy with boundary on both sides
re.findall(r'\btoy\b', 'play toy broke toys')

['toy']

© 2020 Institute of Data