# Regular expressions

Based on Chapter 9 of \[AM\] and Section 8.12 in \[DD\]:
- \[AM\] = AdditionalMaterial.pdf
- \[DD\] = Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud by Paul Deitel and Harvey Deitel. Pearson, 2020. This book is available through the campus book store.


Regular expressions, also called regex, is a syntax or rather a language to search, extract and manipulate specific string patterns from a larger text. It is widely used in projects that involve text validation, NLP and text mining.

## Dealing with the backslash `\`

A backslash is an escape character:

In [1]:
print("Hello\nthis statement is printed on a newline")

Hello
this statement is printed on a newline


What if you want to print the escape character? Just escape it!

In [2]:
print("This statement prints \\n")

This statement prints \n


Or use an r-string (raw string). An r-string does not interpretate backslashes as an escape character but considers them as the backslash character.

In [3]:
print(r"This statement prints \n")

This statement prints \n


In [3]:
print(r"This statement prints \\n")

This statement prints \\n


## A regex pattern

A regex pattern is a special language used to represent generic text, numbers or symbols so it can be used to extract texts that conform to that pattern.
A basic example is `r’\s+’`. Here the `’\s’` matches any whitespace character. By adding a `+` notation at the end will make the pattern match at least 1 or more spaces. So, this pattern will match even tab `’\t’` characters as well.
A larger list of regex patterns comes later. But before getting to that, let’s see how to compile and play with regular expressions.

In [9]:
"""imports the re package and compiles a regular expression pattern 
   that can match at least one or more space characters """

import re
regex_ws = re.compile("\s+")

In [10]:
text = """101 COM    Computers 
205 MAT  Mathematics
189 ENG   English"""

In [11]:
# split text on all whitespace characters
regex_ws.split(text)

['101',
 'COM',
 'Computers',
 '205',
 'MAT',
 'Mathematics',
 '189',
 'ENG',
 'English']

In [12]:
# another possibility
re.split("\s+", text)

['101',
 'COM',
 'Computers',
 '205',
 'MAT',
 'Mathematics',
 '189',
 'ENG',
 'English']

## Finding patterns 

### `findall`
`findall` returns the matched portions of the text as a list

In [13]:
# find all whitspace

print(text)
regex_ws.findall(text)

101 COM    Computers 
205 MAT  Mathematics
189 ENG   English


[' ', '    ', ' \n', ' ', '  ', '\n', ' ', '   ']

In [41]:
# \d matches a decimal character
regex_decimal = re.compile('\d+')

In [40]:
regex_decimal.findall(text)

[]

### `search`

As the name suggests, `search` searches for the pattern in a given text. But unlike `findall` which returns the matched portions of the text as a list, `search` returns a particular match object that contains the starting and ending positions of the *first*
occurrence of the pattern.

In [16]:
# returns a match object m
m = regex_decimal.search(text)

In [17]:
print('Starting Position: ', m.start()) 
print('Ending Position: ', m.end()) 
print(text[m.start():m.end()])

Starting Position:  0
Ending Position:  3
101


In [18]:
print(m.group())

101


In [19]:
# \w matches every alpha numerical character
# . matches any single character

m = re.search('\d+.\w+',text)
print('Starting Position: ', m.start()) 
print('Ending Position: ', m.end()) 
print(m.group())

Starting Position:  0
Ending Position:  7
101 COM


`regex.match()` also returns a match object. But the difference is, it requires the
pattern to be present at the beginning of the text itself.

### `sub`

With `sub` you can replace text.

In [20]:
text_with_tab = """101   COM \t Computers
205  MAT \t Mathematics
189     ENG \t English"""
print(text_with_tab)

101   COM 	 Computers
205  MAT 	 Mathematics
189     ENG 	 English


Goal is to even out the spaces.  
First try:

In [21]:
print(regex_ws.sub(' ', text_with_tab))

101 COM Computers 205 MAT Mathematics 189 ENG English


Problem: `\s` also matches `\n`

In [22]:
# what is the whitespace that is found?

regex_ws.findall(text_with_tab)

['   ', ' \t ', '\n', '  ', ' \t ', '\n', '     ', ' \t ']

In [23]:
# get rid of all extra spaces except newline
regex = re.compile('((?!\n)\s+)')

?expression: is a zero-length assertion that returns true or false (match or no match)

https://www.regular-expressions.info/lookaround.html

?!\n: returns true if the current character does not match \n otherwise it returns false

In [24]:
regex.findall(text)

[' ', '    ', ' \n', ' ', '  ', ' ', '   ']

In [25]:
print(regex.sub(' ', text))

101 COM Computers 205 MAT Mathematics
189 ENG English


## Regex Groups

In [26]:
print(text)

101 COM    Computers 
205 MAT  Mathematics
189 ENG   English


In [27]:
# Notice the patterns for the course 
# num: [0-9]+         at least one digit
# code: [A-Z]{3}      exactly 3 capital letters
# name: [A-Za-z]{4,}  at least 4 alphabet symbols

# They are all placed inside parenthesis () to form the groups.

course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})'

re.findall(course_pattern , text)

[('101', 'COM', 'Computers'),
 ('205', 'MAT', 'Mathematics'),
 ('189', 'ENG', 'English')]

## Greedy matching

In [28]:
# The default behavior of regular expressions is to be greedy:
# Take as much as possible
text = "<body>Regex Greedy Matching Example </body>" 
re.findall('<.*>', text) 

['<body>Regex Greedy Matching Example </body>']

In [30]:
# Lazy matching, on the other hand, ‘takes as little as possible’. 
# This can be effected by adding a ‘?’ at the end of the pattern.
re.findall('<.*?>', text) 

['<body>', '</body>']

### Metacharacters, Character Classes and Quantifiers
* Regular expressions typically contain various special symbols called **metacharacters**:

| Regular expression metacharacters|
| --------
| `[]  `  `{}  `  `()  `  `\  `  `*  `  `+  `  `^  `  `$  `  `?  `  `.  `  `\|`

- ˆ Matches beginning of line.
- $ Matches end of line.
- . Matches any single character except newline. Using re.s option allows it to match newline as well.
- \[...\] Matches any single character in brackets.
- \[ˆ...\] Matches any single character not in brackets
- \* Matches 0 or more occurrences of preceding expression.
- \+ Matches 1 or more occurrence of preceding expression.
- \? Matches 0 or 1 occurrence of preceding expression.
- {n} Matches exactly n number of occurrences of preceding expression.
- {n,} Matches n or more occurrences of preceding expression.
- {n, m} Matches at least n and at most m occurrences of preceding expression. 
- a|b Matches either a or b.


**`\` metacharacter** begins each predefined **character class**
Each matches a specific set of characters

| Character class| Matches|
| ------------| ------------
| `\d`| Any digit (0–9).
| `\D`	| Any character that is _not_ a digit.
| `\s`	| Any whitespace character (such as spaces, tabs and newlines).
| `\S`	| Any character that is _not_ a whitespace character.
| `\w`	| Any **word character** (also called an **alphanumeric character**)—that is, any uppercase or lowercase letter, any digit or an underscore
| `\W`	| Any character that is _not_ a word character.

### Custom Character Classes
* Square brackets, `[]`, define a **custom character class** that matches a **single** character
* `[aeiou]` matches a lowercase vowel
* `[A-Z]` matches an uppercase letter
* `[a-z]` matches a lowercase letter 
* `[a-zA-Z]` matches any lowercase or uppercase letter

[https://regex101.com](https://regex101.com)

# Resources

- [re — Regular expression operations — Python documentation](https://docs.python.org/3/library/re.html?highlight=regular%20expressions)
- [Regular Expressions: Regexes in Python (Part 1) – Real Python](https://realpython.com/regex-python/)
- [https://regex101.com](https://regex101.com)

## The problem with backslashes

Example taken from [https://realpython.com/regex-python/](https://realpython.com/regex-python/)

In [31]:
s = r'foo\bar'
print(s)

foo\bar


Now suppose you want to create a <regex> that will match the backslash between 'foo' and 'bar'. The backslash is itself a special character in a regex, so to specify a literal backslash, you need to escape it with another backslash. If that’s that case, then the following should work:


In [32]:
re.search('\', s)

SyntaxError: EOL while scanning string literal (1259342667.py, line 1)

In [33]:
re.search('\\', s)

error: bad escape (end of pattern) at position 0

Oops. What happened?

The problem here is that the backslash escaping happens twice, first by the Python interpreter on the string literal and then again by the regex parser on the regex it receives.

Here’s the sequence of events:

- The Python interpreter is the first to process the string literal `\\`. It interprets that as an escaped backslash and passes only a single backslash to re.search().
- The regex parser receives just a single backslash, which isn’t a meaningful regex, so the messy error ensues.

In [34]:
m = re.search('\\\\', s)
m.start()

3

In [35]:
m = re.search(r'\\', s)
m.start()

3

This suppresses the escaping at the interpreter level. The string `\\` gets passed unchanged to the regex parser, which again sees one escaped backslash as desired.

It’s good practice to use a raw string to specify a regex in Python whenever it contains backslashes.