## Regex 101, Applied to Hansard Speaker Names

Problem: Writing every individual name for a find/replace dictionary can be time consuming. Also, it's easy to forget whether you have already added a specific iteration of a name.

Solution: Many of the misspellings in the Hansard data are OCR errors. Therefore, many misspellings have patterns. Try using regular expressions to match multiple misspellings at once. The implementation of regular expressions requires careful attention. We don't want to match with the wrong words. 

### Code

First import required modules.

In [4]:
import re

Now let's create the example text for which we will write regular expressions.

In [5]:
text = 'fisheries', 'school district', 'meat industry', 'fisiieries', 'fisiiery', 'fisiieeries'

In [6]:
text

('fisheries',
 'school district',
 'meat industry',
 'fisiieries',
 'fisiiery',
 'fisiieeries')

If we wanted to match with all three spellings of "fisheries" (so, fisheries, fisiieries, and fisiieeries) we could use a "greedy regular expression" that matches with any characters between the "s" and the "e" of fisheries. 

In [7]:
list_of_matches = []

for item in text:
    match = re.search(r'fis(.*)eries', item)
    list_of_matches.append(match)

In [9]:
list_of_matches

[<re.Match object; span=(0, 9), match='fisheries'>,
 None,
 None,
 <re.Match object; span=(0, 10), match='fisiieries'>,
 None,
 <re.Match object; span=(0, 11), match='fisiieeries'>]

Note that a greedy regular expression expands to include all characters between "s" and "e," so it matches with "ii" and "iie."

Therefore, you do not need to specify the number of characters being replaced.

We could also use a regular expression if we wanted to match with "fisiiery" (for fishery). However, this time we need to tell the greedy regular expression to stop at the end of the word (so it doesn't also expand to include other words). 

We can signal a word boundary with `\b`.

In [11]:
list_of_matches_2 = []

for item in text:
    match = re.search(r'fis(.*)er(.*)\b', item)
    list_of_matches_2.append(match)

In [12]:
list_of_matches_2

[<re.Match object; span=(0, 9), match='fisheries'>,
 None,
 None,
 <re.Match object; span=(0, 10), match='fisiieries'>,
 <re.Match object; span=(0, 8), match='fisiiery'>,
 <re.Match object; span=(0, 11), match='fisiieeries'>]

In some situations it might not be wise to use a greedy regular expression. Let's say we had a variable for office titles and we wanted to match with every misspelling of "lord." If we wrote `l(.*)d` we would also match with "labcd."

Instead we will use just `.` without the `*` to match with an exact number.

In [10]:
titles = 'lord smith', 'lard smith', 'loud smith', 'loed smith', 'labcd'

In [11]:
# A greedy approach would be problematic here. Instead we should match the exact number of letters we wish to replace.

list_of_matches_3 = []

for item in titles:
    match = re.search(r'l(.)(.)d', item)
    list_of_matches_3.append(match)

In [12]:
list_of_matches_3

[<re.Match object; span=(0, 4), match='lord'>,
 <re.Match object; span=(0, 4), match='lard'>,
 <re.Match object; span=(0, 4), match='loud'>,
 <re.Match object; span=(0, 4), match='loed'>,
 None]

### Quiz

What regex might we write to match with the following name?

In [None]:
names = 'Lord Ellenborough', 'Lord Ellen borough', 'Lord Ellen-borough'