# BDTA Lesson 13 Review of Regular Expressions

This is a review of regular expressions. In some ways it is easier to try a regular expression tester like:
* [RegEx Tester](https://pythex.org/)

First we set up a text to work on.

In [26]:
import re

test = '''<html>
<head>
<title>An example text for regular expressions created by Geoffrey Rockwell</title>
</head>
<body>
<p>Here is an example of a using <i>regular expressions.</i> This example
shows how you can do different things with regular expressions. Don't be shy to try them. 
A word to the wise, however. Don't be greedy.</p>

<p>Some of the difficulties that we can have are finding and tokenizing hyphenated words 
like tip-off and long-term.</p>

<p>Here are some dates: 1987, 1990, 1389, 2017 and 2027.</p>

<p>Now some acronyms like UN, USA, and NORA.</p>

<p>Here are some words that are similar woman, women, ...</p>

<p>And here are some links, <a href="http://www.ualberta.ca">the University page</a>, and 
the <a href="http://huco.ualberta.ca">the HuCo page</a>.</p>

<p>Finally, here I'm showing contractions like don't and I'll.</p>

<p>Prepared 2017.10.15</p>
</body>
</html>'''

You can use regex to find an exact sequence of characters.

In [4]:
matches = re.findall(r'the',test)
print(matches)

['the', 'the']


You can use it to find a pattern with a set of possible characters using **[ ]**.

In [4]:
matches = re.findall(r'wom.n',test)
print(matches)

['woman', 'women']


You can provide a range of characters like all the capital letters [A-Z] or lower case letters [a-z], or numbers [0-9]. This finds a capitalized word.

In [5]:
matches = re.findall(r'[A-Z][a-z]*',test)
print(matches)

['An', 'Here', 'This', 'Don', 'A', 'Don', 'Some', 'Here', 'Here', 'Prepared']


We can use quantifiers to say how many of any characters (.) or of a range of charaters. 
* **\*** means any number of (from 0 to any number)
* **+** means 1 or more
* **?** means 0 or 1

In [6]:
matches = re.findall(r'[A-Z][a-z]+',test)
print(matches)

['An', 'Here', 'This', 'Don', 'Don', 'Some', 'Here', 'Here', 'Prepared']


Note the difference of what we get with * and +.

We can also have a mix of letters and other characters in a range. Here we have the lower case letters and the apostrophe.

In [16]:
matches = re.findall(r'[A-Z][a-z\']*',test)
print(matches)

['An', 'Here', 'This', "Don't", 'A', "Don't", 'Some', 'Here', 'Here', 'Prepared']


In [18]:
We can also specify special sequences using the back slash.

SyntaxError: invalid syntax (<ipython-input-18-72bb1f645325>, line 1)

In [14]:
matches = re.findall(r'[A-Z].+?\b',test)
print(matches)

['An', 'Here', 'This', 'Don', 'A ', 'Don', 'Some', 'Here', 'Here', 'Prepared']


# Review of regex homework

### Tags and new lines

In [10]:
cleanedContent = test
cleanedContent = re.sub(r'<.*?>',"",cleanedContent)
cleanedContent = re.sub(r'\n',"",cleanedContent)
print(cleanedContent)

An example text for regular expressionsHere is an example of a using regular expressions. This exampleshows how you can do different things with regular expressions. Don't be shy to try them. A word to the wise, however. Don't be greedy.Some of the difficulties that we can have are finding and tokenizing hyphenated words like tip-off and long-term.Here are some dates: 1987, 1990, 1389, 2017 and 2027.Here are some words that are similar woman, women, ...And here are some links, the University page, and the the HuCo page.Finally, here I'm showing contractions like don't and I'll.Prepared 2017.10.15


### Contractions

In [8]:
cleanedContent = test
cleanedContent = re.sub(r'n\'t', ' not', cleanedContent)
cleanedContent = re.sub(r'\'m', ' am', cleanedContent)
cleanedContent = re.sub(r'\'ve', ' have', cleanedContent)
cleanedContent = re.sub(r'\'ll', ' will', cleanedContent)



<html>
<head>
<title>An example text for regular expressions</title>
</head>
<body>
<p>Here is an example of a using <i>regular expressions.</i> This example
shows how you can do different things with regular expressions. Do not be shy to try them. 
A word to the wise, however. Do not be greedy.</p>

<p>Some of the difficulties that we can have are finding and tokenizing hyphenated words 
like tip-off and long-term.</p>

<p>Here are some dates: 1987, 1990, 1389, 2017 and 2027.</p>

<p>Here are some words that are similar woman, women, ...</p>

<p>And here are some links, <a href="http://www.ualberta.ca">the University page</a>, and 
the <a href="http://huco.ualberta.ca">the HuCo page</a>.</p>

<p>Finally, here I am showing contractions like do not and I will.</p>

<p>Prepared 2017.10.15</p>
</body>
</html>


### Find dates and links

In [11]:
re.findall(r'[12]\d\d\d',cleanedContent)

['1987', '1990', '1389', '2017', '2027', '2017']

In [27]:
links = re.findall(r'href=["\'](.*?)["\']',test)
print(links)

['href="http://www.ualberta.ca"', 'href="http://huco.ualberta.ca"']


### Proper nouns and acronyms

In [25]:
properNouns = re.findall(r'\b[A-Z][a-z]*?\b',test) # Note that it gets all capitalized words.
print(properNouns)

['An', 'Geoffrey', 'Rockwell', 'Here', 'This', 'Don', 'A', 'Don', 'Some', 'Here', 'Now', 'Here', 'And', 'University', 'Finally', 'I', 'I', 'Prepared']


In [24]:
acronyms = re.findall(r'\b[A-Z][A-Z]+?\b',test) # Note that it doesn't get acronyms with punctuation                             
print(acronyms)

['UN', 'USA', 'NORA']
