# BDTA Lesson 11 Regular Expressions

This notebook introduces regular expressions.

Some links:

* [The Python Documentation](https://docs.python.org/3/library/re.html)
* [Tutorialspoint Tutorial](https://www.tutorialspoint.com/python3/python_reg_expressions.htm)
* [RegEx One Tutorial](https://regexone.com/references/python)
* [RegEx Tester](https://pythex.org/)

### Import re library

You need to import the regular expression library. We then create a text to play with.

In [1]:
import re

In [2]:
test = '''<html>
<head>
<title>An example text for regular expressions</title>
</head>
<body>
<p>Here is an example of a using <i>regular expressions.</i> This example
shows how you can do different things with regular expressions. Don't be shy to try them.</p>

<p>Some of the difficulties that we can have are finding and tokenizing hyphenated words 
like tip-off and long-term.</p>

<p>Here are some dates: 1987, 1990, 1389, 2017 and 2027.</p>

<p>Here are some words that are similar woman, women, ...</p>

<p>Prepared 2017.10.15</p>
</body>
</html>'''

In [None]:
# Search for exact pattern 1990: 19..
# Search for truncated pattern: wom.* , wom[a-z]*
# Search for variants of words: wom[ae]n
# Search for words before/behind: \w+(?=\s*that)
# Different uses of regex - search, find, replace, strip, 

In [9]:
matches = re.findall(r'\w+(?=\s*that)',test)
print(matches)

<_sre.SRE_Match object; span=(268, 280), match='difficulties'>


In [24]:
type(match)

_sre.SRE_Match

### Using a regex in an if statement

In [11]:
if re.search(r'hermeneutics',test): print("Found one!")

### Using to find tags

Note how regex is "hungry".

In [12]:
re.findall(r'<.*>',test)

['<html>',
 '<head>',
 '<title>An example text for regular expressions</title>',
 '</head>',
 '<body>',
 '<p>Here is an example of a using <i>regular expressions.</i>',
 '</p>',
 '<p>',
 '</p>',
 '<p>Here are some dates: 1987, 1990, 1389, 2017 and 2027.</p>',
 '<p>Here are some words that are similar woman, women, ...</p>',
 '<p>Prepared 2017.10.15</p>',
 '</body>',
 '</html>']

In [13]:
re.findall(r'<.*?>',test)

['<html>',
 '<head>',
 '<title>',
 '</title>',
 '</head>',
 '<body>',
 '<p>',
 '<i>',
 '</i>',
 '</p>',
 '<p>',
 '</p>',
 '<p>',
 '</p>',
 '<p>',
 '</p>',
 '<p>',
 '</p>',
 '</body>',
 '</html>']

### Using to strip out tags

We don't have to use *findall*, we can use other methods like *sub* to substitute.

In [14]:
results = re.sub(r'<.*?>',"",test)
print(results)



An example text for regular expressions


Here is an example of a using regular expressions. This example
shows how you can do different things with regular expressions. Don't be shy to try them.

Some of the difficulties that we can have are finding and tokenizing hyphenated words 
like tip-off and long-term.

Here are some dates: 1987, 1990, 1389, 2017 and 2027.

Here are some words that are similar woman, women, ...

Prepared 2017.10.15




Here we pull an element.

In [107]:
results = re.findall(r'<p>.*?</p>',test)
print(results)

['<p>Here are some dates: 1987, 1990, 1389, 2017 and 2027.</p>', '<p>Here are some words that are similar woman, women, ...</p>', '<p>Prepared 2017.10.15</p>']


Here we remove newlines.

In [39]:
results = re.sub(r'\n',"",test)
print(results)

<html><head><title>An example text for regular expressions</title></head><body><p>Here is an example of a using <i>regular expressions.</i> This exampleshows how you can do different things with regular expressions. Don't be shy to try them.</p><p>Some of the difficulties that we can have are finding and tokenizing hyphenated words like tip-off and long-term.</p><p>Here are some dates: 1987, 1990, 1389, 2017 and 2027.</p></body></html>


Here we search for date like objects.

In [109]:
match = re.findall(r'19..',test)
print(match)

['1987', '1990']


In [78]:
results = re.findall(r'[12]\d\d\d',test) # [0-9]+ , [0-9][0-9][0-9][0-9]
print(results)

['1987', '1990', '1389', '2017', '2027', '2017']


We can also use this to tokenize with the split method.

In [111]:
results = re.split(r'\W+',test)
results[:5]
len(results)

99

In [65]:
results = re.findall(r'\b(\w[\w-]*)\b',test)
len(results)

88

---
## Exercise
Can you write regular expressions that will do the following

* Extract words with similar endings (like adverbs ending in "ly"
* Extract dates of different sorts.
* Convert contractions like "don't" into "do not"
* Extract acronyms

**Optional**
* Extract the word after another word like "that"
* Extract names (where a name is two words in a row with capital letters)
* Tokenize on sentences.

---
## Homework
Create a notebook that gets, cleans up and tokenizes a web page *using regular expressions*. Try to do the following:
* Strip out the tags
* Strip out any excess newlines (\n)
* Convert contractions
* Deal with hyphenated words

**Optional**
Can you find a way to extract useful information about the web page like:
* List of tags
* List of links
* Any dates and names
* Any acronyms
