## Regular Expressions

Text data is often in need of “cleaning” and preparation before it can be effectively used for analysis purposes. Consider the following poorly formatted text string containing names and phone numbers of some residents of the town of Springfield:

"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

Use your Python regular expression (“regex”) skills to complete the following tasks:
### 1. Extract the names of each individual from the unformatted text string and store them in a vector of some sort. When complete, your vector should contain the following entries:

"Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy" "Ned Flanders" "Simpson,Homer" "Dr. Julius Hibbert"

In [1]:
# import Regular Expressions library 
import re
import numpy as np
import pandas as pd

# sample text
Springfield ="555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
Springfield

'555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert'

In [2]:
# create pattern to extract the name of sample text
pattern = r'[A-Z,\s]+[A-Z. \s]+[A-Z]'
# IGNORECASE to make it non case-sensitive 
regex = re.compile(pattern, flags=re.IGNORECASE)
text = regex.findall(Springfield)
text

['Moe Szyslak',
 'Burns, C. Montgomery',
 'Rev. Timothy Lovejoy',
 'Ned Flanders',
 'Simpson, Homer',
 'Dr. Julius Hibbert']


### 2. Using your new vector containing only the names of the six individuals, complete the following tasks:

a. Use your regex skills to rearrange the vector so that all elements conform to the standard "firstname lastname", preserving any titles(e.g.,"Rev.","Dr",etc) or middle/second names.)

In [3]:
# delete the whitesapce before name
text1 = [i.split(', ', 1) for i in text]
reverse = [1,0]
text1[1] = [(text1[1])[i] for i in reverse]
text1[4] = [(text1[4])[i] for i in reverse]
text1

[['Moe Szyslak'],
 ['C. Montgomery', 'Burns'],
 ['Rev. Timothy Lovejoy'],
 ['Ned Flanders'],
 ['Homer', 'Simpson'],
 ['Dr. Julius Hibbert']]

In [4]:
# switch the firstname and lastname
text2 = [q[0]+' '+q[1] if len(q)>1 else str(q).strip("['']") for q in text1]
text2

['Moe Szyslak',
 'C. Montgomery Burns',
 'Rev. Timothy Lovejoy',
 'Ned Flanders',
 'Homer Simpson',
 'Dr. Julius Hibbert']

b. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.). 

In [5]:
# transfer to datafram and then match for title
pattern_2 = '[A-z]{2,3}\\. '
text3 = pd.Series(text2)
text3.str.match(pattern_2)

0    False
1    False
2     True
3    False
4    False
5     True
dtype: bool

c. Construct a logical vector indicating whether a character has a middle/second name.

In [6]:
# transfer to datafram and then match for middle/second name
pattern_3 = '[A-z]{0,1}\\. '
text4 = pd.Series(text2)
text4.str.match(pattern_3)

0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool

## 3. Consider the HTML string <title>+++BREAKING NEWS+++<title>. We would like to extract the first HTML tag (i.e., “<title>”). To do so we write the regular expression “<.+>”. Explain why this fails and correct the expression.

In [54]:
# '.' is not shown in the text, should extract '+++BREAKING NEWS+++'
html_text = '<title>+++BREAKING NEWS+++<title>'
re.findall('<.+>', html_text)

['<title>+++BREAKING NEWS+++<title>']

In [94]:
# use '[A-z]+' to match '+++BREAKING NEWS+++'
re.match('<[A-z]+?>', html_text).group()

'<title>'

 ## 4. Consider the string “(5-3)^2=5^2-2*5*3+3^2” conforms to the binomial theorem. 
 We would like to extract the formula in the string. To do so we write the regular expression “[^0-9=+*()]+”. Explain why this fails and correct the expression.

In [104]:
# Things we want to extract share some repetition in regular expression so that it failed
formula_text = '(5-3)^2=5^2-2*5*3+3^2'
re.findall('[^0-9=+*()]', formula_text)

['-', '^', '^', '-', '^']

In [100]:
# To avoid repetition in regular expression, we could exclude the alpha characters.
re.match('([^A-z]|\\^)+', formula_text).group()

'(5-3)^2=5^2-2*5*3+3^2'