# Regular Expressions
In this notebook we are going to talk about pattern matching in strings using regular expressions. Regular expressions, or regexes, are written in a condensed formatting language. In general, you can thinnk of a regular expression as a pattern which you give to a regex processor with some source data. The processor then parses that source data usgin that pattern, and returns chunks of text back to the a data scientist or programmer for futher manipulation. There's really three main reasons you would want to do this - to check whether a patter exists within some source data, to get all instances of a complex patter from some source data, or to clean your source data using a pattern generally through string splitting. Regexes are not trivial, but they are a foundational technique for data cleaning in data science applications, and a solid understanding of regexs will help you quickly and efficiently manipulate text data further data science application.
Now, you could teach a whole course on regular expressions alone, especially if you wanted to demystify hoe the regex parsing engine works and efficient mechanisms for parsing text.

In [1]:
import re # Library which has several functions for regexes

text = "This is a good day" # A text to manipulate 
if re.match("good", text): # Match if only for start of the string
    print("good is in the start of the string")
else:
    print("It doesn't match")

if re.search("good", text): # Search is for any part of the string
    print("good is in the string")
else:
    print("good isn't in the string")

text = "Amy2 works diligently. Amy gets good grades. Our student Amy is succesful."
print(re.split("Amy", text)) # Split using split function in re library
print(text.split("Amy")) # Split using split function in the strings
print(re.findall("Amy", text)) # All elements with Amy in the text
if re.search("^Amy", text): print("Amy is in the start of the String") # ^ symbol is used to specify that pattern is in the start of the string 
if re.search("ful.$", text): print("ful. is in the end of the string") # $ symbol is used to specify that pattern is in the end of the string

It doesn't match
good is in the string
['', '2 works diligently. ', ' gets good grades. Our student ', ' is succesful.']
['', '2 works diligently. ', ' gets good grades. Our student ', ' is succesful.']
['Amy', 'Amy', 'Amy']
Amy is in the start of the String
ful. is in the end of the string


In [2]:
print(re.match("This", text))

None


In [3]:
# Patterns and Character Classes
grades = "ACAAAABCBCBAA"

print(re.findall("B", grades)) # Find all B in string
print(re.findall("[AB]", grades)) # Find all A or B in the string
print(re.findall("A[BC]", grades)) # Find all match with AB or AC
print(re.findall("AB|AC", grades)) # | is or operation AB or AC
print(re.findall("[^A]", grades)) # ^ symbol in [] is used to negate the result, all matches which aren´t A
# We can use {a, b} as quantifiers
print(re.findall("A{2,10}", grades)) # Minimun 2 A and maximun 10 A
print(re.findall("A{1,1}A{1,1}", grades)) # Exact 2 A
print(re.findall("A{2}", grades)) # Equivalent 2 A
print(re.findall("A{1, 2}", grades)) # Be carefull, a space after ',' in quantifiers result in empty match

['B', 'B', 'B']
['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']
['AC', 'AB']
['AC', 'AB']
['C', 'B', 'C', 'B', 'C', 'B']
['AAAA', 'AA']
['AA', 'AA', 'AA']
['AA', 'AA', 'AA']
[]


In [4]:
with open("data/ferpa.txt","r") as file:
    wiki=file.read()
print(re.findall("[\w ]*\[edit\]",wiki)) # This pattern [\w]* Every word any number os times '*' \[\] for match symbols and edit inside []. it works because after title is \n that is not a word 
groups = re.findall("([\w ]*)(\[edit\])",wiki) # We can agrupate the parts of the match with ()
print(groups)
for group in groups:
    print(group[0]) # Print only the title
#finditer() get all matches but don't return a list, instead of this it return collection os matches
matches = re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki) # We can agrupate the parts of the match in dictionaries with ?P<> marker, inside <> we put the name of key for this parameter
for match in matches:
    print(match.groupdict()) # Print as dictionaries all matches
    
print(re.findall("(?P<title>[\w ]*)(?=\[edit\])",wiki)) # If we need word for match but these aren't interesting for our propouses we can ignorate this with the laber =? 

['Overview[edit]', 'Access to public records[edit]', 'Student medical records[edit]']
[('Overview', '[edit]'), ('Access to public records', '[edit]'), ('Student medical records', '[edit]')]
Overview
Access to public records
Student medical records
{'title': 'Overview', 'edit_link': '[edit]'}
{'title': 'Access to public records', 'edit_link': '[edit]'}
{'title': 'Student medical records', 'edit_link': '[edit]'}
['Overview', '', 'Access to public records', '', 'Student medical records', '']


In [5]:
# Next lets see other examples of regex with other documents
with open("data/buddhist.txt","r", encoding = 'utf8') as file:
    # we'll read that into a variable called wiki
    wiki = file.read()
# and lets print that variable out to the screen
# Ther verbose mode allows you to write multi - line regexes and increase readability. For this mode, we have to explicitly indicate all whitespace characters, either by prependinf them with a \ or
# by using the \s special value. However, this mean we can write pur regex a bit more like code, and can even include comments with #
pattern = """
(?P<title>.*)        #the university title
(–\ located\ in\ )   #an indicator of the location
(?P<city>\w*)        #city the university is in
(,\ )                #separator for the state
(?P<state>\w*)       #the state the city is located in"""
matches = re.finditer(pattern, wiki, re.VERBOSE)
for match in matches:
    print(match.groupdict())
    
# Here's another example from the New York Times which covers health tweets on news items. This data came from
# the UC Irvine Machine Learning Repository which is a great source of different kinds of data
with open("data/nytimeshealth.txt","r", encoding = 'utf8') as file:
    # We'll read everything into a variable and take a look at it
    health=file.read()
pattern = "#[\w\d]*(?=\s)" # Pattern for find hastag in document
print(re.findall(pattern, health)[:10]) # Show the first 10 elements

{'title': 'Dhammakaya Open University ', 'city': 'Azusa', 'state': 'California'}
{'title': 'Dharmakirti College ', 'city': 'Tucson', 'state': 'Arizona'}
{'title': 'Dharma Realm Buddhist University ', 'city': 'Ukiah', 'state': 'California'}
{'title': 'Ewam Buddhist Institute ', 'city': 'Arlee', 'state': 'Montana'}
{'title': 'Institute of Buddhist Studies ', 'city': 'Berkeley', 'state': 'California'}
{'title': 'Maitripa College ', 'city': 'Portland', 'state': 'Oregon'}
{'title': 'University of the West ', 'city': 'Rosemead', 'state': 'California'}
{'title': 'Won Institute of Graduate Studies ', 'city': 'Glenside', 'state': 'Pennsylvania'}
['#askwell', '#pregnancy', '#Colorado', '#VegetarianThanksgiving', '#FallPrevention', '#Ebola', '#Ebola', '#ebola', '#Ebola', '#Ebola']


## Look ahead and look behind
Another form to use regex is with capturing groups an look ahead and look behind

In [25]:
text = 'abxacya'
print(re.findall("a(?=b)\w{1}", text)) # See ?= match a in ab but no a in ac, dont create a group for b, b is added by \w{1} which match one letter or digit more
print(re.findall("a(?:b)\w{1}", text)) # See ?: match a in ab an create a group for b, \w{1} add x in this case
print(re.findall("a(?!b)\w{1}", text)) # See ?! match a in ac because negate b, not create a group for b
print(re.findall("(?<=a)\w{1}", text)) # See ?<= match when a is before a letter and dont capture this
print(re.findall("(?<!a)\w{1}", text)) # See ?<! negate ?<= and match when a isn't before a letter

['ab']
['abx']
['ac']
['b', 'c']
['a', 'x', 'a', 'y', 'a']
