### Regex basis and common applications

In [1]:
import re

In [2]:
# search(): Looks for some pattern and returns a boolean
text = "This is a good day"
print("Wonderfull") if re.search("good", text) else print("Alas :(")

Wonderfull


In [3]:
# split(): It will use a pattern for creating a list of strings
text = "Amy works diligently. Amy gets good grades. Our student Amy is succesfull"
re.split("Amy", text)

['',
 ' works diligently. ',
 ' gets good grades. Our student ',
 ' is succesfull']

In [4]:
# findall(): It will look for a pattern and pull out all the ocurrences
# How many times we have talked about Amy
re.findall("Amy", text)

['Amy', 'Amy', 'Amy']

---

### Pattern and Character Classes

Here we are going to use [] notation. It means a set within the regex expression. There you can specify a rule that will be searched in the program execution.

In [5]:
grades = "ACAAAABCBCBAA"
# Using | operator within regex expression as or operator
re.findall("[A|B]", grades)

['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

In [6]:
# Define a range of values: A followed by A to C
re.findall("[A][A-C]", grades)
# Notice that AB was not chosen because it can select only by pairs.

['AC', 'AA', 'AA', 'AA']

In [7]:
# Also caret operator means negation. For example
print(re.findall("[^A]", grades))
# So take a look with this:
print(re.findall("^[^A]", grades))
# It didn't return anything because is searching a character which not starts with A
# and our grades string starts with A

['C', 'B', 'C', 'B', 'C', 'B']
[]


---

### Quantifiers

In [8]:
# How many times has this student been on a back-to-back A's streak?
# It matches between n to m of the receding token
print(re.findall("A{2,10}", grades))
# Here we are looking for two A's back-to-back
print(re.findall("A{1,1}A{1,1}", grades))
# Equivalent to
# re.findall("AA", grades)
# re.findall("A{2}", grades)

['AAAA', 'AA']
['AA', 'AA', 'AA']


Here there are * + and ? quantifiers that are very very important.

In [11]:
# We are going to load some text
with open("ferpa.txt", "r") as file:
    wiki = file.read()
wiki

'Overview[edit]\nFERPA gives parents access to their child\'s education records, an opportunity to seek to have the records amended, and some control over the disclosure of information from the records. With several exceptions, schools must have a student\'s consent prior to the disclosure of education records after that student is 18 years old. The law applies only to educational agencies and institutions that receive funds under a program administered by the U.S. Department of Education.\n\nOther regulations under this act, effective starting January 3, 2012, allow for greater disclosures of personal and directory student identifying information and regulate student IDs and e-mail addresses.[2] For example, schools may provide external companies with a student\'s personally identifiable information without the student\'s consent.[2]\n\nExamples of situations affected by FERPA include school employees divulging information to anyone other than the student about the student\'s grades o

In [14]:
print(re.findall("[a-zA-Z]{1,100}\[edit\]", wiki))
# Metacharacter \w searches any letters or numbers
print(re.findall("[\w]{1,100}\[edit\]", wiki))
# and using * quantifier it becomes easier
print(re.findall("[\w]*\[edit\]", wiki))

['Overview[edit]', 'records[edit]', 'records[edit]']
['Overview[edit]', 'records[edit]', 'records[edit]']
['Overview[edit]', 'records[edit]', 'records[edit]']


In [18]:
# We are going to improve this. Taking adventage with * quantifier
# we can search any \w and white spaces
print(re.findall("[\w ]*\[edit\]", wiki))
# And thats all! here we have the titles in the article
# finally we can split and get only the text
for title in re.findall("[\w ]*\[edit\]", wiki):
    print(re.split("[\[]", title)[0])

['Overview[edit]', 'Access to public records[edit]', 'Student medical records[edit]']
Overview
Access to public records
Student medical records


---

### Groups

In [19]:
# We can match different patterns at the same time. These are called groups.
re.findall("([\w ]*)(\[edit\])", wiki)
# Here we can see that python breaks up the result by group within tuples.

[('Overview', '[edit]'),
 ('Access to public records', '[edit]'),
 ('Student medical records', '[edit]')]

In [29]:
# Notice that is possible to access the items for each tuple by a loop.
# But python has a iterator object to do that easylly
for item in re.finditer("([\w ]*)(\[edit\])", wiki):
    print(item.groups())
    print(item.group(0))
    print(item.group(1))
    print(item.group(2))
    print("\n")
# The groups are: both together and each one separate

('Overview', '[edit]')
Overview[edit]
Overview
[edit]


('Access to public records', '[edit]')
Access to public records[edit]
Access to public records
[edit]


('Student medical records', '[edit]')
Student medical records[edit]
Student medical records
[edit]




Here we have a very useful feature. You can use groups to build a dict object with the data given by our regular expression.

In [30]:
# Using ?P<name> you can name your group to separate it later in dict keys, values.
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])", wiki):
    print(item.groupdict())


{'title': 'Overview', 'edit_link': '[edit]'}
{'title': 'Access to public records', 'edit_link': '[edit]'}
{'title': 'Student medical records', 'edit_link': '[edit]'}


### Example

In [31]:
with open("buddhist.txt", "r") as file:
    wiki = file.read()
wiki

'Buddhist universities and colleges in the United States\nFrom Wikipedia, the free encyclopedia\nJump to navigationJump to search\n\nThis article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.\nFind sources: "Buddhist universities and colleges in the United States" – news · newspapers · books · scholar · JSTOR (December 2009) (Learn how and when to remove this template message)\nThere are several Buddhist universities in the United States. Some of these have existed for decades and are accredited. Others are relatively new and are either in the process of being accredited or else have no formal accreditation. The list includes:\n\nDhammakaya Open University – located in Azusa, California, part of the Thai Wat Phra Dhammakaya[1]\nDharmakirti College – located in Tucson, Arizona Now called Awam Tibetan Buddhist Institute (http://awaminstitute.org/)\nDharma Realm Buddh

In [35]:
# Example:
# Dhammakaya Open University – located in Azusa, California, part of the Thai Wat Phra Dhammakaya

pattern = """
(?P<title>.*)       # the university title
(–\ located\ in\ )  # an indicator of location
(?P<city>\w*)       # city the university is in
(,\ )               # separator for the state
(?P<state>\w+)      # the state the city is located
"""
for item in re.finditer(pattern, wiki, re.VERBOSE):
    print(item.groupdict())

# IMPORTANT!!!
# The unlabeled groups were not included into result.

{'title': 'Dhammakaya Open University ', 'city': 'Azusa', 'state': 'California'}
{'title': 'Dharmakirti College ', 'city': 'Tucson', 'state': 'Arizona'}
{'title': 'Dharma Realm Buddhist University ', 'city': 'Ukiah', 'state': 'California'}
{'title': 'Ewam Buddhist Institute ', 'city': 'Arlee', 'state': 'Montana'}
{'title': 'Institute of Buddhist Studies ', 'city': 'Berkeley', 'state': 'California'}
{'title': 'Maitripa College ', 'city': 'Portland', 'state': 'Oregon'}
{'title': 'University of the West ', 'city': 'Rosemead', 'state': 'California'}
{'title': 'Won Institute of Graduate Studies ', 'city': 'Glenside', 'state': 'Pennsylvania'}
