# Regex

In [1]:
import re

1. `search`: looks for a pattern and returns a boolean 
1. `split`: use a pattern for creating a list of substrings
1. `findall`: look for a pattern and pull all ocurrences

In [10]:
text = 'Amy works diligently. Amy gets good grades. Our student Amy is succesful.'

In [87]:
# search returns a boolean and a re.Match object
if re.search('gets', text):
    print('Wonderful')
else:
    print('Alas')

re.search('gets', text)

Wonderful


<re.Match object; span=(26, 30), match='gets'>

In [11]:
re.split('Amy', text)

['',
 ' works diligently. ',
 ' gets good grades. Our student ',
 ' is succesful.']

In [15]:
re.findall('Amy', text)

['Amy', 'Amy', 'Amy']

## Markup language to describe patterns

- `^`: means start
- `$`: means end


In [29]:
text = 'Amy works diligently. Amy gets good grades. Our student Amy is succesful.'

In [36]:
# The span shows where it find the pattern, and where it ends the pattern
re.search('^Amy', text)

<re.Match object; span=(0, 3), match='Amy'>

## Patterns and Character Classes

- `[]`: set operator

In [38]:
grades = 'ACAAAABCBCBAA'

In [40]:
# How many B's were in the grade list?
re.findall('B', grades)

['B', 'B', 'B']

In [44]:
# Count A's or B's
re.findall('[AB]', grades)

['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

In [51]:
# Find all instances where the student receive an A followed by a B or a C
re.findall('A[B-C]', grades)

['AC', 'AB']

In [53]:
re.findall('AB|AC', grades)

['AC', 'AB']

In [56]:
# "^" inside the set operator means negative or that doesn't match the pattern
re.findall('[^A]', grades)

['C', 'B', 'C', 'B', 'C', 'B']

In [58]:
re.findall('^[^A]', grades)

[]

## Quantifiers

Is the number of times you want a pattern to be matched in order to count.

1. `*`: 0 or more times
1. `?`: 0 or 1 times
1. `+`: 1 or more times
1. `{n}`: exactly n times
1. `{n,}`: at least n times
1. `{n,m}`: from n to m times

- `\w` to match any letter or digit
- `\s` matchs any whitespace character, like spaces and tabs
- `\d` for any digit
- `.` any character which is not a newline

In [59]:
grades = 'ACAAAABCBCBAA'

In [62]:
re.findall('A{2,10}', grades)

['AAAA', 'AA']

In [66]:
re.findall('A{1,1}A{1,1}', grades), re.findall('AA', grades)

(['AA', 'AA', 'AA'], ['AA', 'AA', 'AA'])

In [68]:
# just {x}, its considered both {m, n}, {2} equivalent {2, 2} 
re.findall('A{2}', grades)

['AA', 'AA', 'AA']

In [71]:
re.findall('A{1,10}B{1,10}C{1,10}', grades)

['AAAABC']

In [94]:
with open('ferpa.txt', 'r') as file:
    wiki = file.read()

wiki[:1000]

'Family Educational Rights and Privacy Act\n\nFrom Wikipedia, the free encyclopedia\n\nJump to navigation Jump to search\n\nThis article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)\n\nThis article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: "Family Educational Rights and Privacy Act" â€“ news Â· newspapers Â· books Â· scholar Â· JSTOR (January 2012) (Learn how and when to remove this template message)\n\nThis article\'s lead section may be too short to adequately summarize the key points. Please consider expanding the lead to provide an accessible overview of all important aspects of the article. (May 2015)\n\n(Learn how and when to remove this template message)\n\nFERPA\n\nLong title Family Educational Rights and Privacy Act\n\nCitations\n\nStatutes at L

In [97]:
# headers have the words [edit] behind them, followed by a newline
re.findall('[a-z A-Z]{1,100}\[edit\]', wiki)

['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]',
 'See also[edit]',
 'References[edit]',
 'External links[edit]']

In [100]:
re.findall('[\w ]{1,100}\[edit\]', wiki)

['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]',
 'See also[edit]',
 'References[edit]',
 'External links[edit]']

In [102]:
re.findall('[\w ]*\[edit\]', wiki)

['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]',
 'See also[edit]',
 'References[edit]',
 'External links[edit]']

In [108]:
for title in re.findall('[\w ]*\[edit\]', wiki):
    print(re.split('[\[]', title)[0])

Overview
Access to public records
Student medical records
See also
References
External links


## Groups

To group patterns together you use parentheses `(` `)`

In [112]:
re.findall('([\w ]*)(\[edit\])', wiki)

[('Overview', '[edit]'),
 ('Access to public records', '[edit]'),
 ('Student medical records', '[edit]'),
 ('See also', '[edit]'),
 ('References', '[edit]'),
 ('External links', '[edit]')]

4. `finditer`: if we want a list of Match objects
5. `groups`: returns a tuple of the group

In [120]:
for item in re.finditer('([\w ]*)(\[edit\])', wiki):
    print(item.groups(), item)

('Overview', '[edit]') <re.Match object; span=(1940, 1954), match='Overview[edit]'>
('Access to public records', '[edit]') <re.Match object; span=(4930, 4960), match='Access to public records[edit]'>
('Student medical records', '[edit]') <re.Match object; span=(6141, 6170), match='Student medical records[edit]'>
('See also', '[edit]') <re.Match object; span=(6693, 6707), match='See also[edit]'>
('References', '[edit]') <re.Match object; span=(6810, 6826), match='References[edit]'>
('External links', '[edit]') <re.Match object; span=(10696, 10716), match='External links[edit]'>


6. `group`: get an individual group using group and a number
    - 0: the whole match
    - Each other number is the portion of the match we are interested

In [126]:
# 
for item in re.finditer('([\w ]*)(\[edit\])', wiki):
    print(item.group(1))

Overview
Access to public records
Student medical records
See also
References
External links


### Naming groups
We use the syntax `(?P<name>)`

7. `groupdict`: takes a match and makes a dictionary based on the name of the group

In [138]:
for item in re.finditer('(?P<title>[\w ]*)(?P<edit_link>\[edit\])', wiki):
    print(item.groupdict()['title'])
item.groupdict()

Overview
Access to public records
Student medical records
See also
References
External links


{'title': 'External links', 'edit_link': '[edit]'}

## Look-ahead and Look-behind

Search for a pattern, but not capturing it.

We use `?=` syntax

In [147]:
for item in re.finditer('(?P<title>[\w ]+)(?=\[edit\])', wiki):
    print(item)

<re.Match object; span=(1940, 1948), match='Overview'>
<re.Match object; span=(4930, 4954), match='Access to public records'>
<re.Match object; span=(6141, 6164), match='Student medical records'>
<re.Match object; span=(6693, 6701), match='See also'>
<re.Match object; span=(6810, 6820), match='References'>
<re.Match object; span=(10696, 10710), match='External links'>


## Example

In [77]:
with open('buddhist.txt', 'r') as file:
    wiki = file.read()
wiki

'Buddhist universities and colleges in the United States\n\nFrom Wikipedia, the free encyclopedia\n\nJump to navigation Jump to search\n\nThis article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: "Buddhist universities and colleges in the United States" - news Â· newspapers Â· books Â· scholar Â· JSTOR (December 2009) (Learn how and when to remove this template message)\n\nThere are several Buddhist universities in the United States. Some of these have existed for decades and are accredited. Others are relatively new and are either in the process of being accredited or else have no formal accreditation. The list includes:\n\nDhammakaya Open University - located in Azusa, California, part of the Thai Wat Phra Dhammakaya[1]\n\nDharmakirti College - located in Tucson, Arizona Now called Awam Tibetan Buddhist Institute (http://awaminstitute.org/)\n\nDharm

In [78]:
# each university follows a pattern, name - located in city, state
pattern = """
(?P<title>.*)       # the university title
(-\ located\ in\ )  # an indicator of the location
(?P<city>[\w ]*)       # city
(,\ )
(?P<state>\w*)      # state"""

In [79]:
ex = [item.groupdict() for item in re.finditer(pattern, wiki, re.VERBOSE)]

# ex is equivalent to this  

for item in re.finditer(pattern, wiki, re.VERBOSE):
    print(item.groupdict())

{'title': 'Dhammakaya Open University ', 'city': 'Azusa', 'state': 'California'}
{'title': 'Dharmakirti College ', 'city': 'Tucson', 'state': 'Arizona'}
{'title': 'Dharma Realm Buddhist University ', 'city': 'Ukiah', 'state': 'California'}
{'title': 'Ewam Buddhist Institute ', 'city': 'Arlee', 'state': 'Montana'}
{'title': 'Naropa University ', 'city': 'Boulder', 'state': 'Colorado'}
{'title': 'Institute of Buddhist Studies ', 'city': 'Berkeley', 'state': 'California'}
{'title': 'Maitripa College ', 'city': 'Portland', 'state': 'Oregon'}
{'title': 'Soka University of America ', 'city': 'Aliso Viejo', 'state': 'California'}
{'title': 'University of the West ', 'city': 'Rosemead', 'state': 'California'}
{'title': 'Won Institute of Graduate Studies ', 'city': 'Glenside', 'state': 'Pennsylvania'}


## Test

In [99]:
prueba = ['www.aBC.com', 'abc.com', 'ab_c.de8f.com', 
          'abc', 'abc..com']

In [178]:
patron = '''
(w{3}.)?
(.*)
(\w)
(.com$)
'''

In [179]:
for word in prueba:
    for item in re.finditer(patron, word, re.VERBOSE):
        print(item)

<re.Match object; span=(0, 11), match='www.aBC.com'>
<re.Match object; span=(0, 7), match='abc.com'>
<re.Match object; span=(0, 13), match='ab_c.de8f.com'>


### Quiz

In [180]:
import numpy as np

In [186]:
a1 = np.random.rand(4)
a2 = np.random.rand(4, 1)
a3 = np.array([1, 2, 3, 4])
a4 = np.arange(1, 4, 1)

a3, a4

(array([1, 2, 3, 4]), array([1, 2, 3]))

In [197]:
old = np.array([[1, 1, 1],
               [1, 1, 1]])
old = np.ones((2, 3)).astype(int)

new = old
new[0, :2] = 0

old, new

(array([[0, 0, 1],
        [1, 1, 1]]),
 array([[0, 0, 1],
        [1, 1, 1]]))

In [201]:
s = 'ACAABAACAAAB'
result = re.findall('A{1,2}', s)
L = len(result)
L, result

(5, ['A', 'AA', 'AA', 'AA', 'A'])

In [207]:
test = 'I refer to https://google.com and I never refer http://www.baidu.com if I have to search anything'
pat = '(?<=[https]:\/\/)([A-Za-z0-9.]*)'
re.findall(pat, test)

['google.com', 'www.baidu.com']

In [208]:
text='''Everyone has the following fundamental freedoms:
    (a) freedom of conscience and religion;
    (b) freedom of thought, belief, opinion and expression, including freedom of the press and other media of communication;
    (c) freedom of peaceful assembly; and
    (d) freedom of association.'''

import re
pattern = '\(.\)'
print(len(re.findall(pattern,text)))

4
