## Regex

In [1]:
import re


In [2]:
# the match() checks for a match at the beginning of the string and returns a boolean
# the search() method checks for a match anywhere in the string and returns a boolean

text = 'This is a good day'

if re.search('good',text):
    print('Wonderful!')
    

Wonderful!


In [3]:
# The findall() and split() functions will parse the string and return the chunks.
text = ' HE is a very goofy. He is a boy. He is still a kid.'

In [4]:
re.findall('He', text)

['He', 'He']

In [5]:
re.split('He',text)

[' HE is a very goofy. ', ' is a boy. ', ' is still a kid.']

### Patterns and Character classes

The regex specifies a markup language.

* Anchors - start or end of the string that we are trying to match
* ^ - start
* $ - end
* [] - selecting individual characters



- The ^ is generally used for selecting the words that start with the following characters. 
- When the `.search()` is used, we get a re.Match object as the return along with where the characters are founf i.e., the `span`
 

In [6]:
text = 'He is a very goofy. He is a boy. He is still a kid.'

re.search("^He", text)

<re.Match object; span=(0, 2), match='He'>

For now, let us discuss characterwise selections, Say we have a string named `grades` in which each character corresponds to a student's grade in a subject, then

In [7]:
grades = 'ABCAFAADDACADAC'

#Let us fing all the A grades in the string.
re.findall('A',grades)

['A', 'A', 'A', 'A', 'A', 'A', 'A']

In general, we can use the `[]` operator to select where the exact characters are located i.e., when we use `.findall()` method, for example on a string of grades, nd we want all the A and B grades scored by a student, then we use the code in the following cell 

In [8]:
# Finding A and B grades.
re.findall("[AB]",grades)

['A', 'B', 'A', 'A', 'A', 'A', 'A', 'A']

The [] operator can take anything, as it only sees the characters and tries to extract the characters from the collection.

In [9]:
re.findall('[Akhil]', grades)

['A', 'A', 'A', 'A', 'A', 'A', 'A']

Now, if you want to find the grades where A is immediately followed by a B grade, then we can use 2 `[]` to describe the sequence that we want.

In [10]:
re.findall('[A][B]', grades)

['AB']

This becomes more useful when you have a specific starting string, but have a range of the following characters, i.e., say you want all the instances where A is immediately followed by a B grade, a C grade and a D grade, then you can use,

In [11]:
# A followed by B or C or D
re.findall('[A][B-D]', grades)

['AB', 'AD', 'AC', 'AD', 'AC']

Other than using `[]` operator, you can also use the `|` operator, when you have 2 choices. The pipe operator is the alternative for `or`

In [12]:
#A followed by B or A followed by D
re.findall('AB|AD', grades)

['AB', 'AD', 'AD']

If you want the collection, excluding a character or a set of characters, then you can use, the `^` in the `[]` operator. This excludes the given character.

In [13]:
# Find all the grades which are not A
re.findall('[^A]',grades)

['B', 'C', 'F', 'D', 'D', 'C', 'D', 'C']

Now there is a difference between ^ out of the square brackets and inside the square brackets. We can use both when we have a need.

In [14]:
re.findall('^[^A]', grades)

[]

### Quantifiers

Quantifiers are the number of times you want a pattern to be matched in order to count as a match. The most basic quantifier is the e(m,n) where, 
* e -expression or character that we are matching
* m - minimum number of times
* n - maximum number of the times an item could be matched.

In [15]:
re.findall('A{2,10}',grades)

['AA']

The above pattern is looking for 2 A's upto 10 A's in a row. But, the below line of code, sees 2 A's back to back, i.e., even if we have more than 2 consecutive A's in the grades like 'AAAA' the output only gives ['AA','AA']

In [16]:
re.findall('A{1,1}A{1,1}', grades)

['AA']

There are some more quantifiers that are used as short hand,
* the `*` - to match 0 or more times 
* the `?` or a `+`  - to match one or more times

Let us look at an example. For this ,we will look at a wikipedia article saved in Ferpa

In [17]:
with open('data\\Week_1\\ferpa.txt','r') as file:
    wiki=file.read()

print(wiki)


Overview[edit]
FERPA gives parents access to their child's education records, an opportunity to seek to have the records amended, and some control over the disclosure of information from the records. With several exceptions, schools must have a student's consent prior to the disclosure of education records after that student is 18 years old. The law applies only to educational agencies and institutions that receive funds under a program administered by the U.S. Department of Education.

Other regulations under this act, effective starting January 3, 2012, allow for greater disclosures of personal and directory student identifying information and regulate student IDs and e-mail addresses.[2] For example, schools may provide external companies with a student's personally identifiable information without the student's consent.[2]

Examples of situations affected by FERPA include school employees divulging information to anyone other than the student about the student's grades or behavior,

In [18]:
re.findall('[a-zA-z]{1,100}\[edit\]', wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

### Shorthands

* `\w` - Any alphanumeric character
* `.` - Any single character which is not a new line
* `d` - Any Digit
* `s` - any whitespace characters

In [19]:
re.findall('[\w]{1,100}\[edit\]',wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

In [20]:
re.findall('[\w]*\[edit\]',wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

In [21]:
re.findall('[\w ]*\[edit\]',wiki)

['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]']

In [22]:
for title in re.findall('[\w ]*\[edit\]',wiki):
    print(re.split('[\[]',title)[0])

Overview
Access to public records
Student medical records


### Groups

In [23]:
re.findall('([\w ]*)(\[edit\])', wiki)

[('Overview', '[edit]'),
 ('Access to public records', '[edit]'),
 ('Student medical records', '[edit]')]

In [24]:
for item in re.finditer('([\w ]*)(\[edit\])', wiki):
    print(item.groups())

('Overview', '[edit]')
('Access to public records', '[edit]')
('Student medical records', '[edit]')


In [25]:
for item in re.finditer('([\w ]*)(\[edit\])', wiki):
    print(item.group(1))

Overview
Access to public records
Student medical records


In [26]:
#Labelling and naming groups
# Syntax: (?P<name>)
for item in re.finditer('(?P<Title>[\w ]*)(?P<edit_link>\[edit\])', wiki):
    print(item.group('Title'))

Overview
Access to public records
Student medical records


In [27]:
for item in re.finditer('(?P<Title>[\w ]*)(?P<edit_link>\[edit\])', wiki):
    print(item.groupdict())

{'Title': 'Overview', 'edit_link': '[edit]'}
{'Title': 'Access to public records', 'edit_link': '[edit]'}
{'Title': 'Student medical records', 'edit_link': '[edit]'}


### Look-ahead and Look-Behind matching

In [28]:
for item in re.finditer('(?P<Title>[\w ]+)(?=\[edit\])', wiki):
    print(item.groupdict())

{'Title': 'Overview'}
{'Title': 'Access to public records'}
{'Title': 'Student medical records'}


## Another Example : Buddhist.txt

In [29]:
with open('data\\Week_1\\buddhist.txt','r') as f:
    wiki = f.read()

print(wiki)

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1589: character maps to <undefined>

In [None]:
pattern="""
(?P<title>.*)   
(–\ located\ in\ )
(?P<city>\w*)   
(,\ ) 
(?P<state>\w*)  
"""
for item in re.finditer(pattern,wiki,re.VERBOSE):
    print(item.groupdict())

### Another example: NYTIMESHEALTH

In [None]:
with open('data\\Week_1\\nytimeshealth.txt','r') as f:
    wiki = f.read()

print(wiki)

In [None]:
pattern='#[\w\d]*(?=\s)'
re.findall(pattern,wiki)

In [None]:
pattern = '''
(|)
(?P<Date>\w*\ \w*\ \d*) #Date
(\ )
(?P<time>\d*:\d*:\d*) #time
(\  \+\d* \ )
(?P<year>\d{4}) #year
(|)

'''
for item in re.finditer(pattern,wiki,re.VERBOSE):
    print(item.groupdict())

In [None]:
pattern='''
(\ )
(?P<link>\w* ://\w*\.\w*/\w*)
'''
for item in re.finditer(pattern,wiki,re.VERBOSE):
    print(item.groupdict())

 http://nyti.ms/1xm7fTi

|Breaking: Lab Error May Have Exposed U.S. Technician to Ebola Virus http://nyti.ms/1ziIYf6

In [None]:
pattern='''
(|)
(?P<Tweet>[\ \w@\w+]*\ )
(?=\w* ://\w*\.\w*/\w*)
'''
for item in re.finditer(pattern,wiki,re.VERBOSE):
    print(item.groupdict())

Your Money: Affordable Care Act’s Tax Effects Now Loom for Filers

In [None]:
pattern = '''
(|)
(?P<Date>\w*\ \w*\ \d*) #Date
(\ )
(?P<time>\d*:\d*:\d*) #time
(\  \+\d* \ )
(?P<year>\d{4}) #year
(|)
(?P<Tweet>[\w:\.\-\ ]*\ ) #tweet
(?=\ \w* ://\w*\.\w*/\w*)
'''
for item in re.finditer(pattern,wiki,re.VERBOSE):
    print(item.groupdict())

547904193001185280|Wed Dec 24 23:58:33 +0000 2014|Machine Learning: Bedtime Technology for a Better Night’s Sleep http://nyti.ms/16RCgpE