# Lecture 8: Python Regular Expressions (Regex)

This lecture is about pattern matching in text using regular expressions. There are three main reasons for using regexes
- to check whether a pattern exists within some text <code>match(), search() </code> 
- given a pattern, to extract all instances of that pattern which occur in the input text <code>findall()</code>
- to modify your source data  through string splitting or pattern-based substitution <code>re.split()</code>, <code>re.sub(r"RE1", "RE2", target_str)</code>

Tutorials
https://regexone.com
https://www.regular-expressions.info/

Based on Coursera Applied-Data-Science-with-Python-Specialization course

In [None]:
#  Import python regular expression libraries.
import re

### Checking 
- to check whether a pattern exists within some text <code>match(), search() </code> 

In [None]:
# match() checks for a match that is at the beginning of the string and returns a boolean. 
# search(), checks for a match anywhere in the string, and returns a boolean.

text = "This is a good day."

if re.search("good", text): # the first parameter here is the pattern
    print("Wonderful!")
else:
    print("Alas :(")

### Getting all instances of a pattern 
- given a pattern, to get all instances of that pattern which occur in some text <code>findall()</code>

In [None]:
import re
text = "Hans is a new student. Hans is from Norway. Hans's friend is from Sri Lanka."

re.findall("Hans", text)

### Modifying text data

<code>re.split()</code>


In [None]:
# Both the findall() and split() functions will parse the input string and return chunks.
text = "Hans is a new student. Hans is from Norway. Hans's friend is from Sri Lanka."
# Split text on all instances of Hans
re.split("Hans", text)

Replace RE1 by RE2 in string target_str

<code>re.sub(r"RE1", "RE2", target_str)</code
    

In [None]:
import re

target_str = "Jessa knows testing and machine learning"

# replace all whitespaces with underscore
res_str = re.sub(r"\s", "_", target_str)
# String after replacement
print(res_str)

### Patterns, Python regex language

https://docs.python.org/3/library/re.html

Which patterns can we write ?
The regex specification standard defines a language to describe patterns in text. 

- Anchors specify the start and/or the end of the string that you are trying to match
- Metacharacters/character classes specify sets of characters
- Operators specify disjunction or negation 
- Quantifiers specify number of repetitions, the number of times you want a pattern to occur in the input string in order for the match to be successful. 
- Groups: to match several patterns, called groups, at the same time
- Compiling regex and Option flags

#### Anchors
- Anchors specify the start and/or the end of the string that you are trying to match. 
- The caret character ^ means start 
- The dollar sign character $ means end. 

If you put ^ before a string, it means that the text the regex processor retrieves must start with the string you specify. If you put the $ character after the string, it means that the text Regex retrieves must end with the string you specify.


In [None]:
text = "Hans is a new student. Hans is from Norway. Hans's friend is from Sri Lanka."

# Lets see if this begins with Hans
match = re.search("^Hans",text)
match
#match.span(0)

In [None]:
# Notice that re.search() actually returned to us a new object, called re.Match object. An re.Match object
# always has a boolean value of True, as something was found, so you can always evaluate it in an if statement
# The rendering of the match object also tells you what pattern was matched, in this case
# the word Hans, and the location the match was in, as the span.

#### Set Operator
- defines a set of matching values
- [AB] matches either A or B

In [None]:
grades="DEDDDDEFEFEDD"

# How many B's?
re.findall("D",grades)

In [None]:
# How many A or B's?
grades="DEDDDDEFEFEDD"
re.findall("[DE]",grades)

In [None]:
# How many A followed by either B or C?
grades="DEDDDDEFEFEDD"
re.findall("E[DF]",grades)

##### Disjunction Operator
- defines a disjunction between  values
- A|B matches either A or B

In [None]:
grades="DEDDDDEFEFEDD"
# All AB or AC
re.findall("ED|EF",grades)

##### Negation Operator
You can use the caret with the set operator to negate the results.

In [None]:
grades="DEDDDDEFEFEDD"
# All non A grades
re.findall("[^D]",grades)

In [None]:
# Note this carefully - the caret was previously matched to the beginning of a string as an anchor point, but
# inside of the set operator the caret, and the other special characters we will be talking about, lose their
# meaning. This can be a bit confusing. What do you think the result would be of this?
grades="DEDDDDEFEFEDD"
re.findall("^[^D]",grades)

In [None]:
# It's an empty list, because the regex says that we want to match any value at the beginning of the string
# which is not an A. Our string though starts with an A, so there is no match found. And remember when you are
# using the set operator you are doing character-based matching. So you are matching individual characters in
# an OR method.

##### Quantifiers

In [None]:
# Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic
# quantifier is expressed as e{m,n}, where e is the expression or character we are matching, m is the minimum
# number of times you want it to matched, and n is the maximum number of times the item could be matched.

grades="DEDDDDEFEFEDD"
# All sequences of A of size 2 to 10
re.findall("A{2,10}",grades) # we'll use 2 as our min, but ten as our max

In [None]:
grades="DEDDDDEFEFEDD"
re.findall("D{1,2}",grades)

In [None]:
grades="DEDDDDEFEFEDD"

# if you don't include a quantifier then the default is {1,1}
# All sequences of A of size 1
re.findall("DD",grades)

In [None]:
grades="DEDDDDEFEFEDD"
# if you just have one number in the braces, it's considered to be both m and n
# All sequences of A of size 2
re.findall("D{2}",grades)

In [None]:
# grades="DEDDDDEFEFEDD"
# All sequences of A of size 1 to 10 followed by a sequence of B of size 1 to 10 followed by a sequence of C of size 1 to 10
re.findall("D{1,10}E{1,10}F{1,10}",grades)

In [None]:
# Lets look at a more complex example, and load some data scraped from wikipedia
with open("edit-regex.txt","r") as file:
    # we'll read that into a variable called wiki
    wiki=file.read()
wiki

In [None]:
# headers are followed by ""[edit]\n"
# to get a list of all of the headers 
re.findall("[a-zA-Z]{1,100}\[edit\]",wiki)

In [None]:
# Ok, that didn't quite work. It got all of the headers, but only the last word of the header
# use \w to match any letter, including digits and numbers.
re.findall("[\w]{1,100}\[edit\]",wiki)

In [None]:
# Use asterix * to match 0 or more times.
re.findall("[\w]*\[edit\]",wiki)

In [None]:
 #We can add in a space
re.findall("[\w ]*\[edit\]",wiki)

In [None]:
# Create a list of titles by iterating through this and applying another regex
for title in re.findall("[\w ]*\[edit\]",wiki):
    print(re.split("[\[]",title)[0])

#### Groups
- To  match several patterns, called groups, at the same time 
- To create group patterns, use parentheses. 
- The re module breaks out the result by groups

<code>re.finditer(RE,text)</code> to get the result i.e., a list of matches

<code>match.group(n)</code> to get the Nth item of each match

In [None]:
# Two groups: (i) any nb of letters or space (ii) [edit]
re.findall("([\w ]*)(\[edit\])",wiki)

In [None]:
# Two groups: (i) any nb of letters or space (ii) [edit]
m = re.finditer("([\w ]*)(\[edit\])",wiki)
m

In [None]:
# findall() returns strings, and search() and match() return individual Match objects. 
# To get a list of Match objects use the function finditer()
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.groups())

In [None]:
# We can get an individual group using group(number), where group(0) is the whole match, and each other number 
# is the portion of the match we are interested in.
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.group(1))

#### Named Groups
- Groups can be named 
- Named group can be accessed thru the dictionary associated with the Match object
<code>(?P\<name\>)</code> , where the parenthesis starts the group, the <code>?P</code> indicates that this is an extension to basic regexes, and <code>name</code> is the dictionary key we want to use.

In [None]:
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict()['title'])

In [None]:
# Printing out the whole dictionary
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki):
    print(item.groupdict())

#### Look-ahead and Look-behind
<code>(?=RE)</code>
- the pattern given to the regex engine is for text either before or after the text we are trying to isolate. 

In [None]:
# isolate text which  comes before  [edit] 
for item in re.finditer("(?P<title>[\w ]+)(?=\[edit\])",wiki):
    # the second group will be the characters [edit] but will not occur in the output match objects
    print(item[1])


# Compiling regex and regex modes 

The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search() or findall() etc., e.g. re.search(pat, str, re.IGNORECASE).

- IGNORECASE -- ignore upper/lowercase differences for matching, so 'a' matches both 'a' and 'A'.
- DOTALL -- allow dot (.) to match newline -- normally it matches anything but newline. This can trip you up as you think .* matches everything, but by default it does not go past the end of a line. Note that \s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s*
- MULTILINE -- Within a string made of many lines, allow ^ and $ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string.
- VERBOSE --  allows you to write multi-line regexes and increases readability. For this mode, we have to explicitly indicate all whitespace characters, either by prepending them with a \ or by using the \s special value. Comments can be added with #


In [None]:
# Regexp can be compiled for reuse
regex = re.compile("(?P<title>[\w ]+)(?=\[edit\])")
for item in re.finditer(regex,wiki):
    # the second group will be the characters [edit] but will not occur in the output match objects
    print(item[1])

##### The verbose mode

In [None]:
# Let's look at some more wikipedia data. Here's some data on universities in the US which are buddhist-based
with open("buddhist.txt","r") as file:
    # we'll read that into a variable called wiki
    wiki=file.read()
# and lets print that variable out to the screen
wiki

In [None]:
# All universities 
pattern="""
(?P<title>.*)        #the university title
(–\ located\ in\ )   #an indicator of the location
(?P<city>\w*)        #city the university is in
(,\ )                #separator for the state
(?P<state>\w*)       #the state the city is located in"""

# Now when we call finditer() we just pass the re.VERBOSE flag as the last parameter, this makes it much
# easier to understand large regexes!
for item in re.finditer(pattern,wiki,re.VERBOSE):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict())

## Some Examples

In [None]:
### Extracting all domain names (the url part which occur after //)

import re
text = 'I refer to https://google.com and I never refer http://www.baidu.com, if I have to search anything'
for item in re.findall("(?<=:\/\/)([A-Za-z0-9.]*)",text):
     print(item)
     


In [None]:
import re
text = 'I refer to https://google.com and I never refer http://www.baidu.com if I have to search anything'
for item in re.finditer("(?<=[https]:\/\/)([A-Za-z0-9.]*)",text):
    #"(?P<title>[\w ]+)(?=\[edit\])":
     print(item.group(1))


In [None]:
import re
# All names 
simple_string = """Hans is 5 years old, and her sister Mary is 2 years old. Ruth and Peter, their parents, have 3 kids."""
re.findall("[A-Z][\w,]+\s", simple_string)
    

In [None]:
s = '146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" \
302 4622\n197.109.77.178 - kertzmann3129 [21/Jun/2019:15:45:25 -0700] \
"DELETE /virtual/solutions/target/web+services HTTP/2.0" 203 26554\n156.127.178.177 - \
okuneva5222 [21/Jun/2019:15:45:27 -0700] "DELETE /interactive/transparent/niches/revolutionize HTTP/1.1"\
416 14701\n100.32.205.59 - ortiz8891 [21/Jun/2019:15:45:28 -0700] "PATCH /architectures HTTP/1.0" \
204 6048\n168.95.156.240 - stark2413 [21/Jun/2019:15:45:31 -0700] \
"GET /engage HTTP/2.0" 201 9645\n71.172.239.195 - dooley1853 [21/Jun/2019:15:45:32 -0700]\
"PUT /cutting-edge HTTP/2.0" 406 24498\n180.95.121.94 - mohr6893 [21/Jun/2019:15:45:34 -0700]\
"PATCH /extensible/reinvent HTTP/1.1" 201 27330\n144.23.247.108 - auer7552 [21/Jun/2019:15:45:35 -0700]\
"POST /extensible/infrastructures/one-to-one/enterprise HTTP/1.1" 100 22921\n2.179.103.97 - \
lind8584 [21/Jun/2019:15:45:36 -0700] "POST /grow/front-end/e-commerce/robust HTTP/2.0"\
304 14641\n241.114.184.133 - tromp8355 [21/Jun/2019:15:45:37 -0700]\
"GET /redefine/orchestrate HTTP/1.0" 204 29059\n224.188.38.4 - \
keebler1423 [21/Jun/2019:15:45:40 -0700] "PUT /orchestrate/out-of-the-box/unleash/syndicate HTTP/1.1" \
404 28211\n94.11.36.112 - klein8508 [21/Jun/2019:15:45:41 -0700] \
"POST /enhance/solutions/bricks-and-clicks HTTP/1.1" 404 24768\n1 146.204.224.152 \
- feest6811 [21/Jun/2019:15:45:24 -0700] 302 4622 \
197.109.77.178 - kertzmann  [21/Jun/2019:15:45:24 -0700] 302 4622 312 \
159.253.153.40 - - [21/Jun/2019:15:46:10 -0700] "POST /e-business HTTP/1.0" 504 19845'

In [None]:
# All host names
re.findall('[\d.]{6,16}', s)

In [None]:
# All users
# - ortiz8891 [21
re.findall(' \- ([\w]+)\s+\[', s)

In [None]:
# request
re.findall('[A-Z]+\s+\/.+[120.]{3}', s)

In [None]:
# time
# [21/Jun/2019:15:45:31 -0700]
re.findall('\[([\w\/\:\s\-]+)\]',s)