# Regular Expressions and Patterns

* Regular expressions (regex) are short statements which describe patterns of text for searching within text
* A regular expression is interpreted by a regex processor, which can be used to search in or split up text into "chunks"
* A regex follows a sort of "mini-language" of programming to define patterns of interest

* Good uses of regex:
  * Validating input data ("Hey, make sure all phone numbers are in the format (###) ###-####")
  * Quick and dirty cleaning of data when you can verify the results easily
* Questionable uses of regex:
  * If someone needs to be able to understand what you wrote
  * If there are a lot of edge cases (in which case you might want regex + more error handling

* One more reason to learn regex: they're nearly ubiquitously supported in tools and languages (Java, python, C#, as well as grep, text editors etc)!

* Regex's in python are done through the `re` module (and the docs are your friend!):

In [None]:
from IPython.display import IFrame    
display(IFrame("https://docs.python.org/3/library/re.html", width="100%", height=700))

* The most important operations are:
  * `re.search()` which returns a `Match` object for the first item which can be found
  * `re.finditer()` which returns an iterator over `Match` objects for items found
  * `re.findall()` which returns a bunch of `string` objects, `re.finditer()` is generally prefered
  * `re.split()` which uses a pattern to break up a string
  * `re.sub()` which replaces substrings through substitution
* But! Lots of other modules will take in a regex as well, and we'll touch on them in pandas

* The Match object is key to understand.

In [None]:
import re
print(re.Match.__doc__)

* If nothing is found the `Match` object doesn't exist - it's `None`.
* There is some important subtlety here!

In [None]:
# Quick example
strng = "I absolutely love SI330 and everything \
 we do in class is amazing."
pattern = "SI330"
result = re.search(pattern, strng)

In [None]:
result

In [None]:
if result:
    print("I knew it was about SI330!")

In [None]:
if result == True:
    print("I knew it was about SI330!")

* Wait, wtf? Why does `result == True` not evaluate to `True`, but `result is True` does, when we have a `Match` object?

* This is important python object understanding:
  * `==` checks for equality between objects, that the left hand side and right hand side point to the **same** object. In this case, `True` is not the same as a given `Match` object

* Don't use `==` with `Match` objects. In truth, never use `==` when checking a `bool`

* Match objects also have some helpful information inside of them, such as what was matched (`match`) and where it was matched in the string (`span`)
* This can be helpful when your patterns can match many different substrings

## Patterns
* We've already seen the most basic pattern, just a list of ordinary alphanumeric characters
* But there are a lot of special characters. Let's start with `.`
* `.` will match any single character except for newline characters (which we represent with the escaped`\n`)

In [None]:
pattern='G..d'
re.search(pattern, 'Good')

In [None]:
re.search(pattern, 'Grid')

In [None]:
pattern='G..d'
re.search(pattern, 'Graduation!')

In [None]:
re.search(pattern, 'God')

* The next patterns to be aware of are
  * `\s` which matches whitespace, this will match odd unicode whitespaces, tabs, spaces, etc.
  * `\S` which matches non-whitespace
  * `\d` which matches digits
  * `\D` which matches non-digits

In [None]:
pattern="\D\d\d\d\D\s\d\d\d-\d\d\d\d"
re.search(pattern,"(306) 262-2905")

In [None]:
re.search(pattern,"306-262-2905")

In [None]:
pattern="\D\d\d\d\D\s\d\d\d-\d\d\d\d"
# But we see it's not an ideal pattern...
re.search(pattern,":306p 262-2905")

* In addition to characters to match, we can match next to positions (boundaries)
  * `^` match to the beginning of a line
  * `$` match to the end of a line
  * `\b` which matches to the beginning or end of a **word**
  * `\B` which matches not to the beginning or end of a word
  * `\w` matches to a word character (defined as letter, number or... underscore?)
  * `\W` matches to a non-word character

In [None]:
strng="My goodness, have you heard that Li person \
is teaching? He's not even a Chris!"
re.search('^Brooks', strng)

In [None]:
re.search('^My', strng)

In [None]:
# words that start with good (but not good itself)
re.search('\bgood\B', strng)

* Wait, WTF? Isn't that supposed to work? What is happening here?
  * There are different ways of representing strings:
    * Just as per normal: `strng="No thank you"`, in Python 3 this is unicode data
    * As a raw string. In this case, the backslash characters are left in and not escaped by the string processing

In [None]:
print('No thank you Chris Teplovs')
print('No thank you Chris \brooks')
print(r'No thank you chris \brooks')
print('No thank you Chris \quarles')

* Goodness! The `\b` that we were putting in the string was being mistaken for a backspace character!
* Wait, why didn't this happen with the \d before?
* Because \d isn't a special character...

* Morale of the story: Always prepend your regex strings with r
* Seriously. Always. Make your life easier.

In [None]:
# words that start with good (but not good itself)
re.search(r'\bgood\B', strng)

In [None]:
strng="Dang I love this class! It was worth every $"
re.search(r'worth every $', strng)

In [None]:
re.search('worth every \$', strng)

## Quantifiers
* A few different kinds of special sequence characters we can use
  * `*` zero or more of the previous character
  * `+` one or more of the previous character
  * `?` zerp or one of the previous character
  * `{m,n}` between `m` and `n` of the previous character, where `n` is optional and if left out it means either exactly `m` (`{m}`) or `m` or more (`{m,`)

In [None]:
strng='`My phone number is (306) 373-2905'
re.search(r'\d*', strng)

In [None]:
# ok, seems like that wasn't the aim
strng='`My phone number is (306) 373-2905'
re.search(r'\d+', strng)

In [None]:
# can we find all number fragments in the string?
re.findall(r'\d+', strng)

In [None]:
# what do you think this will do?
re.findall(r'\d{1,3}', strng)

## Sets of Characters
* We can wrap a set of characters we want to match inside of `[]`
* `[aeiou]` means match any vowel

In [None]:
re.findall(r'[aeiou]+','The quick brown fox jumped over the...')

In [None]:
# we can negate THE WHOLE SET with a caret `^`
re.findall(r'[^aeiou]{1}','The quick brown fox jumped over the...')

In [None]:
re.findall(r'dog[s]{1}','The dogs ran after the big dog')

* We can also define a range inside of a character set. This is still used, but meta characters are often more appropriate.
  * `[A-Z]` all upper case roman characters
  * `[a-zA-Z]` all upper case or lower roman characters
  * `[a-zA-Z0-9_]` the same as `\w`
  *

In [None]:
# unicode ranges work too
re.findall(r'[α-ω]+','Someone once said, "I am the α". Does this mean there is a γ?')

* And logic is implicit, but if you want to specify an or you use a pipe `|`

In [None]:
line="POST /incentivize HTTP/1.1"
re.findall(r'HTTP/1.[1|2]',line)

## Capture Groups
* Up until this point it probably seems really laborious. It is.
* Capture groups let us match and/or extract subpatterns so we can build many regexes up together
* To indicate a capture group we use parentheses `()`
* The cannonical example? An email address

In [None]:
strng="The instructor is liwarren@umich.edu"
re.search(r'[\w.-]+@[\w.-]+',strng)

* But, there are actually a few different parts of an email address, including the username and the hostname

In [None]:
strng="The instructor is liwarren@umich.edu"
match=re.search(r'([\w.-]+)@([\w.-]+)',strng)
if match:
    print(match.group()) # the whole match
    print(match.group(1))# the first group
    print(match.group(2))# the second group

* Capture groups get even cooler though: you can label them like a variable
* Uses the syntax `(?P<name>)`, where 
  * the `()` denotes a capture group 
  * the `?P` indicates this is an extension to standard regex
  * the `<name>` means that matches for that group are labeled with the dictionary key `name`

In [None]:
import re
re.search("(?P<month>\w*) (?P<day>\d{1,2}), (?P<year>\d\d\d\d)",
          "Gordie Howe Chex card.jpg Born	March 31, 1928 Floral, Saskatchewan, Canada")

* Last topic I'll touch on in capture groups: thus far the focus has been on returning and labeling the capture groups
* What if we want to match on the group, but don't want to see it come back?
* (like \[edit\])
* We can use non capturing groups
  * `(?:...)` Match but don't return the group

* Lets see an example using data from wikipedia on US universities which are buddhist-based

In [None]:
# Get a list of dicts where each university 'name', 'city', and 'state' are labeled as such
with open("datasets/buddhist.txt","r") as file:
    wiki=file.read()
# solution: (?P<name>.*)(?:[–])(?: located in )(?P<city>\w*)(?:, )(?P<state>\w*)

![](https://imgs.xkcd.com/comics/regular_expressions.png)