<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST2312/blob/main/CST2312_Class17.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CST2312 WK11CL17 - Regex and APIs**



---



# ***Regular Expressions***
***Regex***

**Reading** from the required textbook: ( [https://www.py4e.com/lessons/](https://www.py4e.com/lessons/))

* [Regular Expressions](https://www.py4e.com/lessons/regex) (Chapter 12)
* [Data Science Cheat Sheet Python Regular Expressions](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)



So far we have been using methods like `split` and `find` to extract portions of strings or to answer a question of whether a particular item / string is part of a list-set-tuple-dictionary / longer string.

Regular Expressions
-------------------

Regular expressions (regexes or re’s) constitute an extremely powerful, flexible and concise language for matching elements in text ranging from a few characters to complex patterns. While mastering the syntax of the regular expression language does require climbing a learning curve, this learning curve is not particularly steep, and a newcomer can find herself performing useful tasks with regular expressions almost immediately. Efforts spent learning regular expressions quickly pay off--tasks that are well suited for regular expressions abound. Indeed, regular expressions are one of the most useful computer skills, and an absolutely critical tool for data scientists.

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. 

We will present examples using python’s standard [re regular expression library](http://docs.python.org/library/re.html).

We will discuss Python libraries in detail later.

Many examples from from this [Google tutorial](https://developers.google.com/edu/python/regular-expressions). Pay attention that tutorial itself uses a 2.x version of Python, thus, several statements (for example, print()) look differently than in the later versions of Python. In this notebook, the examples from the Google tutorial are converted to the current Python version. 

### Searching strings using regexes

The regular expression library `re` must be imported into your program before you can use it. The simplest use of the regular expression library is the `search()` function. 

In [2]:
# first import the library
import re

In [None]:
inputStr = 'an example word:cat!!'
if re.search('cat', inputStr):
  print(inputStr)
print ("Done with the example")

In [None]:
inputStr = 'an example word: cat!!'
if re.search('dog', inputStr):
  print(inputStr)
print ("Done with the example")

#### We can store the result of `re.search(pat, str)` in a variable.

In Python a regular expression search is typically written as:

`match = re.search(pat, str)`

The code `match = re.search(pat, str)` stores the search result in a variable named "match". 

The `re.search()` method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, `search()` returns a match object or `None` otherwise.

Then the `if`-statement tests the match -- if `True` the search succeeded and `match.group()` is the matching text (e.g. 'word: cat'). Otherwise if the match is `False` (`None` to be more specific), then the search did not succeed, and there is no matching text.

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions. It is recommended that you always write pattern strings with the 'r' just as a habit.

 

In [None]:
inputStr = 'an example word: cat!!'
match = re.search(r'word: \w\w\w', inputStr) ## both 'cat' and 'dog' are 3-letter words
match

The `re.match()` function returns a match object on success, `None` on failure. We use `group(num)` function of match object to get matched expression.

`group(num=0)`: This method returns entire match (or specific subgroup num)

In [None]:
inputStr = 'an example word: cat!!'
# If-statement after search() tests if it succeeded
if re.search(r'word: \w\w\w', inputStr):
  print ('found', match.group()) ## 'found word:cat'
else:
  print ('did not find')
print ("Done with the example")

In [None]:
inputStr = 'an example word: dog!!'
match = re.search(r'word: \w\w\w', inputStr)
# If-statement after search() tests if it succeeded
if match:
  print ('found', match.group()) ## 'found word:cat'
else:
  print ('did not find')
print ("Done with the example")

In [None]:
line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line) # regExs within a regEx

if matchObj:
   print ("matchObj.group() : ", matchObj.group(0))
   print ("matchObj.group(0) : ", matchObj.group(0))
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")

### Basic Patterns

see [Python Documentation](https://docs.python.org/3/library/re.html)

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

* a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)

* . (a period) -- matches any single character except newline '\n'

* \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.

* \b -- boundary between word and non-word

* \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.

* \t, \n, \r -- tab, newline, return

* \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)

* ^ = start, $ = end -- match the start or end of the string

* \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

### Basic Examples


The basic rules of regular expression search for a pattern within a string are:

* The search proceeds through the string from start to end, stopping at the first match found

* All of the pattern must be matched, but not all of the string

* If `match = re.search(pat, str)` is successful, match is not `None` and in particular `match.group()` is the matching text

In [None]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.

match = re.search(r'iii', 'piiig')
if match:
  print ('found, match.group() == "iii"')
  print (match.group())

found, match.group() == "iii"
iii


In [None]:
match = re.search(r'igs', 'piiig')
if not match:  
  print ('not found, match == None')
  print (re.search(r'igs', 'piiig'))



In [None]:
## . = any char but \n
match = re.search(r'..g', 'piiig') 
if match:
  print ('found, match.group() == "iig"')
  print (match.group())



In [None]:
## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g')
if match:
  print ('found, match.group() == "123"')
  print (match.group())



In [None]:
match = re.search(r'\w\w\w', '@@abcd!!')
if match:
  print ('found, match.group() == "abc"')
  print (match.group())

### Email Example 1

Extract the account name and the domain from the email address


In [5]:
email = 'my-email@citytech.edu'

# your code here

### ***don't look here***



```
# a solution to exercise #1 
email_match=re.search(r'\S+@\S+', email)
print('found email address ' + email_match.group(0))
match_list=re.split(r"@", email_match.group(0))
print('the name and domain are: ', match_list)
```



### Email Example 2

Suppose you want to find the email address inside the string 'xyz myemail@citytech.edu data  science'. 

Here's an attempt using the pattern r'\w+@\w+':

In [None]:
emailText = 'xyz my-email@citytech.edu data science'
match = re.search(r'\w+@\w+', emailText)
if match:
  print (match.group())

The search does not get the whole email address in this case because the `\w` does not match the `'-'` or `'.'` in the address. We'll fix this using the regular expression features below.

**Square Brackets**

Square brackets can be used to indicate a set of chars, so `[abc]` matches `'a'` or `'b'` or `'c'`. The codes `\w`, `\s` etc. work inside square brackets too with the one exception that dot (`.`) just means a literal dot. For the emails problem, the square brackets are an easy way to add `'.'` and `'-'` to the set of chars which can appear around the `@` with the pattern `r'[\w.-]+@[\w.-]+'` to get the whole email address:

In [None]:
emailText = 'xyz my-email@citytech.edu data science'
match = re.search(r'[\w.-]+@[\w.-]+', emailText)
if match:
  print (match.group())

We will use `group()` function to extract the account name and the domain from the email address


In [None]:
emailText = 'xyz my-email@citytech.edu data science'
match = re.search(r'([\w.-]+)@([\w.-]+)', emailText)
if match:
  print ("email address:\t", match.group())
  print ("email account:\t", match.group(1))
  print ("email domain:\t", match.group(2))

### Iteration using regular expressions

If we want to extract data from a string in Python we can use the `findall()` or `finditer()` methods to extract all of the substrings which match a regular expression. 

These two methods produce resutls of different types. 

Let’s use the example of wanting to extract anything that looks like an email address from any line regardless of format. For example, we want to pull the email addresses from each of the following lines:

In [None]:
emailText = '''xyz my-email@citytech.edu data science or 
your_email@citytech.edu, also we can mention their.email@citytech.edu'''

In [None]:
match = re.search(r'([\w.-]+)@([\w.-]+)', emailText)
if match:
  print ("email address:\t", match.group())
  print ("email account:\t", match.group(1))
  print ("email domain:\t", match.group(2))

In [None]:
matches = re.finditer(r'[\w.-]+@[\w.-]+', emailText)
for match in matches:
    print(match.group())

In [None]:
matches = re.finditer(r'([\w.-]+)@([\w.-]+)', emailText)
for match in matches:
    print(match.group(),"\t", match.group(1),"\t", match.group(2))

In [None]:
print (matches)  

In [None]:
matches = re.findall(r'([\w\.-]+)@([\w\.-]+)', emailText)
print (matches)  
for match in matches:
  print (match[0])  ## username
  print (match[1])  ## host

In [None]:
matches = re.findall(r'[\w\.-]+@[\w\.-]+', emailText)
print (matches)  
for match in matches:
  print (match)



---



Additional reading on the complexities of thorough regex rules for email addresses ▶ [O'Relly Regular Expressions Cookbook 4.1, Validate Email Addresses](https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch04s01.html) by Jan Goyvaerts and Steven Levithan.



---



### Create a variable containing regular expression

`re.compile()` method.

In [None]:
# We are looking for binary numbers
regex = re.compile(r'[10]+')
text = "asddf1101110100011abd1111panos0000"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
emailText = '''xyz my-email@citytech.edu data science or 
your_email@citytech.edu, also we can mention their.email@citytech.edu'''

In [None]:
# We are looking for emails numbers
regex = re.compile(r'[\w\.-]+@[\w\.-]+')
matches = regex.finditer(emailText)
for match in matches:
    print(match.group())

In [None]:
# We look for money figures, either integers, or with 1 or 2 decimal
# digits
regex = re.compile(r'\$\d+(\.\d\d?)?')
text = '$1200.23 is the price today. $1200 was the price yesterday'
matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
# This code is going to generate no matches
regex = re.compile(r'Ra*nd.*m R[egex]')
text = "CUNY, Citytech, Information and Data Management, my-email@citytech.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

print ("The end")

Regular expressions are typically case-sensitive. 

In [None]:
# Regular expressions are compiled into pattern objects
# Regular expressions are case-sensitive
regex = re.compile(r'in.*on')
text = "CUNY, Citytech, Information and Data Management, my-email@citytech.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

But we can specify that they are case-insensitive, using the flag re.IGNORECASE

In [None]:
# Unless we specify that they are case-insensitive, using the flag re.IGNORECASE
regex = re.compile('in.*on',re.IGNORECASE)
text = "CUNY, Citytech, Information and Data Management, my-email@citytech.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

### Greedy vs. Non-Greedy

Suppose you have text with tags in it: `<b>foo</b> and <i>so on</i>`

Suppose you are trying to match each tag with the pattern '(<.*>)' -- what does it match first?



In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.*>)')
match = regex.search(text)
print (match.group())



The result is a little surprising, but the greedy aspect of the `.*` causes it to match the whole `<b>foo</b> and <i>so on</i>` as one big match. The problem is that the `.*` goes as far as is it can, instead of stopping at the first `>` (aka it is "greedy").

There is an extension to regular expression where you add a `?` at the end, such as `.*?` or `.+?`, changing them to be non-greedy. Now they stop as soon as they can. So the pattern `'(<.*?>)'` will get just `'<b>'` as the first match, and `'</b>'` as the second match, and so on getting each <..> pair in turn. The style is typically that you use a `.*?`, and then immediately its right look for some concrete marker (> in this case) that forces the end of the `.*?` run.



In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.?>)')
match = regex.search(text)
print (match.group())

In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.?>)')
matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.+?>)')
matches = regex.finditer(text)
for match in matches:
    print(match.group())

## Regular Expression Fuctions

### Analyzing

* `.match()`
* `.search()`
* `.finditer()`
* `.findall()`
* `.compile()`

[https://docs.python.org/3/library/re.html](https://docs.python.org/3/library/re.html)

Let me separate these functions into three groups: 
1. `.compile()`
2. `.finditer()`, `.findall()`
3. `.match()`, `.search()`


# 1. `re.compile()`

## Regular expression compilation

Regular expression compilation produces a Python object that can be used to do all sort of regular expression operations. What is the benefit of that as long as we can use `re.match` and `re.search` directly? This technique is convenient in case we want to use a regular expression more than once. It makes our code efficient and more readable.

### `re.compile(pattern, flags=0)`

Compile a regular expression pattern into a regular expression object, which can be used for matching using its `match()`, `search()` and other methods.


This sequence

In [None]:
inStr = 'This is my 123 example string'
prog = re.compile(r'\d+')
result = prog.search(inStr)
result

is equivalent to

In [None]:
result = re.search(r'\d+', inStr)
result

By using `re.compile()` and saving the resulting regular expression object for reuse you can make your code more efficient when the expression will be used several times in a single program.

# 2. `re.finditer()`, `re.findall()`

## `.finditer()` VS `.findall()`

* `.findall()` returns a **list** of all matches of a regex in a string.
* `.finditer()` returns an iterator that yields regex matches.

https://realpython.com/regex-python/

https://realpython.com/regex-python-part-2/

https://www.8bitavenue.com/difference-between-re-search-and-re-match-in-python/

`re.findall(<regex>, <string>)` returns a list of all non-overlapping matches of `<regex>` in `<string>`. It scans the search string from left to right and returns all matches in the order found:

In [None]:
re.findall(r'#(\w+)#', '#foo#.#bar#.#baz#')

In this case, the specified regex is `#(\w+)#`. The matching strings are `'#foo#'`, `'#bar#'`, and `'#baz#'`. But the hash (`#`) characters don’t appear in the return list because they’re outside the grouping parentheses.

If `<regex>` contains more than one capturing group, then `re.findall()` returns a list of tuples containing the captured groups. The length of each tuple is equal to the number of groups specified:

In [None]:
re.findall(r'(\w+),(\w+)', 'foo,bar,baz,qux,quux,corge')

In [None]:
re.findall(r'(\w+),(\w+),(\w+)', 'foo,bar,baz,qux,quux,corge')

`re.finditer(<regex>, <string>)` scans `<string>` for non-overlapping matches of `<regex>` and returns an iterator that yields the match objects from any it finds. It scans the search string from left to right and returns matches in the order it finds them:

In [None]:
it = re.finditer(r'\w+', '...foo,,,,bar:%$baz//|')
it

In [None]:
for i in re.finditer(r'\w+', '...foo,,,,bar:%$baz//|'):
  print(i.group())

`re.findall()` and `re.finditer()` are very similar, but they differ in two respects:

1. `re.findall()` returns a **list**, whereas `re.finditer()` returns an **iterator**.

The items in the list that `re.findall()` returns are the actual matching strings, whereas the items yielded by the iterator that `re.finditer()` returns are match objects.

Any task that you could accomplish with one, you could probably also manage with the other. Which one you choose will depend on the circumstances. However, a lot of useful information can be obtained from a match object. If you need that information, then `re.finditer()` will probably be the better choice. For example, you can use the `.group()` method to return Match object.

## Match objects

If a match is found when using `re.finditer()`, `re.match` or `re.search`, we can use some useful methods provided by the match object. Here is a short list of such methods, you may check the reference section for more details…

* `group()` returns the part of the string matched by the entire regular expression
* `group(1)` returns the text matched by the second capturing group
* `start()` and `end()` return the indices of the start and end of the substring matched by the capturing group.

# 3. `re.match()`, `re.search()`

### .match() VS .search()

This is the trickiest part. A good explanation is provided in this tutorial: https://www.8bitavenue.com/difference-between-re-search-and-re-match-in-python/ which is paraphrased here.

Additional information can also be found at ▶

https://realpython.com/regex-python/

https://realpython.com/regex-python-part-2/



When searching or matching, regular expression operations can take optional flags or modifiers. The following two modifiers are the most used ones…

* `re.M multiline`
* `re.I ignore case`

## .match()

`re.match` matches an expression at the beginning of a string. If a match is found, a match object is returned, otherwise `None` is returned. If the input is a multiline string (i.e. starts and ends with three double quotes) that does not change the behavior of the match operation. `re.match` always tries to match the beginning of the string. In regular expressions syntax, the control character (^) is used to match the beginning of a string. If this character is used with re.match, it has no effect. The syntax for `re.match` operation is as follows…



`mobj = re.match(pat, str, flag)`

* pat: regular expression pattern to match
* str: string in which to search for the pattern
* flags: one or more modifiers, for example re.M|re.I

### `.match(pattern, string, flags=0)`

Looks for a regex match at the beginning of a string.

If zero or more characters at the beginning of *string* match the regular expression *pattern*, return a corresponding match object. Return `None` if the string does not match the pattern; note that this is different from a zero-length match.

This is identical to `re.search()`, except that `re.search()` returns a match if the regular expression *pattern* matches anywhere in *string*, whereas `re.match()` returns a match only if the regular expression *pattern* matches at the beginning of *string*.

## .search()

`re.search` attempts to find the first occurrence of the pattern anywhere in the input string as opposed to the beginning. If the search is successful, `re.search` returns a match object, otherwise it returns `None`. The syntax for re.search operation is as follows…

`mobj = re.search(pat, str, flag)`

* pat: regular expression pattern to search for
* str: string in which to search for the pattern
* flags: one or more modifiers, for example re.M|re.I

### `.search(pattern, string, flags=0)`

Scans a string for a regex match.

Scan through *string* looking for the **first** location where the regular expression *pattern* produces a match, and return a corresponding match object. Return `None` if no position in the string matches the pattern.


You can use this function in a Boolean context like a conditional statement that checkes if a pattern is found in the input string (`True`) or not (`False`).

### Substitution 

The `re.sub(pat, replacement, str)` function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include `'\1'`, `'\2'` which refer to the text from `group(1)`, `group(2)`, and so on from the original matching text.

Here's an example which searches for all the email addresses, and changes them to keep the user `(\1)` but have `nyc.gov` as the host.

In [None]:
text = '''xyz my-email@citytech.edu data science or 
your_email@citytech.edu, also we can mention their.email@citytech.edu'''
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print (re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@nyc.gov', text))


### Homework Assignment 1

Imagine we want to conceal all phone numbers in a document.
Read the below multiline string and substitute all phone numbers with 

'XXX-XXX-XXXX'

In [None]:
raw_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

### Solution

In [None]:
## your solution here