<a href="https://colab.research.google.com/github/MBouchaqour/DS_Masjid2024/blob/main/DS1_Regular_expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

So far we have been using methods like `split` and `find` to extract portions of strings or to answer a question of whether a particular item / string is part of a list-set-tuple-dictionary / longer string.

Regular Expressions
-------------------

Regular expressions (regexes or re’s) constitute an extremely powerful, flexible and concise language for matching elements in text ranging from a few characters to complex patterns. While mastering the syntax of the regular expression language does require climbing a learning curve, this learning curve is not particularly steep, and a newcomer can find herself performing useful tasks with regular expressions almost immediately. Efforts spent learning regular expressions quickly pay off--tasks that are well suited for regular expressions abound. Indeed, regular expressions are one of the most useful computer skills, and an absolutely critical tool for data scientists.

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement.

We will present examples using python’s standard [re regular expression library](http://docs.python.org/library/re.html).

We will discuss Python libraries in detail later.

Many examples from from this [Google tutorial](https://developers.google.com/edu/python/regular-expressions). Pay attention that tutorial itself uses a 2.x version of Python, thus, several statements (for example, print()) look differently than in the later versions of Python. In this notebook, the examples from the Google tutorial are converted to the current Python version.


### Searching strings using regexes

The regular expression library `re` must be imported into your program before you can use it. The simplest use of the regular expression library is the `search()` function.

In [None]:
# first import the library
import re

In [None]:
inputStr = 'an example word:cat!!'
if re.search('cat', inputStr):
  print(inputStr)
print ("Done with the example")

an example word:cat!!
Done with the example


In [None]:
inputStr = 'an example word: cat!!'
if re.search('dog', inputStr):
  print(inputStr)
print ("Done with the example")

Done with the example


#### We can store the result of `re.search(pat, str)` in a variable.

In Python a regular expression search is typically written as:

`match = re.search(pat, str)`

The code `match = re.search(pat, str)` stores the search result in a variable named "match".

The `re.search()` method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, `search()` returns a match object or `None` otherwise.

Then the `if`-statement tests the match -- if `True` the search succeeded and `match.group()` is the matching text (e.g. 'word: cat'). Otherwise if the match is `False` (`None` to be more specific), then the search did not succeed, and there is no matching text.

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions. It is recommended that you always write pattern strings with the 'r' just as a habit.



In [None]:
inputStr = 'an example word: cat!!'
match = re.search(r'word: \w\w\w', inputStr) ## both 'cat' and 'dog' are 3-letter words
match

<re.Match object; span=(11, 20), match='word: cat'>

The `re.match()` function returns a match object on success, `None` on failure. We use `group(num)` function of match object to get matched expression.

`group(num=0)`: This method returns entire match (or specific subgroup num)

In [None]:
inputStr = 'an example word: cat!!'
# If-statement after search() tests if it succeeded
if re.search(r'word: \w\w\w', inputStr):
  print ('found', match.group()) ## 'found word:cat'
else:
  print ('did not find')
print ("Done with the example")

found word: cat
Done with the example


In [None]:
inputStr = 'an example word: dog!!'
match = re.search(r'word: \w\w\w', inputStr)
# If-statement after search() tests if it succeeded
if match:
  print ('found', match.group()) ## 'found word:cat'
else:
  print ('did not find')
print ("Done with the example")

found word: dog
Done with the example


In [None]:
line = "Cats are smarter than dogs. Don't you believe me?"

matchObj = re.match( r'(.*) are (.*?) .*', line) # regExs within a regEx

if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(0) : ", matchObj.group(0))
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")

matchObj.group() :  Cats are smarter than dogs. Don't you believe me?
matchObj.group(0) :  Cats are smarter than dogs. Don't you believe me?
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter


### Basic Patterns

see [Python Documentation](https://docs.python.org/3/library/re.html)

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

* a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)

* . (a period) -- matches any single character except newline '\n'

* \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.

* \b -- boundary between word and non-word

* \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.

* \t, \n, \r -- tab, newline, return

* \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)

* ^ = start, $ = end -- match the start or end of the string

* \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

### Basic Examples

Joke: what do you call a pig with three eyes? piiig!

The basic rules of regular expression search for a pattern within a string are:

* The search proceeds through the string from start to end, stopping at the first match found

* All of the pattern must be matched, but not all of the string

* If `match = re.search(pat, str)` is successful, match is not `None` and in particular `match.group()` is the matching text

In [None]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.

match = re.search(r'iii', 'piiig')
if match:
  print ('found, match.group() == "iii"')
  print (match.group())

found, match.group() == "iii"
iii


In [None]:
match = re.search(r'igs', 'piiig')
if not match:
  print ('not found, match == None')
  print (re.search(r'igs', 'piiig'))



not found, match == None
None


In [None]:
## . = any char but \n
match = re.search(r'..g', 'piiig')
if match:
  print ('found, match.group() == "iig"')
  print (match.group())



found, match.group() == "iig"
iig


In [None]:
## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g')
if match:
  print ('found, match.group() == "123"')
  print (match.group())



found, match.group() == "123"
123


In [None]:
match = re.search(r'\w\w\w', '@@abcd!!')
if match:
  print ('found, match.group() == "abc"')
  print (match.group())

found, match.group() == "abc"
abc


### Email Example 1

Extract the account name and the domain from the email address


In [None]:
email = 'my-email@citytech.edu'

# your code here

In [None]:
pattern=re.compile(r'([\w.-]+)@([\w.-]+)')
match = re.search(pattern, email)
if match:
  print (match.group())
  print (match.group(1))
  print (match.group(2))

my-email@citytech.edu
my-email
citytech.edu


### Email Example 2

Suppose you want to find the email address inside the string 'xyz myemail@citytech.edu data  science'.

Here's an attempt using the pattern r'\w+@\w+':

In [None]:
emailText = 'xyz my-email@citytech.edu data science'
match = re.search(r'\w+@\w+', emailText)
if match:
  print (match.group())

email@citytech


The search does not get the whole email address in this case because the `\w` does not match the `'-'` or `'.'` in the address. We'll fix this using the regular expression features below.

**Square Brackets**

Square brackets can be used to indicate a set of chars, so `[abc]` matches `'a'` or `'b'` or `'c'`. The codes `\w`, `\s` etc. work inside square brackets too with the one exception that dot (`.`) just means a literal dot. For the emails problem, the square brackets are an easy way to add `'.'` and `'-'` to the set of chars which can appear around the `@` with the pattern `r'[\w.-]+@[\w.-]+'` to get the whole email address:

In [None]:
emailText = 'xyz my-email@citytech.edu data science'
match = re.search(r'[\w.-]+@[\w.-]+', emailText)
if match:
  print (match.group())

my-email@citytech.edu


We will use `group()` function to extract the account name and the domain from the email address


In [None]:
emailText = 'xyz my-email@citytech.edu data science'
match = re.search(r'([\w.-]+)@([\w.-]+)', emailText)
if match:
  print ("email address:\t", match.group())
  print ("email account:\t", match.group(1))
  print ("email domain:\t", match.group(2))

email address:	 my-email@citytech.edu
email account:	 my-email
email domain:	 citytech.edu


### Iteration using regular expressions

If we want to extract data from a string in Python we can use the `findall()` or `finditer()` methods to extract all of the substrings which match a regular expression.

These two methods produce resutls of different types.

Let’s use the example of wanting to extract anything that looks like an email address from any line regardless of format. For example, we want to pull the email addresses from each of the following lines:

In [None]:
emailText = '''xyz my-email@citytech.edu data science or
your_email@citytech.edu, also we can mention their.email@citytech.edu'''

In [None]:
match = re.search(r'([\w.-]+)@([\w.-]+)', emailText)
if match:
  print ("email address:\t", match.group())
  print ("email account:\t", match.group(1))
  print ("email domain:\t", match.group(2))

email address:	 my-email@citytech.edu
email account:	 my-email
email domain:	 citytech.edu


In [None]:
matches = re.finditer(r'[\w.-]+@[\w.-]+', emailText)
for match in matches:
    print(match.group())

my-email@citytech.edu
your_email@citytech.edu
their.email@citytech.edu


In [None]:
matches = re.finditer(r'([\w.-]+)@([\w.-]+)', emailText)
for match in matches:
    print(match.group(),"\t", match.group(1),"\t", match.group(2))

my-email@citytech.edu 	 my-email 	 citytech.edu
your_email@citytech.edu 	 your_email 	 citytech.edu
their.email@citytech.edu 	 their.email 	 citytech.edu


In [None]:
print (matches) ## finditer returns an iterator

<callable_iterator object at 0x7910e048d480>


In [None]:
matches = re.findall(r'([\w\.-]+)@([\w\.-]+)', emailText)
print (matches)
for match in matches:
  print (match[0])  ## username
  print (match[1])  ## host

[('my-email', 'citytech.edu'), ('your_email', 'citytech.edu'), ('their.email', 'citytech.edu')]
my-email
citytech.edu
your_email
citytech.edu
their.email
citytech.edu


In [None]:
matches = re.findall(r'[\w\.-]+@[\w\.-]+', emailText)
print (matches)
for match in matches:
  print (match)

['my-email@citytech.edu', 'your_email@citytech.edu', 'their.email@citytech.edu']
my-email@citytech.edu
your_email@citytech.edu
their.email@citytech.edu


### Create a variable containing regular expression

`re.compile()` method.

In [None]:
# We are looking for binary numbers
regex = re.compile(r'[10]+')
text = "asddf1101110100011abd1111panos0000"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

1101110100011
1111
0000


In [None]:
emailText = '''xyz my-email@citytech.edu data science or
your_email@citytech.edu, also we can mention their.email@citytech.edu'''

In [None]:
# We are looking for emails numbers
regex = re.compile(r'[\w\.-]+@[\w\.-]+')
matches = regex.finditer(emailText)
for match in matches:
    print(match.group())

my-email@citytech.edu
your_email@citytech.edu
their.email@citytech.edu


In [None]:
# We look for money figures, either integers, or with 1 or 2 decimal
# digits
regex = re.compile(r'\$\d+(\.\d\d?)?')
text = '$1200.23 is the price today. $1200 was the price yesterday'
matches = regex.finditer(text)
for match in matches:
    print(match.group())

$1200.23
$1200


In [None]:
# This code is going to generate no matches
regex = re.compile(r'Ra*nd.*m R[egex]')
text = "CUNY, Citytech, Information and Data Management, my-email@citytech.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

print ("The end")

The end


Regular expressions are typically case-sensitive.

In [None]:
# Regular expressions are compiled into pattern objects
# Regular expressions are case-sensitive
regex = re.compile(r'in.*on')
text = "CUNY, Citytech, Information and Data Management, my-email@citytech.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

But we can specify that they are case-insensitive, using the flag re.IGNORECASE

In [None]:
# Unless we specify that they are case-insensitive, using the flag re.IGNORECASE
regex = re.compile('in.*on',re.IGNORECASE)
text = "CUNY, Citytech, Information and Data Management, my-email@citytech.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

Information


### Greedy vs. Non-Greedy

Suppose you have text with tags in it: `<b>foo</b> and <i>so on</i>`

Suppose you are trying to match each tag with the pattern '(<.*>)' -- what does it match first?



In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.*>)')
match = regex.search(text)
print (match.group())

<b>foo</b> and <i>so on</i>




The result is a little surprising, but the greedy aspect of the `.*` causes it to match the whole `<b>foo</b> and <i>so on</i>` as one big match. The problem is that the `.*` goes as far as is it can, instead of stopping at the first `>` (aka it is "greedy").

There is an extension to regular expression where you add a `?` at the end, such as `.*?` or `.+?`, changing them to be non-greedy. Now they stop as soon as they can. So the pattern `'(<.*?>)'` will get just `'<b>'` as the first match, and `'</b>'` as the second match, and so on getting each <..> pair in turn. The style is typically that you use a `.*?`, and then immediately its right look for some concrete marker (> in this case) that forces the end of the `.*?` run.



In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.?>)')
match = regex.search(text)
print (match.group())

<b>


In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.?>)')
matches = regex.finditer(text)
for match in matches:
    print(match.group())

<b>
<i>


In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.+?>)')
matches = regex.finditer(text)
for match in matches:
    print(match.group())

<b>
</b>
<i>
</i>


### Substitution

The `re.sub(pat, replacement, str)` function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include `'\1'`, `'\2'` which refer to the text from `group(1)`, `group(2)`, and so on from the original matching text.

Here's an example which searches for all the email addresses, and changes them to keep the user `(\1)` but have `nyc.gov` as the host.

In [None]:
text = '''xyz my-email@citytech.edu data science or
your_email@citytech.edu, also we can mention their.email@citytech.edu'''
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print (re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@nyc.gov', text))


### More patterns:

* `^` Matches the beginning of the line.
* `$` Matches the end of the line.
* `.` Matches any character (a wildcard).
* `\s` Matches a whitespace character.
* `\S` Matches a non-whitespace character (opposite of `\s`).

### Exercise 1

Imagine we want to conceal all phone numbers in a document.
Read the below multiline string and substitute all phone numbers with

'XXX-XXX-XXXX'

In [None]:
raw_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

In [None]:
## your solution here

### Solution

In [None]:
#option 1
print (re.sub(r'(\d\d\d-\d\d\d-\d\d\d\d)', r'XXX-XXX-XXXX', raw_text))


# option 2
#regex = re.compile(r'([2-9]\d{2})\D*(\d{3})\D*(\d{4})')
#newstring = re.sub(regex, "XXX-XXX-XXXX", raw_text)
#print(newstring)



### Exercise 2

Set up a crawler that extracts the titles of the Related Content from
the White House Immigration page.

https://www.whitehouse.gov/issues/immigration/

First, let us check the source code:

view-source:https://www.whitehouse.gov/issues/immigration/

The information that we need comes after the following tag:

`<h2 class="briefing-statement__title">`

In [None]:
# We need this library to get the web page source code
import requests

In [None]:
# Fetch the HTML source
url = 'https://www.whitehouse.gov/issues/immigration/'
html = requests.get(url).text

In [None]:
html

In [None]:
pattern = r'YOUR PATTERN HERE'
regex = re.compile(pattern)
matches = regex.finditer(html)
for m in matches:
    ... #YOUR CODE HERE

### Solution

In [None]:
pattern = r'\<h2 class=\"briefing-statement\_\_title\"\>\<a href.*\>(.*)\<\/a\>'
regex = re.compile(pattern)
matches = regex.finditer(html)
for m in matches:
    print (m.group(1)) #YOUR CODE HERE

### Exercise 3

Modify Exercise 2 solution to keep only those titles that have the name
Trump in them.

### Solution

In [None]:
pattern = r'\<h2 class=\"briefing-statement\_\_title\"\>\<a href.*\>(.*)\<\/a\>'
regex = re.compile(pattern)
matches = regex.finditer(html)
for m in matches:
  if ('Trump' in m.group(1)):
    print (m.group(1)) #YOUR CODE HERE