#Regular expressions 

- Pattern matching (search, replace)
- Expressive power equals to regular grammars and finite nondeterministic automaton
- Works via classical string-matching algorithms with special characters   
(wildcards, quantifiers, etc)
- This lab is Python-specific, but other RegEx engines work similarly

A nice regex editor: [regex101.com](http://www.regex101.com)

# [Cheatsheet of special characters](https://www.dataquest.io/blog/regex-cheatsheet/)

## Basic characters
<table style="border-collapse: collapse; width: 1000px;">
<tbody style="border-collapse: collapse; width: 1000px;">
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; background-color: #2d353b; text-align: center;"><span style="color: #ffffff;"><strong>Characters</strong></span></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000; background-color: #2d353b; text-align: center;"><span style="color: #ffffff;"><strong>Explanation</strong></span></td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>a</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">It Matches exactly one character a.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>ab</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Matches the string ab.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>a|b</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Matches a or b. If a is matched, b is left.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>$</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Match the end of the string.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>i</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Ignore case.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>s</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Matches everything, including newline as well.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>u</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Matches Unicode character classes.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>x</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Allow spaces and comments (Verbose).</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>^</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Match the start of the string.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>.</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Match 0 or 1 character.</td>
</tr>
<tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>*</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Match 0 or more repetitions.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>+</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Match one or more times.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>?</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Match zero or one time.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>{a,b}</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Match a to b times.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>{a,}</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Match at least a time.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>{,b}</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Match up to b times.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>{a}</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Match exactly a times.</td>
</tr>
<tr>
<td style="width: 22.1812%; border-style: solid; border-color: #000000; text-align: center;"><strong>{a,b}?</strong></td>
<td style="width: 77.8188%; border-style: solid; border-color: #000000;">Matches the expression to its left times, and ignores b.</td>
</tr>
</tbody>
</table>

## Character classes


<table style="border-collapse: collapse; width: 100%;">
<tbody>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; background-color: #2d353b; text-align: center;"><span style="color: #ffffff;"><strong>Class</strong></span></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000; background-color: #2d353b; text-align: center;"><span style="color: #ffffff;"><strong>Explanation</strong></span></td>
</tr>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; text-align: center;"><strong>\d</strong></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000;">Matches digits from 0-9.</td>
</tr>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; text-align: center;"><strong>\D</strong></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000;">Matches any non-digits.</td>
</tr>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; text-align: center;"><strong>\w</strong></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000;">Matches alphanumeric characters including, a-z, A-Z, 0-9, and underscore(_).</td>
</tr>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; text-align: center;"><strong>\W</strong></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000;">Matches any character, not a Unicode word character.</td>
</tr>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; text-align: center;"><strong>\s</strong></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000;">Matches whitespace characters.</td>
</tr>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; text-align: center;"><strong>\S</strong></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000;">Matches non-whitespace characters.</td>
</tr>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; text-align: center;"><strong>\n</strong></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000;">Matches a newline character.</td>
</tr>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; text-align: center;"><strong>\t</strong></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000;">Matches tab character.</td>
</tr>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; text-align: center;"><strong>\b</strong></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000;">Matches empty string, only at the beginning or end of a word.</td>
</tr>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; text-align: center;"><strong>\Z</strong></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000;">Matches the expression to its left at the absolute end of a string, whether in single or multi-line mode.</td>
</tr>
<tr>
<td style="width: 22.5722%; border-style: solid; border-color: #000000; text-align: center;"><strong>\A</strong></td>
<td style="width: 77.4278%; border-style: solid; border-color: #000000;">Matches the expression to its right at the absolute start of a string, whether in single or multi-line mode.</td>
</tr>
</tbody>
</table>

## Character sets

<table style="border-collapse: collapse; width: 100%; height: 240px;">
<tbody>
<tr style="height: 24px;">
<td style="width: 21.3911%; border-style: solid; border-color: #000000; background-color: #2d353b; text-align: center; height: 24px;"><span style="color: #ffffff;"><strong>Sets</strong></span></td>
<td style="width: 78.6089%; border-style: solid; border-color: #000000; background-color: #2d353b; text-align: center; height: 24px;"><span style="color: #ffffff;"><strong>Explanation</strong></span></td>
</tr>
<tr style="height: 24px;">
<td style="width: 21.3911%; border-style: solid; border-color: #000000; height: 24px; text-align: center;"><strong>[a-z]</strong></td>
<td style="width: 78.6089%; border-style: solid; border-color: #000000; height: 24px;">Match any lowercase ASCII letter.</td>
</tr>
<tr style="height: 24px;">
<td style="width: 21.3911%; border-style: solid; border-color: #000000; height: 24px; text-align: center;"><strong>[xyz]</strong></td>
<td style="width: 78.6089%; border-style: solid; border-color: #000000; height: 24px;">Matches either x, y, or z.</td>
</tr>
<tr style="height: 24px;">
<td style="width: 21.3911%; border-style: solid; border-color: #000000; height: 24px; text-align: center;"><strong>[x\-z]</strong></td>
<td style="width: 78.6089%; border-style: solid; border-color: #000000; height: 24px;">Matches x, – or z.</td>
</tr>
<tr style="height: 24px;">
<td style="width: 21.3911%; border-style: solid; border-color: #000000; height: 24px; text-align: center;"><strong>[-x]</strong></td>
<td style="width: 78.6089%; border-style: solid; border-color: #000000; height: 24px;">Matches – or x.</td>
</tr>
<tr style="height: 24px;">
<td style="width: 21.3911%; border-style: solid; border-color: #000000; height: 24px; text-align: center;"><strong>[a-d0-9]</strong></td>
<td style="width: 78.6089%; border-style: solid; border-color: #000000; height: 24px;">Matches characters from a to d or from 0 to 9.</td>
</tr>
<tr style="height: 24px;">
<td style="width: 21.3911%; border-style: solid; border-color: #000000; height: 24px; text-align: center;"><strong>[^xy4]</strong></td>
<td style="width: 78.6089%; border-style: solid; border-color: #000000; height: 24px;">Matches characters that are not x, y, or 4.</td>
</tr>
<tr style="height: 24px;">
<td style="width: 21.3911%; border-style: solid; border-color: #000000; height: 24px; text-align: center;"><strong>[(+*)]</strong></td>
<td style="width: 78.6089%; border-style: solid; border-color: #000000; height: 24px;">Matches (, +, * or ).</td>
</tr>
<tr style="height: 24px;">
<td style="width: 21.3911%; border-style: solid; border-color: #000000; height: 24px; text-align: center;"><strong>[0-5][0-9]</strong></td>
<td style="width: 78.6089%; border-style: solid; border-color: #000000; height: 24px;">Matches for any two-digit numbers from 00 and 59.</td>
</tr>
<tr style="height: 24px;">
<td style="width: 21.3911%; border-style: solid; border-color: #000000; height: 24px; text-align: center;"><strong>[^ab5]</strong></td>
<td style="width: 78.6089%; border-style: solid; border-color: #000000; height: 24px;">Adding ^ excludes any character in the set. Here, it matches characters that are not a, b, or 5.</td>
</tr>
</tbody>
</table>

# Regular Expressions are used to match string patterns.


-They are very powerful

-If you want to pull out a string pattern RE can do it

-They may seem intimidating  

# Things to note

'r' expression, that voids the Python's special characters

 r'\n' means it's a raw string with two characters 'n' and '\' as 
opposed to just one special character' 

In [2]:
# examples of this dont mind the python syntax
import re
re.search('n', '\n')  #first item is pattern, second item is string

In [3]:
#two ways to handle this one way is to use \ for every backslash
import re
re.search('n', '\\n')   

<re.Match object; span=(1, 2), match='n'>

In [4]:
re.search('n',  '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n')  #not the best way if we
                                                   #have too many \s

In [5]:
re.search('n',  r'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n')    #r converts to raw string

<re.Match object; span=(1, 2), match='n'>

In [6]:
#there are some nuances that you should be aware of
#regular expressions has its own special characters as well
# regex with '\n' and r'\n' both look for newline

re.search('\n',  '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n')  

<re.Match object; span=(0, 1), match='\n'>

In [7]:
re.search(r'\n',  '\n\n')     #this works as well because r'\n' also looks
                                #for new line

<re.Match object; span=(0, 1), match='\n'>

In [10]:
#doesn't work because sting doesn't use newline and r'\n' looks for newline
print(re.search(r'\n',  r'\n\n'))    #r

None


# MATCH and SEARCH EXAMPLES

REs common methods - Match and Search

In [13]:
re.search("c", "abcdef")   #searches anywhere

<re.Match object; span=(2, 3), match='c'>

In [16]:
re.match("c", "abcdef")  #returns none because only looks at the start of string

In [17]:
bool(re.match("c", "abcdef"))  # no match returns boolean false

False

In [18]:
bool(re.match("a", "abcdef"))  #match returns true

True

In [21]:
re.search("c", "abcdef")  #tells you where it matched first and only first

<re.Match object; span=(2, 3), match='c'>

In [22]:
re.search("c", "abcdefc")  #multiple 'c's first instance only

<re.Match object; span=(2, 3), match='c'>

In [23]:
re.search("c", "abdef\nc") #multiline works with search

<re.Match object; span=(6, 7), match='c'>

In [24]:
re.match("c", "abcdef\nc")   #match doesn't work with newline

# Printing the output of match and search

In [25]:
(re.match("a", "abcdef"))   #match objects

<re.Match object; span=(0, 1), match='a'>

In [26]:
re.match("a", "abcdef").group()  #string output #defautlt value is 0

'a'

In [27]:
re.match("a", "abcdef").group(0)  

'a'

In [28]:
re.search("n", "abcdefnc abcd").group()

'n'

In [29]:
re.search('n.+', "abcdefnc abcd").group()  #pull out different types of strings 
                                            #depending on the wildcards you use

'nc abcd'

In [30]:
re.search("c", "abdef\nc").start()

6

In [31]:
re.search("c", "abdef\nc").end()

7

# Literal matching

In [34]:
re.search('na',"abcdefnc abcd" )  #doesn't work, because they are ordered

In [36]:
re.search('n|a',"abcdefnc abcda" )  #n or a

<re.Match object; span=(0, 1), match='a'>

In [37]:
 re.search('n|a',"bcdefnc abcda" )  #replaced the a with b, first match is an n

<re.Match object; span=(5, 6), match='n'>

In [38]:
re.search('n|a|b',"bcdefnc abcda" ) # as many OR expressions

<re.Match object; span=(0, 1), match='b'>

# re.findall

In [39]:
re.findall('n|a',"bcdefnc abcda" ) #find all pulls out all instances

['n', 'a', 'a']

In [40]:
re.search('abcd',"abcdefnc abcd" ) #multiple characters - literal search

<re.Match object; span=(0, 4), match='abcd'>

In [41]:
re.findall('abcd',"abcdefnc abcd" ) 

['abcd', 'abcd']

# Anothor examples 

In [43]:
text = "There is an apple in this sentence."

#Bound to string start, returns Match object
found = re.match(r"There",text)

print(found.group(), found.span())
print(found)

There (0, 5)
<re.Match object; span=(0, 5), match='There'>


In [44]:
found = re.match(r"apple",text)
print(found)

None


In [45]:
#Use search for sub-string matching
found = re.search(r"apple", text)
print(found.group(), found.span())

apple (12, 17)


In [46]:
#Returns first match only
found = re.search(r"i", text)

print(found.group(), found.span())

i (6, 7)


In [47]:
#Finding all NON-OVERLAPPING matches
found = re.findall(r"i", text)

#Returns list of matches
print(found)

['i', 'i', 'i']


Match all words starting with "a"

- wordstart: **\b**   
- character "a" after it: **a**   
- any alphanumeric character is allowed after it: **\w**   
- for 0 to infinite repetitions: **\***   
- the whole word is required, thus we match the word end explicitly: **\b**


In [49]:
#More useful with character classes
found = re.findall(r"\ba\w*\b", text)
#Returns a list of strings
print(found)

['an', 'apple']


In [50]:
#If we need an iterator of match objects, not just the strings
found = re.finditer(r"\ba\w*\b", text)
#Returns list of matches
for match in found:
  print(match)

<re.Match object; span=(9, 11), match='an'>
<re.Match object; span=(12, 17), match='apple'>


### Split text at every two letter word (exclude the whitespaces too)

- Word start and end with any number of whitespaces included: **\s\*\b** and **\b\s\***
- Quantifying exatly two word characters in between: **\w\{2\}**

In [51]:
split_text = re.split(r"\s*\b\w{2}\b\s*", text)
print(split_text)

['There', '', 'apple', 'this sentence.']


### Find dates

Using groups in mm/dd/yyyy format

- Month descriptor can either start with 1 or 0
 - If starts with 1 it can either end to 0, 1 or 2
 - If starts with zero it can be any digit but 0
- Day can start with 3, 2, 1, 0
 - If starts with 3 it can end in 0 or 1
 - If starts with 1 or 2 can end in any digit
 - If starts with 0 it can end any digit but 0
- Year can be any 4 long digit sequence

In [52]:
def findDate(text):
  line=re.findall('(1[0-2]|0[1-9])/(3[01]|[12][0-9]|0[1-9])/([0-9]{4})',text)     
  return line

findDate('Todays date is 10/04/2021')

[('10', '04', '2021')]

## Grouping/Assertion
**( )** | Matches the expression inside the parentheses and groups it.

**(? )** | Inside parentheses like this, ? acts as an extension notation.

**(?aiLmsux)** | Here, a, i, L, m, s, u, and x are flags:

**a** — Matches ASCII only   
**i** — Ignore case   
**L** — Locale dependent   
**m** — Multi-line   
**s** — Matches all   
**u** — Matches unicode   
**x** — Verbose   

**(?FLAGS:A)** | Matches the expression as represented by A. Using a flag set represented by FLAGS.

**(?#...)** | A comment.

**A(?=B)** | Lookahead assertion. This matches the expression A only if it is followed by B.

**A(?!B)** | Negative lookahead assertion. This matches the expression A only if it is not followed by B.

**(?<=B)A** | Positive lookbehind assertion. This matches the expression A only if B is immediately to its left. This can only matched fixed length expressions.

**(?<!B)A** | Negative lookbehind assertion. This matches the expression A only if B is not immediately to its left. This can only matched fixed length expressions.

### Change numbered filename to IMG\<num\> if the extension is .jpg or .png
If no number is present skip

- From the string start: **^**   
- We match the group of characters which are non-digit characters but word characters   
(non-"non-word" characters): **[^\d\W]**
- We match at least one of those: **+**
- Assert that after our replaceable name comes at least one digit: **(?=\d+**
- And an extension of .png and .jpg where we ignore the case: **((?i:.png)|(?i:.jpg)))**


In [53]:
fnames = ["summer1.jpg", "summer2.PNG", "vacation3.png", "summer4.exe", "nlphomework.pdf", "vacation5.JPG"]

for f in fnames:
  print(re.sub(r"^[^\d\W]+(?=\d+((?i:.png)|(?i:.jpg)))","IMG",f))

IMG1.jpg
IMG2.PNG
IMG3.png
summer4.exe
nlphomework.pdf
IMG5.JPG


# Find social media tags

In [54]:
def findHash(text):  
    line=re.findall("(?<=#)\w+",text)     
    return line

findHash("I love football. #football #FIFA")

['football', 'FIFA']

## Reusing patterns

What does this match? **[\w.-]+@[\w.-]+**

In [57]:
pattern = re.compile("[\w.-]+@[\w.-]+")
print(pattern)
print(pattern.findall("This is my secret email address: info@myaddress.com"))

#Counter-intuitive, replacment string comes first
print(pattern.sub("**SECRET EMAIL**","This is my secret email address: info@myaddress.com"))

re.compile('[\\w.-]+@[\\w.-]+')
['info@myaddress.com']
This is my secret email address: **SECRET EMAIL**


# Finding URLs

In [58]:
def find_url(string):
    text = re.findall(r'http[s]?:\/\/(?:\w|[$-_@.&+!*(),])+',string)
    #convert return value from list to string    
    return text

find_url("Could you find http://404notfound.com/?isnotvalid=True&id=2 and this http://127.12.21.32:5342 please?")

['http://404notfound.com/?isnotvalid=True&id=2', 'http://127.12.21.32:5342']