# Regular Expressions
Source: https://www.sololearn.com/Course/Python/

Regular expressions are a powerful tool for various kinds of *string manipulation*.
They are a domain specific language (DSL) that is present as a library in most modern programming languages, not just Python.
They are useful for **two main tasks**:
- verifying that strings **match a pattern** (for instance, that a string has the format of an email address),
- performing **substitutions** in a string (such as changing all American spellings to British ones).

Regular expressions in Python can be accessed using the re module, which is part of the standard library

In [4]:
import re

- To avoid any confusion while working with regular expressions, we would use raw strings as **r"expression"**.
- Raw strings **don't escape** anything, which makes use of regular expressions easier.

In [2]:
pattern = r'spam \n'
print(pattern)

spam \n


In [3]:
pattern = r'spam'
text = 'spamspamspam'

In [4]:
if re.match(pattern, text):
    print("Match")
else:
    print("No match")

Match


- The ***re.match*** function can be used to determine whether it matches at the **beginning of a string**.
- If it does, match returns an *object* representing the match, if not, it returns *None*

In [5]:
re.match(pattern, "spamspamspam")

<re.Match object; span=(0, 4), match='spam'>

- Other functions to match patterns are:
    - The function **re.search** finds a match of a pattern *anywhere* in the string.
    - The function **re.findall** returns *a list* of all substrings that match a pattern.
    - The function **re.finditer** does the same thing as re.findall, except it returns an *iterator*, rather than a list.

In [5]:
pattern = r"spam"
text = "eggspamsausagespam"
# CODE HERE : use match, search findall and finditer for this pattern and text
if re.match(pattern,text):
    print("Match")
else:
    print("No Match")
if re.search(pattern,text):
    print("Match")
else:
    print("No Match")
print(re.findall(pattern,text))
print(re.finditer(pattern,text))
for m in re.finditer(pattern,text):
    print(m)

No Match
Match
['spam', 'spam']
<callable_iterator object at 0x7f2fa8e0ddf0>
<re.Match object; span=(3, 7), match='spam'>
<re.Match object; span=(14, 18), match='spam'>


The regex search returns an object with ***several methods*** that give details about it.
- **group** which returns the string matched, 
- **start** and **end** which return the start and ending positions of the first match, 
- **span** which returns the start and end positions of the first match as a *tuple*.

In [7]:
pattern = r"pam"
match = re.search(pattern, "eggspamsausage")
if match:
    print(match.group())
    print(match.start())
    print(match.end())
    print(match.span())

pam
4
7
(4, 7)


## Search & Replace
>re.sub(pattern, repl, string, max=0)<br>

This method replaces all occurrences of the pattern in string with other string, substituting all occurrences, unless max provided. This method returns the modified string.

In [8]:
text = "My name is David. Hi David."
pattern = r"David"

#CODE HERE: substitute David with Amy in text
print(re.sub(pattern,"Amy", text))

My name is Amy. Hi Amy.


# Metacharacters
metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning.
> . ^ $ * + ? { } [ ] \ | ( )


The existence of metacharacters poses a problem if you want to create a regular expression (or regex) that *matches a literal metacharacter*, such as "$". You can do this by escaping the metacharacters by putting a backslash in front of them.<br>
However, this can cause problems, since backslashes also have an escaping function in normal Python strings. This can mean putting *three or four backslashes in a row* to do all the escaping.
To avoid this, you can use a raw string, which is a normal string with an **"r"** in front of it..

## dot .
The first metacharacter we will look at is . (dot).
This matches any character, other than a new line.

In [29]:
pattern = r"gr.y"
if re.match(pattern, "grey"):
    print("Match 1")
if re.match(pattern, "gray"):
    print("Match 2")
if re.match(pattern, "blue"):
    print("Match 3")
if re.match(pattern, "gr y"):
    print("Match 4")
if re.match(pattern, "greeey"):
    print("Match 5")

Match 1
Match 2
Match 4


## ^ and \$
- start of string : ^ 
- end of string : $

In [11]:
#CODE HERE: write a pattern that matches only the words grey or gary alone in the sentence
pattern = r"^g..y$"
if re.search(pattern, "grey"):
    print("Match 1")
if re.search(pattern, "gary"):
    print("Match 2")
if re.search(pattern, "stingray"):
    print("Match 3")

Match 1
Match 2


### Character Classes:
Character classes provide a way to match only one of a specific set of characters.<br>
A character class is created by putting the characters it matches inside **square brackets**.

In [11]:
pattern = r"[aeiou]"
if re.search(pattern, "grey"):
    print("Match 1")
if re.search(pattern, "qwertyuiop"):
    print("Match 2")
if re.search(pattern, "rhythm myths"):
    print("Match 3")


Match 1
Match 2


Character classes can also match **ranges of characters.**
Some examples:
- The class **\[a-z\]** matches any lowercase alphabetic character.
- The class **\[G-P\]** matches any uppercase character from G to P.
- The class **\[0-9\]** matches any digit.
- Multiple ranges can be included in one class. For example, **\[A-Za-z\]** matches a letter of any case

In [13]:
#CODE HERE: write a pattern that matches two uppercase characters followed by a single digit
pattern = r"[A-Z][A-Z][0-9]"
if re.search(pattern, "LS8"):
    print("Match 1")
if re.search(pattern, "E3"):
    print("Match 2")
if re.search(pattern, "1ab"):
    print("Match 3")  

Match 1


Place a **^** at the start of a character class to invert it.<br>
This causes it to match any character **other than** the ones included.<br>

Other metacharacters such as $ and ., have *no meaning* within character classes.<br>
The metacharacter ^ has no meaning unless it is the *first character* in a class.

In [4]:
pattern = r"[^A-Z]"
if re.search(pattern, "this is all quiet"):
    print("Match 1")
if re.search(pattern, "AbCdEfG123"):
    print("Match 2")
if re.search(pattern, "THISISALLSHOUTING"):
    print("Match 3")

Match 1
Match 2


In [12]:
pattern = r"^[^A-Z]"
if re.search(pattern, "this is all quiet"):
    print("Match 1")
if re.search(pattern, "AbCdEfG123"):
    print("Match 2")
if re.search(pattern, "THISISALLSHOUTING"):
    print("Match 3")

Match 1


## *, +, ?, { and }.
These specify **numbers of repetitions**.<br>

The metacharacter **star *** means **zero or more** repetitions of the *previous* thing". It tries to match *as
many repetitions as possible*.<br> The "previous thing" can be a single character, a class, or a group
of characters in parentheses

In [16]:
#CODE HERE: write a pattern that matches the word egg followed by zero or more spam
pattern = r"egg(spam)*"
if re.match(pattern, "egg"):
    print("Match 1")
if re.match(pattern, "eggspamspamegg"):
    print("Match 2")
if re.match(pattern, "spam"):
    print("Match 3")

Match 1
Match 2


The metacharacter **+** is very similar to *, except it means **"one or more repetitions"**

In [32]:
pattern = r"g+"
if re.match(pattern, "g"):
    print("Match 1")
if re.match(pattern, "gggggggggggggg"):
    print("Match 2")
if re.match(pattern, "abc"):
    print("Match 3")

Match 1
Match 2


The metacharacter **?** means **"zero or one repetitions"**

In [33]:
#CODE HERE: write a pattern that matches the word ice-cream or icecream
pattern = r"ice(-)?cream"
if re.match(pattern, "ice-cream"):
    print("Match 1")
if re.match(pattern, "icecream"):
    print("Match 2")
if re.match(pattern, "sausages"):
    print("Match 3")
if re.match(pattern, "ice--ice"):
    print("Match 4")    

Match 1
Match 2


**Curly braces { }**  can be used to represent the **number of repetitions** between two numbers.<br>
The regex {x,y} means "between x and y repetitions of something".<br>
Hence {0,1} is the same thing as ?.<br>
If the first number is missing, it is taken to be zero. If the second number is missing, it is taken to be infinity.

In [7]:
#CODE HERE: write a pattern that matches one to three nines at the end of a string
pattern = r"9{1,3}$"
if re.match(pattern, "9"):
    print("Match 1")
if re.match(pattern, "999"):
    print("Match 2")
if re.match(pattern, "9999"):
    print("Match 3")

Match 1
Match 2


In [14]:
pattern = r"9{2,}$"
if re.match(pattern, "9"):
    print("Match 1")
if re.match(pattern, "999"):
    print("Match 2")
if re.match(pattern, "9999"):
    print("Match 3")

Match 2
Match 3


## Groups
A group can be created by surrounding part of a regular expression with **parentheses ()**.<br>
This means that a group can be given as an argument to metacharacters such as * and ?

In [8]:
pattern = r"egg(spam)*"
if re.match(pattern, "egg"):
    print("Match 1")
if re.match(pattern, "eggspamspamspamegg"):
    print("Match 2")
if re.match(pattern, "spam"):
    print("Match 3")

Match 1
Match 2


The content of groups in a match can be accessed using the **group function**.<br>
- A call of *group(0) or group()* returns the whole match.
- A call of *group(n)*, where n is greater than 0, returns the nth group from the left.
- The method *groups()* returns all groups up from 1

In [40]:
pattern = r"a(bc)(de)(f(g)h)i"
match = re.match(pattern, "abcdefghijklmnop")
if match:
    print(match.group())
    print(match.group(0))
    print(match.group(1))
    print(match.group(2))
    print(match.group(3))
    print(match.group(4))

    print(match.groups())

abcdefghi
abcdefghi
bc
de
fgh
g
('bc', 'de', 'fgh', 'g')


### Named groups & Non-capturing groups
- Named groups have the format **(?P< name \> ...)**,
where name is the name of the group, and ... is the content.<br>
They behave exactly the same as normal groups, except they can be **accessed by group(name)** in addition to its number.
<br>
- Non-capturing groups have the format **(?:...)**.<br> 
They are **not accessible** by the group method,so they can be added to an existing regular expression without breaking the numbering

In [11]:
pattern = r"(?P<first>abc)(?:def)(ghi)"
match = re.match(pattern, "abcdefghi")
if match:
    #CODE HERE: print "first" group
    print(match.group("first"))
    #CODE HERE: print the first group
    print(match.group(1))
    #CODE HERE: print the second group
    print(match.group(2))
    print(match.groups())

abc
abc
ghi
('abc', 'ghi')


### | or metacharacter

In [15]:
pattern = r"gr(a|e)y"
match = re.match(pattern, "gray")

if match:
    print ("Match 1")
match = re.match(pattern, "grey")
if match:
    print ("Match 2")
match = re.match(pattern, "griy")
if match:
    print ("Match 3")

Match 1
Match 2


# Special Sequences
There are various special sequences you can use in regular expressions. They are written as a **backslash followed by another character**.<br>

One useful special sequence is a backslash and a number between 1 and 99, e.g., \1 or \17.<br>
This matches the *expression of the group of that number*.

In [31]:
#CODE HERE: write a pattern that matches the same word twice
pattern = r"(.*) \1"
match = re.match(pattern, "word word")
if match:
    print ("Match 1")
match = re.match(pattern, "?! ?!")
if match:
    print ("Match 2")
match = re.match(pattern, "abc cde")
if match:
    print ("Match 3")

Match 1
Match 2


More useful special sequences are
- **\d** : match digits = [0-9]
- **\s** : match whitespace = [ \t\n\r\f\v]
- **\w** : match word characters = [a-zA-Z0-9_]

In Unicode mode they match certain other characters, as well. For instance, \w matches letters
with accents.

Versions of these special sequences with upper case letters **\D, \S**, and **\W** - mean the **opposite** to the lower-case versions. For instance, \D matches anything that isn't a digit.

In [15]:
#CODE HERE: write a pattern that matches one or more non-digits followed by a digit
pattern = r"\D+\d" 
match = re.match(pattern, "Hi 999!") 
if match:
    print("Match 1")
match = re.match(pattern, "1, 23, 456!")
if match:
    print("Match 2")
match = re.match(pattern, " ! $?")
if match:
    print("Match 3")

Match 1


Additional special sequences are
- The sequences **\A** and **\Z** match the beginning and end of a string, respectively. (^ and $ will only match up until a newline character)
- The sequence **\b** matches the empty string between \w and \W characters, or \w characters and the beginning or end of the string. Informally, it represents the boundary between words.
The sequence **\B** matches the empty string anywhere else.

In [17]:
#CODE HERE: write a pattern that matches the word "cat" surrounded by word boundaries
pattern = r"\bcat\b"
match = re.search(pattern, "The cat sat!")
if match:
    print ("Match 1")
    
match = re.search(pattern, "We s>cat<tered?")
if match:
    print ("Match 2")
match = re.search(pattern, "We scattered.")
if match:
    print ("Match 3")
match = re.search(pattern, "cat sat.")
if match:
    print ("Match 4")

Match 1
Match 2
Match 4
