# Regular Expression

Credit: [Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex) by Corey Schafer](https://youtu.be/K8L6KVGG-7o)

Regular expression using python's built-in **re** module

In [0]:
import re

Text we would be working with is a multiline string **text_to_search**

In [0]:
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):

. ^ $ * + ? { } [ ] \ | ( )

example.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T

cat
mat
pat
bat
'''

### Raw string literals

Before getting into regular expression we need to now what is raw string literals. Let's see an example

In [3]:
print('\tTab')

	Tab


In [4]:
print(r'\tTab')

\tTab


So **raw string literals** are the string literals marked by an 'r' before the opening quote. So in raw string we do not have any special treatment for escape sequence such as newline, tabs, backspaces, form-feeds, and so on. 

### re.compile()

We are going to use the **compile** method which let us to seperate our pattern as a vartiable and let us reuse it for multiple searches.

re.**compile**(pattern, flags=0)

> Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.



### re.finditer()

First we are going to search for simple string literal **abc**.

We are going to use **finditer**, which returns us an iterator of all the matches.

re.**finditer**(pattern, string, flags=0)

> Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result. See also the note about findall().

In [5]:
pattern = re.compile(r'abc')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(1, 4), match='abc'>


### string slicing

Now, here we see that the substring from index 1 to 4(excluded) contains a match. And there is only one match. We can use **string slicing** to find the match.

In [6]:
text_to_search[1:4]

'abc'

Now if we just search for **cba**, then we will not have any match.

In [7]:
pattern = re.compile(r'cba')

matches = pattern.finditer(text_to_search)

len(list(matches))

0

If we search for **.** we see that it matches almost everything.

In [8]:
pattern = re.compile(r'.')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(1, 2), match='a'>
<_sre.SRE_Match object; span=(2, 3), match='b'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='d'>
<_sre.SRE_Match object; span=(5, 6), match='e'>
<_sre.SRE_Match object; span=(6, 7), match='f'>
<_sre.SRE_Match object; span=(7, 8), match='g'>
<_sre.SRE_Match object; span=(8, 9), match='h'>
<_sre.SRE_Match object; span=(9, 10), match='i'>
<_sre.SRE_Match object; span=(10, 11), match='j'>
<_sre.SRE_Match object; span=(11, 12), match='k'>
<_sre.SRE_Match object; span=(12, 13), match='l'>
<_sre.SRE_Match object; span=(13, 14), match='m'>
<_sre.SRE_Match object; span=(14, 15), match='n'>
<_sre.SRE_Match object; span=(15, 16), match='o'>
<_sre.SRE_Match object; span=(16, 17), match='p'>
<_sre.SRE_Match object; span=(17, 18), match='q'>
<_sre.SRE_Match object; span=(18, 19), match='u'>
<_sre.SRE_Match object; span=(19, 20), match='r'>
<_sre.SRE_Match object; span=(20, 21), match='t'>
<_sre.SRE_Match o

So instead if we wnat to search for **.** we need to pass **\\.**

In [9]:
pattern = re.compile(r'\.')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(114, 115), match='.'>
<_sre.SRE_Match object; span=(150, 151), match='.'>
<_sre.SRE_Match object; span=(172, 173), match='.'>
<_sre.SRE_Match object; span=(176, 177), match='.'>
<_sre.SRE_Match object; span=(224, 225), match='.'>
<_sre.SRE_Match object; span=(255, 256), match='.'>
<_sre.SRE_Match object; span=(268, 269), match='.'>


One practical example would be a url

In [10]:
pattern = re.compile(r'example\.com')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(143, 154), match='example.com'>


### MetaCharacters

So few regular expression **MetaCharacters**:


```
.       - Any Character Except New Line
\d      - Digit (0-9)
\D      - Not a Digit (0-9)
\w      - Word Character (a-z, A-Z, 0-9, _)
\W      - Not a Word Character
\s      - Whitespace (space, tab, newline)
\S      - Not Whitespace (space, tab, newline)

\b      - Word Boundary
\B      - Not a Word Boundary
^       - Beginning of a String
$       - End of a String

[]      - Matches Characters in brackets
[^ ]    - Matches Characters NOT in brackets
|       - Either Or
( )     - Group
```



#### \d

**\d** give us all the matches which are digit.

In [11]:
pattern = re.compile(r'\d')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(55, 56), match='1'>
<_sre.SRE_Match object; span=(56, 57), match='2'>
<_sre.SRE_Match object; span=(57, 58), match='3'>
<_sre.SRE_Match object; span=(58, 59), match='4'>
<_sre.SRE_Match object; span=(59, 60), match='5'>
<_sre.SRE_Match object; span=(60, 61), match='6'>
<_sre.SRE_Match object; span=(61, 62), match='7'>
<_sre.SRE_Match object; span=(62, 63), match='8'>
<_sre.SRE_Match object; span=(63, 64), match='9'>
<_sre.SRE_Match object; span=(64, 65), match='0'>
<_sre.SRE_Match object; span=(156, 157), match='3'>
<_sre.SRE_Match object; span=(157, 158), match='2'>
<_sre.SRE_Match object; span=(158, 159), match='1'>
<_sre.SRE_Match object; span=(160, 161), match='5'>
<_sre.SRE_Match object; span=(161, 162), match='5'>
<_sre.SRE_Match object; span=(162, 163), match='5'>
<_sre.SRE_Match object; span=(164, 165), match='4'>
<_sre.SRE_Match object; span=(165, 166), match='3'>
<_sre.SRE_Match object; span=(166, 167), match='2'>
<_sre.SRE_Match object; span=(16

#### \D

**\D** give us all the matches which are not digit.

In [12]:
pattern = re.compile(r'\d')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(55, 56), match='1'>
<_sre.SRE_Match object; span=(56, 57), match='2'>
<_sre.SRE_Match object; span=(57, 58), match='3'>
<_sre.SRE_Match object; span=(58, 59), match='4'>
<_sre.SRE_Match object; span=(59, 60), match='5'>
<_sre.SRE_Match object; span=(60, 61), match='6'>
<_sre.SRE_Match object; span=(61, 62), match='7'>
<_sre.SRE_Match object; span=(62, 63), match='8'>
<_sre.SRE_Match object; span=(63, 64), match='9'>
<_sre.SRE_Match object; span=(64, 65), match='0'>
<_sre.SRE_Match object; span=(156, 157), match='3'>
<_sre.SRE_Match object; span=(157, 158), match='2'>
<_sre.SRE_Match object; span=(158, 159), match='1'>
<_sre.SRE_Match object; span=(160, 161), match='5'>
<_sre.SRE_Match object; span=(161, 162), match='5'>
<_sre.SRE_Match object; span=(162, 163), match='5'>
<_sre.SRE_Match object; span=(164, 165), match='4'>
<_sre.SRE_Match object; span=(165, 166), match='3'>
<_sre.SRE_Match object; span=(166, 167), match='2'>
<_sre.SRE_Match object; span=(16

#### \w

**\w** give us all the matches which are Word Character (a-z, A-Z, 0-9, _).

In [13]:
pattern = re.compile(r'\w')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(1, 2), match='a'>
<_sre.SRE_Match object; span=(2, 3), match='b'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='d'>
<_sre.SRE_Match object; span=(5, 6), match='e'>
<_sre.SRE_Match object; span=(6, 7), match='f'>
<_sre.SRE_Match object; span=(7, 8), match='g'>
<_sre.SRE_Match object; span=(8, 9), match='h'>
<_sre.SRE_Match object; span=(9, 10), match='i'>
<_sre.SRE_Match object; span=(10, 11), match='j'>
<_sre.SRE_Match object; span=(11, 12), match='k'>
<_sre.SRE_Match object; span=(12, 13), match='l'>
<_sre.SRE_Match object; span=(13, 14), match='m'>
<_sre.SRE_Match object; span=(14, 15), match='n'>
<_sre.SRE_Match object; span=(15, 16), match='o'>
<_sre.SRE_Match object; span=(16, 17), match='p'>
<_sre.SRE_Match object; span=(17, 18), match='q'>
<_sre.SRE_Match object; span=(18, 19), match='u'>
<_sre.SRE_Match object; span=(19, 20), match='r'>
<_sre.SRE_Match object; span=(20, 21), match='t'>
<_sre.SRE_Match o

#### \W

**\W** give us all the matches which are not Word Character.

In [14]:
pattern = re.compile(r'\W')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(27, 28), match='\n'>
<_sre.SRE_Match object; span=(54, 55), match='\n'>
<_sre.SRE_Match object; span=(65, 66), match='\n'>
<_sre.SRE_Match object; span=(66, 67), match='\n'>
<_sre.SRE_Match object; span=(69, 70), match=' '>
<_sre.SRE_Match object; span=(74, 75), match='\n'>
<_sre.SRE_Match object; span=(75, 76), match='\n'>
<_sre.SRE_Match object; span=(90, 91), match=' '>
<_sre.SRE_Match object; span=(91, 92), match='('>
<_sre.SRE_Match object; span=(96, 97), match=' '>
<_sre.SRE_Match object; span=(99, 100), match=' '>
<_sre.SRE_Match object; span=(102, 103), match=' '>
<_sre.SRE_Match object; span=(110, 111), match=')'>
<_sre.SRE_Match object; span=(111, 112), match=':'>
<_sre.SRE_Match object; span=(112, 113), match='\n'>
<_sre.SRE_Match object; span=(113, 114), match='\n'>
<_sre.SRE_Match object; span=(114, 115), match='.'>
<_sre.SRE_Match object; span=(115, 116), match=' '>
<_sre.SRE_Match object; span

#### \s

**\s** give us all the matches which are Whitespace (space, tab, newline).

In [15]:
pattern = re.compile(r'\s')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(27, 28), match='\n'>
<_sre.SRE_Match object; span=(54, 55), match='\n'>
<_sre.SRE_Match object; span=(65, 66), match='\n'>
<_sre.SRE_Match object; span=(66, 67), match='\n'>
<_sre.SRE_Match object; span=(69, 70), match=' '>
<_sre.SRE_Match object; span=(74, 75), match='\n'>
<_sre.SRE_Match object; span=(75, 76), match='\n'>
<_sre.SRE_Match object; span=(90, 91), match=' '>
<_sre.SRE_Match object; span=(96, 97), match=' '>
<_sre.SRE_Match object; span=(99, 100), match=' '>
<_sre.SRE_Match object; span=(102, 103), match=' '>
<_sre.SRE_Match object; span=(112, 113), match='\n'>
<_sre.SRE_Match object; span=(113, 114), match='\n'>
<_sre.SRE_Match object; span=(115, 116), match=' '>
<_sre.SRE_Match object; span=(117, 118), match=' '>
<_sre.SRE_Match object; span=(119, 120), match=' '>
<_sre.SRE_Match object; span=(121, 122), match=' '>
<_sre.SRE_Match object; span=(123, 124), match=' '>
<_sre.SRE_Match object; sp

#### \S

**\S** give us all the matches which are not Whitespace (space, tab, newline).

In [16]:
pattern = re.compile(r'\S')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(1, 2), match='a'>
<_sre.SRE_Match object; span=(2, 3), match='b'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='d'>
<_sre.SRE_Match object; span=(5, 6), match='e'>
<_sre.SRE_Match object; span=(6, 7), match='f'>
<_sre.SRE_Match object; span=(7, 8), match='g'>
<_sre.SRE_Match object; span=(8, 9), match='h'>
<_sre.SRE_Match object; span=(9, 10), match='i'>
<_sre.SRE_Match object; span=(10, 11), match='j'>
<_sre.SRE_Match object; span=(11, 12), match='k'>
<_sre.SRE_Match object; span=(12, 13), match='l'>
<_sre.SRE_Match object; span=(13, 14), match='m'>
<_sre.SRE_Match object; span=(14, 15), match='n'>
<_sre.SRE_Match object; span=(15, 16), match='o'>
<_sre.SRE_Match object; span=(16, 17), match='p'>
<_sre.SRE_Match object; span=(17, 18), match='q'>
<_sre.SRE_Match object; span=(18, 19), match='u'>
<_sre.SRE_Match object; span=(19, 20), match='r'>
<_sre.SRE_Match object; span=(20, 21), match='t'>
<_sre.SRE_Match o

#### \b

**\b** give us all the matches which have Word Boundary.

In [17]:
pattern = re.compile(r'\bHa')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(67, 69), match='Ha'>
<_sre.SRE_Match object; span=(70, 72), match='Ha'>


The two matches are:

...
1234567890

**Ha** **Ha**Ha

MetaCharacters
...

#### \B

Similarly, **\B** give us all the matches which do not have Word Boundary.

In [18]:
pattern = re.compile(r'\BHa')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(72, 74), match='Ha'>


The match is:

...
1234567890

Ha Ha**Ha**

MetaCharacters
...

We can also do something like this:

In [19]:
pattern = re.compile(r'\bHa\b')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(67, 69), match='Ha'>


Only one **Ha** have boundary on both the sides:

...
1234567890

**Ha** HaHa

MetaCharacters
...

#### ^ and $

To see how **^** and **$** works, let's take an string.

In [0]:
sentence = 'Start a sentence and then bring it to an end'

Now if we want to check whether a string **begins with something**, we can use **^**, and if we want to see if a string **ends with something** we can use **$**.

In [21]:
pattern = re.compile(r'^Start')

matches = pattern.finditer(sentence)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(0, 5), match='Start'>


In [22]:
pattern = re.compile(r'end$')

matches = pattern.finditer(sentence)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(41, 44), match='end'>


Let us check whether **and**, is in the beginning or end of the string.

In [0]:
pattern = re.compile(r'^and')

matches = pattern.finditer(sentence)

for match in matches:
  print(match)

In [0]:
pattern = re.compile(r'and$')

matches = pattern.finditer(sentence)

for match in matches:
  print(match)

Now let look for some practicle examples (phone number search)

Let's try by matching any character between numbers.

In [25]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(156, 168), match='321-555-4321'>
<_sre.SRE_Match object; span=(169, 181), match='123.555.1234'>
<_sre.SRE_Match object; span=(182, 194), match='123*555*1234'>
<_sre.SRE_Match object; span=(195, 207), match='800-555-1234'>
<_sre.SRE_Match object; span=(208, 220), match='900-555-1234'>


Now, **123\*555\*1234** is not a valid phone number. Valid numbers only have **.** and **-** in between.

#### character set - []

In [26]:
pattern = re.compile(r'\d\d\d[.-]\d\d\d[.-]\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(156, 168), match='321-555-4321'>
<_sre.SRE_Match object; span=(169, 181), match='123.555.1234'>
<_sre.SRE_Match object; span=(195, 207), match='800-555-1234'>
<_sre.SRE_Match object; span=(208, 220), match='900-555-1234'>


Here, **[.-]** is a character set. It uses square brackets **[]**. Inside this we can have many characters, but it only matches one charcters.

Now, if we only want to match only 800 or 900 number, we can do something like this.

In [27]:
pattern = re.compile(r'[89]00[.-]\d\d\d[.-]\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(195, 207), match='800-555-1234'>
<_sre.SRE_Match object; span=(208, 220), match='900-555-1234'>


Within charcater set the **-** can also be used to specify a range.

If we want to search only digit 1 to 5 we can write:

In [28]:
pattern = re.compile(r'[1-5]')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(55, 56), match='1'>
<_sre.SRE_Match object; span=(56, 57), match='2'>
<_sre.SRE_Match object; span=(57, 58), match='3'>
<_sre.SRE_Match object; span=(58, 59), match='4'>
<_sre.SRE_Match object; span=(59, 60), match='5'>
<_sre.SRE_Match object; span=(156, 157), match='3'>
<_sre.SRE_Match object; span=(157, 158), match='2'>
<_sre.SRE_Match object; span=(158, 159), match='1'>
<_sre.SRE_Match object; span=(160, 161), match='5'>
<_sre.SRE_Match object; span=(161, 162), match='5'>
<_sre.SRE_Match object; span=(162, 163), match='5'>
<_sre.SRE_Match object; span=(164, 165), match='4'>
<_sre.SRE_Match object; span=(165, 166), match='3'>
<_sre.SRE_Match object; span=(166, 167), match='2'>
<_sre.SRE_Match object; span=(167, 168), match='1'>
<_sre.SRE_Match object; span=(169, 170), match='1'>
<_sre.SRE_Match object; span=(170, 171), match='2'>
<_sre.SRE_Match object; span=(171, 172), match='3'>
<_sre.SRE_Match object; span=(173, 174), match='5'>
<_sre.SRE_Match object

If we want to search only alphabet a to p we can write:

In [29]:
pattern = re.compile(r'[a-p]')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(1, 2), match='a'>
<_sre.SRE_Match object; span=(2, 3), match='b'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='d'>
<_sre.SRE_Match object; span=(5, 6), match='e'>
<_sre.SRE_Match object; span=(6, 7), match='f'>
<_sre.SRE_Match object; span=(7, 8), match='g'>
<_sre.SRE_Match object; span=(8, 9), match='h'>
<_sre.SRE_Match object; span=(9, 10), match='i'>
<_sre.SRE_Match object; span=(10, 11), match='j'>
<_sre.SRE_Match object; span=(11, 12), match='k'>
<_sre.SRE_Match object; span=(12, 13), match='l'>
<_sre.SRE_Match object; span=(13, 14), match='m'>
<_sre.SRE_Match object; span=(14, 15), match='n'>
<_sre.SRE_Match object; span=(15, 16), match='o'>
<_sre.SRE_Match object; span=(16, 17), match='p'>
<_sre.SRE_Match object; span=(68, 69), match='a'>
<_sre.SRE_Match object; span=(71, 72), match='a'>
<_sre.SRE_Match object; span=(73, 74), match='a'>
<_sre.SRE_Match object; span=(77, 78), match='e'>
<_sre.SRE_Match o

#### [] with ^

Now if we want to search anything which does not match say anyhting between **a to z** or **A to Z**, then we can use **^** inside the character set.

In [30]:
pattern = re.compile(r'[^a-zA-Z]')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(27, 28), match='\n'>
<_sre.SRE_Match object; span=(54, 55), match='\n'>
<_sre.SRE_Match object; span=(55, 56), match='1'>
<_sre.SRE_Match object; span=(56, 57), match='2'>
<_sre.SRE_Match object; span=(57, 58), match='3'>
<_sre.SRE_Match object; span=(58, 59), match='4'>
<_sre.SRE_Match object; span=(59, 60), match='5'>
<_sre.SRE_Match object; span=(60, 61), match='6'>
<_sre.SRE_Match object; span=(61, 62), match='7'>
<_sre.SRE_Match object; span=(62, 63), match='8'>
<_sre.SRE_Match object; span=(63, 64), match='9'>
<_sre.SRE_Match object; span=(64, 65), match='0'>
<_sre.SRE_Match object; span=(65, 66), match='\n'>
<_sre.SRE_Match object; span=(66, 67), match='\n'>
<_sre.SRE_Match object; span=(69, 70), match=' '>
<_sre.SRE_Match object; span=(74, 75), match='\n'>
<_sre.SRE_Match object; span=(75, 76), match='\n'>
<_sre.SRE_Match object; span=(90, 91), match=' '>
<_sre.SRE_Match object; span=(91, 92), match=

Now if we want to search a 3 letter word which have a ending **at** but not begin with **b**. 
So we wan to have somthing like this

...
**cat**

**mat**

**pat**

bat
...

We can do something like this:

In [31]:
pattern = re.compile(r'[^b]at')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(273, 276), match='cat'>
<_sre.SRE_Match object; span=(277, 280), match='mat'>
<_sre.SRE_Match object; span=(281, 284), match='pat'>


### Quantifiers

```
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)
```

#### regex on Phone Numbers

So for phone number we can use exact number quantifier.

#### {}

In [32]:
pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{4}')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(156, 168), match='321-555-4321'>
<_sre.SRE_Match object; span=(169, 181), match='123.555.1234'>
<_sre.SRE_Match object; span=(195, 207), match='800-555-1234'>
<_sre.SRE_Match object; span=(208, 220), match='900-555-1234'>


Now, let's try to match the Mr name in the very bottom of text_to_search

#### \*, + and ?

In [33]:
pattern = re.compile(r'Mr\.?\s[A-Z]\w*')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(222, 233), match='Mr. Schafer'>
<_sre.SRE_Match object; span=(234, 242), match='Mr Smith'>
<_sre.SRE_Match object; span=(266, 271), match='Mr. T'>


Here, **?** after **\\.** is a quantifier specifying that we want 0 or One **.** after **Mr**

Also **\*** after **\\w** is a quantifier specifying that we want 0 or more character after the first initial of the Name.

Now to match all the names we can do something like this:

In [34]:
pattern = re.compile(r'M(r|s|rs)\.?\s[A-Z]\w*')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(222, 233), match='Mr. Schafer'>
<_sre.SRE_Match object; span=(234, 242), match='Mr Smith'>
<_sre.SRE_Match object; span=(243, 251), match='Ms Davis'>
<_sre.SRE_Match object; span=(252, 265), match='Mrs. Robinson'>
<_sre.SRE_Match object; span=(266, 271), match='Mr. T'>


Here **(r|s|rs)** denotes a group. It allow us to match several different pattern. Here **|** represent **or**. So it says, **M** followed by either **r**, or **s** or **rs**.

Now let's try to implement everything we learn so far by matching emails

In [0]:
emails = '''
ExampleMail@gmail.com
example.mail@university.edu
example-321-mail@my-work.net
'''

Let's try to match the firat email. It seams simple. We just need to match alphabets before **@** and the again alphabets till **.** followed by **com**.

In [36]:
pattern = re.compile(r'[a-zA-Z]+@[a-zA-Z]+\.com')

matches = pattern.finditer(emails)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(1, 22), match='ExampleMail@gmail.com'>


Here **+** after each **[a-zA-Z]** is a qauntifier specifying that we need one or more alphabets.

To match the second one, we have to include **.** in between alphabet befor **@** and **edu** at the very end.

In [37]:
pattern = re.compile(r'[a-zA-Z.]+@[a-zA-Z]+\.(com|edu)')

matches = pattern.finditer(emails)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(1, 22), match='ExampleMail@gmail.com'>
<_sre.SRE_Match object; span=(23, 50), match='example.mail@university.edu'>


To match the last one we need to include **-** and **digits** along alphabet and period before **@** also wee need to include **-** between **@** and **.** and lastly we need to have **net** at the end.

In [38]:
pattern = re.compile(r'[a-zA-Z0-9.-]+@[a-zA-Z-]+\.(com|edu|net)')

matches = pattern.finditer(emails)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(1, 22), match='ExampleMail@gmail.com'>
<_sre.SRE_Match object; span=(23, 50), match='example.mail@university.edu'>
<_sre.SRE_Match object; span=(51, 79), match='example-321-mail@my-work.net'>


#### regex on url

Now, fo anothe practicle example let's try to get useful information out of urls

In [0]:
urls = '''
https://www.google.com
http://example.com
https://youtube.com
https://www.nasa.gov
'''

What we need is just the domain name(**google**, **youtube** etc) and the top level domain (**.com**, **.net**, **.edu** etc).

In [40]:
pattern = re.compile(r'https?://(www\.)?\w+\.\w+')

matches = pattern.finditer(urls)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(1, 23), match='https://www.google.com'>
<_sre.SRE_Match object; span=(24, 42), match='http://example.com'>
<_sre.SRE_Match object; span=(43, 62), match='https://youtube.com'>
<_sre.SRE_Match object; span=(63, 83), match='https://www.nasa.gov'>


But this is not what we want. Here, we can use group to have seperate domain name and top level domain.

In [41]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

matches = pattern.finditer(urls)

for match in matches:
  print(match)

<_sre.SRE_Match object; span=(1, 23), match='https://www.google.com'>
<_sre.SRE_Match object; span=(24, 42), match='http://example.com'>
<_sre.SRE_Match object; span=(43, 62), match='https://youtube.com'>
<_sre.SRE_Match object; span=(63, 83), match='https://www.nasa.gov'>


Here **(www\.)** is Group 1, **(\w+)** is Group 2, and **(\.\w+)** is Group 3. Also we have a Group 0, which is every match we found.

Now to find the group we can just pass the index of the group. Like if we want to print group 0, we can do something like this:

In [42]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

matches = pattern.finditer(urls)

for match in matches:
  print(match.group(0))

https://www.google.com
http://example.com
https://youtube.com
https://www.nasa.gov


For Group 1:

In [43]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

matches = pattern.finditer(urls)

for match in matches:
  print(match.group(1))

www.
None
None
www.


For Group 2:

In [44]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

matches = pattern.finditer(urls)

for match in matches:
  print(match.group(2))

google
example
youtube
nasa


For Group 3:

In [45]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

matches = pattern.finditer(urls)

for match in matches:
  print(match.group(3))

.com
.com
.com
.gov


Now we want the domain name (group 2) and top level domain (group 3)

We can use something called backreference which references a group. We can use **sub()**  method for this

In [0]:
subbed_urls = pattern.sub(r'\2\3', urls)

In [47]:
print(subbed_urls)


google.com
example.com
youtube.com
nasa.gov



### re.findall()

re.**findall**(pattern, string, flags=0)

> Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.



So if we try our name example again.

In [48]:
pattern = re.compile(r'M(r|s|rs)\.?\s[A-Z]\w*')

matches = pattern.findall(text_to_search)

for match in matches:
  print(match)

r
r
s
rs
r


We can see that it only returns the group **(r|s|rs)**

Now if there is no group, it returns all the match

If we go back to our phone numbers example, we can see:

In [49]:
pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{4}')

matches = pattern.findall(text_to_search)

for match in matches:
  print(match)

321-555-4321
123.555.1234
800-555-1234
900-555-1234


### re.match()

re.**match**(pattern, string, flags=0)

> If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

> Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

NOTE: match() does not returns an iterable. It only returns the first match it find and that too in the beginning. Its like using **^**

In [50]:
pattern = re.compile(r'Start')

matches = pattern.match(sentence)

print(matches)

<_sre.SRE_Match object; span=(0, 5), match='Start'>


In [51]:
pattern = re.compile(r'sentence')

matches = pattern.match(sentence)

print(matches)

None


### re.search()

re.**search**(pattern, string, flags=0)

> Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

Since match can only find the pattern present at the beginning we use **search()** to find the pattern, which is present inbetween. Also search() also return only the first match.

In [52]:
pattern = re.compile(r'sentence')

matches = pattern.search(sentence)

print(matches)

<_sre.SRE_Match object; span=(8, 16), match='sentence'>


### flags

Let's say we want to match a word where each letter can be a uppercase or lowercase or mixture of both. So if we want to search **Start** we normally have to write:

In [53]:
pattern = re.compile(r'[Ss][Tt][Aa][Rr][Tt]')

matches = pattern.search(sentence)

print(matches)

<_sre.SRE_Match object; span=(0, 5), match='Start'>


Alternatively we can use **IGNORECASE** flag

In [54]:
pattern = re.compile(r'start', re.IGNORECASE)

matches = pattern.search(sentence)

print(matches)

<_sre.SRE_Match object; span=(0, 5), match='Start'>


Also there is a shorthand for this

In [55]:
pattern = re.compile(r'start', re.I)

matches = pattern.search(sentence)

print(matches)

<_sre.SRE_Match object; span=(0, 5), match='Start'>


Also we have several other flags:

re.**DEBUG**

> Display debug information about compiled expression.

re.**I**
re.**IGNORECASE**

> Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale. To get this effect on non-ASCII Unicode characters such as ü and Ü, add the UNICODE flag.

re.**L**
re.**LOCALE**

> Make \w, \W, \b, \B, \s and \S dependent on the current locale.

re.**M**
re.**MULTILINE**

> When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '\\$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

re.**S**
re.**DOTALL**

> Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

re.**U**
re.**UNICODE**

> Make the \w, \W, \b, \B, \d, \D, \s and \S sequences dependent on the Unicode character properties database. Also enables non-ASCII matching for IGNORECASE.

New in version 2.0.

re.**X**
re.**VERBOSE**

> This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

These are few basic steps on how to use regular expression practically.