## Regular Expression

Regular expression using python's built-in re module

In [1]:
import re

Text we will be working with is a multiline string

In [2]:
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):

. ^ $ * + ? { } [ ] \ | ( )

example.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T

cat
mat
pat
bat
'''

### Raw string literals:
A Python raw string is a normal string, prefixed with a r or R, his treats characters such as backslash (‘\’) as a literal character. This also means that this character will not be treated as a escape character.

In [3]:
print('\tTab Boom')

	Tab Boom


In [5]:
print(r'\tTab')

\tTab


So raw string literals are the string literals marked by an 'r' before the opening quote. In raw string we do not have any special treatment for escape sequence such as newline, tabs, backspaces, form-feeds, and so on.

### re.compile()
We are going to use the compile method which lets us to seperate our pattern as a vartiable and let us reuse it for multiple searches.

#### Syntax 
re.compile(pattern, flags=0)

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.

### re.finditer()
- Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. 
- The string is scanned left-to-right, and matches are returned in the order found.
- Empty matches are included in the result.
#### Syntax :
re.finditer(pattern, string, flags=0)

In [9]:
pattern = re.compile(r'abc')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(1, 4), match='abc'>


In [12]:
pattern = re.compile(r'cat')

In [13]:
matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(273, 276), match='cat'>


In [14]:
print(pattern)

re.compile('cat')


Lets try out a pattern which is not present in the text_to_search

In [23]:
pattern = re.compile(r"cba")

matches = pattern.finditer(text_to_search)

len(list(matches))

0

Lets try to match . it matches almost all patterns!!

In [24]:
pattern = re.compile(r'.')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

What if i need to search for . itself, the we can \.

In [26]:
pattern = re.compile(r'\.')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(114, 115), match='.'>
<re.Match object; span=(150, 151), match='.'>
<re.Match object; span=(172, 173), match='.'>
<re.Match object; span=(176, 177), match='.'>
<re.Match object; span=(224, 225), match='.'>
<re.Match object; span=(255, 256), match='.'>
<re.Match object; span=(268, 269), match='.'>


Yeah! it worked

## MetaCharacters
Few regular expression MetaCharacters:

- .       - Any Character Except New Line
- \d      - Digit (0-9)
- \D      - Not a Digit (0-9)
- \w      - Word Character (a-z, A-Z, 0-9, _)
- \W      - Not a Word Character

- \s      - Whitespace (space, tab, newline)
- \S      - Not Whitespace (space, tab, newline)
- \b      - Word Boundary
- \B      - Not a Word Boundary

- ^       - Beginning of a String
- $       - End of a String

- []      - Matches Characters in brackets
- [^ ]    - Matches Characters NOT in brackets
- |       - Either Or
- ( )     - Group


In [27]:
pattern = re.compile(r'\d')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='3'>
<re.Match object; span=(58, 59), match='4'>
<re.Match object; span=(59, 60), match='5'>
<re.Match object; span=(60, 61), match='6'>
<re.Match object; span=(61, 62), match='7'>
<re.Match object; span=(62, 63), match='8'>
<re.Match object; span=(63, 64), match='9'>
<re.Match object; span=(64, 65), match='0'>
<re.Match object; span=(156, 157), match='3'>
<re.Match object; span=(157, 158), match='2'>
<re.Match object; span=(158, 159), match='1'>
<re.Match object; span=(160, 161), match='5'>
<re.Match object; span=(161, 162), match='5'>
<re.Match object; span=(162, 163), match='5'>
<re.Match object; span=(164, 165), match='4'>
<re.Match object; span=(165, 166), match='3'>
<re.Match object; span=(166, 167), match='2'>
<re.Match object; span=(167, 168), match='1'>
<re.Match object; span=(169, 170), match='1'>
<re.Match object; span=(170, 171), match='2'>
<re.Matc

In [28]:
pattern = re.compile(r'\D')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Ma

In [29]:
pattern = re.compile(r'\w')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

In [30]:
pattern = re.compile(r'\s')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(27, 28), match='\n'>
<re.Match object; span=(54, 55), match='\n'>
<re.Match object; span=(65, 66), match='\n'>
<re.Match object; span=(66, 67), match='\n'>
<re.Match object; span=(69, 70), match=' '>
<re.Match object; span=(74, 75), match='\n'>
<re.Match object; span=(75, 76), match='\n'>
<re.Match object; span=(90, 91), match=' '>
<re.Match object; span=(96, 97), match=' '>
<re.Match object; span=(99, 100), match=' '>
<re.Match object; span=(102, 103), match=' '>
<re.Match object; span=(112, 113), match='\n'>
<re.Match object; span=(113, 114), match='\n'>
<re.Match object; span=(115, 116), match=' '>
<re.Match object; span=(117, 118), match=' '>
<re.Match object; span=(119, 120), match=' '>
<re.Match object; span=(121, 122), match=' '>
<re.Match object; span=(123, 124), match=' '>
<re.Match object; span=(125, 126), match=' '>
<re.Match object; span=(127, 128), match=' '>
<re.Match object; span=(129, 130), match=' '>
<r

In [31]:
pattern = re.compile(r'\bHa')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(67, 69), match='Ha'>
<re.Match object; span=(70, 72), match='Ha'>


The two matches are:

... 1234567890

Ha HaHa

Match : Ha Ha

In [32]:
pattern = re.compile(r'\BHa')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(72, 74), match='Ha'>


Ha HaHa

The match was :
last Ha

In [33]:
sentence = 'Start a sentence and then bring it to an end'

Now lets check if the sentence begins with specific character using ^, and ensure if the sentence ends with specific character using $

In [34]:
pattern = re.compile(r'^Start')

matches = pattern.finditer(sentence)

for match in matches:
  print(match)

<re.Match object; span=(0, 5), match='Start'>


In [35]:
pattern = re.compile(r'^tart')

matches = pattern.finditer(sentence)

for match in matches:
  print(match)

Prints nothing

In [37]:
pattern = re.compile(r'end$')

matches = pattern.finditer(sentence)

for match in matches:
  print(match)

<re.Match object; span=(41, 44), match='end'>


Lets try the typical phone number match...!

In [44]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')
matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(156, 168), match='321-555-4321'>
<re.Match object; span=(169, 181), match='123.555.1234'>
<re.Match object; span=(182, 194), match='123*555*1234'>
<re.Match object; span=(195, 207), match='800-555-1234'>
<re.Match object; span=(208, 220), match='900-555-1234'>


Now, 123*555*1234 is not a valid phone number. Valid numbers only have . and - in between.
So we can use this character set
##### Character Set[]

In [45]:
pattern = re.compile(r'\d\d\d[.-]\d\d\d[.-]\d\d\d\d')
matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(156, 168), match='321-555-4321'>
<re.Match object; span=(169, 181), match='123.555.1234'>
<re.Match object; span=(195, 207), match='800-555-1234'>
<re.Match object; span=(208, 220), match='900-555-1234'>


So character set will only take any one of the characters in the square brackets and matches the same in test to be searched

Lets check if i want the number to start with 8 or 9

In [46]:
pattern = re.compile(r'[89]\d\d[.-]\d\d\d[.-]\d\d\d\d')
matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(195, 207), match='800-555-1234'>
<re.Match object; span=(208, 220), match='900-555-1234'>


Get the numbers in certain Range

In [47]:
pattern = re.compile(r'[1-5]')
matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='3'>
<re.Match object; span=(58, 59), match='4'>
<re.Match object; span=(59, 60), match='5'>
<re.Match object; span=(156, 157), match='3'>
<re.Match object; span=(157, 158), match='2'>
<re.Match object; span=(158, 159), match='1'>
<re.Match object; span=(160, 161), match='5'>
<re.Match object; span=(161, 162), match='5'>
<re.Match object; span=(162, 163), match='5'>
<re.Match object; span=(164, 165), match='4'>
<re.Match object; span=(165, 166), match='3'>
<re.Match object; span=(166, 167), match='2'>
<re.Match object; span=(167, 168), match='1'>
<re.Match object; span=(169, 170), match='1'>
<re.Match object; span=(170, 171), match='2'>
<re.Match object; span=(171, 172), match='3'>
<re.Match object; span=(173, 174), match='5'>
<re.Match object; span=(174, 175), match='5'>
<re.Match object; span=(175, 176), match='5'>
<re.Match object; span=(177, 178), match='1'

How about getting alphabets!

In [50]:
pattern = re.compile(r'[g-t]')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(78, 79), match='t'>
<re.Match object; span=(81, 82), match='h'>
<re.Match object; span=(83, 84), match='r'>
<re.Match object; span=(86, 87), match='t'>
<re.Match object; span=(88, 89), match='r'>
<re.Match object; span=(89, 90), match='s'>
<re.Match object; span=(97, 98), match='t'>
<re.Match object; span=(98, 99), match='o'>
<re.Match object; span=(104, 105), match='s'>
<re.Match object; span=(107, 108), 

[] with ^
Now if we want to search anything which does not match say anyhting between a to z or A to Z, then we can use ^ inside the character set.

In [51]:
pattern = re.compile(r'[^a-zA-Z]')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(27, 28), match='\n'>
<re.Match object; span=(54, 55), match='\n'>
<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='3'>
<re.Match object; span=(58, 59), match='4'>
<re.Match object; span=(59, 60), match='5'>
<re.Match object; span=(60, 61), match='6'>
<re.Match object; span=(61, 62), match='7'>
<re.Match object; span=(62, 63), match='8'>
<re.Match object; span=(63, 64), match='9'>
<re.Match object; span=(64, 65), match='0'>
<re.Match object; span=(65, 66), match='\n'>
<re.Match object; span=(66, 67), match='\n'>
<re.Match object; span=(69, 70), match=' '>
<re.Match object; span=(74, 75), match='\n'>
<re.Match object; span=(75, 76), match='\n'>
<re.Match object; span=(90, 91), match=' '>
<re.Match object; span=(91, 92), match='('>
<re.Match object; span=(96, 97), match=' '>
<re.Match object; span=(99, 100), match=' '>
<re.Match object; span=(10

Lets check the following condition:
i want words ending with at but shouldn't start with b

In [53]:
pattern = re.compile(r'[^b]at')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(273, 276), match='cat'>
<re.Match object; span=(277, 280), match='mat'>
<re.Match object; span=(281, 284), match='pat'>


### Quantifiers
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)
regex on Phone Numbers
So for phone number we can use exact number quantifier.

##### {}

In [54]:
pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{4}')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(156, 168), match='321-555-4321'>
<re.Match object; span=(169, 181), match='123.555.1234'>
<re.Match object; span=(195, 207), match='800-555-1234'>
<re.Match object; span=(208, 220), match='900-555-1234'>


In [55]:
pattern = re.compile(r'\d{3}[-]\d{3}[-]\d{4}')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(156, 168), match='321-555-4321'>
<re.Match object; span=(195, 207), match='800-555-1234'>
<re.Match object; span=(208, 220), match='900-555-1234'>


Now, let's try to match the Mr name in the very bottom of text_to_search

#### *, + and ?

In [62]:
pattern = re.compile(r'Mr\.?\s[A-Z]\w*')
matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(222, 233), match='Mr. Schafer'>
<re.Match object; span=(234, 242), match='Mr Smith'>
<re.Match object; span=(266, 271), match='Mr. T'>


#### re.match()
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

##### Syntax
re.match(pattern, string, flags=0)

Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

NOTE: match() does not returns an iterable. It only returns the first match it find and that too in the beginning. Its like using ^

In [63]:
pattern = re.compile(r'Start')

matches = pattern.match(sentence)

print(matches)

<re.Match object; span=(0, 5), match='Start'>


In [72]:
pattern = re.compile(r'end')

matches = pattern.match(sentence)

print(matches)

None


In [66]:
pattern = re.compile(r'example')

matches = pattern.match(text_to_search)

print(matches)

None


In [67]:
print(sentence)

Start a sentence and then bring it to an end


#### re.search()
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
##### Syntax
re.search(pattern, string, flags=0)

Since match can only find the pattern present at the beginning we use search() to find the pattern, which is present in between the sentences. Also search() also return only the first match.

In [71]:
pattern = re.compile(r'example')

matches = pattern.search(text_to_search)

print(matches)

<re.Match object; span=(143, 150), match='example'>


Got it!!!

In [73]:
pattern = re.compile(r'end')

matches = pattern.search(sentence)

print(matches)

<re.Match object; span=(41, 44), match='end'>


#### flags
Let's say we need to match a word where each letter can be a uppercase or lowercase or mixture of both. So if we want to search Start we normally have to write:

In [74]:
pattern = re.compile(r'[Ss][Tt][Aa][Rr][Tt]')

matches = pattern.search(sentence)

print(matches)

<re.Match object; span=(0, 5), match='Start'>


Lets make it more simple!!

Alternatively we can use IGNORECASE flag

In [75]:
pattern = re.compile(r'start', re.IGNORECASE)

matches = pattern.search(sentence)

print(matches)

<re.Match object; span=(0, 5), match='Start'>


Just the shorthand things !

In [76]:
pattern = re.compile(r'start', re.I)

matches = pattern.search(sentence)

print(matches)

<re.Match object; span=(0, 5), match='Start'>


##### And finally just other things
Also we have several other flags:

#### re.DEBUG

Display debug information about compiled expression.

#### re.I re.IGNORECASE

Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale. To get this effect on non-ASCII Unicode characters such as ü and Ü, add the UNICODE flag.

#### re.L re.LOCALE

Make \w, \W, \b, \B, \s and \S dependent on the current locale.

#### re.M re.MULTILINE

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '\$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

#### re.S re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

#### re.U re.UNICODE

Make the \w, \W, \b, \B, \d, \D, \s and \S sequences dependent on the Unicode character properties database. Also enables non-ASCII matching for IGNORECASE.