# Regular Expressions

## Introduction to Regular Expressions

Primary use is for matching strings. RE has its own syntax that can be used in many languages, sometimes natively, but in python you must import the re library.

In [2]:
## The Matching Characters
## These chars are special chars in RE called metacharacters: . ^ $ * + ? { } [ ] | ( )

import re
text = 'abcdfghijk'

# This expression says we are looking for the letter a, 
# zero or more letters from our bracket class, and it needs to end with an f
parser = re.search('a[b-f]*f', text)
print(parser)

print (parser.group())

<re.Match object; span=(0, 5), match='abcdf'>
abcdf


### Other Metachars

The * char is used to match zero or more times

The + char is used to match one or more times

The ? char is used to match either once or zero times

Brackets are used as {a,b} where a and b are decimal integers. What this means is that there must be at least a repetitions and at most b

The char ^ is used to get the complement, i.e.  [^a] This will match any character except the letter ‘a’.

$ is used as an anchor to specify the end of a string

## Pattern Matching using search

In [5]:
import re

text = "This is the fourth module of this learning path."

strings = ['the', 'one', 'path']

for string in strings:
    match = re.search(string, text)
    if match:
        print('Found "{}" in "{}"'.format(string, text))
        # Get position of match within string
        text_pos = match.span()
        # Print match using start and end indexes
        print(text[match.start():match.end()])
        print('\n')
    else:
        print('Did not find "{}"'.format(string))
        print('\n')

Found "the" in "This is the fourth module of this learning path."
the


Did not find "one"


Found "path" in "This is the fourth module of this learning path."
path




## Escape Codes

d :Matches digit
    
D :Matches non-digit

s :Matches whitespace

S :Matches non-whitespace

w :Matches alphanumeric

W :Matches non-aplhanumeric

example: Using [\d] is the equivalent of [0-9]

## Compiling

This allows you to save your regex and use it to excecute functions at anty time

In [8]:
import re

text = "This is the fourth module of this learning path."

strings = ['the', 'one', 'path']

for string in strings:
    regex = re.compile(string)
    match = re.search(regex, text)
    if match:
        print('Found "{}" in "{}"'.format(string, text))
        text_pos = match.span()
        print(text[match.start():match.end()])
        print('\n')
    else:
        print('Did not find "{}"'.format(string))
        print('\n')

Found "the" in "This is the fourth module of this learning path."
the


Did not find "one"


Found "path" in "This is the fourth module of this learning path."
path




## Compilation flags

re.A / re.ASCII - Tells regex to only match against ascii chars

re.DEBUG - disiplay debug info on compiled expression

re.I / re.IGNORECASE - fort case-insensitive matching

re.L / re.LOCALE - Should aovoid using this but its for making escape codes depend in the current locale

re.M / re.MULTILINE - excecute regex operators in function on each line

re.S / re.DOTALL - allows "." char to match anything including newline chars

re.X / re.VERBOSE - Visually separate logical sections of your regular epxression'

In [11]:
## Using a compilation flag
# This expression is for finding email addresses

import re
re.compile('''
           [\w\.-]+      # the user name
           @
           [\w\.-]+'     # the domain
           ''',
           re.VERBOSE)

re.compile(r"\n           [\w\.-]+      # the user name\n           @\n           [\w\.-]+'     # the domain\n           ",
re.UNICODE|re.VERBOSE)

## Finding Multiple instances

In [15]:
# finding one instance

import re
my_string = 'the dog and the cat.'
pattern = 'the'
match = re.search(pattern,my_string)
print(match.group())

the


In [16]:
# finding all instances

import re
my_string = 'the dog and the cat.'
pattern = 'the'
print(re.findall(pattern,my_string))

['the', 'the']


In [17]:
# finding all instances, returning iterator

import re
my_string = 'the dog and the cat.'
pattern = 'the'

for match in re.finditer(pattern, my_string):
    s = "Found '{group}' and {begin}:{end}".format(
    group=match.group(), begin=match.start(),
    end=match.end())
    print(s)

Found 'the' and 0:3
Found 'the' and 12:15


## Backslashes in re & python

Since backslashes are special chars in python use this syntax for raw strings in re r"\python" instead of "\\\python" to have more readable code.