In [36]:
import re

#### Creating Regex Objects

Passing a string value representing a regular expression to re.compile() returns a Regex pattern object. 

In [6]:
# r represents a raw string, which does not escape characters. This is easier when using re.compile()
phone_number_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

# Regex Object's search method searches a string it is passed for any matches to the regex. 
    # If a match is found the search method returns a Match object, which have a group() method that returns the actual matched text from the search string.
mo = phone_number_regex.search('My number is 415-555-4242.')
print( 'Phone number found: ' + mo.group() )

Phone number found: 415-555-4242


#### Pattern Matching with Regular Expressions

###### Grouping with Parenthesis

In [10]:
phone_number_regex = re.compile( r'(\d\d\d)-(\d\d\d-\d\d\d\d)' )
mo                 = phone_number_regex.search( 'My number is 415-555-4242.' )
print( mo.group(1) )
print( mo.group(2) )
print( mo.group(0) )
print( mo.groups() )

415
555-4242
415-555-4242
('415', '555-4242')


In [12]:
area, main = mo.groups()
print( area )
print( main )

415
555-4242


###### Matching Multiple Groups with the Pipe

Use the | character to match one or more expressions. If both items in strin match, then only the first will be returned.

In [13]:
hero_regex = re.compile( r'Batman|Tina Fey' )
mo1        = hero_regex.search( 'Batman and Tina Fey.' )

mo1.group()

'Batman'

In [14]:
# Use parenthesis to specify a prefix pattern only once.
batRegex = re.compile( r'Bat(man|mobile|copter|bat)' )
mo       = batRegex.search( 'Batmobile lost a wheel' )

mo.group()

'Batmobile'

###### Optional Matching with the Question Mark

Regex finds a match whether or not a bit of pattern is present. The ? flags the proceeding content as an optional part of the pattern. 

In [15]:
# (wo)? is an optional part of the expression
bat_regex = re.compile( r'Bat(wo)?man' )
mo1       = bat_regex.search( 'The Adventures of Batman' )

mo1.group()

'Batman'

In [17]:
mo2 = bat_regex.search( 'The Adventures of Batwoman' )
mo2.group()

'Batwoman'

###### Matching Zero or More with the Star

The * means match zero or more. The group preceeding the star can occur any number of times in the text. It can be absent or repeated any number of times in the text.

In [21]:
bat_regex = re.compile( r'Bat(wo)*man' )
mo1       = bat_regex.search( 'The Adventures of Batman' )

mo1.group()

'Batman'

In [19]:
bat_regex = re.compile( r'Bat(wo)*man' )
mo2       = bat_regex.search( 'The Adventures of Batwoman' )

mo2.group()

'Batwoman'

In [20]:
bat_regex = re.compile( r'Bat(wo)*man' )
mo3       = bat_regex.search( 'The Adventures of Batwowowowoman' )

mo3.group()

'Batwowowowoman'

###### Matching One or More with the Plus

A group proceeding a plus must appear at least once. 

In [26]:
bat_regex = re.compile( r'Bat(wo)+man' )
mo1       = bat_regex.search( 'The Adventures of Batwoman' )

mo1.group()

'Batwoman'

In [27]:
mo2 = bat_regex.search( 'The Adventures of Batman' )

# Wo needs to appeat at least once. If it does not then no sting will be returned.
mo2 == None

True

###### Match Repetitions with Curly Braces

You can match a string by placing a number in curly braces after the pattern {3}, create a range to match {3,5}, or even have an unbounded amount to match { , 4} or { 4, }

In [31]:
ha_regex = re.compile( r'(Ha){3}' )
mo1      = ha_regex.search( 'HaHaHa' )

mo1.group()

'HaHaHa'

#### Greedy and Non-Greedy Matching

The regular expressions are greedy by default. That means in ambiguous situations involving {} will always match the longest string. To get the shortest string follow the {} by a question mark ?. 

In [42]:
non_greedy_regex = re.compile( r'(Ha){3}?' )
mo1              = non_greedy_regex.search( 'HaHaHaHa' )

mo1.group()

'HaHaHa'

###### More on the findall() method

In [44]:
phone_regex = re.compile( r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)' )

phone_regex.findall( 'Cell: 415-555-9999, Work: 212-555-0000' )

[('415', '555', '9999'), ('212', '555', '0000')]

#### Character Classes

- \d is any number or digit  
- \D is any character that is not a numeric digit  
- \w is any letter, numeric digit, or the underscore character  
- \W is any character that is not a letter, numeric digit, or the underscore character  
- \s is any space, tab, or newline character  
- \S is any character that is not a space, tab, or newline  

For example r'\d+\s\w+' will match anything that has 1 or more numeric followed by a space followed by 1 or more letter. 

#### Making Your Own Character Classes

You can define your own character classes using square brackets. Inside the square brackets normal regular expression symbols are not interpreted as such. So escape characters are not necessary. By adding a carrot ^ before the pattern within the brackets [] creates a negative pattern match (match characters not in character class). A popular matching pattern is [a-zA-Z0-9]

In [45]:
vowel_regex = re.compile( r'[aeiouAEIOU]' )
vowel_regex.findall( 'RoboCop eats baby food. BABY FOOD.' )

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

In [48]:
consonant_regex = re.compile( r'[^aeiouAEIOU]' )
consonant_regex.findall( 'RoboCop eats baby food. BABY FOOD.' )

['R',
 'b',
 'C',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

#### Caret and Dollar Sign Characters

By adding a carrot ^ before the regex pattern indicates a match must occure at the beginning of searched text. Likewise, by adding a dollar sign at the end of the regex indicates that a string must end with a regex pattern. By wrapping a string in ^ and $ indiciates that the string must match the regex (not a subset).

In [49]:
begin_with_hello = re.compile( r'^Hello' )
begin_with_hello.search( 'Hello World!' )

<_sre.SRE_Match object; span=(0, 5), match='Hello'>

In [50]:
begin_with_hello.search( 'He said hello' ) == None

True

In [51]:
ends_with_number = re.compile( r'\d$' )
ends_with_number.search( 'Your number is 42' )

<_sre.SRE_Match object; span=(16, 17), match='2'>

In [52]:
ends_with_number.search( 'Your number is forty two' ) == None

True

In [53]:
whole_string_is_num = re.compile( r'^\d+$' )
whole_string_is_num.search( '1234567890' )

<_sre.SRE_Match object; span=(0, 10), match='1234567890'>

In [54]:
whole_string_is_num.search( '12345xyz67890' ) == None

True

In [55]:
whole_string_is_num.search( '12345 67890' ) == None

True

#### Wildcard Character

The period will match any character except for a newline. 

In [56]:
at_regex = re.compile( r'.at' )
at_regex.findall( 'The cat in the hat sat on the flat mat.' )

['cat', 'hat', 'sat', 'lat', 'mat']

###### Matching Everything with Dot Star

Dot Star means to match zero or more of the proceeding characters. The dot star has greedy mode and will match the longest string that it can. The dot, star, and question mark will select text in a non-greedy fashion. 

In [57]:
name_regex = re.compile( r'First Name: (.*) Last Name: (.*)' )
mo         = name_regex.search( 'First Name: Rayne Last Name: Aveson' )
mo.group(1)

'Rayne'

###### Matching Newlines with the Dot Character

In [58]:
no_new_line_regex = re.compile( '.*' )
no_new_line_regex.search( 'Serve the public trust.\nProtect the innocent.\nUphold the law' ).group()

'Serve the public trust.'

In [59]:
no_new_line_regex = re.compile( '.*', re.DOTALL )
no_new_line_regex.search( 'Serve the public trust.\nProtect the innocent.\nUphold the law' ).group()

'Serve the public trust.\nProtect the innocent.\nUphold the law'

###### Other Parameters to re.compile()

To make regex case insensitive you pass the re.IGNORECASE or re.I as a second argument to re.compile().  

Replacing text using regex. First agrument to .sub() is what will replace the string, the second argument is the pattern to be replaced.

The VERBOSE mode allows the user to spread a regex string to match over several lines. Just wrap the string in re.compile() in ''' ''' instead of ''.

In [62]:
name_regex = re.compile( r'Agent \w+' )
name_regex.sub( 'CENSORED', 'Agent Riyun gave me the secret document to Agent Murdoch' )

'CENSORED gave me the secret document to CENSORED'

If you want to use VERBOSE, and I, and DOTALL just separate the arguments in the re.compile() method with the bitwise | operator.