A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

### RegEx Module
Python has a built-in package called re, which can be used to work with Regular Expressions.

When you have imported the re module, you can start using regular expressions

In [1]:
# Search the string to see if it starts with "The" and ends with "Spain":

import re

txt = 'The rain in Spain'
x = re.search('^The.*Spain$',txt)

if x:
    print('Yes, we have a match')
else:
    print('No match')

Yes, we have a match


### RegEx Functions
The re module offers a set of functions that allows us to search a string for a match:

Function	Description<br>
**findall**	Returns a list containing all matches<br>
**search**	Returns a Match object if there is a match anywhere in the string<br>
**split**	Returns a list where the string has been split at each match<br>
**sub**	Replaces one or many matches with a string

### Metacharacters
Metacharacters are characters with a special meaning:

Character	Description	Example	<br>
**[]**	A set of characters	"[a-m]"	<br>
**\**	Signals a special sequence (can also be used to escape special characters)	"\d"	<br>
**.**	Any character (except newline character)	"he..o"	<br>
**^**	Starts with	"^hello"	<br>


**+**	One or more occurrences	"he.+o"	<br>
**?**	Zero or one occurrences	"he.?o"	<br>
**{}**	Exactly the specified number of occurrences	"he.{2}o"	<br>
**|**	Either or	"falls|stays"	<br>
**()**	Capture and group	 

In [4]:
# $ Ends with "planet$"
# * Zero or more occurrences "he.*o"

In [7]:
# Find all lower case characters alphabetically between "a" and "m":
txt = 'The rain in Spain'
x = re.findall("[a-m]",txt)

print(x)

['h', 'e', 'a', 'i', 'i', 'a', 'i']


In [8]:
#Find all digit characters:

txt = "That will be 59 dollars"
x = re.findall('\d',txt)
print(x)

['5', '9']


In [10]:
#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":

txt = "hello planet"
x = re.findall('he..o',txt)
print(x)

['hello']


In [11]:
#Check if the string starts with 'hello':

txt = "hello planet"
x = re.findall('^hello',txt)

if x:
    print('Yes')
else:
    print('No')

Yes


In [12]:
#Check if the string ends with 'planet':

x = re.findall('planet$', txt)

if x:
    print('Yes')
else:
    print('No')

Yes


In [17]:
#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":
x = re.findall('he.*o',txt)
print(x)

['hello']


In [19]:
#Search for a sequence that starts with "he", followed by 1 or more  (any) characters, and an "o":
x = re.findall('he.+o',txt)
print(x)

['hello']


In [15]:
#Search for a sequence that starts with "he", followed by 0 or 1  (any) character, and an "o":
x = re.findall('he.?o',txt)
print(x)

[]


In [21]:
#Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":
x = re.findall('he.{2}o',txt)
print(x)

['hello']


In [22]:
#Check if the string contains either "falls" or "stays":

txt = "The rain in Spain falls mainly in the plain!"
x = re.findall('falls|stays',txt)

if x:
    print('Yes')
else:
    print('No')

Yes


### Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

In [24]:
# \A	Returns a match if the specified characters are at the beginning of the string	"\AThe"
# Check if the string starts with "The":
x = re.findall("\AThe", txt)

print(x)
if x:
    print('Yes')
else:
    print('No')

['The']
Yes


In [27]:
# \b	Returns a match where the specified characters are at the beginning or at the end of a word
# (the "r" in the beginning is making sure that the string is being treated as a "raw string")	r"\bain"  r"ain\b"

#Check if "ain" is present at the beginning of a WORD:
txt = 'The rain in Spain'
x = re.findall(r'\bain',txt)
print(x)

if x:
    print('Yes')
else:
    print('No')

[]
No


In [30]:
#Check if "ain" is present at the end of a WORD:
x = re.findall(r'ain\b',txt)
print(x)

if x:
    print('Yes')
else:
    print(No)

['ain', 'ain']
Yes


In [31]:
# \B	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
# (the "r" in the beginning is making sure that the string is being treated as a "raw string") r"\Bain"  r"ain\B"

#Check if "ain" is present, but NOT at the beginning of a word:
x = re.findall(r'\Bain', txt)
print(x)

['ain', 'ain']


In [32]:
#Check if "ain" is present, but NOT at the end of a word:

x = re.findall(r'ain\B', txt)
print(x)

[]


In [33]:
# \d	Returns a match where the string contains digits (numbers from 0-9)	"\d"

# #Check if the string contains any digits (numbers from 0-9):
x = re.findall('\d',txt)
print(x)

[]


In [34]:
# \D	Returns a match where the string DOES NOT contain digits	"\D"

# Return a match at every no-digit character:
x = re.findall('\D',txt)
print(x)

['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']


In [35]:
# \s	Returns a match where the string contains a white space character	"\s"

# #Return a match at every white-space character:
x = re.findall('\s',txt)
print(x)

[' ', ' ', ' ']


In [36]:
# \S	Returns a match where the string DOES NOT contain a white space character	"\S"

# Return a match at every NON white-space character:
x = re.findall('\S',txt)
print(x)

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']


In [37]:
# \w	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the 
# underscore _ character)	"\w"

#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):
x = re.findall('\w',txt)
print(x)

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']


In [38]:
# \W	Returns a match where the string DOES NOT contain any word characters	"\W"

#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):
x = re.findall('\W',txt)
print(x)

[' ', ' ', ' ']


In [39]:
# \Z	Returns a match if the specified characters are at the end of the string	"Spain\Z"

# Check if the string ends with "Spain":
x = re.findall('Spain\Z',txt)
print(x)

['Spain']


### Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning:

In [42]:
# [arn]	Returns a match where one of the specified characters (a, r, or n) is present

#Check if the string has any a, r, or n characters:
x = re.findall("[arn]",txt)
print(x)

['r', 'a', 'n', 'n', 'a', 'n']


In [43]:
# [a-n]	Returns a match for any lower case character, alphabetically between a and n

# [a-n]	Returns a match for any lower case character, alphabetically between a and n
x = re.findall('[a-n]',txt)
print(x)

['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']


In [44]:
# [^arn]	Returns a match for any character EXCEPT a, r, and n

#Check if the string has other characters than a, r, or n:
x = re.findall('[^arn]',txt)
print(x)

['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']


In [45]:
# [0123]	Returns a match where any of the specified digits (0, 1, 2, or 3) are present

# Check if the string has any 0, 1, 2, or 3 digits:
x = re.findall('[0123]',txt)
print(x)

[]


In [46]:
# [0-9]	Returns a match for any digit between 0 and 9

#Check if the string has any digits:
x = re.findall('[0-9]',txt)
print(x)

[]


In [47]:
# [0-5][0-9]	Returns a match for any two-digit numbers from 00 and 59

#Check if the string has any two-digit numbers, from 00 to 59:
txt = "8 times before 11:45 AM"
x = re.findall('[0-5][0-9]',txt)
print(x)

['11', '45']


In [48]:
# [a-zA-Z]	Returns a match for any character alphabetically between a and z, lower case OR upper case

# Check if the string has any characters from a to z lower case, and A to Z upper case:
x = re.findall('[a-zA-Z]',txt)
print(x)

['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']


In [50]:
# [+]	In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string

# Check if the string has any + characters:
x = re.findall('[+]',txt)
print(x)

[]


### The findall() Function
The findall() function returns a list containing all matches.

The list contains the matches in the order they are found.

If no matches are found, an empty list is returned:

In [52]:
# Return an empty list if no match was found:

txt = 'The rain in Spain'
x = re.findall('Portugal',txt)
print(x)

[]


### The search() Function
The search() function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned:

In [54]:
# Search for the first white-space character in the string:

x = re.search('\s',txt)
print('The first white space position is', x.start())

The first white space position is 3


In [55]:
# If no matches are found, the value None is returned:

x = re.search('Portugal',txt)
print(x)

None


### The split() Function
The split() function returns a list where the string has been split at each match:

In [57]:
# Split at each white-space character:

x = re.split('\s',txt)
print(x)

['The', 'rain', 'in', 'Spain']


In [58]:
# You can control the number of occurrences by specifying the maxsplit parameter:

# Split the string only at the first occurrence:
x = re.split('\s',txt,1)
print(x)

['The', 'rain in Spain']


### The sub() Function
The sub() function replaces the matches with the text of your choice:

In [60]:
# Replace every white-space character with the number 9:

x = re.sub('\s','9',txt)
print(x)

The9rain9in9Spain


In [61]:
# You can control the number of replacements by specifying the count parameter:

# Replace the first 2 occurrences:
x = re.sub('\s','9',txt,2)
print(x)

The9rain9in Spain


### Match Object
A Match Object is an object containing information about the search and the result.

Note: If there is no match, the value None will be returned, instead of the Match Object.

In [62]:
# Do a search that will return a Match Object:
x = re.search('ai',txt)
print(x) #this will print an object

<re.Match object; span=(5, 7), match='ai'>


The Match object has properties and methods used to retrieve information about the search, and the result:

**.span()** returns a tuple containing the start-, and end positions of the match. <br>
**.string** returns the string passed into the function <br>
**.group()** returns the part of the string where there was a match

In [64]:
# Print the position (start- and end-position) of the first match occurrence.
# The regular expression looks for any words that starts with an upper case "S":

x = re.search(r'\bS\w+', txt)
print(x.span())

(12, 17)


In [65]:
# Print the string passed into the function:
print(x.string)

The rain in Spain


In [66]:
# Print the part of the string where there was a match.
# The regular expression looks for any words that starts with an upper case "S":
print(x.group())

# Note: If there is no match, the value None will be returned, instead of the Match Object.

Spain
