# **Python RegEx**

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

# **RegEx Module**

Python has a built-in package called re, which can be used to work with Regular Expressions.

Import the re module:

In [1]:
import re

# **RegEx in Python**

When you have imported the re module, you can start using regular expressions.

In [2]:
# Search the string to see if it starts with "The" and ends with "Spain":

import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$",txt)

# **RegEx Functions**

The re module offers a set of functions that allows us to search a string for a match:

**The findall() Function**

The findall() function returns a list containing all matches.

In [3]:
# findall  -  The findall() function returns a list containing all matches.

import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


The list contains the matches in the order they are found.

If no matches are found, an empty list is returned:

In [4]:
# Return an empty list if no match was found:

txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

[]


**The search() Function**

The search() function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned:

In [5]:
# search -	Returns a Match object if there is a match anywhere in the string.

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


If no matches are found, the value None is returned:

In [6]:
# Make a search that returns no match:

txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

None


**The split() Function**

The split() function returns a list where the string has been split at each match:

In [7]:
# split -	Returns a list where the string has been split at each match.

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


You can control the number of occurrences by specifying the maxsplit parameter:

In [8]:
# Split the string only at the first occurrence:

txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

['The', 'rain in Spain']


**The sub() Function**

The sub() function replaces the matches with the text of your choice:

In [9]:
# sub -	Replaces one or many matches with a string.

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


You can control the number of replacements by specifying the count parameter:

In [10]:
# Replace the first 2 occurrences:

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in Spain


# **Match Object**

A Match Object is an object containing information about the search and the result.

*Note: If there is no match, the value None will be returned, instead of the Match Object.*


In [11]:
# Do a search that will return a Match Object:

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x)

<re.Match object; span=(5, 7), match='ai'>


The Match object has properties and methods used to retrieve information about the search, and the result:

**.span()** returns a tuple containing the start-, and end positions of the match.   
**.string** returns the string passed into the function.                             
**.group()** returns the part of the string where there was a match.

In [12]:
# Print the position (start- and end-position) of the first match occurrence.

# The regular expression looks for any words that starts with an upper case "S":

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

(12, 17)


In [13]:
# Print the string passed into the function:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)

The rain in Spain


In [14]:
# Print the part of the string where there was a match.

# The regular expression looks for any words that starts with an upper case "S":
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

Spain


# **Metacharacters**

In [15]:
# [] 	A set of characters.

txt = "The rain in Spain"

#Find all lower case characters alphabetically between "a" and "m":

x = re.findall("[a-m]", txt)
print(x)

['h', 'e', 'a', 'i', 'i', 'a', 'i']


In [16]:
# \ 	Signals a special sequence (can also be used to escape special characters)

txt = "That will be 59 dollars"

#Find all digit characters:

x = re.findall("\d", txt)
print(x)

['5', '9']


In [17]:
# . 	Any character (except newline character) 	"he..o"

txt = "hello planet"

#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":

#x = re.findall("he..o", txt)
# x = re.findall("he.*o", txt)
x = re.findall("pl...t", txt)
print(x)

['planet']


In [18]:
# ^ 	Starts with 	"^hello"

txt = "hello planet"

#Check if the string starts with 'hello':

x = re.findall("^hello", txt)
if x:
  print("Yes, the string starts with 'hello'")
else:
  print("No match")

Yes, the string starts with 'hello'


In [19]:
# $ 	Ends with 	"planet$"

txt = "hello planet"

x = re.findall("planet$", txt)
if x:
  print("Yes, the string ends with 'planet'")
else:
  print("No match")

Yes, the string ends with 'planet'


In [20]:
# * 	Zero or more occurrences 	"he.*o"

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":

x = re.findall("he.*o", txt)

print(x)

['hello']


In [21]:
# + 	One or more occurrences 	"he.+o"

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 1 or more  (any) characters, and an "o":

x = re.findall("he.+o", txt)

print(x)

['hello']


In [22]:
# ? 	Zero or one occurrences 	"he.?o"

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or 1  (any) character, and an "o":

x = re.findall("he.?o", txt)

print(x)

#This time we got no match, because there were not zero, not one, but two characters between "he" and the "o"

[]


In [23]:
# {} 	Exactly the specified number of occurrences 	"he.{2}o"

txt = "hello planet"

#Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":

x = re.findall("he.{2}o", txt)

print(x)

['hello']


In [24]:
# | 	Either or 	"falls|stays"

txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains either "falls" or "stays":

x = re.findall("falls|stays", txt)

print(x)

['falls']


# **Special Sequences**

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

In [26]:
# \A 	Returns a match if the specified characters are at the beginning of the string 	"\AThe"

import re

txt = "The Rain In Spain"

x = re.findall("\AThe", txt)
print (x)

if x:
  print("Yes, There is a Match!..")
else:
  print("No Match!..")

['The']
Yes, There is a Match!..


In [30]:
# \b 	Returns a match where the specified characters are at the beginning or at the end of a word   r"\bain"

txt = "The rain in Spain"

#Check if "ain" is present at the beginning of a WORD:

x = re.findall(r"\bain", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [31]:
# \b (the "r" in the beginning is making sure that the string is being treated as a "raw string")   r"ain\b"

txt = "The rain in Spain"

#Check if "ain" is present at the end of a WORD:

x = re.findall(r"ain\b", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['ain', 'ain']
Yes, there is at least one match!


In [32]:
# \B 	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word    r"\Bain"

txt = "The rain in Spain"

#Check if "ain" is present, but NOT at the beginning of a word:

x = re.findall(r"\Bain", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['ain', 'ain']
Yes, there is at least one match!


In [33]:
# (the "r" in the beginning is making sure that the string is being treated as a "raw string")     r"ain\B"

txt = "The rain in Spain"

#Check if "ain" is present, but NOT at the end of a word:

x = re.findall(r"ain\B", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [34]:
# \d 	Returns a match where the string contains digits (numbers from 0-9) 	"\d"

txt = "The rain in Spain"

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [35]:
# \D 	Returns a match where the string DOES NOT contain digits 	"\D"

txt = "The rain in Spain"

#Return a match at every no-digit character:

x = re.findall("\D", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No Match!")

['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


In [36]:
# \s 	Returns a match where the string contains a white space character 	"\s"

txt = "The rain in Spain"

#Return a match at every white-space character:

x = re.findall("\s", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ']
Yes, there is at least one match!


In [37]:
# \S 	Returns a match where the string DOES NOT contain a white space character 	"\S"

txt = "The rain in Spain"

#Return a match at every NON white-space character:

x = re.findall("\S", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


In [38]:
# \w 	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) 	"\w"

txt = "The rain in Spain"

#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):

x = re.findall("\w", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


In [39]:
# \W 	Returns a match where the string DOES NOT contain any word characters 	"\W"

txt = "The rain in Spain"

x = re.findall("\W", txt)
print(x)

if x:
  print("Yes")
else:
  print("No")

[' ', ' ', ' ']
Yes


In [40]:
# \Z 	Returns a match if the specified characters are at the end of the string 	"Spain\Z"

txt = "The rain in Spain"

x = re.findall("Spain\Z", txt)
print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")

['Spain']
Yes, there is a match!


# **Sets**

A set is a set of characters inside a pair of square brackets [ ] with a special meaning:

In [41]:
# [arn] 	Returns a match where one of the specified characters (a, r, or n) is present

txt = "The rain in Spain"

#Check if the string has any a, r, or n characters:

x = re.findall("[arn]", txt)

print(x)

['r', 'a', 'n', 'n', 'a', 'n']


In [42]:
# [a-n] 	Returns a match for any lower case character, alphabetically between a and n

txt = "The rain in Spain"

#Check if the string has any characters between a and n:

x = re.findall("[a-n]", txt)

print(x)

['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']


In [43]:
# [^arn] 	Returns a match for any character EXCEPT a, r, and n

txt = "The rain in Spain"

#Check if the string has other characters than a, r, or n:

x = re.findall("[^arn]", txt)

print(x)

['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']


In [44]:
# [0123] 	Returns a match where any of the specified digits (0, 1, 2, or 3) are present

txt = "The rain in Spain"

#Check if the string has any 0, 1, 2, or 3 digits:

x = re.findall("[0123]", txt)

print(x)

[]


In [45]:
# [0-9] 	Returns a match for any digit between 0 and 9

txt = "8 times before 11:45 AM"

#Check if the string has any digits:

x = re.findall("[0-9]", txt)

print(x)

['8', '1', '1', '4', '5']


In [46]:
# [0-5][0-9] 	Returns a match for any two-digit numbers from 00 and 59

txt = "8 times before 11:45 AM"

#Check if the string has any two-digit numbers, from 00 to 59:

x = re.findall("[0-5][0-9]", txt)

print(x)

['11', '45']


In [47]:
# [a-zA-Z] 	Returns a match for any character alphabetically between a and z, lower case OR upper case

txt = "8 times before 11:45 AM"

#Check if the string has any characters from a to z lower case, and A to Z upper case:

x = re.findall("[a-zA-Z]", txt)

print(x)

['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']


In [50]:
# [+] 	In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string

txt = "8 times before 11:45 AM"

#Check if the string has any + characters:

x = re.findall("[+]", txt)

print(x)

[]
