### RegEx Module
Python has a built-in package called re, which can be used to work with Regular Expressions.

Import the re module:

In [1]:
import re

#### Example
Search the string to see if it starts with "The" and ends with "Spain":

In [2]:
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt) # Trả về 1 đối tượng match

print(x)

if x:
    print('OK')
else:
    print('NO')

<re.Match object; span=(0, 17), match='The rain in Spain'>
OK


### Split at each white-space character:

In [3]:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)
s = ''
for z in x:
    s += z + ' '
s.strip()   
print(s)

['The', 'rain', 'in', 'Spain']
The rain in Spain 


#### You can control the number of occurrences by specifying the maxsplit parameter:

In [4]:
txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

['The', 'rain in Spain']


### The sub() Function
The sub() function replaces the matches with the text of your choice:

##### Replace every white-space character with the number 9:

In [5]:
txt = "The rain in Spain"
x = re.sub("\s", "--", txt)
print(x)

The--rain--in--Spain


##### You can control the number of replacements by specifying the count parameter:

##### Replace the first 2 occurrences:

In [6]:
txt = "The rain in Spain"
x = re.sub("\s", "---", txt, 2)
print(x)

The---rain---in Spain


#### Metacharacters
Metacharacters are characters with a special meaning:

##### []:	A set of characters	"[a-m]"

In [7]:
txt = "The rain in Spain"

#Find all lower case characters alphabetically between "a" and "m":

x = re.findall("[a-m]", txt)
print(x)


['h', 'e', 'a', 'i', 'i', 'a', 'i']


#### \ :	Signals a special sequence (can also be used to escape special characters)	"\d"

In [8]:
import re

txt = "That will be 59 dollars"

#Find all digit characters:

x = re.findall("\d", txt)
print(x)


['5', '9']


##### . : Any character (except newline character)	"he..o"

In [9]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":

x = re.findall("he..t", txt)
print(x)

# a.*b : tất cả các chữ cái giữa a và b đều được loại bỏ khi tìm kiếm, tức là nếu tìm thấy 1 xâu bắt đầu = a và kết thúc = b => return xâu từ a => b.


[]


##### ^	Starts with	"^hello"

In [10]:
import re

txt = "hello planet"

#Check if the string starts with 'hello':

x = re.findall("^he", txt)
if x:
  print("Yes, the string starts with 'he'")
else:
  print("No match")


Yes, the string starts with 'he'


##### $(dấu tiền) : Ends with	"planet$"

In [11]:
import re

txt = "hello planet"

#Check if the string ends with 'planet':

x = re.findall("net$", txt)
if x:
  print("Yes, the string ends with 'planet'")
else:
  print("No match")

Yes, the string ends with 'planet'


##### *	Zero or more occurrences	"he.*o"

In [12]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":

x = re.findall("he.*l", txt)

print(x)

['hello pl']


##### +	One or more occurrences	"he.+o"

In [13]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 1 or more  (any) characters, and an "o":

x = re.findall("he.+l", txt)

print(x)


['hello pl']


#### ?	Zero or one occurrences	"he.?o"

In [16]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or 1  (any) character, and an "o":

x = re.findall("he.?l", txt)

print(x)

#This time we got no match, because there were not zero, not one, but two characters between "he" and the "o"


['hell']


#### {}	Exactly the specified number of occurrences	"he.{2}o"

In [17]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":

x = re.findall("he.{2}o", txt)

print(x)

['hello']


#### |	Either or	"falls|stays"

In [18]:
import re

txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains either "falls" or "stays":

x = re.findall("falls|stays", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['falls']
Yes, there is at least one match!


## Special Sequences (Chuỗi Đặc Biệt)
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

#### \A	Returns a match if the specified characters are at the beginning of the string	"\AThe"

In [19]:
import re

txt = "The rain in Spain The The"

#Check if the string starts with "The":

x = re.findall("\AThe", txt)

print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")


['The']
Yes, there is a match!


#### \b	Returns a match where the specified characters are at the beginning or at the end of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string")	r"\bain"
r"ain\b"

In [20]:
import re

txt = "The rain in Spain"

#Check if "ain" is present at the beginning of a WORD:

x = re.findall(r"\brai", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['rai']
Yes, there is at least one match!


In [21]:
import re

txt = "The rain in Spain"

#Check if "ain" is present at the end of a WORD:

x = re.findall(r"ain\b", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['ain', 'ain']
Yes, there is at least one match!


#### \d	Returns a match where the string contains digits (numbers from 0-9)	"\d"

In [22]:
import re

txt = "The ra090in i2929n Sp25235ain"

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['0', '9', '0', '2', '9', '2', '9', '2', '5', '2', '3', '5']
Yes, there is at least one match!


#### \D	Returns a match where the string DOES NOT contain digits	"\D"

In [23]:
import re

txt = "The rain in Spain"

#Return a match at every no-digit character:

x = re.findall("\D", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


#### \s	Returns a match where the string contains a white space character	"\s"

In [24]:
import re

txt = "The rain in Spain"

#Return a match at every white-space character:

x = re.findall("\s", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[' ', ' ', ' ']
Yes, there is at least one match!


#### \S	Returns a match where the string DOES NOT contain a white space character	"\S"

In [25]:
import re

txt = "The rain in Spain"

#Return a match at every NON white-space character:

x = re.findall("\S", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


#### \w	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)	"\w"

In [26]:
import re

txt = "The rain in Spain"

#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):

x = re.findall("\w", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


#### \W	Returns a match where the string DOES NOT contain any word characters	"\W"

In [30]:
import re

txt = "900768 ! @ # $ % ^ T93295he r25252ain 25252in Spain"

#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[' ', '!', ' ', '@', ' ', '#', ' ', '$', ' ', '%', ' ', '^', ' ', ' ', ' ', ' ']
Yes, there is at least one match!


#### \Z	Returns a match if the specified characters are at the end of the string	"Spain\Z"

In [33]:
import re

txt = "The rain in Spain"

#Check if the string ends with "Spain":

x = re.findall("Spain\Z", txt)

print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")


['Spain']
Yes, there is a match!


### Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning:

In [34]:
import re

txt = "The rain in Spain"

#Check if the string has any a, r, or n characters:

x = re.findall("[ain]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
Yes, there is at least one match!


In [35]:
import re

txt = "The rain in Spain"

#Check if the string has any characters between a and n:

x = re.findall("[a-n]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
Yes, there is at least one match!


In [None]:
import re

txt = "The rain in Spain"

#Check if the string has other characters than a, r, or n:

x = re.findall("[^arn]", txt) # Loại trừ arn

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


In [36]:
import re

txt = "Th262e ra121090252in 252in352 Spain"

#Check if the string has any 0, 1, 2, or 3 digits:

x = re.findall("[0123]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['2', '2', '1', '2', '1', '0', '0', '2', '2', '2', '2', '3', '2']
Yes, there is at least one match!


In [37]:
import re

txt = "8 times before 11:45 AM"

#Check if the string has any digits:

x = re.findall("[0-9]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['8', '1', '1', '4', '5']
Yes, there is at least one match!


#### [0-5][0-9]	Returns a match for any two-digit numbers from 00 and 59

In [39]:
import re

txt = "08 times before 11:45 AM"

#Check if the string has any two-digit numbers, from 00 to 59:

x = re.findall("[0-5][0-9]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['08', '11', '45']
Yes, there is at least one match!


In [40]:
import re

txt = "8 times before 11:45 AM"

#Check if the string has any characters from a to z lower case, and A to Z upper case:

x = re.findall("[a-zA-Z]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']
Yes, there is at least one match!


#### [+]	In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string	


In [43]:
import re

txt = "8 times #%# %$%# before 11:45 AM"

#Check if the string has any + characters:

x = re.findall("[#]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['#', '#', '#']
Yes, there is at least one match!


### Match Object
A Match Object is an object containing information about the search and the result.

##### Note: If there is no match, the value None will be returned, instead of the Match Object.

In [14]:
import re

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object

if x:
    print('Founded')
else:
    print('Not Founded')
    

<re.Match object; span=(5, 7), match='ai'>
Founded


#### The Match object has properties and methods used to retrieve information about the search, and the result:

###### .span() returns a tuple containing the start-, and end positions of the match. 

###### .string returns the string passed into the function

###### .group() returns the part of the string where there was a match

#### Print the position (start- and end-position) of the first match occurrence.

The regular expression looks for any words that starts with an upper case "S":

In [15]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

(12, 17)
