# Regular Expression in Python

<h3 style='color:blue'> Use raw string to define a regex</h3>

- As you may already know, the backslash has a special meaning in some cases because it may indicate an escape character or escape sequence. To avoid that always use a raw string.

- For example, let's say that in Python we are defining a string that is actually a path to an exercise folder like this `path = "c:\example\task\new"`.

- So 'r' means that we are going to use all the special sequences, metacharactes and other options, which are provided 're' library

[python_regex](https://pynative.com/python/regex/)

In [6]:
import re

print("without raw string:")

# path_to_search = "c:\example\task\new"
target_string = r"c:\example\task\new\exercises\session1" # this is target sequence, just put r in front of it which tells
# it is what it is

without raw string:


**But when creating regex pattern, `\t` and `\n` have their own special meanings so we need to escape the special meaning of `\` by adding another `\` in front of them but our program doesn't tell the `\` is being escaped by another `\` 
so to tell the program clearly we put `r` so the program knows that we are using regex to escape the `\` using another`\`, by regex the special meaning of `\` can be escaped by another `\`**

In [10]:


pattern = "^c:\\example\\task\\new"

# \n and \t has a special meaning in Python
# Python will treat them differently

res = re.search(pattern, target_string)
print(res.group())

error: bad escape \e at position 3

Notice that inside the pattern we have two escape characters `\t` and `\n`. If you execute the above code you will the re.error: bad escape error because `\`n and `\t`has a special meaning in Python.

To avoid such issues, always write a regex pattern using a raw string. The character r denotes the raw string.

Now replace the existing pattern with `pattern = r"^c:\\example\\task\\new"` and execute our code again. Now you can get the following output.

In [9]:
pattern = r"^c:\\example\\task\\new"

res = re.search(pattern, target_string)
print(res.group())

c:\example\task\new


### Python regex methods

|Method|Description|
|:--|:--|
|re.compile('pattern')	|Compile a regular expression pattern provided as a string into a re.Pattern object.|
|re.search(pattern, str)	|Search for occurrences of the regex pattern inside the target string and return **only the first match.**|
|re.match(pattern, str)|Try to match the regex pattern **at the start of the string.** It returns a match only if the pattern is located at the beginning of the string.|
|re.fullmatch(pattern, str)	|Match the regular expression pattern to the entire string from the first to the last character.|
|re.findall(pattern, str)|Scans the regex pattern through the entire string and **returns all matches.**|
|re.finditer(pattern, str)|Scans the regex pattern through the entire string and returns an iterator yielding match objects.|
|re.split(pattern, str)|It breaks a string into a list of matches as per the given regular expression pattern.|
|re.sub(pattern, replacement, str)|Replace one or more occurrences of a pattern in the string with a replacement.|
|re.subn(pattern, replacement, str)|Same as re.sub(). The difference is it will return a tuple of two elements. First, a new string after all replacement, and second the number of replacements it has made.|


### How to use regular expression in Python

In [12]:
import re

target_string = "Jessa salary is 8000$"

# compile regex pattern , pattern to match any character
str_pattern = r"\w"
pattern = re.compile(str_pattern)

# match regex pattern at start of the string
res = pattern.match(target_string)
# match character
print(res.group())  

# search regex pattern anywhere inside string pattern to search any digit
res = re.search(r"\d", target_string)
print(res.group())

# pattern to find all digits
res = re.findall(r"\d", target_string)
print(res)  

# regex to split string on whitespaces
res = re.split(r"\s", target_string)
print("All tokens:", res)

# regex for replacement, replace space with hyphen
res = re.sub(r"\s", "-", target_string)

# string after replacement:
print(res)

J
8
['8', '0', '0', '0']
All tokens: ['Jessa', 'salary', 'is', '8000$']
Jessa-salary-is-8000$


### The Match object methods

|Method|	Meaning|
|:--|:--|
|group()|	Return the string matched by the regex pattern. |
|groups()|	Returns a tuple containing the strings for all matched subgroups.|
|start()	|Return the start position of the match.|
|end()	|Return the end position of the match.|
|span()	|Return a tuple containing the (start, end) positions of the match.|

### Regex Metacharacters

|Metacharacter|	Description|
|:--|:--|
|. (DOT)|	Matches any character except a newline.|
|^ (Caret)|	Matches pattern only at the start of the string.|
|$ (Dollar)|	Matches pattern at the end of the string.|
|* (asterisk)|	Matches 0 or more repetitions of the regex.|
|+ (Plus)|	Match 1 or more repetitions of the regex.|
|? (Question mark)|	Match 0 or 1 repetition of the regex.|
|[] (Square brackets)|	Used to indicate a set of characters. Matches any single character in brackets. For example, [abc] will match either a, or, b, or c character.|
|\| (Pipe)|	Used to specify multiple patterns. For example, P1\|P2, where P1 and P2 are two different regexes.|
|\ (backslash)|	Use to escape special characters or signals a special sequence. For example, If you are searching for one of the special characters you can use a \ to escape them.|
|[^...]|	Matches any single character not in brackets.|
|(...)|	Matches whatever regular expression is inside the parentheses. For example, (abc) will match to substring 'abc'|

### Regex special sequences (a.k.a. Character Classes)

|Special Sequence|	Meaning|
|:--|:--|
|\A|	Matches pattern only at the start of the string.|
|\Z|	Matches pattern only at the end of the string.|
|\d|	Matches to any digit. Short for character classes [0-9].|
|\D|	Matches to any non-digit. short for [^0-9]|.
|\s|	Matches any whitespace character. short for character class [ \t\n\x0b\r\f].|
|\S|	Matches any non-whitespace character. Short for [^ \t\n\x0b\r\f].|
|\w	|Matches any alphanumeric character. Short for character class [a-zA-Z_0-9].|
|\W|	Matches any non-alphanumeric character. Short for [^a-zA-Z_0-9]|
|\b|	Matches the empty string, but only at the beginning or end of a word. Matches a word boundary where a word character is [a-zA-Z0-9_]. For example, '\bJessa\b' matches 'Jessa', 'Jessa.', '(Jessa)', 'Jessa Emma Kelly' but not 'JessaKelly' or 'Jessa5'.|
|\B|	Opposite of a \b. Matches the empty string, but only when it is not at the beginning or end of a word|

### Regex Quantifiers

|Quantifier|	Meaning|
|:--|:--|
|\*|	Match 0 or more repetitions of the preceding regex. For example, a* matches any string that contains zero or more occurrences of 'a'.|
|\+	|Match 1 or more repetitions of the preceding regex. For example, a+ matches any string that contains at least one a, i.e., a, aa, aaa, or any number of a's.|
|?|Match 0 or 1 repetition of the preceding regex. For example, a? matches any string that contains zero or one occurrence of a.|
|{2}|Matches only 2 copies of the preceding regex. For example, p{3} matches exactly three 'p' characters, but not four.|
|{2, 4}|Match 2 to 4 repetitions of the preceding regex. For example, a{2,4} matches any string that contains 2, 3 to 4 'a' characters.|
|{3,}|Matches minimum 3 copies of the preceding regex. It will try to match as many repetitions as possible. For example, p{3,} matches a minimum of three 'p' characters.|


### Regex flags
|Flag|	long syntax	Meaning|
|:--|:--|
|`re.A`	re.ASCII|Perform ASCII-only matching instead of full Unicode matching.|
|`re.I`  	re.IGNORECASE|Perform case-insensitive matching.|
|`re.M`	re.MULTILINE|This flag is used with metacharacter ^ (caret) and \\$ (dollar). When this flag is specified, the metacharacter ^ matches the pattern at beginning of the string and each newline’s beginning (\\n). And the metacharacter \$ matches pattern at the end of the string and the end of each new line (\n)|
|`re.S`	re.DOTALL|Make the DOT (.) special character match any character at all, including a newline. Without this flag, DOT(.) will match anything except a newline.|
|`re.X`	re.VERBOSE|Allow comment in the regex. This flag is useful to make regex more readable by allowing comments in the regex.|
|`re.L` re.LOCALE|Perform case-insensitive matching dependent on the current locale. Use only with bytes patterns.|

- To specify more than one flag, use the | operator to connect them.

```python
re.findall(pattern, string, flags=re.I|re.M|re.X)
```


# re.match()

<img src='python-regex-match.jpg' style='width: 600px'>

# re.compile()

Python’s re.compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object (re.Pattern). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re.match() or re.search().

```python
re.compile(pattern, flags=0)
```

<img src='python-regex-compile.jpg' style='width:600px'>

```python
pattern= re.compile("str_pattern")
result = pattern.match(string)
```

above is equal to below

```python
result = re.match("str_pattern", string)
```

#### Is it worth using Python’s re.compile()?
**But compiling regex is useful for the following situations.**

- It denotes that the compiled regular expressions will be used a lot and is not meant to be removed.
- By compiling once and re-using the same regex multiple times, we reduce the possibility of typos.
- When you are using lots of different regexes, you should keep your compiled expressions for those which are used multiple times, so they’re not flushed out of the regex cache when the cache is full.

**Match regex pattern at the beginning of the string**

In [None]:
import re

target_string = "Emma is a basketball player who was born on June 17"
result = re.match(r"\w{4}", target_string) #

print("Match object: ", result)

#Extract match value
print("Match value: ", result.group())y

**Match regex pattern anywhere in the string**

In [14]:
target_string = "Jessa loves Python and pandas"

# Match six-letter word
pattern = r"\w{6}"

result = re.match(pattern, target_string)
print(result)

# search() method
result = re.search(pattern, target_string)
print(result.group()) 

# findall() method
result = re.findall(pattern, target_string)
print(result) 


None
Python
['Python', 'pandas']


**Match regex at the end of the string**

In [15]:
target_string = "Emma is a basketball player who was born on June 17, 1993"

# match at the end
result = re.search(r"\d{4}$", target_string)
print("Matching number: ", result.group())  

Matching number:  1993


**Match the exact word or string**

In [16]:
target_string = "Emma is a basketball player who was born on June 17"

result = re.findall(r"player", target_string)
print("Matching string literal: ", result) 

Matching string literal:  ['player']


**Understand the Match object**

As you know, the `match()` and `search()` method returns a re.Match object if a match found. Let’s see the structure of a re.Match object.

<img src='python-regex-match_object.jpg' style='width:600px'>


In [18]:
target_string = "Jessa and Kelly"

# Match five-letter word
res = re.match(r"\b\w{5}\b", target_string)

# printing entire match object
print(res)

# Extract Matching value
print(res.group())

# Start index of a match
print(res.start())

# End index of a match
print("End index: ", res.end()) 

# Start and end index of a match
pos = res.span()
print(pos)

# Use span to retrieve the matching string
print(target_string[pos[0]:pos[1]])

<re.Match object; span=(0, 5), match='Jessa'>
Jessa
0
End index:  5
(0, 5)
Jessa


**Match regex pattern that starts and ends with the given text**

In [19]:
import re

# string starts with letter 'p' ends with letter 's'
def starts_ends_with(str1):
    res = re.match(r'^(P).*(s)$', str1)
    if res:
        print(res.group())
    else:
        print('None')

str1 = "PYnative is for Python developers"
starts_ends_with(str1)

str2 = "PYnative is for Python"
starts_ends_with(str2)

PYnative is for Python developers
None


**More matching operations**

In [21]:
str1 = "Emma 12 25"

# Match any character
print(re.match(r'.', str1))

# Match all digits
print(re.findall(r'\d', str1))

# Match all numbers
# + indicate 1 or more occurence of \d
print(re.findall(r'\d+', str1))

# Match all special characters and symbols
str2 = "Hello #Jessa!@#$%"
print(re.findall(r'\W', str2))

<re.Match object; span=(0, 1), match='E'>
['1', '2', '2', '5']
['12', '25']
[' ', '#', '!', '@', '#', '$', '%']


**Regex Search vs. match**

- The match() checks for a match only at the beginning of the string; otherwise, it will return None.
- The search() checks for a match anywhere in the string.

In [22]:
target_string = "Emma is a baseball player who was born on June 17, 1993"

# Match 2-digit number
# Using match()
result = re.match(r'\d{2}', target_string)
print(result)

# Using search()
result = re.search(r'\d{2}', target_string)
print(result.group())

None
17


**The behavior of search vs. match with a multiline string**

- We use the re.M flag with caret (^) metacharacter to match each regex pattern at each newline’s start. But you must note that even in MULTILINE mode, match() will only match at the beginning of the string and not at the beginning of each line.

- On the other hand, the search method scans the entire multi-line string to look for a pattern and returns only the first match

In [24]:
multi_line_string = """emma 
love Python"""

# Matches at the start
print(re.match('emma', multi_line_string).group())

# re.match doesn't match at the start of each newline
# It only match at the start of the string
# Won't match
print(re.match('love', multi_line_string, re.MULTILINE))

# found "love" at start of newline
print(re.search('love', multi_line_string).group())

pattern = re.compile('Python$', re.MULTILINE)
# No Match
print(pattern.match(multi_line_string))

# found 'Python" at the end
print(pattern.search(multi_line_string).group())

emma
None
love
None
Python


**re.fullmatch()**

- The re.fullmatch method returns a match object if and only if the *entire target string from the first to the last character matches the regular expression pattern.*

In [25]:
# string length of 42
str1 = "My name is maximums and my salary is 1000$"
print("str1 length: ", len(str1))

result = re.fullmatch(r".{42}", str1)

# print entire match object
print(result)

# print actual match value
print("Match: ", result.group())

str1 length:  42
<re.Match object; span=(0, 42), match='My name is maximums and my salary is 1000$'>
Match:  My name is maximums and my salary is 1000$


# re.search()

<img src="python-regex-search.jpg" style="width:600px;"/>

The re.search() returns only the first match to the pattern from the target string. Use a re.search() to search pattern anywhere in the string.

In [26]:
# Target String
target_string = "Emma is a baseball player who was born on June 17"

# search() for eight-letter word
result = re.search(r"\w{8}", target_string)

# Print match object
print("Match Object", result)

# print the matching word using group() method
print("Matching word: ", result.group()) 

Match Object <re.Match object; span=(10, 18), match='baseball'>
Matching word:  baseball


**Regex search example find exact substring or word**

In [27]:
# Target String
target_string = "Emma is a baseball player who was born on June 17, 1993."

# find substring 'ball'
result = re.search(r"ball", target_string)

# Print matching substring
print(result.group())

# find exact word/substring surrounded by word boundary
result = re.search(r"\bball\b", target_string)
if result:
    print(result)

# find word 'player'
result = re.search(r"\bplayer\b", target_string)
print(result.group())

ball
player


**Regex search groups or multiple patterns**

In this section, we will learn how to search for multiple distinct patterns inside the same target string. Let’s assume, we want to search the following two distinct patterns inside the target string at the same time.

1. A ten-letter word
2. Two consecutive digits

Regex Pattern 1: `\w{10}`

Regex Pattern 2: `\d{2}`

Now each pattern will represent one group. Let’s add each group inside a parenthesis ( ). In our case` r"(\w{10}).+(\d{2})"`

On a successful search, we can use `match.group(1)` to get the match value of a first group and `match.group(2)` to get the match value of a second group.

In [34]:
target_string = "Emma is a basketball player who was born on June 17."

# two group (two patterns) enclosed in separate ( and ) bracket
result = re.search(r"(\w{10}).+(\d{2})", target_string)

# Extract the matches using group()

# print ten-letter word
print(result.group(1))
# Output basketball

# print two digit number
print(result.group(2))

# print the whole matching group
print(result.group())

basketball
17
basketball player who was born on June 17


**Search multiple words using regex**

In [35]:
import re

str1 = "Emma is a baseball player who was born on June 17, 1993."

# search() for four-letter word surrounded by space
# \b is used to specify word boundary

result = re.findall(r"\bEmma\b|\bplayer\b|\bborn\b", str1)
print(result)

['Emma', 'player', 'born']


**Case insensitive regex search**

In [37]:
# Target String
target_string = "Emma is a Baseball player who was born on June 17, 1993."

# case sensitive searching
result = re.search(r"emma", target_string)
print("Matching word:", result)

print("case insensitive searching")
# using re.IGNORECASE
result = re.search(r"emma", target_string, re.IGNORECASE)
print("Matching word:", result.group())

Matching word: None
case insensitive searching
Matching word: Emma


# re.findall() & re.finditer()

<img src="python-regex-findall.jpg" style="width:600px;"/>

<img src="python-regex-finditer.jpg" style="width:600px;"/>

In [38]:
import re

target_string = "Emma is a basketball player who was born on June 17, 1993. She played 112 matches with scoring average 26.12 points per game. Her weight is 51 kg."
result = re.findall(r"\d+", target_string)

# print all matches
print("Found following matches")
print(result)

Found following matches
['17', '1993', '112', '26', '12', '51']


In [1]:
import re

# Target String one
str1 = "Emma's luck numbers are 251 761 231 451"

# pattern to find three consecutive digits
string_pattern = r"\d{3}"

# compile string pattern to re.Pattern object
regex_pattern = re.compile(string_pattern)

# print the type of compiled pattern
print(type(regex_pattern))

# find all the matches in string one
result = regex_pattern.findall(str1)
print(result)


# Target String two
str2 = "Kelly's luck numbers are 111 212 415"

# find all the matches in second string by reusing the same pattern
result = regex_pattern.findall(str2)
print(result)

<class 're.Pattern'>
['251', '761', '231', '451']
['111', '212', '415']


In [41]:

target_string = "Emma is a basketball player who was born on June 17, 1993. She played 112 matches with a scoring average of 26.12 points per game. Her weight is 51 kg."

# finditer() with regex pattern and target string
# \d{2} to match two consecutive digits 
result = re.finditer(r"\d{2}", target_string)

# print all match object
for match_obj in result:
    # print each re.Match object
    print(match_obj)
    
    # extract each matching number
    print(match_obj.group(0))

<re.Match object; span=(49, 51), match='17'>
17
<re.Match object; span=(53, 55), match='19'>
19
<re.Match object; span=(55, 57), match='93'>
93
<re.Match object; span=(70, 72), match='11'>
11
<re.Match object; span=(108, 110), match='26'>
26
<re.Match object; span=(111, 113), match='12'>
12
<re.Match object; span=(145, 147), match='51'>
51


**Regex find all word starting with specific letters**

In [42]:
target_string = "Jessa is a Python developer. She also gives Python programming training"

# all word starts with letter 'p'
print(re.findall(r'\b[p]\w+\b', target_string, re.I))

# all word starts with substring 'Py'
print(re.findall(r'\bpy\w+\b', target_string, re.I))

['Python', 'Python', 'programming']
['Python', 'Python']


**Regex to find all word that starts and ends with a specific letter**

In [45]:
target_string = "Jessa is a Python developer. She also gives Python programming training"
# all word starts with letter 'p' and ends with letter 'g'
print(re.findall(r'\b[p]\w+[g]\b', target_string, re.I))

# all word starts with letter 'p' or 't' and ends with letter 'g'
print(re.findall(r'\b[pt]\w+[g]\b', target_string, re.I))

target_string = "Jessa loves mango and orange"
# all word starts with substring 'ma' and ends with substring 'go'
print(re.findall(r'\bma\w+go\b', target_string, re.I))

target_string = "Kelly loves banana and apple"
# all word starts or ends with letter 'a'
print(re.findall(r'\b[a]\w+\b|\w+[a]\b', target_string, re.I))

['programming']
['programming', 'training']
['mango']
['banana', 'and', 'apple']


**egex to find all words containing a certain letter**

In [46]:
target_string = "Jessa is a knows testing and machine learning"
# find all word that contain letter 'i'
print(re.findall(r'\b\w*[i]\w*\b', target_string, re.I))

# find all word which contain substring 'ing'
print(re.findall(r'\b\w*ing\w*\b', target_string, re.I))

['is', 'testing', 'machine', 'learning']
['testing', 'learning']


**Regex findall repeated characters**

In [48]:
target_string = "Jessa Erriikkka"

# This '\w' matches any single character
# and then its repetitions (\1*) if any.
matcher = re.compile(r"(\w)\1*")

for match in matcher.finditer(target_string):
    print(match.group(), end=", ")
# output J, e, ss, a, E, rr, ii, k, a,

J, e, ss, a, E, rr, ii, kkk, a, 

# re.split()

|Operation|	Description|
|:--|:--|
|re.split(pattern, str)|	Split the string by each occurrence of the pattern.|
|re.split(pattern, str, maxsplit=2)|	Split the string by the occurrences of the pattern. Limit the number of splits to 2|
|re.split(p1\|p2, str)|	Split string by multiple delimiter patterns (p1 and p2).|

In [50]:
target_string = "My name is maximums and my luck numbers are 12 45 78"

# split on white-space and return list of words separated by whitespace
word_list = re.split(r"\s+", target_string)
print(word_list)

['My', 'name', 'is', 'maximums', 'and', 'my', 'luck', 'numbers', 'are', '12', '45', '78']


**Limit the number of splits**

In [53]:
target_string = "12-45-78"

# Split only on the first occurrence
# maxsplit is 1
result = re.split(r"\D", target_string, maxsplit=1)
print(result)

# Split on the three occurrence
# maxsplit is 3
result = re.split(r"\D", target_string, maxsplit=3)
print(result)

['12', '45-78']
['12', '45', '78']


**Regex to Split string with multiple delimiters**

In [60]:
target_string = "12,45,78,85-17-89"

# 2 delimiter - and ,
# use OR (|) operator to combine two pattern
result = re.split(r"-|,", target_string)
print(result)

# Regex to split string on five delimiters
target_string = "PYnative   dot.com; is for, Python-developer"
# Pattern to split: [-;,.\s]\s*
result = re.split(r"[-;,.\s]\s*", target_string)
print(result)

# Regex to split String into words with multiple word boundary delimiters
target_string = "PYnative! dot.com; is for, Python-developer?"
result = re.split(r"[\b\W\b]+", target_string)
print(result)

# Split strings by delimiters and specific word
text = "12, and45,78and85-17and89-97"
# split by word 'and' space, and comma
result = re.split(r"and|[\s,-]+", text)
print(result)
# Output ['12', '', '45', '78', '85', '17', '89', '97']

['12', '45', '78', '85', '17', '89']
['PYnative', 'dot', 'com', 'is', 'for', 'Python', 'developer']
['PYnative', 'dot', 'com', 'is', 'for', 'Python', 'developer', '']
['12', '', '45', '78', '85', '17', '89', '97']


**Regex split a string and keep the separators**

If capturing parentheses are used in the pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

Note: You are capturing the group by writing pattern inside the (,).

In simple terms, be careful while using the re.split() method when the regular expression pattern is enclosed in parentheses to capture groups. If capture groups are used, then the matched text is also included in the resulted list.

It is helpful when you want to keep the separators/delimiter in the resulted list.

In [62]:
target_string = "12-45-78."

# Split on non-digit
result = re.split(r"\D+", target_string)
print(result)

# Split on non-digit and keep the separators
# pattern written in parenthese
result = re.split(r"(\D+)", target_string)
print(result)

['12', '45', '78', '']
['12', '-', '45', '-', '78', '.', '']


**Regex split string by ignoring case**

In [63]:
# Without ignoring case
print(re.split('[a-z]+', "7J8e7Ss3a"))

# With ignoring case
print(re.split('[a-z]+', "7J8e7Ss3a", flags=re.IGNORECASE))

# Without ignoring case
print(re.split(r"emma", "Emma knows Python.EMMA loves Data Science"))

# With ignoring case
print(re.split(r"emma", "Emma knows Python.EMMA loves Data Science", flags=re.IGNORECASE))

['7J8', '7S', '3', '']
['7', '8', '7', '3', '']
['Emma knows Python.EMMA loves Data Science']
['', ' knows Python.', ' loves Data Science']


**Split string by upper case words**

- We used lookahead regex \s(?=[A-Z]).
- This regex will split at every space(\s), followed by a string of upper-case letters([A-Z]) that end in a word-boundary(\b).

In [64]:
import re

print(re.split(r"\s(?=[A-Z])", "EMMA loves PYTHON and ML"))

['EMMA loves', 'PYTHON and', 'ML']


# re.sub()

|Operation|	Description|
|:--|:--|
|re.sub(pattern, replacement, string)|	Find and replaces all occurrences of pattern with replacement|
|re.sub(pattern, replacement, string, count=1)|	Find and replaces only the first occurrences of pattern with replacement|
|re.sub(pattern, replacement, string, count=n)|	Find and replaces first n occurrences of pattern with the replacement|

**Regex example to replace all whitespace with an underscore**

In [1]:
import re

target_str = "Jessa knows testing and machine learning"
res_str = re.sub(r"\s", "_", target_str)
# String after replacement
print(res_str)
# Output 'Jessa_knows_testing_and_machine_learning'

Jessa_knows_testing_and_machine_learning


**Regex to remove whitespaces from a string**

In [8]:
# Remove all spaces
target_str = "   Jessa Knows Testing And Machine Learning \t  ."

# \s+ to remove all spaces
# + indicate 1 or more occurrence of a space
res_str = re.sub(r"\s+", "", target_str)
# String after replacement
print(res_str)

# Remove leading spaces
target_str = "   Jessa Knows Testing And Machine Learning \t  ."
# ^\s+ remove only leading spaces
# caret (^) matches only at the start of the string
res_str = re.sub(r"^\s+", "", target_str)

# String after replacement
print(res_str)

# Remove trailing spaces
target_str = "   Jessa Knows Testing And Machine Learning \t\n"
# ^\s+$ remove only trailing spaces
# dollar ($) matches spaces only at the end of the string
res_str = re.sub(r"\s+$", "", target_str)

# String after replacement
print(res_str)

# Remove both leading and trailing spaces
target_str = "   Jessa Knows Testing And Machine Learning \t\n"
# ^\s+ remove leading spaces
# ^\s+$ removes trailing spaces
# | operator to combine both patterns
res_str = re.sub(r"^\s+|\s+$", "", target_str)

# String after replacement
print(res_str)

# Substitute multiple whitespaces with single whitespace using regex
target_str = "Jessa Knows Testing    And Machine     Learning \t \n"

# \s+ to match all whitespaces
# replace them using single space " "
res_str = re.sub(r"\s+", " ", target_str)

# string after replacement
print(res_str)

JessaKnowsTestingAndMachineLearning.
Jessa Knows Testing And Machine Learning 	  .
   Jessa Knows Testing And Machine Learning
Jessa Knows Testing And Machine Learning
Jessa Knows Testing And Machine Learning 


**Limit the maximum number of pattern occurrences to be replaced**

In [9]:
# original string
target_str = "Jessa knows testing and machine learning"
# replace only first occurrence
res_str = re.sub(r"\s", "-", target_str, count=1)
# String after replacement
print(res_str)

# replace three occurrence
res_str = re.sub(r"\s", "-", target_str, count=3)
print(res_str)

Jessa-knows testing and machine learning
Jessa-knows-testing-and machine learning


**Regex replacement function**

In [11]:
# replacement function to convert uppercase letter to lowercase
def convert_to_lower(match_obj):
    if match_obj.group() is not None:
        return match_obj.group().lower()

# Original String
str_obj = "Emma LOves PINEAPPLE DEssert and COCONUT Ice Cream"

# pass replacement function to re.sub()
res_str = re.sub(r"[A-Z]", convert_to_lower, str_obj)
# String after replacement
print(res_str)

emma loves pineapple dessert and coconut ice cream


**Regex replace group/multiple regex patterns**

In [12]:
# Original string
student_names = "Emma-Kelly Jessa Joy Scott-Joe Jerry"

# replace two pattern at the same time
# use OR (|) to separate two pattern
res = re.sub(r"(\s)|(-)", ",", student_names)
print(res)

Emma,Kelly,Jessa,Joy,Scott,Joe,Jerry


**Replace multiple regex patterns with different replacement**

In [13]:
# replacement function to convert uppercase word to lowercase
# and lowercase word to uppercase
def convert_case(match_obj):
    if match_obj.group(1) is not None:
        return match_obj.group(1).lower()
    if match_obj.group(2) is not None:
        return match_obj.group(2).upper()

# Original String
str_obj = "EMMA loves PINEAPPLE dessert and COCONUT ice CREAM"

# group 1 ([A-Z]+) capture and replace all uppercase word with a lowercase.
# group 2 ([a-z]+) capture and replace all lowercase word with an uppercase
# pass replacement function 'convert_case' to re.sub()
res_str = re.sub(r"([A-Z]+)|([a-z]+)", convert_case, str_obj)

# String after replacement
print(res_str)
# Output 'emma LOVES pineapple DESSERT AND coconut ICE cream'

emma LOVES pineapple DESSERT AND coconut ICE cream


**RE’s subn() method**

The re.subn() method returns a tuple of two elements.

- The first element of the result is the new version of the target string after all the replacements have been made.
- The second element is the number of replacements it has made

In [16]:
target_string = "Emma loves PINEAPPLE, COCONUT, BANANA ice cream"
result = re.subn(r"[A-Z]{2,}", "MANGO", target_string)
print(result)


target_string = "Emma loves PINEAPPLE, COCONUT, BANANA ice cream"
result = re.subn(r"[A-Z]{2,}", "MANGO", target_string, count=2)
print(result)

('Emma loves MANGO, MANGO, MANGO ice cream', 3)
('Emma loves MANGO, MANGO, BANANA ice cream', 2)


# Regex Capturing Groups

For example, In the expression, ((\w)(\s\d)), there are three such groups

- ((\w)(\s\d))
- (\w)
- (\s\d)

We can specify as many groups as we wish. Each sub-pattern inside a pair of parentheses will be captured as a group. Capturing groups are numbered by counting their opening parentheses from left to right.

In [20]:
target_string = "The price of PINEAPPLE ice cream is 20"

# two groups enclosed in separate ( and ) bracket
result = re.search(r"(\b[A-Z]+\b).+(\b\d+)", target_string)

# Extract matching the whole string
print(result.group())

# Extract matching values of all groups
print(result.groups())

# Extract match value of group 1
print(result.group(1))

# Extract match value of group 2
print(result.group(2))

PINEAPPLE ice cream is 20
('PINEAPPLE', '20')
PINEAPPLE
20


In [22]:
target_string = "The price of PINEAPPLE ice cream is 20"

# two groups enclosed in separate ( and ) bracket
result = re.search(r"((\b[A-Z]+\b).+(\b\d+))", target_string)

# Extract matching the whole string
print(result.group())

# Extract matching values of all groups
print(result.groups())

# Extract match value of group 1
print(result.group(1))

# Extract match value of group 2
print(result.group(2))

# Extract match value of group 3
print(result.group(3))

PINEAPPLE ice cream is 20
('PINEAPPLE ice cream is 20', 'PINEAPPLE', '20')
PINEAPPLE ice cream is 20
PINEAPPLE
20


**Regex Capture Group Multiple Times**

In [27]:
target_string = "The price of ice-creams PINEAPPLE 20 MANGO 30 CHOCOLATE 40"

# two groups enclosed in separate ( and ) bracket
# group 1: find all uppercase letter
# group 2: find all numbers
# you can compile a pattern or directly pass to the finditer() method
pattern = re.compile(r"(\b[A-Z]+\b).(\b\d+\b)")

print(pattern.findall(target_string))
print()

matched=pattern.search(target_string) # just catch the first occurence of the match
print(matched.group())
print(matched.groups())
print(matched.group(0))
print(matched.group(1))
print(matched.group(2))
print()

# find all matches to groups
for match in pattern.finditer(target_string): # [('PINEAPPLE', '20'), ('MANGO', '30'), ('CHOCOLATE', '40')]
    # extract words
    print(match.group(1))
    # extract numbers
    print(match.group(2))

[('PINEAPPLE', '20'), ('MANGO', '30'), ('CHOCOLATE', '40')]

PINEAPPLE 20
('PINEAPPLE', '20')
PINEAPPLE 20
PINEAPPLE
20

PINEAPPLE
20
MANGO
30
CHOCOLATE
40


**Extract Range of Groups Matches**

In [29]:
target_string = "The price of PINEAPPLE ice cream is 20"
# two pattern enclosed in separate ( and ) bracket
result = re.search(r".+(\b[A-Z]+\b).+(\b\d+)", target_string)

print(result.group(1, 2))
print(result.group())
# Output ('PINEAPPLE', '20')

('PINEAPPLE', '20')
The price of PINEAPPLE ice cream is 20


In [5]:
import re

text = "He's an engineer, isn't he?"
print(text.title())
print()

p=re.compile(r"[A-Za-z]+('[a-z]+)?")
m=p.search(text)
if m:
    print(m.group())
    print(m.group(0))
    print(m.group(1))
    print()
    print(m.group()[0])
    print(m.group()[1])
    print(m.group()[2])
    print(m.group()[3])
    
result=p.findall(text)
print(result)

He'S An Engineer, Isn'T He?

He's
He's
's

H
e
'
s
["'s", '', '', "'t", '']


In [2]:
import re

text = "He's an engineer, isn't he?"
print(text.title())
print()

def titlecase():
    p=re.compile(r"[A-Za-z]+('[a-z]+)?")
    result=p.sub(lambda m:m.group(0)[0].upper()+m.group(0)[1:].lower(), text)
    print(result)
titlecase()

He'S An Engineer, Isn'T He?

He's An Engineer, Isn't He?


# Metacharacters and Operators

|Metacharacter|	Description|
|:--|:--|
|. (DOT)|	Matches any character except a newline|.
|^ (Caret)|	Matches pattern only at the start of the string.|
|$ (Dollar)|	Matches pattern at the end of the string|
|* (asterisk)	|Matches 0 or more repetitions of the regex.|
|+ (Plus)|	Match 1 or more repetitions of the regex.|
|? (Question mark)	|Match 0 or 1 repetition of the regex.|
|[] (Square brackets)	|Used to indicate a set of characters. Matches any single character in brackets. For example, [abc] will match either a, or, b, or c character|
|\| (Pipe)|	used to specify multiple patterns. For example, P1\|P2, where P1 and P2 are two different regexes.|
|\ (backslash)|	Use to escape special characters or signals a special sequence. For example, If you are searching for one of the special characters you can use a \ to escape them|
|[^...]|	Matches any single character not in brackets.|
|(...)|	Matches whatever regular expression is inside the parentheses. For example, (abc) will match to substring 'abc'|

**Regex . dot metacharacter**

In [30]:
import re

target_string = "Emma loves \n Python"
# dot(.) metacharacter to match any character
result = re.search(r'.', target_string)
print(result.group())

# .+ to match any string except newline
result = re.search(r'.+', target_string)
print(result.group())

E
Emma loves 


**DOT to match a newline character**

In [31]:
str1 = "Emma is a Python developer \n She also knows ML and AI"

# dot(.) characters to match newline
result = re.search(r".+", str1, re.S)
print(result.group())

Emma is a Python developer 
 She also knows ML and AI


**Regex ^ caret metacharacter**

In [32]:
target_string = "Emma is a Python developer \n Emma also knows ML and AI"

# caret (^) matches at the beginning of a string
result = re.search(r"^\w{4}", target_string)
print(result.group())

Emma


**caret ( ^ ) to match a pattern at the beginning of each new line**

In [33]:
str1 = "Emma is a Python developer and her salary is 5000$ \nEmma also knows ML and AI"

# caret (^) matches at the beginning of each new line
# Using re.M flag
result = re.findall(r"^\w{4}", str1, re.M)
print(result)

['Emma', 'Emma']


**Regex $ dollar metacharacter**

In [34]:
str1 = "Emma is a Python developer \nEmma also knows ML and AI"
# dollar sign($) to match at the end of the string
result = re.search(r"\w{2}$", str1)
print(result.group())

AI


**Regex * asterisk/star metacharacter**

In [35]:
str1 = "Numbers are 8,23, 886, 4567, 78453"
# asterisk sign(*) to match 0 or more repetitions

result = re.findall(r"\d\d*", str1)
print(result)

['8', '23', '886', '4567', '78453']


**Regex + Plus metacharacter**

In [36]:
str1 = "Numbers are 8,23, 886, 4567, 78453"
# Plus sign(+) to match 1 or more repetitions
result = re.findall(r"\d\d+", str1)
print(result)

['23', '886', '4567', '78453']


**The ? question mark metacharacter**

In [37]:
target_string = "Numbers are 8,23, 886, 4567, 78453"

# Question mark sign(?) to match 0 or 1 repetitions
result = re.findall(r"\d\d\d\d\d?", target_string)
print(result)

['4567', '78453']


**The \ backslash metacharacter**

In [38]:
str1 = "Emma is a Python developer. Emma salary is 5000$. Emma also knows ML and AI."
# escape dot
res = re.findall(r"\.", str1)
print(res)

['.', '.', '.']


**The [] square brackets metacharacter**

In [39]:
str1 = "Emma is a Python developer. Emma also knows ML and AI."
res = re.findall(r"[edk]", str1)
print(res)

['d', 'e', 'e', 'e', 'k', 'd']


# Regex Special Sequences

|Special Sequence|	Meaning|
|:--|:--|
|\A|	Matches pattern only at the start of the string|
|\Z|	Matches pattern only at the end of the string|
|\d|	Matches to any digit. Short for character classes [0-9]|
|\D|	Matches to any non-digit. short for [^0-9]|
|\s|	Matches any whitespace character. short for character class [ \t\n\x0b\r\f]|
|\S|	Matches any non-whitespace character. short for [^ \t\n\x0b\r\f]|
|\w|	Matches any alphanumeric character.short for character class [a-zA-Z_0-9]|
|\W|	Matches any non-alphanumeric character. |short for [^a-zA-Z_0-9]|
|\b|	Matches the empty string, but only at the beginning or end of a word. Matches a word boundary where a word character is [a-zA-Z0-9_]. For example, ‘\bJessa\b' matches ‘Jessa’, ‘Jessa.’, ‘(Jessa)’, ‘Jessa Emma Kelly’ but not ‘JessaKelly’ or ‘Jessa5’.|
|\B	|Opposite of a \b. Matches the empty string, but only when it is not at the beginning or end of a word|

|Character Class|	Description|
|:--|:--|
|[abc]|	Match the letter a or b or c|
|[abc][pq]|	Match letter a or b or c followed by either p or q.|
|[^abc]|	Match any letter except a, b, or c (negation)|
|[0-9]|	Match any digit from 0 to 9. inclusive (range)|
|[a-z]|	Match any lowercase letters from a to z. inclusive (range)|
|[A-Z]|	Match any UPPERCASE letters from A to Z. inclusive (range)|
|[a-zA-z]|	Match any lowercase or UPPERCASE letter. inclusive (range)|
|[m-p2-8]|	Ranges: matches a letter between m and p and digits from 2 to 8, but not p2|
|[a-zA-Z0-9_]|	Match any alphanumeric character|

**Special Sequence \A and \Z**

In [42]:
import re

target_str = "Jessa is a Python developer, and her salary is 8000"

# \A to match at the start of a string
# match word starts with capital letter
result = re.findall(r"\A([A-Z].*?)\s", target_str)
print("Matching value", result)

result = re.findall(r"\A([A-Z].*)\s", target_str)
print("Matching value", result)

# \Z to match at the end of a string
# match number at the end of the string
result = re.findall(r"\d.*?\Z", target_str)
print("Matching value", result)


Matching value ['Jessa']
Matching value ['Jessa is a Python developer, and her salary is']
Matching value ['8000']


**Special sequence \d and \D**

In [43]:
target_str = "8000 dollar"

# \d to match all digits
result = re.findall(r"\d", target_str)
print(result)

# \d to match all numbers
result = re.findall(r"\d+", target_str)
print(result)

# \D to match non-digits
result = re.findall(r"\D", target_str)
print(result)

['8', '0', '0', '0']
['8000']
[' ', 'd', 'o', 'l', 'l', 'a', 'r']


**Special Sequence \w and \W**

In [46]:
target_str = "Jessa and Kelly!!"

# \w to match all alphanumeric characters
result = re.findall(r"\w", target_str)
print(result)

# \w{5} to 5-letter word
result = re.findall(r"\w{5}", target_str)
print(result)

# \W to match NON-alphanumeric
result = re.findall(r"\W", target_str)
print(result)

['J', 'e', 's', 's', 'a', 'a', 'n', 'd', 'K', 'e', 'l', 'l', 'y']
['Jessa', 'Kelly']
[' ', ' ', '!', '!']


**Special Sequence \s and \S**

The \s matches any whitespace character inside the target string. Whitespace characters covered by this sequence are as follows

* Common space generated by the space key from the keyboard. ("  ")
* Tab character (\t)
* Newline character (\n)
* Carriage return (\r)
* Form feed (\f)
* Vertical tab (\v)

In [47]:
target_str = "Jessa \t \n  "

# \s to match any whitespace
result = re.findall(r"\s", target_str)
print(result)

# \S to match non-whitespace
result = re.findall(r"\S", target_str)
print(result)

# split on white-spaces
result = re.split(r"\s+", "Jessa and Kelly")
print(result)

# remove all multiple white-spaces with single space
result = re.sub(r"\s+", " ", "Jessa   and   \t \t Kelly  ")
print(result)

[' ', '\t', ' ', '\n', ' ', ' ']
['J', 'e', 's', 's', 'a']
['Jessa', 'and', 'Kelly']
Jessa and Kelly 


**Special Sequence \b and \B**

In [50]:
target_str = "  Jessa salary is 8000$ She is Python developer"

# \b to word boundary, \w{6} to match six-letter word
result = re.findall(r"\b\w{6}\b", target_str)
print(result)

# \b need separate word not part of a word
result = re.findall(r"\bthon\b", target_str)
print(result)

# \B
result = re.findall(r"\Bthon", target_str)
print(result)

['salary', 'Python']
[]
['thon']


### Create custom character classes

**Simple character classes**

In [52]:
target_string = "Jessa loves Python. and her salary is 8000$"

# simple character Class [jds], Match the letter J or d or e
result = re.findall(r"[Jde]", target_string)
print(result)

# simple character Class [0-9], Match any digit
result = re.findall(r"[0-9]", target_string)
print(result)

# character Class [abc][pq], Match Match p or y or t followed by either h or s.
result = re.findall(r"[Pyt][hs]", target_string)
print(result)

['J', 'e', 'e', 'd', 'e']
['8', '0', '0', '0']
['th']


**Use negation to construct character classess**

In [53]:
target_string = "abcde25"

result = re.findall(r"[^abc]", target_string)
print(result)

# match any character except digits
result = re.findall(r"[^0-9]", target_string)
print(result)

['d', 'e', '2', '5']
['a', 'b', 'c', 'd', 'e']


**Use ranges to construct character classes**

In [55]:
target_string = "ABCDefg29"

print(re.findall(r"[a-z]", target_string))

print(re.findall(r"[A-Z]", target_string))

print(re.findall(r"[a-zA-Z]", target_string))

print(re.findall(r"[2-6]", target_string))

print(re.findall(r"[A-C2-8]", target_string))

['e', 'f', 'g']
['A', 'B', 'C', 'D']
['A', 'B', 'C', 'D', 'e', 'f', 'g']
['2']
['A', 'B', 'C', '2']


# Regex Flags

|Flag|	long syntax|	Meaning|
|:--|:--|:--|
|re.A	|re.ASCII|	Perform ASCII-only matching instead of full Unicode matching|
|re.I	|re.IGNORECASE|	Perform case-insensitive matching|
|re.M	|re.MULTILINE|	This flag is used with metacharacter ^ (caret) and \\$ (dollar). When this flag is specified, the metacharacter ^ matches the pattern at beginning of the string and each newline’s beginning (\n).And the metacharacter $ matches pattern at the end of the string and the end of each new line (\n)|
|re.S|	re.DOTALL|	Make the DOT (.) special character match any character at all, including a newline. Without this flag, DOT(.) will match anything except a newline|
|re.X	|re.VERBOSE	|Allow comment in the regex. This flag is useful to make regex more readable by allowing comments in the regex.|
|re.L|	re.LOCALE|Perform case-insensitive matching dependent on the current locale. Use only with bytes patterns|

**IGNORECASE flag**

In [56]:
target_str = "KELLy is a Python developer at a PYnative. kelly loves ML and AI"

# Without using re.I
result = re.findall(r"kelly", target_str)
print(result)

# with re.I
result = re.findall(r"kelly", target_str, re.I)
print(result)

# with re.IGNORECASE
result = re.findall(r"kelly", target_str, re.IGNORECASE)
print(result)

['kelly']
['KELLy', 'kelly']
['KELLy', 'kelly']


**DOTALL flag**

In [58]:
import re

# string with newline character
target_str = "ML\nand AI"

# Match any character
result = re.search(r".+", target_str)
print("Without using re.S flag:", result.group())
print()

# With re.S flag
result = re.search(r".+", target_str, re.S)
print("With re.S flag:", result.group())
print()

# With re.DOTALL flag
result = re.search(r".+", target_str, re.DOTALL)
print("With re.DOTALL flag:", result.group())


Without using re.S flag: ML

With re.S flag: ML
and AI

With re.DOTALL flag: ML
and AI


**VERBOSE flag**

In [59]:
target_str = "Jessa is a Python developer, and her salary is 8000"

# re.X to add indentation  and comment in regex
result = re.search(r"""(^\w{2,}) # match 5-letter word at the start
                        .+(\d{4}$) # match 4-digit number at the end """, target_str, re.X)
# Fiver-letter word
print(result.group(1))

# 4-digit number
print(result.group(2))

Jessa
8000


**MULTILINE flag**

In [60]:
target_str = "Joy lucky number is 75\nTom lucky number is 25"

# find 3-letter word at the start of each newline
# Without re.M or re.MULTILINE flag
result = re.findall(r"^\w{3}", target_str)
print(result)  

# find 2-digit at the end of each newline
# Without re.M or re.MULTILINE flag
result = re.findall(r"\d{2}$", target_str)
print(result)

# With re.M or re.MULTILINE
# find 3-letter word at the start of each newline
result = re.findall(r"^\w{3}", target_str, re.MULTILINE)
print(result)

# With re.M
# find 2-digit number at the end of each newline
result = re.findall(r"\d{2}$", target_str, re.M)
print(result)

['Joy']
['25']
['Joy', 'Tom']
['75', '25']


**ASCII flag**

In [61]:
import re

# string with ASCII and Unicode characters
target_str = "虎太郎 and Jessa are friends"

# Without re.A or re.ASCII, To match all 3-letter word
result = re.findall(r"\b\w{3}\b", target_str)
print(result)

# With re.A or re.ASCII, regex to match only 3-letter ASCII word
result = re.findall(r"\b\w{3}\b", target_str, re.A)
print(result)

['虎太郎', 'and', 'are']
['and', 'are']
