<a href="https://colab.research.google.com/github/AnelKweyu/Python-Guide/blob/master/_26_Python_RegEx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python RegEx
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

## RegEx Module
Python has a built-in package called `re`, which can be used to work with Regular Expressions.

Import the `re` module:

In [1]:
import re

## RegEx in Python
When you have imported the `re` module, you can start using regular expressions:

In [2]:
#Search the string to see if it starts with "The" and ends with "Spain":
import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

## RegEx Functions

| **Function** | **Description**                                                   |
|:------------:|:-----------------------------------------------------------------:|
| findall      | Returns a list containing all matches                             |
| search       | Returns a Match object if there is a match anywhere in the string |
| split        | Returns a list where the string has been split at each match      |
| sub          | Replaces one or many matches with a string                        |


## Metacharacters

| **Character** | **Description**                                                            | **Example**   |
|:-------------:|:--------------------------------------------------------------------------:|:-------------:|
| `[]`          | A set of characters                                                        | "[a-m]"       |
| `\`             | Signals a special sequence (can also be used to escape special characters) | "\d"          |
| `.`             | Any character (except newline character)                                   | "he..o"       |
| `^`             | Starts with                                                                | "^hello"      |
| `$`             | Ends with                                                                  | "planet$"     |
| `*`             | Zero or more occurrences                                                   | "he.*o"       |
| `+`            | One or more occurrences                                                    | "he.+o"       |
| `?`             | Zero or one occurrences                                                    | "he.?o"       |
| `{}`            | Exactly the specified number of occurrences                                | "he.{2}o"     |
| `\|`             | Either or                                                                  | "falls|stays" |
| `()`            | Capture and group                                                          | &nbsp;        | &nbsp; |


## Special Sequences

| **Special Sequence** | **Meaning**                                                                                                                                                                                                                                                            |
|:--------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| `\A`                  | Matches pattern only at the start of the string                                                                                                                                                                                                                        |
| `\Z`                  | Matches pattern only at the end of the string                                                                                                                                                                                                                          |
| `\d`                  | Matches to any digit\. Short for character classes&nbsp; `[0\-9\]`                                                                                                                                                                                                       |
| `\D`                  | Matches to any non\-digit\.short for&nbsp; `[^0\-9\]`                                                                                                                                                                                                                    |
| `\s`                  | Matches any whitespace character\.short for character class `[ 	\\n\\x0b\\r\\f\]`                                                                                                                                                                                       |
| `\S`                  | Matches any non\-whitespace character\.short for `[^ \\t\\n\\x0b\\r\\f\]`                                                                                                                                                                                               |
| `\w`                  | Matches any alphanumeric character\.short for character class `[a\-zA\-Z\_0\-9\]`                                                                                                                                                                                       |
| `\W`                  | Matches any non\-alphanumeric character\.short for `[^a\-zA\-Z\_0\-9\]`                                                                                                                                                                                                 |
| `\b`                  | Matches the empty string, but only at the beginning or end of a word\. Matches a word boundary where a word character is `[a\-zA\-Z0\-9\_\]`\.For example, ‘`\bJessa\b`' matches ‘Jessa’, ‘Jessa\.’, ‘\(Jessa\)’, ‘Jessa Emma Kelly’ but not ‘JessaKelly’ or ‘Jessa5’\. |
| `\B`                  | Opposite of a `\b`\. Matches the empty string, but only when it is not at the beginning or end of a word                                                                                                                                                                |


## Sets

| **Set** | **Description**                                                             |
|:-------------------:|:---------------------------------------------------------------------------:|
| `[abc]`             | Match the letter a or b or c                                                |
| `[abc][pq]`       | Match letter a or b or c followed by either p or q\.                        |
| `[^abc]`            | Match any letter except a, b, or c \(negation\)                             |
| `[0-9]`            | Match any digit from 0 to 9\. inclusive \(range\)                           |
| `[a-z]`            | Match any lowercase letters from a to z\. inclusive \(range\)               |
| `[A-Z]`            | Match any UPPERCASE letters from A to Z\. inclusive \(range\)               |
| `[a-zA-z]`        | Match any lowercase or UPPERCASE letter\. inclusive \(range\)               |
| `[m-p2-8]`        | Ranges: matches a letter between m and p and digits from 2 to 8, but not p2 |
| `[a-zA-Z0-9_]`  | Match any alphanumeric character                                            |


## The findall() Function
The `findall()` function returns a list containing all matches.

In [3]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


The list contains the matches in the order they are found.

If no matches are found, an empty list is returned:

In [4]:
import re

txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

[]


## The search() Function
The `search()` function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned:

In [5]:
#Search for the first white-space character in the string:
import re

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


If no matches are found, the value `None` is returned:

In [6]:
import re

txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

None


## The split() Function
The `split()` function returns a list where the string has been split at each match:

In [7]:
#Split at each white-space character:
import re

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


You can control the number of occurrences by specifying the `maxsplit` parameter:

In [8]:
#Split the string only at the first occurrence:
import re

txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

['The', 'rain in Spain']


## The sub() Function
The `sub()` function replaces the matches with the text of your choice:

In [9]:
#Replace every white-space character with the number 9:
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


You can control the number of replacements by specifying the `count` parameter:

In [10]:
#Replace the first 2 occurrences:
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in Spain


## Match Object
A Match Object is an object containing information about the search and the result.

**Note**: If there is no match, the value `None` will be returned, instead of the Match Object.

In [11]:
import re

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object

<re.Match object; span=(5, 7), match='ai'>


The Match object has properties and methods used to retrieve information about the search, and the result:

`.span()` returns a tuple containing the start-, and end positions of the match.

`.string` returns the string passed into the function

`.group()` returns the part of the string where there was a match

In [12]:
#Print the position (start- and end-position) of the first match occurrence.
#The regular expression looks for any words that starts with an upper case "S":
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

(12, 17)


In [13]:
#Print the string passed into the function:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)

The rain in Spain


In [14]:
#Print the part of the string where there was a match.
#The regular expression looks for any words that starts with an upper case "S":
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

Spain
