# Python RegEX

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

## RegEx Module

Python has a built-in package called `re`, which can be used to work with Regular Expression.

## RegEx in Python

When you have imported the `re` module, you can start using regular expressions:

Eg: Search the string to see if it starts with "The" and ends with "Spain":

In [4]:
import re

txt = "The rain in Spain"

# ^ Mathes the *start of the string*
# .* Matches any character (.) 0 or more times (*). Basically, it matches everything in between
# $ Matches the *end of the string*
x = re.search("^The.*Spain$", txt)

if x:
    print("Match Found!")
else:
    print("Match not Found!")

Match Found!


## RegEx Functions

The `re` module offers a set of functions that allows us to search a string for a match:

| Function | Description |
|----------|-------------|
| `findall`|Returns a list containing all matches|
| `search`|Returns a `Match object` if there is a match anywhere in the string|
| `split`|Returns a list where the string has been slplit at each match|
| `sub`| Replacces one or many matches with a string|

## Metacharacters

Metacharacters are characters with a special meaning:

## Common Regex Characters in Python
| Character | Description | Example |
|-----------|-------------|---------|
| `[]`      | A set of characters | `"[a-m]"` |
| `\`       | Signals a special sequence (or escapes special characters) | `"\d"` |
| `.`       | Any character (except newline) | `"he..o"` |
| `^`       | Starts with | `"^hello"` |
| `$`       | Ends with | `"planet$"` |
| `*`       | Zero or more occurrences | `"he.*o"` |
| `+`       | One or more occurrences | `"he.+o"` |
| `?`       | Zero or one occurrences | `"he.?o"` |
| `{}`      | Exactly the specified number of occurrences | `"he.{2}o"` |
| `\|`       | Either/or | `"falls\|stays"` |
| `()`      | Capture and group | `"(abc)"` |


## Flags

You can add flags to the pattern when using regular expressions.

## Common Regex Flags in Python

| Flag | Shorthand | Description |
|------|-----------|-------------|
| `re.ASCII` | `re.A` | Returns only ASCII matches |
| `re.DEBUG` |  | Returns debug information |
| `re.DOTALL` | `re.S` | Makes the `.` character match all characters (including newline) |
| `re.IGNORECASE` | `re.I` | Case-insensitive matching |
| `re.MULTILINE` | `re.M` | Allows `^` and `$` to match the start and end of each line |
| `re.NOFLAG` |  | Specifies that no flag is set for this pattern |
| `re.UNICODE` | `re.U` | Returns Unicode matches (default in Python 3; in Python 2, use this for Unicode) |
| `re.VERBOSE` | `re.X` | Allows whitespaces and comments inside patterns, making them more readable |

## Special Sequences

A special sequence is a `\` followed by one of the characters in the list below, and has a special meaning:

## Common Regex Special Sequences in Python

| Character | Description | Example |
|-----------|-------------|---------|
| `\A` | Returns a match if the specified characters are at the **beginning** of the string | `\AThe` |
| `\b` | Returns a match where the specified characters are at the **beginning or end of a word** (use raw string `r""`) | `r"\bain"` <br> `r"ain\b"` |
| `\B` | Returns a match where the specified characters are present, **but NOT at the beginning or end of a word** (use raw string `r""`) | `r"\Bain"` <br> `r"ain\B"` |
| `\d` | Returns a match where the string contains **digits** (0–9) | `\d` |
| `\D` | Returns a match where the string **does NOT contain digits** | `\D` |
| `\s` | Returns a match where the string contains a **whitespace character** | `\s` |
| `\S` | Returns a match where the string **does NOT contain a whitespace character** | `\S` |
| `\w` | Returns a match where the string contains any **word character** (a–Z, 0–9, `_`) | `\w` |
| `\W` | Returns a match where the string **does NOT contain any word character** | `\W` |
| `\Z` | Returns a match if the specified characters are at the **end** of the string | `Spain\Z` |

## Sets

A set is a set of characters inside a pair of square brackets `[]` with a special meaning:

## Common Regex Sets in Python

| Set | Description |
|-----|-------------|
| `[arn]` | Returns a match where **one of the specified characters** (a, r, or n) is present |
| `[a-n]` | Returns a match for **any lowercase character** alphabetically between a and n |
| `[^arn]` | Returns a match for **any character EXCEPT** a, r, and n |
| `[0123]` | Returns a match where **any of the specified digits** (0, 1, 2, or 3) are present |
| `[0-9]` | Returns a match for **any digit** between 0 and 9 |
| `[0-5][0-9]` | Returns a match for **any two-digit numbers** from 00 to 59 |
| `[a-zA-Z]` | Returns a match for **any character alphabetically between a and z**, lower or upper case |
| `[+]` | In sets, `+`, `*`, `.`, `\|`, `()`, `$`, `{}` have **no special meaning**, so `[+]` matches the literal `+` character |

## The findall() Function

The `findall()` function returns a list containing all matches.

In [5]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


The list contains the matches in the order they are found.

If no matches are found, an empty list is returned:

In [6]:
x = re.findall("Nepal", txt)
print(x)

[]


## The search() Function

The `search()` function searches the string for a match, and returns a Match object if there is a match. 

If there is more than one match, only the first occurrence of teh match will be returned:

Eg: Search for the first white-space character in the string:

In [7]:
import re

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position: ", x.start())

The first white-space character is located in position:  3


If no matches are found, the value `None` is returned.

In [10]:
import re

txt = "Hii"
x = re.search("Hello", txt)
print(x)

None


## The split() Function

The `split()` function returns a list where the string has been split at each match.

Eg: Split at each white-space character:

In [11]:
import re

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


You can control the no of occurrences by specifying the `maxsplit` parameter.

Eg: Split the string only at the first occurrence:

In [12]:
import re

txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

['The', 'rain in Spain']


## The sub() function

The `sub()` function replaces the matches with the text of your choice.

Eg: Replace every white-space character with the no. 9

In [14]:
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


You can control the no. of replacements by specifying the `count` parameter:

Eg: Replace the first 2 occurrences:

In [15]:
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in Spain


## Match Object

A Match Object is an object containing info about the search and the result.

>Note: If there is no match, the value `None` will be returned, instead of the Match Object.

Eg: Do a search that will return a Match Object:

In [16]:
import re

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) # this will print an object

<re.Match object; span=(5, 7), match='ai'>


The Match Object has properties and methods used to retrieve info about the search, and the result:

`.span()` returns a tuple containing the start-, and end positions of the match.

`.string` returns teh string passed into the function

`.group()` returns the part of the string where there was a match

Eg: Print the position (start- and end-position) of the first match occurrence.

The regurlar expression looks for any words that starts with an upper case "S":

In [17]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

(12, 17)


Eg: Print the string passed into the function

In [18]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)

The rain in Spain


Eg: Print the part of the string where there was a match

The regular expression looks for any words that starts with an upper case "S":

In [19]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

Spain


>Note: If there is no match, the value `None` will be returened, instead of the Match Object