# <center> RegEx in Python</center>

![](images/memes/meme3.jpg)

# Getting started with RegEx in Python

The **[re](https://docs.python.org/3/howto/regex.html)** module provides an interface to the regular expression engine, allowing you to **compile regular expressions into objects and then perform matches with them**.

In [1]:
import re

## 1. Compiling Regular Expressions

Regular expressions are **compiled** into `Pattern` objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.


### `re.compile(pattern, flags=0)`

Compile a regular expression pattern, returning a pattern object.

- The regular expression is passed to `re.compile()` as a **string**. 

> Regular expressions are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. 

> Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.

In [2]:
# Here we are creating a pattern i.e. "hello"
# Means now using this we can search for the word "hello" in any sentence.

pattern = re.compile("hello")

In [3]:
# The pattern is
# Here "re.UNICODE" is default "flag" 
# It means the text we passed here is an UNICODE text and not a simple ASCII text
# ASCII has a range of 0-255 characters
# But UNICODE can be expanded as much we want
 
pattern

re.compile(r'hello', re.UNICODE)

In [4]:
# So the type of the pattern is

type(pattern)

re.Pattern

- `re.compile()` also accepts an optional `flags` argument, used to enable various special features and syntax variations. [More about flags](http://xahlee.info/python/python_regex_flags.html)

<br>

In the example below, we use the flag `re.I` (short for `re.IGNORECASE`) to ignore letter case in the regex pattern.

In [5]:
pattern = re.compile("hello", flags=re.I)

In [6]:
# So now we have two flags

pattern

re.compile(r'hello', re.IGNORECASE|re.UNICODE)

So now this particular pattern wil match all these inputs
> `Hello`

> `HELLO`

> `hello`

## 2. Performing Matches

So, we have created a `Pattern` object representing a compiled regular expression using `re.compile()` method.

Pattern objects have several methods and attributes.

Here is the list of different methods used for performing matches:


<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Method/Attribute</th>
    <th>Purpose</th>
</thead>
    
<tbody>
<tr>
    <td>match()</td>
    <td>Determine if the RE matches at the beginning of the string.</td>
</tr>
    
<tr>
    <td>search()</td>
    <td>Scan through a string, looking for any location where this RE matches.</td>
</tr>

<tr>
    <td>findall()</td>
    <td>Find all substrings where the RE matches, and returns them as a list.</td>
</tr>

<tr>
    <td>finditer()</td>
    <td>Find all substrings where the RE matches, and returns them as an iterator.</td>
</tr>
</tbody>
</table>

Let us go through them one by one:

### `match(string[, pos[, endpos]])`

- A match is checked only at the beginning (by default).

- Checking starts from `pos` index of the string. (default is 0)

- Checking is done until `endpos` index of string. `endpos` is set as a very large integer (by default).

- Returns `None` if no match found.

- If a match is found, a `Match` object is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

In [7]:
# Creating a pattern

pattern = re.compile("hello")

In [8]:
# Now trying to match the "hello" word with the "match()"
# It will return a match object

pattern.match("hello")

<re.Match object; span=(0, 5), match='hello'>

In [9]:
# Keeping the match object in a variable

match = pattern.match("hello world")

In [10]:
# So now it is a match object
# A match object is return if a match is found

type(match)

re.Match

In [11]:
# span()
# It will tell the index positions between which the match was found in the input text

match.span()

(0, 5)

In [12]:
# start()
# The starting position of the match

match.start()

0

In [13]:
# end()
# The ending position of the match

match.end()

5

In [14]:
# If no match is found it will return None

no_match = pattern.match("ello world")
type(no_match)

NoneType

In [15]:
# pos
# The "pos" parameter is used to tell the starting position from where the match should start
# Here we starting from 4th index

pattern.match("say hello", pos=4)

<re.Match object; span=(4, 9), match='hello'>

In [16]:
# If we use the condition to check if it is a nonetype or not

pattern.match("say hello", pos=4) is None

False

In [17]:
# endpos
# Tells upto which index the match should take place
# As here we are matching upto the 4th index

pattern.match("hello", endpos=4)

In [18]:
# Here as we are searching upto 4th index it will not match so we get a none

pattern.match("hello", endpos=4) is None

True

### `search(string[, pos[, endpos]])`

- A match is checked throughtout the string. So here it will search anywhere in the string.

- Same behaviour of `pos` and `endpos` as the `match()` function.

- Returns `None` if no match found.

- If a match is found, a `Match` object is returned.

In [19]:
pattern.search("say hello")

<re.Match object; span=(4, 9), match='hello'>

In [20]:
# If the pattern is multiple times it will return only the first match

pattern.search("say hello hello")

<re.Match object; span=(4, 9), match='hello'>

### `findall(string[, pos[, endpos]])`

- Finds **all non-overlapping substrings** where the match is found, and returns them as a list.

- Same behaviour of `pos` and `endpos` as the `match()` and `search()` function.

In [21]:
# Here if the pattern is there multiple times
# It will return those matched values instead of the position index

pattern.findall("say hello hello")

['hello', 'hello']

### `finditer(string[, pos[, endpos]])`

- Finds **all non-overlapping substrings** where the match is found, and returns them as an iterator of the `Match` objects.

- Same behaviour of `pos` and `endpos` as the `match()`, `search()` and `findall()` function.

In [22]:
# Here it creates a match object iterator

matches = pattern.finditer("say hello hello")
matches

<callable_iterator at 0x277361357f0>

In [23]:
type(matches)

callable_iterator

In [24]:
# Now using a for loop matching

for match in matches:
    print(match.span())

(4, 9)
(10, 15)


In [25]:
# Here it will highlight the matched texts
# To do this we need to import the "highlight_regex_matches"

from utils import highlight_regex_matches
highlight_regex_matches(pattern, "say hello hello")

say [43m[1mhello[0m [43m[1mhello[0m


> By now, you must have noticed that `match()`, `search()` and `finditer()` return `Match` object(s) where as `findall()` returns a list of strings.


### Note:

It is not mandatory to create a `Pattern` object explicitly using `re.compile()` method in order to perform a regex operation.

You can direclty use the module level functions such as:
- `re.match(pattern, string, flags=0)`

- `re.search(pattern, string, flags=0)`

- `re.findall(pattern, string, flags=0)`

- `re.finditer(pattern, string, flags=0)`

and so on.

In a module level function, you can simply pass a **string** as your **regex pattern** as shown in the examples below.

In [26]:
# Here we are passing the pattern and the string in one go

re.match("hello", "hello")

<re.Match object; span=(0, 5), match='hello'>

In [27]:
# Doing the same with "findall()"

re.findall("hello", "say hello hello")

['hello', 'hello']

### Important Example

Consider the example below:

In [28]:
# Creating a text

txt = "This book costs $15."

Search for the pattern `$15`.

In [29]:
pattern = re.compile("$15")

In [30]:
pattern.search(txt)

### No match found. Why?

`$` is a metacharacter and has a special meaning for regex engine. Here, we want to treat it like a literal.

In order to treat a metacharacter like a literal, you need to **escape** it using `\` character.

In [31]:
# Now doing again using "\"

pattern = re.compile("\$15")

In [32]:
pattern.search(txt)

<re.Match object; span=(16, 19), match='$15'>

In regular expressions, there are twelve metacharacters that should be escaped if they are to be used with their literal meaning:

- Backslash `\`
- Caret `^`
- Dollar sign `$`
- Dot `.`
- Pipe symbol `|`
- Question mark `?`
- Asterisk `*`
- Plus sign `+`
- Opening parenthesis `(`
- Closing parenthesis `)`
- Opening square bracket `[`
- The opening curly brace `{`

![](images/memes/meme4.jpg)