In [1]:
import re

## 1. Compiling Regular Expressions

Regular expressions are **compiled** into `Pattern` objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.


### `re.compile(pattern, flags=0)`

Compile a regular expression pattern, returning a pattern object.

- The regular expression is passed to `re.compile()` as a **string**. 

> Regular expressions are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. 

> Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.

In [2]:
pattern = re.compile("hello")

In [3]:
type(pattern)

re.Pattern

In [4]:
pattern

re.compile(r'hello', re.UNICODE)

- `re.compile()` also accepts an optional `flags` argument, used to enable various special features and syntax variations. [flags details](http://xahlee.info/python/python_regex_flags.html)

<br>

In the example below, we use the flag `re.I` (short for `re.IGNORECASE`) to ignore letter case in the regex pattern.

In [5]:
pattern = re.compile("hello", flags=re.I)

In [6]:
pattern

re.compile(r'hello', re.IGNORECASE|re.UNICODE)

## 2. Performing Matches

So, we have created a `Pattern` object representing a compiled regular expression using `re.compile()` method.

Pattern objects have several methods and attributes.

Here is the list of different methods used for performing matches:


<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Method/Attribute</th>
    <th>Purpose</th>
</thead>
    
<tbody>
<tr>
    <td>match()</td>
    <td>Determine if the RE matches at the beginning of the string.</td>
</tr>
    
<tr>
    <td>search()</td>
    <td>Scan through a string, looking for any location where this RE matches.</td>
</tr>

<tr>
    <td>findall()</td>
    <td>Find all substrings where the RE matches, and returns them as a list.</td>
</tr>

<tr>
    <td>finditer()</td>
    <td>Find all substrings where the RE matches, and returns them as an iterator.</td>
</tr>
</tbody>
</table>

Let us go through them one by one:

### `match(string[, pos[, endpos]])`

- A match is checked only at the beginning (by default).

- Checking starts from `pos` index of the string. (default is 0)

- Checking is done until `endpos` index of string. `endpos` is set as a very large integer (by default).

- Returns `None` if no match found.

- If a match is found, a `Match` object is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

In [7]:
pattern = re.compile("hello")

In [8]:
match = pattern.match("hello world")

In [9]:
match

<re.Match object; span=(0, 5), match='hello'>

In [10]:
type(match)

re.Match

In [11]:
match.span()

(0, 5)

In [12]:
match.start()

0

In [13]:
match.end()

5

In [14]:
match = pattern.match("ello world")
print(match)

None


In [15]:
type(match)

NoneType

In [16]:
match == None

True

In [17]:
match = pattern.match("ello hello world" , pos = 5)
print(match)

<re.Match object; span=(5, 10), match='hello'>


In [18]:
match = pattern.match("ello hello world" , pos = 5 , endpos = 9)
print(match)

None


In [19]:
match = pattern.match("ello hello world" , pos = 5 , endpos= 12)
print(match)

<re.Match object; span=(5, 10), match='hello'>


In [20]:
pattern.match("say hello", pos=4) is None

False

In [21]:
pattern.match("hello", endpos=4) is None

True

### `search(string[, pos[, endpos]])`

- A match is checked throughtout the string.

- Same behaviour of `pos` and `endpos` as the `match()` function.

- Returns `None` if no match found.

- If a match is found, a `Match` object is returned.

In [22]:
pattern = re.compile("hello")

In [23]:
pattern.search("say hello")

<re.Match object; span=(4, 9), match='hello'>

In [24]:
pattern.search("say hello hello")

<re.Match object; span=(4, 9), match='hello'>

In [25]:
re.search("hello" , "say hello hello")      # This is also work

<re.Match object; span=(4, 9), match='hello'>

### `findall(string[, pos[, endpos]])`

- Finds **all non-overlapping substrings** where the match is found, and returns them as a list.

- Same behaviour of `pos` and `endpos` as the `match()` and `search()` function.

In [26]:
pattern.findall("say hello hello")

['hello', 'hello']

In [27]:
re.findall("hello" , "say hello hello")

['hello', 'hello']

In [28]:
pattern = re.compile("\d")

In [29]:
pattern.findall("1 2 3 4 5 6")

['1', '2', '3', '4', '5', '6']

### `finditer(string[, pos[, endpos]])`

- Finds **all non-overlapping substrings** where the match is found, and returns them as an iterator of the `Match` objects.

- Same behaviour of `pos` and `endpos` as the `match()`, `search()` and `findall()` function.

In [30]:
pattern = re.compile("hello")

In [31]:
match_iter = pattern.finditer("say hello hello")

In [32]:
type(match_iter)

callable_iterator

In [33]:
next(match_iter)

<re.Match object; span=(4, 9), match='hello'>

In [34]:
next(match_iter)

<re.Match object; span=(10, 15), match='hello'>

In [35]:
pattern = re.compile("hello")
match_iter = pattern.finditer("say hello hello")

In [36]:
for match in match_iter:
    print(match.span())

(4, 9)
(10, 15)


In [37]:
pattern = re.compile("hello")
match_iter = pattern.finditer("say hello hello")

In [38]:
for match in match_iter:
    print(match)

<re.Match object; span=(4, 9), match='hello'>
<re.Match object; span=(10, 15), match='hello'>


In [39]:
from utils import highlight_regex_matches
highlight_regex_matches(pattern, "say hello hello")

say [42m[1mhello[0m [42m[1mhello[0m


> By now, you must have noticed that `match()`, `search()` and `finditer()` return `Match` object(s) where as `findall()` returns a list of strings.


### Note:

It is not mandatory to create a `Pattern` object explicitly using `re.compile()` method in order to perform a regex operation.

You can direclty use the module level functions such as:
- `re.match(pattern, string, flags=0)`

- `re.search(pattern, string, flags=0)`

- `re.findall(pattern, string, flags=0)`

- `re.finditer(pattern, string, flags=0)`

and so on.

In a module level function, you can simply pass a **string** as your **regex pattern** as shown in the examples below.

In [40]:
re.match("hello", "hello")

<re.Match object; span=(0, 5), match='hello'>

In [41]:
re.findall("hello", "say hello hello")

['hello', 'hello']

### Important Example

Consider the example below:

In [42]:
txt = "This book costs $15."

Search for the pattern `$15`.

In [43]:
pattern = re.compile("$15")

In [44]:
print(pattern.search(txt))


None


### No match found. Why?

`$` is a metacharacter and has a special meaning for regex engine. Here, we want to treat it like a literal.

In order to treat a metacharacter like a literal, you need to **escape** it using `\` character.

In [45]:
pattern = re.compile("\$15")

In [46]:
pattern.search(txt)

<re.Match object; span=(16, 19), match='$15'>

In [47]:
pattern = re.compile(r"$15")
print(pattern.search(txt))

None


In [48]:
re.search("\$15" , "This book costs $15.")

<re.Match object; span=(16, 19), match='$15'>

In [49]:
print(r"$15")

$15
