> ### A regular expression is a sequence of characters that define a search pattern.

## 1. What exactly is a Regular Expression?

A regular expression, often called a pattern, is **an expression used to specify a set of strings** required for a particular purpose. 
- A simple way to specify a finite set of strings is to list its elements or members. <br>For example `{file, file1, file2}`. 
- The set `{file, file1, file2}` can be specified by the pattern `file(1|2)?`.
- The string set `{file, file1, file2}` can also be specified by the pattern `file\d?`.
- Tf there exists at least one regular expression that matches a particular set then there exists an infinite number of other regular expressions that also match it, i.e. **the specification is not unique**.<br>

## 2. The math of Regular Expressions

- The concept of **Regular Expressions** originated from **[Regular Languages](https://en.wikipedia.org/wiki/Regular_language)**. 

- **Regular Expressions** describe **Regular Languages** in **[Formal Language Theory](https://en.wikipedia.org/wiki/Formal_language)**.
> *** Formal Language Theory***: In mathematics, computer science, and linguistics, a **formal language** consists of words whose letters are taken from an alphabet and are **well-formed according to a specific set of rules**. The field of formal language theory studies primarily the purely syntactical aspects of such languages—that is, their internal structural patterns.

> *** Regular Languages ***: A regular language is a category of **formal languages** which can be expressed using a regular expression. ![](images/formal-lang-theory.png)

## 3. Uses of Regular Expressions

Some important usages of regular expressions are:

- Check if an input HONORS a given pattern; for example, we can check whether a value entered in a HTML formulary is a valid e-mail address


- Look for a pattern appearance in a piece of text; for example, check if either the word "color" or the word "colour" appears in a document with just **one scan**


- Extract specific portions of a text; for example, extract the postal code of an address


- Replace portions of text; for example, change any appearance of "color" or "colour" with "red"


- Split a larger text into smaller pieces, for example, splitting a text by any appearance of the dot, comma, or newline characters

### Regex today

- It was the rise of the web that gave a big boost to the Perl implementation of regex, and that's where we get the modern syntax of regular expressions today; it really comes from Perl. `Apache`, `C`, `C++`, `the .NET languages`, `Java`, `JavaScript`, `MySQL`, `PHP`, `Python`, `Ruby` all of these are endeavoring to be Perl-compatible languages and programs. There's also a library called the `PCRE` library that stands for Perl-Compatible Regular Expression library.


- Today, the standard Python module for regular expressions—`re`—supports only Perl-style regular expressions. There is an [effort](https://pypi.python.org/pypi/regex) to write a new regex module with better POSIX style support. This new module is intended to replace Python's `re` module implementation eventually. 

## 5. Understanding the Regular Expression Syntax

A regex pattern is a simple sequence of characters. The components of a regex pattern are:

- **literals (ordinary characters)**: these characters carry no special meaning and are processed as it is.

- **metacharacters (KEY characters)**: these characters carry a special meaning and processed in some special way.


![](images/components.png)
et's start with a simple example.

Consider that we have got the list of several filenames in a folder.

```
file1.xml
file1.txt
file2.txt
file15.xml
file5.docx
file60.txt
file5.txt
```

And we want to filter out only those filenames which follow a specific pattern, i.e.  `file<one or more digits>.txt`.

> Let's try to do this on an online tool to learn, build, & test Regular Expressions (RegEx / RegExp), [RegExr](https://regexr.com).

So, the regular expression we need here is:

`file\d+\.txt`

This expression can be understood as follows:

- `file` is a substring of literals which are matched with the input as it is.

- `\d` is a metacharacter which instructs the software to match this position with a digit (0-9).

- `+` is also a metacharacter which instructs the software to match one or more iterations of the preceeding character (`\d` in this case)

- `\.` is a literal. `.` is a metacharacter but we want to use it as a literal in this case. Hence, we escape it using `\` character.

- `txt` is a substring of literals which are matched with the input as it is.

![](images/example1.png)

# 2.Getting started with RegEx in Python

The **[re](https://docs.python.org/3/howto/regex.html)** module provides an interface to the regular expression engine, allowing you to **compile regular expressions into objects and then perform matches with them**.

In [1]:
import re

## 1. Compiling Regular Expressions

Regular expressions are **compiled** into `Pattern` objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.


### `re.compile(pattern, flags=0)`

Compile a regular expression pattern, returning a pattern object.

- The regular expression is passed to `re.compile()` as a **string**. 

> Regular expressions are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. 

> Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.

In [2]:
pattern = re.compile("hello")
pattern

re.compile(r'hello', re.UNICODE)

- `re.compile()` also accepts an optional `flags` argument, used to enable various special features and syntax variations. [More about flags](http://xahlee.info/python/python_regex_flags.html)

<br>

In the example below, we use the flag `re.I` (short for `re.IGNORECASE`) to ignore letter case in the regex pattern.

In [3]:
pattern = re.compile("hello", flags=re.I)
pattern 

re.compile(r'hello', re.IGNORECASE|re.UNICODE)

## 2. Performing Matches

- Methods for `Pattern` object representing a compiled regular expression using `re.compile()` method.

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Method/Attribute</th>
    <th>Purpose</th>
</thead>
    
<tbody>
<tr>
    <td>match()</td>
    <td>Determine if the RE matches at the beginning of the string.</td>
</tr>
    
<tr>
    <td>search()</td>
    <td>Scan through a string, looking for any location where this RE matches.</td>
</tr>

<tr>
    <td>findall()</td>
    <td>Find all substrings where the RE matches, and returns them as a list.</td>
</tr>

<tr>
    <td>finditer()</td>
    <td>Find all substrings where the RE matches, and returns them as an iterator.</td>
</tr>
</tbody>
</table>

### `match(string[, pos[, endpos]])`
- A match is checked only at the beginning (by default).

- Checking starts from `pos` index of the string. (default is 0)

- Checking is done until `endpos` index of string. `endpos` is set as a very large integer (by default).

- Returns `None` if no match found.

- If a match is found, a `Match` object is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

In [4]:
pattern = re.compile("hello")
match = pattern.match("hello world")
print(match.start(),match.span(),match.end())

0 (0, 5) 5


### `search(string[, pos[, endpos]])`

- A match is checked throughtout the string.

- Same behaviour of `pos` and `endpos` as the `match()` function.

- Returns `None` if no match found.

- If a match is found, a `Match` object is returned.

In [5]:
print(pattern.search("say hello"))
print(pattern.search("say hello hello"))

<re.Match object; span=(4, 9), match='hello'>
<re.Match object; span=(4, 9), match='hello'>


### `finditer(string[, pos[, endpos]])`

- Finds **all non-overlapping substrings** where the match is found, and returns them as an iterator of the `Match` objects.

- Same behaviour of `pos` and `endpos` as the `match()`, `search()` and `findall()` function.

In [6]:
matches = pattern.finditer("say hello hello")
for match in matches:
    print(match.span())

(4, 9)
(10, 15)


In [7]:
from utils import highlight_regex_matches
highlight_regex_matches(pattern, "say hello hello")

say [43m[1mhello[0m [43m[1mhello[0m


- By now, you must have noticed that `match()`, `search()` and `finditer()` return `Match` object(s) where as `findall()` returns a list of strings.


### Note:

It is not mandatory to create a `Pattern` object explicitly using `re.compile()` method in order to perform a regex operation.

You can direclty use the module level functions such as:
- `re.match(pattern, string, flags=0)`

- `re.search(pattern, string, flags=0)`

- `re.findall(pattern, string, flags=0)`

- `re.finditer(pattern, string, flags=0)`


In [8]:
re.findall("hello", "say hello hello")

['hello', 'hello']

#### In order to treat a metacharacter like a literal, you need to **escape** it using `\` character.

In [9]:
txt = "This book costs $15."
pattern = re.compile("$15")
pattern.search(txt)

In [10]:
pattern = re.compile("\$15")
pattern.search(txt)

<re.Match object; span=(16, 19), match='$15'>

#### In regular expressions, there are twelve metacharacters that should be escaped if they are to be used with their literal meaning:

- Backslash `\`
- Caret `^`
- Dollar sign `$`
- Dot `.`
- Pipe symbol `|`
- Question mark `?`
- Asterisk `*`
- Plus sign `+`
- Opening parenthesis `(`
- Closing parenthesis `)`
- Opening square bracket `[`
- The opening curly brace `{`

# 3.Character Classes

- The **character classes** (also known as **character sets**) allow us to define a character that will match if any of the defined characters on the set is present.
- Def: `[`accepted characters`]`

In [11]:
txt = """
Yesterday, I was driving my car without a driving licence. The traffic police stopped me and asked me for my 
license. I told them that I forgot my licence at home. 
"""

In [12]:
pattern = re.compile("licen[cs]e")
pattern.findall(txt)

['licence', 'license', 'licence']

In [13]:
highlight_regex_matches(pattern, txt)


Yesterday, I was driving my car without a driving [43m[1mlicence[0m. The traffic police stopped me and asked me for my 
[43m[1mlicense[0m. I told them that I forgot my [43m[1mlicence[0m at home. 



![](images/example2.png)

# Character Set Range

> It is possible to also use the range of a character. This is done by leveraging the hyphen symbol (-) between two related characters; for example, to match any lowercase letter we can use `[a-z]`. Likewise, to match any single digit we can define the character set `[0-9]`.

In [14]:
txt = """
The first season of Indian Premiere League (IPL) was played in 2008. 
The second season was played in 2009 in South Africa. 
Last season was played in 2018 and won by Chennai Super Kings (CSK).
CSK won the title in 2010 and 2011 as well.
Mumbai Indians (MI) has also won the title 3 times in 2013, 2015 and 2017.
"""

In [15]:
pattern = re.compile("[1-9][0-9][0-9][0-9]")
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

In [16]:
highlight_regex_matches(pattern, txt)


The first season of Indian Premiere League (IPL) was played in [43m[1m2008[0m. 
The second season was played in [43m[1m2009[0m in South Africa. 
Last season was played in [43m[1m2018[0m and won by Chennai Super Kings (CSK).
CSK won the title in [43m[1m2010[0m and [43m[1m2011[0m as well.
Mumbai Indians (MI) has also won the title 3 times in [43m[1m2013[0m, [43m[1m2015[0m and [43m[1m2017[0m.



#### Negation of Ranges
> There is another possibility—the negation of ranges. We can invert the meaning
of a character set by placing a caret (`^`) symbol right after the opening square
bracket metacharacter (`[`).

In [17]:
pattern = re.compile("[^aeiou]")
pattern.findall(txt)

['\n',
 'T',
 'h',
 ' ',
 'f',
 'r',
 's',
 't',
 ' ',
 's',
 's',
 'n',
 ' ',
 'f',
 ' ',
 'I',
 'n',
 'd',
 'n',
 ' ',
 'P',
 'r',
 'm',
 'r',
 ' ',
 'L',
 'g',
 ' ',
 '(',
 'I',
 'P',
 'L',
 ')',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '0',
 '8',
 '.',
 ' ',
 '\n',
 'T',
 'h',
 ' ',
 's',
 'c',
 'n',
 'd',
 ' ',
 's',
 's',
 'n',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '0',
 '9',
 ' ',
 'n',
 ' ',
 'S',
 't',
 'h',
 ' ',
 'A',
 'f',
 'r',
 'c',
 '.',
 ' ',
 '\n',
 'L',
 's',
 't',
 ' ',
 's',
 's',
 'n',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '1',
 '8',
 ' ',
 'n',
 'd',
 ' ',
 'w',
 'n',
 ' ',
 'b',
 'y',
 ' ',
 'C',
 'h',
 'n',
 'n',
 ' ',
 'S',
 'p',
 'r',
 ' ',
 'K',
 'n',
 'g',
 's',
 ' ',
 '(',
 'C',
 'S',
 'K',
 ')',
 '.',
 '\n',
 'C',
 'S',
 'K',
 ' ',
 'w',
 'n',
 ' ',
 't',
 'h',
 ' ',
 't',
 't',
 'l',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '1',
 '0',
 ' ',
 'n',


# Predefined Character Classes

There exist some predefined character classes which can be used as a shortcut for some frequently used classes.


<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Element</th>
    <th>Description</th>
</thead>
    
<tbody>
<tr>
    <td>.</td>
    <td>This element matches any character except newline</td>
</tr>

<tr>
    <td>\d</td>
    <td>This matches any decimal digit; this is equivalent to the class [0-9]</td>
</tr>

<tr>
    <td>\D</td>
    <td>This matches any non-digit character; this is equivalent to the class [^0-9]</td>
</tr>

<tr>
    <td>\s</td>
    <td>This matches any whitespace character; this is equivalent to the class
[ \t\n\r\f\v]</td>
</tr>

<tr>
    <td>\S</td>
    <td>This matches any non-whitespace character; this is equivalent to the class
[^ \t\n\r\f\v]</td>
</tr>

<tr>
    <td>\w</td>
    <td>This matches any alphanumeric character; this is equivalent to the class
[a-zA-Z0-9_]</td>
</tr>
    
<tr>
    <td>\W</td>
    <td>This matches any non-alphanumeric character; this is equivalent to the
class [^a-zA-Z0-9_]</td>
</tr>
</tbody>
</table>


Now, we can improve our pattern to find years in a given text a bit:

In [18]:
pattern = re.compile("[1-9]\d\d\d")
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

In [19]:
pattern = re.compile("\d\d\d\d")
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

- Let us try to find out all special symbols (non-alphanumeric, non-whitespace characters) in our text now.

In [20]:
re.findall("[^\w\s]", txt)

['(', ')', '.', '.', '(', ')', '.', '.', '(', ')', ',', '.']

# 4.The Backslash Plague

- Consider a text containing some Windows style directory addresses in which we have to find `C:\Windows\System32` substring.

In [21]:
txt = """
C:\Windows
C:\Python
C:\Windows\System32
"""

In [22]:
pattern = re.compile("C:\Windows\System32")
print(pattern.search(txt))

None


- Regex Engine is treateing `\` as metacharacters, whereas we intend to treat it like a literal.

- Solution: We need to escape the metacharacters. A metacharacter can be escaped by putting a `\` before it.

In [23]:
pattern = re.compile("C:\\Windows\\System32")
print(pattern.search(txt))

None


### Still no match found. Why???

`\` is used as an escape at two different levels. 

- First, the Python interpreter itself performs substitutions for `\` before the `re` module ever sees the pattern string. For instance, `\n` is converted to a newline character, `\t` is converted to a tab character, etc. 

- Finally, `re` reads the substituted pattern string and will apply its own substitutions for `\` character. 

Hence, to use `\` as a **literal**, we first escape `\` with `\\` for python interpreter and then escape `\\` as `\\\\` for regex engine.

In [24]:
pattern = re.compile("C:\\\\Windows\\\\System32")
print(pattern.search(txt))

<re.Match object; span=(22, 41), match='C:\\Windows\\System32'>


### Can we use 2 backslashes instead of 4 here?

Yes. By using **raw-strings**, we do not need to put escapes at first level. 

> Python raw strings are represented as ***r"your string"***. In raw strings, no escaping is required as escape sequences like `\n`, `\t`, etc are not processed.

In [25]:
pattern = re.compile(r"C:\\Windows\\System32")
print(pattern.search(txt))

<re.Match object; span=(22, 41), match='C:\\Windows\\System32'>


### Do we really need to use 2 backslashes?

If you are **not using any metacharacters** in your regex pattern, you can use `re.escape()` method to escape all the characters in pattern except ASCII letters, numbers and '_'.

In [26]:
re.escape("C:\Windows\System32")

'C:\\\\Windows\\\\System32'

In [27]:
re.search(re.escape("C:\Windows\System32"), txt)

<re.Match object; span=(22, 41), match='C:\\Windows\\System32'>

# 5.Alteration

**Alternation** is used to match a single regular expression out of several possible regular expressions.

- This is accomplished using the pipe symbol `|`.
- One way is to write and execute multiple separate regular expressions in a single regular expression!

In [28]:
txt = """the most common conjunctions are and, or and but."""

In [29]:
pattern = re.compile("and|or|the")
print(pattern.findall(txt))
highlight_regex_matches(pattern, txt)

['the', 'and', 'or', 'and']
[43m[1mthe[0m most common conjunctions are [43m[1mand[0m, [43m[1mor[0m [43m[1mand[0m but.


In [30]:
txt = """What is your name? Who is that guy?"""

In [31]:
pattern = re.compile("What|Who is")
highlight_regex_matches(pattern, txt)

[43m[1mWhat[0m is your name? [43m[1mWho is[0m that guy?


`What|Who is` regex pattern actually matches substrings `What` and `Who is`.

To get the desired result, we need to wrap the optional regular expressions using **paranthesis**.

In [32]:
pattern = re.compile("(What|Who) is")
highlight_regex_matches(pattern, txt)

[43m[1mWhat is[0m your name? [43m[1mWho is[0m that guy?


# 6. Quantifiers

**Quantifiers** are the mechanisms to define how a **character**, **metacharacter**, or **character set** can be **repeated**.

Here is the list of 4 basic quantifers:

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Symbol</th>
    <th>Name</th>
    <th>Quantification of previous character</th>
</thead>
    
<tbody>
<tr>
    <td>?</td>
    <td>Question Mark</td>
    <td>Optional (0 or 1 repetitions)</td>
</tr>
    
<tr>
    <td>*</td>
    <td>Asterisk</td>
    <td>Zero or more times</td>
</tr>

<tr>
    <td>+</td>
    <td>Plus Sign</td>
    <td>One or more times</td>
</tr>

<tr>
    <td>{n,m}</td>
    <td>Curly Braces</td>
    <td>Between n and m times</td>
</tr>
</tbody>
</table>


#### Example 1

Find all the matches for `dog` and `dogs` in the given text.

In [33]:
txt = """
I have 2 dogs. One dog is 1 year old and other one is 2 years old. Both dogs are very cute! 
"""

In [34]:
pattern = re.compile("dogs?")
pattern.findall(txt)

['dogs', 'dog', 'dogs']

In [35]:
from utils import highlight_regex_matches
highlight_regex_matches(pattern, txt)


I have 2 [43m[1mdogs[0m. One [43m[1mdog[0m is 1 year old and other one is 2 years old. Both [43m[1mdogs[0m are very cute! 



#### Example 2
Find all filenames starting with `file` and ending with `.txt` in the given text.

In [36]:
txt = """
file1.txt
file_one.txt
file.txt
fil.txt
file.xml
file-1.txt
"""

In [37]:
pattern = re.compile("file[\w-]1*\.txt")
print(pattern.findall(txt))
highlight_regex_matches(pattern, txt)

['file1.txt', 'file-1.txt']

[43m[1mfile1.txt[0m
file_one.txt
file.txt
fil.txt
file.xml
[43m[1mfile-1.txt[0m



In [38]:
re.compile("file[\w-]*\.[(txt)|(xml)]").findall(txt)

['file1.t', 'file_one.t', 'file.t', 'file.x', 'file-1.t']

In [39]:
#find . -E -regex '.*\.(sh|ini|conf|vhost|xml|php)$'

p = re.compile("file[\w-]*\.(txt|xml)")
#p.findall(txt)
highlight_regex_matches(p, txt)


[43m[1mfile1.txt[0m
[43m[1mfile_one.txt[0m
[43m[1mfile.txt[0m
fil.txt
[43m[1mfile.xml[0m
[43m[1mfile-1.txt[0m



In [40]:
txt = """
file1.txt
file_one.txt
file09.txt
fil.txt
file23.xml
file.txt
"""

In [41]:
pattern = re.compile("file\d+\.txt")
pattern.findall(txt)

['file1.txt', 'file09.txt']

In [42]:
highlight_regex_matches(pattern, txt)


[43m[1mfile1.txt[0m
file_one.txt
[43m[1mfile09.txt[0m
fil.txt
file23.xml
file.txt



We can use the curly brackets syntax here with these modifications:

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Syntax</th>
    <th>Description</th>
</thead>
    
<tbody>
<tr>
    <td>{n}</td>
    <td>The previous character is repeated exactly n times.</td>
</tr>
    
<tr>
    <td>{n,}</td>
    <td>The previous character is repeated at least n times.</td>
</tr>

<tr>
    <td>{,n}</td>
    <td>The previous character is repeated at most n times.</td>
</tr>

<tr>
    <td>{n,m}</td>
    <td>The previous character is repeated between n and m times (both inclusive).</td>
</tr>
</tbody>
</table>

### Example 4

Find years in the given text.


In [43]:
txt = """
The first season of Indian Premiere League (IPL) was played in 2008. 
The second season was played in 2009 in South Africa. 
Last season was played in 2018 and won by Chennai Super Kings (CSK).
CSK won the title in 2010 and 2011 as well.
Mumbai Indians (MI) has also won the title 3 times in 2013, 2015 and 2017.
"""

In [44]:
pattern = re.compile("\d{4}")
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

In [45]:
pattern = re.compile("\d\d\d\d")
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

In [46]:
pattern = re.compile("\d+")
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '3', '2013', '2015', '2017']

### Example 5

In the given text, filter out all 4 or more digit numbers.

In [47]:
txt = """
123143
432
5657
4435
54
65111
"""

In [48]:
pattern = re.compile("\d{4}\d*")
pattern.findall(txt)

['123143', '5657', '4435', '65111']

In [49]:
pattern = re.compile("\d{4,}")
re.findall(pattern, txt)

['123143', '5657', '4435', '65111']

### Example 6

Write a pattern to validate telephone numbers.

Telephone numbers can be of the form: `555-555-5555`, `555 555 5555`, `5555555555`

In [50]:
txt = """
555-555-5555
555 555 5555
5555555555
"""

In [51]:
pattern = re.compile("(\d{10})|(\d{3}\-\d{3}\-\d{4})")
re.findall(pattern, txt)

[('', '555-555-5555'), ('5555555555', '')]

In [52]:
pattern = re.compile("\d{3}[-\s]?\d{3}[-\s]?\d{4}")
pattern.findall(txt)

['555-555-5555', '555 555 5555', '5555555555']

# 7. Greedy Behaviour

In [53]:
txt = """<html><head><title>Title</title>"""

In [54]:
pattern = re.compile("<.*>")
pattern.findall(txt)

['<html><head><title>Title</title>']

In above example, one may expect to get 4 matches, i.e. `<html>`, `<head>`, `<title>` and `</title>`. Instead, we get the longest match, i.e. `<html><head><title>Title</title>`.

This particular behaviour (to find longest match) is called **greedy** behaviour.

> The greedy behavior of the quantifiers is applied by default in the quantifiers. A greedy quantifier will try to match as much as possible to have the biggest match result possible.

# Non-Greedy behaviour

The **non-greedy** (or **reluctant**) behaviour can be requested by adding an extra question mark to the quantifier.

For example, `??`, `*?` or `+?`. 

> A quantifier marked as reluctant will behave like the exact opposite of the greedy ones. They will try to have the smallest match possible.

In [55]:
pattern = re.compile("<.*?>")
pattern.findall(txt)

['<html>', '<head>', '<title>', '</title>']

# 8. Boundary Matchers

Consider a scenario where you want to find all occurances of `and`, `or` and `the` in the given text.

In [56]:
txt = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into electronic typesetting, 
remaining essentially unchanged. 
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, 
and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
"""

In [57]:
pattern = re.compile("and|or|the")
pattern.findall(txt)
highlight_regex_matches(pattern, txt)


L[43m[1mor[0mem Ipsum is simply dummy text of [43m[1mthe[0m printing [43m[1mand[0m typesetting industry. 
L[43m[1mor[0mem Ipsum has been [43m[1mthe[0m industry's st[43m[1mand[0mard dummy text ever since [43m[1mthe[0m 1500s, 
when an unknown printer took a galley of type [43m[1mand[0m scrambled it to make a type specimen book. 
It has survived not only five centuries, but also [43m[1mthe[0m leap into electronic typesetting, 
remaining essentially unchanged. 
It was popularised in [43m[1mthe[0m 1960s with [43m[1mthe[0m release of Letraset sheets containing L[43m[1mor[0mem Ipsum passages, 
[43m[1mand[0m m[43m[1mor[0me recently with desktop publishing software like Aldus PageMaker including versions of L[43m[1mor[0mem Ipsum.



There is a slight problem with the above pattern. `and`, `or`, `the` inside the words are also counted as a match where as we want to find individual strings containing `and`, `or`, `the` only.

### What is the solution?

Solution is to use this pattern:

`\b(and|or|the)\b`

where `\b` is a metacharacter that matches at a position that is called a **word boundary**. 

Such identifiers that correspond to a particular position inside of the input are called **Boundary Matchers**.

**Note:** Since `\b` is also an escape sequence for strings in Python, we need to escape it using `\`, i.e. `\\b`,  in order to treat it like a metacharacter for regex matching.

In [58]:
pattern = re.compile("\\b(and|or|the)\\b")
highlight_regex_matches(pattern, txt)


Lorem Ipsum is simply dummy text of [43m[1mthe[0m printing [43m[1mand[0m typesetting industry. 
Lorem Ipsum has been [43m[1mthe[0m industry's standard dummy text ever since [43m[1mthe[0m 1500s, 
when an unknown printer took a galley of type [43m[1mand[0m scrambled it to make a type specimen book. 
It has survived not only five centuries, but also [43m[1mthe[0m leap into electronic typesetting, 
remaining essentially unchanged. 
It was popularised in [43m[1mthe[0m 1960s with [43m[1mthe[0m release of Letraset sheets containing Lorem Ipsum passages, 
[43m[1mand[0m more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.



Here is a table which shows the list of all boundary matchers available in Python:

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Matcher</th>
    <th>Description</th>
</thead>
    
<tbody>
<tr>
    <td>^</td>
    <td>Matches at the beginning of a line</td>
</tr>
    
<tr>
    <td>$</td>
    <td>Matches at the end of a line</td>
</tr>

<tr>
    <td>\b</td>
    <td>Matches a word boundary</td>
</tr>

<tr>
    <td>\B</td>
    <td>Matches the opposite of \b. Anything that is not a word boundary</td>
</tr>

<tr>
    <td>\A</td>
    <td>Matches the beginning of the input</td>
</tr>

<tr>
    <td>\Z</td>
    <td>Matches the end of the input</td>
</tr>
</tbody>
</table>

### Example 1

Consider a scenario where we want to find all the lines in the given text which **start** with the pattern `Name:`.

In [59]:
txt = """
Name:
Age: 0
Roll No.: 15
Grade: S

Name: Ravi Teja
Age: -1
Roll No.: 123 Name: ABC
Grade: K

Name: Ram
Age: N/A
Roll No.: 1
Grade: G
"""

In [60]:
pattern = re.compile("^Name: \w*", flags=re.M)
pattern.findall(txt)

['Name: Ravi', 'Name: Ram']

> `re.M` (short for `re.MULTILINE`) is a flag which is used to make begin/end `(^, $)` consider each line.

### Example 2

Find all the sentences which do not end with a full stop (`.`) in the given text.

In [61]:
txt = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s!
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages
More recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."""

In [62]:
pattern = re.compile("^.+[^\.]$", flags=re.M)
pattern.findall(txt)
highlight_regex_matches(pattern, txt)


Lorem Ipsum is simply dummy text of the printing and typesetting industry.
[43m[1mLorem Ipsum has been the industry's standard dummy text ever since the 1500s![0m
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
[43m[1mIt was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages[0m
More recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.


# 9.Splitting

### `split(string[, maxsplit])`

- Every pattern object has a `split()` method which splits the input string at all positions where a match is found.

- `maxsplit` is an optional argument (default value 0) which specifies the max no. of splits that can take place. `0` value means there is no limit on the no. of splits.

- Pattern match is not included in any of the substrings obtained after splitting.

#### Example 1

Let us try to split a string to get individual lines in it.

In [63]:
txt = """Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated."""

In [64]:
pattern = re.compile("\n")
pattern.split(txt)

['Beautiful is better than ugly.',
 'Explicit is better than implicit.',
 'Simple is better than complex.',
 'Complex is better than complicated.']

In [65]:
pattern = re.compile("\W")
pattern.split(txt)

['Beautiful',
 'is',
 'better',
 'than',
 'ugly',
 '',
 'Explicit',
 'is',
 'better',
 'than',
 'implicit',
 '',
 'Simple',
 'is',
 'better',
 'than',
 'complex',
 '',
 'Complex',
 'is',
 'better',
 'than',
 'complicated',
 '']

In [66]:
pattern.split(txt, maxsplit=3)

['Beautiful',
 'is',
 'better',
 'than ugly.\nExplicit is better than implicit.\nSimple is better than complex.\nComplex is better than complicated.']

# 10. Substitution

Now, we are going to look at a method which will replace all the **leftmost non-overlapping occurrences** of a pattern in a given string and return the new string as result.

### `sub(repl, string[, count=0])`

- `repl` is the replacement string which gets substituted in the place of match

- `string` is the input text on which substitution takes place.

- `count` is an optional argument (default is 0) which specifies the max no. of substitutions that can take place.  0 means there is no limit on substitution count.


Let us consider a case where we want to replace all occurances of numbers with a `-` in the given text.

In [67]:
txt = "100 cats, 23 dogs, 3 rabbits"

In [68]:
pattern = re.compile("\d+")
pattern.sub("-", txt)

'- cats, - dogs, - rabbits'

### `subn(repl, string[, count=0])`

- Returns the substituted string as well as the no. of substitutions.

- Can be thought of as a utility function over `sub()`

In [69]:
pattern.subn("-", txt)

('- cats, - dogs, - rabbits', 3)

# 11.Compilation Flags

- When compiling a pattern string into a pattern object, it's possible to **modify the standard behavior of the patterns** using **Compilation Flags**.

- Multiple compilation flags can be combined using the bitwise OR "|".

Here is a list of all the complation flags:

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Syntax</th>
    <th>Meaning</th>
</thead>
    
<tbody>
<tr>
    <td>re.IGNORECASE or re.I</td>
    <td>ignore case.</td>
</tr>

<tr>
    <td>re.MULTILINE or re.M</td>
    <td>make begin/end boundary matchers (^, $) consider each line.</td>
</tr>

<tr>
    <td>re.DOTALL or re.S</td>
    <td>make . match newline too.</td>
</tr>

<tr>
    <td>re.UNICODE or re.U</td>
    <td>make {\w, \W, \b, \B} follow Unicode rules.</td>
</tr>

<tr>
    <td>re.LOCALE or re.L</td>
    <td>make {\w, \W, \b, \B} follow locale.</td>
</tr>

<tr>
    <td>re.ASCII or re.A</td>
    <td>make {\w, \W, \b, \B} perform ASCII-only matching.</td>
</tr>

<tr>
    <td>re.VERBOSE or re.X</td>
    <td>allow comment in regex.</td>
</tr>

<tr>
    <td>re.DEBUG</td>
    <td>get information about the compilation pattern.</td>
</tr>
</tbody>
</table>

Let's go through each one of them one by one.

## 1. re.IGNORECASE or re.I

This flag makes a regex pattern case-insensitive.


Let's check out an example to find all occurances of `the` and `The` in the given text.

In [70]:
txt = """
The best thing about regex is that it makes the task of string manipulation so easy.
"""

In [71]:
pattern = re.compile("the", flags=re.I)
print(pattern)
highlight_regex_matches(pattern, txt)

re.compile('the', re.IGNORECASE)

[43m[1mThe[0m best thing about regex is that it makes [43m[1mthe[0m task of string manipulation so easy.



## 2. re.MULTILINE or re.M

This flag is used to make begin/end boundary matchers (`^`, `$`) consider each line of the given text.


Let's check out an example to find all lines starting with `A`.

In [72]:
txt = """
A man was crossing the road.
Suddenly, a car passed before him in a very high speed.
He was terrified
And shocked.
"""

In [73]:
pattern = re.compile("^A.+", flags=re.M)
highlight_regex_matches(pattern, txt)


[43m[1mA man was crossing the road.[0m
Suddenly, a car passed before him in a very high speed.
He was terrified
[43m[1mAnd shocked.[0m



## 3. re.DOTALL or re.S

The `.` metacharacter matches everything except newline character. If we want to make `.` match newline too, we have to set this flag.

Let's consider an examle to match all the text after (and including) `car`.

In [74]:
pattern = re.compile("car.+", flags=re.S)
highlight_regex_matches(pattern, txt)


A man was crossing the road.
Suddenly, a [43m[1mcar passed before him in a very high speed.
He was terrified
And shocked.
[0m


## 4. re.UNICODE or re.U

Using this flag, we can make the pattern characters `{\w, \W, \b, \B}` dependent on the Unicode character properties database.

> re.UNICODE is the default flag in Python 3 regex patterns.

Let's consider an example where we try to work on hindi language.

In [75]:
txt = "मुझे किताबें पढ़ना बहुत पसंद है।"
pattern = re.compile("\w+")
pattern.findall(txt)

['म', 'झ', 'क', 'त', 'ब', 'पढ', 'न', 'बह', 'त', 'पस', 'द', 'ह']

In [76]:
import regex
pattern = regex.compile("\w+")
pattern.findall(txt)

['मुझे', 'किताबें', 'पढ़ना', 'बहुत', 'पसंद', 'है']

## 5. re.LOCALE or re.L

> A locale is a set of environmental variables that defines the language, country, and character encoding settings (or any other special variant preferences) for your applications.

This flag will make the word pattern `{\w, \W}` and boundary pattern `{\b, \B}`, dependent on the current locale. 

<span style="color:red;">**The use of this flag is discouraged in Python 3 as the locale mechanism is very unreliable, it only handles one “culture” at a time, and it only works with 8-bit locales. Unicode matching is already enabled by default in Python 3 for Unicode (str) patterns, and it is able to handle different locales/languages.**</span>


## 6. re.ASCII or re.A

This flag will make the word pattern `{\w, \W}` and boundary pattern `{\b, \B}` perform ASCII-only matching, i.e. only A-Z, a-z, 0-9 will be considered alphanumeric characters. 

Let us see an example below:

In [77]:
chars =  ''.join(chr(i) for i in range(256))

In [78]:
print(chars)

 	
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ


In [79]:
pattern = re.compile("\w")
highlight_regex_matches(pattern, chars)

 	
 !"#$%&'()*+,-./[43m[1m0[0m[43m[1m1[0m[43m[1m2[0m[43m[1m3[0m[43m[1m4[0m[43m[1m5[0m[43m[1m6[0m[43m[1m7[0m[43m[1m8[0m[43m[1m9[0m:;<=>?@[43m[1mA[0m[43m[1mB[0m[43m[1mC[0m[43m[1mD[0m[43m[1mE[0m[43m[1mF[0m[43m[1mG[0m[43m[1mH[0m[43m[1mI[0m[43m[1mJ[0m[43m[1mK[0m[43m[1mL[0m[43m[1mM[0m[43m[1mN[0m[43m[1mO[0m[43m[1mP[0m[43m[1mQ[0m[43m[1mR[0m[43m[1mS[0m[43m[1mT[0m[43m[1mU[0m[43m[1mV[0m[43m[1mW[0m[43m[1mX[0m[43m[1mY[0m[43m[1mZ[0m[\]^[43m[1m_[0m`[43m[1ma[0m[43m[1mb[0m[43m[1mc[0m[43m[1md[0m[43m[1me[0m[43m[1mf[0m[43m[1mg[0m[43m[1mh[0m[43m[1mi[0m[43m[1mj[0m[43m[1mk[0m[43m[1ml[0m[43m[1mm[0m[43m[1mn[0m[43m[1mo[0m[43m[1mp[0m[43m[1mq[0m[43m[1mr[0m[43m[1ms[0m[43m[1mt[0m[43m[1mu[0m[43m[1mv[0m[43m[1mw[0m[43m[1mx[0m[43m[1my[0m[43m[1mz[0m{|}~ ¡¢£¤¥¦§¨©[43m[1mª

In [80]:
pattern = re.compile("\w", flags=re.A)
highlight_regex_matches(pattern, chars)

 	
 !"#$%&'()*+,-./[43m[1m0[0m[43m[1m1[0m[43m[1m2[0m[43m[1m3[0m[43m[1m4[0m[43m[1m5[0m[43m[1m6[0m[43m[1m7[0m[43m[1m8[0m[43m[1m9[0m:;<=>?@[43m[1mA[0m[43m[1mB[0m[43m[1mC[0m[43m[1mD[0m[43m[1mE[0m[43m[1mF[0m[43m[1mG[0m[43m[1mH[0m[43m[1mI[0m[43m[1mJ[0m[43m[1mK[0m[43m[1mL[0m[43m[1mM[0m[43m[1mN[0m[43m[1mO[0m[43m[1mP[0m[43m[1mQ[0m[43m[1mR[0m[43m[1mS[0m[43m[1mT[0m[43m[1mU[0m[43m[1mV[0m[43m[1mW[0m[43m[1mX[0m[43m[1mY[0m[43m[1mZ[0m[\]^[43m[1m_[0m`[43m[1ma[0m[43m[1mb[0m[43m[1mc[0m[43m[1md[0m[43m[1me[0m[43m[1mf[0m[43m[1mg[0m[43m[1mh[0m[43m[1mi[0m[43m[1mj[0m[43m[1mk[0m[43m[1ml[0m[43m[1mm[0m[43m[1mn[0m[43m[1mo[0m[43m[1mp[0m[43m[1mq[0m[43m[1mr[0m[43m[1ms[0m[43m[1mt[0m[43m[1mu[0m[43m[1mv[0m[43m[1mw[0m[43m[1mx[0m[43m[1my[0m[43m[1mz[0m{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´

## 7. re.VERBOSE or re.X

This flag changes the regex syntax, to allow you to add annotations in regex. 

- Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash.

- When a line contains a # neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

In [81]:
txt = """
This is a sample text123
"""

In [82]:
pattern = re.compile("\w+")
pattern.findall(txt)

['This', 'is', 'a', 'sample', 'text123']

In [83]:
pattern = re.compile("\w +")
pattern.findall(txt)

['s ', 's ', 'a ', 'e ']

In [84]:
pattern = re.compile("\w +  # find all words", flags=re.X)
pattern.findall(txt)

['This', 'is', 'a', 'sample', 'text123']

## 8. re.DEBUG

This flag when set, gives some information about the compilation pattern.

In [85]:
pattern = re.compile("\b[a-e7-9]+\b", flags=re.DEBUG)

LITERAL 8
MAX_REPEAT 1 MAXREPEAT
  IN
    RANGE (97, 101)
    RANGE (55, 57)
LITERAL 8

 0. INFO 8 0b1 3 MAXREPEAT (to 9)
      prefix_skip 1
      prefix [0x8] ('\x08')
      overlap [0]
 9: LITERAL 0x8 ('\x08')
11. REPEAT_ONE 13 1 MAXREPEAT (to 25)
15.   IN 8 (to 24)
17.     RANGE 0x61 0x65 ('a'-'e')
20.     RANGE 0x37 0x39 ('7'-'9')
23.     FAILURE
24:   SUCCESS
25: LITERAL 0x8 ('\x08')
27. SUCCESS


# 12. Grouping

> Frequently you need to obtain more information than just whether the regex pattern matched or not.

By placing part of a regular expression inside round brackets or parentheses `(`, `)`, you can **group that part** of the regex pattern together.

### Applications of grouping:

#### 1. apply a quantifier to the entire group.

For example, `(ab)+` will match one or more repetitions of `ab`.

In [86]:
txt = "abbbbbabbbb"

In [87]:
pattern1 = re.compile("ab+")
pattern2 = re.compile("(ab)+")
highlight_regex_matches(pattern1, txt)
highlight_regex_matches(pattern2, txt)

[43m[1mabbbbb[0m[43m[1mabbbb[0m
[43m[1mab[0mbbbb[43m[1mab[0mbbb


In [88]:
txt = """
my name is ram
my name is sam
"""

In [89]:
pattern1 = re.compile("my name is ram|sam")
pattern2 = re.compile("my name is (ram|sam)")

In [90]:
highlight_regex_matches(pattern1, txt)
highlight_regex_matches(pattern2, txt)


[43m[1mmy name is ram[0m
my name is [43m[1msam[0m


[43m[1mmy name is ram[0m
[43m[1mmy name is sam[0m



#### 3. capture the text matched by group.

- Groups indicated with `(`, `)` also capture the **starting** and **ending** index of the text that they match.

- Groups can be retrieved by passing an argument to `group()`, `start()`, `end()`, and `span()` of the `Match` object. 

- Groups are numbered starting with `0`. 

- Group `0` is always present; it captures the whole regex pattern, so all `Match` object methods have group `0` as their default argument.

Consider an example where we want to parse a date and determine day, month and year

In [91]:
txt = "12/02/2019" 

In [92]:
pattern = re.compile("(\d{2})\/(\d{2})\/(\d{4})")
match = pattern.match(txt)

In [93]:
print(match.group(0))  #Entire Group
print(match.group(1))
print(match.group(2))
day, month, year = match.groups()
print(day, month, year)

12/02/2019
12
02
12 02 2019


In [94]:
txt = """
Name: Nikhil
Age: 0
Roll No.: 15
Grade: S

Name: Ravi teja
Age: -1
Roll No.: 123
Grade: K

Name: Ram
Age: N/A
Roll No.: 1
Grade: G
"""

In [95]:
pattern = re.compile("Name: (.+)\n")
pattern.findall(txt)

['Nikhil', 'Ravi teja', 'Ram']

In [96]:
pattern = re.compile("^Name: \w*")#, flags=re.M)
pattern.findall(txt)

[]

In [97]:
pattern = re.compile("^Name: \w*", flags=re.M)
pattern.findall(txt)

['Name: Nikhil', 'Name: Ravi', 'Name: Ram']

# 13.Backreferencing

**Backreferences** in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. 

> For example, `\1` will succeed if the exact contents of group `1` can be found at the current position, and fails otherwise.

### Example 1

Consider a scenario where we want to find all the duplicated words in the given text.

In [98]:
txt = """
hello hello hello
how are you
bye bye
"""

In [99]:
pattern = re.compile("(\w+) \\1")
pattern.findall(txt)

['hello', 'bye']

In [100]:
pattern = re.compile("(\w+) \\1")
pattern.findall(txt)

['hello', 'bye']

> Since Python’s string literals also use a **backslash followed by numbers** to allow including arbitrary characters in a string, backreferences need to be **escaped** so that regex engine gets proper format. We can also use **raw strings** to ignore escaping.

Here is an example using raw strings.

In [101]:
pattern = re.compile(r"(\w+) \1")
pattern.findall(txt)

['hello', 'bye']

### Example 2

Consider a scenario where we want to find all dates with the format `dd/mm/yyy` and change them to `yyyy-mm-dd` format. 

In [102]:
txt = """
today is 23/02/2019.
yesterday was 22/02/2019.
tomorrow is 24/02/2019.
"""

In [103]:
pattern = re.compile("(\d{2})\/(\d{2})\/(\d{4})")
newtxt = pattern.sub(r"\3-\2-\1", txt)
print(newtxt)


today is 2019-02-23.
yesterday was 2019-02-22.
tomorrow is 2019-02-24.



> Backreferences, too, cannot be used inside a character class. The `\1` in a regex like `(a)[\1b]` is either an error or a needlessly escaped literal 1. 

# 14. Named Groups

> Using numbers to refer to groups can be tedious and confusing, and the worst thing is that it doesn't allow you to give meaning or context to the group. That's why we have named groups.

Instead of referring to groups by numbers, groups can be referenced by a name. Such a group is called a **named group**.

- The syntax for a named group is one of the Python-specific extensions: `(?P<name>...)`  where `name` is, obviously, the name of the group. 

- Named groups behave exactly like capturing groups, and additionally associate a name with a group.

- Here is a table which shows three different ways to refer to named groups:
    
<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Use</th>
    <th>Syntax</th>
</thead>
    
<tbody>
<tr>
    <td>Inside a pattern</td>
    <td>(?P=name)</td>
</tr>
    
<tr>
    <td>In the repl string of the sub operation</td>
    <td>\g&lt;name&gt;</td>
</tr>

<tr>
    <td>In any of the operations of the MatchObject</td>
    <td>match.group('name')</td>
</tr>
</tbody>
</table>

### Example 1

Consider a scenario where we want to extract the first name and last name of a person.

In [104]:
txt = "Nikhil Kumar"

In [105]:
pattern = re.compile("(?P<first>\w+) (?P<last>\w+)")
match = pattern.match(txt)

In [106]:
match.group('first'),match.group('last')

('Nikhil', 'Kumar')

### Example 2

Now consider the scenario where we want to swap first name and last name in above example.

In [107]:
pattern.sub("\g<last> \g<first>", txt)

'Kumar Nikhil'

### Example 3

Consider a scenario where we want to check if a person has same first and last name.

In [108]:
txt = "Jhonson Jhonson"
pattern = re.compile("(?P<first>\w+) (?P=first)")
pattern.findall(txt)

['Jhonson']

# Non-Capturing Groups

> There are cases when we want to use groups, but we're not interested in extracting the information, i.e. capturing the matched text inside paranthesis only. An example is **alteration**.

Let's consider an example where we want to find the strings `i love cats` or `i love dogs` in the given text.

In [109]:
txt = """
i love cats
i love dogs
"""

In [110]:
pattern = re.compile("i love (cats|dogs)")
pattern.findall(txt)

['cats', 'dogs']

In [111]:
for match in pattern.finditer(txt):
    print("Complete regex match (default):", match.group(0))
    print("Match captured by 1st group:", match.group(1))

Complete regex match (default): i love cats
Match captured by 1st group: cats
Complete regex match (default): i love dogs
Match captured by 1st group: dogs


As we can see, the group captured part contains only `cats` or `dogs` instead of complete sentences.

Hence, to make a group **non-capturing**, we have to use the syntax `(?:pattern)`.

In [112]:
pattern = re.compile("i love (?:cats|dogs)")
pattern.findall(txt)

['i love cats', 'i love dogs']

In [113]:
pattern = re.compile("i love (cats|dogs)")
pattern.findall(txt)

['cats', 'dogs']

> After using the new syntax, we have the same functionality as before, but now we're saving resources and the regex is easier to maintain. Note that the group cannot be referenced.

# 16.Zero-width assertions

- Characters which indicate positions rather than actual content are called **zero-width assertions**.


- For instance, the caret symbol (`^`) is a representation of the beginning of a line or the dollar sign (`$`) for the end of a line. 


- They effectively do assertion without consuming characters; they just return a positive or negative result of the match.


- A more powerful kind of **zero-width assertion** is **look around**, a mechanism with which it is possible to match a certain previous (**look behind**) or ulterior (**look ahead**) value to the current position.


# Look around


**Look around** is a simple mechanism which during the matching process, at the current position, looks forward (or behind, depends on type of lookaround used) to see if **some** pattern matches before continuing with the actual match.

The most important thing to understand here is that **look around** mechanism consists of 2 parts:
- **actual expression**: an expression whose match constitutes the final **result**.
- **non-consuming expression**: an expression whose match is evaluated before the actual expression, just to see if it can succeed. It is **not actually consumed** by the regex engine.
    - If the non-consuming match **succeeds**, the regex engine forgets about this non-consuming expression and starts evaluating the next character from the current position of the actual expression. 
    - If the non-consuming match **does not succeed**, we simply move to next character of the given text and repeat the whole match process again.

There are 2 main categories of **look around**  which, in turn, have 2 sub-categories each.

![](images/lookaround.png)

Let's explore each one of them one by one.

# Look ahead

**Look ahead** mechanism checks the match for a non-consuming expression **ahead** of a given pattern.


## Positive look ahead

- **Positive look ahead** will succeed if the passed non-consuming expression **does match** against the forthcoming input.

- The syntax is `A(?=B)` where `A` is the **actual expression** and `B` is the **non-consuming expression**. 


Let's check out an example to understand the concept. Let's assume that we want to find a match for `love` in the given text only if it is followed by `regex`.

In [114]:
txt = "i love python, i love regex"

In [115]:
pattern = re.compile('love regex')
match = pattern.search(txt)
match.span()

(17, 27)

In [116]:
pattern.findall(txt)

['love regex']

In [117]:
highlight_regex_matches(pattern, txt)

i love python, i [43m[1mlove regex[0m


As we can see, a total of 10 (index 17 to 27) characters, i.e. `love regex` are consumed to search for the given pattern in the text.

Now consider the regex pattern `love(?=\sregex)`.

In [118]:
pattern = re.compile("love(?=\sregex)")
match = pattern.search(txt)
print(match.span())
highlight_regex_matches(pattern, txt)                  

(17, 21)
i love python, i [43m[1mlove[0m regex


Now, using **positive look ahead** mechanism, we consumed only 4 (index 17 to 21) characters are consumed for the match.

Let us check out another example to find all words in given text which are followed by `.` or `,`.

In [119]:
txt = "My favorite colors are red, green, and blue."
pattern = re.compile("\w+(?=,|\.)")
pattern.findall(txt)

['red', 'green', 'blue']

In [120]:
highlight_regex_matches(pattern, txt)

My favorite colors are [43m[1mred[0m, [43m[1mgreen[0m, and [43m[1mblue[0m.


## Negative look ahead

- **Negative look ahead** will succeed if the passed non-consuming expression **does not match** against the forthcoming input.

- The syntax is `A(?!B)` where `A` is the **actual expression** and `B` is the **non-consuming expression**. 


Let's assume that we want to find a match for `love` in the given text only if it is NOT followed by `regex`.

In [121]:
txt = "i love python, i love regex"

In [122]:
pattern = re.compile("love(?!\sregex)")
highlight_regex_matches(pattern, txt)

i [43m[1mlove[0m python, i love regex


# 17.Look behind


**Look behind** mechanism checks the match for a non-consuming expression **behind** a given pattern.


## Positive look behind

- **Positive look behind** will succeed if the passed non-consuming expression **does match** against the forthcoming input.

- The syntax is `(?<=B)A` where `A` is the **actual expression** and `B` is the **non-consuming expression**. 


Let's check out an example to understand the concept. Let's assume that we want to find a match for `regex` in the given text only if it is succeeded by `love` or `hate`.

In [123]:
txt = "love regex or hate regex, can't ignore regex"

In [124]:
pattern = re.compile("(?<=(love|hate)\s)regex")
highlight_regex_matches(pattern, txt)

love [43m[1mregex[0m or hate [43m[1mregex[0m, can't ignore regex


## Negative look behind

- **Negative look behind** will succeed if the passed non-consuming expression **does not match** against the forthcoming input.

- The syntax is `(?<!B)A` where `A` is the **actual expression** and `B` is the **non-consuming expression**. 


Let's assume that we want to find a match for `regex` in the given text if it is not followed by `love` or `hate`.

In [125]:
pattern = re.compile("(?<!(love|hate)\s)regex")
highlight_regex_matches(pattern, txt)

love regex or hate regex, can't ignore [43m[1mregex[0m
