<a id="home"></a>
# Regular Expressions - Advanced (Part 1, Part 2)
**Before you start**<br/>
You should go over this notebook after completing the basic regular expression notebook:<br/>
[regular expressions - basic notebook](Ex09_RegularExpressions.ipynb)

| Section | Section-name | Section | Section-name | Section | Section-name | 
| :- | :- | :- | :- | :- | :- | 
| part 1 | [Regexp - Repetition](#part1) |  1. | [1. Repetition Qualifiers](#1) |  1.a. | [the `+` meta-character](#1a) | 
| 1.b. | [ `*` and `?`](#1b) | 1.c. | [Specifying number of occurrences](#1c) | 1.d. | [Example: E-Mails](#1d) | 
| 1.e. | [Optional Exercise for part 1](#1e) | 
| part 2 | [advanced syntax](#part2) |  2.a. | [Grouping](#2a) |  2.b. | [the `sub` function](#2b) | 
| 2.c. | [ the re `flags`](#2c) | 2.d. | [Greedy vs non-greedy](#2d) | 2.e. | [Other Functions](#2e) | 
| 2.f. | [Optional Exercise for part 2](#2f) |


### About this notebook
This notebook is divided into 2 parts (part 1, part 2).<br/>
**Part 1** practices the following items:
+ Get familiar with repetition in Regular Expressions
- Practice the use of `+`, `*` and `?` operators
- Practice the use of `{,}`

**Part 2** practices the following items:
+ Get familiar with grouping
- Practice the use of `sub` and use of grouping in it
- Demonstrate the difference between greedy and non-greedy match
- Use of `re` flags

In [None]:
import re

# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

[Go to the beginning of the notebook](#home)
<a id="part1"></a>
## Part 1 - Regular Expressions - Repetition

[Go to the beginning of the notebook](#home)
<a id="1"></a>
### 1. Repetition Qualifiers

A key concept in regex is repetition. There are five ways to express repetition in a pattern:

1. A pattern followed by the meta-character **+** is repeated one or more times. 
2. Replace the **+** with <b>*</b> and the pattern can appear zero or more times. 
3. Using **?** means the pattern appears zero or one time. 
4. For a specific number of occurrences, use **{m}** after the pattern, where m is replaced with the number of times         the pattern should repeat. 
5. Use **{m,n}** where m is the minimum number of repetitions and n is the maximum. <br/>Leaving out n (**{m,}**) means the           value appears at least m times, with no maximum.

Now we will see an example of each of these:

[Go to the beginning of the notebook](#home)
<a id="1a"></a>
####  1.a. the `+` meta-character

Hmm.. Let's try to catch this slang word "hmm" - as you may know, some people write it as "hmm", <br/>
while others write as "hmmm" or even "hmmmmm" - let's see how we can capture this in one expression

In [None]:
import re

In [None]:
txt="hmmmm.. what do you mean by that?"
pattern = "hm+"

print(re.findall(pattern,txt))

In [None]:
txt="hmmmm.. what do you mean by that?"
pattern2 = "hm*"

print(re.findall(pattern2,txt))

try to change in the above ``txt`` string, the 'hmm' expression to be longer or shorter. 

What do you see?

Lets use the following ``txt`` string - and see how we can use the `+` meta character to extract the people first names

In [None]:
txt="""

name: Dina Ivry
email: dimai@gmail.com
time: 2020-11-02 11:32:11
phone: 052-3434233
city: Tel-aviv
title: knn  
content: can you explain what does the k hyper-parameter mean???

==============
name: Joseph Haim Katzir
email: joek@myemail.ac.il
time: 2020-12-20 13:34:02
phone: (054) 5444443
city: Tel aviv
title: what a great lecture   
content: avinoam this was one of your best

=============

"""

In [None]:
pattern = "name: \w+"

print(re.findall(pattern,txt))

This is much simpler than the pattern we used before. <br/>
Note also that it captures vairable length. <br/>
If you take a look you can see that some names have also a middle name. <br/>

How can we write a pattern to capture the full name?

[Go to the beginning of the notebook](#home)
<a id="1b"></a>
#### 1.b.  `*` and `?` meta-charachter
In order to cpature both options: with and without middle name - we need to define a pattern in which the middle name is optional. <br/>
For this we will use the `*` meta-charachter. <br/>

In the pattern will cpature first_name space optional_middle_name optional_space last_name

In [None]:
# we want to cpature first_name space optional_middle_name optional_space last_name
pattern = "name: \w+ \w* ?\w+"

print(re.findall(pattern,txt))

note that we could use a simpler pattern to capture the entire line that starts with "name:". <br/>

The pattern will look like this:

In [None]:
pattern = "name: .*"

print(re.findall(pattern,txt))

While this pattern is simpler, it doesn't provide us with a breakdown of the different elements in the name, that will become useful later.

[Go to the beginning of the notebook](#home)
<a id="1c"></a>
#### 1.c. Specifying number of occurrences

`{m}` Specifies that exactly m copies of the previous RE that should be matched. <br/>
Fewer matches cause the entire RE not to match. <br/>

Let's the phone numbers example:

In [None]:
phone_numbers = "(054) 232-2235, (050) 134-2215, this is common in twelve (12) countries and one (1) state"

# match exactly three digits enclosed in brackets (area-code)
re.findall("\(([0-9]{3})\)", phone_numbers)

We can now also extend it to extract the phone numbers in the text:

In [None]:
pattern="\(\d{3}\) \d{3}\-?\d{4}"

re.findall(pattern, phone_numbers)

if we apply it on the ``txt`` string data, we will see it still doesn't work perfectly. 

In [None]:
re.findall(pattern, txt)

it found only one of the numbers. <br/>

Let's update the pattern to extract both numbers

In [None]:
pattern="\(?\d{3}\)?[ \-]\d{3}\-?\d{4}"

re.findall(pattern, txt)

hurray! it worked.. let's explain what we did:

1. We marked the brackets around the area code as optional (`\(?\d{3}\)?` instead of `\(\d{3}\)`)
2. We allowed either a space or hypher between area code and main number (`[ \-]` instead of ` ` between area code and main number) 

Let's see another more complex example:

[Go to the beginning of the notebook](#home)
<a id="1d"></a>
#### 1.d. Example: E-Mails

Let's take a look at how we can use regular expressions. <br/>
Suppose you're a marketer and you want to scrape e-mail addresses from website. 

Here is an example:

In [None]:
html = 'You can reach us @datascience <a href="mailto:datascience@campus.gov.il">by e-mail</a> if necessary.'

pattern = ""#your pattern here

re.findall(pattern, html)

In [None]:
# a first attempt:
# \w+ 1-n word letters,
# @ for the literal @ 
# 1-n word letters
pattern=r'\w+@\w+'
re.findall(pattern, html)

That didn't work because `.` doesn't match for `\w`. 

We can write a more specific query:

In [None]:
# \w+ 1-n word letters
# @
# 1-n word letters
# a period \.
# 1-n word letters
# another period \., 
# and more 1-n word letters 
pattern=r'\w+@\w+\.+\w+\.\w+'
re.findall(pattern, html)

That worked! But it's easy to see that this isn't very general, i.e., it doesn't work for every legal e-mail. 

See the example text below:

In [None]:
html2 = 'You can reach us also at <a href="mailto:datasciencemooc@gmail.com">by e-mail</a> if necessary.'

re.findall(pattern, html2)

Here the e-mail datasciencemooc@gmail.com wasn't matched at all. Let's revise the pattern a bit, and check it on both texts:

In [None]:
pattern=r'\w+@\w+\.+\w+\.?\w*'
for t in [html,html2]:
    print(re.findall(pattern, t))

This works on both texts, but what happens if our actual email changes, and is like the following:

In [None]:
html3 = 'You can reach us also at <a href="mailto:data-science-mooc@gmail.com">by e-mail</a> if necessary.'

re.findall(pattern, html3)

Here, something matched but it's the wrong e-mail! It's not data-science-mooc@gmail.com but mooc@gmail.com. 

To fix this, we need to improve the pattern and use also character groups:

In [None]:
pattern = r'[\w.-]+@[\w.-]+'
for t in [html,html2,html3]:
    print(re.findall(pattern, t))

That worked wonderfully! <br/>
See how easy it is to extract an e-mail from a website. <br/>

However, this pattern matches valid e-mail addresses, but it also matches non-valid ones. <br/>
So this is a fine regex if you want to extract e-mail addresses, but not if you want to validate an e-mail address. <br/>

Try to think on text sequnces that would be matched to this regular expression but do not represent valid emails..

[Go to the beginning of the notebook](#home)
<a id="1e"></a>
#### 1.e. Optional Exercise:
Optionally sharpen your skills with [regular expression self exercises notebook](Ex09-RegularExpression-Exercises.ipynb).<br/>
The following exercise is relevant for now:
* Exercise 3

[Go to the beginning of the notebook](#home)
<a id="part2"></a>
## Part 2 - Regular Expressions - advanced syntax

In this part you will practice the following items:
+ Get familiar with grouping
- Practice the use of `sub` and use of grouping in it
- Demonstrate the difference between greedy and non-greedy match
- Use of `re` flags

[Go to the beginning of the notebook](#home)
<a id="2a"></a>
#### 2.a. Grouping
If we want to be more specific about repeating substrings, for example, <br/>
we need to be able to group a part of a regular expression. 

You can group with round brackets `()`, so you can refer only to those elements. <br/>
Each element is captured as a `group` item in the `Match` object. If you use `findall` you will get an array of the grouped items rather the entire string

Now we will see an example of it, on our ``txt`` string data. <br/>
We want to extract the first and last name, but get only the names without the `name: ` prefix

In [None]:
txt="""
@start_comment
name: Dina Ivry
email: dimai@gmail.com
time: 2020-11-02 11:32:11
phone: 052-3434233
city: Tel-aviv
title: knn  
content: can you explain what does the k hyper-parameter mean???
@end_comment
==============
@start_comment
name: Joseph Haim Katzir
email: joek@myemail.ac.il
time: 2020-12-20 13:34:02
phone: (054) 5444443
city: Tel aviv
title: what a great lecture   
content: avinoam this was one of your best
@end_comment
=============

"""

recall the pattern we tried before:

In [None]:
pattern = "name: \w+ \w* ?\w+"

print(re.findall(pattern,txt))

As you see, we extracted also the `name: ` prefix. 

We can use grouping to return only the name itself

In [None]:
pattern = "name: (\w+ \w* ?\w+)"

print(re.findall(pattern,txt))

if we want to mark each of the names (first middle and last) we can do this as well

In [None]:
pattern = "name: (\w+) ?(\w*) (\w+)"

print(re.findall(pattern,txt))

if we apply the `search` command - we would do it this way:

In [None]:
m=re.search(pattern,txt)
if(m):
    (len(m.groups()))
    print("First name is:",m.group(1))
    print("Middle name (if any) is:",m.group(2))
    print("Last name is:",m.group(3))

[Go to the beginning of the notebook](#home)
<a id="2b"></a>
#### 2.b. the `sub` function

We can use the sub() to dynamically replace content. 

In [None]:
weekdays = "We could meet Monday or Wednesday"

re.sub("Monday|Tuesday|Wednesday|Thursday|Friday", "Weekday",  weekdays)

We can use grouping functionality to make the replacement smarter. <br/>

Consider we want to switch the order from "first middle last" name structure, to "last first middle" 

In [None]:
name="Joseph Haim Katzir"
pattern = "(\w+) ?(\w*) (\w+)"

re.sub(pattern,r"\3 \1 \2",name)

very powerful!

[Go to the beginning of the notebook](#home)
<a id="2c"></a>
#### 2.c. the re `flags`
Look on the example below: 

we will try to match the ``txt`` string:

In [None]:
pattern="@start_comment(.*)@end_comment"

print(re.findall(pattern,txt))

but.. nothing was extracted. <br/>
The reason for that, is since the match spans multiple lines, and by default the `.` (dot) escape character doesn't match new-lines. <br/>

In order to over come this, we need to use `flags` - to instruct the matching engine to consider also new line as part of dot.

In [None]:
pattern="@start_comment(.*)@end_comment"

for s in re.findall(pattern,txt, flags=re.DOTALL):
    print(s)

This worked partially. <br/>
It extracted text, but not exactly what we wanted. <br/>
It extracted the entire text, from the first "@start_comment" up to the very last "@end_comment". <br/>

What we wanted is just to extract the text between two adjacent open and end tags. <br/>
In order to solve this we need to introduce the "`non-greedy` match.

[Go to the beginning of the notebook](#home)
<a id="2d"></a>
#### 2.d. Greedy vs non-greedy
By default, regular expressions are greedy.<br/>
This means they try to capture the first and longest match possible. <br/>

In the previous example, we tried to match a single post in the forum. <br/>
With non-greedy operator we instruct the system to look for the shortest match, rather than the longest one. <br/>

We can modify this behavior with the `?` character, which signals that the expression on the left should not be greedy:

In [None]:
pattern="@start_comment(.*?)@end_comment"

for s in re.findall(pattern,txt, flags=re.DOTALL):
    print(s)
    print("------------")

Greedy applies to the `*`, `+` and `?` operators – so these are legal sequences: `*?`, `+?`, `??`.

[Go to the beginning of the notebook](#home)
<a id="2e"></a>
#### 2.e. Other Functions

We've covered a lot, but not all of the functionality of regex.  <br/>
A couple of other functions that could be helpful:

* [finditer](https://docs.python.org/3/library/re.html#re.finditer) returns an iterator
* the [IGNORECASE](https://docs.python.org/3/library/re.html#re.IGNORECASE) option

[Go to the beginning of the notebook](#home)
<a id="2f"></a>
#### 2.f. Optional Exercise:
Optionally sharpen your skills with [regular expression self exercises notebook](Ex09-RegularExpression-Exercises.ipynb).<br/>
The following exercise is relevant for now:
* Exercise 4