<a id="home"></a>
# Regular Expressions - Basics

| Section | Section-name | Section | Section-name | Section | Section-name | 
| :- | :- | :- | :- | :- | :- | 
| 1. | [the re Module](#1) |  1.a. | [Simple (syntax) Examples](#1a) |  1.b. | [the 'search' function](#1b) | 
| 1.c. | [the 'findall' function](#1c) | 1.d. | [the 'split' function](#1d) | 
| 2. | [Character sets and ranges](#2) | 3. | [Escape codes](#3) |  4. | [Or expression](#4) | 
| 5. | [Useful tool](#5) | 6. | [Optional Exercises](#6) | 

**After you are done**<br/>
You should continue with to the next regular expression notebook:<br/>
[regular expressions - advanced notebook](Ex09_RegularExpressions_adv.ipynb)

#### About this notebook
In this notebook we will practice the following items:
+ Get familiar with Regular Expressions
- Basic use of main `re` module functions: `search`,`findall`,`split`,`sub` and `match` object.
- Practice basic regular expression syntax
- Get familiar with character sets and character ranges ([] operator)
- Get familiar with special escape codes (such as `\w`,`\d`, etc)
- Or expression
- Useful online regexp debugger

In this part of our exercise we'll learn about regular expressions. <br/>
Regular expressions are text matching patterns described with a formal syntax. <br/>
You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. <br/>
They are very useful to find (and replace) text, to extract structured information such as <br/>
e-mails, phone numbers, etc., or for cleaning up text that was entered by humans, and many other applications. 

In Python, regular expressions are available as part of the [`re`](https://docs.python.org/3/library/re.html#module-re) module. <br/>
There are various [good](https://docs.python.org/3/howto/regex.html) [tutorials](https://developers.google.com/edu/python/regular-expressions) [available](https://github.com/tesla809/intro-to-python-jupyter-notebooks/blob/master/47-Regular%20Expressions.ipynb) on which this document is partially based. 

[Go to the beginning of the notebook](#home)
<a id="1"></a>
### 1. the `re` Module

In order to use the `re` module in python, one need first to import it. <br/>
As mentioned in the lecture, there are 3 main use cases:
1. Find - mainly using `search` and `findall` functions.
2. Replace - mainly using the `sub` function.
3. Split -  using the `split` function.
4. Match object - this is not a use case, but an object returned mainly by "find" functions and can be used for further text manipulation.


The basic syntax to search for a match in a string is this: 

```python
match = re.search(pattern, text)
```

Here, `pattern` is the regular expression, `text` is the text that the regular expression is applied to. <br/>
Match holds the search result that matches the string in an object.

[`search()`](https://docs.python.org/3/library/re.html#re.search) returns only the first occurrence of a match, in contrast, [`findall()`](https://docs.python.org/3/library/re.html#re.findall) returns all matches.

Another useful function is [`split()`](https://docs.python.org/3/library/re.html#re.split), which splits a string based on a regex pattern – we'll use all of these functions and others where appropriate. 

Mostly, we'll use search to learn about the syntax, but sometimes we'll use split instead of search to explain a pattern. <br/>
There are other functions which we'll use later.

[Go to the beginning of the notebook](#home)
<a id="1a"></a>
#### 1.a. Simple (syntax) Examples

We'll use a regular expression to demonstrate the syntax and use case: 
```python
'name: \w\w\w\w'
```

To extract the name of people that submitted inquiries to the forum. <br/>
The way this pattern works, it matches the substring **'name:'** followed by a four letter word, encoded by **'\w\w\w\w'**. <br/>
Let's start with the import..

In [None]:
import re

# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Below you can find a snippet from the comments we got into the course's forum. <br/>
We'll save it into a string variable

In [None]:
txt="""

name: Dina Ivry
email: dimai@gmail.com
time: 2020-11-02 11:32:11
phone: +972-3-52-3434233
city: Tel-aviv
title: knn  
content: can you explain what does the k hyper-parameter mean???

==============
name: Joseph Katzir
email: joek@myemail.ac.il
time: 2020-12-20 13:34:02
phone: (054) 5444443
city: Tel aviv
title: what a great lecture   
content: avinoam this was one of your best

=============

"""
txt

[Go to the beginning of the notebook](#home)
<a id="1b"></a>
#### 1.b.  the `search` function

One of the most common uses for the re module is for finding patterns in text. <br/>
Let's do a quick example of using the search method in the re module to find some text. <br/>
In this case, by finding the first names of the people that wrote in the forum, based on the pattern we mentioned before.

In [None]:
good_pattern="name: \w\w\w\w"
no_match_pattern="first name: \w\w\w\w"

#Check for match on first pattern
if re.search(good_pattern,  txt):
    print ('Match was found for pattern:',good_pattern)
else:
    print ('No Match was found for pattern:',good_pattern)

#Check for match on second pattern
if re.search(no_match_pattern,  txt):
    print ('Match was found for pattern:',no_match_pattern)
else:
    print ('No Match was found for pattern:',no_match_pattern)


This is nice.. we've seen that `re.search()` will take the pattern, scan the text, and return if it finds a match or not. <br/>
But how can we get the actual text it matched?

In order to understand this, we will introduce the `Match` object. When the function `search` is called, <br/>
it returns a `Match` object. If no pattern is found,  `None` is returned. <br/>
To give a clearer picture of this match object, check out the cell below:

In [None]:
match = re.search(good_pattern,  txt)

type(match)

This Match object returned by the search() method is more than just a Boolean or None, <br/>
it contains information about the match, including the original input string, <br/>
the regular expression that was used, and the location of the match. <br/>

Let's see the methods we can use on the match object:

In [None]:
# Show start of match
match.start()

In [None]:
# Show end
match.end()

In [None]:
# show the text that was found
match.group(0)

[Go to the beginning of the notebook](#home)
<a id="1c"></a>
#### 1.c.  Finding all instances of a pattern - the `findall` function

You can use `re.findall()` to find all the instances of a pattern in a string. <br/>

For example, if we want to apply the previous pattern (`good_pattern`) on all the posts in the forum:

In [None]:
txt

In [None]:
# Returns a list of all matches
re.findall(good_pattern,txt)

As you can see, it extracted both names from the forum posts. <br/>

In addition, as we mentioned in the lecture, while the first name is extarcted properly,<br/>
in the second name only the first 4 characters were extracted. We will see later how to fix it.

[Go to the beginning of the notebook](#home)
<a id="1d"></a>
#### 1.d.  the  `split` function
Split is another useful function in the `re` module. Let's see how we can split with the re syntax. <br/>

This should look similar to how you used the split() method with strings, <br/>
however you can see that instead of simple patterns, you can use the unique regule-expression syntax for more powerfull split. <br/>

We will start with a simple example:

In [None]:
email="myaddress@domain.com"

# Term to split on
split_term = '@'

# Split the phrase
re.split(split_term,email)

This splits the email exactly to the alias and its domain. <br/>
Let's take a look on a more sophisticated example. Consider email aliases. <br/>
They can be in the form of "first.last" or "first-last" or "first_last". <br/>

Your task is to split them into first and last. 

For that we'll make use of `character sets` and split function (more on `character ranges` in next section)

In [None]:
names=["first last","first_last","first.last","first-last"]
char_range="[ ._-]"

for name in names:
    print('splitting "{}" into:'.format(name),re.split(char_range,name))

[Go to the beginning of the notebook](#home)
<a id="2"></a>
### 2. Character sets and ranges

Character sets are used when you wish to match any one of a group of characters at a point in the input. <br/>
Brackets are used to construct character set inputs. <br/>

For example: the input **[ab]** searches for occurrences of either a or b.<br/>
As character sets grow larger, typing every character that should (or should not) match could become very tedious. <br/>

A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. 
**The format used is [start-end]**.

Common use cases are to search for a specific range of letters in the alphabet, <br/>
such [a-f] would return matches with any instance of letters between a and f.

Let's walk through some examples:

In [None]:
# find all 4 digit words that start with a capital letter
cap_pattern="[A-Z]\w\w\w"

# Returns a list of all matches
re.findall(cap_pattern,txt)

In [None]:
# find all dates in format yyyy-mm-dd
date_pattern="[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]"

# Returns a list of all matches
re.findall(date_pattern,txt)

As you can see this is very powerful! However it is a bit tidious to write, for that we introduce the escape codes:

[Go to the beginning of the notebook](#home)
<a id="3"></a>
### 3. Escape codes
You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more.

For example:

<table class="docutils" border="1">

<thead valign="bottom">
<tr class="row-odd">
<th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even">
<td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd">
<td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even">
<td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd">
<td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even">
<td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd">
<td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by prefixing the character with a backslash (\\). <br/>
Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. <br/>

Using raw strings, created by prefixing the literal value with **r**, for creating regular expressions eliminates this problem and maintains readability. 

Let's take a fresh look on the previous pattern of finding date expressions:

In [None]:
# find all dates in format yyyy-mm-dd
date_pattern=r'\d\d\d\d-\d\d-\d\d'

# Returns a list of all matches
re.findall(date_pattern,txt)

Sometimes the use of **r** to escape a backslash is probably one of the things <br/>
that block someone who is not familiar with regex in Python from being able to read regex code at first. <br/>

Hopefully after seeing these examples this syntax will become clear. 

In [None]:
txt2=r"I will eat 1\2\3 oranges"

# Returns the number of oranges I will each
eat_pattern=r"\d\\\d\\\d"


re.findall(eat_pattern,txt2)

[Go to the beginning of the notebook](#home)
<a id="4"></a>
### 4. Or expression

We can use the pipe `|` to define an or between any regular expression:

In [None]:
weekdays = "We could meet Monday or Wednesday"
pattern = "Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday"
re.findall(pattern , weekdays)

[Go to the beginning of the notebook](#home)
<a id="5"></a>
### 5. Useful tool
In some cases, one is looking for a RefExp debugger. <br/>
It's not that easy, but [this](https://regex101.com/) can be useful in some of those cases.

[Go to the beginning of the notebook](#home)
<a id="6"></a>
### 6. Optional Exercises:
Optionally sharpen your skills with [regular expression self exercises notebook](Ex09-RegularExpression-Exercises.ipynb).<br/>
The following exercises are relevant for now:
* Exercise 1
* Exercise 2

**After you are done**<br/>
You should continue with to the next regular expression notebook:<br/>
[regular expressions - advanced notebook](Ex09_RegularExpressions_adv.ipynb)