<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Regular Expressions - mini-lesson

---

<a id="learning-objectives"></a>
### Learning Objectives
*After this lesson, you will be able to:*
- Define regular expressions.
- Use regular expressions to match text.
- Demonstrate how to use capturing and non-capturing groups.
- Use regex with python and pandas

In [1]:
from IPython.core.display import Image
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

<a id="most-famous-quote-in-regex-dom"></a>
## The Most Famous Quote in `regex-dom`

>"Some people, when confronted with a problem, think 
'I know, I'll use regular expressions.'  Now they have two problems." — Jamie Zawinski (Netscape engineer)

<a id="so-what-does-a-regular-expression-look-like"></a>
## So, What Does a Regular Expression Look Like?

## ```/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/```



<img src="assets/regex4.png">

<a id="the-history-of-regular-expressions"></a>
## The History of Regular Expressions

---
A regular expression is a **sequence of characters that define a search pattern**. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings.


Regular expressions and neural nets have a common ancestry in the work of McColloch and Pitts (1943) and their attempt to computationally represent a model of a neuron. 

This work was picked up by Steve Kleene (Mr. \*) and developed further into the idea of regular expressions. His idea was then popularized by its inclusion in Unix in the 1970s, in the form of [**grep**](http://opensourceforu.com/2012/06/beginners-guide-gnu-grep-basics-regular-expressions/) (grep stands for: Global regular expression print).  Its inclusion in PERL in the 1980s cemented its popularity.

Here's [the story of Walter Pitts](http://nautil.us/issue/21/information/the-man-who-tried-to-redeem-the-world-with-logic).

<a id="where-is-regex-implemented"></a>
## Where are `regex` Implemented?

---

There are any number of places where `regex`s can be run — from your text editor, to the `bash` shell, to `Python`, and even `SQL`. It is typically baked into the standard library of programming languages.



<a id="exploring-regex"></a>
## Exploring `regex`

---

The web app [RegExr](http://regexr.com/) is an excellent tool to explore `regex`. I recommend you to bookmark it.

For the rest of the lesson we are going to use it to test our patterns. Open [RegExr](http://regexr.com/) in a another tab.

#### Next: in the `Expression` subsection, delete everything.

You should see something like this: 

![image.png](attachment:image.png)

### Now copy the following text , into the `text` subsection of the RegExr website (linked above):

```
1. This is a string

2. That is also a string

3. This is an illusion

4. THIS IS LOUD

that isn't thus

bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab

6. tHiS	iS	CoFu SEd

777. THIS IS 100%-THE-BEST!!!

8888. this_is_a_fiiile.py

hidden bob
```

You should now see something like this:

![image.png](attachment:image.png)

<br>

**Note:** Let's leave the explanation of `//g` aside for the moment.

<a id="basic-regular-expression-syntax"></a>
## Let's go through basic Regular Expression syntax
---

<a id="literals"></a>
### Literals

Literals are essentially just what you think of as characters in a string. For example:

```
a
b
c
X
Y
Z
1
5
``` 

These are all considered literals.

_Enter: `T` in the Expression subsection_

![image.png](attachment:image.png)

<a id="character-classes"></a>
### Character Classes `[ ]`

A character class is a set of characters matched as an "or."

```
[io]
```

So, this class would run as "match either i or o."

You can include as many characters as you like in between the brackets.

Character classes match only a single character.

<a id="character-classes-can-also-accept-certain-ranges"></a>
### Character Classes Can Also Accept Certain Ranges

For example, the following will all work:
    
```
[a-f]
[a-z]
[A-Z]
[a-zA-Z]
[1-4]
[a-c1-3]
```

_Enter: `[Ths]`  in the Expression subsection_

![image.png](attachment:image.png)
   


**Checkout: why only one character is highlighted?**

<a id="character-class-negation"></a>
### Character Class Negation

We can also add **negation** to character classes. For example:

```
[^a-z]
```

This means match *ANYTHING* that is *NOT* `a` through `z`.

<a id="exercise-"></a>
## Exercise #1

---

<a id="what-happens-if-we-put-two-character-class-brackets-back-to-back"></a>
### What Happens If We Put Two Character Class Brackets Back to Back?

Using RegExr and the text snippet from earlier, match **"That", "that"**, and **"thus"** — but not **"This"** and **"this"** — using the following:
- One literal
- Two character classes (no negation)
- One negation in a character class

In [None]:
# A:            

<a id="shorthand-for-character-classes"></a>
## Shorthand for Character Classes
---

```
\w - Matches word characters (includes digits and underscores)
\W - Matches what \w doesn't — non-word characters
\d - Matches all digit characters
\D - Matches all non-digit characters
\s - Matches whitespace (including tabs)
\S - Matches non-whitespace
\n - Matches new lines
\r - Matches carriage returns
\t - Matches tabs
```

These can also be placed into brackets like so:

```
[\d\t]
[^\d\t]
```

#### Go ahead and try those out with the same example

<a id="special-characters"></a>
## Special Characters
---

Certain characters must be escaped with a backslash: "`\`."

These include the following:

```
.
?
\
{
}
(
)
[
]
+
-
&
<
>
```

<a id="exercise-2"></a>
## Exercise #2

---

Use RegExr and our text snippet to match all digits. Do this three ways:

```
- First, with character classes
- Second, with the shorthand character classes
- Third, using only negation
```

In [None]:
# A:

<a id="the-dot"></a>
## The Dot

---

The dot matches any single character.

<a id="anchors"></a>
## Anchors

---

Anchors are used to denote the start and end of a line.

```
^ - Matches the start of the line
$ - Matches the end of the line
```

Example:

```bash
^Now - Matches the word "Now" when it occurs at the beginning of a line.  
country$ - Matches the word "country" when it occurs at the end of a line.
```

<a id="exercise-3"></a>
## Exercise #3

---

Use an anchor and a character class to find the **bab** and the **bob** at the end of the line, but not elsewhere.

In [3]:
# A:

### Using Regex with Python:

Run the following cell to load the `text` variable.

In [48]:
text = """1. This is a string

2. That is also a string

3. This is an illusion

4. THIS IS LOUD

that isn't thus

bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab

6. tHiS    iS    CoFu SEd

777. THIS IS 100%-THE-BEST!!!

8888. this_is_a_fiiile .py

hidden bob"""

**With Python you can use regex in 2 different ways:**

First you need to install and import the regex module

**Note:** We are going to use the third-party regex module (that you get through PyPI), as it has an API compatible with the standard library re module (built-in with python), but offers additional functionality.

In [61]:
# !pip install regex
import regex


Then either enter the regex straight in a `regex` class method:

In [62]:
regex.findall('[Tt]h[^i][st]', text)

['That', 'that', 'thus']

or, use a regex compiler (cleaner):

In [64]:
pattern = regex.compile(r'[Tt]h[^i][st]')
regex.findall(pattern, text)

['That', 'that', 'thus']

Both  will return exactly the same output.

<a id="exercise-4"></a>
## Exercise #4

___

Using python `regex` module, select the 'CoFu' term in your `text` variable.

In [53]:
# A:


<a id="quantifiers"></a>
## Quantifiers

---

Quantfiers adjust how many items are matched.

```
* - Zero or more
+ - One or more
? - Zero or one
{n} - Exactly 'n' number
{n,} - Matches 'n' or more occurrences
{n,m} - Between 'n' and 'm'
```

<a id="greedy-matching"></a>
## Greedy Matching

---


By nature, ```.*``` is the _SUPER_ ** *greedy* matcher **. This means they will match for as many characters as possible (i.e., the longest match).

<a id="exercise-4"></a>
## Exercise #5
---

1. Find **bob**, but only if it occurs three times in a row without any spaces.
2. Find **bob** if it occurs twice in a row, with or without spaces.

In [None]:
# A:

<a id="groups-and-capturing"></a>
## Groups and Capturing

---

In `regex`, parentheses — `()` — denote groupings. These groups can then be quantified.

Additionally, these groups can be designated as either "capture" or "non-capture."

To mark a group as a capture group, just put it in parenthesis — (match_phrase).

To mark it as a non-capture group, punctuate it like so — (?:match_phrase).


For example:

In [72]:
url = 'https://stackoverflow.com/questions/tagged/regex'
pattern = regex.compile(r'(https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?')
regex.findall(pattern, url)


[('https', 'stackoverflow.com', '/questions/tagged/regex')]

<a id="alternation"></a>
## Alternation

---

The pipe character — `|` — can be used to denote an OR relation, just like in Python.

For example, `(bob|bab)` or `(b(o|a)b)`.

<a id="lookahead"></a>
## Lookahead
---

There are two types of lookaheads: postive and negative.

```    
(?=match_text) — A postive lookahead says, "only match the current pattern if it is followed by another pattern."
(?!match_text) — A negative lookahead says the opposite.

Examples:
- that(?=guy) — Only match "that" if it is followed by "guy."
- these(?!guys) — Only match "these" if it is NOT follow by "guys."
```

In [79]:
text_ = 'that girl and that guy'
pattern = regex.compile(r'that(?=.guy)')
regex.findall(pattern, text_)

['that']

<a id="exercise-6"></a>
## Exercise #6
---

1. Match **bob** only if it is followed by "_".
2. Match **bob** if it is followed by "_" or a new line character (Hint: How do we specify "or" in `regex`?).
3. Match **bob** only if it isn't followed by a space or a new line character.

In [54]:
# A:

<a id="regex-in-python-and-pandas"></a>
## Regex in Python and `pandas`

---

Let's practice working with `regex` in Python and `pandas` using the string below.

In [32]:
my_string = """
I said a hip hop,
The hippie, the hippie,
To the hip, hip hop, and you don't stop, a rock it
To the bang bang boogie, say, up jump the boogie,
To the rhythm of the boogie, the beat.
"""

You can find the documentation for this variable [here](https://www.youtube.com/watch?v=wLzwSqKNkVU) .

<a id="regex-findall-method"></a>
### `regex`' `.findall()` Method

In [80]:
mo = regex.findall('h[io]p', my_string)

In [81]:
mo

['hip', 'hop', 'hip', 'hip', 'hip', 'hip', 'hop']

In [82]:
# `.findall()` will return only the capture groups, if included.
mo = regex.findall('h([io])p', my_string)

In [83]:
mo

['i', 'o', 'i', 'i', 'i', 'i', 'o']

<a id="using-pandas"></a>
### Using `pandas`

In [45]:
fish = pd.Series(['onefish', 'twofish','redfish', 'bluefish'])
fish

0     onefish
1     twofish
2     redfish
3    bluefish
dtype: object

<a id="strcontains"></a>
### `str.contains`

In [46]:
# Get all fish that start with "b."
fish[fish.str.contains('^b')]

3    bluefish
dtype: object

<a id="strextract"></a>
### `str.extract`

In [47]:
# `.extract()` maps capture groups to new Series.
fish.str.extract('(.*)fish', expand=False)

0     one
1     two
2     red
3    blue
dtype: object

<a id="independent-practice"></a>
## Independent Practice
---

Pull up the following tutorials for regular expressions in Python. 

- [TutorialPoint](http://www.tutorialspoint.com/python/python_reg_expressions.htm)  
- [Google Regex Tutorial](https://developers.google.com/edu/python/regular-expressions) (findall)

In the cells below, import Python's `regex` library and experiment with matching on the string.

Try out some of the following:
- Match with and without case sensitivity.
- Match using word borders (try "bob").
- Use positive and negative lookaheads.
- Experiment with the multi-line flag.
- Try matching the second or third instance of a repetitive pattern ("ab" or "bob," for example).
- Try using `re.sub` to replace a matching string.
- Note the difference between `search` and `match`.
- What happens to the order of groups if they are nested?

In [14]:
test = """
1. This is a string

2. That is also a string

3. This is an illusion

4. THIS IS LOUD

that isn't thus

bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab

6. tHiS	iS	CoFu SEd

777. THIS IS 100%-THE-BEST!!!

8888. this_is_a_fiiile.py

hidden bob

"""

<a id="extra-practice"></a>
## Extra Practice

---

Pull up the [Regex Golf](http://regex.alf.nu/) website and solve as many as you can!

If you get bored, try [Regex Crossword](https://regexcrossword.com/).