In [1]:
from __future__ import division

from IPython.core.display import Image

import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
%matplotlib inline

```
1. This is a string

2. That is also a string

3. This is an illusion

4. THIS IS LOUD

that isn't thus

bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab

6. tHiS	iS	CoFu SEd

777. THIS IS 100%-THE-BEST!!!

8888. this_is_a_fiiile.py

hidden bob
```

<a id="where-is-regex-implemented"></a>
## Where are `regex` Implemented?

---

There are any number of places where `regex`s can be run — from your text editor, to the `bash` shell, to Python, and even SQL. It is typically baked into the standard libary of programming languages.

In Python, it can be imported like so:

```python
import re
```

<a id="basic-regular-expression-syntax"></a>
## Basic Regular Expression Syntax
---

<a id="literals"></a>
### Literals

Literals are essentially just what you think of as characters in a string. For example:

```
a
b
c
X
Y
Z
1
5
100
``` 

These are all considered literals.

<a id="character-classes"></a>
### Character Classes

A character class is a set of characters matched as an "or."

```
[io]
```

So, this class would run as "match either i or o."

You can include as many characters as you like in between the brackets.

Character classes match only a single character.

<a id="character-classes-can-also-accept-certain-ranges"></a>
### Character Classes Can Also Accept Certain Ranges

For example, the following will all work:
    
```
[a-f]
[a-z]
[A-Z]
[a-zA-Z]
[1-4]
[a-c1-3]
```

<a id="character-class-negation"></a>
### Character Class Negation

We can also add **negation** to character classes. For example:

```
[^a-z]
```

This means match *ANYTHING* that is *NOT* `a` through `z`.

### Exercise #1

#### Solution

`[Tt]h[^i][st]`

**Solution Breakdown:**  

`[Tt]` = _'T' or 't'_              
`h`    = _'h'_                      
`[^i]` = *Anything that is _not_ 'i'*  
`[st]` =_'s' or 't'_               

#### Exercise #2

1. `[0-9]`
2. `\d`
3. `[^\D]` **or** `[^a-zA-Z\s\%\'!\-\._]`  
>_The latter option of solution #3 is specific to our text block, as we explicitly specify the special characters to exclude._

<a id="exercise-3"></a>
## Exercise #3

---

Use an anchor and a character class to find the **bab** and the **bob** at the end of the line, but not elsewhere.

#### Exercise #3

`b[oa]b$`

#### Exercise #4

<a id="exercise-4"></a>
## Exercise #5
---

1. Find **bob**, but only if it occurs three times in a row without any spaces.
2. Find **bob** if it occurs twice in a row, with or without spaces.

#### Exercise #5
1. `(bob){3}`
2. `(bob)( )?(bob)` **or**  `(bob ?){2}`

<a id="groups-and-capturing"></a>
## Groups and Capturing

---

In `regex`, parentheses — `()` — denote groupings. These groups can then be quantified.

Additionally, these groups can be designated as either "capture" or "non-capture."

To mark a group as a capture group, just put it in parenthesis — (match_phrase).

To mark it as a non-capture group, punctuate it like so — (?:match_phrase).


### Exercise 6#

1. `(bob)(?=_)`
2. `(bob)(?=_|\n)`
3. `(bob)(?!( |\n))`

<a id="regex-in-python-and-pandas"></a>
## Regex in Python and `pandas`

---

Let's practice working with `regex` in Python and `pandas` using the string below.

In [8]:

my_string = """
I said a haap hop hip,
The hippie, the hippie,
To the hip, hip hop, and you don't stop, a rock it
To the bang bang boogie, say, up jump the boogie,
To the rhythm of the boogie, the beat.
"""

In [9]:
# Import the `regex` module.
import re
import pandas as pd

<a id="regex-search-method"></a>
### `regex`' `.search()` Method

In [10]:
# `.search()` returns a match object.
mo = re.search('h([aousi])p', my_string) # h followed by aousi letters

In [11]:
# Everything that matches the expression:
mo.group()

'hop'

In [12]:
# The match groups (like $1, $2):
mo.group(1)

'o'

<a id="regex-findall-method"></a>
### `regex`' `.findall()` Method

In [13]:
mo = re.findall('h[aio]p', my_string)

In [14]:
mo

['hop', 'hip', 'hip', 'hip', 'hip', 'hip', 'hop']

In [15]:
# `.findall()` will return only the capture groups, if included.
mo = re.findall('h([iao])p', my_string)

In [16]:
mo

['o', 'i', 'i', 'i', 'i', 'i', 'o']

<a id="using-pandas"></a>
### Using `pandas`

In [17]:
fish = pd.Series(['onefish', 'twofish','redfish', 'bluefish'])
fish

0     onefish
1     twofish
2     redfish
3    bluefish
dtype: object

<a id="strcontains"></a>
### `str.contains`

In [18]:
# Get all fish that start with "b."
fish[fish.str.contains('^b')]

3    bluefish
dtype: object

<a id="strextract"></a>
### `str.extract`

In [19]:
# `.extract()` maps capture groups to new Series.
fish.str.extract('(.*)fish', expand=False)

0     one
1     two
2     red
3    blue
dtype: object

<a id="independent-practice"></a>
## Independent Practice
---

Pull up the following tutorials for regular expressions in Python. 

- [TutorialPoint](http://www.tutorialspoint.com/python/python_reg_expressions.htm)  
- [Google Regex Tutorial](https://developers.google.com/edu/python/regular-expressions) (findall)

In the cells below, import Python's `regex` library and experiment with matching on the string.

Try out some of the following:
- Match with and without case sensitivity.
- Match using word borders (try "bob").
- Use positive and negative lookaheads.
- Experiment with the multi-line flag.
- Try matching the second or third instance of a repetitive pattern ("ab" or "bob," for example).
- Try using `re.sub` to replace a matching string.
- Note the difference between `search` and `match`.
- What happens to the order of groups if they are nested?

In [77]:
test = """
1. This is a string

2. That is also a string

3. This is an illusion

4. THIS IS LOUD

that isn't thus

bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab

6. tHiS	iS	CoFu SEd

777. THIS IS 100%-THE-BEST!!!

8888. this_is_a_fiiile.py

hidden bob

"""

<a id="extra-practice"></a>
## Extra Practice

---

Pull up the [Regex Golf](http://regex.alf.nu/) website and solve as many as you can!

If you get bored, try [Regex Crossword](https://regexcrossword.com/).

In [78]:
import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")	

Search successful.


In [79]:

# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

['12', '89', '34']


In [99]:
import re

string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

# Output: ['Twelve:', ' Eighty nine:', '.']

['Twelve:', ' Eighty nine:', '.']


In [81]:

import re

string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'

# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

['Twelve:', ' Eighty nine:89 Nine:9.']


In [82]:
# The method returns a string where matched occurrences are replaced with the content of replace variable.

re.sub(pattern, replace, string)

'Twelve: Eighty nine: Nine:.'

In [78]:
# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

abc12de23f456


In [84]:
import re

# multiline string
string = 'abc 12\de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

abc12\de 23 
 f45 6


In [85]:

# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

('abc12de23f456', 4)


In [86]:

import re

string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found") 

pattern found inside the string


In [87]:
import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

801 35


In [88]:
>>> match.group(1)

'801'

In [89]:
>>> match.group(2)

'35'

In [90]:
>>> match.group(1, 2)

('801', '35')

In [91]:
>>> match.groups()

('801', '35')

In [92]:
>>> match.start()

2

In [93]:
>>> match.end()

8

In [94]:
>>> match.re

re.compile(r'(\d{3}) (\d{2})', re.UNICODE)

In [95]:
>>> match.string

'39801 356, 2102 1111'

In [96]:
import re

string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

['\n', '\r']
