# Regular Expressions

- [Examples](#Examples)
- [Basic Regex](#Basic-Regex)
    - [Metacharacters](#Metacharacters)
    - [Repitition](#Repitition)
    - [Any of / None of](#Any-of-/-None-of)
    - [Anchors](#Anchors)
    - [Other Functions](#Other-Functions)
    - [Capture Groups](#Capture-Groups)
    - [Flags](#Flags)
    - [Usage with Pandas](#Usage-with-Pandas)

## Examples

Say I want to parse the following lines in a log file:

<div style="font-family: monospace; overflow: scroll; white-space: pre">GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
</div>

In [2]:
import pandas as pd

Extract various components of an address:

In [3]:
addresses = pd.Series([
    '84 Rainey Street, Arlen, TX',
    '4 Privet Drive, Little Whinging, Surrey, U.K.',
    '740 Evergreen Terrace, Springfield',
    '1 Infinite Loop, Cupertino, California',
    'Wayne Manor, Gotham City',
    '124 Conch Street, Bikini Bottom',
])
addresses

0                      84 Rainey Street, Arlen, TX
1    4 Privet Drive, Little Whinging, Surrey, U.K.
2               740 Evergreen Terrace, Springfield
3           1 Infinite Loop, Cupertino, California
4                         Wayne Manor, Gotham City
5                  124 Conch Street, Bikini Bottom
dtype: object

In [4]:
data = addresses.str.extract(r'^(\d+)?\s*(.*?),\s*([\w\s]+)')
data.columns = ['house_no', 'street', 'city']
data

Unnamed: 0,house_no,street,city
0,84.0,Rainey Street,Arlen
1,4.0,Privet Drive,Little Whinging
2,740.0,Evergreen Terrace,Springfield
3,1.0,Infinite Loop,Cupertino
4,,Wayne Manor,Gotham City
5,124.0,Conch Street,Bikini Bottom


In [None]:
# find all the csv files refrenced in the curriculum (this won't work for you)
# !(cd ~/codeup/curriculum/data-science/content && rg --vimgrep ".*pd.read_csv\(['\"](.+)['\"]\).*" -r '$1')

In [None]:
# find all the imports in .py files in the curriculum (this won't work for you)
# !(cd ~/codeup/curriculum/data-science/content && rg --vimgrep '^import\s+([\.\w]+)\s*(as\s*\w+)?.*$' -r '$1')

## Basic Regex

- what is a regex? (bigger than python, different flavors)
- raw strings
- re.findall (but also others)

In [9]:
import re

In [11]:
# for demonstration in this lesson
from zgulde.hl_matches import hl_all_matches_nb as hl # pip install zgulde

In [12]:
subject = 'Hello, Bayes! Today is Dec 3 and the temperature is 70 degrees.'

In [13]:
re.findall(r'H', subject)

['H']

In [14]:
re.findall(r'e', subject)

['e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e']

In [15]:
hl(r'e', subject)

In [16]:
hl(r'70', subject)

### Metacharacters

In [17]:
hl(r'\w', subject)

In [18]:
hl(r'\d', subject)

In [19]:
hl(r'\s', subject)

### Repitition

In [20]:
hl(r'\w+', subject)

In [24]:
hl(r'\w{5}',subject)

In [25]:
hl(r'\w{5,}',subject)

### Any of / None of

In [26]:
hl(r'[aeiou]', subject)

In [27]:
hl(r'[^aeiou]', subject)

In [34]:
hl(r'[A-Z][a-z]+',subject)
re.findall(r'[A-Z][a-z]+',subject)

['Hello', 'Bayes', 'Today', 'Dec']

In [37]:
hl(r'[\d.]+',subject)

In [38]:
hl(r'.+\d',subject)

### Anchors

In [22]:
hl('r^.', subject)

In [None]:
hl(r'.{3}$', subject)

### Other Functions

- `re.search`
- `re.sub`
- `re.compile` + flags

### Capture Groups

In [None]:
hl(r'\w+(\w)', subject)

In [41]:
java_string = "i love javascript so much but java, not so much. mochajavacoffee"



In [42]:
re.findall(r'\w*java\w*',java_string)

['javascript', 'java', 'mochajavacoffee']

In [None]:
## double letter

### Flags

In [None]:
re.compile(r'', re.IGNORECASE | re.MULTILINE | re.VERBOSE)

### Usage with Pandas

In [None]:
pd.Series.str.extract