# Regular Expressions

- [Examples](#Examples)
- [Basic Regex](#Basic-Regex)
    - [Metacharacters](#Metacharacters)
    - [Repitition](#Repitition)
    - [Any of / None of](#Any-of-/-None-of)
    - [Anchors](#Anchors)
    - [Other Functions](#Other-Functions)
    - [Capture Groups](#Capture-Groups)
    - [Flags](#Flags)
    - [Usage with Pandas](#Usage-with-Pandas)

## Examples

Say I want to parse the following lines in a log file:

<div style="font-family: monospace; overflow: scroll; white-space: pre">GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
</div>

In [2]:
import re
import pandas as pd

Extract various components of an address:

In [84]:
addresses = pd.Series([
    '84 Rainey Street, Arlen, TX',
    '4 Privet Drive, Little Whinging, Surrey, U.K.',
    '740 Evergreen Terrace, Springfield',
    '1 Infinite Loop, Cupertino, California',
    'Wayne Manor, Gotham City',
    '124 Conch Street, Bikini Bottom',
])
addresses

0                      84 Rainey Street, Arlen, TX
1    4 Privet Drive, Little Whinging, Surrey, U.K.
2               740 Evergreen Terrace, Springfield
3           1 Infinite Loop, Cupertino, California
4                         Wayne Manor, Gotham City
5                  124 Conch Street, Bikini Bottom
dtype: object

In [82]:
data = addresses.str.extract(r'^(\d+)?\s*(.*?),\s*([\w\s]+)')
data.columns = ['house_no', 'street', 'city']
data

Unnamed: 0,house_no,street,city
0,84.0,Rainey Street,Arlen
1,4.0,Privet Drive,Little Whinging
2,740.0,Evergreen Terrace,Springfield
3,1.0,Infinite Loop,Cupertino
4,,Wayne Manor,Gotham City
5,124.0,Conch Street,Bikini Bottom


In [None]:
# find all the csv files refrenced in the curriculum (this won't work for you)
# !(cd ~/codeup/curriculum/data-science/content && rg --vimgrep ".*pd.read_csv\(['\"](.+)['\"]\).*" -r '$1')

In [None]:
# find all the imports in .py files in the curriculum (this won't work for you)
# !(cd ~/codeup/curriculum/data-science/content && rg --vimgrep '^import\s+([\.\w]+)\s*(as\s*\w+)?.*$' -r '$1')

## Basic Regex

- what is a regex? (bigger than python, different flavors)
- raw strings
- re.findall (but also others)

In [5]:
import re

In [7]:
# for demonstration in this lesson
from zgulde.hl_matches import hl_all_matches_nb as hl # pip install zgulde

In [8]:
subject = 'Hello, Bayes! Today is Dec 3 and the temperature is 70 degrees.'

In [9]:
re.findall(r'H', subject)

['H']

In [17]:
re.findall(r'e', subject)

['e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e']

In [11]:
hl(r'e', subject)

In [12]:
hl(r'70', subject)

### Metacharacters

In [18]:
hl(r'\w', subject)

In [19]:
hl(r'\d', subject)

In [20]:
hl(r'\s', subject)

In [22]:
hl(r'.', subject)

### Repitition

In [21]:
hl(r'\w+', subject)

In [26]:
hl(r'\w{5}', subject)

In [27]:
hl(r'\w{6,8}', subject)

### Any of / None of

In [28]:
hl(r'[aeiou]', subject)

In [29]:
hl(r'[^aeiou]', subject)

In [37]:
hl(r'[A-Z][a-z]+',subject)

In [38]:
re.findall(r'[A-Z][a-z]+',subject)

['Hello', 'Bayes', 'Today', 'Dec']

In [39]:
hl(r'\d.+', subject)

In [41]:
hl(r'[\d.]+', subject)

In [46]:
# This regular expression below is 'greedy'.  It will try to match as much as it can.
hl(r'.+\d', subject)

In [43]:
re.findall(r'.+\d', subject)

['Hello, Bayes! Today is Dec 3 and the temperature is 70']

In [44]:
re.findall(r'.+?\d', subject)

['Hello, Bayes! Today is Dec 3', ' and the temperature is 7']

### Anchors

In [49]:
# start of the string then any character
re.findall(r'^.', subject)

['H']

In [55]:
hl(r'.{2}$', subject)

In [51]:
# any character and then the end of the string
re.findall(r'.$', subject)

['.']

In [50]:
# any character
re.findall(r'.', subject)

['H',
 'e',
 'l',
 'l',
 'o',
 ',',
 ' ',
 'B',
 'a',
 'y',
 'e',
 's',
 '!',
 ' ',
 'T',
 'o',
 'd',
 'a',
 'y',
 ' ',
 'i',
 's',
 ' ',
 'D',
 'e',
 'c',
 ' ',
 '3',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 't',
 'h',
 'e',
 ' ',
 't',
 'e',
 'm',
 'p',
 'e',
 'r',
 'a',
 't',
 'u',
 'r',
 'e',
 ' ',
 'i',
 's',
 ' ',
 '7',
 '0',
 ' ',
 'd',
 'e',
 'g',
 'r',
 'e',
 'e',
 's',
 '.']

In [None]:
hl(r'.{3}$', subject)

In [59]:
hl(r'.\b', subject)

No matches!
No matches!
No matches!
No matches!
No matches!
No matches!
No matches!
No matches!
No matches!
No matches!
No matches!


In [60]:
re.findall(r'\b.', subject)

['H',
 ',',
 'B',
 '!',
 'T',
 ' ',
 'i',
 ' ',
 'D',
 ' ',
 '3',
 ' ',
 'a',
 ' ',
 't',
 ' ',
 't',
 ' ',
 'i',
 ' ',
 '7',
 ' ',
 'd',
 '.']

In [61]:
subject

'Hello, Bayes! Today is Dec 3 and the temperature is 70 degrees.'

In [62]:
hl(r'\w{2}\b', subject)

### Other Functions

- `re.search`
- `re.sub`
- `re.compile` + flags

### Capture Groups

In [65]:
hl(r'\w+w', subject)

In [67]:
hl('\w+(\w)', subject)

In [70]:
hl(r'(\w)\1', subject)

In [None]:
## double letter

In [71]:
date = '03 12 2019'

In [72]:
# re.sub(needle, replacement, haystack)
re.sub(r'0', 'ZERO', date)

'ZERO3 12 2ZERO19'

In [74]:
re.sub(r'\d+', 'DIGIT', date)

'DIGIT DIGIT DIGIT'

In [75]:
re.sub(r'(\d{2})\s(\d{2})\s(\d{4})', '???', date)

'???'

In [76]:
re.sub(r'(\d{2})\s(\d{2})\s(\d{4})', r'\1-\2-\3', date)

'03-12-2019'

In [77]:
re.sub(r'(\d{2})\s(\d{2})\s(\d{4})', r'\2/\1/\3', date)

'12/03/2019'

### Flags

In [79]:
re.compile(r'', re.IGNORECASE | re.MULTILINE | re.VERBOSE)

re.compile(r'', re.IGNORECASE|re.MULTILINE|re.UNICODE|re.VERBOSE)

### Usage with Pandas

In [78]:
pd.Series.str.extract

<function pandas.core.strings.StringMethods.extract(self, pat, flags=0, expand=True)>

In [87]:
# grabs the first group of digits and puts in pandas dataframe
addresses.str.extract(r'^(\d+)')

Unnamed: 0,0
0,84.0
1,4.0
2,740.0
3,1.0
4,
5,124.0


In [90]:
# everything after a space character up to a comma. Can put a quest mark after the + sign to get everything up to the comma
addresses.str.extract(r'^(\d+)\s(.+),')



Unnamed: 0,0,1
0,84.0,"Rainey Street, Arlen"
1,4.0,"Privet Drive, Little Whinging, Surrey"
2,740.0,Evergreen Terrace
3,1.0,"Infinite Loop, Cupertino"
4,,
5,124.0,Conch Street
