# Regular Expressions

- [Examples](#Examples)
- [Basic Regex](#Basic-Regex)
    - [Metacharacters](#Metacharacters)
    - [Repitition](#Repitition)
    - [Any of / None of](#Any-of-/-None-of)
    - [Anchors](#Anchors)
    - [Other Functions](#Other-Functions)
    - [Capture Groups](#Capture-Groups)
    - [Flags](#Flags)
    - [Usage with Pandas](#Usage-with-Pandas)

## Examples

Say I want to parse the following lines in a log file:

<div style="font-family: monospace; overflow: scroll; white-space: pre">GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
</div>

In [143]:
import pandas as pd
import re

lines = '''
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
'''.strip().split('\n')

regex = re.compile(r'''
(?P<method>POST|GET)
\s*
(?P<path>(?:[/\w-]+))
(?:\?(?P<query_string>.*?)\s)?
\s*\[
    (?P<day>\d+)/(?P<month>\w+)/(?P<year>\d+):
    (?P<hour>\d{2})(?P<minute>\d{2})(?P<second>\d{2})
    (?P<timezone>\+\d{4})
\]\s+
(?P<protocol>HTTPS?/\d\.\d)
\s+
\{(?P<status>\d+)\}
\s+
(?P<bytes_sent>\d+)
\s+
"(?P<user_agent>.*)"
\s+
(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
''', re.VERBOSE)

pd.DataFrame([regex.match(line).groupdict() for line in lines])


Unnamed: 0,method,path,query_string,day,month,year,hour,minute,second,timezone,protocol,status,bytes_sent,user_agent,ip
0,GET,/api/v1/sales,page=86,16,Apr,2019,19,34,52,0,HTTP/1.1,200,510348,python-requests/2.21.0,97.105.19.58
1,POST,/users_accounts/file-upload,,16,Apr,2019,19,34,52,0,HTTP/1.1,201,42,User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; ...,97.105.19.58
2,GET,/api/v1/items,page=3,16,Apr,2019,19,34,53,0,HTTP/1.1,429,3561,python-requests/2.21.0,97.105.19.58


Extract various components of an address:

In [2]:
addresses = pd.Series([
    '84 Rainey Street, Arlen, TX',
    '4 Privet Drive, Little Whinging, Surrey, U.K.',
    '740 Evergreen Terrace, Springfield',
    '1 Infinite Loop, Cupertino, California',
    'Wayne Manor, Gotham City',
    '124 Conch Street, Bikini Bottom',
])
addresses

0                      84 Rainey Street, Arlen, TX
1    4 Privet Drive, Little Whinging, Surrey, U.K.
2               740 Evergreen Terrace, Springfield
3           1 Infinite Loop, Cupertino, California
4                         Wayne Manor, Gotham City
5                  124 Conch Street, Bikini Bottom
dtype: object

In [3]:
data = addresses.str.extract(r'^(\d+)?\s*(.*?),\s*([\w\s]+)')
data.columns = ['house_no', 'street', 'city']
data

Unnamed: 0,house_no,street,city
0,84.0,Rainey Street,Arlen
1,4.0,Privet Drive,Little Whinging
2,740.0,Evergreen Terrace,Springfield
3,1.0,Infinite Loop,Cupertino
4,,Wayne Manor,Gotham City
5,124.0,Conch Street,Bikini Bottom


In [4]:
# find all the csv files refrenced in the curriculum (this won't work for you)
!(cd ~/codeup/curriculum/data-science/content && rg --vimgrep ".*pd.read_csv\(['\"](.+)['\"]\).*" -r '$1')

13-advanced-topics/2-cross-validation.ipynb:178:1:./data/cars.csv
13-advanced-topics/1-tidy-data.ipynb:127:1:data/treatments.csv
13-advanced-topics/1-tidy-data.ipynb:465:1:data/students.csv
13-advanced-topics/1-tidy-data.ipynb:1184:1:./data/sales.csv
13-advanced-topics/3.3-building-a-model.md:17:1:./data/spam_clean.csv
4-python/pandas-time-series.md:83:1:coffee_consumption.csv
4-python/7.4-intro-to-pandas.ipynb:4874:1:fruits.csv
4-python/7.4-intro-to-pandas.ipynb:4961:1:fruits.csv', header=None, sep=';
6-regression/wrangle.py:35:1:data/student_grades.csv
7-classification/2-acquire.ipynb:346:1:https://s3.amazonaws.com/irs-form-990/index_2011.csv
7-classification/handling-missing-values.ipynb:44:1:data/titanic.csv
11-nlp/pos-tagging.ipynb:48:1:data/news.csv
11-nlp/entity-labeling.ipynb:71:1:data/news.csv
11-nlp/6-model.ipynb:1215:1:./data/spam_clean.csv
11-nlp/5-explore.ipynb:105:1:./data/spam_clean.csv
5-stats/4.2-compare-means.ipynb:37:1:data/exam_scores.csv
5-stats/4.3

In [5]:
# find all the imports in .py files in the curriculum (this won't work for you)
!(cd ~/codeup/curriculum/data-science/content && rg --vimgrep '^import\s+([\.\w]+)\s*(as\s*\w+)?.*$' -r '$1')

13-advanced-topics/3.3-building-a-model.md:10:1:pandas
9-timeseries/prepare.py:1:1:pandas
10-anomaly-detection/plot_isolation_forest.py:27:1:numpy
10-anomaly-detection/plot_isolation_forest.py:28:1:matplotlib.pyplot
9-timeseries/summarize.py:1:1:pandas
9-timeseries/acquire.py:1:1:pandas
9-timeseries/acquire.py:2:1:env
10-anomaly-detection/lesson_prob_AD.py:2:1:itertools
10-anomaly-detection/lesson_prob_AD.py:5:1:matplotlib.pyplot
10-anomaly-detection/lesson_prob_AD.py:6:1:matplotlib.dates
10-anomaly-detection/lesson_prob_AD.py:7:1:numpy
10-anomaly-detection/lesson_prob_AD.py:8:1:pandas
10-anomaly-detection/lesson_prob_AD.py:9:1:math
10-anomaly-detection/lesson_prob_AD.py:13:1:seaborn
4-python/7.3-intro-to-numpy.md:13:1:numpy
8-clustering/acquire.py:1:1:pandas
8-clustering/acquire.py:2:1:env
6-regression/wrangle.py:1:1:pandas
6-regression/wrangle.py:2:1:numpy
6-regression/wrangle.py:4:1:env
11-nlp/project_acquire.py:23:1:os
11-nlp/project_acquire.py:24:1:json
11-nl

## Basic Regex

- what is a regex? (bigger than python, different flavors)
- raw strings
- re.findall (but also others)

In [6]:
'here is a single quote \' <--'

"here is a single quote ' <--"

In [7]:
r'here is a single quote \' <--'

"here is a single quote \\' <--"

In [8]:
import re

In [9]:
# pip install --upgrade zgulde
# for demonstration in this lesson
from zgulde.hl_matches import hl_all_matches_nb as hl

In [10]:
subject = 'Hello, Bayes! Today is Dec 3 and the temperature is 70 degrees.'

In [11]:
re.findall(r'H', subject)

['H']

In [12]:
re.findall(r'e', subject)

['e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e']

In [13]:
hl(r'e', subject)

In [14]:
hl(r'70', subject)

### Metacharacters

In [15]:
hl(r'\W', subject)

In [16]:
hl(r'\D', subject)

In [17]:
hl(r'\S', subject)

In [18]:
hl(r'.', subject)

### Repitition

In [19]:
re.findall(r'\w+', subject)

['Hello',
 'Bayes',
 'Today',
 'is',
 'Dec',
 '3',
 'and',
 'the',
 'temperature',
 'is',
 '70',
 'degrees']

In [20]:
hl(r'\w+', subject)

In [21]:
hl(r'\w{5}', subject)

In [22]:
hl(r'\w{6,8}', subject)

### Any of / None of

In [23]:
hl(r'[aeiou]', subject)

In [24]:
hl(r'[^aeiou]', subject)

In [25]:
hl(r'[A-Z][a-z]+', subject)

In [26]:
re.findall(r'[A-Z][a-z]+', subject)

['Hello', 'Bayes', 'Today', 'Dec']

In [27]:
hl(r'\d.+', subject)

In [28]:
re.findall(r'\d.+', subject)

['3 and the temperature is 70 degrees.']

In [29]:
hl(r'[\d.]+', subject)

In [30]:
# re is *greedy*, will try to match _as much_ as it can
hl(r'.+\d', subject)

In [31]:
re.findall(r'.+\d', subject)

['Hello, Bayes! Today is Dec 3 and the temperature is 70']

In [32]:
re.findall(r'.+?\d', subject)

['Hello, Bayes! Today is Dec 3', ' and the temperature is 7']

### Anchors

In [33]:
re.findall(r'^.', subject)

['H']

In [34]:
re.findall(r'\w$', subject)

[]

In [35]:
hl(r'.{2}$', subject)

In [36]:
re.findall(r'.$', subject)

['.']

In [37]:
hl(r'.', subject)

In [38]:
hl(r'.\b', subject)

In [39]:
hl(r'\b.', subject)

In [40]:
java_string = 'I love javascript so much, but java, not so much. I also like the delicious coffeedrink mochajavacoffee.'
java_string

'I love javascript so much, but java, not so much. I also like the delicious coffeedrink mochajavacoffee.'

In [41]:
# just the words that start with java
hl(r'\bjava\w*', java_string)

In [42]:
# any word that contains "java", including "java" itself
hl(r'\w*java\w*', java_string)

In [43]:
subject

'Hello, Bayes! Today is Dec 3 and the temperature is 70 degrees.'

In [44]:
hl(r'\w{2}\b', subject)

In [45]:
hl(r'\b\w{2}', subject)

In [46]:
hl('^.', subject)

### Other Functions

- `re.search`
- `re.sub`
- `re.compile` + flags

### Capture Groups

In [47]:
re.findall(r'\w+\w', subject)

['Hello',
 'Bayes',
 'Today',
 'is',
 'Dec',
 'and',
 'the',
 'temperature',
 'is',
 '70',
 'degrees']

In [48]:
hl(r'\w+(\w)', subject)

In [49]:
hl(r'(\w)\1', subject)

In [50]:
hl(r'(\w)..\1', subject)

In [51]:
date = '03 12 2019'

In [52]:
# re.sub(needle, replacement, haystack)
re.sub(r'0', 'ZERO', date)

'ZERO3 12 2ZERO19'

In [53]:
re.sub(r'\d{4}', 'digits', date)

'03 12 digits'

In [54]:
date

'03 12 2019'

In [55]:
re.sub(r'(\d{2})\s(\d{2})\s(\d{4})', r'\1-\2-\3',  date)

'03-12-2019'

In [56]:
re.sub(r'(\d{2})\s(\d{2})\s(\d{4})', r'\2/\1/\3',  date)

'12/03/2019'

In [57]:
date

'03 12 2019'

In [58]:
# the last word character at the end of the string (accounting for anything else at the end of the string)
hl(r'(\w)\W*$', subject)

### Flags

In [59]:
compiled_regexp = re.compile(r'^.', re.IGNORECASE | re.MULTILINE | re.VERBOSE)

compiled_regexp.findall('abc')

['a']

### Usage with Pandas

several common methods:

- `extract`: extract various pieces of the string using a regular expression
- `contains`: test if a regular expression is present in a string
- `count`: count the number of occurances of the regular expression in the string
- `replace`: make string substitutions based on a regular expression

In [60]:
addresses = pd.Series([
    '84 Rainey Street, Arlen, TX',
    '4 Privet Drive, Little Whinging, Surrey, U.K.',
    '740 Evergreen Terrace, Springfield',
    '1 Infinite Loop, Cupertino, California',
    'Wayne Manor, Gotham City',
    '124 Conch Street, Bikini Bottom',
])
addresses

0                      84 Rainey Street, Arlen, TX
1    4 Privet Drive, Little Whinging, Surrey, U.K.
2               740 Evergreen Terrace, Springfield
3           1 Infinite Loop, Cupertino, California
4                         Wayne Manor, Gotham City
5                  124 Conch Street, Bikini Bottom
dtype: object

In [61]:
addresses.str.extract(r'^(\d+)?\s*(.+?),')

Unnamed: 0,0,1
0,84.0,Rainey Street
1,4.0,Privet Drive
2,740.0,Evergreen Terrace
3,1.0,Infinite Loop
4,,Wayne Manor
5,124.0,Conch Street


Using named capture groups gives us a dataframe with column names.

In [62]:
addresses.str.extract(r'^(?P<house_no>\d+)?\s*(?P<street_name>.+?),')

Unnamed: 0,house_no,street_name
0,84.0,Rainey Street
1,4.0,Privet Drive
2,740.0,Evergreen Terrace
3,1.0,Infinite Loop
4,,Wayne Manor
5,124.0,Conch Street


In [63]:
addresses.str.contains(r'\d')

0     True
1     True
2     True
3     True
4    False
5     True
dtype: bool

In [64]:
addresses[addresses.str.contains(r'\d')]

0                      84 Rainey Street, Arlen, TX
1    4 Privet Drive, Little Whinging, Surrey, U.K.
2               740 Evergreen Terrace, Springfield
3           1 Infinite Loop, Cupertino, California
5                  124 Conch Street, Bikini Bottom
dtype: object

In [65]:
addresses.str.count(r'\d')

0    2
1    1
2    3
3    1
4    0
5    3
dtype: int64

In [66]:
s = pd.Series(['aaa', 'abb', 'aba'])
s

0    aaa
1    abb
2    aba
dtype: object

In [67]:
s.str.replace('a', '')

0      
1    bb
2     b
dtype: object

In [68]:
s.str.replace(r'a(.)', r'\1')

0    aa
1    bb
2    ba
dtype: object

## Exercise Solutions

In [83]:
def is_vowel(letter):
    vowel_re = r'[aeiouAEIOU]' # write your regex here
    return bool(re.search(vowel_re, letter))

is_vowel('a'), is_vowel('b'), is_vowel('e'), is_vowel('d')

(True, False, True, False)

In [85]:
# A valid username starts with a lowercase letter, and only consists of lowercase letters,
# numbers, or the _ character. It should also be no longer than 32 characters
def is_valid_username(username):
    username_re = r'^[a-z][a-z0-9_]{,31}$' # write your regex here
    return bool(re.search(username_re, username))

usernames = ['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'codeup', 'Codeup', 'codeup123', '1codeup']
print('is_valid_username | username')
print('----------------- | --------')
for username in usernames:
    print('%17s | %s' % (is_valid_username(username), username))

is_valid_username | username
----------------- | --------
            False | aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
             True | codeup
            False | Codeup
             True | codeup123
            False | 1codeup


In [87]:
phone_numbers = pd.Series([
    '(210) 867 5309',
    '+1 210.867.5309',
    '867-5309',
    '210-867-5309',
])

phone_number_re = re.compile(r'''
^
(?P<calling_code>\+\d+)?
\D*?
(?P<area_code>\d{3})?
\D*?
(?P<first_three>\d{3})
\D*?
(?P<last_four>\d{4})
\D*
$
''', re.VERBOSE)

pd.DataFrame({
    'phone_number': phone_numbers,
    'regexp_matches': phone_numbers.str.match(phone_number_re)
})

Unnamed: 0,phone_number,regexp_matches
0,(210) 867 5309,False
1,+1 210.867.5309,False
2,867-5309,False
3,210-867-5309,False


In [88]:
hl(r'\d{4}$', '(210) 867 5309')

In [89]:
hl(r'\d{4}$', '+1 210.867.5309')

In [72]:
phone_numbers.str.extract(phone_number_re)

Unnamed: 0,calling_code,area_code,first_three,last_four
0,,210.0,867,5309
1,1.0,210.0,867,5309
2,,,867,5309
3,,210.0,867,5309


In [73]:
dates = pd.Series([
    '02/04/19',
    '02/05/19',
    '02/06/19',
    '02/07/19',
    '02/08/19',
    '02/09/19',
    '02/10/19',
])

# hint: use dates.str.replace and capture groups

In [74]:
dates.str.replace(r'(\d+)/(\d+)/(\d+)', r'20\3-\1-\2')

0    2019-02-04
1    2019-02-05
2    2019-02-06
3    2019-02-07
4    2019-02-08
5    2019-02-09
6    2019-02-10
dtype: object

In [75]:
# accepts a `Match` object and returns a string
def format_date(m: re.Match) -> str:
    return '20' + m.group('year') + '-' + m.group('month') + '-' + m.group('day')

dates.str.replace(r'(?P<month>\d+)/(?P<day>\d+)/(?P<year>\d+)', format_date)

0    2019-02-04
1    2019-02-05
2    2019-02-06
3    2019-02-07
4    2019-02-08
5    2019-02-09
6    2019-02-10
dtype: object

In [90]:
def format_date(m):
    return '20' + m.group('year') + '-' + m.group('month') + '-' + m.group('day')

dates.str.replace(r'(?P<month>\d+)/(?P<day>\d+)/(?P<year>\d+)', format_date)

0    2019-02-04
1    2019-02-05
2    2019-02-06
3    2019-02-07
4    2019-02-08
5    2019-02-09
6    2019-02-10
dtype: object

In [94]:
words = pd.read_csv('/usr/share/dict/words', header=None).iloc[:, 0].dropna().str.lower()
words

0                  a
1                  a
2                 aa
3                aal
4              aalii
             ...    
235881        zythem
235882        zythia
235883        zythum
235884       zyzomys
235885    zyzzogeton
Name: 0, Length: 235884, dtype: object

In [95]:
# How many words with at least 3 vowels? 190293
(words.str.count(r'[aeiou]') >= 3).sum()

191365

In [103]:
# How many words have at least 3 vowels in a row? 6156
words.str.match(r'.*[aeiou]{3}').sum()

6182

In [105]:
words[words.str.match(r'^[aeiou]{3,}')]

3041           aeacides
3042             aeacus
3043             aeaean
3109      aeolharmonica
3110             aeolia
              ...      
133272          ouenite
133278            ouija
133279         ouistiti
210115            uaupe
210173        ueueteotl
Name: 0, Length: 78, dtype: object

In [99]:
# How many words have at least 3 vowels in a row? 6156
words.str.contains(r'[aeiou]{3}').sum()

6182

In [110]:
# How many words have at least 4 consonants in a row? 19743
words.str.contains(r'[^aeiou]{4}').sum()

19241

In [113]:
words[words.str.contains(r'[^aeiou]{8}')].sample(10)

136805    pachyrhynchous
169656       rhythmproof
196749        syncryptic
196748         syncrypta
136610      oxyrhynchous
115190     methylglycine
6467       amblyrhynchus
119715       morthwyrtha
196706    synchytriaceae
13506          arrhythmy
Name: 0, dtype: object

In [119]:
# How many words start and end with the same letter? 9916
words.str.contains(r'^(.).*\1$').sum()

11452

In [115]:
# How many words start and end with a vowel? 12351
words.str.contains(r'^[aeiou].*[aeiou]$').sum()

14657

In [121]:
# How many words contain the same letter 3 times in a row? 7
words[words.str.contains(r'(.)\1\1')]

24988             bossship
50636      demigoddessship
78498          goddessship
82997     headmistressship
140481       patronessship
230262            wallless
231688           whenceeer
Name: 0, dtype: object

In [123]:
# how many words contain two sets of the same two letters repeated?
# e.g. assissin
words.str.contains(r'((.)\2).*\1').sum()

792

In [124]:
# how many words contain two sets of the same two letters repeated?
# e.g. assissin
words[words.str.contains(r'((.)\2).*\1')]

895        accessariness
906           accessless
909        accessoriness
2688      admissibleness
3879         agelessness
               ...      
234557       yellowbelly
234559        yellowbill
234834     youthlessness
235076      zeallessness
235683            zoozoo
Name: 0, Length: 792, dtype: object

In [125]:
# can we detect sentences with duplicated words? a word being 3 or more characters long
sentences = pd.Series([
    'Mary had a little lamb, little lamb, little lamb',
    'Hello Bayes! It is a fine morning for regular expressions.',
    'The plural of regex is regrets.',
    'Python is a fine, fine language.'
])
sentences

0     Mary had a little lamb, little lamb, little lamb
1    Hello Bayes! It is a fine morning for regular ...
2                      The plural of regex is regrets.
3                     Python is a fine, fine language.
dtype: object

In [133]:
sentences.str.findall(r'\b\w{3,}\b')

0    [Mary, had, little, lamb, little, lamb, little...
1    [Hello, Bayes, fine, morning, for, regular, ex...
2                        [The, plural, regex, regrets]
3                       [Python, fine, fine, language]
dtype: object

In [135]:
sentences[sentences.str.contains(r'(\b\w{3,}\b).*\1')]

0    Mary had a little lamb, little lamb, little lamb
3                    Python is a fine, fine language.
dtype: object