## Introduction to Regular Expression

### Solutions to Exercises

#### DATA 601


1. Listed below are some regular expressions (some we have already seen). What kinds of strings will they match? Try them out on sample strings.

    - `r"[,\r\n]+"`
    - `r"([a-zA-Z0-9_]+)@"`
    - `r"\(\d\d\d\) \d\d\d-?\d\d\d\d"`
    - `r"[0-9]?[0-9]:[0-9][0-9]"`
    - `r"^.+"` (assume matching at the beginning of a string)





In [4]:
import re

exp = r"[,\r\n]+"

# Comma or new line - one or more. Can be used to split a line from a csv file 

test1 = "a,b,c,d,message,3.1412"
test2 = """a
b, test
c
d
message
3.1412"""
#print(re.split(exp, test1))
print(re.split(exp, test2))

['a', 'b', ' test', 'c', 'd', 'message', '3.1412']


In [7]:
#exp = r"([a-zA-Z0-9_]+)@"
exp = r"(\w+)@"
# Extract the part before the '@' sign in an email address
test1 = "super_cool@email.address"
test2 = "user2000@does.not.matter"

print(re.findall(exp, test1))
print(re.findall(exp, test2))

['super_cool']
['user2000']


In [9]:
exp = r"\(\d\d\d\) ?\d\d\d-?\d\d\d\d"

# Can be used to match phone numbers

test1 = "(123) 456-7890"
test2 = "(123) 4567890"
test3 = "(123)4567890"
print(re.findall(exp, test1))
print(re.findall(exp, test2))
print(re.findall(exp, test3))

['(123) 456-7890']
['(123) 4567890']
['(123)4567890']


In [11]:
exp = r"[0-9]?[0-9]:[0-9][0-9]"

# Pattern for time, not very strict 
test1 = "12:36"
test2 = "2:54"
test3 = "55:79"
print(re.findall(exp, test1))
print(re.findall(exp, test2))
print(re.findall(exp, test3))

['12:36']
['2:54']
['55:79']


In [12]:
exp = r"^.+"

# Can be used to grab the first line from a multi-line string

test1 = """This is the first line.
This is the second line.
This is the third line."""
print(re.findall(exp, test1))

['This is the first line.']


2. Recall the Calgary Historical Rainfall dataset. Use a regular expression to parse out month and date from the `TIMESTAMP` column into separate columns.

In [2]:
import pandas as pd

fileloc = "Historical_Rainfall_20240201.csv"
rdata = pd.read_csv(fileloc)
display( rdata.head() )
print(rdata.dtypes)

Unnamed: 0,CHANNEL,NAME,YEAR,TIMESTAMP,RAINFALL,ID
0,44,Forest Lawn Creek,2021,2021/05/01 02:40:00 PM,0.2,2021-05-01T14:40:00-44
1,48,Seton,2021,2021/05/01 03:35:00 PM,0.2,2021-05-01T15:35:00-48
2,17,Windsor Park,2021,2021/05/01 03:40:00 PM,0.2,2021-05-01T15:40:00-17
3,17,Windsor Park,2021,2021/05/01 03:45:00 PM,0.2,2021-05-01T15:45:00-17
4,18,Cedarbrae,2021,2021/05/01 03:45:00 PM,0.2,2021-05-01T15:45:00-18


CHANNEL        int64
NAME          object
YEAR           int64
TIMESTAMP     object
RAINFALL     float64
ID            object
dtype: object


In [15]:
exp = r"^\d\d\d\d/(?P<MONTH>\d\d)/(?P<DAY>\d\d)"
df = rdata['TIMESTAMP'].str.extractall(exp)
display(df.head())

df.reset_index(drop=True, inplace=True)
display(df.head())

rdata['MONTH'] = df['MONTH']
rdata['DAY'] = df['DAY']
rdata

Unnamed: 0_level_0,Unnamed: 1_level_0,MONTH,DAY
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,5,1
1,0,5,1
2,0,5,1
3,0,5,1
4,0,5,1


Unnamed: 0,MONTH,DAY
0,5,1
1,5,1
2,5,1
3,5,1
4,5,1


Unnamed: 0,CHANNEL,NAME,YEAR,TIMESTAMP,RAINFALL,ID,MONTH,DAY
0,44,Forest Lawn Creek,2021,2021/05/01 02:40:00 PM,0.2,2021-05-01T14:40:00-44,05,01
1,48,Seton,2021,2021/05/01 03:35:00 PM,0.2,2021-05-01T15:35:00-48,05,01
2,17,Windsor Park,2021,2021/05/01 03:40:00 PM,0.2,2021-05-01T15:40:00-17,05,01
3,17,Windsor Park,2021,2021/05/01 03:45:00 PM,0.2,2021-05-01T15:45:00-17,05,01
4,18,Cedarbrae,2021,2021/05/01 03:45:00 PM,0.2,2021-05-01T15:45:00-18,05,01
...,...,...,...,...,...,...,...,...
962658,2,Silver Springs,2020,2020/09/30 05:45:00 AM,0.2,2020-09-30T05:45:00-02,09,30
962659,3,Edgemont,2020,2020/09/30 05:45:00 AM,0.2,2020-09-30T05:45:00-03,09,30
962660,1,Spy Hill,2020,2020/09/30 05:50:00 AM,0.2,2020-09-30T05:50:00-01,09,30
962661,5,University,2020,2020/09/30 05:55:00 AM,0.2,2020-09-30T05:55:00-05,09,30
