# Text Wrangling and Regex

Adapted from Lisa Yan, Will Fithian, Joseph Gonzalez, Deborah Nolan, Sam Lau

Updated by Bella Crouch

Working with text: applying string methods and regular expressions

In [36]:
import numpy as np
import pandas as pd
import re

<br/><br/>
<br/>

---

# Real World Case Study: Restaurant Data

In this example, we will show how regexes can allow us to track quantitative data across categories defined by the appearance of various text fields.

In this example we'll see how the presence of certain keywords can affect quantitative data:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?**
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

In [37]:
# code here

data = pd.read_csv("violations.csv")
data.columns =["bid","date","dsc"]
data.head()

Unnamed: 0,bid,date,dsc
0,19,20171211,Inadequate food safety knowledge or lack of ce...
1,19,20171211,Unapproved or unmaintained equipment or utensils
2,19,20160513,Unapproved or unmaintained equipment or utensi...
3,19,20160513,Unclean or degraded floors walls or ceilings ...
4,19,20160513,Food safety certificate or food handler card n...


In [38]:
data.shape

(39042, 3)

In [39]:
# code here
description = data["dsc"]
val_counts = description.value_counts()
val_counts.shape

(14253,)

That's a lot of different descriptions!! Can we **canonicalize** at all? Let's explore two sets of 10 rows.

In [40]:
# code here
description[:10]

0    Inadequate food safety knowledge or lack of ce...
1     Unapproved or unmaintained equipment or utensils
2    Unapproved or unmaintained equipment or utensi...
3    Unclean or degraded floors walls or ceilings  ...
4    Food safety certificate or food handler card n...
5                                Improper food storage
6    Unclean or degraded floors walls or ceilings  ...
7    Unclean or degraded floors walls or ceilings  ...
8    Unclean or degraded floors walls or ceilings  ...
9    Food safety certificate or food handler card n...
Name: dsc, dtype: object

In [41]:
# code here
description[90:100]

90    Moderate risk vermin infestation  [ date viola...
91    Wiping cloths not clean or properly stored or ...
92    Low risk vermin infestation  [ date violation ...
93    Improper storage of equipment utensils or line...
94    Improper food storage  [ date violation correc...
95    Improper storage use or identification of toxi...
96    High risk food holding temperature   [ date vi...
97    Unclean or degraded floors walls or ceilings  ...
98                                Improper food storage
99    Inadequately cleaned or sanitized food contact...
Name: dsc, dtype: object

In [42]:
# Use regular expressions to cut out the extra info in square braces.
# code here


data['clean_desc'] = (data['dsc']
             .str.replace(r'\s*\[.*\]$', '', regex=True)
# \s* matches any whitespace characters (spaces, tabs) zero or more times.
# The dot (.) is a wildcard character that matches any single character except newline characters.
#.* matches zero or more of any character.
#The pattern \[.*\] matches any sequence of characters enclosed within square brackets,
# including the brackets themselves.
# So This pattern effectively removes any text within square brackets (including the brackets themselves) 
#if it appears at the end of the string.

             .str.strip() #This method removes any leading and trailing whitespace from the text.
             .str.lower()) 
data.head()

Unnamed: 0,bid,date,dsc,clean_desc
0,19,20171211,Inadequate food safety knowledge or lack of ce...,inadequate food safety knowledge or lack of ce...
1,19,20171211,Unapproved or unmaintained equipment or utensils,unapproved or unmaintained equipment or utensils
2,19,20160513,Unapproved or unmaintained equipment or utensi...,unapproved or unmaintained equipment or utensils
3,19,20160513,Unclean or degraded floors walls or ceilings ...,unclean or degraded floors walls or ceilings
4,19,20160513,Food safety certificate or food handler card n...,food safety certificate or food handler card n...


In [43]:
# canonicalizing definitely helped
len(data["clean_desc"].value_counts())

68

In [44]:
# code here
data["clean_desc"].value_counts().head()

clean_desc
unclean or degraded floors walls or ceilings               3507
moderate risk food holding temperature                     2542
inadequate and inaccessible handwashing facilities         2529
unapproved or unmaintained equipment or utensils           2382
inadequately cleaned or sanitized food contact surfaces    2301
Name: count, dtype: int64

Remember our research question:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?**
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

<br/>

Below, we use regular expressions and `df.assign()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html?highlight=assign#pandas.DataFrame.assign)) to **method chain** our creation of new boolean features, one per keyword.

In [45]:
# use regular expressions to assign new features for the presence of various keywords
# regex metacharacter |
# code here

with_features = (data
 .assign(is_unclean     = data['clean_desc'].str.contains('clean'))
 .assign(is_high_risk = data['clean_desc'].str.contains('high risk'))
 .assign(is_vermin    = data['clean_desc'].str.contains('vermin'))
 .assign(is_surface   = data['clean_desc'].str.contains('surface | floor'))
 .assign(is_human     = data['clean_desc'].str.contains('human |'))
 .assign(is_permit    = data['clean_desc'].str.contains('permit | allow'))
)


with_features.head()

Unnamed: 0,bid,date,dsc,clean_desc,is_unclean,is_high_risk,is_vermin,is_surface,is_human,is_permit
0,19,20171211,Inadequate food safety knowledge or lack of ce...,inadequate food safety knowledge or lack of ce...,False,False,False,False,True,False
1,19,20171211,Unapproved or unmaintained equipment or utensils,unapproved or unmaintained equipment or utensils,False,False,False,False,True,False
2,19,20160513,Unapproved or unmaintained equipment or utensi...,unapproved or unmaintained equipment or utensils,False,False,False,False,True,False
3,19,20160513,Unclean or degraded floors walls or ceilings ...,unclean or degraded floors walls or ceilings,True,False,False,True,True,False
4,19,20160513,Food safety certificate or food handler card n...,food safety certificate or food handler card n...,False,False,False,False,True,False


<br/><br/>

### EDA

That's the end of our text wrangling. Now let's do some more analysis to analyze restaurant health as a function of the number of violation keywords.

To do so we'll first group so that our **granularity** is one inspection for a business on particular date. This effectively counts the number of violations by keyword for a given inspection.

In [46]:
# code here
count_features = (with_features
 .groupby(['bid', 'date']).sum(numeric_only=True).reset_index())
count_features.iloc[255:260, :]


Unnamed: 0,bid,date,is_unclean,is_high_risk,is_vermin,is_surface,is_human,is_permit
255,489,20150728,5,0,2,1,12,0
256,489,20150807,1,0,0,0,3,0
257,489,20160308,2,2,1,0,6,0
258,489,20160721,2,1,1,0,7,1
259,489,20161220,3,0,1,1,7,0


Check out our new dataframe in action:

In [47]:
count_features.sample(5,replace=True)

Unnamed: 0,bid,date,is_unclean,is_high_risk,is_vermin,is_surface,is_human,is_permit
9693,79794,20170809,0,0,0,0,2,0
11893,89643,20170118,2,0,0,0,2,0
9161,77755,20160527,1,1,0,0,6,0
9484,78962,20150213,1,0,0,0,6,0
822,1566,20160301,1,1,0,1,4,0


Now we'll reshape this "wide" table into a "tidy" table using a pandas feature called `pd.melt` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html?highlight=pd%20melt)) which we won't describe in any detail, other than that it's effectively the inverse of `pd.pivot_table`.

Our **granularity** is now a violation type for a given inspection (for a business on a particular date).

In [48]:
# code here
total_violations = pd.melt(count_features, id_vars=['bid', 'date'],
            var_name='feature', value_name='num_vios')

# show a particular inspection's results
# code here
total_violations[(total_violations['bid'] == 489) & (total_violations['date'] == 20150728)]

Unnamed: 0,bid,date,feature,num_vios
255,489,20150728,is_unclean,5
12517,489,20150728,is_high_risk,0
24779,489,20150728,is_vermin,2
37041,489,20150728,is_surface,1
49303,489,20150728,is_human,12
61565,489,20150728,is_permit,0


Remember our research question:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?**
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

<br/>

We have the second half of this question! Now let's **join** our table with the inspection scores, located in `inspections.csv`.

In [49]:
# read in the scores
# code here

df = pd.read_csv(r'H:\Machine Learning\EDA\inspections.csv',
                  header=0,
                  usecols=[0, 1, 2],
                  names=['bid', 'score', 'date'])
df.head()

Unnamed: 0,bid,score,date
0,19,94,20160513
1,19,94,20171211
2,24,98,20171101
3,24,98,20161005
4,24,96,20160311


While the inspection scores were stored in a separate file from the violation descriptions, we notice that the **primary key** in inspections is (`bid`, `date`)! So we can reference this key in our join.

In [50]:
# join scores with the table broken down by violation type
# code here

violation_df = (df.merge(df, on=['bid', 'date'])
)
violation_df.head(12)

Unnamed: 0,bid,score_x,date,score_y
0,19,94,20160513,94
1,19,94,20171211,94
2,24,98,20171101,98
3,24,98,20161005,98
4,24,96,20160311,96
5,31,98,20151204,98
6,45,78,20160104,78
7,45,88,20170307,88
8,45,85,20170914,85
9,45,84,20160614,84


# It took too much time to solve this case study but I learned lot of new things from this case study