# Yelp Businesses: Cleaning and Wrangling

The objective of this notebook is to inspect and wrangle the `business.json` file from the yelp dataset.
At each feature extracting/cleaning step, the data is saved in a separate csv file in the format `business_feature.csv` such that we can trace back the file origin. This is also to avoid ending up with a massing dataframe with too many features.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import seaborn as sns
from collections import Counter, OrderedDict
import calendar

%matplotlib inline

# Load + Assess

In [2]:
#location of file
business_dir = 'data/business.json'

In [3]:
#download data
df_bus = pd.read_json(business_dir, orient='columns',lines=True)
#head
df_bus.head()

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,postal_code,review_count,stars,state
0,2818 E Camino Acequia Drive,{'GoodForKids': 'False'},1SWheh84yJXfytovILXOAQ,"Golf, Active Life",Phoenix,,0,33.522143,-112.018481,Arizona Biltmore Golf Club,85016,5,3.0,AZ
1,30 Eglinton Avenue W,"{'RestaurantsReservations': 'True', 'GoodForMe...",QXAEGFB4oINsVuTFxEYKFQ,"Specialty Food, Restaurants, Dim Sum, Imported...",Mississauga,"{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...",1,43.605499,-79.652289,Emerald Chinese Restaurant,L5R 3E7,128,2.5,ON
2,"10110 Johnston Rd, Ste 15","{'GoodForKids': 'True', 'NoiseLevel': 'u'avera...",gnKjwL_1w79qoiV3IC_xQQ,"Sushi Bars, Restaurants, Japanese",Charlotte,"{'Monday': '17:30-21:30', 'Wednesday': '17:30-...",1,35.092564,-80.859132,Musashi Japanese Restaurant,28210,170,4.0,NC
3,"15655 W Roosevelt St, Ste 237",,xvX2CttrVhyG2z1dFg_0xw,"Insurance, Financial Services",Goodyear,"{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...",1,33.455613,-112.395596,Farmers Insurance - Paul Lorenz,85338,3,5.0,AZ
4,"4209 Stuart Andrew Blvd, Ste F","{'BusinessAcceptsBitcoin': 'False', 'ByAppoint...",HhyxOkGAM07SRYtlQ4wMFQ,"Plumbing, Shopping, Local Services, Home Servi...",Charlotte,"{'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ...",1,35.190012,-80.887223,Queen City Plumbing,28217,4,4.0,NC


In [4]:
#shape
df_bus.shape

(192609, 14)

In [5]:
#data types
df_bus.dtypes

address          object
attributes       object
business_id      object
categories       object
city             object
hours            object
is_open           int64
latitude        float64
longitude       float64
name             object
postal_code      object
review_count      int64
stars           float64
state            object
dtype: object

In [6]:
#quick stats
df_bus.describe()

Unnamed: 0,is_open,latitude,longitude,review_count,stars
count,192609.0,192609.0,192609.0,192609.0,192609.0
mean,0.82304,38.541803,-97.594785,33.538962,3.585627
std,0.381635,4.941964,16.697725,110.135224,1.018458
min,0.0,33.204642,-115.493471,3.0,1.0
25%,1.0,33.637408,-112.274677,4.0,3.0
50%,1.0,36.144815,-111.759323,9.0,3.5
75%,1.0,43.602989,-79.983614,25.0,4.5
max,1.0,51.299943,-72.911982,8348.0,5.0


In [7]:
#percent missing
df_bus.isna().mean()

address         0.000000
attributes      0.149713
business_id     0.000000
categories      0.002502
city            0.000000
hours           0.232751
is_open         0.000000
latitude        0.000000
longitude       0.000000
name            0.000000
postal_code     0.000000
review_count    0.000000
stars           0.000000
state           0.000000
dtype: float64

# Cleaning to-do list

Evaluating the business dataframe above, there are several features that need to be cleaned.
The list below offers a roadmap to addressing these issues although it might not be comprehensive. During the process we might need to add additional steps.
We understand that some data types are nested within the columns, and that these data types might not be stored  in the appropirate manner.


- address
    - make everything lower case
    - extract feature: if on road/boulevard/ave/etc...
- attributes
    - break up dict to dummy variables
- business_id
    - no changes
- categories
    - make everything lower?
    - dummy variables and split by comma character
    - note that not everything is a restaurant (plumbers)
- city
    - maybe lower case?
- hours
    - split dict by days
        - open hour monday
        - close hour monday
        - etc...
    - figure out placeholder value for None
    - check if correlation between closed restaurant and no hours posted
- is_open
    - no changes
    - 82% are open, 18% are dead businesses
- latitude
    - no changes
- longitude
    - no changes
- name
    - no changes
- postal_code
    - note: some zips are canadian
- review_count
    - note that lowest value is 3
- stars
    - no changes
- state
    - some are canadian
    - add feature: is in USA yes/no

In [8]:
#mapping dict for replacing and fixing data types
bool_to_int = {True: 1, False: 0, np.nan: 0, 'True': 1, 'False': 0, 'None': 0, None: 0}

## address

In [9]:
#create deep copy
df_bus_adr = df_bus.copy()
#make everything lower case
df_bus_adr['address'] = df_bus_adr['address'].str.lower()
#remove punctuation
df_bus_adr['address'] = df_bus_adr['address'].str.replace('[^\w\s]','')

In [10]:
#define counter object
adr_counter = Counter()
#loop over every address entry
for add in df_bus_adr.address:
    #loop over each individial word
    for word in add.lower().split():
        #add word to counter
        adr_counter[word] +=1

In [11]:
#list top k words
adr_counter.most_common(100)

[('rd', 46230),
 ('ste', 40472),
 ('w', 29488),
 ('e', 28551),
 ('ave', 23643),
 ('n', 21687),
 ('st', 21172),
 ('s', 20076),
 ('blvd', 18235),
 ('street', 14378),
 ('dr', 12894),
 ('avenue', 9499),
 ('road', 5760),
 ('unit', 5336),
 ('pkwy', 4724),
 ('rue', 4267),
 ('vegas', 3977),
 ('las', 3924),
 ('boulevard', 3185),
 ('100', 3021),
 ('suite', 2921),
 ('yonge', 2907),
 ('sw', 2655),
 ('drive', 2625),
 ('main', 2437),
 ('hwy', 2379),
 ('101', 2378),
 ('1', 2367),
 ('bell', 2293),
 ('queen', 2174),
 ('school', 1971),
 ('scottsdale', 1953),
 ('se', 1890),
 ('center', 1860),
 ('way', 1740),
 ('nw', 1715),
 ('valley', 1657),
 ('ln', 1642),
 ('park', 1585),
 ('110', 1535),
 ('2', 1497),
 ('eastern', 1491),
 ('sahara', 1477),
 ('dundas', 1466),
 ('lake', 1440),
 ('camelback', 1433),
 ('west', 1433),
 ('7', 1431),
 ('university', 1412),
 ('bloor', 1409),
 ('rainbow', 1403),
 ('indian', 1359),
 ('charleston', 1359),
 ('ne', 1351),
 ('105', 1326),
 ('baseline', 1325),
 ('102', 1313),
 ('east'

In [12]:
#road type mapping to homogenize road names
road_type_dict = {'rd': 'road','rue': 'road', 'avenue': 'ave',
                  'street': 'str', 'blvd': 'boulevard',
                  'drive': 'dr', 'highway': 'hwy',
                  'parkway': 'pkwy', 'center': 'ct', 'lane': 'ln'}

#replace names
df_bus_adr['address'] = df_bus_adr['address'].replace(road_type_dict, regex=True)

In [13]:
#get list of finalized road values
#set to remove duplicates
road_types_list = list(set(road_type_dict.values()))
print(road_types_list)

['dr', 'ave', 'str', 'ct', 'hwy', 'road', 'ln', 'pkwy', 'boulevard']


In [14]:
#dict for dummies
road_col_dict = {}
#iterate over road types
for road in road_types_list:
    #create a dummy for that type
    dum_col = df_bus_adr['address'].str.contains(road)
    #add it to the dict
    road_col_dict[road] = dum_col

#convert boolean to 1/0
road_type_df = pd.DataFrame.from_dict(road_col_dict).replace({False:0, True:1})

road_type_df.head()

Unnamed: 0,dr,ave,str,ct,hwy,road,ln,pkwy,boulevard
0,1,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,1


In [15]:
#key for reference in case we want to join tables
road_type_df['business_id'] = df_bus['business_id']

In [16]:
#save the work
road_type_df.to_csv(path_or_buf='data/cleaned/business_roadtype.csv')

## attributes

In [17]:
#break up dict inside df
df_atr = df_bus['attributes'].apply(pd.Series)
df_atr.head(10)

Unnamed: 0,GoodForKids,RestaurantsReservations,GoodForMeal,BusinessParking,Caters,NoiseLevel,RestaurantsTableService,RestaurantsTakeOut,RestaurantsPriceRange2,OutdoorSeating,...,BYOBCorkage,DriveThru,Smoking,AgesAllowed,HairSpecializesIn,Corkage,BYOB,DietaryRestrictions,Open24Hours,RestaurantsCounterService
0,False,,,,,,,,,,...,,,,,,,,,,
1,True,True,"{'dessert': False, 'latenight': False, 'lunch'...","{'garage': False, 'street': False, 'validated'...",True,u'loud',True,True,2.0,False,...,,,,,,,,,,
2,True,True,"{'dessert': False, 'latenight': False, 'lunch'...","{'garage': False, 'street': False, 'validated'...",False,u'average',True,True,2.0,False,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,"{'garage': False, 'street': False, 'validated'...",,,,,2.0,,...,,,,,,,,,,
6,,,,"{'garage': False, 'street': False, 'validated'...",,,,,2.0,,...,,,,,,,,,,
7,True,,,,,,,,3.0,,...,,,,,,,,,,
8,,,,"{'garage': False, 'street': False, 'validated'...",,,,,2.0,,...,,,,,,,,,,
9,,,,"{'garage': False, 'street': False, 'validated'...",,,,,2.0,,...,,,,,,,,,,


In [18]:
#inspect types
df_atr.dtypes

GoodForKids                   object
RestaurantsReservations       object
GoodForMeal                   object
BusinessParking               object
Caters                        object
NoiseLevel                    object
RestaurantsTableService       object
RestaurantsTakeOut            object
RestaurantsPriceRange2        object
OutdoorSeating                object
BikeParking                   object
Ambience                      object
HasTV                         object
WiFi                          object
Alcohol                       object
RestaurantsAttire             object
RestaurantsGoodForGroups      object
RestaurantsDelivery           object
BusinessAcceptsCreditCards    object
BusinessAcceptsBitcoin        object
ByAppointmentOnly             object
AcceptsInsurance              object
Music                         object
GoodForDancing                object
CoatCheck                     object
HappyHour                     object
BestNights                    object
W

In [19]:
#select cols that are still as objects
df_atr_obj_cols = df_atr.select_dtypes(include='object').columns
print(df_atr_obj_cols)

Index(['GoodForKids', 'RestaurantsReservations', 'GoodForMeal',
       'BusinessParking', 'Caters', 'NoiseLevel', 'RestaurantsTableService',
       'RestaurantsTakeOut', 'RestaurantsPriceRange2', 'OutdoorSeating',
       'BikeParking', 'Ambience', 'HasTV', 'WiFi', 'Alcohol',
       'RestaurantsAttire', 'RestaurantsGoodForGroups', 'RestaurantsDelivery',
       'BusinessAcceptsCreditCards', 'BusinessAcceptsBitcoin',
       'ByAppointmentOnly', 'AcceptsInsurance', 'Music', 'GoodForDancing',
       'CoatCheck', 'HappyHour', 'BestNights', 'WheelchairAccessible',
       'DogsAllowed', 'BYOBCorkage', 'DriveThru', 'Smoking', 'AgesAllowed',
       'HairSpecializesIn', 'Corkage', 'BYOB', 'DietaryRestrictions',
       'Open24Hours', 'RestaurantsCounterService'],
      dtype='object')


In [20]:
df_atr[df_atr_obj_cols].head(10)

Unnamed: 0,GoodForKids,RestaurantsReservations,GoodForMeal,BusinessParking,Caters,NoiseLevel,RestaurantsTableService,RestaurantsTakeOut,RestaurantsPriceRange2,OutdoorSeating,...,BYOBCorkage,DriveThru,Smoking,AgesAllowed,HairSpecializesIn,Corkage,BYOB,DietaryRestrictions,Open24Hours,RestaurantsCounterService
0,False,,,,,,,,,,...,,,,,,,,,,
1,True,True,"{'dessert': False, 'latenight': False, 'lunch'...","{'garage': False, 'street': False, 'validated'...",True,u'loud',True,True,2.0,False,...,,,,,,,,,,
2,True,True,"{'dessert': False, 'latenight': False, 'lunch'...","{'garage': False, 'street': False, 'validated'...",False,u'average',True,True,2.0,False,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,"{'garage': False, 'street': False, 'validated'...",,,,,2.0,,...,,,,,,,,,,
6,,,,"{'garage': False, 'street': False, 'validated'...",,,,,2.0,,...,,,,,,,,,,
7,True,,,,,,,,3.0,,...,,,,,,,,,,
8,,,,"{'garage': False, 'street': False, 'validated'...",,,,,2.0,,...,,,,,,,,,,
9,,,,"{'garage': False, 'street': False, 'validated'...",,,,,2.0,,...,,,,,,,,,,


In [21]:
def str_dict_to_df(series):
    """
    Takes in a pandas series with dicts stored as strings
    returns dataframe with dict key as columns
    
    serires: pandas series
    """
    eval_list = []
    for sr in series:
        if not pd.isna(sr):
            eval_list.append(eval(sr))
        else:
            eval_list.append(np.nan)
    
    eval_df = pd.Series(eval_list).apply(pd.Series)  
    
    #drop cols that are all nan
    eval_df = eval_df.dropna(axis=1, how='all')
    
    
    return eval_df

In [22]:
#store col names
dict_cols_list = []

for col in df_atr_obj_cols:
    #if contains a curly bracket, then assume column is a dict as string
    if df_atr[col].str.contains('{').any():
        dict_cols_list.append(col)
print(dict_cols_list)            

['GoodForMeal', 'BusinessParking', 'Ambience', 'Music', 'BestNights', 'HairSpecializesIn', 'DietaryRestrictions']


In [23]:
#store dataframes from dict nested columns
dict_col_df_list = []

for col in dict_cols_list:
    #apply string to dict evaluation
    temp_df = str_dict_to_df(df_atr[col])
    #append to list
    dict_col_df_list.append(temp_df)

#combine all in one column
dict_col_df = pd.concat(dict_col_df_list, axis=1)
dict_col_df.head()

Unnamed: 0,breakfast,brunch,dessert,dinner,latenight,lunch,garage,lot,street,valet,...,kids,perms,straightperms,dairy-free,gluten-free,halal,kosher,soy-free,vegan,vegetarian
0,,,,,,,,,,,...,,,,,,,,,,
1,False,False,False,True,False,True,False,True,False,False,...,,,,,,,,,,
2,False,False,False,True,False,True,False,True,False,False,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [24]:
#store dummy attribute columns in list
atr_dum_df_cols = []

#iterate over object columns
for col in df_atr_obj_cols:
    #if contains a categorical variable then it begins with a u
    if df_atr[col].str.contains("u'").any():
        #add to the list
        atr_dum_df_cols.append(col)

#print out list
print(atr_dum_df_cols)


#store dummy dataframes in list
atr_dum_df_list = []

for col in atr_dum_df_cols:
    
    #we do not want to modify the dataframe in place so create a copy
    temp_series = df_atr[col].copy()
    
    
    #fx messy inputs and remove u
    temp_series = temp_series.str.replace("u'", "")
    #remove '
    temp_series= temp_series.str.replace("'", "")
        
    
    #create dummies
    dum_df = pd.get_dummies(temp_series)
    #drop the None column
    dum_df = dum_df.drop(columns=['None'])
        
    #fx messy col names in case we missed them
    dum_df.columns = dum_df.columns.str.replace("u'", "")
    dum_df.columns = dum_df.columns.str.replace("'", "")
        
    #add prefixt
    dum_df = dum_df.add_prefix(col+'_')
                
    #append to list
    atr_dum_df_list.append(dum_df)

#concat
atr_dum_df = pd.concat(atr_dum_df_list, axis=1)

atr_dum_df.head()


['NoiseLevel', 'WiFi', 'Alcohol', 'RestaurantsAttire', 'BYOBCorkage', 'Smoking', 'AgesAllowed']


Unnamed: 0,NoiseLevel_average,NoiseLevel_loud,NoiseLevel_quiet,NoiseLevel_very_loud,WiFi_free,WiFi_no,WiFi_paid,Alcohol_beer_and_wine,Alcohol_full_bar,Alcohol_none,...,BYOBCorkage_no,BYOBCorkage_yes_corkage,BYOBCorkage_yes_free,Smoking_no,Smoking_outdoor,Smoking_yes,AgesAllowed_18plus,AgesAllowed_19plus,AgesAllowed_21plus,AgesAllowed_allages
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
atr_dum_df.columns

Index(['NoiseLevel_average', 'NoiseLevel_loud', 'NoiseLevel_quiet',
       'NoiseLevel_very_loud', 'WiFi_free', 'WiFi_no', 'WiFi_paid',
       'Alcohol_beer_and_wine', 'Alcohol_full_bar', 'Alcohol_none',
       'RestaurantsAttire_casual', 'RestaurantsAttire_dressy',
       'RestaurantsAttire_formal', 'BYOBCorkage_no', 'BYOBCorkage_yes_corkage',
       'BYOBCorkage_yes_free', 'Smoking_no', 'Smoking_outdoor', 'Smoking_yes',
       'AgesAllowed_18plus', 'AgesAllowed_19plus', 'AgesAllowed_21plus',
       'AgesAllowed_allages'],
      dtype='object')

In [26]:
#standard preprocessing for restaurant price range since it has no u
atr_price_range = pd.get_dummies(df_atr['RestaurantsPriceRange2'])
atr_price_range = atr_price_range.drop(columns= ['None'])
atr_price_range = atr_price_range.add_prefix('price_range'+'_')

atr_price_range.head()

Unnamed: 0,price_range_1,price_range_2,price_range_3,price_range_4
0,0,0,0,0
1,0,1,0,0
2,0,1,0,0
3,0,0,0,0
4,0,0,0,0


In [27]:
#add to the list
atr_to_drop = atr_dum_df_cols + dict_cols_list + ['RestaurantsPriceRange2']
print(atr_to_drop)

['NoiseLevel', 'WiFi', 'Alcohol', 'RestaurantsAttire', 'BYOBCorkage', 'Smoking', 'AgesAllowed', 'GoodForMeal', 'BusinessParking', 'Ambience', 'Music', 'BestNights', 'HairSpecializesIn', 'DietaryRestrictions', 'RestaurantsPriceRange2']


In [28]:
#combine features in one dataframes
df_atr_conc = pd.concat([df_atr.drop(columns=atr_to_drop),dict_col_df, atr_dum_df, atr_price_range], axis=1)

#standardize name
df_atr_conc.columns = df_atr_conc.columns.str.replace("-", "_")

#make 1/0
df_atr_conc = df_atr_conc.replace(bool_to_int)

df_atr_conc.head()

Unnamed: 0,GoodForKids,RestaurantsReservations,Caters,RestaurantsTableService,RestaurantsTakeOut,OutdoorSeating,BikeParking,HasTV,RestaurantsGoodForGroups,RestaurantsDelivery,...,Smoking_outdoor,Smoking_yes,AgesAllowed_18plus,AgesAllowed_19plus,AgesAllowed_21plus,AgesAllowed_allages,price_range_1,price_range_2,price_range_3,price_range_4
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,1,1,0,1,1,0,1,1,1,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
#quick check for column names
df_atr_conc.columns

Index(['GoodForKids', 'RestaurantsReservations', 'Caters',
       'RestaurantsTableService', 'RestaurantsTakeOut', 'OutdoorSeating',
       'BikeParking', 'HasTV', 'RestaurantsGoodForGroups',
       'RestaurantsDelivery', 'BusinessAcceptsCreditCards',
       'BusinessAcceptsBitcoin', 'ByAppointmentOnly', 'AcceptsInsurance',
       'GoodForDancing', 'CoatCheck', 'HappyHour', 'WheelchairAccessible',
       'DogsAllowed', 'DriveThru', 'Corkage', 'BYOB', 'Open24Hours',
       'RestaurantsCounterService', 'breakfast', 'brunch', 'dessert', 'dinner',
       'latenight', 'lunch', 'garage', 'lot', 'street', 'valet', 'validated',
       'casual', 'classy', 'divey', 'hipster', 'intimate', 'romantic',
       'touristy', 'trendy', 'upscale', 'background_music', 'dj', 'jukebox',
       'karaoke', 'live', 'no_music', 'video', 'friday', 'monday', 'saturday',
       'sunday', 'thursday', 'tuesday', 'wednesday', 'africanamerican',
       'asian', 'coloring', 'curly', 'extensions', 'kids', 'perms',
     

In [30]:
#key for reference
df_atr_conc['business_id'] = df_bus['business_id']

In [31]:
#save the work
df_atr_conc.to_csv(path_or_buf='data/cleaned/business_attributes.csv')

## Categories

In [32]:
#counter for original categories
cat_counter = Counter()
#loop through split categories
for cat_split in df_bus['categories'].str.split(',| '):
    #if statement to avoid none type is not iterable
    if cat_split:
        for cat in cat_split:
            cat_counter[cat] +=1

In [33]:
#see top k types
cat_counter.most_common(25)

[('', 596232),
 ('&', 129038),
 ('Services', 72809),
 ('Restaurants', 59382),
 ('Food', 47591),
 ('Shopping', 32643),
 ('Home', 31600),
 ('Spas', 23387),
 ('Bars', 21592),
 ('Beauty', 21518),
 ('Medical', 20510),
 ('Health', 18736),
 ('Hair', 15561),
 ('Local', 15405),
 ('Event', 14518),
 ('Repair', 13276),
 ('Automotive', 13203),
 ('Nightlife', 13095),
 ('Stores', 12969),
 ('Salons', 12847),
 ('Planning', 12740),
 ('American', 12580),
 ('Auto', 11392),
 ('Life', 10049),
 ('Arts', 9744)]

In [34]:
#get keys for top k common categories
top_cats = list(dict(cat_counter.most_common(25)).keys())
print(top_cats)
#note that the first 2 were space and & so skip those
print("\nselecting only top 10 relevant sections\n")
top_cats = top_cats[3:13]
print(top_cats)

['', '&', 'Services', 'Restaurants', 'Food', 'Shopping', 'Home', 'Spas', 'Bars', 'Beauty', 'Medical', 'Health', 'Hair', 'Local', 'Event', 'Repair', 'Automotive', 'Nightlife', 'Stores', 'Salons', 'Planning', 'American', 'Auto', 'Life', 'Arts']

selecting only top 10 relevant sections

['Restaurants', 'Food', 'Shopping', 'Home', 'Spas', 'Bars', 'Beauty', 'Medical', 'Health', 'Hair']


In [35]:
#create dict for categories
category_dict = {}
for cat in top_cats:
    #select relevant top 10 categories from before
    dum_cat = df_bus['categories'].str.contains(cat)
    #add it to the dict
    category_dict[cat] = dum_cat
    
cat_type_df = pd.DataFrame.from_dict(category_dict).replace(bool_to_int)

cat_type_df.head()

Unnamed: 0,Restaurants,Food,Shopping,Home,Spas,Bars,Beauty,Medical,Health,Hair
0,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,0
2,1,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0
4,0,0,1,1,0,0,0,0,0,0


In [36]:
#key for reference
cat_type_df['business_id'] = df_bus['business_id']

In [37]:
#save the work
cat_type_df.to_csv(path_or_buf='data/cleaned/business_cats.csv')

## Hours

Holy grail of date time format:

http://strftime.org/

In [38]:
#split out the dict
hours_day_df = df_bus['hours'].apply(pd.Series)
hours_day_df.head()

Unnamed: 0,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
0,,,,,,,
1,9:0-0:0,9:0-0:0,9:0-0:0,9:0-0:0,9:0-1:0,9:0-1:0,9:0-0:0
2,17:30-21:30,,17:30-21:30,17:30-21:30,17:30-22:0,17:30-22:0,17:30-21:0
3,8:0-17:0,8:0-17:0,8:0-17:0,8:0-17:0,8:0-17:0,,
4,7:0-23:0,7:0-23:0,7:0-23:0,7:0-23:0,7:0-23:0,7:0-23:0,7:0-23:0


In [39]:
def series_to_datetime(df, series):
    """
    Takes in a pandas series with content stored as a string
    must have format hour:minute - hour:minute
    can have missing values
    
    df = pandas dataframe
    series = pandas column name
    """
    
    
    #create array for day of week
    #weekday = list(calendar.day_abbr)
    
    #ordered dict container
    serires_dict = OrderedDict()
    
    #hour container
    open_hour = []
    close_hour = []
    
    #split the series along the dash (-)
    day = df[series].str.split("-")

    
    #iterate over days
    for hour in day:
        
        #if not a nan the split will return a list
        if type(hour)==list:
            open_hour.append(hour[0])
            close_hour.append(hour[1])
        else:
            #necessary nan for when not available
            open_hour.append(np.nan)
            close_hour.append(np.nan)
            
    #make a datetime object    
    
    #open_hour_dt = pd.to_datetime(open_hour, dayfirst=True,format='%H:%M')
    #close_hour_dt = pd.to_datetime(close_hour, dayfirst=True, format='%H:%M')

    
    
    serires_dict[series+'_open'] = open_hour
    serires_dict[series+'_close'] = close_hour
    
    
    hours_df = pd.DataFrame.from_dict(serires_dict)
    
    return hours_df

In [40]:
hours_df_list = []

for col in hours_day_df.columns:
    temp_hour_df = series_to_datetime(hours_day_df, col)
    hours_df_list.append(temp_hour_df)
    
    
hours_df_openclose = pd.concat(hours_df_list, axis=1)
hours_df_openclose.head()

Unnamed: 0,Monday_open,Monday_close,Tuesday_open,Tuesday_close,Wednesday_open,Wednesday_close,Thursday_open,Thursday_close,Friday_open,Friday_close,Saturday_open,Saturday_close,Sunday_open,Sunday_close
0,,,,,,,,,,,,,,
1,9:0,0:0,9:0,0:0,9:0,0:0,9:0,0:0,9:0,1:0,9:0,1:0,9:0,0:0
2,17:30,21:30,,,17:30,21:30,17:30,21:30,17:30,22:0,17:30,22:0,17:30,21:0
3,8:0,17:0,8:0,17:0,8:0,17:0,8:0,17:0,8:0,17:0,,,,
4,7:0,23:0,7:0,23:0,7:0,23:0,7:0,23:0,7:0,23:0,7:0,23:0,7:0,23:0


In [41]:
#key for reference
hours_df_openclose['business_id'] = df_bus['business_id']

In [42]:
#save the work
hours_df_openclose.to_csv(path_or_buf='data/cleaned/business_hours.csv')

## is_open

In [43]:
#quick check for unique values
df_bus.is_open.unique()

array([0, 1])

## state

In [44]:
#USA! USA!

states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

In [45]:
df_bus_usa_id = df_bus[df_bus['state'].str.contains('|'.join(states))]['business_id']

In [46]:
#save the work
df_bus_usa_id.to_csv(path_or_buf='data/cleaned/business_is_usa.csv', header=True)