# What makes resturants close

In this part of the analysis we wish to investigate what factors makes resturants close.

## Load data

In [1]:
import pandas as pd
import numpy as np

In [2]:
Resturants=pd.read_csv('data/toronto_restaurant_business2.csv',encoding="utf8")

In [3]:
len(Resturants)

8681

## Data preparation

First we remove data columns which decribe information which is only specific for the individual resturant. This will be the business id, the name of the resturant, address and since we are only intrested in resturants in Toronto also city and state. While longitude and latitude provide more precises information the information is overlapping with the postal_code, however the postal code provide a more general indication of the area and are therefore choosen to represent location.

In [4]:
Resturants

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,rVBPQdeayMYht4Uv_FOLHg,Gourmet Burger Company,843 Kipling Avenue,Toronto,ON,M8Z 5G9,43.633291,-79.531768,3.0,13,0,"{""RestaurantsPriceRange2"": ""2"", ""RestaurantsAt...","Restaurants, Burgers, Food",{}
1,0QjROMVW9ACKjhSEfHqNCQ,Mi Mi Restaurant,688 Gerrard Street E,Toronto,ON,M4M 1Y3,43.666376,-79.348773,4.0,116,1,"{""RestaurantsTakeOut"": ""True"", ""Alcohol"": ""bee...","Vietnamese, Restaurants","{""Monday"": ""11:00-22:00"", ""Tuesday"": ""11:00-22..."
2,8k62wYhDVq1-652YbJi5eg,Tim Hortons,90 Adelaide Street W,Toronto,ON,M5H 3V9,43.649859,-79.382060,3.0,8,1,"{""OutdoorSeating"": ""False"", ""RestaurantsDelive...","Bagels, Donuts, Food, Cafes, Coffee & Tea, Res...",{}
3,0DnQh8SE8BSnvJltGCCiWg,Chick-N-Joy,3-1265 York Mills Road,Toronto,ON,M3A 1Z3,43.765279,-79.326248,3.0,11,1,"{""NoiseLevel"": ""loud"", ""BusinessParking"": {""ga...","Fast Food, Restaurants, Chicken Shop",{}
4,NLaK58WvlNQdUunSIkt-jA,Zav Coffee Shop & Gallery,2048 Danforth Avenue,Toronto,ON,M4C 1J6,43.685608,-79.313936,4.5,24,1,"{""DogsAllowed"": ""False"", ""OutdoorSeating"": ""Tr...","Coffee & Tea, Restaurants, Sandwiches, Food","{""Monday"": ""0:00-0:00"", ""Tuesday"": ""7:30-17:00..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8676,thzyiQZb16zD8wMliaEfRQ,Sushi Supreme,1995 Yonge Street,Toronto,ON,M4S 1Z8,43.700617,-79.396762,4.0,104,1,"{""Ambience"": {""romantic"": ""False"", ""intimate"":...","Sushi Bars, Restaurants, Japanese","{""Monday"": ""12:00-23:00"", ""Tuesday"": ""12:00-23..."
8677,eUi3O-8Gqh_nJ_ZhO-25gQ,Get & Go Burrito,"1077 Wilson Avenue, Unit 8",Toronto,ON,M3K 1G7,43.726656,-79.480365,3.5,43,1,"{""BusinessParking"": {""garage"": ""False"", ""stree...","Mexican, Restaurants","{""Monday"": ""11:00-23:00"", ""Tuesday"": ""11:00-23..."
8678,yFQCdWr_k1pTObzHPGis9Q,Grasshopper Restaurant,310 College Street,Toronto,ON,M5T 1S2,43.657716,-79.402098,4.0,177,1,"{""DogsAllowed"": ""False"", ""BikeParking"": ""True""...","Vegan, Restaurants, Vegetarian, Salad","{""Monday"": ""11:30-22:00"", ""Tuesday"": ""11:30-22..."
8679,GAgEoHcf4PSuZRS5Zd3ltA,Q's Shawarma,1075 Martin Grove Road,Toronto,ON,M9W 4W6,43.701807,-79.575135,4.0,16,1,"{""GoodForKids"": ""True"", ""HasTV"": ""True"", ""Rest...","Restaurants, Mediterranean","{""Monday"": ""11:00-21:00"", ""Tuesday"": ""11:00-21..."


In [5]:
Resturants=Resturants.drop(['name','address','city','state','longitude','latitude','business_id'],axis=1)

To be able to work with the data postal_code is one-hot-encoded.

In [6]:
Resturants=pd.concat([Resturants,pd.get_dummies(Resturants.postal_code,prefix='postal_code',drop_first=False)],axis=1)
Resturants=Resturants.drop(['postal_code'],axis=1)

The atributtes ne

In [7]:
# attribute is transformed from string to dictionaries
Resturants.attributes=Resturants.attributes.apply(lambda d: eval(d))

As not all attributes are listed for all resturants we first need to get a list of all available attributes

In [8]:
attributes=set([])
for d in Resturants.attributes:
    attributes=attributes.union(set(d.keys()))
print("The are the following {} attributes:".format(len(attributes)))
for attribute in attributes:
    print("- {}".format(attribute))

The are the following 37 attributes:
- RestaurantsPriceRange2
- GoodForDancing
- Smoking
- RestaurantsAttire
- GoodForKids
- BYOB
- DogsAllowed
- DietaryRestrictions
- OutdoorSeating
- CoatCheck
- ByAppointmentOnly
- Ambience
- Alcohol
- BestNights
- Corkage
- AgesAllowed
- DriveThru
- Caters
- BusinessAcceptsBitcoin
- NoiseLevel
- HappyHour
- RestaurantsCounterService
- WheelchairAccessible
- BikeParking
- RestaurantsReservations
- RestaurantsGoodForGroups
- Music
- RestaurantsDelivery
- AcceptsInsurance
- RestaurantsTableService
- GoodForMeal
- BusinessParking
- WiFi
- HairSpecializesIn
- HasTV
- RestaurantsTakeOut
- BusinessAcceptsCreditCards


Each attribute is now represented as individual columns in the dataframe.

In [9]:
print("Percent of resturants for which attribute is NaN:")
print("---------------------------------------------------------")
for attribute in attributes:
    Resturants[attribute]=Resturants.attributes.apply(lambda a:  a[attribute] if (attribute in a) else np.nan)
    print("{}: {}%".format(attribute,Resturants[attribute].isna().sum()/len(Resturants[attribute].isna())*100))
Resturants=Resturants.drop(['attributes'],axis=1)

Percent of resturants for which attribute is NaN:
---------------------------------------------------------
RestaurantsPriceRange2: 19.32957032599931%
GoodForDancing: 94.74714894597398%
Smoking: 95.72629881350075%
RestaurantsAttire: 26.414007602810734%
GoodForKids: 22.474369312291213%
BYOB: 99.066927773298%
DogsAllowed: 87.29409054256422%
DietaryRestrictions: 99.86176707752563%
OutdoorSeating: 21.276350650846677%
CoatCheck: 95.63414353185117%
ByAppointmentOnly: 96.72848750143993%
Ambience: 21.518258265176822%
Alcohol: 29.24778251353531%
BestNights: 94.86234304803594%
Corkage: 99.35491302845294%
AgesAllowed: 99.9769611795876%
DriveThru: 96.49809929731597%
Caters: 45.89333026149061%
BusinessAcceptsBitcoin: 99.9769611795876%
NoiseLevel: 34.48911415735515%
HappyHour: 87.67423107936874%
RestaurantsCounterService: 99.98848058979381%
WheelchairAccessible: 82.67480704987905%
BikeParking: 39.00472295818454%
RestaurantsReservations: 19.68667204239143%
RestaurantsGoodForGroups: 20.94228775486695%

As we can see that fare from all attributes are listed for all resturants. However NaN values does not provide information about a resturant, and can therefore not be included when training the classifier. Furthermore we see that certain attributes are not listed for most resturants. In such cases it is assumed that a value for NaN cases cannot gennerally be reasonably derived. Attributes where more than 80% of resturants don't have a value listed, are therefore removed. 

In [10]:
disgarded_attributes=set([])
for attribute in attributes:
    percent_NaN=Resturants[attribute].isna().sum()/len(Resturants[attribute].isna())*100
    if percent_NaN>=80:
        Resturants=Resturants.drop([attribute],axis=1)
        disgarded_attributes=disgarded_attributes.union(set([attribute]))
attributes.difference_update(disgarded_attributes)        
print("Attributes disgarded:")
for a in disgarded_attributes:
      print(a) 
print("-----------------------------")
print("There are {} remaing attributes".format(len(attributes)))

Attributes disgarded:
GoodForDancing
HappyHour
RestaurantsCounterService
WheelchairAccessible
Smoking
BusinessAcceptsCreditCards
BYOB
DogsAllowed
Music
DietaryRestrictions
CoatCheck
AcceptsInsurance
ByAppointmentOnly
BestNights
AgesAllowed
Corkage
DriveThru
HairSpecializesIn
BusinessAcceptsBitcoin
-----------------------------
There are 18 remaing attributes


In [11]:
print("Percent of resturants for which attribute is NaN:")
print("---------------------------------------------------------")
for attribute in attributes:
    print("{}: {}%".format(attribute,Resturants[attribute].isna().sum()/len(Resturants[attribute].isna())*100))

Percent of resturants for which attribute is NaN:
---------------------------------------------------------
RestaurantsPriceRange2: 19.32957032599931%
RestaurantsAttire: 26.414007602810734%
GoodForKids: 22.474369312291213%
OutdoorSeating: 21.276350650846677%
Ambience: 21.518258265176822%
Alcohol: 29.24778251353531%
Caters: 45.89333026149061%
NoiseLevel: 34.48911415735515%
BikeParking: 39.00472295818454%
RestaurantsReservations: 19.68667204239143%
RestaurantsGoodForGroups: 20.94228775486695%
RestaurantsDelivery: 22.209422877548672%
RestaurantsTableService: 66.34028337749108%
GoodForMeal: 47.909227047575165%
BusinessParking: 18.78815804630803%
WiFi: 33.19894021426103%
HasTV: 23.77606266559152%
RestaurantsTakeOut: 15.631839649809931%


The above examples shove that there is a large difference between what kind of values are listed for each attribute. This mean that they must all be treated differently.

For the 18 remaining attributes there is still need to find a solution for NaN values.

In [12]:
print("There are {} resturants with no NaN listed for attributes".format(len(Resturants.dropna())))

There are 1666 resturants with no NaN listed for attributes


As we can see droping all resturants with NaN values are not an option as it will limit the size of the dataset to drastic. 

In [13]:
Resturants.Ambience

0       {'romantic': 'False', 'intimate': 'False', 'cl...
1       {'romantic': 'False', 'intimate': 'False', 'cl...
2       {'touristy': 'False', 'hipster': 'False', 'rom...
3       {'touristy': 'False', 'hipster': 'False', 'rom...
4                                                     NaN
                              ...                        
8676    {'romantic': 'False', 'intimate': 'False', 'cl...
8677    {'romantic': 'False', 'intimate': 'False', 'cl...
8678    {'touristy': 'False', 'hipster': 'False', 'rom...
8679    {'romantic': 'False', 'intimate': 'False', 'cl...
8680    {'touristy': 'False', 'hipster': 'False', 'rom...
Name: Ambience, Length: 8681, dtype: object

In [14]:
Resturants.BikeParking

0         NaN
1        True
2       False
3        True
4        True
        ...  
8676     True
8677    False
8678     True
8679    False
8680     True
Name: BikeParking, Length: 8681, dtype: object

In [15]:
Resturants.NoiseLevel

0       average
1       average
2           NaN
3          loud
4           NaN
         ...   
8676    average
8677    average
8678    average
8679        NaN
8680      quiet
Name: NoiseLevel, Length: 8681, dtype: object

As the above examples show the values in the the attributes are very different and a general solution is there for not available.

### Ambience

In [16]:
Resturants.Ambience.apply(lambda r: type(r)).unique()

array([<class 'dict'>, <class 'float'>, <class 'str'>], dtype=object)

We can see that not all values are dictionaries or NaN values.

In [17]:
for a in Resturants.Ambience:
    if type(a)==str:
        print(a)

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None


We see that values such as None are also available. This appear to coorespond well with NaN. We now take a look at the dictionaries.

In [18]:
ambience_keys=set(Resturants.Ambience[0].keys())
Resturants.Ambience.apply(lambda d: set(d.keys())==ambience_keys if type(d)==dict else True)

0        True
1        True
2       False
3       False
4        True
        ...  
8676     True
8677     True
8678    False
8679     True
8680    False
Name: Ambience, Length: 8681, dtype: bool

We see that there not all ambience dictionaries contains the same keys. We therefor have to collect all keys.

In [19]:
ambience_keys=set([])
for d in Resturants.Ambience:
    if type(d)==dict:
        ambience_keys=ambience_keys.union(set(d.keys()))
print("The are the following {} keys used to describe ambience:".format(len(ambience_keys)))
for key in ambience_keys:
    print("- {}".format(key))

The are the following 9 keys used to describe ambience:
- touristy
- intimate
- romantic
- casual
- classy
- hipster
- trendy
- upscale
- divey


Each are now represented by a new column, where NaN and None are represented as False. Here 0 and 1 are used.

In [20]:
for key in ambience_keys:
    Resturants[key]=Resturants.Ambience.apply(lambda a:  int(bool(a[key])) if type(a)==dict and (key in a) else 0)
Resturants=Resturants.drop(['Ambience'],axis=1)

### Wifi

In [21]:
Resturants.WiFi.unique()

array(['no', 'free', nan, 'paid', 'None'], dtype=object)

We see that there is an overlap of categories, with no and None meaning the same thing. The values are therefor both set to 'no'. To address the issue of NaN, this is here assumed to be 'no' aswell.

In [22]:
Resturants.WiFi=Resturants.WiFi.replace("None","no")
Resturants.WiFi=Resturants.WiFi.replace(np.nan,"no")

Wifi cannow be one-hot encoded.

In [23]:
Resturants=pd.concat([Resturants,pd.get_dummies(Resturants.WiFi,prefix='Wifi_',drop_first=False)],axis=1)
Resturants=Resturants.drop(['WiFi'],axis=1)

### Noise level

In [24]:
Resturants.NoiseLevel.unique()

array(['average', nan, 'loud', 'quiet', 'very_loud', 'None'], dtype=object)

We again see that None and nan is overlapping. When nothing have been mentioned we will here assume average noise level

In [25]:
Resturants.NoiseLevel=Resturants.NoiseLevel.replace("None","average")
Resturants.NoiseLevel=Resturants.NoiseLevel.replace(np.nan,"average")

In [26]:
Resturants=pd.concat([Resturants,pd.get_dummies(Resturants.NoiseLevel,prefix='NoiseLevel_',drop_first=False)],axis=1)
Resturants=Resturants.drop(['NoiseLevel'],axis=1)

### Outdoor seating

In [27]:
Resturants.OutdoorSeating.unique()

array(['False', 'True', nan, 'None'], dtype=object)

Again same trend. If nothing is mentioned we are gonna assume False.

In [28]:
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace("None",0)
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace(np.nan,0)
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace('False',0)
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace('True',1)

### Restaurants TakeOut

In [29]:
Resturants.RestaurantsTakeOut.unique()

array(['True', 'False', nan, 'None'], dtype=object)

Same approach is used

In [30]:
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace("None",0)
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace(np.nan,0)
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace('False',0)
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace('True',1)

### Delivery

In [31]:
Resturants.RestaurantsTakeOut.unique()

array(['True', 'False', nan, 'None'], dtype=object)

In [32]:
Resturants.RestaurantsTakeOut=Resturants.RestaurantsTakeOut.replace("None",0)
Resturants.RestaurantsTakeOut=Resturants.RestaurantsTakeOut.replace(np.nan,0)
Resturants.RestaurantsTakeOut=Resturants.RestaurantsTakeOut.replace('False',0)
Resturants.RestaurantsTakeOut=Resturants.RestaurantsTakeOut.replace('True',1)

### Reservations

In [33]:
Resturants.RestaurantsReservations.unique()

array(['False', nan, 'True', 'None'], dtype=object)

In [34]:
Resturants.RestaurantsReservations=Resturants.RestaurantsReservations.replace("None",0)
Resturants.RestaurantsReservations=Resturants.RestaurantsReservations.replace(np.nan,0)
Resturants.RestaurantsReservations=Resturants.RestaurantsReservations.replace('False',0)
Resturants.RestaurantsReservations=Resturants.RestaurantsReservations.replace('True',1)

### Has TV

In [35]:
Resturants.HasTV.unique()

array(['False', 'True', nan, 'None'], dtype=object)

In [36]:
Resturants.HasTV=Resturants.HasTV.replace("None",0)
Resturants.HasTV=Resturants.HasTV.replace(np.nan,0)
Resturants.HasTV=Resturants.HasTV.replace('False',0)
Resturants.HasTV=Resturants.HasTV.replace('True',1)

# Restaurants Price Range

In [37]:
Resturants.RestaurantsPriceRange2.unique()

array(['2', '1', '3', nan, '4', 'None'], dtype=object)

None and NaN is assumed to be the most common price range

In [38]:
Resturants.RestaurantsPriceRange2=Resturants.RestaurantsPriceRange2.replace('None',np.nan)
Resturants.RestaurantsPriceRange2=Resturants.RestaurantsPriceRange2.apply(lambda l: float(l) if type(l)==str else l)

In [39]:
common_priceRange=Resturants.RestaurantsPriceRange2.mode()
Resturants.RestaurantsPriceRange2=Resturants.RestaurantsPriceRange2.replace(np.nan,common_priceRange)

### Table Service

In [40]:
Resturants.RestaurantsTableService.unique()

array([nan, 'True', 'False', 'None'], dtype=object)

When not listed it is assumed that table service is available.

In [41]:
Resturants.RestaurantsTableService=Resturants.RestaurantsTableService.replace("None",1)
Resturants.RestaurantsTableService=Resturants.RestaurantsTableService.replace(np.nan,1)
Resturants.RestaurantsTableService=Resturants.RestaurantsTableService.replace('False',0)
Resturants.RestaurantsTableService=Resturants.RestaurantsTableService.replace('True',1)

### Alcohol

In [42]:
Resturants.Alcohol.unique()

array(['none', 'beer_and_wine', nan, 'full_bar', 'None'], dtype=object)

None and none are here interpreted to mean that aclohol is not served. Nan is assigned the most common value.

In [43]:
common_alcohol_serving=Resturants.Alcohol.mode()
print(common_alcohol_serving)
Resturants.Alcohol=Resturants.Alcohol.replace(np.nan,common_alcohol_serving)
Resturants.Alcohol=Resturants.Alcohol.replace('none','None')

0    full_bar
dtype: object


In [44]:
Resturants=pd.concat([Resturants,pd.get_dummies(Resturants.Alcohol,prefix='Alcohol',drop_first=False)],axis=1)
Resturants=Resturants.drop(['Alcohol','Alcohol_None'],axis=1)

### Business Parking

In [45]:
Resturants.BusinessParking

0       {'garage': 'False', 'street': 'False', 'valida...
1       {'garage': 'False', 'street': 'True', 'validat...
2       {'garage': 'False', 'street': 'False', 'valida...
3       {'garage': 'False', 'street': 'False', 'valida...
4       {'garage': 'False', 'street': 'False', 'valida...
                              ...                        
8676    {'garage': 'False', 'street': 'True', 'validat...
8677    {'garage': 'False', 'street': 'False', 'valida...
8678    {'garage': 'False', 'street': 'True', 'validat...
8679                                                  NaN
8680    {'garage': 'False', 'street': 'True', 'validat...
Name: BusinessParking, Length: 8681, dtype: object

In [46]:
parking_keys=set([])
for d in Resturants.BusinessParking:
    if type(d)==dict:
        parking_keys=parking_keys.union(set(d.keys()))
print("The are the following {} keys used to discribe parking:".format(len(parking_keys)))
for key in parking_keys:
    print("- {}".format(key))

The are the following 5 keys used to discribe parking:
- lot
- street
- garage
- valet
- validated


In [47]:
for key in parking_keys:
    Resturants[key]=Resturants.BusinessParking.apply(lambda a:  int(bool(a[key])) if type(a)==dict and (key in a) else 0)
Resturants=Resturants.drop(['BusinessParking'],axis=1)

### Catering

In [48]:
Resturants.Caters.unique()

array(['False', nan, 'True', 'None'], dtype=object)

When None or NaN it is assumed that the resturant does not cater

In [49]:
Resturants.Caters=Resturants.Caters.replace("None",0)
Resturants.Caters=Resturants.Caters.replace(np.nan,0)
Resturants.Caters=Resturants.Caters.replace('False',0)
Resturants.Caters=Resturants.Caters.replace('True',1)

### Kid friendly

In [50]:
Resturants.GoodForKids.unique()

array(['True', nan, 'False', 'None'], dtype=object)

In this case it is assumed that if None or NaN then the resturant is not kid friendly.<br>
**Would it make more sense to allow for none, as in not partically kid friendly or bad for kids..?**

In [51]:
Resturants.GoodForKids=Resturants.GoodForKids.replace("None",0)
Resturants.GoodForKids=Resturants.GoodForKids.replace(np.nan,0)
Resturants.GoodForKids=Resturants.GoodForKids.replace('False',0)
Resturants.GoodForKids=Resturants.GoodForKids.replace('True',1)

### Attire

In [52]:
Resturants.RestaurantsAttire.unique()

array(['casual', nan, 'dressy', 'formal', 'None'], dtype=object)

When None available casual is assumed, here we allow for resturants to not have a special attrie.

In [53]:
Resturants.GoodForKids=Resturants.RestaurantsAttire.replace(np.nan,'None')
Resturants=pd.concat([Resturants,pd.get_dummies(Resturants.RestaurantsAttire,prefix='Attire',drop_first=False)],axis=1)
Resturants=Resturants.drop(['RestaurantsAttire','Attire_None'],axis=1)

### Bike parking

In [54]:
Resturants.BikeParking.unique()

array([nan, 'True', 'False', 'None'], dtype=object)

Here NaN and None is interpreted to mean false. **-reasonable for NaN?**

In [55]:
Resturants.BikeParking=Resturants.BikeParking.replace("None",0)
Resturants.BikeParking=Resturants.BikeParking.replace(np.nan,0)
Resturants.BikeParking=Resturants.BikeParking.replace('False',0)
Resturants.BikeParking=Resturants.BikeParking.replace('True',1)

### Suitable for groups

In [56]:
Resturants.RestaurantsGoodForGroups.unique()

array(['True', nan, 'False', 'None'], dtype=object)

In [57]:
Resturants.RestaurantsGoodForGroups=Resturants.RestaurantsGoodForGroups.replace("None",0)
Resturants.RestaurantsGoodForGroups=Resturants.RestaurantsGoodForGroups.replace(np.nan,0)
Resturants.RestaurantsGoodForGroups=Resturants.RestaurantsGoodForGroups.replace('False',0)
Resturants.RestaurantsGoodForGroups=Resturants.RestaurantsGoodForGroups.replace('True',1)

### What type of meals are the resturant good for

In [58]:
Resturants.GoodForMeal

0                                                     NaN
1       {'dessert': 'False', 'latenight': 'False', 'lu...
2                                                     NaN
3       {'dessert': 'False', 'latenight': 'False', 'lu...
4                                                     NaN
                              ...                        
8676    {'dessert': 'False', 'latenight': 'False', 'lu...
8677    {'dessert': 'False', 'latenight': 'False', 'lu...
8678    {'dessert': 'False', 'latenight': 'False', 'lu...
8679                                                  NaN
8680                                                  NaN
Name: GoodForMeal, Length: 8681, dtype: object

In [59]:
meal_keys=set([])
for d in Resturants.GoodForMeal:
    if type(d)==dict:
        meal_keys=meal_keys.union(set(d.keys()))
print("The are the following {} keys used to discribe meals:".format(len(meal_keys)))
for key in meal_keys:
    print("- {}".format(key))

The are the following 6 keys used to discribe meals:
- breakfast
- dinner
- latenight
- lunch
- dessert
- brunch


In [60]:
for key in meal_keys:
    Resturants[key]=Resturants.GoodForMeal.apply(lambda a:  int(bool(a[key])) if type(a)==dict and (key in a) else 0)
Resturants=Resturants.drop(['GoodForMeal'],axis=1)

**Is it still a reasonable approach when GoodForMeal: 47.909227047575165% NaN values?**

### Categories
The resturtants has a number of categories listed which needs to be transformed.

In [61]:
Resturants.categories=Resturants.categories.str.lower().apply(lambda c: c.split(", "))

In [62]:
categories=set([])
for c in Resturants.categories:
    categories=categories.union(set(c))
print("The are the following {} categories exist:".format(len(categories)))
print("--------------------------------------------")
for category in categories:
    print("- {}".format(category))

The are the following 384 categories exist:
--------------------------------------------
- vegan
- chinese
- barbeque
- dinner theater
- waffles
- professional services
- fashion
- american (traditional)
- bangladeshi
- cafes
- gay bars
- pop-up restaurants
- colombian
- alternative medicine
- venezuelan
- pets
- tacos
- scottish
- mobile phone repair
- gluten-free
- turkish
- financial services
- swiss food
- florists
- home services
- pasta shops
- czech/slovakian
- nightlife
- sports clubs
- music venues
- delicatessen
- gastropubs
- wineries
- local services
- butcher
- latin american
- casinos
- cheese shops
- day spas
- musical instruments & teachers
- soul food
- grocery
- cheesesteaks
- hot pot
- cuban
- afghan
- vietnamese
- patisserie/cake shop
- shopping
- airport lounges
- ethical grocery
- champagne bars
- adult entertainment
- belgian
- fish & chips
- shaved ice
- british
- education
- plumbing
- restaurants
- hostels
- irish pub
- vegetarian
- scandinavian
- hookah bars


It is now encoded if each resturant belong to each of the categories

In [63]:
for category in categories:
    Resturants[category]=Resturants.categories.apply(lambda c:  int(bool(category in c)))
Resturants=Resturants.drop(['categories'],axis=1)

**Should I investigate if some categories are only listed for very very few resturants and then remove them??**

### Opening hourse
The last bit of information we have available is concerning the opening hours of the resturant

In [64]:
"Do we know resturants opending hour:"
Resturants.hours.apply(lambda h: h=='{}').value_counts()

False    6720
True     1961
Name: hours, dtype: int64

The opening hours when unknown will be the same as most common for remining resturants.

In [65]:
Resturants.hours=Resturants.hours.apply(lambda h: eval(h))

In [66]:
type(Resturants.hours[0])

dict

In [67]:
##Resturants.hours=Resturants.hours.apply(lambda l: type(l))

In [68]:
#Resturants.hours[7]#['Monday']#.time()#>=time(12,00)

In [69]:
import datetime
days=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
hours=[]
for i in range (24):
    hours.append(datetime.time(i,0,0))
for day in days:
    for hour in hours:
        Resturants['open_'+day+"_"+str(hour)]=0 #default closed

In [70]:
Resturants

Unnamed: 0,stars,review_count,is_open,hours,postal_code_L3R 0A1,postal_code_L3R 0L7,postal_code_L3R 4X8,postal_code_L4K 2Z5,postal_code_L4K 5Y5,postal_code_L4L 8B7,...,open_Sunday_14:00:00,open_Sunday_15:00:00,open_Sunday_16:00:00,open_Sunday_17:00:00,open_Sunday_18:00:00,open_Sunday_19:00:00,open_Sunday_20:00:00,open_Sunday_21:00:00,open_Sunday_22:00:00,open_Sunday_23:00:00
0,3.0,13,0,{},0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4.0,116,1,"{'Monday': '11:00-22:00', 'Tuesday': '11:00-22...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3.0,8,1,{},0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3.0,11,1,{},0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4.5,24,1,"{'Monday': '0:00-0:00', 'Tuesday': '7:30-17:00...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8676,4.0,104,1,"{'Monday': '12:00-23:00', 'Tuesday': '12:00-23...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8677,3.5,43,1,"{'Monday': '11:00-23:00', 'Tuesday': '11:00-23...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8678,4.0,177,1,"{'Monday': '11:30-22:00', 'Tuesday': '11:30-22...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8679,4.0,16,1,"{'Monday': '11:00-21:00', 'Tuesday': '11:00-21...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [71]:
for index,resturant in Resturants.iterrows():
    if "Monday" in resturant.hours.keys():
        print(resturant.hours["Monday"])

11:00-22:00
0:00-0:00
11:30-0:00
0:00-0:00
11:30-0:00
11:00-1:30
11:30-22:00
11:30-23:30
0:00-0:00
11:00-19:00
17:30-23:30
10:00-0:00
11:00-21:00
12:00-18:00
10:00-21:00
8:00-20:00
0:00-0:00
17:00-1:00
8:00-16:00
10:00-2:00
11:00-1:00
10:00-18:00
12:00-23:00
0:00-0:00
10:00-21:30
8:00-19:00
11:00-22:00
8:00-22:00
11:00-2:00
11:30-1:00
7:00-23:00
11:00-23:00
0:00-0:00
8:00-19:00
7:30-20:00
11:00-21:00
12:00-23:00
17:30-22:00
10:30-20:00
12:00-1:00
0:00-0:00
0:00-0:00
11:30-1:00
17:00-23:00
11:00-2:00
11:00-1:00
10:00-21:00
9:00-22:00
8:00-22:00
7:00-16:00
11:30-23:00
10:30-21:30
10:00-22:00
10:30-0:00
11:00-19:00
18:00-2:00
11:30-19:30
7:00-17:00
11:00-23:00
16:00-2:30
11:00-0:00
11:00-22:00
11:00-0:00
7:00-15:00
12:00-22:00
0:00-0:00
11:00-0:00
16:00-1:00
11:30-21:00
0:00-0:00
11:00-23:00
0:00-0:00
11:00-23:00
11:00-23:00
8:00-20:00
16:00-2:00
15:00-2:00
10:00-21:00
11:00-19:00
11:00-0:00
11:00-2:00
11:00-18:00
18:00-0:00
17:00-23:00
11:00-22:00
11:30-2:00
11:30-22:00
11:30-22:30
11:00

11:00-22:00
11:00-23:00
11:00-2:00
11:00-23:00
7:00-18:00
11:00-21:00
11:00-1:00
12:00-22:30
9:00-19:00
11:00-1:00
11:00-1:00
17:00-2:00
11:00-20:00
11:30-22:30
17:00-22:00
11:30-22:00
12:00-23:00
11:00-22:00
17:00-22:00
16:00-0:00
6:30-15:30
10:00-0:00
11:00-22:00
11:00-0:00
0:00-0:00
0:00-0:00
16:30-22:15
0:00-0:00
7:00-0:00
8:00-18:00
11:00-22:00
11:00-0:00
11:00-0:00
11:00-20:00
11:30-21:30
11:00-22:00
11:30-20:00
11:30-22:00
7:00-15:00
11:00-21:00
12:00-21:00
10:00-0:00
11:00-22:00
11:30-21:00
11:00-19:00
11:30-22:30
11:30-21:00
10:30-22:00
8:00-22:00
8:30-16:00
0:00-0:00
18:00-22:00
17:30-22:30
17:00-2:00
11:30-22:00
9:00-21:00
11:30-23:00
11:30-0:00
11:30-1:00
11:00-22:00
11:00-21:00
0:00-0:00
11:00-22:30
0:00-0:00
11:00-22:00
11:30-22:00
11:30-23:00
11:00-2:00
9:00-23:00
7:00-19:00
19:00-2:00
11:30-22:00
17:00-2:00
7:00-22:00
7:00-1:30
11:00-22:00
11:30-22:00
6:30-15:30
15:30-1:30
0:00-0:00
11:00-0:00
10:00-23:00
14:00-23:00
17:00-22:00
0:00-0:00
0:00-0:00
7:00-18:00
0:00-0:00


12:00-23:00
8:30-16:00
18:00-2:00
11:00-0:00
7:00-0:00
11:00-23:00
11:00-23:00
11:00-22:30
11:00-2:00
11:30-1:00
12:00-21:00
8:00-18:00
11:00-21:00
11:00-22:30
11:00-23:30
11:00-23:00
8:00-23:00
17:00-1:00
11:00-23:00
11:00-1:00
5:00-0:00
11:00-23:00
7:00-22:00
11:30-21:30
11:00-20:00
11:00-23:00
7:00-17:00
0:00-0:00
10:00-19:30
11:00-22:00
11:00-19:00
11:30-22:00
18:00-2:30
11:30-0:00
11:00-20:00
11:00-22:30
9:00-23:00
0:00-0:00
11:00-22:30
17:00-0:00
11:00-22:00
11:30-22:30
11:30-18:30
0:00-0:00
17:00-0:00
11:00-0:00
11:30-22:00
16:00-2:00
10:00-1:00
11:30-22:00
8:00-18:00
11:00-21:00
11:00-22:00
12:00-23:00
11:00-23:00
10:00-21:00
11:30-21:00
11:00-21:30
0:00-0:00
9:00-21:00
0:00-0:00
11:00-23:00
11:00-22:00
9:00-17:00
17:00-23:00
7:00-18:00
16:00-2:00
0:00-0:00
16:30-3:00
11:30-21:00
11:00-2:00
0:00-0:00
10:00-22:00
16:00-23:00
8:00-22:00
0:00-0:00
0:00-0:00
11:00-0:00
0:00-0:00
11:00-22:00
11:30-2:00
11:00-22:00
10:00-20:00
0:00-0:00
0:00-0:00
11:30-22:00
11:30-22:00
10:00-23:00
1

7:30-22:00
16:00-1:00
0:00-0:00
0:00-0:00
11:00-23:00
12:00-2:00
0:00-0:00
0:00-0:00
11:00-20:00
11:00-20:00
7:00-18:00
7:00-15:00
11:30-22:00
9:00-23:00
10:00-19:00
11:00-2:00
0:00-0:00
17:30-23:30
11:30-22:30
11:00-21:00
9:00-22:30
9:00-19:00
17:00-2:00
8:00-17:00
10:00-20:00
11:00-1:00
11:30-22:00
7:00-15:00
11:30-22:00
11:00-0:00
9:00-3:30
7:30-19:30
0:00-0:00
11:30-21:00
11:00-23:00
10:00-23:00
11:00-2:00
9:00-21:00
11:00-23:00
11:00-0:00
10:00-22:00
5:00-0:00
11:00-0:00
11:00-22:00
7:00-16:00
11:00-21:00
16:00-2:00
11:00-0:00
0:00-0:00
0:00-0:00
7:00-22:30
10:45-22:00
11:00-21:00
11:30-23:30
11:00-0:00
7:00-16:30
7:30-20:00
17:00-0:00
17:00-0:00
10:00-21:00
0:00-0:00
11:00-23:00
7:00-21:00
10:30-0:00
11:30-23:30
0:00-0:00
11:00-2:00
11:00-22:00
7:30-16:00
11:00-0:00
11:00-23:00
18:00-2:00
8:00-22:00
10:00-18:00
11:30-23:00
11:30-22:00
11:30-23:00
9:30-23:00
6:00-16:00
8:00-14:00
11:30-2:00
18:00-3:00
8:00-16:00
17:30-22:00
12:00-0:00
11:00-22:00
20:00-2:00
11:30-23:00
10:00-21:00

11:00-23:00
10:00-22:00
0:00-0:00
7:00-15:00
11:00-1:00
0:00-0:00
7:00-18:00
11:30-20:00
7:00-19:00
9:00-21:00
7:30-21:00
0:00-0:00
12:00-22:00
11:00-23:00
8:00-16:00
12:00-22:00
11:00-18:30
10:00-21:00
16:00-0:00
10:00-0:00
7:00-0:00
0:00-0:00
11:00-22:00
11:00-22:00
11:30-22:00
7:00-19:00
10:00-21:00
17:30-0:00
11:00-22:00
0:00-0:00
12:15-21:00
18:00-0:00
8:00-22:30
10:00-21:00
10:30-20:00
0:00-0:00
11:30-22:30
0:00-0:00
11:00-2:00
0:00-0:00
9:30-17:30
8:00-21:00
8:00-2:00
15:00-0:00
11:00-22:00
0:00-0:00
12:00-22:00
0:00-0:00
7:00-16:00
10:00-22:00
11:30-0:00
17:00-22:00
11:00-22:30
11:30-22:00
11:00-19:00
8:00-20:00
11:00-2:00
10:00-2:00
11:00-22:00
11:00-23:00
7:00-20:00
9:00-21:00
11:00-22:00
10:45-22:00
12:00-23:00
11:30-22:00
6:30-21:00
11:30-1:00
11:00-23:00
7:00-18:00
8:00-22:00
11:30-22:00
11:00-21:00
8:00-18:00
11:00-2:00
12:00-2:00
17:00-22:00
0:00-0:00
0:00-0:00
11:30-13:00
6:30-23:00
11:00-1:00
7:00-18:00
11:00-23:00
11:00-20:00
11:00-22:00
8:00-16:00
8:00-23:00
14:00-2:

11:00-1:00
11:00-23:00
11:00-19:00
0:00-0:00
10:00-22:00
10:30-22:00
5:00-23:00
11:00-21:00
10:00-18:30
0:00-0:00
12:00-22:00
12:00-21:30
11:30-23:00
5:30-21:30
11:30-22:00
0:00-0:00
17:30-23:00
0:00-0:00
5:00-0:00
16:00-0:00
12:00-23:00
11:30-21:00
9:00-21:00
10:45-21:00
0:00-0:00
11:00-23:30
17:00-22:30
11:00-19:00
9:00-21:00
17:30-23:30
11:00-20:30
11:30-22:00
8:00-18:00
7:00-21:00
10:00-16:00
12:00-22:00
9:00-22:00
7:00-19:00
0:00-0:00
18:00-0:00
11:00-22:00
11:30-23:00
10:00-2:00
0:00-0:00
11:30-23:00
16:00-22:00
0:00-0:00
7:00-23:00
11:00-22:00
11:00-22:00
12:00-21:00
9:00-15:00
11:00-21:00
12:00-21:00
18:30-23:00
11:30-22:30
11:00-17:00
12:00-0:00
12:00-22:00
7:00-19:00
12:00-22:00
17:00-22:00
10:00-22:00
9:30-21:00
11:00-21:00
8:00-15:00
0:00-0:00
17:00-22:30
11:00-21:00
11:30-22:00
10:30-22:30
9:00-22:00
11:00-0:00
7:00-20:00
10:00-0:00
11:00-1:00
0:00-0:00
10:30-22:00
9:00-22:00
11:30-0:00
17:00-2:00
11:00-23:00
11:30-22:30
11:00-20:30
12:00-19:00
18:00-2:00
13:00-2:00
11:00-

0:00-0:00
12:00-22:00
12:00-22:00
11:30-22:00
11:00-23:00
11:30-14:30
10:00-22:00
16:00-2:00
11:00-22:00
17:00-22:00
0:00-0:00
17:00-2:00
16:00-21:00
8:00-17:00
14:00-21:00
10:00-21:00
10:00-20:30
11:00-19:00
11:30-0:00
14:00-22:00
11:30-19:30
7:00-20:00
8:00-18:00
10:00-21:00
11:00-23:00
16:00-2:00
0:00-0:00
12:00-22:00
11:30-22:00
14:00-2:00
6:30-11:30
11:00-23:00
11:30-22:00
7:30-22:00
10:30-16:30
7:30-23:00
11:00-22:00
15:00-23:00
11:00-22:00
10:30-22:30
0:00-0:00
17:30-22:30
11:00-23:00
11:00-23:00
8:00-21:00
11:30-22:00
11:00-22:30
11:00-1:00
11:00-19:00
12:00-22:00
7:00-22:00
11:00-22:00
12:00-21:00
7:00-17:00
16:00-22:00
10:30-19:30
11:00-18:00
11:00-20:00
11:30-23:30
7:00-21:00
11:00-21:00
10:00-22:00
11:00-23:00
0:00-0:00
11:00-22:00
11:00-23:00
19:00-1:00
7:00-22:00
10:00-21:00
11:30-21:30
11:30-20:00
11:00-21:00
0:00-0:00
11:00-22:00
8:00-1:00
10:00-1:00
11:00-22:00
8:00-23:00
0:00-0:00
11:00-22:00
0:00-0:00
10:00-16:00
11:00-23:00
11:30-21:45
0:00-0:00
11:00-22:00
11:30-23

In [72]:
for index, resturant in Resturants.iterrows():
    if resturant.hours !={}:
        for day in days:
            if day in resturant.hours.keys():
                opening_hours=resturant.hours[day].split("-")
                open_hour=pd.to_datetime(opening_hours[0]).time()
                close_hour=pd.to_datetime(opening_hours[1]).time()
                for hour in hours:
                    if open_hour<=hour and hour<close_hour:
                        Resturants['open_'+day+"_"+str(hour)][index]=1
                            
                

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


In [73]:
resturants_with_openhours=Resturants[Resturants.hours!={}]
for index, resturant in Resturants.iterrows():
    if resturant.hours =={}:
        for day in days:
            for hour in hours:
                Resturants['open_'+day+"_"+str(hour)][index]=resturants_with_openhours['open_'+day+"_"+str(hour)].mode()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


## Defining training, test and validation sets

In [74]:
Resturants

Unnamed: 0,stars,review_count,is_open,hours,postal_code_L3R 0A1,postal_code_L3R 0L7,postal_code_L3R 4X8,postal_code_L4K 2Z5,postal_code_L4K 5Y5,postal_code_L4L 8B7,...,open_Sunday_14:00:00,open_Sunday_15:00:00,open_Sunday_16:00:00,open_Sunday_17:00:00,open_Sunday_18:00:00,open_Sunday_19:00:00,open_Sunday_20:00:00,open_Sunday_21:00:00,open_Sunday_22:00:00,open_Sunday_23:00:00
0,3.0,13,0,{},0,0,0,0,0,0,...,1,1,1,1,1,0,0,0,0,0
1,4.0,116,1,"{'Monday': '11:00-22:00', 'Tuesday': '11:00-22...",0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,0
2,3.0,8,1,{},0,0,0,0,0,0,...,1,1,1,1,1,0,0,0,0,0
3,3.0,11,1,{},0,0,0,0,0,0,...,1,1,1,1,1,0,0,0,0,0
4,4.5,24,1,"{'Monday': '0:00-0:00', 'Tuesday': '7:30-17:00...",0,0,0,0,0,0,...,1,1,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8676,4.0,104,1,"{'Monday': '12:00-23:00', 'Tuesday': '12:00-23...",0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,0,0
8677,3.5,43,1,"{'Monday': '11:00-23:00', 'Tuesday': '11:00-23...",0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,0
8678,4.0,177,1,"{'Monday': '11:30-22:00', 'Tuesday': '11:30-22...",0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,0
8679,4.0,16,1,"{'Monday': '11:00-21:00', 'Tuesday': '11:00-21...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Predicting if a business close

### Baseline

Random classification as a baseline

To predict which resturants close, we build a binary classifyer classifying wheter or not a business is closed or open. In other words we are intrested in building a binary classify. To do so logistic regression is used.

### Logistic regression

In [75]:
from sklearn.linear_model import LogisticRegression

In [76]:
clf = LogisticRegression

In [77]:
clf = LogisticRegression

## Interpreting factors influencing prediction

In [78]:
 from sklearn.datasets import load_iris