# Investor part.3: What makes resturants close

In this part of the analysis we wish to investigate what factors makes resturants in Toronto close. The focus is on properties of the resturants themself and does not take into account the general economic situation. To answer this questions it is initially examined, whether or not any underlying structures are present, determining whether or not a resturant has to close. This is done by building classifiers, classifing if a resturant is closed or not. By examining the constructed classifier, and how their prediction relates to a resturants features, it is indirectly determined which features are influencing if a resturant will be forced to close.

## Load data

In [18]:
import pandas as pd
import numpy as np

We start by loadining the data set containt information about all resturants in Toronto.

In [19]:
Resturants=pd.read_csv('data/toronto_restaurant_business2.csv',encoding="utf8")

In [20]:
print("Total number of resturants in Toronto: {}".format(len(Resturants)))

Total number of resturants in Toronto: 8681


## Data preparation

First we remove data columns which decribe information which is only specific for the individual resturant. This include the business ID, the name of the resturant, address and since we are only intrested in resturants in Toronto also city and state.

In [21]:
Resturants.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,rVBPQdeayMYht4Uv_FOLHg,Gourmet Burger Company,843 Kipling Avenue,Toronto,ON,M8Z 5G9,43.633291,-79.531768,3.0,13,0,"{""RestaurantsPriceRange2"": ""2"", ""RestaurantsAt...","Restaurants, Burgers, Food",{}
1,0QjROMVW9ACKjhSEfHqNCQ,Mi Mi Restaurant,688 Gerrard Street E,Toronto,ON,M4M 1Y3,43.666376,-79.348773,4.0,116,1,"{""RestaurantsTakeOut"": ""True"", ""Alcohol"": ""bee...","Vietnamese, Restaurants","{""Monday"": ""11:00-22:00"", ""Tuesday"": ""11:00-22..."
2,8k62wYhDVq1-652YbJi5eg,Tim Hortons,90 Adelaide Street W,Toronto,ON,M5H 3V9,43.649859,-79.38206,3.0,8,1,"{""OutdoorSeating"": ""False"", ""RestaurantsDelive...","Bagels, Donuts, Food, Cafes, Coffee & Tea, Res...",{}
3,0DnQh8SE8BSnvJltGCCiWg,Chick-N-Joy,3-1265 York Mills Road,Toronto,ON,M3A 1Z3,43.765279,-79.326248,3.0,11,1,"{""NoiseLevel"": ""loud"", ""BusinessParking"": {""ga...","Fast Food, Restaurants, Chicken Shop",{}
4,NLaK58WvlNQdUunSIkt-jA,Zav Coffee Shop & Gallery,2048 Danforth Avenue,Toronto,ON,M4C 1J6,43.685608,-79.313936,4.5,24,1,"{""DogsAllowed"": ""False"", ""OutdoorSeating"": ""Tr...","Coffee & Tea, Restaurants, Sandwiches, Food","{""Monday"": ""0:00-0:00"", ""Tuesday"": ""7:30-17:00..."


In [22]:
Resturants=Resturants.drop(['name','address','city','state','business_id'],axis=1)

Both postal code and longitude and lattitude provides information about the location, and including both features would therefore be redundent. While longitude and latitude provide more precises information the information is overlapping with the postal_code, however the postal code provide a more general indication of the area.

In [23]:
print("There are {} different postal codes".format(len(Resturants.postal_code.unique())))

There are 3176 different postal codes


If postal codes were to be included it would be neccessary to one-hot encode them. With more 3176 different postal codes, this would make the total number of attributes, very large compared to the number of samples. This would also mean that the observations for certain postal codes would be very spars. For this reason the postal code attribute is removed from the data set.

In [24]:
Resturants=Resturants.drop(['postal_code'],axis=1)

The logitude and latitude however provide very specific information about the location, and as the influence of a resturants location have been investigated in a previous analysis, it will not be included here.

In [25]:
Resturants=Resturants.drop(['latitude','longitude'],axis=1)

### Transforming attributes of resturants

The listed *attributes* of each resturant is represented as dictionaries, but in a string format. The frist step towards transformining them into a useful format is therefore to change them into a dictionary format.

In [26]:
# attributes are transformed from string to dictionaries
Resturants.attributes=Resturants.attributes.apply(lambda d: eval(d))

As not all attributes are listed for all resturants, terefore a list of all available attributes are determined.

In [27]:
attributes=set([])
for d in Resturants.attributes:
    attributes=attributes.union(set(d.keys()))
print("The are the following {} attributes:".format(len(attributes)))
for attribute in attributes:
    print("- {}".format(attribute))

The are the following 37 attributes:
- DriveThru
- HairSpecializesIn
- BusinessAcceptsCreditCards
- BusinessAcceptsBitcoin
- RestaurantsReservations
- Music
- GoodForKids
- BikeParking
- BestNights
- RestaurantsTakeOut
- RestaurantsDelivery
- Smoking
- AgesAllowed
- AcceptsInsurance
- BusinessParking
- ByAppointmentOnly
- WheelchairAccessible
- NoiseLevel
- Caters
- CoatCheck
- RestaurantsTableService
- DietaryRestrictions
- HappyHour
- HasTV
- RestaurantsGoodForGroups
- RestaurantsPriceRange2
- RestaurantsAttire
- OutdoorSeating
- GoodForMeal
- GoodForDancing
- Ambience
- Alcohol
- WiFi
- BYOB
- DogsAllowed
- RestaurantsCounterService
- Corkage


Each attribute is now represented as individual columns in the dataframe.

In [28]:
print("Percent of resturants for which attribute is NaN:")
print("---------------------------------------------------------")
for attribute in attributes:
    Resturants[attribute]=Resturants.attributes.apply(lambda a:  a[attribute] if (attribute in a) else np.nan)
    print("{}: {}%".format(attribute,Resturants[attribute].isna().sum()/len(Resturants[attribute].isna())*100))
Resturants=Resturants.drop(['attributes'],axis=1)

Percent of resturants for which attribute is NaN:
---------------------------------------------------------
DriveThru: 96.49809929731597%
HairSpecializesIn: 99.9769611795876%
BusinessAcceptsCreditCards: 96.8667204239143%
BusinessAcceptsBitcoin: 99.9769611795876%
RestaurantsReservations: 19.68667204239143%
Music: 92.25895634143532%
GoodForKids: 22.474369312291213%
BikeParking: 39.00472295818454%
BestNights: 94.86234304803594%
RestaurantsTakeOut: 15.631839649809931%
RestaurantsDelivery: 22.209422877548672%
Smoking: 95.72629881350075%
AgesAllowed: 99.9769611795876%
AcceptsInsurance: 99.9769611795876%
BusinessParking: 18.78815804630803%
ByAppointmentOnly: 96.72848750143993%
WheelchairAccessible: 82.67480704987905%
NoiseLevel: 34.48911415735515%
Caters: 45.89333026149061%
CoatCheck: 95.63414353185117%
RestaurantsTableService: 66.34028337749108%
DietaryRestrictions: 99.86176707752563%
HappyHour: 87.67423107936874%
HasTV: 23.77606266559152%
RestaurantsGoodForGroups: 20.94228775486695%
Restaur

It is clear that far from all attributes are listed for all resturants. However NaN values does not provide information about a resturant, and can therefore not be included when training the classifier. Furthermore we see that inoformation about certain attributes are not listed for most resturants. In such cases it is assumed that a value for NaN cases cannot gennerally be reasonably derived. Attributes where more than 40% of resturants don't have a value listed, are therefore removed. While only having information of an attribute for 60% of the observations, is still spars the balance is to also allow some less common attributes to be included.

In [29]:
disgarded_attributes=set([])
for attribute in attributes:
    percent_NaN=Resturants[attribute].isna().sum()/len(Resturants[attribute].isna())*100
    if percent_NaN>=40:
        Resturants=Resturants.drop([attribute],axis=1)
        disgarded_attributes=disgarded_attributes.union(set([attribute]))
attributes.difference_update(disgarded_attributes)        
print("Attributes disgarded:")
print("-----------------------------")
for a in disgarded_attributes:
      print(a) 
print("-----------------------------")
print("-----------------------------")
print("There are {} remaing attributes".format(len(attributes)))

Attributes disgarded:
Smoking
DriveThru
HairSpecializesIn
AgesAllowed
BusinessAcceptsCreditCards
BusinessAcceptsBitcoin
AcceptsInsurance
ByAppointmentOnly
WheelchairAccessible
Caters
CoatCheck
RestaurantsTableService
DietaryRestrictions
HappyHour
Music
GoodForMeal
BestNights
GoodForDancing
BYOB
RestaurantsCounterService
DogsAllowed
Corkage
-----------------------------
There are 15 remaing attributes


Below the remanining attributes are shown.

In [30]:
print("Percent of resturants for which attribute is NaN:")
print("---------------------------------------------------------")
for attribute in attributes:
    print("{}: {}%".format(attribute,Resturants[attribute].isna().sum()/len(Resturants[attribute].isna())*100))

Percent of resturants for which attribute is NaN:
---------------------------------------------------------
RestaurantsReservations: 19.68667204239143%
GoodForKids: 22.474369312291213%
BikeParking: 39.00472295818454%
RestaurantsTakeOut: 15.631839649809931%
RestaurantsDelivery: 22.209422877548672%
BusinessParking: 18.78815804630803%
NoiseLevel: 34.48911415735515%
HasTV: 23.77606266559152%
RestaurantsGoodForGroups: 20.94228775486695%
RestaurantsPriceRange2: 19.32957032599931%
RestaurantsAttire: 26.414007602810734%
OutdoorSeating: 21.276350650846677%
Ambience: 21.518258265176822%
Alcohol: 29.24778251353531%
WiFi: 33.19894021426103%


For the 15 remaining attributes there is still a need to find a solution for NaN values.

In [31]:
print("There are {} resturants with no NaN listed for attributes".format(len(Resturants.dropna())))

There are 3730 resturants with no NaN listed for attributes


As we can see droping all resturants with NaN values are not an option as it will limit the size of the dataset drasticly. 

The examples below show that there is a large difference between what kind of values are listed for each attribute. This mean that they must all be treated differently, and so must the occurences of NaN values.

In [32]:
Resturants.Ambience

0       {'romantic': 'False', 'intimate': 'False', 'cl...
1       {'romantic': 'False', 'intimate': 'False', 'cl...
2       {'touristy': 'False', 'hipster': 'False', 'rom...
3       {'touristy': 'False', 'hipster': 'False', 'rom...
4                                                     NaN
                              ...                        
8676    {'romantic': 'False', 'intimate': 'False', 'cl...
8677    {'romantic': 'False', 'intimate': 'False', 'cl...
8678    {'touristy': 'False', 'hipster': 'False', 'rom...
8679    {'romantic': 'False', 'intimate': 'False', 'cl...
8680    {'touristy': 'False', 'hipster': 'False', 'rom...
Name: Ambience, Length: 8681, dtype: object

In [33]:
Resturants.BikeParking

0         NaN
1        True
2       False
3        True
4        True
        ...  
8676     True
8677    False
8678     True
8679    False
8680     True
Name: BikeParking, Length: 8681, dtype: object

In [34]:
Resturants.NoiseLevel

0       average
1       average
2           NaN
3          loud
4           NaN
         ...   
8676    average
8677    average
8678    average
8679        NaN
8680      quiet
Name: NoiseLevel, Length: 8681, dtype: object

#### Ambience

In [35]:
Resturants.Ambience.apply(lambda r: type(r)).unique()

array([<class 'dict'>, <class 'float'>, <class 'str'>], dtype=object)

We can see that not all values are dictionaries or NaN values.

In [19]:
for a in Resturants.Ambience:
    if type(a)==str:
        print(a)

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None


We see that values such as None are also available. This appear to coorespond well with NaN. We now take a look at the dictionaries.

In [36]:
ambience_keys=set(Resturants.Ambience[0].keys())
Resturants.Ambience.apply(lambda d: set(d.keys())==ambience_keys if type(d)==dict else True)

0        True
1        True
2       False
3       False
4        True
        ...  
8676     True
8677     True
8678    False
8679     True
8680    False
Name: Ambience, Length: 8681, dtype: bool

We see that not all ambience dictionaries contains the same keys. We therefor collect all keys to get an overview.

In [37]:
ambience_keys=set([])
for d in Resturants.Ambience:
    if type(d)==dict:
        ambience_keys=ambience_keys.union(set(d.keys()))
print("The are the following {} keys used to describe ambience:".format(len(ambience_keys)))
for key in ambience_keys:
    print("- {}".format(key))

The are the following 9 keys used to describe ambience:
- hipster
- trendy
- upscale
- divey
- classy
- romantic
- touristy
- intimate
- casual


Each of the ambiance keys are now represented by a new column, where NaN and None are represented as False. Here 0 and 1 are used.

In [38]:
for key in ambience_keys:
    Resturants[key]=Resturants.Ambience.apply(lambda a:  int(bool(a[key])) if type(a)==dict and (key in a) else 0)
Resturants=Resturants.drop(['Ambience'],axis=1)

In [39]:
print("Percent of resturants which have ambiance attribute:")
print("---------------------------------------------------------")
for ambience in ambience_keys:
    HaveAttribute=len(Resturants[Resturants[ambience]==1])        
    print("{}: {}%".format(ambience,HaveAttribute/len(Resturants[ambience])*100))

Percent of resturants which have ambiance attribute:
---------------------------------------------------------
hipster: 75.37150097914986%
trendy: 78.19375647966824%
upscale: 77.9172906347195%
divey: 32.76120262642553%
classy: 78.19375647966824%
romantic: 78.19375647966824%
touristy: 78.19375647966824%
intimate: 78.19375647966824%
casual: 78.19375647966824%


#### Wifi

In [40]:
Resturants.WiFi.unique()

array(['no', 'free', nan, 'paid', 'None'], dtype=object)

We see that there is an overlap of categories, with no and None meaning the same thing. The values are therefor both set to 'no'. To address the issue of NaN, this is here assumed to be 'no' aswell.

In [41]:
Resturants.WiFi=Resturants.WiFi.replace("None","no")
Resturants.WiFi=Resturants.WiFi.replace(np.nan,"no")

Wifi can now be one-hot encoded.

In [42]:
Resturants=pd.concat([Resturants,pd.get_dummies(Resturants.WiFi,prefix='Wifi_',drop_first=False)],axis=1)
Resturants=Resturants.drop(['WiFi'],axis=1)

#### Noise level

In [43]:
Resturants.NoiseLevel.unique()

array(['average', nan, 'loud', 'quiet', 'very_loud', 'None'], dtype=object)

We again see that None and nan is overlapping. When nothing have been mentioned we will here assume average noise level

In [44]:
Resturants.NoiseLevel=Resturants.NoiseLevel.replace("None","average")
Resturants.NoiseLevel=Resturants.NoiseLevel.replace(np.nan,"average")

In [45]:
Resturants=pd.concat([Resturants,pd.get_dummies(Resturants.NoiseLevel,prefix='NoiseLevel_',drop_first=False)],axis=1)
Resturants=Resturants.drop(['NoiseLevel'],axis=1)

#### Outdoor seating

In [46]:
Resturants.OutdoorSeating.unique()

array(['False', 'True', nan, 'None'], dtype=object)

Again same pattern can be observed. If nothing is mentioned we are gonna assume False.

In [47]:
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace("None",0)
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace(np.nan,0)
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace('False',0)
Resturants.OutdoorSeating=Resturants.OutdoorSeating.replace('True',1)

#### Takeout

In [48]:
Resturants.RestaurantsTakeOut.unique()

array(['True', 'False', nan, 'None'], dtype=object)

Same approach are used, as for the previous attributes

In [49]:
Resturants.RestaurantsTakeOut=Resturants.RestaurantsTakeOut.replace("None",0)
Resturants.RestaurantsTakeOut=Resturants.RestaurantsTakeOut.replace(np.nan,0)
Resturants.RestaurantsTakeOut=Resturants.RestaurantsTakeOut.replace('False',0)
Resturants.RestaurantsTakeOut=Resturants.RestaurantsTakeOut.replace('True',1)

#### Offers delivery

In [50]:
Resturants.RestaurantsDelivery.unique()

array(['False', 'True', nan, 'None'], dtype=object)

If nothing is listed, it is assumed that delivery is not offered

In [51]:
Resturants.RestaurantsDelivery=Resturants.RestaurantsDelivery.replace("None",0)
Resturants.RestaurantsDelivery=Resturants.RestaurantsDelivery.replace(np.nan,0)
Resturants.RestaurantsDelivery=Resturants.RestaurantsDelivery.replace('False',0)
Resturants.RestaurantsDelivery=Resturants.RestaurantsDelivery.replace('True',1)

#### Reservations

In [52]:
Resturants.RestaurantsReservations.unique()

array(['False', nan, 'True', 'None'], dtype=object)

In [53]:
Resturants.RestaurantsReservations=Resturants.RestaurantsReservations.replace("None",0)
Resturants.RestaurantsReservations=Resturants.RestaurantsReservations.replace(np.nan,0)
Resturants.RestaurantsReservations=Resturants.RestaurantsReservations.replace('False',0)
Resturants.RestaurantsReservations=Resturants.RestaurantsReservations.replace('True',1)

#### Has TV

In [54]:
Resturants.HasTV.unique()

array(['False', 'True', nan, 'None'], dtype=object)

In [55]:
Resturants.HasTV=Resturants.HasTV.replace("None",0)
Resturants.HasTV=Resturants.HasTV.replace(np.nan,0)
Resturants.HasTV=Resturants.HasTV.replace('False',0)
Resturants.HasTV=Resturants.HasTV.replace('True',1)

#### Restaurants Price Range

In [56]:
Resturants.RestaurantsPriceRange2.unique()

array(['2', '1', '3', nan, '4', 'None'], dtype=object)

None and NaN is assumed to be the most common price range

In [57]:
Resturants.RestaurantsPriceRange2=Resturants.RestaurantsPriceRange2.replace('None',np.nan)
Resturants.RestaurantsPriceRange2=Resturants.RestaurantsPriceRange2.apply(lambda l: float(l) if type(l)==str else l)

In [58]:
common_priceRange=Resturants.RestaurantsPriceRange2.mode()
Resturants.RestaurantsPriceRange2=Resturants.RestaurantsPriceRange2.replace(np.nan,float(common_priceRange))

#### Alcohol

In [59]:
Resturants.Alcohol.unique()

array(['none', 'beer_and_wine', nan, 'full_bar', 'None'], dtype=object)

"None" and "none" are here interpreted to mean that aclohol is not served. Nan is assigned the most common value.

In [60]:
common_alcohol_serving=Resturants.Alcohol.mode()
print("The most common type of status for alcohol serving is:")
print(common_alcohol_serving)
Resturants.Alcohol=Resturants.Alcohol.replace(np.nan,common_alcohol_serving)
Resturants.Alcohol=Resturants.Alcohol.replace('none','None')

The most common type of status for alcohol serving is:
0    full_bar
dtype: object


In [61]:
Resturants=pd.concat([Resturants,pd.get_dummies(Resturants.Alcohol,prefix='Alcohol',drop_first=False)],axis=1)
Resturants=Resturants.drop(['Alcohol','Alcohol_None'],axis=1)

#### Business parking

In [62]:
Resturants.BusinessParking

0       {'garage': 'False', 'street': 'False', 'valida...
1       {'garage': 'False', 'street': 'True', 'validat...
2       {'garage': 'False', 'street': 'False', 'valida...
3       {'garage': 'False', 'street': 'False', 'valida...
4       {'garage': 'False', 'street': 'False', 'valida...
                              ...                        
8676    {'garage': 'False', 'street': 'True', 'validat...
8677    {'garage': 'False', 'street': 'False', 'valida...
8678    {'garage': 'False', 'street': 'True', 'validat...
8679                                                  NaN
8680    {'garage': 'False', 'street': 'True', 'validat...
Name: BusinessParking, Length: 8681, dtype: object

In [63]:
parking_keys=set([])
for d in Resturants.BusinessParking:
    if type(d)==dict:
        parking_keys=parking_keys.union(set(d.keys()))
print("The are the following {} keys used to discribe parking:".format(len(parking_keys)))
print("-------------------------------------------------------")
for key in parking_keys:
    print("- {}".format(key))

The are the following 5 keys used to discribe parking:
-------------------------------------------------------
- valet
- garage
- street
- lot
- validated


In [64]:
for key in parking_keys:
    Resturants[key]=Resturants.BusinessParking.apply(lambda a:  int(bool(a[key])) if type(a)==dict and (key in a) else 0)
Resturants=Resturants.drop(['BusinessParking'],axis=1)

#### Kid friendly

In [65]:
Resturants.GoodForKids.unique()

array(['True', nan, 'False', 'None'], dtype=object)

In this case it is assumed that if None or NaN then the resturant is not kid friendly.<br>

In [66]:
Resturants.GoodForKids=Resturants.GoodForKids.replace("None",0)
Resturants.GoodForKids=Resturants.GoodForKids.replace(np.nan,0)
Resturants.GoodForKids=Resturants.GoodForKids.replace('False',0)
Resturants.GoodForKids=Resturants.GoodForKids.replace('True',1)

#### Attire

In [67]:
Resturants.RestaurantsAttire.unique()

array(['casual', nan, 'dressy', 'formal', 'None'], dtype=object)

Here we allow for resturants to not have a special attrie listed.

In [68]:
Resturants.RestaurantsAttire=Resturants.RestaurantsAttire.replace(np.nan,'None')
Resturants=pd.concat([Resturants,pd.get_dummies(Resturants.RestaurantsAttire,prefix='Attire',drop_first=False)],axis=1)
Resturants=Resturants.drop(['RestaurantsAttire','Attire_None'],axis=1)

#### Bike parking

In [69]:
Resturants.BikeParking.unique()

array([nan, 'True', 'False', 'None'], dtype=object)

Here NaN and None is interpreted to mean false.

In [70]:
Resturants.BikeParking=Resturants.BikeParking.replace("None",0)
Resturants.BikeParking=Resturants.BikeParking.replace(np.nan,0)
Resturants.BikeParking=Resturants.BikeParking.replace('False',0)
Resturants.BikeParking=Resturants.BikeParking.replace('True',1)

#### Suitable for groups

In [71]:
Resturants.RestaurantsGoodForGroups.unique()

array(['True', nan, 'False', 'None'], dtype=object)

In [72]:
Resturants.RestaurantsGoodForGroups=Resturants.RestaurantsGoodForGroups.replace("None",0)
Resturants.RestaurantsGoodForGroups=Resturants.RestaurantsGoodForGroups.replace(np.nan,0)
Resturants.RestaurantsGoodForGroups=Resturants.RestaurantsGoodForGroups.replace('False',0)
Resturants.RestaurantsGoodForGroups=Resturants.RestaurantsGoodForGroups.replace('True',1)

### Categories
The resturtants has a number of categories listed which needs to be transformed. We start by taking a look at all mentioned categories

In [73]:
Resturants.categories=Resturants.categories.str.lower().apply(lambda c: c.split(", "))

In [74]:
categories=set([])
for c in Resturants.categories:
    categories=categories.union(set(c))
print("The are the following {} categories exist:".format(len(categories)))
print("--------------------------------------------")
for category in categories:
    print("- {}".format(category))

The are the following 384 categories exist:
--------------------------------------------
- financial services
- salad
- ethnic food
- health & medical
- cooking classes
- chicken shop
- chicken wings
- lawyers
- nightlife
- venues & event spaces
- dumplings
- eyelash service
- polish
- salvadoran
- flowers & gifts
- bakeries
- grocery
- gay bars
- specialty schools
- american (traditional)
- middle eastern
- modern european
- videos & video game rental
- dive bars
- sports bars
- piano bars
- video game stores
- caribbean
- comedy clubs
- bed & breakfast
- buffets
- egyptian
- irish
- barbeque
- convenience stores
- brazilian
- community service/non-profit
- supernatural readings
- moroccan
- hot pot
- event planning & services
- lounges
- ice cream & frozen yogurt
- resorts
- bartenders
- playgrounds
- food tours
- community centers
- italian
- austrian
- tax services
- gift shops
- taiwanese
- bubble tea
- champagne bars
- alternative medicine
- fitness & instruction
- ukrainian
- sy

It is now encoded if each resturant belong to each of the categories

In [75]:
for category in categories:
    Resturants[category]=Resturants.categories.apply(lambda c:  int(bool(category in c)))
Resturants=Resturants.drop(['categories'],axis=1)

In [76]:
print("Number of resturants which belong in category:")
print("---------------------------------------------------------")
for category in categories:
    InCategory=len(Resturants[Resturants[category]==1])        
    print("{}: {}".format(category,InCategory))

Number of resturants which belong in category:
---------------------------------------------------------
financial services: 2
salad: 310
ethnic food: 155
health & medical: 4
cooking classes: 3
chicken shop: 87
chicken wings: 272
lawyers: 1
nightlife: 1213
venues & event spaces: 77
dumplings: 6
eyelash service: 1
polish: 21
salvadoran: 8
flowers & gifts: 6
bakeries: 296
grocery: 60
gay bars: 4
specialty schools: 6
american (traditional): 419
middle eastern: 361
modern european: 60
videos & video game rental: 1
dive bars: 32
sports bars: 130
piano bars: 2
video game stores: 1
caribbean: 188
comedy clubs: 3
bed & breakfast: 1
buffets: 47
egyptian: 2
irish: 33
barbeque: 237
convenience stores: 11
brazilian: 19
community service/non-profit: 2
supernatural readings: 1
moroccan: 15
hot pot: 34
event planning & services: 341
lounges: 170
ice cream & frozen yogurt: 120
resorts: 2
bartenders: 1
playgrounds: 3
food tours: 1
community centers: 2
italian: 670
austrian: 1
tax services: 1
gift shops

Since certain categories only have very few resturants beloning to them this can allow the category to function as an identifier for resturants, categories with 10 or less resturants are not included.

In [77]:
disgarded_categories=set([])
for category in categories:
    InCategory=len(Resturants[Resturants[category]==1]) 
    if InCategory<=10:
        Resturants=Resturants.drop([category],axis=1)
        disgarded_categories=disgarded_categories.union(set([category]))
categories.difference_update(disgarded_categories)        
print("Categories disgarded:")
for a in disgarded_categories:
      print(a) 
print("-----------------------------")
print("There are {} remaing categories".format(len(categories)))

Categories disgarded:
dinner theater
financial services
health & medical
fruits & veggies
window washing
shopping centers
cooking classes
meditation centers
lawyers
skin care
dumplings
eyelash service
art schools
milkshake bars
stadiums & arenas
cards & stationery
salvadoran
flowers & gifts
gay bars
specialty schools
furniture stores
australian
video game stores
videos & video game rental
bocce ball
piano bars
comedy clubs
hungarian
swimming pools
electricians
bed & breakfast
egyptian
hair removal
toy stores
department stores
community service/non-profit
supernatural readings
wholesalers
irish pub
resorts
gelato
bartenders
hair loss centers
playgrounds
hostels
food tours
community centers
web design
scottish
austrian
shaved snow
cinema
animal shelters
smokehouse
acne treatment
life coach
tax services
gift shops
music & dvds
champagne bars
alternative medicine
fitness & instruction
ukrainian
yelp events
syrian
tobacco shops
ethnic grocery
acai bowls
bookstores
pet sitting
accessories
in

### Opening hourse
The last bit of information available is concerning the opening hours of the resturants. 

In [78]:
Resturants.hours

0                                                      {}
1       {"Monday": "11:00-22:00", "Tuesday": "11:00-22...
2                                                      {}
3                                                      {}
4       {"Monday": "0:00-0:00", "Tuesday": "7:30-17:00...
                              ...                        
8676    {"Monday": "12:00-23:00", "Tuesday": "12:00-23...
8677    {"Monday": "11:00-23:00", "Tuesday": "11:00-23...
8678    {"Monday": "11:30-22:00", "Tuesday": "11:30-22...
8679    {"Monday": "11:00-21:00", "Tuesday": "11:00-21...
8680                                                   {}
Name: hours, Length: 8681, dtype: object

In [79]:
print("Is resturants opending hours known:")
Resturants.hours.apply(lambda h: h!='{}').value_counts()

Is resturants opending hours known:


True     6720
False    1961
Name: hours, dtype: int64

The opening hours when unknown will be the same as most common for remining resturants.

To get a better overview of a resturants opening hours each day is divided into five time intervals, this allows some of the details to be captured, while abstracting for the finer details, to keep down the number of attributes.<br>
- Early Morning: 0:00-5:59
- Morning: 6:00-9:59.
- Midday: 10:00-16:59.
- Evening: 17:00-21:59.
- Night: 22:00-23:59.

For resturants, which have opening hours that partially fall within some time period, the resturants will be considered open in that period.

In [80]:
Resturants.hours=Resturants.hours.apply(lambda h: eval(h)) #trasform into dictionary format

In [81]:
import datetime

#Adding openinghours categories to data frame, with default value 0 for closed in all entries.
days=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
timeIntervals={"Early Morning":(datetime.time(0,0,0),datetime.time(5,59,0)),
               "Morning":(datetime.time(6,0,0),datetime.time(9,59,0)),
               "Midday":(datetime.time(10,0,0),datetime.time(16,59,0)),
               "Evening":(datetime.time(17,0,0),datetime.time(23,50,0))}
for day in days:
    for time in timeIntervals.keys():
        Resturants['open_'+day+"_"+time]=0 #default closed

In [82]:
#If opening hours are listed, then for each day it is tested if returant is open in the different time intervals.
for index, resturant in Resturants.iterrows():
    if resturant.hours !={}:
        for day in days:
            if day in resturant.hours.keys():
                opening_hours=resturant.hours[day].split("-")
                open_hour=pd.to_datetime(opening_hours[0]).time()
                close_hour=pd.to_datetime(opening_hours[1]).time()
                for time in timeIntervals.keys():
                    if not (timeIntervals[time][1]<open_hour or close_hour<timeIntervals[time][0]): # opening hours overlap with time interval
                     Resturants['open_'+day+"_"+time][index]=1 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


For resturants where no opening hours are listed, the values are now set to include the most common opening hours.

In [83]:
resturants_with_openhours=Resturants[Resturants.hours!={}]
for index, resturant in Resturants.iterrows():
    if resturant.hours =={}:
        for day in days:
            for time in timeIntervals.keys():
                Resturants['open_'+day+"_"+time][index]=resturants_with_openhours['open_'+day+"_"+time].mode()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [84]:
Resturants=Resturants.drop(["hours"],axis=1)

### Standardization

After the transformation there are 229 attributes listed for each resturant. However it is clear that the scale of the attributes are very different in terms of range. In order to be able to interpret the influence of the attributes, based on the constructed classifiers, it is therefore necessary to normalize the data.

In [85]:
Resturants

Unnamed: 0,stars,review_count,is_open,RestaurantsReservations,GoodForKids,BikeParking,RestaurantsTakeOut,RestaurantsDelivery,HasTV,RestaurantsGoodForGroups,...,open_Friday_Midday,open_Friday_Evening,open_Saturday_Early Morning,open_Saturday_Morning,open_Saturday_Midday,open_Saturday_Evening,open_Sunday_Early Morning,open_Sunday_Morning,open_Sunday_Midday,open_Sunday_Evening
0,3.0,13,0,0,1,0,1,0,0,1,...,1,1,0,0,1,1,0,0,1,1
1,4.0,116,1,0,1,1,1,0,1,1,...,1,1,0,0,1,1,0,0,1,1
2,3.0,8,1,0,1,0,1,0,1,1,...,1,1,0,0,1,1,0,0,1,1
3,3.0,11,1,0,1,1,1,1,1,1,...,1,1,0,0,1,1,0,0,1,1
4,4.5,24,1,0,0,1,1,0,0,0,...,1,1,0,1,1,1,0,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8676,4.0,104,1,1,1,1,1,1,1,1,...,1,1,0,0,1,1,0,0,1,1
8677,3.5,43,1,0,1,0,1,1,1,1,...,1,1,0,0,1,1,0,0,1,1
8678,4.0,177,1,1,1,1,1,1,0,1,...,1,1,0,0,1,1,0,0,1,1
8679,4.0,16,1,0,1,0,1,0,1,1,...,1,1,0,0,1,1,0,0,0,0


In [91]:
Resturants_normalized=(Resturants-Resturants.min())/(Resturants.max()-Resturants.min())
Resturants_normalized

Unnamed: 0,stars,review_count,is_open,RestaurantsReservations,GoodForKids,BikeParking,RestaurantsTakeOut,RestaurantsDelivery,HasTV,RestaurantsGoodForGroups,...,open_Friday_Midday,open_Friday_Evening,open_Saturday_Early Morning,open_Saturday_Morning,open_Saturday_Midday,open_Saturday_Evening,open_Sunday_Early Morning,open_Sunday_Morning,open_Sunday_Midday,open_Sunday_Evening
0,0.500,0.003630,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
1,0.750,0.041016,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
2,0.500,0.001815,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
3,0.500,0.002904,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
4,0.875,0.007623,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8676,0.750,0.036661,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
8677,0.625,0.014519,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
8678,0.750,0.063158,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
8679,0.750,0.004719,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0


## Defining training and test sets

Before the data set is divided, the target value is seperated from the rest of the attributes. As the overall goal is to be able to determine whether a resturant is closed, the target is created based on the attribute *is_open*, which is transformed into *is_closed*.

In [94]:
isClosed=Resturants_normalized.is_open.apply(lambda l: 0 if l==1 else 1)
Resturants_normalized=Resturants_normalized.drop(['is_open'],axis=1)

Before defining the data set we take a look at how many of the resturants are closed.

In [102]:
print("Percent of resturants which are closed: {}%".format(len(isClosed[isClosed==1])/len(isClosed)*100))

Percent of resturants which are closed: 37.05794263333717%


While there are more observations of open resturants, with 37% being closed, the data set is still estimated to be well enough balanced that no special measures are required.

In [104]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Resturants_normalized, isClosed, test_size=0.2, random_state=42)

The test set is defined to contain 20 per cent of the available data.

In [113]:
print("Training set")
print("---------------------")
print("Number of train samples: {}".format(len(y_train)))
print("Percent of samples which represent closed resturants: {}%".format(len(y_train[y_train==1])/len(y_train)*100))
print()
print("Test set")
print("---------------------")
print("Number of test samples: {}".format(len(y_test)))
print("Percent of samples which represent closed resturants: {}%".format(len(y_test[y_test==1])/len(y_test)*100))

Training set
---------------------
Number of train samples: 6944
Percent of samples which represent closed resturants: 37.06797235023041%

Test set
---------------------
Number of test samples: 1737
Percent of samples which represent closed resturants: 37.01784686240645%


## Predicting if a resturant close
To predict which resturants close, we build a classifier, predicting whether or not a business is closed or open.

### Baseline

To be able to measure the performance of the classifiers, a basline model is constructed. The following basline model, uses random selection to classify each case

In [74]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score

In [75]:
BaselineClf = DummyClassifier(strategy="uniform") # using random classification
BaselineClf.fit(X_train,y_train)
baselinePred=BaselineClf.predict(X_test)
print("Basline")
print("---------------------------------------")
print("Accuracy: {}".format(accuracy_score(y_test,baselinePred)))
print("Recall: {}".format(recall_score(y_test,baselinePred)))
print("Precision: {}".format(precision_score(y_test,baselinePred)))
print("F1-score: {}".format(f1_score(y_test,baselinePred)))

Basline
---------------------------------------
Accuracy: 0.4985607369027058
Recall: 0.4976671850699845
Precision: 0.3686635944700461
F1-score: 0.42356055592322966


As expected the baseline have an accuracy colse to 50%.

### Logistic regression

For the model to be useful, it is important that interpretablety is high. As the first alternative to the basline model, logistic regression is therefore choosen.

In [76]:
from sklearn.linear_model import LogisticRegression

In [77]:
LogisticRegressionClf = LogisticRegression(solver='liblinear') #liblinear, the size of the data set is small enough that the memory saving default is not neccessary.
LogisticRegressionClf.fit(X_train,y_train)
LRPred=LogisticRegressionClf.predict(X_test)
print("Logistic regression classifer:")
print("---------------------------------------")
print("Accuracy: {}".format(accuracy_score(y_test,LRPred)))
print("Recall: {}".format(recall_score(y_test,LRPred)))
print("Precision: {}".format(precision_score(y_test,LRPred)))
print("F1-score: {}".format(f1_score(y_test,LRPred)))

Logistic regression classifer:
---------------------------------------
Accuracy: 0.7432354634427173
Recall: 0.5583203732503889
Precision: 0.6890595009596929
F1-score: 0.6168384879725086


As expected the classifier based on logistic regression performs better than basliine model with an accuracy of just over 74%.

### Decision tree

The tuning of the hyper parameters are based on the following article:<br>
Mithrakumar, M. (11.11.2019)*How to tune a Decision Tree*. Towards data science:https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680. visited 17.5.2020.

As a next possible approach a decision tree is used, again allowing easy interpretation of the model.<br>
Decisions trees have a number of parameters which can be adjuste. To do this a validation set is defined, constisting of 20% of the training data.

In [80]:
X_train_tree, X_valid, y_train_tree, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=43)
print("Number of train samples: {}".format(len(y_train_tree)))
print("Number of validation samples: {}".format(len(y_valid)))

Number of train samples: 5555
Number of validation samples: 1389


In [81]:
from sklearn.tree import DecisionTreeClassifier

In [82]:
TreeClf1=DecisionTreeClassifier()
TreeClf1.fit(X_train_tree,y_train_tree)
Tree1pred=TreeClf1.predict(X_valid)
Tree1pred_train=TreeClf1.predict(X_train_tree)
print("Decision tree classifer 1:")
print("---------------------------------------")
print("Validation:")
print("Accuracy: {}".format(accuracy_score(y_valid,Tree1pred)))
print("Recall: {}".format(recall_score(y_valid,Tree1pred)))
print("Precision: {}".format(precision_score(y_valid,Tree1pred)))
print("F1-score: {}".format(f1_score(y_valid,Tree1pred)))
print()
print("Train:")
print("Accuracy: {}".format(accuracy_score(y_train_tree,Tree1pred_train)))
print("Recall: {}".format(recall_score(y_train_tree,Tree1pred_train)))
print("Precision: {}".format(precision_score(y_train_tree,Tree1pred_train)))
print("F1-score: {}".format(f1_score(y_train_tree,Tree1pred_train)))

Decision tree classifer 1:
---------------------------------------
Validation:
Accuracy: 0.7141828653707704
Recall: 0.587426326129666
Precision: 0.6152263374485597
F1-score: 0.6010050251256283

Train:
Accuracy: 0.9994599459945994
Recall: 0.9990314769975787
Precision: 0.999515503875969
F1-score: 0.9992734318236861


We see that both in terms of the accuracy and the F1-score the tree-based classifier perform wors than the model based on logistic regression. When taking a look at the performance on the traing set it self, it appear that the model is overfitting. To address this issue the the minimum number of samples allowed in leaft of the three is raised from 1 to 5.

In [83]:
TreeClf2=DecisionTreeClassifier(min_samples_leaf=5)
TreeClf2.fit(X_train_tree,y_train_tree)
Tree2pred=TreeClf2.predict(X_valid)
Tree2pred_train=TreeClf2.predict(X_train_tree)
print("Decision tree classifer 2:")
print("---------------------------------------")
print("Validation:")
print("Accuracy: {}".format(accuracy_score(y_valid,Tree2pred)))
print("Recall: {}".format(recall_score(y_valid,Tree2pred)))
print("Precision: {}".format(precision_score(y_valid,Tree2pred)))
print("F1-score: {}".format(f1_score(y_valid,Tree2pred)))
print()
print("Train:")
print("Accuracy: {}".format(accuracy_score(y_train_tree,Tree2pred_train)))
print("Recall: {}".format(recall_score(y_train_tree,Tree2pred_train)))
print("Precision: {}".format(precision_score(y_train_tree,Tree2pred_train)))
print("F1-score: {}".format(f1_score(y_train_tree,Tree2pred_train)))

Decision tree classifer 2:
---------------------------------------
Validation:
Accuracy: 0.7350611951043916
Recall: 0.5854616895874263
Precision: 0.654945054945055
F1-score: 0.6182572614107884

Train:
Accuracy: 0.8586858685868587
Recall: 0.7549636803874092
Precision: 0.8482045701849836
F1-score: 0.7988726620548297


In [84]:
print("Depth of decision tree 2 is {}.".format(TreeClf2.get_depth()))

Depth of decision tree 2 is 24.


The smaller difference between training and test scores, indicates that the increase in the minimum number of samples required in leaf nodes, have helped to decress overfitting. However there is still room for improvment. As the next step we try to increase the minimum decreas in impurity that must be obtained for a split in the tree to be allowed. The default value is zero and is here set to 0.001

In [85]:
TreeClf3=DecisionTreeClassifier(min_samples_leaf=5,min_impurity_decrease=0.001)
TreeClf3.fit(X_train_tree,y_train_tree)
Tree3pred=TreeClf3.predict(X_valid)
Tree3pred_train=TreeClf3.predict(X_train_tree)
print("Decision tree classifer 3:")
print("---------------------------------------")
print("Validation:")
print("Accuracy: {}".format(accuracy_score(y_valid,Tree3pred)))
print("Recall: {}".format(recall_score(y_valid,Tree3pred)))
print("Precision: {}".format(precision_score(y_valid,Tree3pred)))
print("F1-score: {}".format(f1_score(y_valid,Tree3pred)))
print()
print("Train:")
print("Accuracy: {}".format(accuracy_score(y_train_tree,Tree3pred_train)))
print("Recall: {}".format(recall_score(y_train_tree,Tree3pred_train)))
print("Precision: {}".format(precision_score(y_train_tree,Tree3pred_train)))
print("F1-score: {}".format(f1_score(y_train_tree,Tree3pred_train)))

Decision tree classifer 3:
---------------------------------------
Validation:
Accuracy: 0.761699064074874
Recall: 0.6051080550098232
Precision: 0.7031963470319634
F1-score: 0.6504751847940866

Train:
Accuracy: 0.7634563456345634
Recall: 0.6329297820823244
Precision: 0.701556629092861
F1-score: 0.665478615071283


With the training and test error being very close the model appear to generalize well. The third version overall performe the best on the validation set, and is therefore now tested on the test set.

In [86]:
Tree3pred=TreeClf3.predict(X_test)
print("Decision tree classifer:")
print("---------------------------------------")
print("Test:")
print("Accuracy: {}".format(accuracy_score(y_test,Tree3pred)))
print("Recall: {}".format(recall_score(y_test,Tree3pred)))
print("Precision: {}".format(precision_score(y_test,Tree3pred)))
print("F1-score: {}".format(f1_score(y_test,Tree3pred)))

Decision tree classifer:
---------------------------------------
Test:
Accuracy: 0.7305699481865285
Recall: 0.5769828926905132
Precision: 0.654320987654321
F1-score: 0.6132231404958677


The tree based classifier clearly outperforms the basline model. However the decision tree classifier does not perform better than the logistic regression classifier, measured on any of the performance matrixes. When continuing the work with the classefier the focus will therefore be on the logistic regression classifier.

## Interpreting factors influencing prediction

Having constructed a succesful classifer predicting if a resturant is closed or open, we will in this section focus on what influences the classifier to gain a understanding of what factors are the most important to be aware of, to avoid a resturant close.<br>
To do this we start by calculating the odds ratio based on the cofficients of the logistic regression classifier.

In [87]:
InterpretationOverview = pd.concat([pd.DataFrame(X_train.columns),pd.DataFrame(np.transpose(np.exp(LogisticRegressionClf.coef_)))], axis = 1)
InterpretationOverview=pd.concat([InterpretationOverview,pd.DataFrame(np.transpose(LogisticRegressionClf.coef_))],axis=1)
InterpretationOverview.columns=['attribute','odds ratio','weight']
InterpretationOverview

Unnamed: 0,attribute,odds ratio,weight
0,stars,0.782288,-0.245533
1,review_count,0.000149,-8.809615
2,RestaurantsReservations,1.412855,0.345612
3,GoodForKids,0.977803,-0.022447
4,RestaurantsGoodForGroups,1.439600,0.364365
...,...,...,...
223,open_Saturday_Evening,0.656462,-0.420890
224,open_Sunday_Early Morning,1.586239,0.461366
225,open_Sunday_Morning,1.049455,0.048271
226,open_Sunday_Midday,1.149704,0.139505


With 228 different attributes it is not easy to get an immediate overview. We therefore start by ordering the attributed based on the odds ratio values.

In [88]:
InterpretationOverview.sort_values(by=['odds ratio'],ascending=False)

Unnamed: 0,attribute,odds ratio,weight
36,Attire_formal,4.961547,1.601718
34,Attire_casual,3.775848,1.328625
17,divey,3.180984,1.157191
58,international,2.842133,1.044555
140,music venues,2.780455,1.022615
...,...,...,...
171,breweries,0.267735,-1.317758
159,donuts,0.214845,-1.537836
8,HasTV,0.196224,-1.628500
14,hipster,0.108494,-2.221064


We start by focusing on the five attributes with the highest odds ratio scores, as seen in the above tabel.<br>

**Attire_formal:** Formal attire has the highest odds ratio value with 4.96. This means that when all other attributes of a resturant stays the same the odds of a resturant closing increase with a factor of 4.96 when formal attrie is required.<br>
**Attire_casual:** The second highest odds ratio is found for casual attire. For resturants with a casual attire the odds of closing is 3.78 higher compared with resturants where all other attributes are the same.<br>
**divey:** For resturants that are considered divey[1], the ratio for closing is 3.18 times higher, than for resturants where all other factors are the same.<br>
**International:** International resturants have 2.84 times higher odds for closing, in comparison to resturants where all other attributes remain the same.<br>
**music venues:** For resturants that are also music venues, the odds of closing are 2.78 times higher.

At the other end of the scale, we look at the five attributes with the lowest odds ratio.<br>

**review count:** with a very low odds ratio, increasing the reviewcountlowers the odds of a resturant closing. This indicate that resturants which have a higher amount of interaction are less likly to close.<br>
**hipster:** With an odds ratio of 0.11, resturants which are considered to have a hipster atmosphere has lower odd for closing. The odds of closing is decressed by a factor of 0.11 compared to resturants where all other attributes are the same.<br>
**HasTV:** Having a TV is the attribute with the thrid lowest odds ratio at 0.2. This again indicates that having a TV decrease the odds of a resturant closing.<br>
**donuts:** With an odds ratio of 0.21 it is indicated that resturants serving donuts have lower odds of closing, than similar resturants.<br>
**breweries:** At last breweries have odds ratio of 0.26. This indicate that the odds of closing is lowered by a factor of 0.26 for breweries, compared to resturants where all other features are the same.<br>


[1] Snesdude(September 04, 2014).*divey*. Urban dictionary: https://www.urbandictionary.com/define.php?term=divey visited 17.5.2020.

While it is cleare that the condition for all other factors to remain the same, makes the precis effect of resturant attributes, on the risk of a resturant having to close, it does provide some general indications of which factors are important. Above we only took at look at the five attributes that if all things equal, would lead to the biggest increas in the odds of a resturant closing, as well as the attributes that lower the odds the most. However a lot of attributes are still available, and to understand there influence better, we here take a look at the different attributes focusing on the different categories of attributes.

### Opening hours

When focusing on the opening hours of the resturants there is no clear overall pattern in their influence on on whether a resturant close or not. However a few time periods stand out. With a high odds ratio of 2.28 resturant which are open early friday morning appear to be more likly to end up closing than resturants where all other factors stay the same. At the other end of the scale being open at early morning on Monday and Tuesday appear to lower the odds of a resturant having to close with odds ratios of 0.46 and 0.49 respectivly.

In [89]:
InterpretationOverview[InterpretationOverview.attribute.str.contains("open_")]

Unnamed: 0,attribute,odds ratio,weight
200,open_Monday_Early Morning,0.457458,-0.782071
201,open_Monday_Morning,0.659345,-0.416509
202,open_Monday_Midday,1.167715,0.155049
203,open_Monday_Evening,0.803101,-0.219275
204,open_Tuesday_Early Morning,0.485043,-0.723517
205,open_Tuesday_Morning,0.905623,-0.099132
206,open_Tuesday_Midday,1.228025,0.205407
207,open_Tuesday_Evening,0.979642,-0.020568
208,open_Wednesday_Early Morning,0.526661,-0.641198
209,open_Wednesday_Morning,1.365971,0.311866


### Attire

The next category which we focus on is that off required attire for the resturants.<br>
As could be observed earlier two of the three attire categories where among the attributes with one of the five highest odds ratio values. Here it can be observed that also resturants wit a dressy attire have a high odds ratio.

In [90]:
InterpretationOverview=InterpretationOverview.drop(InterpretationOverview[InterpretationOverview.attribute.str.contains("open_")].index)
InterpretationOverview[InterpretationOverview.attribute.str.contains("Attire")]

Unnamed: 0,attribute,odds ratio,weight
34,Attire_casual,3.775848,1.328625
35,Attire_dressy,2.366118,0.86125
36,Attire_formal,4.961547,1.601718


### Ambiance

Divey ambiance were allready noted earlier to have a very high odds ratio, indicating that resturant consider to have such an ambiance are more likely to close compared with resturants where all other attributes are the same. As for the rest of the possible ambiances hipster stands out. With an odds ratio of 0.11 it is indicated that resturants considered to have a hipster ambiance have lower odds for closing.

In [91]:
InterpretationOverview=InterpretationOverview.drop(InterpretationOverview[InterpretationOverview.attribute.str.contains("Attire")].index)
InterpretationOverview[InterpretationOverview.attribute.isin(['classy','upscale','trendy','hipster','romantic','intimate','divey','touristy','casual'])]

Unnamed: 0,attribute,odds ratio,weight
11,intimate,1.248999,0.222342
12,upscale,0.863006,-0.147333
13,trendy,1.248999,0.222342
14,hipster,0.108494,-2.221064
15,touristy,1.248999,0.222342
16,romantic,1.248999,0.222342
17,divey,3.180984,1.157191
18,classy,1.248999,0.222342
19,casual,1.248999,0.222342


### Parking

In [96]:
InterpretationOverview=InterpretationOverview.drop(InterpretationOverview[InterpretationOverview.attribute.isin(['classy','upscale','trendy','hipster','romantic','intimate','divey','touristy','casual'])].index)
InterpretationOverview[(InterpretationOverview.attribute.isin(parking_keys)) | (InterpretationOverview.attribute=='BikeParking')]

Unnamed: 0,attribute,odds ratio,weight
9,BikeParking,0.505477,-0.682253
29,garage,1.037596,0.036907
30,valet,1.037596,0.036907
31,lot,1.037596,0.036907
32,street,1.037596,0.036907
33,validated,1.037596,0.036907


With all attributes describing all types of car parking having odds ratio values very close to 1, there is no indication of having certain parking options available will change the odds of a resturant having to close. One thing that does stand out is that with a odds ratio of 0.5 it is indicated that having biking available will decress the odds of a resturant having to close.

### Wifi

In [129]:
InterpretationOverview=InterpretationOverview.drop(InterpretationOverview[(InterpretationOverview.attribute.isin(parking_keys)) | (InterpretationOverview.attribute.str=='BikeParking')].index)
InterpretationOverview[InterpretationOverview.attribute.str.contains('Wifi_')]

Unnamed: 0,attribute,odds ratio,weight


All attributes describing the wifi option for the resturant is below a little below 1. When at the same time it is not possible to have nether free, paid and no internet, this overall doesn't indicate that the wifi options are important for whether or not a resturant has to close.

### Takeout and delivery

With odds ratios very close to 1 there is not indication of resturants offering takeaway or delivery are more or less likely to close than other similar resturants.

In [98]:
InterpretationOverview=InterpretationOverview.drop(InterpretationOverview[InterpretationOverview.attribute.str.contains('Wifi_')].index)
InterpretationOverview[InterpretationOverview.attribute.isin(['RestaurantsDelivery','RestaurantsTakeOut'])]

Unnamed: 0,attribute,odds ratio,weight
5,RestaurantsTakeOut,1.007264,0.007237
6,RestaurantsDelivery,0.936055,-0.066081


### NoiseLevel

Based on the odds ratios for attributes describing the noise levels, noise levels by them self does not appear to be very important for the odds of a resturant having to close.

In [101]:
InterpretationOverview=InterpretationOverview.drop(InterpretationOverview[InterpretationOverview.attribute.isin(['RestaurantsDelivery','RestaurantsTakeOut'])].index)
InterpretationOverview[InterpretationOverview.attribute.str.contains('NoiseLevel')]

Unnamed: 0,attribute,odds ratio,weight
23,NoiseLevel__average,0.826705,-0.190308
24,NoiseLevel__loud,0.71283,-0.338513
25,NoiseLevel__quiet,1.34442,0.295963
26,NoiseLevel__very_loud,0.751533,-0.285641


### Alcohol

Resturants which have a full bar, appear to have slightly higher odds for closing, compared to a similar resturant which does not have a full bar.

In [106]:
InterpretationOverview=InterpretationOverview.drop(InterpretationOverview[InterpretationOverview.attribute.str.contains('NoiseLevel')].index)
InterpretationOverview[InterpretationOverview.attribute.str.contains('Alcohol')]

Unnamed: 0,attribute,odds ratio,weight
27,Alcohol_beer_and_wine,0.998578,-0.001423
28,Alcohol_full_bar,1.430657,0.358134


### Groups and kids

In [108]:
InterpretationOverview=InterpretationOverview.drop(InterpretationOverview[InterpretationOverview.attribute.str.contains('Alcohol')].index)
len(InterpretationOverview)
InterpretationOverview[InterpretationOverview.attribute.isin(['RestaurantsGoodForGroups','GoodForKids'])]

Unnamed: 0,attribute,odds ratio,weight
3,GoodForKids,0.977803,-0.022447
4,RestaurantsGoodForGroups,1.4396,0.364365


Being a resturant considered good for groups or good for kids does not by it self appear to have a large influence on the

### Price, outdoor seating and reservations

When taking a look at the last three attributes there which where originally listed as attributes of resturants, it can be observed that resturants that allow for reservations, appear to have higher odds for closing.

In [112]:
InterpretationOverview=InterpretationOverview.drop(InterpretationOverview[InterpretationOverview.attribute.isin(['RestaurantsGoodForGroups','GoodForKids'])].index)
len(InterpretationOverview)
InterpretationOverview[InterpretationOverview.attribute.isin(['RestaurantsPriceRange2','OutdoorSeating','RestaurantsReservations'])]

Unnamed: 0,attribute,odds ratio,weight
2,RestaurantsReservations,1.412855,0.345612
7,RestaurantsPriceRange2,0.819725,-0.198787
10,OutdoorSeating,1.256174,0.22807


### Reviews

Earlier we already saw that the review count have the lowest odds ratio of all the attributes. However when looking at the odds ratio of stars we see that the odds of a resturant closing also appear to be lowered when the number of stars are increased. While this is not supprising it is not a parameter which can be directly taken into account, when trying to avoid closure.

In [118]:
InterpretationOverview=InterpretationOverview.drop(InterpretationOverview[InterpretationOverview.attribute.isin(['RestaurantsPriceRange2','OutdoorSeating','RestaurantsReservations'])].index)
InterpretationOverview[InterpretationOverview.attribute.isin(['stars','review_count'])]

Unnamed: 0,attribute,odds ratio,weight
0,stars,0.782288,-0.245533
1,review_count,0.000149,-8.809615


### Resturant categories

In [164]:
InterpretationOverview=InterpretationOverview.drop(InterpretationOverview[InterpretationOverview.attribute.isin(['stars','review_count','HasTV','BikeParking'])].index)
print("{} attributes still remains".format(len(InterpretationOverview)))

163 attributes still remains


We have yet to look into 164 attributes, representing different categories for resturants. To allow for a better overview of the very divers attributes we are here going to take a look at the attributes with the 20 highest and 20 lowest odds ratio scores.<br>
For the highest score there appear to be no overall trend in general categories for which resturants appear to be more likly to close. Off categories with very high odds ice cream & frozen yogurt, waffles, peruvian and lounges can be noted. Beloning to each of these categories a resutant is indicated to increase the odds of a resturant having to close, if all other factors remain the same.

In [162]:
InterpretationOverview=InterpretationOverview.sort_values(by="odds ratio",ascending=False)
InterpretationOverview[:20]

Unnamed: 0,attribute,odds ratio,weight
58,international,2.842133,1.044555
140,music venues,2.780455,1.022615
114,ice cream & frozen yogurt,2.56947,0.9437
166,waffles,2.526237,0.926731
131,peruvian,2.516727,0.922959
198,lounges,2.256915,0.813999
176,cheese shops,2.109254,0.746334
73,irish,2.007509,0.696895
47,hookah bars,1.951337,0.668515
189,arcades,1.915038,0.649737


Next we take a look at the categories with the 20 lowest odds ratio values.<br>
As we have already seen earlier breweries and resturants with donuts have very low odds ratio scores, indicating that beloning to one of those categories lowers the odds of a resturant being closed. Also fast food, filipino and imported food apear to be lowering the odds of a resturant closing.

In [163]:
InterpretationOverview.iloc[-20:]

Unnamed: 0,attribute,odds ratio,weight
184,pizza,0.554911,-0.588948
139,hakka,0.553298,-0.591858
175,kebab,0.546759,-0.603747
65,coffee roasteries,0.532661,-0.629871
179,beer bar,0.52451,-0.645292
9,BikeParking,0.505477,-0.682253
193,local services,0.504114,-0.684952
196,kosher,0.472354,-0.750026
116,beauty & spas,0.462191,-0.771777
152,portuguese,0.438704,-0.82393


## Conclusion

Two classifiers were constructed based on logistic regression and on a decision tree respectivly. As both improved the performance of a basline model, using a random choice for classification, it was shown that there there underlying structures influencing if a resturant has to close.\
While the performance of the two constructed classefiers were close, the highest performance was achived for the model based on logistic regression. The subsequent analysis of the model highlighted the complexity of factors influencing whether or not a resturant can remain open.\\
Based on the odds ratio of each attribute in the model, a number of factors where indicated to have either a positive or negative influence on the chances of resturant closing.