## Problem Description

Who doesn’t love food? All of us must have craving for at least a few favourite food items, we may also have a few places where we like to get them, a restaurant which serves our favourite food the way we want it to be. But there is one factor that will make us reconsider having our favourite food from our favourite restaurant, the cost. Here in this hackathon, you will be predicting the cost of the food served by the restaurants across different cities in India. You will use your Data Science skills to investigate the factors that really affect the cost, and who knows maybe you will even gain some very interesting insights that might help you choose what to eat and from where.

## Feature Description

* TITLE: The feature of the restaurant which can help identify what and for whom it is suitable for.

* RESTAURANT_ID: A unique ID for each restaurant.

* CUISINES: The variety of cuisines that the restaurant offers.

* TIME: The open hours of the restaurant.

* CITY: The city in which the restaurant is located.

* LOCALITY: The locality of the restaurant.

* RATING: The average rating of the restaurant by customers.

* VOTES: The overall votes received by the restaurant.

* COST: The average cost of a two-person meal.

# Importing Dataset

In [1]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



Bad key "text.kerning_factor" on line 4 in
G:\DOWNLOADS\Programs\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution


# Getting Data

In [2]:
foodcost=pd.read_excel('Restrauntfoodcost.xlsx')

In [3]:
foodcost.head()

Unnamed: 0,TITLE,RESTAURANT_ID,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES,COST
0,CASUAL DINING,9438,"Malwani, Goan, North Indian","11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)",Thane,Dombivali East,3.6,49 votes,1200
1,"CASUAL DINING,BAR",13198,"Asian, Modern Indian, Japanese",6pm – 11pm (Mon-Sun),Chennai,Ramapuram,4.2,30 votes,1500
2,CASUAL DINING,10915,"North Indian, Chinese, Biryani, Hyderabadi","11am – 3:30pm, 7pm – 11pm (Mon-Sun)",Chennai,Saligramam,3.8,221 votes,800
3,QUICK BITES,6346,"Tibetan, Chinese",11:30am – 1am (Mon-Sun),Mumbai,Bandra West,4.1,24 votes,800
4,DESSERT PARLOR,15387,Desserts,11am – 1am (Mon-Sun),Mumbai,Lower Parel,3.8,165 votes,300


## Data Exploration

In [4]:
foodcost.shape

(12690, 9)

In [5]:
# check for duplicate records... didn't remove the duplicate records as it was bringing score down
foodcost.duplicated().sum()
foodcost.drop_duplicates(keep='first', inplace=True)
foodcost.reset_index(inplace=True)

In [6]:
foodcost.head()

Unnamed: 0,index,TITLE,RESTAURANT_ID,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES,COST
0,0,CASUAL DINING,9438,"Malwani, Goan, North Indian","11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)",Thane,Dombivali East,3.6,49 votes,1200
1,1,"CASUAL DINING,BAR",13198,"Asian, Modern Indian, Japanese",6pm – 11pm (Mon-Sun),Chennai,Ramapuram,4.2,30 votes,1500
2,2,CASUAL DINING,10915,"North Indian, Chinese, Biryani, Hyderabadi","11am – 3:30pm, 7pm – 11pm (Mon-Sun)",Chennai,Saligramam,3.8,221 votes,800
3,3,QUICK BITES,6346,"Tibetan, Chinese",11:30am – 1am (Mon-Sun),Mumbai,Bandra West,4.1,24 votes,800
4,4,DESSERT PARLOR,15387,Desserts,11am – 1am (Mon-Sun),Mumbai,Lower Parel,3.8,165 votes,300


In [7]:
foodcost.shape

(12665, 10)

In [8]:
foodcost.isnull().sum()

index               0
TITLE               0
RESTAURANT_ID       0
CUISINES            0
TIME                0
CITY              112
LOCALITY           98
RATING              2
VOTES            1200
COST                0
dtype: int64

In [9]:
#check unique values in every column
for i in foodcost:
    print('unique values in ',i ,'is',foodcost[i].nunique())

unique values in  index is 12665
unique values in  TITLE is 113
unique values in  RESTAURANT_ID is 11892
unique values in  CUISINES is 4155
unique values in  TIME is 2689
unique values in  CITY is 359
unique values in  LOCALITY is 1416
unique values in  RATING is 32
unique values in  VOTES is 1847
unique values in  COST is 86


From above we can observe that there is lot of unique values.

## Data Pre-Processing

In [10]:
def extract_closed(time):
    a = re.findall('Closed \(.*?\)', time)
    if a != []:
        return a[0]
    else:
        return 'NA'

foodcost['CLOSED'] = foodcost['TIME'].apply(extract_closed)
foodcost.head()

Unnamed: 0,index,TITLE,RESTAURANT_ID,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES,COST,CLOSED
0,0,CASUAL DINING,9438,"Malwani, Goan, North Indian","11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)",Thane,Dombivali East,3.6,49 votes,1200,
1,1,"CASUAL DINING,BAR",13198,"Asian, Modern Indian, Japanese",6pm – 11pm (Mon-Sun),Chennai,Ramapuram,4.2,30 votes,1500,
2,2,CASUAL DINING,10915,"North Indian, Chinese, Biryani, Hyderabadi","11am – 3:30pm, 7pm – 11pm (Mon-Sun)",Chennai,Saligramam,3.8,221 votes,800,
3,3,QUICK BITES,6346,"Tibetan, Chinese",11:30am – 1am (Mon-Sun),Mumbai,Bandra West,4.1,24 votes,800,
4,4,DESSERT PARLOR,15387,Desserts,11am – 1am (Mon-Sun),Mumbai,Lower Parel,3.8,165 votes,300,


In [11]:
foodcost['TIME'] = foodcost['TIME'].str.replace(r'Closed \(.*?\)','')
#foodcost['TIME'] = foodcost['TIME'].str.replace(r'Closed \(.*?\)','')

In [12]:
foodcost['RATING'] = foodcost['RATING'].str.replace('NEW', '1')
foodcost['RATING'] = foodcost['RATING'].str.replace('-', '1').astype(float)

In [13]:
foodcost['VOTES'] = foodcost['VOTES'].str.replace(' votes', '').astype(float)

In [14]:
foodcost['CITY'].fillna('Missing', inplace=True)  
foodcost['LOCALITY'].fillna('Missing', inplace=True)  
foodcost['RATING'].fillna(3.8, inplace=True)  
foodcost['VOTES'].fillna(0.0, inplace=True)

In [15]:
foodcost['COST'] = foodcost['COST'].astype(float)

In [16]:
foodcost.head()

Unnamed: 0,index,TITLE,RESTAURANT_ID,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES,COST,CLOSED
0,0,CASUAL DINING,9438,"Malwani, Goan, North Indian","11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)",Thane,Dombivali East,3.6,49.0,1200.0,
1,1,"CASUAL DINING,BAR",13198,"Asian, Modern Indian, Japanese",6pm – 11pm (Mon-Sun),Chennai,Ramapuram,4.2,30.0,1500.0,
2,2,CASUAL DINING,10915,"North Indian, Chinese, Biryani, Hyderabadi","11am – 3:30pm, 7pm – 11pm (Mon-Sun)",Chennai,Saligramam,3.8,221.0,800.0,
3,3,QUICK BITES,6346,"Tibetan, Chinese",11:30am – 1am (Mon-Sun),Mumbai,Bandra West,4.1,24.0,800.0,
4,4,DESSERT PARLOR,15387,Desserts,11am – 1am (Mon-Sun),Mumbai,Lower Parel,3.8,165.0,300.0,


In [17]:
foodcost['TITLE'].nunique()

113

In [18]:
foodcost['CUISINES'].nunique()

4155

In [19]:
calc_mean = foodcost.groupby(['CITY'], axis=0).agg({'RATING': 'mean'}).reset_index()
calc_mean.columns = ['CITY','CITY_MEAN_RATING']
foodcost = foodcost.merge(calc_mean, on=['CITY'],how='left')

calc_mean = foodcost.groupby(['LOCALITY'], axis=0).agg({'RATING': 'mean'}).reset_index()
calc_mean.columns = ['LOCALITY','LOCALITY_MEAN_RATING']
foodocost = foodcost.merge(calc_mean, on=['LOCALITY'],how='left')

In [20]:
foodcost.head()

Unnamed: 0,index,TITLE,RESTAURANT_ID,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES,COST,CLOSED,CITY_MEAN_RATING
0,0,CASUAL DINING,9438,"Malwani, Goan, North Indian","11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)",Thane,Dombivali East,3.6,49.0,1200.0,,3.441053
1,1,"CASUAL DINING,BAR",13198,"Asian, Modern Indian, Japanese",6pm – 11pm (Mon-Sun),Chennai,Ramapuram,4.2,30.0,1500.0,,3.581546
2,2,CASUAL DINING,10915,"North Indian, Chinese, Biryani, Hyderabadi","11am – 3:30pm, 7pm – 11pm (Mon-Sun)",Chennai,Saligramam,3.8,221.0,800.0,,3.581546
3,3,QUICK BITES,6346,"Tibetan, Chinese",11:30am – 1am (Mon-Sun),Mumbai,Bandra West,4.1,24.0,800.0,,3.693112
4,4,DESSERT PARLOR,15387,Desserts,11am – 1am (Mon-Sun),Mumbai,Lower Parel,3.8,165.0,300.0,,3.693112


## Feature Extraction

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf1 = TfidfVectorizer(ngram_range=(1, 1), lowercase=True)
foodcost_title = tf1.fit_transform(foodcost['TITLE'])
foodcost_title = pd.DataFrame(data=foodcost_title.toarray(), columns=tf1.get_feature_names())

tf2 = TfidfVectorizer(ngram_range=(1, 1), lowercase=True)
foodcost_cuisines = tf2.fit_transform(foodcost['CUISINES'])
foodcost_cuisines = pd.DataFrame(data=foodcost_cuisines.toarray(), columns=tf2.get_feature_names())

tf3 = TfidfVectorizer(ngram_range=(1, 1), lowercase=True)
foodcost_city = tf3.fit_transform(foodcost['CITY'])
foodcost_city = pd.DataFrame(data=foodcost_city.toarray(), columns=tf3.get_feature_names())

tf4 = TfidfVectorizer(ngram_range=(1, 1), lowercase=True)
foodcost_locality = tf4.fit_transform(foodcost['LOCALITY'])
foodcost_locality = pd.DataFrame(data=foodcost_locality.toarray(), columns=tf4.get_feature_names())

tf5 = TfidfVectorizer(ngram_range=(1, 1), lowercase=True)
foodcost_time = tf5.fit_transform(foodcost['TIME'])
foodcost_time = pd.DataFrame(data=foodcost_time.toarray(), columns=tf5.get_feature_names())

In [22]:
foodcost.head()

Unnamed: 0,index,TITLE,RESTAURANT_ID,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES,COST,CLOSED,CITY_MEAN_RATING
0,0,CASUAL DINING,9438,"Malwani, Goan, North Indian","11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)",Thane,Dombivali East,3.6,49.0,1200.0,,3.441053
1,1,"CASUAL DINING,BAR",13198,"Asian, Modern Indian, Japanese",6pm – 11pm (Mon-Sun),Chennai,Ramapuram,4.2,30.0,1500.0,,3.581546
2,2,CASUAL DINING,10915,"North Indian, Chinese, Biryani, Hyderabadi","11am – 3:30pm, 7pm – 11pm (Mon-Sun)",Chennai,Saligramam,3.8,221.0,800.0,,3.581546
3,3,QUICK BITES,6346,"Tibetan, Chinese",11:30am – 1am (Mon-Sun),Mumbai,Bandra West,4.1,24.0,800.0,,3.693112
4,4,DESSERT PARLOR,15387,Desserts,11am – 1am (Mon-Sun),Mumbai,Lower Parel,3.8,165.0,300.0,,3.693112


In [23]:
foodcost = pd.concat([foodcost,foodcost_title, foodcost_cuisines, foodcost_city, foodcost_locality, foodcost_time], axis=1) 
foodcost.drop(['TITLE', 'CUISINES', 'CITY', 'LOCALITY', 'TIME'], axis=1, inplace=True)

In [24]:
foodcost.head()

Unnamed: 0,index,RESTAURANT_ID,RATING,VOTES,COST,CLOSED,CITY_MEAN_RATING,bakery,bar,beverage,...,closed,fri,hours,mon,not,sat,sun,thu,tue,wed
0,0,9438,3.6,49.0,1200.0,,3.441053,0.0,0.0,0.0,...,0.0,0.0,0.0,0.144885,0.0,0.0,0.150942,0.0,0.0,0.0
1,1,13198,4.2,30.0,1500.0,,3.581546,0.0,0.806864,0.0,...,0.0,0.0,0.0,0.200535,0.0,0.0,0.208919,0.0,0.0,0.0
2,2,10915,3.8,221.0,800.0,,3.581546,0.0,0.0,0.0,...,0.0,0.0,0.0,0.191223,0.0,0.0,0.199218,0.0,0.0,0.0
3,3,6346,4.1,24.0,800.0,,3.693112,0.0,0.0,0.0,...,0.0,0.0,0.0,0.184058,0.0,0.0,0.191753,0.0,0.0,0.0
4,4,15387,3.8,165.0,300.0,,3.693112,0.0,0.0,0.0,...,0.0,0.0,0.0,0.217237,0.0,0.0,0.226319,0.0,0.0,0.0


In [25]:
foodcost = pd.get_dummies(foodcost, columns=['CLOSED'], drop_first=True)

In [26]:
foodcost.head()

Unnamed: 0,index,RESTAURANT_ID,RATING,VOTES,COST,CITY_MEAN_RATING,bakery,bar,beverage,bites,...,"CLOSED_Closed (Mon, Wed, Thu, Sun)",CLOSED_Closed (Mon-Thu),CLOSED_Closed (Mon-Tue),CLOSED_Closed (Sat),CLOSED_Closed (Sat-Sun),CLOSED_Closed (Sun),CLOSED_Closed (Tue),CLOSED_Closed (Wed),CLOSED_Closed (Wed-Sun),CLOSED_NA
0,0,9438,3.6,49.0,1200.0,3.441053,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
1,1,13198,4.2,30.0,1500.0,3.581546,0.0,0.806864,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
2,2,10915,3.8,221.0,800.0,3.581546,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
3,3,6346,4.1,24.0,800.0,3.693112,0.0,0.0,0.0,0.707107,...,0,0,0,0,0,0,0,0,0,1
4,4,15387,3.8,165.0,300.0,3.693112,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1


In [27]:
foodcost.shape

(12665, 2023)

In [28]:
foodcost = foodcost[foodcost['COST'].isnull()!=True]

In [29]:
foodcost.shape

(12665, 2023)

In [30]:
foodcost['COST'] = np.log1p(foodcost['COST'])

## Train Test Split

In [31]:
X = foodcost.drop(labels=['COST'], axis=1)
y = foodcost['COST'].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [32]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((9498, 2022), (9498,), (3167, 2022), (3167,))

## Building Model

In [33]:
from math import sqrt 
from sklearn.metrics import mean_squared_log_error

In [35]:
from sklearn.ensemble import BaggingRegressor
br = BaggingRegressor(base_estimator=None, n_estimators=30, max_samples=0.9, max_features=1.0, bootstrap=True, 
                      bootstrap_features=True, oob_score=True, warm_start=False, n_jobs=1, random_state=42, verbose=1)
br.fit(X_train, y_train)
y_pred_br = br.predict(X_test)
print('RMSE:', sqrt(mean_squared_log_error(np.exp(y_test), np.exp(y_pred_br))))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   47.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


RMSE: 0.3708310102608996


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.0s finished


In [36]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=40, criterion='mse', max_depth=None, min_samples_split=4, min_samples_leaf=1, 
                           min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, 
                           min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, 
                           random_state=42, verbose=1, warm_start=False)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print('RMSE:', sqrt(mean_squared_log_error(np.exp(y_test), np.exp(y_pred_rf))))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


RMSE: 0.37656460951430587


[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:   51.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:    0.0s finished


In [37]:
rf.score(X_train,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:    0.1s finished


0.956799405952441

# Predicting Dataset

In [39]:
pred=rf.predict(X_test)
print("predicted cost",pred)
print("actual cost",y_test)

predicted cost [7.05185596 5.72959862 5.7697185  ... 7.62374873 5.3362872  8.03323569]
actual cost [7.17088848 5.52545294 5.70711026 ... 8.00670085 5.30330491 8.16080392]


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:    0.0s finished


# Saving model

In [40]:
x=pd.DataFrame(pred)
x.to_csv('Restraunt_food_cost.csv')

In [43]:
import pickle
#save the model as a pickle in a file
# Saving model to disk
pickle.dump(rf, open('food_cost.pickle','wb'))