# Restaurant Food Cost

#### Project Description

Who doesn’t love food? All of us must have craving for at least a few favourite food items, we may also have a few places where we like to get them, a restaurant which serves our favourite food the way we want it to be. But there is one factor that will make us reconsider having our favourite food from our favourite restaurant, the cost. Here in this hackathon, you will be predicting the cost of the food served by the restaurants across different cities in India. You will use your Data Science skills to investigate the factors that really affect the cost, and who knows maybe you will even gain some very interesting insights that might help you choose what to eat and from where.

- You are provided with following 2 files:
1.     train.csv : Use this dataset to train the model. This file contains all the details related to restaurant food cost as    well as the target variable “cost”. You have to train your model using this file.
2.     test.csv : Use the trained model to predict the cost of a two person meal.

#####  Dataset Attributes
- TITLE: The feature of the restaurant which can help identify what and for whom it is suitable for.
- RESTAURANT_ID: A unique ID for each restaurant.
- CUISINES: The variety of cuisines that the restaurant offers.
- TIME: The open hours of the restaurant.
- CITY: The city in which the restaurant is located.
- LOCALITY: The locality of the restaurant.
- RATING: The average rating of the restaurant by customers.
- VOTES: The overall votes received by the restaurant.
- COST: The average cost of a two-person meal.


In [1]:
# Import libraries 
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Impoting Datasets 
train_data = pd.read_excel('Data_Train.xlsx')
test_data = pd.read_excel('Data_test.xlsx')

## Data Exploration

In [3]:
train_data.head(3)

Unnamed: 0,TITLE,RESTAURANT_ID,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES,COST
0,CASUAL DINING,9438,"Malwani, Goan, North Indian","11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)",Thane,Dombivali East,3.6,49 votes,1200
1,"CASUAL DINING,BAR",13198,"Asian, Modern Indian, Japanese",6pm – 11pm (Mon-Sun),Chennai,Ramapuram,4.2,30 votes,1500
2,CASUAL DINING,10915,"North Indian, Chinese, Biryani, Hyderabadi","11am – 3:30pm, 7pm – 11pm (Mon-Sun)",Chennai,Saligramam,3.8,221 votes,800


In [4]:
test_data.head(2)

Unnamed: 0,TITLE,RESTAURANT_ID,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES
0,CASUAL DINING,4085,"North Indian, Chinese, Mughlai, Kebab",12noon – 12midnight (Mon-Sun),Noida,Sector 18,4.3,564 votes
1,QUICK BITES,12680,"South Indian, Fast Food, Pizza, North Indian",7am – 12:30AM (Mon-Sun),Mumbai,Grant Road,4.2,61 votes


In [5]:
train_data.shape

(12690, 9)

In [6]:
test_data.shape

(4231, 8)

In [7]:
train_data.columns

Index(['TITLE', 'RESTAURANT_ID', 'CUISINES', 'TIME', 'CITY', 'LOCALITY',
       'RATING', 'VOTES', 'COST'],
      dtype='object')

In [8]:
test_data.columns

Index(['TITLE', 'RESTAURANT_ID', 'CUISINES', 'TIME', 'CITY', 'LOCALITY',
       'RATING', 'VOTES'],
      dtype='object')

In [9]:
# Checking for records (duplicasy record) ....didn't remove record as it was reducing score down 
train_data.duplicated().sum()
# train_data drop_duplicated (keep='first', inplace=True)
# train_data reset_index(inplace=True)

25

In [10]:
test_data.duplicated().sum()
# test_data drop_duplicated (keep='first', inplace=True)
# test_data reset_index(inplace=True)

1

In [11]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12690 entries, 0 to 12689
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   TITLE          12690 non-null  object
 1   RESTAURANT_ID  12690 non-null  int64 
 2   CUISINES       12690 non-null  object
 3   TIME           12690 non-null  object
 4   CITY           12578 non-null  object
 5   LOCALITY       12592 non-null  object
 6   RATING         12688 non-null  object
 7   VOTES          11486 non-null  object
 8   COST           12690 non-null  int64 
dtypes: int64(2), object(7)
memory usage: 892.4+ KB


In [12]:
# Checking unique values in all columns 
for i in train_data.columns:
    print('Unique value in' , i, train_data[i].nunique())

Unique value in TITLE 113
Unique value in RESTAURANT_ID 11892
Unique value in CUISINES 4155
Unique value in TIME 2689
Unique value in CITY 359
Unique value in LOCALITY 1416
Unique value in RATING 32
Unique value in VOTES 1847
Unique value in COST 86


In [13]:
for i in test_data.columns:
    print('Unique value in' , i, test_data[i].nunique())

Unique value in TITLE 86
Unique value in RESTAURANT_ID 4127
Unique value in CUISINES 1727
Unique value in TIME 1183
Unique value in CITY 151
Unique value in LOCALITY 834
Unique value in RATING 31
Unique value in VOTES 1136


In [14]:
train_data.isnull().sum() / len(train_data)*100

TITLE            0.000000
RESTAURANT_ID    0.000000
CUISINES         0.000000
TIME             0.000000
CITY             0.882585
LOCALITY         0.772262
RATING           0.015760
VOTES            9.487786
COST             0.000000
dtype: float64

In [15]:
test_data.isnull().sum()/ len(test_data)*100

TITLE            0.000000
RESTAURANT_ID    0.000000
CUISINES         0.000000
TIME             0.000000
CITY             0.827228
LOCALITY         0.709052
RATING           0.047270
VOTES            9.501300
dtype: float64

## Data pre-processing

In [16]:
# merge train and test
df = train_data.append(test_data,ignore_index=True)
df.head()

Unnamed: 0,TITLE,RESTAURANT_ID,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES,COST
0,CASUAL DINING,9438,"Malwani, Goan, North Indian","11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)",Thane,Dombivali East,3.6,49 votes,1200.0
1,"CASUAL DINING,BAR",13198,"Asian, Modern Indian, Japanese",6pm – 11pm (Mon-Sun),Chennai,Ramapuram,4.2,30 votes,1500.0
2,CASUAL DINING,10915,"North Indian, Chinese, Biryani, Hyderabadi","11am – 3:30pm, 7pm – 11pm (Mon-Sun)",Chennai,Saligramam,3.8,221 votes,800.0
3,QUICK BITES,6346,"Tibetan, Chinese",11:30am – 1am (Mon-Sun),Mumbai,Bandra West,4.1,24 votes,800.0
4,DESSERT PARLOR,15387,Desserts,11am – 1am (Mon-Sun),Mumbai,Lower Parel,3.8,165 votes,300.0


In [17]:
df = df[['TITLE', 'CUISINES', 'TIME', 'CITY', 'LOCALITY', 'RATING', 'VOTES', 'COST']]

In [18]:
# Extracting closed times 
import re
def extract_closed(time):
    a = re.findall('Closed \(.*?\)', time)
    if a != []:
        return a[0]
    else:
        return 'NA'

df['CLOSED'] = df['TIME'].apply(extract_closed)

In [19]:
# Remove substrings matching the pattern 'Closed (...)'
df['TIME'] = df['TIME'].str.replace(r'Closed \(.*?\)','')
#df['TIME'] = df['TIME'].str.replace(r'Closed...','')
# If you want to remove substrings matching the pattern 'Closed...'
# you can uncomment the line below:
# df['TIME'] = df['TIME'].str.replace(r'Closed...', '')

In [20]:
# Replace 'NEW' with '1'
df['RATING'] = df['RATING'].str.replace('NEW', '1')

# Replace '-' with '1' and then convert to float
df['RATING'] = df['RATING'].str.replace('-', '1').astype(float)

In [21]:
# Remove ' votes' from each entry and convert to float
df['VOTES'] = df['VOTES'].str.replace(' votes', '').astype(float)

In [22]:
# Fill missing values in 'CITY' column with 'Missing'
df['CITY'].fillna('Missing', inplace=True)  

# Fill missing values in 'LOCALITY' column with 'Missing'
df['LOCALITY'].fillna('Missing', inplace=True)  

# Fill missing values in 'RATING' column with 3.8
df['RATING'].fillna(3.8, inplace=True)  

# Fill missing values in 'VOTES' column with 0.0
df['VOTES'].fillna(0.0, inplace=True) 

In [23]:
# Converting Cost column to a float type
df['COST'] = df['COST'].astype(float)

In [24]:
df.head(2)

Unnamed: 0,TITLE,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES,COST,CLOSED
0,CASUAL DINING,"Malwani, Goan, North Indian","11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)",Thane,Dombivali East,3.6,49.0,1200.0,
1,"CASUAL DINING,BAR","Asian, Modern Indian, Japanese",6pm – 11pm (Mon-Sun),Chennai,Ramapuram,4.2,30.0,1500.0,


In [25]:
#calculate the number of unique values in the 'TITLE' and 'CUISINES' columns 
df['TITLE'].nunique(), df['CUISINES'].nunique()

(123, 5183)

In [26]:
# Calculate Mean Rating by City
calc_mean = df.groupby(['CITY'], axis=0).agg({'RATING': 'mean'}).reset_index()
calc_mean.columns = ['CITY','CITY_MEAN_RATING']

In [27]:
# Merge Mean Ratings by City back to the DataFrame
df = df.merge(calc_mean, on=['CITY'],how='left')

In [28]:
# Calculate Mean Rating by Locality
calc_mean = df.groupby(['LOCALITY'], axis=0).agg({'RATING': 'mean'}).reset_index()
calc_mean.columns = ['LOCALITY','LOCALITY_MEAN_RATING']

In [29]:
# Merge Mean Ratings by Locality back to the DataFrame
df = df.merge(calc_mean, on=['LOCALITY'],how='left')

In [30]:
df.head(3)

Unnamed: 0,TITLE,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES,COST,CLOSED,CITY_MEAN_RATING,LOCALITY_MEAN_RATING
0,CASUAL DINING,"Malwani, Goan, North Indian","11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)",Thane,Dombivali East,3.6,49.0,1200.0,,3.376271,3.388889
1,"CASUAL DINING,BAR","Asian, Modern Indian, Japanese",6pm – 11pm (Mon-Sun),Chennai,Ramapuram,4.2,30.0,1500.0,,3.584588,3.472222
2,CASUAL DINING,"North Indian, Chinese, Biryani, Hyderabadi","11am – 3:30pm, 7pm – 11pm (Mon-Sun)",Chennai,Saligramam,3.8,221.0,800.0,,3.584588,3.55


In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Vectorization for 'TITLE' Column

tf1 = TfidfVectorizer(ngram_range=(1, 1), lowercase=True)

df_title = tf1.fit_transform(df['TITLE'])

df_title = pd.DataFrame(data=df_title.toarray(), columns=tf1.get_feature_names_out())


In [32]:
# TF-IDF Vectorization for 'CUISINES' Column
tf2 = TfidfVectorizer(ngram_range=(1, 1), lowercase=True)

df_cuisines = tf2.fit_transform(df['CUISINES'])

df_cuisines = pd.DataFrame(data=df_cuisines.toarray(), columns=tf2.get_feature_names_out())

In [33]:
# TF-IDF Vectorization for 'CITY' Column
tf3 = TfidfVectorizer(ngram_range=(1, 1), lowercase=True)

df_city = tf3.fit_transform(df['CITY'])

df_city = pd.DataFrame(data=df_city.toarray(), columns=tf3.get_feature_names_out())

In [34]:
# TF-IDF Vectorization for 'LOCALITY' Column
tf4 = TfidfVectorizer(ngram_range=(1, 1), lowercase=True)

df_locality = tf4.fit_transform(df['LOCALITY'])

df_locality = pd.DataFrame(data=df_locality.toarray(), columns=tf4.get_feature_names_out())

In [35]:
# TF-IDF Vectorization for 'TIME' Column
tf5 = TfidfVectorizer(ngram_range=(1, 1), lowercase=True)

df_time = tf5.fit_transform(df['TIME'])

df_time = pd.DataFrame(data=df_time.toarray(), columns=tf5.get_feature_names_out())

In [36]:
df.head(3)

Unnamed: 0,TITLE,CUISINES,TIME,CITY,LOCALITY,RATING,VOTES,COST,CLOSED,CITY_MEAN_RATING,LOCALITY_MEAN_RATING
0,CASUAL DINING,"Malwani, Goan, North Indian","11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)",Thane,Dombivali East,3.6,49.0,1200.0,,3.376271,3.388889
1,"CASUAL DINING,BAR","Asian, Modern Indian, Japanese",6pm – 11pm (Mon-Sun),Chennai,Ramapuram,4.2,30.0,1500.0,,3.584588,3.472222
2,CASUAL DINING,"North Indian, Chinese, Biryani, Hyderabadi","11am – 3:30pm, 7pm – 11pm (Mon-Sun)",Chennai,Saligramam,3.8,221.0,800.0,,3.584588,3.55


In [37]:
# Concatenate transformed DataFrames with the original DataFrame
df = pd.concat([df, df_title, df_cuisines, df_city, df_locality, df_time], axis=1) 

# Drop original text columns
df.drop(['TITLE', 'CUISINES', 'CITY', 'LOCALITY', 'TIME'], axis=1, inplace=True)

In [38]:
# Check if 'CLOSED' column exists in the DataFrame
if 'CLOSED' in df.columns:
    # If 'CLOSED' column exists, convert categorical variable(s) into dummy
    df = pd.get_dummies(df, columns=['CLOSED'], drop_first=True)
else:
    # Print an error message if 'CLOSED' column does not exist
    print("Error: 'CLOSED' column does not exist in the DataFrame.")


In [39]:
df.shape

(16921, 2285)

In [40]:
# Splitting into Training and Testing Sets
train_df = df[df['COST'].notnull()]

In [41]:
# Removing 'COST' Column from Testing Set
test_df = df[df['COST'].isnull()]
test_df = test_df.drop('COST', axis=1)  # Dropping 'COST' column from testing set

In [42]:
train_df.shape, test_df.shape

((12690, 2285), (4231, 2284))

In [43]:
train_df['COST'] = np.log1p(train_df['COST'])

## Train test split

In [44]:
X = train_df.drop(labels=['COST'], axis=1)
y = train_df['COST'].values

from sklearn.model_selection import train_test_split
X_train, X_cv, y_train, y_cv = train_test_split(X, y, test_size=0.25, random_state=1)

In [45]:
X_train.shape, y_train.shape, X_cv.shape, y_cv.shape

((9517, 2284), (9517,), (3173, 2284), (3173,))

# Build the model

### RandomForestRegressor

In [48]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error
from math import sqrt 

# Create RandomForestRegressor model
rf_model = RandomForestRegressor(n_estimators=50, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_rf = rf_model.predict(X_cv)

# Calculate RMSLE
rmsle_rf = sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred_rf)))
print('RMSLE (Random Forest):', rmsle_rf)


RMSLE (Random Forest): 0.3626825677031758


### Gradient Boosting Regressor 

In [49]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_log_error
from math import sqrt

# Initialize the Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Train the Gradient Boosting model
gb_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_gb = gb_model.predict(X_cv)

# Calculate RMSLE
rmsle_gb = sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred_gb)))
print('RMSLE (Gradient Boosting):', rmsle_gb)


RMSLE (Gradient Boosting): 0.37224193983912335


### Support Vector Regressor 

In [50]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_log_error
from math import sqrt

# Initialize the SVR model with a pipeline for scaling
svr_model = make_pipeline(StandardScaler(), SVR())

# Train the SVR model
svr_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_svr = svr_model.predict(X_cv)

# Calculate RMSLE
rmsle_svr = sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred_svr)))
print('RMSLE (SVR):', rmsle_svr)

RMSLE (SVR): 0.4171107186358355


### Bagging Regressor

In [52]:
from sklearn.ensemble import BaggingRegressor
br = BaggingRegressor(base_estimator=None, n_estimators=30, max_samples=0.9, max_features=1.0, bootstrap=True, 
                      bootstrap_features=True, oob_score=True, warm_start=False, n_jobs=1, random_state=42, verbose=1)
br.fit(X_train, y_train)
y_pred_br = br.predict(X_cv)
print('RMSLE(BaggingRegressor):', sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred_br))))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   18.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


RMSLE(BaggingRegressor): 0.35889807962931003


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.5s finished


## Checking Best Model 

In [56]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, BaggingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_log_error
from math import sqrt

# List of regression models
model_list = [
    RandomForestRegressor(random_state=42),
    GradientBoostingRegressor(random_state=42),
    SVR(),
    BaggingRegressor(random_state=42)
]

# Initialize lists to store model names and accuracies
model_names = []
rmsle_scores = []

# Iterate through the models
for model in model_list:
    model_name = model.__class__.__name__
    print(f"Training {model_name}...")
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions on the validation set
    y_pred = model.predict(X_cv)
    
    # Calculate RMSLE
    rmsle = sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred)))
    print(f"RMSLE ({model_name}): {rmsle}")
    
    # Store model name and accuracy
    model_names.append(model_name)
    rmsle_scores.append(rmsle)

Training RandomForestRegressor...
RMSLE (RandomForestRegressor): 0.36016039297674485
Training GradientBoostingRegressor...
RMSLE (GradientBoostingRegressor): 0.37224193983912335
Training SVR...
RMSLE (SVR): 0.636616396013763
Training BaggingRegressor...
RMSLE (BaggingRegressor): 0.3721038821901323


## Predict on test set

In [58]:
#Xtrain = train_df.drop(labels='COST', axis=1)
#ytrain = train_df['COST'].values
Xtest = test_df

In [59]:
from sklearn.model_selection import KFold
from sklearn.ensemble import BaggingRegressor

err_br = []
y_pred_totbr = []

fold = KFold(n_splits=15, shuffle=True, random_state=42)

for train_index, test_index in fold.split(X):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y[train_index], y[test_index]

    br = BaggingRegressor(base_estimator=None, n_estimators=30, max_samples=1.0, max_features=1.0, bootstrap=True,
                          bootstrap_features=True, oob_score=False, warm_start=False, n_jobs=1, random_state=42, verbose=0)
    
    br.fit(X_train, y_train)
    y_pred_br = br.predict(X_test)

    print("RMSE BR:", sqrt(mean_squared_log_error(np.exp(y_test), np.exp(y_pred_br))))

    err_br.append(sqrt(mean_squared_log_error(np.exp(y_test), np.exp(y_pred_br))))
    p = br.predict(Xtest)
    y_pred_totbr.append(p)

RMSE BR: 0.3604225466162505
RMSE BR: 0.3606867271762973
RMSE BR: 0.34840054065618664
RMSE BR: 0.3511785117383026
RMSE BR: 0.3412423764840244
RMSE BR: 0.3324498158689537
RMSE BR: 0.3627368477369282
RMSE BR: 0.3781163629013975
RMSE BR: 0.35734481021927217
RMSE BR: 0.34075710028279965
RMSE BR: 0.3451173859037766
RMSE BR: 0.3592734017580756
RMSE BR: 0.34009318446171727
RMSE BR: 0.3571125481490086
RMSE BR: 0.37301396435914974


In [61]:
from sklearn.model_selection import KFold
from sklearn.ensemble import BaggingRegressor

err_br = []
y_pred_totbr = []

fold = KFold(n_splits=15, shuffle=True, random_state=42)

for train_index, test_index in fold.split(X):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y[train_index], y[test_index]

    br = BaggingRegressor(base_estimator=None, n_estimators=30, max_samples=1.0, max_features=1.0, bootstrap=True,
                          bootstrap_features=True, oob_score=False, warm_start=False, n_jobs=1, random_state=42, verbose=0)
    
    br.fit(X_train, y_train)
    y_pred_br = br.predict(X_test)

    print("RMSE BR:", sqrt(mean_squared_log_error(np.exp(y_test), np.exp(y_pred_br))))

    err_br.append(sqrt(mean_squared_log_error(np.exp(y_test), np.exp(y_pred_br))))
    p = br.predict(Xtest)
    y_pred_totbr.append(p)

RMSE BR: 0.3604225466162505
RMSE BR: 0.3606867271762973
RMSE BR: 0.34840054065618664
RMSE BR: 0.3511785117383026
RMSE BR: 0.3412423764840244
RMSE BR: 0.3324498158689537
RMSE BR: 0.3627368477369282
RMSE BR: 0.3781163629013975
RMSE BR: 0.35734481021927217
RMSE BR: 0.34075710028279965
RMSE BR: 0.3451173859037766
RMSE BR: 0.3592734017580756
RMSE BR: 0.34009318446171727
RMSE BR: 0.3571125481490086
RMSE BR: 0.37301396435914974


In [67]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error
from math import sqrt 
import numpy as np


# Create RandomForestRegressor model
rf_model = RandomForestRegressor(n_estimators=50, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_rf = rf_model.predict(X_cv)

# Calculate RMSLE on the validation set
rmsle_rf = sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred_rf)))
print('RMSLE on Validation Set (Random Forest):', rmsle_rf)

# Make predictions on the test set
y_pred_test_rf = rf_model.predict(X_test)

# Calculate RMSLE on the test set
rmsle_test_rf = sqrt(mean_squared_log_error(np.exp(y_test), np.exp(y_pred_test_rf)))
print('RMSLE on Test Set (Random Forest):', rmsle_test_rf)

RMSLE on Validation Set (Random Forest): 0.15965657186236626
RMSLE on Test Set (Random Forest): 0.36447666013161467


In [68]:
# Saving the best model 
!pip install joblib




In [71]:
import joblib

# Save the trained model to a file
joblib.dump(rf_model, 'random_forest_model.pkl')


['random_forest_model.pkl']

In [72]:
# Load the saved model from file
loaded_model = joblib.load('random_forest_model.pkl')
# Now you can use loaded_model for making predictions or further analysi