<a href="https://www.kaggle.com/code/adhoppin/zomato-rating-prediction?scriptVersionId=95728613" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Zomato Restaurant Ratings

# **ABSTRACT**

Zomato is one of the best online food delivery apps which gives the users the ratings and the reviews on restaurants all over india.These ratings and the Reviews are considered as one of the most important deciding factors which determine how good a restaurant is. 

We will therefore use the real time Data set with variuos features a user would look into regarding a restaurant. We will be considering Banglore City in this analysis.

Content
The basic idea of analyzing the Zomato dataset is to get a fair idea about the factors affecting the establishment
of different types of restaurant at different places in Bengaluru, aggregate rating of each restaurant, Bengaluru
being one such city has more than 12,000 restaurants with restaurants serving dishes from all over the world.

With each day new restaurants opening the industry has’nt been saturated yet and the demand is increasing
day by day. Inspite of increasing demand it however has become difficult for new restaurants to compete with
established restaurants. Most of them serving the same food. Bengaluru being an IT capital of India. Most of the people here are dependent mainly on the restaurant food as they don’t have time to cook for themselves.

With such an overwhelming demand of restaurants it has therefore become important to study the demography
of a location. What kind of a food is more popular in a locality. Do the entire locality loves vegetarian food.
If yes then is that locality populated by a particular sect of people for eg. Jain, Marwaris, Gujaratis who are
mostly vegetarian. These kind of analysis can be done using the data, by studying the factors such as

    • Location of the restaurant
    • Approx Price of food
    • Theme based restaurant or not
    • Which locality of that city serves that cuisines with maximum number of restaurants
    • The needs of people who are striving to get the best cuisine of the neighborhood
    • Is a particular neighborhood famous for its own kind of food.

“Just so that you have a good meal the next time you step out”

The data is accurate to that available on the zomato website until 15 March 2019.
The data was scraped from Zomato in two phase. After going through the structure of the website I found that for each neighborhood there are 6-7 category of restaurants viz. Buffet, Cafes, Delivery, Desserts, Dine-out, Drinks & nightlife, Pubs and bars.

Phase I,

In Phase I of extraction only the URL, name and address of the restaurant were extracted which were visible on the front page. The URl's for each of the restaurants on the zomato were recorded in the csv file so that later the data can be extracted individually for each restaurant. This made the extraction process easier and reduced the extra load on my machine. The data for each neighborhood and each category can be found here

Phase II,

In Phase II the recorded data for each restaurant and each category was read and data for each restaurant was scraped individually. 15 variables were scraped in this phase. For each of the neighborhood and for each category their onlineorder, booktable, rate, votes, phone, location, resttype, dishliked, cuisines, approxcost(for two people), reviewslist, menu_item was extracted. See section 5 for more details about the variables.

Acknowledgements
The data scraped was entirely for educational purposes only. Note that I don’t claim any copyright for the data. All copyrights for the data is owned by Zomato Media Pvt. Ltd..

         Source: Kaggle

**Main Objective:**

The main agenda of this project is:

>> Perform extensive **Exploratory Data Analysis(EDA)** on the Zomato Dataset.

>>Build an appropriate **Machine Learning Model** that will help various Zomato Restaurants to predict their respective Ratings based on certain features



## Feature description

1. <b>url </B> contains the url of the restaurant in the zomato website

2. **address** contains the address of the restaurant in Bengaluru

3. **name** contains the name of the restaurant

4. **online_order** whether online ordering is available in the restaurant or not

5. **book_table** table book option available or not

6. **rate** contains the overall rating of the restaurant out of 5

7. **votes** contains total number of rating for the restaurant as of the above mentioned date

8. **phone** contains the phone number of the restaurant

9. **location** contains the neighborhood in which the restaurant is located

10. **rest_type** restaurant type

11. **dish_liked** dishes people liked in the restaurant

12. **cuisines** food styles, separated by comma

13. **approx_cost**(for two people) contains the approximate cost of meal for two people

14. **reviews_list** list of tuples containing reviews for the restaurant, each tuple

15. **menu_item** contains list of menus available in the restaurant

16. **listed_in**(type) type of meal

17. **listed_in**(**city**) contains the neighborhood in which the restaurant is listed

##  1. Importing the libraires

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

#### 1.1 Loading the dataset

In [None]:
data = pd.read_csv('../input/zomato-bangalore-restaurants/zomato.csv')

In [None]:
data

#### 1.2 checking the shape of dataset

In [None]:
data.shape

- there are total 51717 samples with 17 features.

In [None]:
data.columns

#### 1.3  checking the datatypes

In [None]:
data.info()

- there are so many object type columns, we have to convert them into numeric type. letter we will convert oject dtype to numeric type

## 2. Data Cleaning

#### 2.1 checking the missing values

In [None]:
data.isnull().sum()

- there are so many null values.we can clearly see that in the  '__rate__', '__phone__', '__location__', '__rest_type__', '__dish_liked__', '__cuisines__' and '__approx_cost(for two people)__' these columns have missing values.So  firstly we have to handle the missing values.

#### 2.2  Removing the unnecessary columns form data

In [None]:
df = data.drop(['url', 'phone'], axis = 1) # dropped 'url' and 'phone' columns

In [None]:
df.head()

#### 2.3 handling the null or missing values

In [None]:
df.dropna(inplace = True)

In [None]:
df.isnull().sum()

- Now there is no null values

#### 2.4 checking the duplicates & handling the duplicates values

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace = True)
df.duplicated().sum()

- Now there are no duplicate values.

#### 2.5 Renaming the columns appropriately

In [None]:
df = df.rename(columns = {'approx_cost(for two people)':'cost',
                         'listed_in(type)':'type', 'listed_in(city)': 'city'})

In [None]:
df.head()

- Sucessfully rename the columns

##### 2.6 cleaning the "cost" column

In [None]:
df['cost'].unique()

- here we can see that data point is string type and some values like 5,000 6,000   have comma(,). we have  to remove that ',' from the values and we have convert them into numeric type.


In [None]:
df['cost'] = df['cost'].apply(lambda x:x.replace(',', '')) # lo
df['cost'] = df['cost'].astype(float)

df['cost'].unique()

- Now sucessfully we converted the values into numeric type

#### 2.7 handling the rate columns

In [None]:
df['rate'].unique()

- here rating column also string type. we have to convert them into numeric type. we have to remove the '/5' form given values.
there is 'NEW' value which make no sense. SO we have to remove that values.

In [None]:
df = df.loc[df.rate != 'NEW'] # geting rid of 'NEW'

In [None]:
df['rate'].unique()

In [None]:
df['rate'] = df['rate'].apply(lambda x:x.replace('/5', ''))

df['rate'].unique()

In [None]:
df['rate'] = df['rate'].apply(lambda x: float(x))
df['rate']

- Now our data is cleaned and we can perform visulization

## 3. Data  Visulaization

#### 3.1 Most famous restaurant chains in banaglore

In [None]:
plt.figure(figsize = (17,10))
chains = df['name'].value_counts()[:20]
sns.barplot(x = chains, y=  chains.index,  palette= 'deep')
plt.title('Most famous restaurants chains in bangalore')
plt.xlabel('Number of outlets')
plt.show()

: 

__Insights:__
- __'Onesta'__, __'Empire Restaurant'__  & __'KFC'__ are the most famous restaurant in bangalore.

#### 3.2 checking  online order or not

In [None]:
v = df['online_order'].value_counts()
fig = plt.gcf()
fig.set_size_inches((10,6))
cmap = plt.get_cmap('Set3')
color = cmap(np.arange(len(v)))

plt.pie(v, labels = v.index, wedgeprops= dict(width = 0.6),autopct = '%0.02f', shadow = True, colors=  color)
plt.title('Online orders', fontsize = 20)
plt.show()


__Insight:__
- Most Restaurants offer option for online order and delivery.

#### 3.3 Book table or not

In [None]:
v = df['book_table'].value_counts()

fig = plt.gcf()
fig.set_size_inches((8,6))
cmap = plt.get_cmap('Set1')
color = cmap(np.arange(len(v)))

plt.pie(v, labels = v.index, wedgeprops= dict(width = 0.6),autopct = '%0.02f', shadow = True, colors=  color)
plt.title('Book Table', fontsize = 20)
plt.show()


__Insight:__
- Most of restaurants doesn't offer table booking.

#### 3.4 Rating Distribution

In [None]:
plt.figure(figsize = (9,7))
sns.distplot(df['rate'])
plt.title('Rating Distribution')

__Insight:__

- We can infer from above that most of the ratings are within 3.5 and 4.5

#### 3.5 Location

In [None]:
plt.figure(figsize=  (20,40))
chains = df['location'].value_counts()#[:20]
sns.barplot(x = chains, y=  chains.index,  palette= 'deep')
plt.title('Loaction of restaurants in bangalore')
plt.show()

__Insight:__
- Here above we can see that most of the restaurants located in '__Koramangala 5th Block__', '__BTM__' & '__Indiranagar__'.
- Then least restaurants are located  '__KR Puram__', '__Kanakapura__', '__Magadi Road__'.

#### 3.6 Restaurant Type

In [None]:
plt.figure(figsize = (20,40))
t = df['rest_type'].value_counts()
sns.barplot(y = t.index ,x = t, palette = 'Paired')
plt.title('Restaurant Type')
plt.show()
plt.show()

__Insight:__

- 'Casual Dining', 'Quick Bites', 'Cafe', 'Dessert Parlor' are the most common types of  restaurant.
- 'Food Court, Casual Dining', 'Dhaba' are the least common. 

#### 3.7 Most Liked Dishes

In [None]:
import re

df.index=range(df.shape[0])
likes=[]
for i in range(df.shape[0]):
    array_split=re.split(',',df['dish_liked'][i])
    for item in array_split:
        likes.append(item)

In [None]:
favourite_food = pd.Series(likes).value_counts()
favourite_food.head(30)

In [None]:
cmap = plt.get_cmap('Set3')
color = cmap(np.arange(len(v)))

ax = favourite_food.nlargest(n = 20, keep = 'first').plot(kind = 'bar', figsize = (18,10), title=  'Top 30 Favourite Food counts', color =  color)

for i in ax.patches:
    ax.annotate(str(i.get_height()), (i.get_x() * 1.005, i.get_height() * 1.005))
    

__Insights:__
-  here form above we can see that __pasta__ & __Pizza__ most famous food in bangalore restaurants. 

#### 3.8 Most popular cuisines of Bangalore

In [None]:
v = df['cuisines'].value_counts()[:15]
plt.figure(figsize = (20,8))

ax  = sns.barplot(x = v.index, y = v, palette = 'Paired')

for i in ax.patches:
    ax.annotate(i.get_height().astype(int), (i.get_x()*1.005, i.get_height()*1.005))


plt.title('Most popular cuisines of Bangalore', fontsize = 20)
plt.xlabel('Cuisines', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.xticks(rotation =90)
plt.show()


__Insights:__
-  Here form above we can see that North Indian  Cuisines are most famous in bangalore restaurants. 

#### 3.9 Distribution of Cost of Food for two People

- contains the approximate cost of meal for two people

In [None]:
plt.figure(figsize=(20,8))
sns.distplot(df['cost'])
plt.show()

In [None]:
df.columns

In [None]:
v = df['cost'].value_counts()
plt.figure(figsize = (20,8))

sns.barplot(x = v.index, y = v, palette = 'Paired')
plt.xticks(rotation  =90)
plt.show()

__Insights:__
-  Here form above we can that most common price for two person is __400__ in bangalore restaurants. 

#### 3.10 Services Types

In [None]:
#Types of Services

ax  = sns.countplot(df['type']).set_xticklabels(sns.countplot(df['type']).get_xticklabels(), rotation=90, ha="right")
fig = plt.gcf()
fig.set_size_inches(12,12)

plt.title('Type of Service')
plt.show()

__Insights:__
-  Here the two main service types are __Delivery__ and __Dine-out__. 

#### 3.11 Highest vote of restaurant

In [None]:
name_grp = df.groupby('name')
v = name_grp['votes'].agg(np.sum).sort_values(ascending = False)[:20]  ## Here i selected 20 restaurant based on high votes

plt.figure(figsize = (20,10))
ax = sns.barplot(y = v, x = v.index)

for i in ax.patches:
    ax.annotate(i.get_height().astype(int), (i.get_x()* 1.005, i.get_height()*1.005))


plt.title('Highest vote of restaurant', fontsize = 20)
plt.xlabel('Restaurant', fontsize = 15)
plt.ylabel('Frequecy', fontsize = 15)
plt.xticks(rotation =90)
plt.show()

__Insights:__
- Here from the analysis, we can see that   __'Onesta'__, __'Truffles'__ & __'Empire Restaurant'__ are  highly voted restaurants. 

## 4. Data preparing

In [None]:
df.head()

#### 4.1 Convert the online categorical variables into a numeric format

In [None]:
df.online_order[df.online_order == 'Yes'] = 1
df.online_order[df.online_order == 'No'] =  0

In [None]:
df.online_order.value_counts()

#### 4.2 Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.location = le.fit_transform(df.location)
df.rest_type = le.fit_transform(df.rest_type)
df.cuisines = le.fit_transform(df.cuisines)
df.menu_item = le.fit_transform(df.menu_item)

df.book_table = le.fit_transform(df.book_table)

In [None]:
df.head(n=2)

In [None]:
my_data = df.iloc[:,[2,3,4,5,6,7,9,10,12]]

my_data.to_csv('Zomato_df.csv')

In [None]:
my_data.head()

In [None]:
plt.figure(figsize = (20,20))
sns.heatmap(my_data.corr(), annot = True)
plt.show()

__Insights:__
- Here from the above we can see that rate is higly correlated with votes. 

#### 4.3 Depedent and independent variable

In [None]:
X = df.iloc[:,[2,3,5,6,7,9,10,12]]
y = df['rate']

In [None]:
X

In [None]:
y

#### 4.4 Splitting data into train and test set

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=10)


In [None]:
X_train

In [None]:
y_train

## 5 Modeling

#### 5.1 Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train,y_train)

#predict the test set
y_pred = lr.predict(X_test)

## Evaluate the model
from sklearn.metrics import r2_score
print(r2_score(y_test, y_pred))

#### 5.2 Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor(min_samples_leaf=0.01)

dtr.fit(X_train,y_train)

# Predict the test ser
y_pred  = dtr.predict(X_test)

# Evaluate the model performance
from sklearn.metrics import r2_score

print(r2_score(y_test,y_pred))


#### 5.3 Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=650,random_state=245,min_samples_leaf=.0001)
rfr.fit(X_train,y_train)

# Predict the test ser
y_pred  = rfr.predict(X_test)

# Evaluate the model performance
from sklearn.metrics import r2_score

print(r2_score(y_test,y_pred))

#### 5.4 Support vector Regressor

In [None]:
from sklearn.svm import SVR
svr = SVR(kernel ='rbf')

svr.fit(X_train, y_train)

# predict the test set
y_pred = svr.predict(X_test)

# Evaluate the performance
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(r2)

#### 5.5 Extra tree Regressor

Extra Trees is like a Random Forest, in that it builds multiple trees and splits nodes using random subsets of features, but with two key differences: it does not bootstrap observations (meaning it samples without replacement), and nodes are split on random splits, not best splits. So in summary, ExtraTrees:

 - builds multiple trees with bootstrap = False by default, which means it samples without replacement.
 - nodes are split based on random splits among a random subset of the features selected at every node

In Extra Trees, randomness doesn’t come from bootstrapping the data, but rather comes from the random splits of all observations. ExtraTrees is named for (Extremely Randomized Trees).

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
etr = ExtraTreesRegressor(n_estimators = 120)

etr.fit(X_train, y_train)

# predict the test set
y_pred = etr.predict(X_test)

# Evaluate the model performance
from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred)
print(r2)

## 6. Conclusion :

- From the analysis, __'Onesta'__, __'Empire Restaurant'__  & __'KFC'__ are the most famous restaurants in bangalore.
- Most Restaurants offer options for online order and delivery.
- Most restaurants don't offer table booking.
- From the analysis, most of the ratings are within 3.5 and 4.5.
- From the analysis. we can see that most of the restaurants located in '__Koramangala 5th Block__', '__BTM__' & '__Indiranagar__'.Then least restaurants are located  '__KR Puram__', '__Kanakapura__', '__Magadi Road__'.

- __'Casual Dining'__, __'Quick Bites'__, __'Cafe'__, __'Dessert Parlor'__ are the most common types of  restaurant.And __'Food Court'__, __'Casual Dining'__, __'Dhaba'__ are the least common. 
-  From the analysis, __pasta__ & __Pizza__ most famous food in bangalore restaurants. 
- From the analysis, we can see that __North Indian__  Cuisines are most famous in bangalore restaurants. 
-  Two main service types are __Delivery__ and __Dine-out__. 
- From the analysis, we can see that   __'Onesta'__, __'Truffles'__ & __'Empire Restaurant'__ are  highly voted restaurants.

- For the modeling part, i used __LinearRegression__, __DecisionTree Regressor__, __RandomForest Regressor__ , __Supprotvector Regressor__ & __ExtraTree Regressor__. From all these models __ExtraTree Regressor__ perform well compared to the other models.So i selected __ExtraTree Regressor__ for model creation.

