## Business Problem

Restaurants play a major role in the business, intellectual, social and artistic life of a society. To get into a restaurant business, one need to have a proper business plan and background analysis to seek investors attention. For a restaurant business to be in a good shape, customer ratings play a key role. In today's world, web and apps allow customers to efficiently express their views and feedback on the restaurants to a large audience. This is valuable as one can get in-depth information regarding a restaurant from other customers for a very low efforts and search costs. It is always better to have a thorough analysis of the market before getiing into the business.

We are assuming ourselves as a consulting firm who would provide valuable insights and analysis to the individuals who are planning to start a new restaurant in Banglore city. In the below analysis we have determined various factors that plays role in deciding the customer ratings for a restaurant. We also built predictive models that helps in determining a restaurants success(customer ratings) based on the business plan.

For this analysis, we have used an open source dataset scrapped from Zomato available in Kaggle.

In [2]:
sc

In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName('PROJECT').getOrCreate()

In [5]:
#importing data to a pandas dataframe
import pandas as pd
df = pd.read_csv("/dbfs/FileStore/tables/zomato.csv")

In [6]:
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# As reading data directly into spark dataframe is causing data mismatch for various rows, 
# so we have first imported using pandas and then converted it into spark dataframe.

In [7]:
df = spark.createDataFrame(df)

In [8]:
df.show(n=5)

In [9]:
df.printSchema()

In [10]:
df.count()

In [11]:
df.dropna()

In [12]:
df.columns

In [13]:
# Rating column has attributes in the fraction form (4/5,3/5..). We only need numerator part of it for the analysis.
# Thus we are splitting it accordingly and creating a new column named ratings in this case.
import pyspark.sql.functions as f

split_col = f.split(df['rate'], '/')
df = df.withColumn('ratings', split_col.getItem(0))

In [14]:
# As low number of votes indicates that the reviews are not authentic. 
# So, we are considering the data points for which the vote count is greater than 150.

df1=df.filter("votes > 150")

In [15]:
# Here we have included two new features to the existing data frame.
df1 = df1.withColumn('CuisinesCount', f.size(f.split(f.col('cuisines'), ',')))
df1 = df1.withColumn('dish_liked_Count', f.size(f.split(f.col('dish_liked'), ',')))

df1 = df1.withColumnRenamed('listed_in(city)','listed_in_city')
df1 = df1.withColumnRenamed( 'approx_cost(for two people)', 'approx_cost')
df1 = df1.withColumnRenamed('listed_in(type)','listed_in_type')

from pyspark.sql.types import IntegerType
df1 = df1.withColumn("approx_cost", df1["approx_cost"].cast(IntegerType()))
df1 = df1.withColumn("ratings", df1["ratings"].cast('float'))

df1.show(n=5)

In [16]:
df1.printSchema()


In [17]:
# Creating a view zomato, with the selected columns as features.
df1.select([ 'name','online_order', 'book_table', 'votes', 'location', 'rest_type', 'cuisines', 'approx_cost', 'listed_in_type','ratings','CuisinesCount','dish_liked_Count','listed_in_city']).createOrReplaceTempView('zomato')

In [18]:
df1.count()

#### Analysis

#### 1. Does online order availability in the restaurants decide the ratings?

In [21]:
%sql
select online_order,avg(ratings) as average_rating from zomato group by online_order 


online_order,average_rating
No,4.094441835752795
Yes,3.977004994571507


##### From the above two averages, we see that they both are almost equal. So we can say that online order does not play an important role in deciding the ratings of the restaurants. Observing closely, we see that people coming to restaurants give slightly higher ratings than those who order online.

#### 2. Does online table booking option availability in the restaurants decide the ratings?

In [24]:
%sql
select book_table,avg(ratings) as average_rating from zomato group by book_table 

book_table,average_rating
No,3.909053195008695
Yes,4.190330311785239


##### From the above two averages, we see that there is almost 0.3 rating higher for the restaurants with online table booking option availability.

#### 3a. Top five locations for restaurants to avail better ratings

In [27]:
%sql
select location,avg(ratings) as average_rating from zomato group by location order by average_rating desc limit 5 

location,average_rating
City Market,4.399999936421712
Lavelle Road,4.294812734944676
Hennur,4.235294173745548
Koramangala 5th Block,4.218389154924976
Vasanth Nagar,4.2156862230861885


##### The above five locations are the most favourable for restaurants to come up with better customer satisfaction.

#### 3b. Top five restaurants having better ratings

In [30]:
%sql
select name,avg(ratings) as average_rating from zomato group by name order by average_rating desc limit 5 

name,average_rating
Asia Kitchen By Mainland China,4.900000095367432
Byg Brewski Brewing Company,4.900000095367432
SantÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ© Spa Cuisine,4.900000095367432
Punjab Grill,4.871428694043841
Belgian Waffle Factory,4.844827734190842


##### The above listed five are the restaurants to come up with better customer satisfaction.

#### 3c. Are the top five rated restaurants located in above listed top 5 favourable locations with better customer satisfaction?

In [33]:
%sql
select name,avg(ratings) as avg_rating,location from zomato group by name,location order by avg_rating desc limit 5

name,avg_rating,location
Flechazo,4.900000095367432,Whitefield
Belgian Waffle Factory,4.900000095367432,Brigade Road
Byg Brewski Brewing Company,4.900000095367432,Sarjapur Road
Asia Kitchen By Mainland China,4.900000095367432,Koramangala 5th Block
Milano Ice Cream,4.900000095367432,Indiranagar


##### The above result gives some interesting findings. The top five rated restaurants are not in the top five favourable locations. So location is not the only factor that decides the ratings.

#### 4. Will the number of cuisines have an impact on the customer satisfaction rate (Are customers satisfied with restaurants concentrating on single cuisine or inclined towards multi-cuisines)?

In [36]:
%sql

select CuisinesCount, avg(ratings) as average_rating from zomato group by CuisinesCount order by average_rating desc

CuisinesCount,average_rating
6,4.172398172352649
7,4.15754390348468
5,4.104934194958524
4,4.046109984178498
-1,4.0
3,3.993466321726392
2,3.967367345019812
8,3.965833322207133
1,3.964750593235261


##### From the above analysis, we can say that muti-cuisine restaurants are more liked comapred to single-cuisine restaurants. 

It is interesting to know that, to start a multi-cuisine restaurants it is ideal to have 5 or 6 cuisines. However, if planning for other number of cuisines, the rating is almost same for 1 to 4 cuisines. So, its better to go with single cuisine.

#### Model predictions

##### Because our dependent variable is a continuous varible ranging from zero to five, we cannot use models like logistic regression here. For such continuous varibles with number of inputs, it will be a good example of a regression model problem.

In [40]:
# Selecting the dependent and the independent variables that are identified as most useful attributes to estimate duration

data=df1.select(['online_order','book_table','dish_liked_count','CuisinesCount','listed_in_city','listed_in_type','approx_cost','ratings'])

In [41]:
data.count()

In [42]:
data=data.dropna()

In [43]:
data.count()

In [44]:
# Create a 70-30 train test split

train_data,test_data=data.randomSplit([0.7,0.3])

#### Building Random Forest Model:

In [46]:
data.describe().show()

In [47]:
#Created string indexes to a categorical variable in order to convert each category into a specific index(neumerical value)
from pyspark.ml.feature import StringIndexer
online_order_indexer = StringIndexer(inputCol='online_order',outputCol='online_order_index',handleInvalid='keep')
online_order_indexer_df = online_order_indexer.fit(data).transform(data)
online_order_indexer_df.show()

In [48]:
#Created string indexes to a categorical variable in order to convert each category into a specific index(neumerical value)
from pyspark.ml.feature import StringIndexer
book_table_indexer = StringIndexer(inputCol='book_table',outputCol='book_table_index',handleInvalid='keep')
book_table_indexer_df = book_table_indexer.fit(online_order_indexer_df).transform(online_order_indexer_df)
book_table_indexer_df.show()

In [49]:
#Created string indexes to a categorical variable in order to convert each category into a specific index(neumerical value)
from pyspark.ml.feature import StringIndexer
listed_in_city_indexer = StringIndexer(inputCol='listed_in_city',outputCol='listed_in_city_index',handleInvalid='keep')
listed_in_city_indexer_df = listed_in_city_indexer.fit(book_table_indexer_df).transform(book_table_indexer_df)
listed_in_city_indexer_df.show()

In [50]:
# Vector assembler is used to create a vector of input features

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=['online_order_index','book_table_index','listed_in_city_index',
                                       'dish_liked_count','CuisinesCount','approx_cost'],
                            outputCol="features")
output = assembler.transform(listed_in_city_indexer_df)
output.show(2)

In [51]:
#Applying features and lables to provide the same to model.
df_classifier = output.selectExpr("features as features","ratings as label")
df_classifier.show(2)

In [52]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
#from pyspark.ml.evaluation import RegressionEvaluator

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.



featureIndexer =VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=5).fit(df_classifier)

In [53]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = df_classifier.randomSplit([0.7, 0.3])

In [54]:
# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures")

In [55]:
# Chain indexer and forest in a Pipeline
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])

In [56]:
# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

In [57]:
predictions = model.transform(testData)

In [58]:
# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

rfModel = model.stages[1]
print(rfModel)  # summary only

##### For the Random forest regression model, the Root Mean Squared Error 0.37 explains that the standard deviation of the residuals is low. It also indicates that most of the data points lie closely to the regression line.

### Decision Tree:

In [61]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

In [62]:
# Train a DecisionTree model.
dt = DecisionTreeRegressor(featuresCol="indexedFeatures")

In [63]:
# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, dt])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)


In [64]:
# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

In [65]:
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
treeModel = model.stages[1]
# summary only
print(treeModel)

##### For the Decision Tree model, the Root Mean Squared Error 0.377 explains that the standard deviation of the residuals is low. It also indicates that most of the data points lie closely to the regression line.

### Linear Regression

In [68]:
from pyspark.ml.regression import LinearRegression

In [69]:
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(trainingData)

In [70]:
# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))

In [71]:
# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

##### For the Linear regression model, the Root Mean Squared Error 0.413 explains that the standard deviation of the residuals is low. It also indicates that most of the data points lie closely to the regression line.

##### From the above three models evaluations, we can see that the Root Mean Square Error of these models are almost same, as we have choosen the best possible models for this kind of data. On closely observing the model performance, we see that Random Forest model performed better than other two models with lower RMSE.

##### As a consulting firm, we are now in a position to provide a deep analysis on which location, number of cusinies, price range and theme of restaurent to choose to any prospective client who wants to get into the restaurant business in city of Banglore. Our model is good enough to predict the ratings which gives an idea of how the business performed based on various factors.