In this kernel, we will try to predict the 'Aggregate rating' based on the other features.

**First we will import some important libraries which we will use in the pre-processing and EDA**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Then we will upload our data to a DataFrame.

In [None]:
df = pd.read_csv('../input/zomato.csv', encoding='iso-8859-1')

Let's take a look at the 2 first rows of our dataset:

In [None]:
df.head(2)

We can see that there are text columns, categorical columns, and numerical columns.
Let's take a deeper look at our columns properties.

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

We can see that there are 21 columns in our dataset. In addition, in this kernel we are not going to use the text columns, so we won't consider them at our machine learning model. 

We can also see some information about our target: the mean of 'Aggragate rating' is 2.66 and the standart diviation is 1.51. The min score is 0 and the max score is 4.9.

***

Now we will make some visualizations to get a better understanding of our data.

In [None]:
sns.set(rc={'figure.figsize':(9,7)})
sns.countplot(x='Has Table booking',data=df,palette='viridis')

We can see that most of the tables havn't been booked.

In [None]:
sns.countplot(x='Has Online delivery',data=df,palette='viridis',order=['Yes','No'])

In [None]:
sns.countplot(x='Is delivering now',data=df,palette='viridis',order=['Yes','No'])

With this plot, it seems that there are no deliveries. Let's take a look at the numbers.

In [None]:
df['Is delivering now'].value_counts()

We can see that there are only 34 restaurants who make deliveries. Because of this imbalance in the data, we will not use this feature.

In [None]:
df['Switch to order menu'].value_counts()

In this column there is only one option, so we will not use this column also.

In [None]:
sns.countplot(x='Price range',data=df,palette='viridis')

We can see that from all the price category, the low price category has most of the restaurants.

In [None]:
sns.countplot(x='Rating text',data=df,palette='viridis')

We can see that in this column the data is in a normal distribution.

In [None]:
sns.countplot(x='Rating color',data=df,palette='viridis')

It looks like this data here is just the same as the 'Rating text', so we will use the Rating text column.

In [None]:
sns.distplot(df['Aggregate rating'], hist=True,kde=False,bins=20,color = 'blue',hist_kws={'edgecolor':'black'})

We can see that most of the data is distributed in a normal distribution, but there are also restaurants who got a rating of 0.

***

Now, we will do some **feature engineering** and try to get more from our dataset.

First, we have to change the cost column. Let's look at how many currencies we have.

In [None]:
df['Currency'].unique()

So we have 12 different currencies. We have to treat each currency different. The currency rate was taken from www.XE.com.
We will convert each cost to dollars.

In [None]:
df['new cost'] = 0

In [None]:
df['Currency'].unique()

In [None]:
d = {'Botswana Pula(P)':0.095, 'Brazilian Real(R$)':0.266,'Dollar($)':1,'Emirati Diram(AED)':0.272,
    'Indian Rupees(Rs.)':0.014,'Indonesian Rupiah(IDR)':0.00007,'NewZealand($)':0.688,'Pounds(\x8c£)':1.314,
    'Qatari Rial(QR)':0.274,'Rand(R)':0.072,'Sri Lankan Rupee(LKR)':0.0055,'Turkish Lira(TL)':0.188}

df['new cost'] = df['Average Cost for two'] * df['Currency'].map(d) 

In [None]:
df.head(2)

In [None]:
sns.heatmap(data=df.corr(),cmap='coolwarm',annot=True)

We can see that the 'price range', 'votes' and the 'new cost' correlated with our target, so we will use them in our model. 

Now we will try to do some **Exploratory data analysis** in order to get a better understanding of the connection between our features, and between the features and the target.

We will first add new feature from our target to understand it better.

In [None]:
df['new Rating'] = 0

In [None]:
mask1 = (df['Aggregate rating'] < 1)
mask2 = (df['Aggregate rating'] >= 1) & (df['Aggregate rating'] < 2)
mask3 = (df['Aggregate rating'] >= 2) &(df['Aggregate rating'] < 3)
mask4 = (df['Aggregate rating'] >= 3) & (df['Aggregate rating'] < 4)
mask5 = (df['Aggregate rating'] >= 4)

df['new Rating'] = df['new Rating'].mask(mask1, 'Low')
df['new Rating'] = df['new Rating'].mask(mask2, 'Medium -')
df['new Rating'] = df['new Rating'].mask(mask3, 'Medium')
df['new Rating'] = df['new Rating'].mask(mask4, 'Medium +')
df['new Rating'] = df['new Rating'].mask(mask5, 'High')

In [None]:
sns.set(rc={'figure.figsize':(18,6)})
sns.countplot(data=df,x='new Rating',order=['Low','Medium -','Medium','Medium +','High'])

We can see that most of the restaurants have aggregate rating between 3 to 4.

In [None]:
sns.set(rc={'figure.figsize':(18,6)})
sns.scatterplot(data=df,x='Aggregate rating',y='Votes')
plt.ylim(0,1000)
plt.xlim(1,5)

We can see correlation between the number of votes and the aggregate rating, but we can see that it isn't strong.

In [None]:
sns.countplot(data=df,x='Aggregate rating',hue='Has Table booking',palette='viridis')

In [None]:
sns.countplot(data=df,x='Aggregate rating',hue='Has Online delivery',palette='viridis')

We can see that there is a bigger correlation between the delivery feature than the table booking feature. But, we can see that both of them can help the model, so we will use them both.

In [None]:
df.head(2)

Now, after we deecided which columns we will use, we have to create subset from our dataset.

In [None]:
new_df = df[['Has Table booking','Has Online delivery','Price range','Rating text','Votes','new cost','Aggregate rating']]
new_df.head()

We can see that we have some features that have to be encoded in order to fit the machine learning algorithms (the scikit-learn library can't get any text).

In [None]:
new_df = pd.get_dummies(new_df, columns=['Has Table booking','Has Online delivery','Price range','Rating text'])

In [None]:
new_df.head()

Now our data is ready for prediction models. Let's start.

***

# Linear Regression

In [None]:
X = new_df.drop(['Aggregate rating'], axis=1)
y = new_df['Aggregate rating']

In [None]:
from sklearn import model_selection
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# implementation of Linear Regression model using scikit-learn and K-fold for stable model
from sklearn.linear_model import LinearRegression
kfold = model_selection.KFold(n_splits=10)
lr = LinearRegression()
scoring = 'r2'
results = model_selection.cross_val_score(lr, X, y, cv=kfold, scoring=scoring)
lr.fit(X_train,y_train)
lr_predictions = lr.predict(X_test)
print('Coefficients: \n', lr.coef_,'\n')
print(results)
print(results.sum()/10)

In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, lr_predictions))
print('MSE:', metrics.mean_squared_error(y_test, lr_predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, lr_predictions)))

In [None]:
from sklearn.metrics import r2_score
print("R_square score: ", r2_score(y_test,lr_predictions))

***

# Desicion Trees

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor(random_state = 42)
dtr.fit(X_train,y_train)
dtr_predictions = dtr.predict(X_test) 
results = model_selection.cross_val_score(dtr, X, y, cv=kfold, scoring='r2')
print(results)
print(results.sum()/10)

# R^2 Score
print("R_square score: ", r2_score(y_test,dtr_predictions))

***

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 100)
rfr.fit(X_train,y_train)
rfr_predicitions = rfr.predict(X_test) 
results = model_selection.cross_val_score(dtr, X, y, cv=kfold, scoring='r2')
print(results)
print(results.sum()/10)

# R^2 Score
print("R_square score: ", r2_score(y_test,rfr_predicitions))

***

# Gardient Boost

In [None]:
from sklearn import ensemble
clf = ensemble.GradientBoostingRegressor(n_estimators = 400, max_depth = 5, min_samples_split = 2,
          learning_rate = 0.1, loss = 'ls')
clf.fit(X_train, y_train)
clf_predicitions = clf.predict(X_test) 
results = model_selection.cross_val_score(dtr, X, y, cv=kfold, scoring='r2')
print(results)
print(results.sum()/10)
print("R_square score: ", r2_score(y_test,clf_predicitions))

***

In [None]:
y = np.array([r2_score(y_test,lr_predictions),r2_score(y_test,dtr_predictions),r2_score(y_test,rfr_predicitions),
           r2_score(y_test,clf_predicitions)])
x = ["LinearRegression","RandomForest","DecisionTree","Grdient Boost"]
plt.bar(x,y)
plt.title("Comparison of Regression Algorithms")
plt.ylabel("r2_score")
plt.show()