#**RANDOM FOREST REGRESSOR MODEL** on restaurant revenue dataset using Scikit-Learn

<a href="https://colab.research.google.com/github/dphi-official/Micro-Courses/blob/master/Supervised_Algorithms_Regression/Exercises/RandomForestRegressor_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background
With over 1,200 quick service restaurants across the globe, TFI is the company behind some of the world's most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites.

Right now, deciding when and where to open new restaurants is largely a subjective process based on the personal judgement and experience of development teams. This subjective data is difficult to accurately extrapolate across geographies and cultures.

New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees

# Objective
Head the growing data science team of TFI in Istanbul or Shanghai. Using demographic, real estate, and commercial data, you need to predict the annual restaurant sales of 100,000 regional locations i.e estimate the revenue using Random Forest Regressor.

The data is taken from [Kaggle](https://www.kaggle.com/c/restaurant-revenue-prediction/data).

## Tasks
1. Load the data from https://raw.githubusercontent.com/dphi-official/Datasets/master/restaurant_revenue.csv
2. The data involves categorical features as well as dates so some pre-processing is required.
3. Build a Random Forest Regressor on the data and train it
4. Evaluate the model using RMSE and R2 Score

In [1]:
import pandas as pd

1. Load the data from https://raw.githubusercontent.com/dphi-official/Datasets/master/restaurant_revenue.csv

In [2]:
full_data=pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/restaurant_revenue.csv')

In [None]:
full_data.shape

(137, 43)

In [None]:
full_data.head()

Unnamed: 0,Id,Open Date,City,City Group,Type,P1,P2,P3,P4,P5,...,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue
0,0,07/17/1999,İstanbul,Big Cities,IL,4,5.0,4.0,4.0,2,...,3.0,5,3,4,5,5,4,3,4,5653753.0
1,1,02/14/2008,Ankara,Big Cities,FC,4,5.0,4.0,4.0,1,...,3.0,0,0,0,0,0,0,0,0,6923131.0
2,2,03/09/2013,Diyarbakır,Other,IL,2,4.0,2.0,5.0,2,...,3.0,0,0,0,0,0,0,0,0,2055379.0
3,3,02/02/2012,Tokat,Other,IL,6,4.5,6.0,6.0,4,...,7.5,25,12,10,6,18,12,12,6,2675511.0
4,4,05/09/2009,Gaziantep,Other,IL,3,4.0,3.0,4.0,2,...,3.0,5,1,3,2,3,4,3,3,4316715.0


2. The data involves categorical features as well as dates so some pre-processing is required.

In [None]:
from sklearn import preprocessing
numeric_data=full_data.select_dtypes(include=['float64', 'int64'])
scaler = preprocessing.StandardScaler().fit(numeric_data)
scaled_data = scaler.transform(numeric_data)

In [None]:
numeric_data

Unnamed: 0,Id,P1,P2,P3,P4,P5,P6,P7,P8,P9,...,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue
0,0,4,5.0,4.0,4.0,2,2,5,4,5,...,3.0,5,3,4,5,5,4,3,4,5653753.0
1,1,4,5.0,4.0,4.0,1,2,5,5,5,...,3.0,0,0,0,0,0,0,0,0,6923131.0
2,2,2,4.0,2.0,5.0,2,3,5,5,5,...,3.0,0,0,0,0,0,0,0,0,2055379.0
3,3,6,4.5,6.0,6.0,4,4,10,8,10,...,7.5,25,12,10,6,18,12,12,6,2675511.0
4,4,3,4.0,3.0,4.0,2,2,5,5,5,...,3.0,5,1,3,2,3,4,3,3,4316715.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132,132,2,3.0,3.0,5.0,4,2,4,4,4,...,3.0,0,0,0,0,0,0,0,0,5787594.0
133,133,4,5.0,4.0,4.0,2,3,5,4,4,...,3.0,0,0,0,0,0,0,0,0,9262754.0
134,134,3,4.0,4.0,4.0,2,3,5,5,5,...,3.0,0,0,0,0,0,0,0,0,2544857.0
135,135,4,5.0,4.0,5.0,2,2,5,5,5,...,3.0,0,0,0,0,0,0,0,0,7217634.0


In [None]:
x=numeric_data.drop('revenue', axis=1)
y=numeric_data['revenue']

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.20, random_state=101)

3. Build a Random Forest Regressor on the data and train it

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(x_train, y_train)

4. Evaluate the model using RMSE and R2 Score

In [None]:
from sklearn import metrics
y_pred = rfr.predict(x_test)
mse = metrics.mean_squared_error(y_test, y_pred)
print("Mean Squared Error :", mse)

Mean Squared Error : 11352694031031.078


In [None]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print("R2 score :", r2)

R2 score : 0.06344447039624224
