# RESTAURANT REVENUE PREDICTION
**INTRODUCTION** : With over 1,200 quick-service restaurants globally, TFI, the company behind renowned brands like Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s, employs over 20,000 people in Europe and Asia. TFI invests significantly daily in developing new restaurant sites.

However, deciding when and where to open new restaurants is largely subjective, relying on the personal judgment and experience of development teams. This subjective data makes it challenging to accurately extrapolate across geographies and cultures.

Opening new restaurant sites requires substantial time and capital investments. When the wrong location is chosen, the site closes within 18 months, resulting in operating losses.

Developing a mathematical model to enhance the effectiveness of investments in new restaurant sites would enable TFI to allocate more resources to crucial areas like sustainability, innovation, and employee training. This competition challenges you to predict the annual restaurant sales of 100,000 regional locations using demographic, real estate, and commercial data.

https://www.kaggle.com/competitions/restaurant-revenue-prediction/overview

---
**Dataset :**
* train.csv - the training set. Use this dataset for training your model.
* test.csv - the test set. To deter manual "guess" predictions, Kaggle has supplemented the test set with additional "ignored" data. These are not counted in the scoring.
* sampleSubmission.csv - a sample submission file in the correct format





---

**Data fields**

* **Id :** Restaurant id.
* **Open Date :** opening date for a restaurant
* **City :** City that the restaurant is in. Note that there are unicode in the names.
* **City Group:** Type of the city. Big cities, or Other.
Type: Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile
* **P1, P2 - P37:** There are three categories of these obfuscated data. Demographic data are gathered from third party providers with GIS systems. These include population in any given area, age and gender distribution, development scales. Real estate data mainly relate to the m2 of the location, front facade of the location, car park availability. Commercial data mainly include the existence of points of interest including schools, banks, other QSR operators.
* **Revenue:** The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. Please note that the values are transformed so they don't mean real dollar values.



In [1]:
!pip install vecstack
!pip install feature_engine

Collecting vecstack
  Downloading vecstack-0.4.0.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: vecstack
  Building wheel for vecstack (setup.py) ... [?25l[?25hdone
  Created wheel for vecstack: filename=vecstack-0.4.0-py3-none-any.whl size=19861 sha256=758893eea9d90ac8fa0d398c681eb8a89a953d1696fbf2c4cd397d6e6cdae7ea
  Stored in directory: /root/.cache/pip/wheels/b8/d8/51/3cf39adf22c522b0a91dc2208db4e9de4d2d9d171683596220
Successfully built vecstack
Installing collected packages: vecstack
Successfully installed vecstack-0.4.0
Collecting feature_engine
  Downloading feature_engine-1.8.2-py2.py3-none-any.whl.metadata (9.9 kB)
Downloading feature_engine-1.8.2-py2.py3-none-any.whl (374 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.0/375.0 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: feature_engine
Successfully installed feature_engine-1.8.2


## Importing Libraries

In [2]:
# Install Relevant Libraries
!pip install vecstack
!pip install feature_engine

# Data Manipulation
import pandas as pd
import numpy as np

#Data Visualization
import seaborn as sn
import matplotlib.pyplot as plt
from matplotlib import pyplot

#Data Imbalance
from imblearn.over_sampling import SMOTE

# Machine Learning
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import KFold, RepeatedKFold, GridSearchCV,cross_validate, train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder

from sklearn.svm import LinearSVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Lasso, Ridge, LassoCV,RidgeCV, LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import StackingRegressor

import feature_engine as fe
import time

#Stacking
from vecstack import stacking

import warnings
warnings.filterwarnings("ignore")



## IMPORTING DATA SETS


In [3]:
#Importing Training and Test File
train_data = pd.read_csv("/content/train.csv")

test_data = pd.read_csv("/content/test.csv")

In [4]:
print(train_data.shape)
print(test_data.shape)

(137, 43)
(100000, 42)


In [5]:
train_data.head(5)

Unnamed: 0,Id,Open Date,City,City Group,Type,P1,P2,P3,P4,P5,...,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue
0,0,07/17/1999,İstanbul,Big Cities,IL,4,5.0,4.0,4.0,2,...,3.0,5,3,4,5,5,4,3,4,5653753.0
1,1,02/14/2008,Ankara,Big Cities,FC,4,5.0,4.0,4.0,1,...,3.0,0,0,0,0,0,0,0,0,6923131.0
2,2,03/09/2013,Diyarbakır,Other,IL,2,4.0,2.0,5.0,2,...,3.0,0,0,0,0,0,0,0,0,2055379.0
3,3,02/02/2012,Tokat,Other,IL,6,4.5,6.0,6.0,4,...,7.5,25,12,10,6,18,12,12,6,2675511.0
4,4,05/09/2009,Gaziantep,Other,IL,3,4.0,3.0,4.0,2,...,3.0,5,1,3,2,3,4,3,3,4316715.0


In [6]:
test_data.head(5)

Unnamed: 0,Id,Open Date,City,City Group,Type,P1,P2,P3,P4,P5,...,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37
0,0,01/22/2011,Niğde,Other,FC,1,4.0,4.0,4.0,1,...,2.0,3.0,0,0,0,0,0,0,0,0
1,1,03/18/2011,Konya,Other,IL,3,4.0,4.0,4.0,2,...,1.0,3.0,0,0,0,0,0,0,0,0
2,2,10/30/2013,Ankara,Big Cities,FC,3,4.0,4.0,4.0,2,...,2.0,3.0,0,0,0,0,0,0,0,0
3,3,05/06/2013,Kocaeli,Other,IL,2,4.0,4.0,4.0,2,...,2.0,3.0,0,4,0,0,0,0,0,0
4,4,07/31/2013,Afyonkarahisar,Other,FC,2,4.0,4.0,4.0,1,...,5.0,3.0,0,0,0,0,0,0,0,0


## **EXPLORATORY DATA ANALYSIS**

In [7]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137 entries, 0 to 136
Data columns (total 43 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Id          137 non-null    int64  
 1   Open Date   137 non-null    object 
 2   City        137 non-null    object 
 3   City Group  137 non-null    object 
 4   Type        137 non-null    object 
 5   P1          137 non-null    int64  
 6   P2          137 non-null    float64
 7   P3          137 non-null    float64
 8   P4          137 non-null    float64
 9   P5          137 non-null    int64  
 10  P6          137 non-null    int64  
 11  P7          137 non-null    int64  
 12  P8          137 non-null    int64  
 13  P9          137 non-null    int64  
 14  P10         137 non-null    int64  
 15  P11         137 non-null    int64  
 16  P12         137 non-null    int64  
 17  P13         137 non-null    float64
 18  P14         137 non-null    int64  
 19  P15         137 non-null    i

In [8]:
train_data.describe()

Unnamed: 0,Id,P1,P2,P3,P4,P5,P6,P7,P8,P9,...,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue
count,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,...,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0,137.0
mean,68.0,4.014599,4.408759,4.317518,4.372263,2.007299,3.357664,5.423358,5.153285,5.445255,...,3.135036,2.729927,1.941606,2.525547,1.138686,2.489051,2.029197,2.211679,1.116788,4453533.0
std,39.692569,2.910391,1.5149,1.032337,1.016462,1.20962,2.134235,2.296809,1.858567,1.834793,...,1.680887,5.536647,3.512093,5.230117,1.69854,5.165093,3.436272,4.168211,1.790768,2576072.0
min,0.0,1.0,1.0,0.0,3.0,1.0,1.0,1.0,1.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1149870.0
25%,34.0,2.0,4.0,4.0,4.0,1.0,2.0,5.0,4.0,4.0,...,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2999068.0
50%,68.0,3.0,5.0,4.0,4.0,2.0,3.0,5.0,5.0,5.0,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3939804.0
75%,102.0,4.0,5.0,5.0,5.0,2.0,4.0,5.0,5.0,5.0,...,3.0,4.0,3.0,3.0,2.0,3.0,4.0,3.0,2.0,5166635.0
max,136.0,12.0,7.5,7.5,7.5,8.0,10.0,10.0,10.0,10.0,...,7.5,25.0,15.0,25.0,6.0,24.0,15.0,20.0,8.0,19696940.0


## NULL VALUES


In [9]:
null_counts = train_data.isnull().sum()

print("Null values in each column:")
print(null_counts)

Null values in each column:
Id            0
Open Date     0
City          0
City Group    0
Type          0
P1            0
P2            0
P3            0
P4            0
P5            0
P6            0
P7            0
P8            0
P9            0
P10           0
P11           0
P12           0
P13           0
P14           0
P15           0
P16           0
P17           0
P18           0
P19           0
P20           0
P21           0
P22           0
P23           0
P24           0
P25           0
P26           0
P27           0
P28           0
P29           0
P30           0
P31           0
P32           0
P33           0
P34           0
P35           0
P36           0
P37           0
revenue       0
dtype: int64


In [10]:
# Extracting the year and month from the 'Open Date' column and creating new columns
train_data['year'] = train_data['Open Date'].apply(lambda x: pd.to_datetime(x).year)
train_data['month'] = train_data['Open Date'].apply(lambda x: pd.to_datetime(x).month)

# Removing the 'Open Date' and 'Id' columns from the training dataset
train_data.drop(columns=['Open Date', 'Id'], inplace=True)

# Extracting the year and month for the test dataset
test_data['year'] = test_data['Open Date'].apply(lambda x: pd.to_datetime(x).year)
test_data['month'] = test_data['Open Date'].apply(lambda x: pd.to_datetime(x).month)

# Storing the 'Id' column for reference
base_id = test_data['Id']

# Dropping unnecessary columns from the test dataset
test_data.drop(columns=['Open Date', 'Id'], inplace=True)

### Dropping duplicate and correlated features

In [11]:
# Creating the set of dependent and independent variables for test and train data
X_train = train_data.drop(labels = "revenue", axis = 1)
y_train = train_data["revenue"]
X_test = test_data

In [12]:
# Removing duplicate and highly correlated features
print('Initial shape of dataset:', X_train.shape)

# Importing necessary modules from feature_engine
from feature_engine.selection import DropDuplicateFeatures, DropCorrelatedFeatures

# Dropping duplicate features
dup_remover = DropDuplicateFeatures()
dup_remover.fit(X_train)
X_train = dup_remover.transform(X_train)

# Dropping correlated features based on Pearson correlation
corr_remover = DropCorrelatedFeatures(variables=None, method='pearson', threshold=0.8)
X_train = corr_remover.fit_transform(X_train)

print('Shape after removing duplicates and correlations:', X_train.shape)

Initial shape of dataset: (137, 42)
Shape after removing duplicates and correlations: (137, 19)


## **HANDLING CATEGORICAL FEATURES - ONEHOTENCODING**

In [13]:
# Applying One-Hot Encoding to categorical columns in the training and test datasets

# Defining the categorical columns to encode
categorical_features = ['City', 'City Group', 'Type']

# Initializing the OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Transforming categorical features in the training dataset
train_encoded = pd.DataFrame(
    encoder.fit_transform(X_train[categorical_features]),
    columns=encoder.get_feature_names_out(categorical_features)
)

# Transforming categorical features in the test dataset
test_encoded = pd.DataFrame(
    encoder.transform(X_test[categorical_features]),
    columns=encoder.get_feature_names_out(categorical_features)
)

# Merging the encoded features back into the original datasets
X_train = pd.concat([X_train.reset_index(drop=True), train_encoded.reset_index(drop=True)], axis=1)
X_test = pd.concat([X_test.reset_index(drop=True), test_encoded.reset_index(drop=True)], axis=1)

# Dropping the original categorical columns from both datasets
X_train.drop(columns=categorical_features, inplace=True)
X_test.drop(columns=categorical_features, inplace=True)

# Display the first few rows of the transformed training dataset
X_train.head()

Unnamed: 0,P1,P3,P4,P5,P6,P10,P11,P14,P21,P22,...,City_Trabzon,City_Uşak,City_İstanbul,City_İzmir,City_Şanlıurfa,City Group_Big Cities,City Group_Other,Type_DT,Type_FC,Type_IL
0,4,4.0,4.0,2,2,5,3,1,1,3,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,4,4.0,4.0,1,2,5,1,0,1,3,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,2,2.0,5.0,2,3,5,2,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,6,6.0,6.0,4,4,10,8,6,6,1,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,3,3.0,4.0,2,2,5,2,2,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


In [14]:
X_test.head()

Unnamed: 0,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,...,City_Trabzon,City_Uşak,City_İstanbul,City_İzmir,City_Şanlıurfa,City Group_Big Cities,City Group_Other,Type_DT,Type_FC,Type_IL
0,1,4.0,4.0,4.0,1,2,5,4,5,5,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,3,4.0,4.0,4.0,2,2,5,3,4,4,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,3,4.0,4.0,4.0,2,2,5,4,4,5,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,2,4.0,4.0,4.0,2,3,5,4,5,4,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,2,4.0,4.0,4.0,1,2,5,4,5,4,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [15]:
#Ensuring that the training and test sets have identical features
_common = []
for i in X_train.columns:
    if i in X_test.columns:
        _common.append(i)
len(_common)

X_train = X_train[_common]
X_test = X_test[_common]

print(X_train.shape)
print(X_test.shape)

(137, 55)
(100000, 55)


In [16]:
sc = StandardScaler()
X_scaled_train = sc.fit_transform(X_train)
X_scaled_train = pd.DataFrame(data = X_scaled_train, columns = X_train.columns)

In [17]:
X_scaled_test = sc.fit_transform(X_test)
X_scaled_test = pd.DataFrame(data = X_scaled_test, columns = X_test.columns)

##**MODEL DEVELOPMENT**

---



1.   Support Vector Regressor
2.   Decision Tree Regressor
3.   RandomForest Regressor
4. MLP Regressor
5. Gradient Descent Regressor

### **Support Vector Regressor**

In [18]:
from sklearn.svm import LinearSVR
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import pandas as pd

# Initialize the LinearSVR model
regressor = LinearSVR()

# Expanded Hyperparameter Search Space
parameters = {
    'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'],
    'epsilon': [0.1, 0.2, 0.5],  # Adding epsilon parameter
    'C': [0.1, 1.0, 10.0],  # Regularization parameter
    'max_iter': range(500, 2000, 100)  # Expanded max_iter range
}

# Using Randomized Search
randomsearch = RandomizedSearchCV(
    estimator=regressor,
    param_distributions=parameters,
    cv=5,
    n_iter=20,  # Increased number of iterations
    random_state=42,
    scoring='neg_mean_squared_error'  # Use scoring metric specific to regression
)
randomsearch.fit(X_scaled_train, y_train)

# Get the best parameters and score
best_params_random = randomsearch.best_params_
print('The Best Parameters from Randomized Search are:', best_params_random)
print('\nThe Best Score from Randomized Search is:', -randomsearch.best_score_)  # Negative MSE

# Fine-tuning with Grid Search around the best parameters from Random Search
param_grid = {
    'loss': [best_params_random['loss']],
    'epsilon': [best_params_random['epsilon'] - 0.1, best_params_random['epsilon'], best_params_random['epsilon'] + 0.1],
    'C': [best_params_random['C'] * 0.1, best_params_random['C'], best_params_random['C'] * 10],
    'max_iter': [best_params_random['max_iter'] - 100, best_params_random['max_iter'], best_params_random['max_iter'] + 100]
}

gridsearch = GridSearchCV(
    estimator=LinearSVR(),
    param_grid=param_grid,
    cv=5,
    scoring='neg_mean_squared_error'
)
gridsearch.fit(X_scaled_train, y_train)

best_params_grid = gridsearch.best_params_
print('The Best Parameters from Grid Search are:', best_params_grid)
print('\nThe Best Score from Grid Search is:', -gridsearch.best_score_)  # Negative MSE

# Train the optimized regressor with the best parameters
regressor_optimized = LinearSVR(**best_params_grid)
regressor_optimized.fit(X_scaled_train, y_train)

# Predict and store output
predictions = regressor_optimized.predict(X_scaled_test)
output = pd.DataFrame(predictions, columns=['Prediction'])
print(output.head())

The Best Parameters from Randomized Search are: {'max_iter': 500, 'loss': 'squared_epsilon_insensitive', 'epsilon': 0.2, 'C': 0.1}

The Best Score from Randomized Search is: 11558108718045.969
The Best Parameters from Grid Search are: {'C': 0.1, 'epsilon': 0.1, 'loss': 'squared_epsilon_insensitive', 'max_iter': 400}

The Best Score from Grid Search is: 11558108575189.074
     Prediction
0  4.441332e+06
1  2.640104e+06
2  3.038449e+06
3  1.935742e+06
4  7.155307e+06


In [19]:
# Export csv predictions
sample=pd.read_csv('/content/sampleSubmission.csv')
output=pd.concat([sample['Id'],output],axis=1)
output.to_csv('Support vector base model.csv', index = None)


### **Decision Tree Regressor**

In [20]:
# Decision Tree Model
dtr=DecisionTreeRegressor()

# Hyper Parameter Tuning
dtr_params={'criterion':['squared_error','friedman_mse','absolute_error','poisson','entropy'],
        'splitter':['best','random'],
       'max_depth':range(1,20,1),
       'max_leaf_nodes': range(5,50,5)}

rscv=RandomizedSearchCV(dtr,dtr_params)
rscv.fit(X_train,y_train)
dtr_best_parameters=rscv.best_params_
print('The Best Paramteres for Decision Tree Regressor is : ',rscv.best_params_)
print('\nThe Best Score for Decision Tree Regressor is : ',rscv.best_score_)

dtr=DecisionTreeRegressor(**dtr_best_parameters)
dtr.fit(X_train,y_train)
DTR_Prediction=dtr.predict(X_test)


The Best Paramteres for Decision Tree Regressor is :  {'splitter': 'best', 'max_leaf_nodes': 10, 'max_depth': 9, 'criterion': 'absolute_error'}

The Best Score for Decision Tree Regressor is :  -0.11615005030613865


In [21]:
# Export csv predictions

sample=pd.read_csv('/content/sampleSubmission.csv')
output=pd.concat([sample['Id'],pd.DataFrame(DTR_Prediction,columns=['Prediction'])],axis=1)
output.to_csv('Decision tree model-base model.csv', index = None)

### **RandomForest Regressor**

In [22]:
# Random Forest Model
rfc = RandomForestRegressor()

# Doing HyperParameter Tuning & K-fold Cross Validation on RandomForestRegressor
parameters={'min_samples_leaf' : range(10,100,10),
               'max_depth': range(1,10,2),
               'max_features':[10,20,30,40,50],
               'n_estimators':[20,30,40]}

# Using Randomized Search
rf_random = RandomizedSearchCV(rfc,parameters,n_iter=25,cv=5)
rf_random.fit(X_train, y_train)
grid_parm=rf_random.best_params_
print('The Best Paramteres for Random Forest Regressor are : ',grid_parm)
print('\nThe Best Score for Random Forest Regressor is : ',rf_random.best_score_)

# Predict and Store Output
rffinal = RandomForestRegressor(**grid_parm)
rffinal.fit(X_train, y_train)
rfc_predict=rffinal.predict(X_test)

The Best Paramteres for Random Forest Regressor are :  {'n_estimators': 20, 'min_samples_leaf': 20, 'max_features': 40, 'max_depth': 7}

The Best Score for Random Forest Regressor is :  0.053342703882390374


In [23]:
# Export csv predictions

sample=pd.read_csv('/content/sampleSubmission.csv')
output=pd.concat([sample['Id'],pd.DataFrame(rfc_predict,columns=['Prediction'])],axis=1)
output.to_csv('Random forest regressor-base model.csv', index = None)

### **MLP Regressor**

In [45]:
# Multi-Layer Perceptron Model
mlp = MLPRegressor(random_state=1, max_iter=600)

#HyperParameter Tuning
parameters = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

# Using Randomized Search
mlp_random = RandomizedSearchCV(mlp,parameters,n_iter=15,cv=5)
mlp_random.fit(X_train, y_train)
mlp_best = mlp_random.best_params_

print('The Best Paramteres for MLP Regressor are : ',mlp_best)
print('\nThe Best Score for MLP Regressor is : ',mlp_random.best_score_)

# Predict and Store Output
mlpfinal = MLPRegressor(**mlp_best)
mlpfinal.fit(X_train, y_train)

output = pd.DataFrame(mlpfinal.predict(X_test), columns = ['Prediction'])

The Best Paramteres for MLP Regressor are :  {'solver': 'adam', 'learning_rate': 'constant', 'hidden_layer_sizes': (50, 50, 50), 'alpha': 0.0001, 'activation': 'relu'}

The Best Score for MLP Regressor is :  -0.01706134529288932


In [46]:
sample=pd.read_csv('/content/sampleSubmission.csv')
output=pd.concat([sample['Id'],output],axis=1)
output.to_csv('MultilayerP-base model.csv', index = None)

### **Gradient Descent Regressor**

In [39]:
 #Gradient Descent Model
sgd = SGDRegressor(random_state=42)

# Hyperparameter Tuning
sgd_params = {
    'loss': ['squared_error', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'],
    'penalty': ['l2', 'l1', 'elasticnet'],
    'alpha': [0.0001, 0.001, 0.01, 0.1],
    'learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'],
    'eta0': [0.001, 0.01, 0.1],
    'max_iter': [500, 1000, 2000],
    'tol': [1e-3, 1e-4, 1e-5]
}

# Using Randomized Search CV
rscv = RandomizedSearchCV(estimator=sgd, param_distributions=sgd_params, cv=5, n_iter=20, scoring='r2', random_state=42)
rscv.fit(X_train, y_train)

sgd_best_parameters = rscv.best_params_
print('The Best Parameters for SGDRegressor are:', sgd_best_parameters)
print('\nThe Best Score for SGDRegressor is:', rscv.best_score_)

# Predict and Store Output
sgd_optimized = SGDRegressor(**sgd_best_parameters, random_state=42)
sgd_optimized.fit(X_train, y_train)

SGD_Prediction = sgd_optimized.predict(X_test)

The Best Parameters for SGDRegressor are: {'tol': 1e-05, 'penalty': 'l2', 'max_iter': 1000, 'loss': 'epsilon_insensitive', 'learning_rate': 'adaptive', 'eta0': 0.001, 'alpha': 0.001}

The Best Score for SGDRegressor is: -0.055825820622041486


In [37]:
sample=pd.read_csv('/content/sampleSubmission.csv')
output.to_csv('Gradient descent regressor-base model.csv', index = None)

# **STACKING MODEL**

In [28]:
# Building the level 0(Base) model in stacking
models = [  SGDRegressor(), RandomForestRegressor(), DecisionTreeRegressor(), MLPRegressor() ,LinearSVR(),]

S_train, S_test = stacking(models,                     # list of models
                           X_train, y_train ,X_test,   # data
                           regression=True,            # Regression Method
                           n_folds=4,                  # number of folds
                           stratified=True,            # stratified split for folds
                           shuffle=True,               # shuffle the data
                           verbose=2)                  # print all info

task:         [regression]
metric:       [mean_absolute_error]
mode:         [oof_pred_bag]
n_models:     [5]

model  0:     [SGDRegressor]
    fold  0:  [1093250292344813.50000000]
    fold  1:  [5184015880948759.00000000]
    fold  2:  [495273993767114.25000000]
    fold  3:  [914185818243529.75000000]
    ----
    MEAN:     [1921681496326054.00000000] + [1895970055244333.50000000]
    FULL:     [1915634553231300.75000000]

model  1:     [RandomForestRegressor]
    fold  0:  [1584614.96085714]
    fold  1:  [1497136.05794118]
    fold  2:  [2313733.27941176]
    fold  3:  [1905423.83205882]
    ----
    MEAN:     [1825227.03256723] + [320393.40526083]
    FULL:     [1823470.74007299]

model  2:     [DecisionTreeRegressor]
    fold  0:  [1925451.20000000]
    fold  1:  [2444289.73529412]
    fold  2:  [2874868.61764706]
    fold  3:  [2261644.44117647]
    ----
    MEAN:     [2376563.49852941] + [342638.12197147]
    FULL:     [2373270.70802920]

model  3:     [MLPRegressor]
    fold 

### Hyperparameter tuning with cross-validation (Support Vector Regressor as Meta Model)

In [29]:
# Stacked SVR Model
regressornew = LinearSVR()

# Hyper Parameter Tuning
parameters = {'loss':['epsilon_insensitive', 'squared_epsilon_insensitive'],
              'max_iter':range(600,1500,100)}
# Using Randomized Cross Validation
randomsearch = RandomizedSearchCV(regressornew, parameters)
randomsearch.fit(S_train, y_train)
best = randomsearch.best_params_
print('The Best Paramteres for Deceision Tree Regressor is : ',best)
print('\nThe Best Score for Decision Tree Regressor is : ',randomsearch.best_score_)

#Using the best parameters
regressornew = LinearSVR(**best)
regressornew.fit(S_train, y_train)
output = pd.DataFrame(regressornew.predict(S_test), columns = ['Prediction'])

The Best Paramteres for Deceision Tree Regressor is :  {'max_iter': 600, 'loss': 'squared_epsilon_insensitive'}

The Best Score for Decision Tree Regressor is :  -2.049364893483245


In [30]:
# Export csv predictions
sample=pd.read_csv('/content/sampleSubmission.csv')
output=pd.concat([sample['Id'],pd.DataFrame(output,columns=['Prediction'])],axis=1)
output.to_csv('StackedSVR-model.csv', index = None)

###  Hyperparameter tuning with cross-validation (Decision Tree Regression as Meta Model)

In [31]:
# Stacked Decision Tree Regressor Model
dtr_stacked=DecisionTreeRegressor()

# Hyper Parameter Tuning
dtr_params={'criterion':['squared_error','friedman_mse','absolute_error','poisson','entropy'],
        'splitter':['best','random'],
       'max_depth':range(3,30,1),
       'max_leaf_nodes': range(5,50,5)}

# Using Randomized Cross Validation
rscv=RandomizedSearchCV(dtr,dtr_params)
rscv.fit(S_train,y_train)
dtr_best_parameters=rscv.best_params_
print('The Best Paramteres for Deceision Tree Regressor is : ',rscv.best_params_)
print('\nThe Best Score for Decision Tree Regressor is : ',rscv.best_score_)

#Using the best parameters
dtr_stacked=DecisionTreeRegressor(**dtr_best_parameters)
dtr_stacked.fit(S_train,y_train)
DTR_Prediction_Stacked=dtr_stacked.predict(S_test)


The Best Paramteres for Deceision Tree Regressor is :  {'splitter': 'best', 'max_leaf_nodes': 5, 'max_depth': 21, 'criterion': 'squared_error'}

The Best Score for Decision Tree Regressor is :  -0.30219419032743755


In [32]:
#Export csv predictions
sample=pd.read_csv('/content/sampleSubmission.csv')
output=pd.concat([sample['Id'],pd.DataFrame(DTR_Prediction_Stacked,columns=['Prediction'])],axis=1)
output.to_csv('StackedDTR-model.csv', index = None)

###  Hyperparameter tuning with cross-validation (RandomForest Regression as Meta Model)

In [33]:
# Stacked RandomForest Regressor
model = RandomForestRegressor()

model = model.fit(S_train, y_train)

#Hyperparameter Tuning & K-fold Cross Validation
parameters={'min_samples_leaf' : range(10,100,10),
               'max_depth': range(1,10,2),
               'max_features':range(3,19,1),
               'n_estimators':[20,30,40]}


# Using Randomized Cross Validation
rf_random = RandomizedSearchCV(model,parameters,n_iter=25,cv=5)
rf_random.fit(S_train, y_train)
grid_parm=rf_random.best_params_
print('The Best Paramteres for Deceision Tree Regressor is : ',grid_parm)
print('\nThe Best Score for Decision Tree Regressor is : ',rf_random.best_score_)

#Using the best parameters
rffinal = RandomForestRegressor(**grid_parm)
rffinal.fit(S_train, y_train)

output=rffinal.predict(S_test)

The Best Paramteres for Deceision Tree Regressor is :  {'n_estimators': 20, 'min_samples_leaf': 20, 'max_features': 7, 'max_depth': 7}

The Best Score for Decision Tree Regressor is :  0.04828553469807728


In [34]:
#Export csv predictions
sample=pd.read_csv('/content/sampleSubmission.csv')
output=pd.concat([sample['Id'],pd.DataFrame(output,columns=['Prediction'])],axis=1)
output.to_csv('StackedRFR-model.csv', index = None)



---

