**References:**<br>

[sklearn Supervised Learning](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)<br>
[Statsmodels](http://www.statsmodels.org/stable/index.html)<br>
[Lin Reg Evaluation (statsmodels)](https://statinfer.com/204-1-7-adjusted-r-squared-in-python/)<br>
[Regression Evaluation Metrics (sklearn)](https://scikit-learn.org/stable/modules/classes.html#regression-metrics)

## Importing Libraries and Cleaned Dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Importing the Cleaned Dataset from previous steps
df = pd.read_csv('Cleaned.csv')

In [3]:
df.shape

(14204, 14)

In [4]:
df.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_Type_Combined,data
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,14,Medium,Tier 1,Supermarket Type1,3735.138,FD,1
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,4,Medium,Tier 3,Supermarket Type2,443.4228,DR,1


In [5]:
# Make separate DataFrames for Item_Identifier & Outlet_Identifier in order to use them
# to make the output file.
Item_Identifier_train = df.loc[df.loc[:,'data']==1,'Item_Identifier']
Outlet_Identifier_train = df.loc[df.loc[:,'data']==1,'Outlet_Identifier']

Item_Identifier_test = df.loc[df.loc[:,'data']==0,'Item_Identifier']
Outlet_Identifier_test = df.loc[df.loc[:,'data']==0,'Outlet_Identifier']

In [6]:
# Now let's drop 'Item_Identifier', 'Outlet_Identifier' & 'Item_Type' from the dataframe.
# 'Item_Type' is dropped because we've made a different column for this purpose => 'Item_Categories'.

df.drop(['Item_Identifier','Item_Type'],axis=1,inplace=True)

In [7]:
df.head(2)

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_Type_Combined,data
0,9.3,Low Fat,0.016047,249.8092,OUT049,14,Medium,Tier 1,Supermarket Type1,3735.138,FD,1
1,5.92,Regular,0.019278,48.2692,OUT018,4,Medium,Tier 3,Supermarket Type2,443.4228,DR,1


---
---

## Encoding Categorical Variables
1. **Item_Fat_Content**
2. **Item_Type** [We'll not use this since we've already created Item_Categories]
3. **Outlet_Identifier**
4. **Outlet_Size** (High/Medium/Small) => Treat Missing values using *Outlet Identifier*, no of items in that store, *Outlet_Type*.
5. **Outlet_Location_Type**
6. **Outlet_Type** 
7. **Item_Categories**

In [8]:
df1 = pd.get_dummies(df, columns =['Item_Fat_Content',
                                   'Item_Type_Combined',
                                   'Outlet_Location_Type',
                                   'Outlet_Size',
                                   'Outlet_Type',
                                   'Outlet_Identifier'
                                  ])
df1.dtypes

Item_Weight                      float64
Item_Visibility                  float64
Item_MRP                         float64
Outlet_Establishment_Year          int64
Item_Outlet_Sales                float64
data                               int64
Item_Fat_Content_Low Fat           uint8
Item_Fat_Content_Non Edible        uint8
Item_Fat_Content_Regular           uint8
Item_Type_Combined_DR              uint8
Item_Type_Combined_FD              uint8
Item_Type_Combined_NC              uint8
Outlet_Location_Type_Tier 1        uint8
Outlet_Location_Type_Tier 2        uint8
Outlet_Location_Type_Tier 3        uint8
Outlet_Size_High                   uint8
Outlet_Size_Medium                 uint8
Outlet_Size_Small                  uint8
Outlet_Type_Grocery Store          uint8
Outlet_Type_Supermarket Type1      uint8
Outlet_Type_Supermarket Type2      uint8
Outlet_Type_Supermarket Type3      uint8
Outlet_Identifier_OUT010           uint8
Outlet_Identifier_OUT013           uint8
Outlet_Identifie

In [9]:
df1.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,data,Item_Fat_Content_Low Fat,Item_Fat_Content_Non Edible,Item_Fat_Content_Regular,Item_Type_Combined_DR,...,Outlet_Identifier_OUT010,Outlet_Identifier_OUT013,Outlet_Identifier_OUT017,Outlet_Identifier_OUT018,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049
0,9.3,0.016047,249.8092,14,3735.138,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,5.92,0.019278,48.2692,4,443.4228,1,0,0,1,1,...,0,0,0,1,0,0,0,0,0,0
2,17.5,0.01676,141.618,14,2097.27,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,19.2,0.054021,182.095,15,732.38,1,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
4,8.93,0.054021,53.8614,26,994.7052,1,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0


---
---

## Splitting the DataFrame into train and test as per original datasets.

In [10]:
# Splitting into train.
train = df1.loc[df1.loc[:,'data']==1,:]

In [11]:
# Removing 'Item_Outlet_Sales' & 'data' columns from the train and store in X_train
X_TRAIN = pd.concat([train.iloc[:,:4], train.iloc[:,6:]], axis=1, sort=False)

In [12]:
Y_TRAIN = train['Item_Outlet_Sales']

---

In [13]:
# Test Dataset
test = df1.loc[df1.loc[:,'data']==0,:]

In [14]:
# Removing 'Item_Outlet_Sales' & 'data' column from the test dataset
X_TEST = pd.concat([test.iloc[:,:4], test.iloc[:,6:]], axis=1, sort=False)

---
---

# Model Building, Prediction and Evaluation

In [15]:
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score as r2

In [16]:
def evaluation(test_y, pred):
    print("Mean Absolute Error (MAE):",mae(test_y,pred))
    print("Mean Squared Error (MSE):",mse(test_y,pred))
    print("Root Mean Square Error (RMSE):",np.sqrt(mse(test_y,pred)))
    print("R-Squared Value:",r2(test_y,pred))

In [17]:
# Cross Validation
from sklearn.model_selection import cross_val_score,cross_validate

[sklearn.model_selection.cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)<br>
[sklearn.model_selection.cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate)<br>
[scoring parameter](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

In [18]:
# For evaluating for only 1 metric use this function otherwise the one below:
#accuracies = cross_val_score(estimator = model, 
#                             X = train_x, 
#                             y = train_y, 
#                             cv = 20, 
#                             scoring='neg_mean_absolute_error')

In [19]:
def cross_val(model, train_x, train_y):
    accuracies = cross_validate(estimator = model, 
                            X = train_x, 
                            y = train_y, 
                            cv = 10, 
                            scoring=('neg_mean_absolute_error','neg_mean_squared_error','r2'),
                            return_train_score=False)
    print("RMSE:",np.mean(np.sqrt(np.abs(accuracies['test_neg_mean_squared_error']))))
    print("R-squared:",np.mean(np.abs(accuracies['test_r2'])))
    print("MSE:",np.mean(np.abs(accuracies['test_neg_mean_squared_error'])))
    print("MAE:",np.mean(np.abs(accuracies['test_neg_mean_absolute_error'])))

## 1. MULTIPLE LINEAR REGRESSION MODEL

In [20]:
from sklearn.linear_model import LinearRegression

In [21]:
MLR = LinearRegression(normalize=True) 

In [22]:
# Training the model on Train Data
MLR.fit(X_TRAIN, Y_TRAIN)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

**PREDICTIONS**

In [23]:
predictions = MLR.predict(X_TEST)

**EVALUATION**

In [24]:
cross_val(MLR, X_TRAIN, Y_TRAIN)

RMSE: 1129.8553198457973
R-squared: 0.5602653818968515
MSE: 1276844.1806150586
MAE: 837.8224216809034


**% Using statsmodels %**

In [25]:
# import statsmodels.api as sm
# import statsmodels.formula.api as smf

#results = sm.OLS(Y_train, X_train).fit()

#print(results.summary())

## 2. RIDGE REGRESSION MODEL

In [26]:
from sklearn.linear_model import Ridge

In [27]:
RR = Ridge(normalize=True)

In [28]:
# Training the model on Overall Train Data
RR.fit(X_TRAIN, Y_TRAIN)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, random_state=None, solver='auto', tol=0.001)

In [29]:
predictions = MLR.predict(X_TEST)

In [30]:
# EVALUATION
cross_val(RR, X_TRAIN, Y_TRAIN)

RMSE: 1251.0767275739704
R-squared: 0.46160419896706273
MSE: 1566273.8838219715
MAE: 935.5944122682497


## 3. DECISION TREE REGRESSION

In [16]:
from sklearn.tree import DecisionTreeRegressor

In [17]:
DTR = DecisionTreeRegressor(max_depth=10, min_samples_leaf=100)

In [18]:
DTR.fit(X_TRAIN, Y_TRAIN)

DecisionTreeRegressor(criterion='mse', max_depth=10, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=100,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [19]:
predictions = DTR.predict(X_TEST)

In [26]:
# Evaluation
cross_val(DTR, X_TRAIN, Y_TRAIN)

RMSE: 1092.9347711168036
R-squared: 0.5881457978346949
MSE: 1194937.7948706043
MAE: 767.2623834431963


## 4. RANDOM FOREST REGRESSION

In [23]:
from sklearn.ensemble import RandomForestRegressor

In [24]:
RFR = RandomForestRegressor(max_depth=10, min_samples_leaf=100, n_estimators=100)

In [25]:
RFR.fit(X_TRAIN, Y_TRAIN)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=100, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [26]:
predictions = RFR.predict(X_TEST)

In [27]:
# EVALUATION
cross_val(RFR, X_TRAIN, Y_TRAIN)

RMSE: 1084.8272048825506
R-squared: 0.5942568960571297
MSE: 1177283.3300120078
MAE: 760.6125087984959


## 5. SUPPORT VECTOR REGRESSION

In [28]:
from sklearn.svm import SVR

In [29]:
SV_regressor = SVR(gamma='scale')

In [30]:
SV_regressor.fit(X_TRAIN, Y_TRAIN)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [31]:
predictions = SV_regressor.predict(X_TEST)

In [32]:
predictions1 = SV_regressor.predict(X_TRAIN)

In [33]:
evaluation(Y_TRAIN, predictions1)

Mean Absolute Error (MAE): 1137.4614324923764
Mean Squared Error (MSE): 2462252.280506624
Root Mean Square Error (RMSE): 1569.1565506687418
R-Squared Value: 0.15438803909740484


## 5. ANN

## 5. XGBOOST

In [29]:
from xgboost import XGBRegressor

In [58]:
my_model = XGBRegressor()
# Add silent=True to avoid printing out updates with each cycle

In [59]:
my_model.fit(X_TRAIN, Y_TRAIN, verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [60]:
predictions_xgboost = my_model.predict(X_TEST)

In [61]:
predictions_xgboost

array([1287.6254, 1408.5503,  628.3989, ..., 1792.4873, 3615.2861,
       1268.2827], dtype=float32)

In [62]:
cross_val(my_model, X_TRAIN, Y_TRAIN)

RMSE: 1084.1300837682552
R-squared: 0.5947415051446352
MSE: 1175749.9998201444
MAE: 757.8122937615178


---
---

# Saving the predictions to a DataFrame

In [27]:
solution = pd.DataFrame({"Item_Identifier":Item_Identifier_test,
                         "Outlet_Identifier":Outlet_Identifier_test,
                         "Item_Outlet_Sales":predictions})

---
---

# Saving to csv file

In [28]:
# save file
solution.to_csv("Decision_Tree.csv", index = False)