**References:**<br>

[sklearn Supervised Learning](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)<br>
[Statsmodels](http://www.statsmodels.org/stable/index.html)<br>
[Lin Reg Evaluation (statsmodels)](https://statinfer.com/204-1-7-adjusted-r-squared-in-python/)<br>
[Regression Evaluation Metrics (sklearn)](https://scikit-learn.org/stable/modules/classes.html#regression-metrics)

## Importing Libraries and Cleaned Dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [25]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [3]:
# Importing the Cleaned Dataset from previous steps
df = pd.read_csv('Cleaned.csv')

In [20]:
df.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,data,Item_Categories
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,14,Medium,Tier 1,Supermarket Type1,3735.138,1,FD
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,4,Medium,Tier 3,Supermarket Type2,443.4228,1,DR


---
---

## Encoding Categorical Variables
1. **Item_Identifier** 
2. **Item_Fat_Content** => Categories not properly defined
3. **Item_Type** 
4. **Outlet_Identifier**
5. **Outlet_Size** (High/Medium/Small) => Treat Missing values using *Outlet Identifier*, no of items in that store, *Outlet_Type*.
6. **Outlet_Type** 

In [7]:
df1 = pd.get_dummies(df)

---
---

## Splitting the DataFrame into train and test as per original datasets.

In [8]:
# Splitting into train.
train = df1.loc[df1.loc[:,'data']==1,:]

In [21]:
train.head(2)

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,data,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,Item_Identifier_DRB01,...,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Item_Categories_DR,Item_Categories_FD,Item_Categories_NC
0,9.3,0.016047,249.8092,14,3735.138,1,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
1,5.92,0.019278,48.2692,4,443.4228,1,0,0,0,0,...,0,0,1,0,0,1,0,1,0,0


In [14]:
# Removing 'Item_Outlet_Sales' & 'data' columns from the train and store in X
X_train = pd.concat([train.iloc[:,:4], train.iloc[:,6:]], axis=1, sort=False)

In [16]:
Y_train = train['Item_Outlet_Sales']

In [26]:
# For statsmodels
TRAIN = pd.concat([X_train, Y_train], axis=1, sort=False)

**Now for evaluation purposes let's again split this Train Dataset into Train and Test.**

In [18]:
# from sklearn.cross_validation import train_test_split [DEPRECATED]
from sklearn.model_selection import train_test_split

In [19]:
train_X, test_X, train_Y, test_Y = train_test_split(X_train, Y_train, test_size = 0.3, random_state = 0)

Now these splitted Datasets will be used for Evaluation Purposes.

---

In [10]:
# Test Dataset
test = df1.loc[df1.loc[:,'data']==0,:]

In [22]:
test.head(2)

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,data,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,Item_Identifier_DRB01,...,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Item_Categories_DR,Item_Categories_FD,Item_Categories_NC
8523,20.75,0.007565,107.8622,14,,0,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
8524,8.3,0.038428,87.3198,6,,0,0,0,0,0,...,0,1,0,0,1,0,0,0,1,0


In [12]:
# Removing 'Item_Outlet_Sales' & 'data' column from the test dataset
X_test = pd.concat([test.iloc[:,:4], test.iloc[:,6:]], axis=1, sort=False)

---
---

# Model Building, Prediction and Evaluation

## 1. MULTIPLE LINEAR REGRESSION 

### **Using sklearn**

In [35]:
from sklearn.linear_model import LinearRegression

In [79]:
lm1 = LinearRegression(fit_intercept=False)
lm2 = LinearRegression()

In [80]:
# Training model on datasets splitted from Train Data into Train and Test.
lm1.fit(train_X, train_Y)
predictions1 = lm1.predict(test_X)

In [81]:
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score as r2

In [82]:
print("Mean Absolute Error (MAE):",mae(test_Y,predictions1))
print("Mean Squared Error (MSE):",mse(test_Y,predictions1))
print("Root Mean Square Error (RMSE):",np.sqrt(mse(test_Y,predictions1)))
print("R-Squared Value:",r2(test_Y,predictions1))

Mean Absolute Error (MAE): 642954713.5199857
Mean Squared Error (MSE): 5.701115716681887e+19
Root Mean Square Error (RMSE): 7550573300.539428
R-Squared Value: -18722370162940.68


In [None]:
rmse_cv()

In [84]:
lm1.score(test_X,test_Y)

-18722370162940.68

### **Using statsmodels**

In [24]:
# import statsmodels.api as sm
# import statsmodels.formula.api as smf

In [32]:
results = sm.OLS(Y_train, X_train).fit()

In [33]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:      Item_Outlet_Sales   R-squared:                       0.643
Model:                            OLS   Adj. R-squared:                  0.563
Method:                 Least Squares   F-statistic:                     7.993
Date:                Fri, 19 Apr 2019   Prob (F-statistic):               0.00
Time:                        08:54:35   Log-Likelihood:                -71130.
No. Observations:                8523   AIC:                         1.454e+05
Df Residuals:                    6953   BIC:                         1.565e+05
Df Model:                        1569                                         
Covariance Type:            nonrobust                                         
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
Item_Weigh

**=> PREDICTIONS on OVERALL TEST DATASET**

In [68]:
# Training the model on Overall Train Data
lm2.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [69]:
# Predictions of Linear Regression
predictions2 = lm2.predict(X_test)

 ___

## 2. Decision Tree Regression

---
---

## 3. Random Forest Regression

---
---

## 4. SVR

---
---

## 5. ANN

---
---

### 6. XGBOOST

In [18]:
from xgboost import XGBRegressor

In [33]:
my_model = XGBRegressor() #RMSE 1154
# my_model = XGBRegressor(learning_rate=0.05, n_estimators=1000) #RMSE 1165
# Add silent=True to avoid printing out updates with each cycle

In [34]:
my_model.fit(X_train, Y_train, verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.05, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

**PREDICTIONS**

In [35]:
predictions_xgboost = my_model.predict(X_test)

In [36]:
predictions_xgboost

array([1622.0024 , 1384.9318 ,  631.67786, ..., 1813.3983 , 3313.7683 ,
       1327.607  ], dtype=float32)

---
---

# Saving the predictions to a DataFrame

In [38]:
df.loc[df.loc[:,'data']==0,['Item_Identifier','Outlet_Identifier']].head()

Unnamed: 0,Item_Identifier,Outlet_Identifier
8523,FDW58,OUT049
8524,FDW14,OUT017
8525,NCN55,OUT010
8526,FDQ58,OUT017
8527,FDY38,OUT027


In [39]:
solution = pd.DataFrame({"Item_Identifier":df.loc[df.loc[:,'data']==0,'Item_Identifier'],"Outlet_Identifier":df.loc[df.loc[:,'data']==0,'Outlet_Identifier'],"Item_Outlet_Sales":predictions_xgboost})

---
---

# Saving to csv file

In [40]:
# save file
solution.to_csv("XGBoost_1.csv", index = False)