# Linear Regression

    For a good linear regression model, the data should satisfy a few assumptions. One of these assumptions is that of absence of multicollinearity, i.e, the independent variables should be correlated. However, as per the correlation plot above, we have a few highly correlated independent variables in our data. This issue of multicollinearity can be dealt with regularization.

    For the time being, let’s build our linear regression model with all the variables. We will use 5-fold cross validationin all the models we are going to build. Basically cross vaidation gives an idea as to how well a model generalizes to unseen data.

In [14]:
#loading packages
import pandas as pd
import numpy as np #for mathematical calculations
import seaborn as sns
import math
import matplotlib.pyplot as plt #for plotting graphs
from datetime import datetime #to access datetime
from pandas import Series, DataFrame #to work on series & dataframe
from pathlib import Path #to create path to directories and files
from sklearn.metrics import mean_squared_error
%matplotlib inline
import warnings #to ignore the warnings
warnings.filterwarnings('ignore')

In [2]:
#https://pbpython.com/notebook-process.html
today = datetime.today()
train_original = Path.cwd() /'data'/'raw'/'Train_File.csv'
test_original = Path.cwd() /'data'/'raw'/'Test_File.csv'
summary_file_train = Path.cwd() /'data'/'processed'/f'summary_train{today:%b-%d-%Y}.pkl'
summary_file_test = Path.cwd() /'data'/'processed'/f'summary_test{today:%b-%d-%Y}.pkl'

In [3]:
#reading data
train = pd.read_pickle(summary_file_train)
test = pd.read_pickle(summary_file_test)

In [4]:
train.head()

Unnamed: 0,Item_Identifier,Item_MRP,Item_Outlet_Sales,Item_Visibility,Item_Weight,Outlet_Identifier,Outlet_Years,Item_Visibility_MeanRatio,Item_Fat_Content_0,Item_Fat_Content_1,...,Outlet_0,Outlet_1,Outlet_2,Outlet_3,Outlet_4,Outlet_5,Outlet_6,Outlet_7,Outlet_8,Outlet_9
0,FDA15,249.8092,3735.138,0.016047,9.3,OUT049,14,0.931078,1,0,...,0,0,0,0,0,0,0,0,0,1
1,DRC01,48.2692,443.4228,0.019278,5.92,OUT018,4,0.93342,0,0,...,0,0,0,1,0,0,0,0,0,0
2,FDN15,141.618,2097.27,0.01676,17.5,OUT049,14,0.960069,1,0,...,0,0,0,0,0,0,0,0,0,1
3,FDX07,182.095,732.38,0.017834,19.2,OUT010,15,1.0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,NCD19,53.8614,994.7052,0.00978,8.93,OUT013,26,1.0,0,1,...,0,1,0,0,0,0,0,0,0,0


In [5]:
test.head()

Unnamed: 0,Item_Identifier,Item_MRP,Item_Visibility,Item_Weight,Outlet_Identifier,Outlet_Years,Item_Visibility_MeanRatio,Item_Fat_Content_0,Item_Fat_Content_1,Item_Fat_Content_2,...,Outlet_0,Outlet_1,Outlet_2,Outlet_3,Outlet_4,Outlet_5,Outlet_6,Outlet_7,Outlet_8,Outlet_9
8523,FDW58,107.8622,0.007565,20.75,OUT049,14,1.029192,1,0,0,...,0,0,0,0,0,0,0,0,0,1
8524,FDW14,87.3198,0.038428,8.3,OUT017,6,1.130311,0,0,1,...,0,0,1,0,0,0,0,0,0,0
8525,NCN55,241.7538,0.099575,14.6,OUT010,15,1.735215,0,1,0,...,1,0,0,0,0,0,0,0,0,0
8526,FDQ58,155.034,0.015388,7.315,OUT017,6,1.291577,1,0,0,...,0,0,1,0,0,0,0,0,0,0
8527,FDY38,234.23,0.118599,13.6,OUT027,28,0.917824,0,0,1,...,0,0,0,0,0,1,0,0,0,0


In [6]:
train = train.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1)
test = test.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1)

In [7]:
X = train.drop('Item_Outlet_Sales', axis=1)
y = train['Item_Outlet_Sales']

## Building Model

In [8]:
#importing linear regression from sklearn
from sklearn.linear_model import LinearRegression
lreg = LinearRegression()

#import cross validation
from sklearn.model_selection import train_test_split

In [9]:
x_train, x_cv, y_train, y_cv = train_test_split(X, y, test_size=0.3)

In [10]:
#training linear regression model on train
lreg.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [11]:
#https://medium.com/coinmonks/regularization-of-linear-models-with-sklearn-f88633a93a2
print('Training score:  {}'.format(lreg.score(x_train,y_train)))
print('Test score: {}'.format(lreg.score(x_cv,y_cv)))

Training score:  0.5610328624555483
Test score: 0.5689859712515812


In [12]:
#predicting on cv
pred_cv = lreg.predict(x_cv)

In [15]:
#calculating rmse
mse = mean_squared_error(y_cv, pred_cv)
rmse = math.sqrt(mse)

In [16]:
print('RMSE: {}'.format(rmse))

RMSE: 1077.4578332404458
