# Linear Regression

Regression is a method of modelling a target value based on independent predictors. This method is mostly used for forecasting and finding out cause and effect relationship between variables. Regression techniques mostly differ based on the number of independent variables and the type of relationship between the independent and dependent variables.

<img src= "Images/Photo15.png" width = 500>

Simple linear regression is a type of regression analysis where the number of independent variables is one and there is a linear relationship between the independent(x) and dependent(y) variable. The red line in the above graph is referred to as the best fit straight line. Based on the given data points, we try to plot a line that models the points the best. The line can be modelled based on the linear equation shown below.


Before moving on to the algorithm, let’s have a look at two important concepts you must know to better understand linear regression.

## Cost Function

The cost function helps us to figure out the best possible values for a_0 and a_1 which would provide the best fit line for the data points. Since we want the best values for a_0 and a_1, we convert this search problem into a minimization problem where we would like to minimize the error between the predicted value and the actual value.

<img src= "Images/Photo16.png" width=350>

We choose the above function to minimize. The difference between the predicted values and ground truth measures the error difference. We square the error difference and sum over all data points and divide that value by the total number of data points. This provides the average squared error over all the data points. Therefore, this cost function is also known as the Mean Squared Error(MSE) function. Now, using this MSE function we are going to change the values of a_0 and a_1 such that the MSE value settles at the minima.

## Gradient Descent

The next important concept needed to understand linear regression is gradient descent. Gradient descent is a method of updating a_0 and a_1 to reduce the cost function(MSE). The idea is that we start with some values for a_0 and a_1 and then we change these values iteratively to reduce the cost. Gradient descent helps us on how to change the values.

<img src= "Images/Photo17.png" width = 500>

To draw an analogy, imagine a pit in the shape of U and you are standing at the topmost point in the pit and your objective is to reach the bottom of the pit. There is a catch, you can only take a discrete number of steps to reach the bottom. If you decide to take one step at a time you would eventually reach the bottom of the pit but this would take a longer time. If you choose to take longer steps each time, you would reach sooner but, there is a chance that you could overshoot the bottom of the pit and not exactly at the bottom. In the gradient descent algorithm, the number of steps you take is the learning rate. This decides on how fast the algorithm converges to the minima.

<img src= "Images/Photo18.png" width = 600>

Sometimes the cost function can be a non-convex function where you could settle at a local minima but for linear regression, it is always a convex function.

You may be wondering how to use gradient descent to update a_0 and a_1. To update a_0 and a_1, we take gradients from the cost function. To find these gradients, we take partial derivatives with respect to a_0 and a_1. Now, to understand how the partial derivatives are found below you would require some calculus but if you don’t, it is alright. You can take it as it is.

<img src= "Images/Photo19.png" width = 800>

<img src= "Images/Photo20.png" width = 400>

## Linear Regression using sklearn

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim of this data science project is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.


In [11]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sklearn.model_selection import train_test_split

In [12]:
train = pd.read_csv("Dataset/BigMart/train.csv")
train.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [3]:
test = pd.read_csv("Dataset/BigMart/test.csv")
test.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDW58,20.75,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,FDW14,8.3,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1
2,NCN55,14.6,Low Fat,0.099575,Others,241.7538,OUT010,1998,,Tier 3,Grocery Store
3,FDQ58,7.315,Low Fat,0.015388,Snack Foods,155.034,OUT017,2007,,Tier 2,Supermarket Type1
4,FDY38,,Regular,0.118599,Dairy,234.23,OUT027,1985,Medium,Tier 3,Supermarket Type3


In [4]:
train.shape

(8523, 12)

In [5]:
test.shape

(5681, 11)

In [13]:
from sklearn.linear_model import LinearRegression
lreg = LinearRegression()

In [22]:
X = train.loc[:,['Outlet_Establishment_Year','Item_MRP']]
x_train, x_cv, y_train, y_cv = train_test_split(X,train.Item_Outlet_Sales, random_state=123)

#Training the model
lreg.fit(x_train,y_train)

#predicting on cv
pred = lreg.predict(x_cv)

#Calculating Mean Square Error
mse = np.mean((pred - y_cv)**2)
mse

1983577.4099061228

In [20]:
#Calculating Coefficients
coeff = DataFrame(x_train.columns)
coeff['Coefficient Estimate'] = Series(lreg.coef_)
coeff

Unnamed: 0,0,Coefficient Estimate
0,Outlet_Establishment_Year,-9.763104
1,Item_MRP,15.587094


Therefore, we can see that MRP has a high coefficient, meaning items having higher prices have better sales.

# R square and adjusted R- square
R-Square: It determines how much of the total variation in Y (dependent variable) is explained by the variation in X (independent variable). Mathematically, it can be written as:

<img src= "Images/Photo7.png" width = 400 align = center>

The value of R-square is always between 0 and 1, where 0 means that the model does not model explain any variability in the target variable (Y) and 1 meaning it explains full variability in the target variable.


In [9]:
lreg.score(x_cv,y_cv)

0.3184339327541229

In this case, R² is 32%, meaning, only 32% of variance in sales is explained by year of establishment and MRP. In other words, if you know year of establishment and the MRP, you’ll have 32% information to make an accurate prediction about its sales.

# Linear Regression with multiple variables
---

In [23]:
X = train.loc[:,['Outlet_Establishment_Year','Item_MRP','Item_Weight']]
x_train, x_cv, y_train, y_cv = train_test_split(X,train.Item_Outlet_Sales)
lreg.fit(x_train,y_train)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

During running this code we get an error "ValueError: Input contains NaN, infinity or a value too large for dtype('float64')."
It produces an error, because item weights column have some missing values. So let us impute it with the mean of other non-null entries.

In [24]:
train['Item_Weight'].fillna((train['Item_Weight'].mean()), inplace=True)
X = train.loc[:,['Outlet_Establishment_Year','Item_MRP','Item_Weight']]
x_train, x_cv, y_train, y_cv = train_test_split(X,train.Item_Outlet_Sales)
lreg.fit(x_train,y_train)

# Predecting on cv
pred = lreg.predict(x_cv)

# Calculating mse
mse = np.mean((pred - y_cv)**2)
mse

1992082.675126109

In [12]:
# Calculating Coefficients
coeff = DataFrame(x_train.columns)
coeff['Coefficient Estimate'] = Series(lreg.coef_)
coeff

Unnamed: 0,0,Coefficient Estimate
0,Outlet_Establishment_Year,-9.284013
1,Item_MRP,15.452276
2,Item_Weight,-3.096755


In [13]:
# Calculating R-square
lreg.score(x_cv,y_cv)

0.33067711577344805

Therefore we can see that the mse is further reduced.

# Adjusted R-square
---
The only drawback of R2 is that if new predictors (X) are added to our model, R2 only increases or remains constant but it never decreases. We can not judge that by increasing complexity of our model, are we making it more accurate?

That is why, we use “Adjusted R-Square”.

The Adjusted R-Square is the modified form of R-Square that has been adjusted for the number of predictors in the model. It incorporates model’s degree of freedom. The adjusted R-Square only increases if the new term improves the model accuracy.
<img src= "Images/Photo8.png" width = 400 align = center>
where,

R2 = Sample R square

p = Number of predictors

N = total sample size

# Using all features for Prediction
---
Now let us built a model containing all the features.

## Data pre-processing steps for regression model

In [14]:
# inputing missing values
train['Item_Visibility'] = train['Item_Visibility'].replace(0,np.mean(train['Item_Visibility']))
train['Outlet_Establishment_Year'] = 2013 - train['Outlet_Establishment_Year']
train['Outlet_Size'].fillna('Small',inplace=True)

# creating dummy variables to convert categorical into numeric values
mylist = list(train.select_dtypes(include=['object']).columns)
dummies = pd.get_dummies(train[mylist], prefix= mylist)
train.drop(mylist, axis=1, inplace = True)
X = pd.concat([train,dummies], axis =1 )

In [16]:
import matplotlib.pyplot as plt
%matplotlib inline
X = train.drop('Item_Outlet_Sales',1)
x_train, x_cv, y_train, y_cv = train_test_split(X,train.Item_Outlet_Sales, test_size =0.3)

# training a linear regression model on train
lreg.fit(x_train,y_train)

# predicting on cv
pred_cv = lreg.predict(x_cv)

from sklearn import metrics

# Another way for calculating mse
print(metrics.mean_squared_error(pred_cv, y_cv))

1842296.712910115


In [17]:
# calculating coefficients
coeff = DataFrame(x_train.columns)
coeff['Coefficient Estimate'] = Series(lreg.coef_)
coeff

Unnamed: 0,0,Coefficient Estimate
0,Item_Weight,-0.850192
1,Item_Visibility,-4678.838347
2,Item_MRP,15.584792
3,Outlet_Establishment_Year,13.439718


In [18]:
# evaluation using r-square
lreg.score(x_cv,y_cv)

0.33948060778042055