In [1]:
# importing required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# read the dataset

data = pd.read_csv("train.csv")

In [3]:
# dimension of the dataset

data.shape

(8523, 12)

In [4]:
# seeing the first few rows

data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


## Problem: BigMart Sales Prediction (Need to predict the sales of stores, 'Item_Outlet_Sales', using all theses variables)

### To build a model, we need to have 2 types of data, training dataset (on what we'll make our model on) and test dataset (to make predictions and check our results) which doesn't have a target variable generally (if it has we need to get rid of it)

#### > Before starting, we can see that our target variable is continuous in nature, so that's why it's a regression problem and algorithm used will be Linear Regression

#### > First, let's create the train and test datasets; since the dataset data has 8523 rows, we'll use the first 8000 rows as train dataset & remaining 523 rows as test dataset for making predictions

In [33]:
# making train dataset

train = data[0:7999]

In [34]:
# making test dataset (generally does not have a target variable but in this case we do which we will need to get rid off)

test = data[8000:]

In [35]:
# see first few rows of test dataset

test.head()    # we see that it has the target variable which we will need to get rid off

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,Item_Identifier_DRB01,Item_Identifier_DRB13,...,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
8000,7.02,0.081329,150.0734,2002,4454.202,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
8001,7.42,0.020388,247.1092,2004,4233.1564,0,0,0,0,0,...,0,0,1,0,1,0,0,1,0,0
8002,17.25,0.113518,253.5724,1997,5033.448,0,0,0,0,0,...,0,0,1,1,0,0,0,1,0,0
8003,18.75,0.052917,190.6504,2002,1342.2528,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
8004,20.25,0.018911,220.5772,2007,2446.1492,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


### Now we need to adjust the dataset according to scikit-learn implementation. So what scikit-learn algorithms do is they 2 separate arguments for any sort of algorithms, whether linear or logistic regression. They need independent variables and target variable separately.

But, since we are here using this 'train' dataset, it contains both independent and dependent variables together. So, we need to separate out these.

In [36]:
# creating set of independent variables from the train dataset

x_train = train.drop('Item_Outlet_Sales',axis=1)    # drops the 'Item_Outlet_Sales' (target) column from the train dataset (axis=1 means drop from column)

In [37]:
# creating the target (dependent) variable

y_train = train['Item_Outlet_Sales']

We can use the above two to train our model

### Also, now we need our test dataset with only independent variables to make predictions and to check our model

In [38]:
# performing the same above operations on the test dataset

x_test = test.drop('Item_Outlet_Sales',axis=1)

Now, we'll be keeping our target variable in the test dataset separately as true predictions. This will be used later when we'll be checking the performance of our model.

In [39]:
# creating the true prediction variable

true_p = test['Item_Outlet_Sales']

So, now we have our independent variables in both train & test; and dependent variable (target) in the train dataset.

## Now we can go ahead and create our linear regression model

### > For using the linear regression model, we'll need to import the linear regression from scikit-learn

In [40]:
# import linear regression

from sklearn.linear_model import LinearRegression

Another thing about scikit-learn is that it creates first an object, and then the methods are fitted to that object

In [41]:
# create an object

lreg = LinearRegression()    # object of linear regression

So, now whatever we'll be doing, we will use the 'lreg' object

#### > Linear regression has a function called fit which fits the training data to our model

So, lets go ahead and fit the training data.

But, for now we have not done any pre-processing, data exploration or any modifications and we can see that our dataset has missing values, etc. We'll get to know what are the requirements of scikit-learn as we go along and we get some errors

In [19]:
# fitting the model

lreg.fit(x_train,y_train)    # (independent,dependent)

ValueError: could not convert string to float: 'FDA15'

#### > This gives an error (Item_Outlet_Sales). It seems that the fit function was expecting some float value and we passed it a string value.

#### > One thing about scikit-learn is that it needs input in the form of numbers only, i.e. int or float and not string/object as input.

#### > So to get rid of this, what we can do is, we need to create numeric features out of these categorical or string features.

#### > Here, we'll be using a concept called dummification, where we create dummies for variable (assign a specific variable to each typpe of string)

In [42]:
# create dummy varaibles for our train dataset

x_train = pd.get_dummies(x_train)

In [43]:
# to confirm we can check the shape of the x_train

x_train.shape

(7999, 1604)

In [44]:
x_train.head()    # to see the changed variable dummies for first few rows

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,Item_Identifier_DRB01,Item_Identifier_DRB13,Item_Identifier_DRB24,...,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,9.3,0.016047,249.8092,1999,0,0,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
1,5.92,0.019278,48.2692,2009,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0
2,17.5,0.01676,141.618,1999,0,0,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
3,19.2,0.0,182.095,1998,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
4,8.93,0.0,53.8614,1987,0,0,0,0,0,0,...,1,0,0,0,0,1,0,1,0,0


Here, now we can see that our number of features (columns) have increased from 11 to 1604.

We'll be repeating this step with x_test too.

In [45]:
# create dummy varaibles for our test dataset

x_test = pd.get_dummies(x_test)

So now since we have all the numeric features we can now train our model using the .fit() function of lreg.

In [24]:
# fit/train the model using lreg.fit()

lreg.fit(x_train,y_train)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

#### > We are still getting an error 'Input contains NaN, infinity or a value too large for dtype'.

#### > So another problem with linear regression specifically is that, it expects non-missing values

#### > So we have to impute somehow the missing values

In [46]:
# impute missing values inside our model (both train & test) using the .fillna() function

x_train.fillna(0,inplace=True)
x_test.fillna(0,inplace=True)    # fill all missing values with 0 and inplace

So now let's re-run our .fit() function for lreg

In [47]:
lreg.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#### > So here we can see that we have successfully fitted our model to the dataset and these [ LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) ] are the default parameters with which it was fit

## So, now that we have fit our model, we can predict on the test dataset

In [29]:
# using lreg.predict() function to predict on the test dataset

lreg.predict(x_test)

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1604 is different from 502)

#### > So we some error 'matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1604 is different from 502)'.

#### > It says that the dimensions of train & test are different. It seems that the number of features in train and test are different. (dimension mismatch)

#### > Possible reason: Number of categories in train different from test dataset

#### > Good solution: Whenever we perform dummification, we perform on the entire subset or combination of train and test, so that the number of features created are equal

In [32]:
# create dummy variables for our entire dataset (this can be done just after reading the dataset at the beginning)

data = pd.get_dummies(data)

### Now, we can re-run the consecutive commands after the loading/reading the dataset step from start

> Now, since we have created the dummy variables for the data, separated train & test, created x_train, y_train, x_test and true predictions from train and test respectively. We have also replaced the missing values inside each of the variables, so we can now fit the function using lreg.fit() function

#### Let's now make predictions on the test dataset

In [48]:
lreg.predict(x_test)

array([ 2.4110e+03,  3.8860e+03,  3.8490e+03,  2.0695e+03,  4.7590e+03,
        2.9940e+03,  1.8815e+03,  2.5100e+03,  3.5240e+03,  5.3375e+03,
        4.8685e+03,  1.0990e+03,  4.1410e+03,  9.3550e+02,  1.6855e+03,
        9.3800e+02,  5.4900e+02,  9.3150e+02,  3.4785e+03,  2.2285e+03,
        1.9805e+03,  2.1850e+03,  3.8095e+03,  5.3740e+03,  1.2210e+03,
       -8.5950e+02,  2.5590e+03,  2.5645e+03,  5.0355e+03,  4.2085e+03,
        6.3250e+02,  3.0350e+02,  1.0355e+03,  1.5375e+03,  1.4530e+03,
        2.6325e+03,  2.9075e+03,  1.3975e+03,  6.2000e+02,  4.8180e+03,
        2.9355e+03,  1.0650e+03,  3.6615e+03,  2.4250e+02,  3.4470e+03,
        1.8840e+03,  3.1485e+03,  2.5905e+03,  7.9700e+02,  6.2750e+02,
        5.4880e+03,  2.3430e+03,  2.4750e+03,  2.2645e+03,  1.1650e+02,
        2.9395e+03,  2.4585e+03,  1.6025e+03,  7.8500e+02,  3.7995e+03,
        2.0100e+02,  2.8410e+03,  2.4880e+03, -4.3400e+02,  4.8715e+03,
        2.5770e+03,  2.7115e+03,  2.9655e+03,  2.1055e+03,  2.60

So, we can see that we have successfully made some predictions on the test dataset. Let us store this prediction in a variable called 'pred'

In [49]:
# store the prediction in a  variable

pred = lreg.predict(x_test)

### Now another thing that we want to see is that we want to check our performance. Remember for checking the performance of a model, we use R Square method

> So scikit-learn provide a function called .score() which can be used to check the performance of our model

In [50]:
# performance of our model - r2

lreg.score(x_test,true_p)    # takes 2 arguments as input (independent variables in the test dataset,true values of the target)

0.40227153690776773

So this is the R - Squared value on the test dataset.

We can also see the R - Squared value on the train dataset to see if our model is overfitting or not.

In [51]:
# r2 for the train dataset

lreg.score(x_train,y_train)

0.6497735631845378

So here we can see that on the train dataset we have 0.65 R - Squared value, whereas on the test dataset we have 0.40 value. 

So comparatively the performance dropped on the test dataset, so it might be either our model overfitted the training dataset or our test dataset is not a representative sample of our train dataset.

So in either case, it needs investigation and this was how we can check the performance of our model.

## Now we need to evaluate our models using the metric of RMSE (Root Mean Square Error)
> Square root of mean of ((true minus predicted) squared)

In [52]:
# calculate rmse for test dataset

np.sqrt(np.mean(np.power((np.array(true_p)-np.array(pred)),2)))

1255.3981735645664

In [53]:
# storing the rmse in a variable

rmse_test = np.sqrt(np.mean(np.power((np.array(true_p)-np.array(pred)),2)))

Similarly, we can find RMSE for the train dataset.

But for this, we need the predictions for the train dataset, (i.e. true and predicted of train dataset) 

In [54]:
# calculating rmse for the train dataset

np.sqrt(np.mean(np.power((np.array(y_train)-np.array(lreg.predict(x_train))),2)))

1013.0040784390595

In [55]:
# storing the rmse in a variable

rmse_train = np.sqrt(np.mean(np.power((np.array(y_train)-np.array(lreg.predict(x_train))),2)))

In [56]:
# print the rmse for both test and train

print(rmse_train)
print(rmse_test)

1013.0040784390595
1255.3981735645664


#### So here we can see that the RMSE of train is 1013 while of test is 1255.
> So again the same problem, either our test sample is not representative of the train dataset or our model has overfitted the train dataset and we will address these problems later on.

## This was how we can create a linear regression model in scikit-learn