## Multiple Linear Regression

Now you know how to build a model with one X (feature variable) and Y (response variable). But what if you have three feature variables, or may be 10 or 100? Building a separate model for each of them, combining them, and then understanding them will be a very difficult and next to impossible task. By using multiple linear regression, you can build models between a response variable and many feature variables.

Let's see how to do that.

### Step_1 : Importing and Understanding Data

In [None]:
import pandas as pd

In [None]:
# Importing advertising.csv
advertising_multi = pd.read_csv('advertising.csv')

In [None]:
# Looking at the first five rows
advertising_multi.head()

In [None]:
# Looking at the last five rows
advertising_multi.tail()

In [None]:
# What type of values are stored in the columns?
advertising_multi.info()

In [None]:
# Let's look at some statistical information about our dataframe.
advertising_multi.describe()

### Step_2: Visualising Data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Let's plot a pair plot of all variables in our dataframe

#Note: Radio Vs Sales ,Newpaper Vs Sales is  scatter, Radio Vs News is scatter, 
sns.pairplot(advertising_multi)
print(advertising_multi)

In [None]:
# Visualise the relationship between the features and the response using scatterplots
sns.pairplot(advertising_multi, x_vars=['TV','Radio','Newspaper'], y_vars='Sales',size=7, aspect=0.7, kind='scatter')

### Step_3: Splitting the Data for Training and Testing

In [None]:
# Putting feature variable to X
X = advertising_multi[['TV','Radio','Newspaper']]

# Putting response variable to y
y = advertising_multi['Sales']

In [None]:
#random_state is the seed used by the random number generator. It can be any integer.
#from sklearn.cross_validation import train_test_split
# TO find oout the Y-prediction we need to test 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7 , random_state=100)

### Step_4 : Performing Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Representing LinearRegression as lr(Creating LinearRegression Object)
lm = LinearRegression()

In [None]:
# fit the model to the training data
lm.fit(X_train,y_train)

### Step_5 : Model Evaluation

In [None]:
# print the intercept
print(lm.intercept_)
print(lm.coef_)

In [None]:
# Let's see the coefficient
#Note : 
coeff_df = pd.DataFrame(lm.coef_,X_test.columns,columns=['Coefficient'])
coeff_df

From the above result we may infern that if TV price increses by 1 unit it will affect sales by 0.045 units.

### Step_6 : Predictions

In [None]:
# Making predictions using the model
y_pred = lm.predict(X_test)

### Step_7: Calculating Error Terms

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)

In [None]:
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)

### Optional Step : Checking for P-value Using STATSMODELS

In [None]:
#Note: This is only for the statics to check the 
import statsmodels.api as sm
X_train_sm = X_train
#Unlike SKLearn, statsmodels don't automatically fit a constant, 
#so you need to use the method sm.add_constant(X) in order to add a constant. 
#Note add_contract will add the C value i.e Y=Mx+C , is statics we need to manuly tell them to add the constant to the equations
# 
X_train_sm = sm.add_constant(X_train_sm)
# create a fitted model in one line
# OLS : Ordinary least square =e1+e2+..+en ... 
lm_1 = sm.OLS(y_train,X_train_sm).fit()

# print the coefficients
lm_1.params

R-squared or R2 explains the degree to which your input variables explain the variation of your output / predicted variable. So, if R-square is 0.8, it means 80% of the variation in the output variable is explained by the input variables. So, in simple terms, higher the R squared, the more variation is explained by your input variables and hence better is your model.

However, the problem with R-squared is that it will either stay the same or increase with addition of more variables, even if they do not have any relationship with the output variables. This is where “Adjusted R square” comes to help. Adjusted R-square penalizes you for adding variables which do not improve your existing model.

Hence, if you are building Linear regression on multiple variable, it is always suggested that you use Adjusted R-squared to judge goodness of model. In case you only have one input variable, R-square and Adjusted R squared would be exactly same.

Typically, the more non-significant variables you add into the model, the gap in R-squared and Adjusted R-squared increases.

https://blog.minitab.com/blog/adventures-in-statistics-2/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables

In [None]:
#Summery Model will give the statics information 
#The Below statics report is for trained data not for full data set 
# From the below report need to under stand the P value ,R square, Adj R square AIC, BIC ,and coefficient values
#Gideline to never pass the test data information to out side 
#Adj-R -Square : Any time while keep on adding varibles to the dataset then Adj R Square will keep on encrease
# I need to keep on adjust the Adj R square 
print(lm_1.summary())

In [None]:
# Newspaper      0.0046      0.008      0.613      0.541      -0.010       0.019 
# Newspaper is P value is greater than 50% then we can remove: (0.06 can be removed it is more variables are in the range )
#P will be calculate by knowing the Z value 

From the above we can see that Newspaper is insignificant.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
print(advertising_multi.corr())

 “Covariance” indicates the direction of the linear relationship between variables.
“Correlation” on the other hand measures both the strength and direction of the linear relationship between two variables. Correlation is a function of the covariance. 
 it essentially scales the value down to a limited range of -1 to +1. This is precisely the range of the correlation values.
 The positive sign signifies the direction of the correlation i.e. if one of the variables increases, the other variable is also supposed to increase.
    

In [None]:
plt.figure(figsize = (5,5))
sns.heatmap(advertising_multi.corr(),annot = True)
#it essentially scales the value down to a limited range of -1 to +1. This is precisely the range of the correlation values.

#Note: Black color are not at all considered,  white coloured looks good , grey is looks fine, 

### Step_8 : Implementing the results and running the model again

From the data above, you can conclude that Newspaper is insignificant.

In [None]:
# Removing Newspaper from our dataset
X_train_new = X_train[['TV','Radio']]
X_test_new = X_test[['TV','Radio']]

In [None]:
# Model building
lm.fit(X_train_new,y_train)

In [None]:
# Making predictions
y_pred_new = lm.predict(X_test_new)

In [None]:
#Actual vs Predicted
c = [i for i in range(1,61,1)]
fig = plt.figure()
plt.plot(c,y_test, color="blue", linewidth=2.5, linestyle="-")
plt.plot(c,y_pred, color="red",  linewidth=2.5, linestyle="-")
fig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Sales', fontsize=16)                               # Y-label

In [None]:
# Error terms
c = [i for i in range(1,61,1)]
fig = plt.figure()
plt.plot(c,y_test-y_pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('ytest-ypred', fontsize=16)                # Y-label

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred_new)
r_squared = r2_score(y_test, y_pred_new)

In [None]:
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)

In [None]:
X_train_final = X_train_new
#Unlike SKLearn, statsmodels don't automatically fit a constant, 
#so you need to use the method sm.add_constant(X) in order to add a constant. 
X_train_final = sm.add_constant(X_train_final)
# create a fitted model in one line
lm_final = sm.OLS(y_train,X_train_final).fit()

print(lm_final.summary())

### Model Refinement Using RFE

The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the less important features are pruned from the the current set of features. This procedure is recursively repeated on the pruned dataset until the desired number of features to select is reached.

In [None]:
from sklearn.feature_selection import RFE

In [None]:

rfe = RFE(lm, 2)
print(rfe)

In [None]:
rfe = rfe.fit(X_train, y_train)

In [None]:
print(rfe.support_)
print(rfe.ranking_)

### Simple Linear Regression: Newspaper(X) and Sales(y)

In [None]:
import pandas as pd
import numpy as np
# Importing dataset
advertising_multi = pd.read_csv('advertising.csv')

x_news = advertising_multi['Newspaper']

y_news = advertising_multi['Sales']

# Data Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_news, y_news, 
                                                    train_size=0.7 , 
                                                    random_state=110)

# Required only in the case of simple linear regression
X_train = X_train[:,np.newaxis]
X_test = X_test[:,np.newaxis]

# Linear regression from sklearn
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

# Fitting the model
lm.fit(X_train,y_train)

# Making predictions
y_pred = lm.predict(X_test)

# Importing mean square error and r square from sklearn library.
from sklearn.metrics import mean_squared_error, r2_score

# Computing mean square error and R square value
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)

# Printing mean square error and R square value
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)
print(lm.score)

Note: R_Square : Amount of variance captured by Y from the input variables X

In [None]:
VIF : 