<b> <u> Multiple Linear Regression </u> </b> <br>
<b> <u> News Case Study </u> </b> <br>
Problem Statement:
Essentially, the company wants —

To identify the variables affecting

To create a linear model that quantitatively you infer about their effect on the dependent variable



**So interpretation is important!**

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
news_train = pd.read_csv("C:/Users/Administrator/Desktop/Upgrad Case Study/Multiple linear regression - news/train_file.csv")

In [None]:
news_train.head(2)

In [None]:
news_test = pd.read_csv("C:/Users/Administrator/Desktop/Upgrad Case Study/Multiple linear regression - news/test_file.csv")

In [None]:
news_test.head(2)

In [None]:
# import preprocessing from sklearn
from sklearn import preprocessing

In [None]:
news_train.shape

In [None]:
news_train.columns

In [None]:
news_train.dtypes

In [None]:
news_train.IDLink = news_train.IDLink.astype(str)
news_train.Title = news_train.Title.astype(str)
news_train.Headline = news_train.Headline.astype(str)
news_train.Source = news_train.Source.astype(str)
news_train.Topic = news_train.Topic.astype(str)
news_train.PublishDate = news_train.PublishDate.astype(str)
news_train.SentimentTitle = news_train.SentimentTitle.astype(str)
news_train.SentimentHeadline = news_train.SentimentHeadline.astype(str)

In [None]:
news_train.dtypes

In [None]:
# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()


# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
news_train_2 = news_train.apply(le.fit_transform)
news_train_2.head(1000)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ["IDLink","Title","Headline","Source","Topic","PublishDate","Facebook","GooglePlus","LinkedIn","SentimentTitle","SentimentHeadline"]

news_train_2[num_vars] = scaler.fit_transform(news_train_2[num_vars])

news_train_2.head()

In [None]:
news_train_2.isnull().sum()

From , the `Above Dataset`, the max-min scaler is used to put all the values between 0 and 1

### Dividing into X and Y sets for the model building

- Here , `SentimentTitle` is the target variable - To be used to predict the demand of the Shared bikes 

- Here , `SentimentHeadline` is the target variable - To be used to predict the demand of the Shared bikes 

In [None]:
y_train = news_train_2.pop('SentimentTitle')
X_train = news_train_2

## Building our model

This time, we will be using the **LinearRegression function from SciKit Learn** for its compatibility with RFE (which is a utility from sklearn)

### RFE
Recursive feature elimination

In [None]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 10
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 7)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

Here , After Using the `RFE` for the Automatic Selection , it has negated `'IDLink', 'Title', 'Source'`

In [None]:
X_train_new = X_train.drop(["IDLink", "Title", "Source"], axis = 1)

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_new = X_train[col]

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_new = sm.add_constant(X_train_new)

In [None]:
lm = sm.OLS(y_train,X_train_new).fit()   # Running the linear model

In [None]:
#Let's see the summary of our linear model
print(lm.summary())

In [None]:
lm.params

### Checking VIF

Variance Inflation Factor or VIF, gives a basic quantitative idea about how much the feature variables are correlated with each other. It is an extremely important parameter to test our linear model. The formula for calculating `VIF` is:

### $ VIF_i = \frac{1}{1 - {R_i}^2} $

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### We will move to Manual Regression , with respect to the Signifance(P-value) and the VIF Factor

LinkedIn             -0.0102      0.014     -0.723      `0.469` <br>
The P-value of Linked is `0.469` and can be negated

In [None]:
X_train_new.columns

In [None]:
X_train_new = X_train_new.drop(['LinkedIn'], axis=1)

In [None]:
# Build a third fitted model
X_train_lm = sm.add_constant(X_train_new)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

Facebook              0.0101      0.010      0.977      `0.329` <br>
The P-value of Linked is `0.0.329` and can be negated

In [None]:
X_train_new.columns

In [None]:
X_train_new = X_train_new.drop(['Facebook'], axis=1)

In [None]:
# Build a third fitted model
X_train_lm = sm.add_constant(X_train_new)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

In [None]:
X_train_new = X_train_new.drop(['PublishDate'], axis=1)

In [None]:
# Build a third fitted model
X_train_lm = sm.add_constant(X_train_new)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Here, the `VIF` value of the `const` is high : 8.02 and hence can be negated 

In [None]:
X_train_new = X_train_new.drop(['const'], axis=1)

In [None]:
# Build a third fitted model
X_train_lm = sm.add_constant(X_train_new)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

As we can see from the above statement, All the VIF values are less than `5` nad P-value are less than `0.05`

## Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
y_train_cnt = lr_2.predict(X_train_lm)

In [None]:
# Importing the required libraries for plots.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

## Making Predictions

#### Applying the scaling on the test sets

In [None]:
# import preprocessing from sklearn
from sklearn import preprocessing

In [None]:
news_test.shape

In [None]:
news_test.columns

In [None]:
news_test.dtypes

In [None]:
news_test.IDLink = news_test.IDLink.astype(str)
news_test.Title = news_test.Title.astype(str)
news_test.Headline = news_test.Headline.astype(str)
news_test.Source = news_test.Source.astype(str)
news_test.Topic = news_test.Topic.astype(str)
news_test.PublishDate = news_test.PublishDate.astype(str)


In [None]:
news_test['SentimentTitle'] = 0
news_test['SentimentHeadline'] = 0

In [None]:
news_test.head(1)

In [None]:
news_test.dtypes

In [None]:
news_test.SentimentTitle = news_test.SentimentTitle.astype(str)
news_test.SentimentHeadline = news_test.SentimentHeadline.astype(str)

In [None]:
news_test.dtypes

In [None]:
# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()


# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
news_test_2 = news_test.apply(le.fit_transform)
news_test_2.head(1000)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ["IDLink","Title","Headline","Source","Topic","PublishDate","Facebook","GooglePlus","LinkedIn","SentimentTitle","SentimentHeadline"]

news_test_2[num_vars] = scaler.fit_transform(news_test_2[num_vars])

news_test_2.head()

Here , The `scaling` is completed on the test data. The values lies in the range between 0 and 1

#### Applying the scaling on the test sets

In [None]:
num_vars = ['IDLink','Title','Headline','Source','Topic', 'PublishDate','Facebook','GooglePlus','LinkedIn','SentimentTitle','SentimentHeadline']

In [None]:
news_test_2[num_vars] = scaler.transform(news_test_2[num_vars])

#### Dividing into X_test and y_test

In [None]:
y_test = news_test_2.pop('SentimentTitle')
X_test = news_test_2

In [None]:
# Now let's use our model to make predictions.

# Creating X_test_new dataframe by dropping variables from X_test
X_test_new = X_test[X_train_new.columns]

# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [None]:
# Making predictions
y_pred = lr_2.predict(X_test_new)


In [None]:
lr_2.params

## Model Evaluation

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)                          # Y-label

In [None]:
y_pred

In [None]:
print("The 'SentimentTitle' predicted as per the model is ", y_pred)

Here, The test Dataframe of unknown data on which the `SentimentTitle` is presdicted :

In [None]:
news_test.head(2)

In [None]:
type(y_pred)

In [None]:
news_test2 = y_pred.to_frame()

In [None]:
news_test2.head(4)

In [None]:
news_test_final = news_test.join(news_test2)

In [None]:
news_test_final.columns

In [None]:
news_test_final = news_test_final.drop(columns=['SentimentTitle'])

In [None]:
news_test_final = news_test_final.rename(columns={0: 'SentimentTitle'})

Here, the Y_Pred is successfully Done on the `test` data. <br>
The Test data with `Sentiment Title`

In [None]:
news_test_final.head(3)

In [None]:
news_test_final = news_test_final.drop(columns=['SentimentHeadline'])

# <u> SentimentHeadline : Predicting on the unseen Test Data </u>

In [None]:
news_train.dtypes

In [None]:
news_train.IDLink = news_train.IDLink.astype(str)
news_train.Title = news_train.Title.astype(str)
news_train.Headline = news_train.Headline.astype(str)
news_train.Source = news_train.Source.astype(str)
news_train.Topic = news_train.Topic.astype(str)
news_train.PublishDate = news_train.PublishDate.astype(str)
news_train.SentimentTitle = news_train.SentimentTitle.astype(str)
news_train.SentimentHeadline = news_train.SentimentHeadline.astype(str)

In [None]:
news_train.dtypes

In [None]:
# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()


# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
news_train_2 = news_train.apply(le.fit_transform)
news_train_2.head(1000)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ["IDLink","Title","Headline","Source","Topic","PublishDate","Facebook","GooglePlus","LinkedIn","SentimentTitle","SentimentHeadline"]

news_train_2[num_vars] = scaler.fit_transform(news_train_2[num_vars])

news_train_2.head()

Here, the `scaling` of train data is done between 0 and 1

In [None]:
news_train_2.isnull().sum()

From , the `Above Dataset`, the max-min scaler is used to put all the values between 0 and 1

In [None]:
y_train = news_train_2.pop('SentimentHeadline')
X_train = news_train_2

## Building our model - Sentiment Headline

This time, we will be using the **LinearRegression function from SciKit Learn** for its compatibility with RFE (which is a utility from sklearn)

### RFE
Recursive feature elimination

In [None]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 10
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 7)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

Negating the Above Value As identified by the `RFE`

In [None]:
X_train_new = X_train.drop(["IDLink","Source","LinkedIn"], axis = 1)

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_new = X_train[col]

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_new = sm.add_constant(X_train_new)

In [None]:
lm = sm.OLS(y_train,X_train_new).fit()   # Running the linear model

In [None]:
#Let's see the summary of our linear model
print(lm.summary())

### Checking VIF

Variance Inflation Factor or VIF, gives a basic quantitative idea about how much the feature variables are correlated with each other. It is an extremely important parameter to test our linear model. The formula for calculating `VIF` is:

### $ VIF_i = \frac{1}{1 - {R_i}^2} $

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### We will move to Manual Regression , with respect to the Signifance(P-value) and the VIF Factor

Title             -0.0036      0.004     -0.852      0.394 <br>
The `Significance Value` of the `Title` is : 0.394 . Hence we are dopping this

In [None]:
X_train_new.columns

In [None]:
X_train_new = X_train_new.drop(['Title'], axis=1)

In [None]:
# Build a third fitted model
X_train_lm = sm.add_constant(X_train_new)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

Headline           0.0049      0.004      1.191      0.234 <br>
The `Significance Value` of the `Headline` is : 0.234 . Hence we are dopping this

In [None]:
X_train_new = X_train_new.drop(['Headline'], axis=1)

In [None]:
# Build a third fitted model
X_train_lm = sm.add_constant(X_train_new)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

Facebook          -0.0157      0.012     -1.362      0.173 <br>
The `Significance Value` of the `Facebook` is : 0.173 . Hence we are dopping this

In [None]:
X_train_new = X_train_new.drop(['Facebook'], axis=1)

In [None]:
# Build a third fitted model
X_train_lm = sm.add_constant(X_train_new)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

GooglePlus         0.0230      0.019      1.196      0.232 <br>
The `Significance Value` of the `GooglePlus` is : 0.232 . Hence we are dopping this

In [None]:
X_train_new = X_train_new.drop(['GooglePlus'], axis=1)

In [None]:
# Build a third fitted model
X_train_lm = sm.add_constant(X_train_new)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

PublishDate        0.0059      0.004      1.438      0.150 <br>
The `Significance Value` of the `PublishDate` is : 0.150 . Hence we are dopping this

In [None]:
X_train_new = X_train_new.drop(['PublishDate'], axis=1)

In [None]:
# Build a third fitted model
X_train_lm = sm.add_constant(X_train_new)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train_new = X_train_new.drop(['const'], axis=1)

In [None]:
# Build a third fitted model
X_train_lm = sm.add_constant(X_train_new)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

This Model is fine , Hen we can go ahead and accept this model 

## Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
y_train_cnt = lr_2.predict(X_train_lm)

In [None]:
# Importing the required libraries for plots.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18) 

## Making Predictions

#### Applying the scaling on the test sets

In [None]:
# import preprocessing from sklearn
from sklearn import preprocessing

In [None]:
news_test.shape

In [None]:
news_test.columns

In [None]:
news_test.dtypes

In [None]:
news_test.IDLink = news_test.IDLink.astype(str)
news_test.Title = news_test.Title.astype(str)
news_test.Headline = news_test.Headline.astype(str)
news_test.Source = news_test.Source.astype(str)
news_test.Topic = news_test.Topic.astype(str)
news_test.PublishDate = news_test.PublishDate.astype(str)


In [None]:
news_test['SentimentTitle'] = 0
news_test['SentimentHeadline'] = 0

In [None]:
news_test.head(1)

In [None]:

news_test.dtypes

In [None]:
news_test.SentimentTitle = news_test.SentimentTitle.astype(str)
news_test.SentimentHeadline = news_test.SentimentHeadline.astype(str)

In [None]:
news_test.dtypes

In [None]:
# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()


# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
news_test_2 = news_test.apply(le.fit_transform)
news_test_2.head(1000)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:

# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ["IDLink","Title","Headline","Source","Topic","PublishDate","Facebook","GooglePlus","LinkedIn","SentimentTitle","SentimentHeadline"]

news_test_2[num_vars] = scaler.fit_transform(news_test_2[num_vars])

news_test_2.head()

In [None]:
#### Applying the scaling on the test sets

num_vars = ['IDLink','Title','Headline','Source','Topic', 'PublishDate','Facebook','GooglePlus','LinkedIn','SentimentTitle','SentimentHeadline']

news_test_2[num_vars] = scaler.transform(news_test_2[num_vars])

In [None]:

#### Dividing into X_test and y_test

y_test = news_test_2.pop('SentimentHeadline')
X_test = news_test_2

In [None]:

# Now let's use our model to make predictions.

# Creating X_test_new dataframe by dropping variables from X_test
X_test_new = X_test[X_train_new.columns]

In [None]:
# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)


In [None]:
# Making predictions
y_pred = lr_2.predict(X_test_new)


In [None]:
print("The 'SentimentHeadline' predicted as per the model is ", y_pred)

In [None]:
news_test.head(2)

In [None]:
type(y_pred)

In [None]:
news_test2 = y_pred.to_frame()

In [None]:
news_test2.head(4)

In [None]:
news_test_final2 = news_test.join(news_test2)

In [None]:
news_test_final2.columns

In [None]:
news_test_final2 = news_test_final2.drop(columns=['SentimentHeadline'])

In [None]:
news_test_final2 = news_test_final2.rename(columns={0: 'SentimentHeadline'})

In [None]:
news_test_final2.head(2)

In [None]:
news_test_final2 = news_test_final2.drop(columns=['SentimentTitle'])

In [None]:
news_test_final3 = news_test_final2[["IDLink", "SentimentHeadline"]]

In [None]:
news_test_final3.head(3)

In [None]:
result = pd.merge(news_test_final, news_test_final3, on='IDLink')

In [None]:
result.head(3)

### <br><u> The Final Data Set After Predicting 'SentimentTitle'  'SentimentHeadline'</u> </br>
#### This is done on the unknown/test data

In [None]:
result.head(10)

In [None]:
result.describe()

In [None]:
result.info()