### LSE Data Analytics Online Career Accelerator 
# Course 301: Data Analytics with Python

## Practical activity: Conducting multiple linear regression using Python

**This is the solution to the activity.**

In MLR you are adding another variable (or two or three or more!) to the calculation when you run your regression. Most likely, in the real world, you’ll have more than two variables to deal with, so MLR allows you to handle this and find predictive results that can help your business grow. This activity will build on the simple linear regression practical exercise from earlier, but this time, there will be another variable to work with. 

The main objective is to run multiple linear regression on three variables to predict future median business values. You’ll need to divide the data into training and testing subsets and use these to test the model with OLS. You’ll also check for multicollinearity and homoscedasticity. 

## 1. Prepare your workstation

In [None]:
#import all the necessary packages
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.stats.api as sms
import sklearn
import matplotlib.pyplot as plt

from sklearn import datasets 
from sklearn import linear_model
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from statsmodels.formula.api import ols

## 2. Import data set

In [None]:
df_ecom = pd.read_csv('Ecommerce_data.csv')

df_ecom

In [None]:
# view DataFrame
df_ecom.info()

## 3 Define variables

In [None]:
# dependent variable
y = df_ecom['Median_s'] 

# independent variable
X = df_ecom[['avg_no_it', 'tax']] 

In [None]:
# create train and test data sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

In [None]:
multi = LinearRegression()  
multi.fit(x_train, y_train)

In [None]:
multi.predict(x_train)

In [None]:
# Checking the value of R-squared, intercept and coefficients
print("R-squared: ", multi.score(x_train, y_train))
print("Intercept: ", multi.intercept_)
print("Coefficients:")
list(zip(x_train, multi.coef_))

In [None]:
# make predictions
New_Value1 = 5.75
New_Value2 = 15.2
print ('Predicted Value: \n', multi.predict([[New_Value1 ,New_Value2]]))  

## 4. Training and testing subsets with MLR

In [None]:
model = sm.OLS(y_train, sm.add_constant(x_train)).fit()
Y_pred = model.predict(sm.add_constant(x_train))
print_model = model.summary()

print(print_model)

In [None]:
print(multi.score(x_train,y_train)*100)

## 4. Check the model with OLS

In [None]:
# run regression on the train subset
mlr = LinearRegression()  
mlr.fit(x_train, y_train)

In [None]:
model = sm.OLS(y_train, sm.add_constant(x_train)).fit()
Y_pred = model.predict(sm.add_constant(x_train))
print_model = model.summary()
print(print_model)

In [None]:
y_pred_mlr= mlr.predict(x_train)
print("Prediction for test set: {}".format(y_pred_mlr))

In [None]:
print(mlr.score(x_train,y_train)*100)

In [None]:
meanAbErr = metrics.mean_absolute_error(y_train, y_pred_mlr)
meanSqErr = metrics.mean_squared_error(y_train, y_pred_mlr)

print('R squared: {:.2f}'.format(mlr.score(X,y)*100))
print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)

In [None]:
New_Value1 = 5.75
New_Value2 = 15.2
print ('Predicted Value: \n', mlr.predict([[New_Value1 ,New_Value2]])) 

In [None]:
# check multicollinearity
x_temp = sm.add_constant(x_train)# multicollinearity

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(x_temp.values, i) for i in range(x_temp.values.shape[1])]
vif["features"] = x_temp.columns
print(vif.round(1))

In [None]:
model = sms.het_breuschpagan(model.resid, model.model.exog) # heteroscedasticity

In [None]:
terms = ['LM stat', 'LM Test p-value', 'F-stat', 'F-test p-value']
print(dict(zip(terms, model)))

`Note:` We always fit the model to train data and evaluate the performance of the model using the test data. We predict the test data and compare the predictions with actual test values.
- rerun the model on the test data and jot down your observation.