# Linear Regression Labskies

This lab will be very similar to a “Datathon”. During the next week you will work in groups on a dataset called `EcomExpense` to create the best possible model, using the modeling techniques learned in class, to predict the variable `Total.Spend`.

It is important for the development of the lab that you all use the same cross-validation techniques, so all of you will have to divide the dataset in 75%-25% at the beginning. Use the seed `2024` to obtain the same results than your peers.

It is recommended to make a previous study of the dataset in which, in an exploratory way, we can understand the data we are going to work with.

You should explain the parameters of your model and how you arrived at it. It is also important that you check if there is multicollinearity among the variables or if there is any polynomial or interaction effect.

Finally, you will have to defend your model using the different statistics and residual analysis that we have seen in class.

Good luck!

In [238]:
#We start by importing relevant libraries
import numpy as np
from scipy import stats
import pandas as pd
import csv
import matplotlib.pyplot as plt

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.model_selection import train_test_split as tts

import itertools

import joblib
import sys
sys.modules['sklearn.externals.joblib'] = joblib
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

from sklearn.linear_model import LinearRegression

from statsmodels.stats.outliers_influence import variance_inflation_factor

In [239]:
#Connect to data on csv
url = '/Users/luisinfanten/Desktop/IE/Classes/First-Year/Second-Semester/Simulating and Modelling/Models/Notebooks/LAB/EcomExpense.csv'
columns = ["TransactionID", "Age", "Items", "MonthlyIncome", "TransactionTime", "Record", "Gender", "CityTier", "TotalSpend"]
data = pd.read_csv(url, names = columns, header = 0)
data = data.drop(data.columns[0], axis=1)
data.head

<bound method NDFrame.head of       Age  Items  MonthlyIncome  TransactionTime  Record  Gender CityTier  \
0      42     10           7313       627.668127       5  Female   Tier 1   
1      24      8          17747       126.904567       3  Female   Tier 2   
2      47     11          22845       873.469701       2    Male   Tier 2   
3      50     11          18552       380.219428       7  Female   Tier 1   
4      60      2          14439       403.374223       2  Female   Tier 2   
...   ...    ...            ...              ...     ...     ...      ...   
2357   50      7           5705       460.157207       3    Male   Tier 2   
2358   35     11          11202       851.924751       8    Male   Tier 2   
2359   27      5          21335       435.145358       8  Female   Tier 3   
2360   45     12          19294       658.439838       7  Female   Tier 1   
2361   46      7           2855       560.514341       8    Male   Tier 1   

       TotalSpend  
0     4198.385084  
1    

In [240]:
data = pd.get_dummies(data, columns=['Gender', 'CityTier'], drop_first=True)
data = data.rename(columns={'CityTier_Tier 2': 'CityTier_Tier_2', 'CityTier_Tier 3': 'CityTier_Tier_3'})
data

Unnamed: 0,Age,Items,MonthlyIncome,TransactionTime,Record,TotalSpend,Gender_Male,CityTier_Tier_2,CityTier_Tier_3
0,42,10,7313,627.668127,5,4198.385084,False,False,False
1,24,8,17747,126.904567,3,4134.976648,False,True,False
2,47,11,22845,873.469701,2,5166.614455,True,True,False
3,50,11,18552,380.219428,7,7784.447676,False,False,False
4,60,2,14439,403.374223,2,3254.160485,False,True,False
...,...,...,...,...,...,...,...,...,...
2357,50,7,5705,460.157207,3,2909.619546,True,True,False
2358,35,11,11202,851.924751,8,7968.633136,True,True,False
2359,27,5,21335,435.145358,8,8816.406448,False,False,True
2360,45,12,19294,658.439838,7,7915.595856,False,False,False


In [241]:
#Create model 
model_formula = 'TotalSpend ~ Age + Items + MonthlyIncome + TransactionTime + Record + Gender_Male + CityTier_Tier_2 + CityTier_Tier_3'
train_data, test_data = tts(data, test_size=0.25, random_state=2024)

In [242]:
#Check Proper Split
print(train_data.shape)
print(test_data.shape)

(1771, 9)
(591, 9)


In [243]:
first_model = smf.ols(formula=model_formula, data=train_data).fit()
first_model.summary()

0,1,2,3
Dep. Variable:,TotalSpend,R-squared:,0.923
Model:,OLS,Adj. R-squared:,0.923
Method:,Least Squares,F-statistic:,2658.0
Date:,"Mon, 29 Apr 2024",Prob (F-statistic):,0.0
Time:,09:59:18,Log-Likelihood:,-14307.0
No. Observations:,1771,AIC:,28630.0
Df Residuals:,1762,BIC:,28680.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-741.7281,99.770,-7.434,0.000,-937.408,-546.048
Gender_Male[T.True],284.1596,37.257,7.627,0.000,211.088,357.232
CityTier_Tier_2[T.True],-3.2595,44.917,-0.073,0.942,-91.355,84.836
CityTier_Tier_3[T.True],-182.6714,45.898,-3.980,0.000,-272.693,-92.650
Age,5.4470,1.557,3.499,0.000,2.394,8.500
Items,36.6601,4.336,8.454,0.000,28.155,45.165
MonthlyIncome,0.1499,0.002,64.482,0.000,0.145,0.154
TransactionTime,0.1961,0.065,3.005,0.003,0.068,0.324
Record,775.8669,6.024,128.805,0.000,764.053,787.681

0,1,2,3
Omnibus:,361.256,Durbin-Watson:,1.969
Prob(Omnibus):,0.0,Jarque-Bera (JB):,177.028
Skew:,0.622,Prob(JB):,3.62e-39
Kurtosis:,2.078,Cond. No.,97700.0


In [244]:
#Common Error Stats
sse = first_model.ssr # sum squared error
mse = first_model.mse_resid # mean squared error
rse = np.sqrt(mse) # relative standard error
percentage_error = (rse/data["TotalSpend"].mean())*100

print("SSE:",round(sse, 3))
print("MSE:",round(mse, 3))
print("RSE:",round(rse, 3))
print("Mean Error:",round(percentage_error, 3))

SSE: 1078022474.811
MSE: 611817.523
RSE: 782.188
Mean Error: 12.691


# Model Selection Stepwise

In [245]:
#Using Stepwise Regression
target = train_data['TotalSpend']
predictors = train_data.drop(columns='TotalSpend')

linear_regression = LinearRegression()

# Forward Selection
forward_selector = SFS(linear_regression,
                       k_features="best",
                       forward=True,
                       floating=False,
                       scoring='r2',
                       cv=0)
forward_selector.fit(predictors, target)
forward_selected_features = list(predictors.columns[list(forward_selector.k_feature_idx_)])
print("Forward Selection: ", forward_selected_features)

# Backward Elimination
backward_selector = SFS(linear_regression,
                        k_features="best",
                        forward=False,
                        floating=False,
                        scoring='r2',
                        cv=0)
backward_selector.fit(predictors, target)
backward_eliminated_features = list(predictors.columns[list(backward_selector.k_feature_idx_)])
print("Backward Elimination: ", backward_eliminated_features)

Forward Selection:  ['Age', 'Items', 'MonthlyIncome', 'TransactionTime', 'Record', 'Gender_Male', 'CityTier_Tier_2', 'CityTier_Tier_3']
Backward Elimination:  ['Age', 'Items', 'MonthlyIncome', 'TransactionTime', 'Record', 'Gender_Male', 'CityTier_Tier_2', 'CityTier_Tier_3']


In [246]:
def summarize_results(selector, method, predictors):
    selected_features = list(predictors.columns[list(selector.k_feature_idx_)])
    print(f"{method} Results:")
    print("Selected features:", selected_features)
    print("Number of features:", selector.k_feature_names_)
    print("R-squared:", selector.k_score_)
    print("\nFeature Selection History:")
    for idx, values in selector.subsets_.items():
        print("Step", idx, ": Features", list(predictors.columns[list(values["feature_idx"])]), "- R-squared:" ,values["avg_score"])

# Summarize Forward Selection results
summarize_results(forward_selector, "Forward Selection", predictors)
print("\n")
# Summarize Backward Elimination results
summarize_results(backward_selector, "Backward Elimination", predictors)

Forward Selection Results:
Selected features: ['Age', 'Items', 'MonthlyIncome', 'TransactionTime', 'Record', 'Gender_Male', 'CityTier_Tier_2', 'CityTier_Tier_3']
Number of features: ('Age', 'Items', 'MonthlyIncome', 'TransactionTime', 'Record', 'Gender_Male', 'CityTier_Tier_2', 'CityTier_Tier_3')
R-squared: 0.9234744064515731

Feature Selection History:
Step 1 : Features ['Record'] - R-squared: 0.7360232838885956
Step 2 : Features ['MonthlyIncome', 'Record'] - R-squared: 0.9157753854218289
Step 3 : Features ['Items', 'MonthlyIncome', 'Record'] - R-squared: 0.9190520432494808
Step 4 : Features ['Items', 'MonthlyIncome', 'Record', 'Gender_Male'] - R-squared: 0.9216096290764781
Step 5 : Features ['Items', 'MonthlyIncome', 'Record', 'Gender_Male', 'CityTier_Tier_3'] - R-squared: 0.9225682649449898
Step 6 : Features ['Age', 'Items', 'MonthlyIncome', 'Record', 'Gender_Male', 'CityTier_Tier_3'] - R-squared: 0.9230822105390406
Step 7 : Features ['Age', 'Items', 'MonthlyIncome', 'TransactionTim

<h5> We see that when optimizing for r2 using backward and forward regression, using all variables results in the best r2. However, since the p-value of city_tier_2 is quite large, we will still remove it (noting that we would be loosing some r2 value)

# Model Analysis

In [247]:
#Remove the highest p-value
new_model_formula = 'TotalSpend ~ Age + Items + MonthlyIncome + TransactionTime + Record + Gender_Male + CityTier_Tier_3'
best_model = smf.ols(formula=new_model_formula, data=train_data).fit()
best_model.summary()
#This model is better than the first one, as it has no p-values above 0.05

0,1,2,3
Dep. Variable:,TotalSpend,R-squared:,0.923
Model:,OLS,Adj. R-squared:,0.923
Method:,Least Squares,F-statistic:,3039.0
Date:,"Mon, 29 Apr 2024",Prob (F-statistic):,0.0
Time:,09:59:18,Log-Likelihood:,-14307.0
No. Observations:,1771,AIC:,28630.0
Df Residuals:,1763,BIC:,28670.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-743.2960,97.375,-7.633,0.000,-934.279,-552.313
Gender_Male[T.True],284.1406,37.245,7.629,0.000,211.091,357.190
CityTier_Tier_3[T.True],-181.0518,40.095,-4.516,0.000,-259.691,-102.412
Age,5.4456,1.556,3.499,0.000,2.393,8.498
Items,36.6642,4.335,8.458,0.000,28.162,45.166
MonthlyIncome,0.1499,0.002,64.504,0.000,0.145,0.154
TransactionTime,0.1960,0.065,3.005,0.003,0.068,0.324
Record,775.8730,6.021,128.855,0.000,764.063,787.683

0,1,2,3
Omnibus:,360.775,Durbin-Watson:,1.969
Prob(Omnibus):,0.0,Jarque-Bera (JB):,176.957
Skew:,0.622,Prob(JB):,3.75e-39
Kurtosis:,2.078,Cond. No.,94200.0


<h2>Adequacy & Usefulness<h2>
<h4>r2: Very good value of 0.92<h4>
<h4>AIC & BIC: Quite large values<h4>
<h4>Prob(F-Stat): Virtually 0, great<h4>
<h4>Variable P-Values: All are virtually 0, great<h4>


<h2>Errors<h2>

In [248]:
#Common Error Stats
sse = best_model.ssr # sum squared error
mse = best_model.mse_resid # mean squared error
rse = np.sqrt(mse) # relative standard error
percentage_error = (rse/train_data["TotalSpend"].mean())*100

print("SSE:",round(sse, 3))
print("MSE:",round(mse, 3))
print("RSE:",round(rse, 3))
print("Mean Error:",round(percentage_error, 3))

SSE: 1078025696.685
MSE: 611472.318
RSE: 781.967
Mean Error: 12.657


<h2>Multicolinearity<h2>

In [273]:
predictors2 = train_data.drop(columns=['TotalSpend', 'CityTier_Tier_2'])
predictors2 = predictors2.assign(const=1)
predictors2['Gender_Male'] = predictors2['Gender_Male'].astype(int)
predictors2['CityTier_Tier_3'] = predictors2['CityTier_Tier_3'].astype(int)
predictors2.head()

Unnamed: 0,Age,Items,MonthlyIncome,TransactionTime,Record,Gender_Male,CityTier_Tier_3,const
918,40,3,18713,442.587273,0,1,0,1
1720,37,9,27719,259.459471,9,0,0,1
401,55,10,27314,214.451836,10,0,1,1
994,31,12,7117,494.976242,8,1,1,1
430,36,8,10518,975.35169,5,0,0,1


In [274]:
# Compute the VIF for each explanatory variable
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(predictors2.values, i) for i in range(predictors2.shape[1])]
vif["Explanatory"] = predictors2.columns

# Print the VIF table
print(vif)

   VIF Factor      Explanatory
0    1.004019              Age
1    1.005120            Items
2    1.005134    MonthlyIncome
3    1.007932  TransactionTime
4    1.001958           Record
5    1.004368      Gender_Male
6    1.003850  CityTier_Tier_3
7   27.462292            const


<h4>There appears to be no multicolinearity within any variables as there is no VIF factor exceeding 5<h4>