# Lab | Comparing regression models

For this lab, we will be using the same dataset for the customer analysis case study we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs.

Instructions
1. Fit the models LinearRegression,Lasso and Ridge and compare the model performances.
2. Define a function that takes a list of models and trains (and tests) them so we can try a lot of them without repeating code.
3. Use feature selection techniques (P-Value, RFE) to select a subset of features to train the model with (if necessary).
4. (optional) Re-fit the models with the selected features.

In [28]:
#Importing libraries and data
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt


from sklearn.preprocessing import MinMaxScaler, StandardScaler, PowerTransformer, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression 
from sklearn.feature_selection import VarianceThreshold # It only works with numerical features


df = pd.read_csv('data/marketing_customer_analysis_clean (2).csv')
numericals_df = df.select_dtypes(include=np.number).drop(columns = 'unnamed:_0')
numericals_df


Unnamed: 0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,total_claim_amount,month
0,4809.216960,48029,61,7.000000,52,0.000000,9,292.800000,2
1,2228.525238,0,64,3.000000,26,0.000000,1,744.924331,1
2,14947.917300,22139,100,34.000000,31,0.000000,2,480.000000,2
3,22332.439460,49078,97,10.000000,3,0.000000,2,484.013411,1
4,9025.067525,23675,117,15.149071,31,0.384256,7,707.925645,1
...,...,...,...,...,...,...,...,...,...
10905,15563.369440,0,253,15.149071,40,0.384256,7,1214.400000,1
10906,5259.444853,61146,65,7.000000,68,0.000000,6,273.018929,1
10907,23893.304100,39837,201,11.000000,63,0.000000,2,381.306996,2
10908,11971.977650,64195,158,0.000000,27,4.000000,6,618.288849,2


In [34]:
#define X and Y
X = numericals_df
y = numericals_df[["total_claim_amount"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)






## Variance threshold method

In [36]:
X_train = pd.DataFrame(X_train, columns=X.columns)
X_test  = pd.DataFrame(X_test, columns=X.columns)

#display(X_train)
print("Initial number of numerical columns: ",X_train.shape)
print()

selector = VarianceThreshold() # Default threshold value is 0
# Features with a training-set variance lower than this threshold will be removed.
selector.fit(X_train)

kept_features_indexes = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
kept_features_indexes
kept_features = list(X_train.iloc[:,kept_features_indexes].columns)
kept_features

X_train = selector.transform(X_train)
X_test  = selector.transform(X_test)

X_train = pd.DataFrame(X_train, columns=kept_features)
X_test  = pd.DataFrame(X_test, columns=kept_features)

X_train
print("Final number of numerical columns: ",X_train.shape)
print()
X_train


Initial number of numerical columns:  (8728, 9)

Final number of numerical columns:  (8728, 9)



Unnamed: 0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,total_claim_amount,month
0,4665.129599,0.0,62.0,26.0,62.0,0.0,3.0,297.600000,2.0
1,10288.924950,96337.0,127.0,19.0,12.0,0.0,3.0,609.600000,1.0
2,4873.436612,18866.0,126.0,4.0,62.0,0.0,1.0,604.800000,1.0
3,6944.739992,0.0,68.0,24.0,31.0,0.0,2.0,489.600000,1.0
4,2472.469209,63860.0,62.0,26.0,81.0,0.0,1.0,208.598246,1.0
...,...,...,...,...,...,...,...,...,...
8723,3810.238281,0.0,108.0,7.0,57.0,0.0,1.0,777.600000,2.0
8724,3815.851163,38651.0,98.0,12.0,83.0,0.0,1.0,470.400000,1.0
8725,7850.590399,0.0,69.0,5.0,78.0,0.0,2.0,331.200000,2.0
8726,4974.235309,0.0,70.0,18.0,74.0,0.0,3.0,336.000000,2.0


## Correlation matrix

In [22]:
#Linear regression
lm = LinearRegression()
model = lm.fit(X_train, y_train)

Unnamed: 0_level_0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,total_claim_amount,month
unnamed:_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3105,4665.129599,0,62,26.0,62,0.0,3,297.600000,2
6032,10288.924950,96337,127,19.0,12,0.0,3,609.600000,1
157,4873.436612,18866,126,4.0,62,0.0,1,604.800000,1
6964,6944.739992,0,68,24.0,31,0.0,2,489.600000,1
6349,2472.469209,63860,62,26.0,81,0.0,1,208.598246,1
...,...,...,...,...,...,...,...,...,...
5734,3810.238281,0,108,7.0,57,0.0,1,777.600000,2
5191,3815.851163,38651,98,12.0,83,0.0,1,470.400000,1
5390,7850.590399,0,69,5.0,78,0.0,2,331.200000,2
860,4974.235309,0,70,18.0,74,0.0,3,336.000000,2
