## Lab | Comparing regression models
For this lab, we will be using the same dataset for the customer analysis case study we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs.

Instructions
1.Fit the models LinearRegression,Lasso and Ridge and compare the model performances.
2.Define a function that takes a list of models and trains (and tests) them so we can try a lot of them without repeating code.
3.Use feature selection techniques (P-Value, RFE) to select a subset of features to train the model with (if necessary).
4.(optional) Re-fit the models with the selected features.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, load_breast_cancer, fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('customer_analysis.csv')
df.head()

Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type,month
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2011-02-18,Employed,M,...,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,A,2.0
1,1,KX64629,California,2228.525238,No,Basic,College,2011-01-18,Unemployed,F,...,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,A,1.0
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2011-02-10,Employed,M,...,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A,2.0
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,2011-01-11,Employed,M,...,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A,1.0
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,2011-01-17,Medical Leave,F,...,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,A,1.0


In [3]:
df.columns


Index(['unnamed:_0', 'customer', 'state', 'customer_lifetime_value',
       'response', 'coverage', 'education', 'effective_to_date',
       'employmentstatus', 'gender', 'income', 'location_code',
       'marital_status', 'monthly_premium_auto', 'months_since_last_claim',
       'months_since_policy_inception', 'number_of_open_complaints',
       'number_of_policies', 'policy_type', 'policy', 'renew_offer_type',
       'sales_channel', 'total_claim_amount', 'vehicle_class', 'vehicle_size',
       'vehicle_type', 'month'],
      dtype='object')

## Fitting the models LinearRegression,Lasso and Ridge and compare the model's performances. 

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score

# Load the dataset
file_path = 'customer_analysis.csv'
data = pd.read_csv('customer_analysis.csv')

# Preprocessing: Dropping non-relevant/identifier columns and encoding categorical variables
data_cleaned = data.drop(['unnamed:_0', 'customer'], axis=1)
categorical_cols = data_cleaned.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
    le = LabelEncoder()
    data_cleaned[col] = le.fit_transform(data_cleaned[col].astype(str))

# Assuming 'customer_lifetime_value' as the target variable
X = data_cleaned.drop('customer_lifetime_value', axis=1)
y = data_cleaned['customer_lifetime_value']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Handling missing values
for col in X_train.columns:
    if X_train[col].dtype == 'object' or categorical_cols.isin([col]).any():
        mode_val = X_train[col].mode()[0]
        X_train[col].fillna(mode_val, inplace=True)
        X_test[col].fillna(mode_val, inplace=True)
    else:
        median_val = X_train[col].median()
        X_train[col].fillna(median_val, inplace=True)
        X_test[col].fillna(median_val, inplace=True)

# Defining the model fitting function
def fit_and_evaluate(models, X_train, X_test, y_train, y_test, feature_selection=False):
    results = {}
    for name, model in models:
        if feature_selection:
            selector = RFE(model, n_features_to_select=5, step=1)
            selector = selector.fit(X_train, y_train)
            X_train_selected = selector.transform(X_train)
            X_test_selected = selector.transform(X_test)
            
            model.fit(X_train_selected, y_train)
            predictions = model.predict(X_test_selected)
        else:
            model.fit(X_train, y_train)
            predictions = model.predict(X_test)
        
        accuracy = r2_score(y_test, predictions)
        results[name] = accuracy
    
    return results

# List of models to evaluate
models = [
    ('Linear Regression', LinearRegression()),
    ('Lasso', Lasso(random_state=42)),
    ('Ridge', Ridge(random_state=42))
]

# Fitting and evaluating models without feature selection
results_without_fs = fit_and_evaluate(models, X_train, X_test, y_train, y_test)

# Fitting and evaluating models with feature selection
results_with_fs = fit_and_evaluate(models, X_train, X_test, y_train, y_test, feature_selection=True)

print("Results without Feature Selection:", results_without_fs)
print("Results with Feature Selection:", results_with_fs)


Results without Feature Selection: {'Linear Regression': 0.17569478108156888, 'Lasso': 0.1757814090743104, 'Ridge': 0.175690391612915}
Results with Feature Selection: {'Linear Regression': 0.022084657275460273, 'Lasso': 0.029726306048702633, 'Ridge': 0.029318544220811593}
