# Lab | Comparing regression models

For this lab, we will be using the same dataset we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs.

## Instructions
1. In this final lab, we will model our data. 
Import sklearn train_test_split and separate the data.

2. We will start with removing outliers, if you have not already done so. We have discussed different methods to remove outliers. Use the one you feel more comfortable with, define a function for that. Use the function to remove the outliers and apply it to the dataframe.

3. Create a copy of the dataframe for the data wrangling.

4. Normalize the continuous variables. You can use any one method you want.

5. Encode the categorical variables (See the hint below for encoding categorical data!!!)

6. The time variable can be useful. Try to transform its data into a useful one. Hint: Day week and month as integers might be useful.

7. Since the model will only accept numerical data, check and make sure that every column is numerical, if some are not, change it using encoding.

## Hint for Categorical Variables
You should deal with the categorical variables as shown below (for ordinal encoding, dummy code has been provided as well):

Encoder Type	Column
One hot	state
Ordinal	coverage
Ordinal	employmentstatus
Ordinal	location code
One hot	marital status
One hot	policy type
One hot	policy
One hot	renew offercustomer_df
One hot	sales channel
One hot	vehicle class
Ordinal	vehicle size

#### Dummy code
data["coverage"] = data["coverage"].map({"Basic" : 0, "Extended" : 1, "Premium" : 2})

given that column "coverage" in the dataframe "data" has three categories:

"basic", "extended", and "premium" and values are to be represented in the same order.

8. Try a simple linear regression with all the data to see whether we are getting good results.

9. Great! Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.

10.Use the function to check LinearRegressor and KNeighborsRegressor.

11. You can check also the MLPRegressor for this task!

12. Check and discuss the results.

# Lab | Comparing regression models

For this lab, we will be using the same dataset we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs.

## Instructions

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import r2_score

In [4]:
customer = pd.read_csv("we_fn_use_c_marketing_customer_value_analysis.csv")
display(customer.head())
customer.info()

Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,...,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size
0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2/24/11,Employed,F,56274,...,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,QZ44356,Arizona,6979.535903,No,Extended,Bachelor,1/31/11,Unemployed,F,0,...,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,AI49188,Nevada,12887.43165,No,Premium,Bachelor,2/19/11,Employed,F,48767,...,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,WW63253,California,7645.861827,No,Basic,Bachelor,1/20/11,Unemployed,M,0,...,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,HB64268,Washington,2813.692575,No,Basic,Bachelor,2/3/11,Employed,M,43836,...,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9134 entries, 0 to 9133
Data columns (total 24 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Customer                       9134 non-null   object 
 1   State                          9134 non-null   object 
 2   Customer Lifetime Value        9134 non-null   float64
 3   Response                       9134 non-null   object 
 4   Coverage                       9134 non-null   object 
 5   Education                      9134 non-null   object 
 6   Effective To Date              9134 non-null   object 
 7   EmploymentStatus               9134 non-null   object 
 8   Gender                         9134 non-null   object 
 9   Income                         9134 non-null   int64  
 10  Location Code                  9134 non-null   object 
 11  Marital Status                 9134 non-null   object 
 12  Monthly Premium Auto           9134 non-null   i

I have taken the liberty of reversing the order of the first two questions.

We are now going to make some changes already applied in the 2 previous labs.

## 0. Cleaning the data

In [10]:
#lowercase and removing spaces
columns = []
for col in customer.columns:
    col = col.lower().replace(" ", "_")
    columns.append(col)
    
customer.columns = columns
customer

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,...,months_since_policy_inception,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size
0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2/24/11,Employed,F,56274,...,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,QZ44356,Arizona,6979.535903,No,Extended,Bachelor,1/31/11,Unemployed,F,0,...,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,AI49188,Nevada,12887.431650,No,Premium,Bachelor,2/19/11,Employed,F,48767,...,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,WW63253,California,7645.861827,No,Basic,Bachelor,1/20/11,Unemployed,M,0,...,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,HB64268,Washington,2813.692575,No,Basic,Bachelor,2/3/11,Employed,M,43836,...,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9129,LA72316,California,23405.987980,No,Basic,Bachelor,2/10/11,Employed,M,71941,...,89,0,2,Personal Auto,Personal L1,Offer2,Web,198.234764,Four-Door Car,Medsize
9130,PK87824,California,3096.511217,Yes,Extended,College,2/12/11,Employed,F,21604,...,28,0,1,Corporate Auto,Corporate L3,Offer1,Branch,379.200000,Four-Door Car,Medsize
9131,TD14365,California,8163.890428,No,Extended,Bachelor,2/6/11,Unemployed,M,0,...,37,3,2,Corporate Auto,Corporate L2,Offer1,Branch,790.784983,Four-Door Car,Medsize
9132,UP19263,California,7524.442436,No,Extended,College,2/3/11,Employed,M,21941,...,3,0,3,Personal Auto,Personal L2,Offer3,Branch,691.200000,Four-Door Car,Large


In [11]:
#changing the tipe of the column 'effective_to_date' to datetime
customer['effective_to_date']=pd.to_datetime(customer['effective_to_date'])
customer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9134 entries, 0 to 9133
Data columns (total 24 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   customer                       9134 non-null   object        
 1   state                          9134 non-null   object        
 2   customer_lifetime_value        9134 non-null   float64       
 3   response                       9134 non-null   object        
 4   coverage                       9134 non-null   object        
 5   education                      9134 non-null   object        
 6   effective_to_date              9134 non-null   datetime64[ns]
 7   employmentstatus               9134 non-null   object        
 8   gender                         9134 non-null   object        
 9   income                         9134 non-null   int64         
 10  location_code                  9134 non-null   object        
 11  marital_status   

In [12]:
#Checking for Nans
customer.isna().sum()

customer                         0
state                            0
customer_lifetime_value          0
response                         0
coverage                         0
education                        0
effective_to_date                0
employmentstatus                 0
gender                           0
income                           0
location_code                    0
marital_status                   0
monthly_premium_auto             0
months_since_last_claim          0
months_since_policy_inception    0
number_of_open_complaints        0
number_of_policies               0
policy_type                      0
policy                           0
renew_offer_type                 0
sales_channel                    0
total_claim_amount               0
vehicle_class                    0
vehicle_size                     0
dtype: int64

In [28]:
#Change name of column "employment status"
customer = customer.rename(columns={"employmentstatus":"employment_status"})
customer

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employment_status,gender,income,...,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size
0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2011-02-24,Employed,F,56274,...,32,5,0,1,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
2,AI49188,Nevada,12887.431650,No,Premium,Bachelor,2011-02-19,Employed,F,48767,...,18,38,0,2,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,WW63253,California,7645.861827,No,Basic,Bachelor,2011-01-20,Unemployed,M,0,...,18,65,0,7,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,HB64268,Washington,2813.692575,No,Basic,Bachelor,2011-02-03,Employed,M,43836,...,12,44,0,1,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize
5,OC83172,Oregon,8256.297800,Yes,Basic,Bachelor,2011-01-25,Employed,F,62902,...,14,94,0,2,Personal L3,Offer2,Web,159.383042,Two-Door Car,Medsize
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9129,LA72316,California,23405.987980,No,Basic,Bachelor,2011-02-10,Employed,M,71941,...,18,89,0,2,Personal L1,Offer2,Web,198.234764,Four-Door Car,Medsize
9130,PK87824,California,3096.511217,Yes,Extended,College,2011-02-12,Employed,F,21604,...,14,28,0,1,Corporate L3,Offer1,Branch,379.200000,Four-Door Car,Medsize
9131,TD14365,California,8163.890428,No,Extended,Bachelor,2011-02-06,Unemployed,M,0,...,9,37,3,2,Corporate L2,Offer1,Branch,790.784983,Four-Door Car,Medsize
9132,UP19263,California,7524.442436,No,Extended,College,2011-02-03,Employed,M,21941,...,34,3,0,3,Personal L2,Offer3,Branch,691.200000,Four-Door Car,Large


## 2. We will start with removing outliers, if you have not already done so.¶
We have discussed different methods to remove outliers. Use the one you feel more comfortable with, define a function for that. Use the function to remove the outliers and apply it to the dataframe.

In [29]:
iqr = np.nanpercentile(customer['customer_lifetime_value'],75) - np.nanpercentile(customer['customer_lifetime_value'],25)
upper_limit = np.nanpercentile(customer['customer_lifetime_value'],75) + 1.5*iqr
customer_df = customer[customer['customer_lifetime_value'] < upper_limit]



In [30]:
iqr = np.nanpercentile(customer['total_claim_amount'],75) - np.nanpercentile(customer['total_claim_amount'],25)
upper_limit = np.nanpercentile(customer['total_claim_amount'],75) + 1.5*iqr 
customer = customer[customer['total_claim_amount'] < upper_limit]

In [31]:
customer_df.shape

(7940, 23)

## 1. In this final lab, we will model our data. 
Import sklearn train_test_split and separate the data.

In [36]:
X = customer.drop(["customer", "total_claim_amount"], axis = 1)
y = customer["total_claim_amount"]

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 25)
print(X_train.shape)
print(X_test.shape)

(6451, 21)
(2151, 21)


## 3. Create a copy of the dataframe for the data wrangling.

In [43]:
X_train_num = X_train.select_dtypes(np.number)
X_train_cat = X_train.select_dtypes("object")

X_test_num = X_test.select_dtypes(np.number)
X_test_cat = X_test.select_dtypes("object")

## 4. Normalize the continuous variables. 
You can use any one method you want.

In [44]:
#We are going to use the logaritmic transformation. 
#To do so, we need the number to be both finite and diferent to 0.
#We are going to apply it to the 'customer_lifetime_value' and 'income' columns.

def log_transform(x):
    if (np.isfinite(x)) and (x != 0):
        return np.log(x)
    else:
        return x

#Then we apply it to both train and test

X_train_num["customer_lifetime_value"] = X_train_num["customer_lifetime_value"].apply(log_transform)
X_train_num["income"] = X_train_num["income"].apply(log_transform)

X_test_num["customer_lifetime_value"] = X_test_num["customer_lifetime_value"].apply(log_transform)
X_test_num["income"] = X_test_num["income"].apply(log_transform)

In [45]:
mm_scaler = MinMaxScaler().fit(X_train_num)
X_train_num_mm = mm_scaler.transform(X_train_num)
X_train_num_mm = pd.DataFrame(X_train_num_mm, columns = X_train_num.columns)

X_test_num_mm = mm_scaler.transform(X_test_num)
X_test_num_mm = pd.DataFrame(X_test_num_mm, columns = X_test_num.columns)

In [46]:
X_train_num_mm

Unnamed: 0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies
0,0.346615,0.000000,0.123404,0.057143,0.464646,0.2,0.250
1,0.602264,0.959154,0.063830,0.171429,0.707071,0.2,0.125
2,0.425476,0.965408,0.212766,0.000000,0.494949,1.0,0.625
3,0.417354,0.975602,0.165957,0.142857,0.727273,0.0,0.250
4,0.723650,0.942271,0.229787,1.000000,0.303030,0.0,0.125
...,...,...,...,...,...,...,...
6446,0.452526,0.990564,0.055319,0.400000,0.040404,0.0,0.125
6447,0.649418,0.913911,0.008511,0.200000,0.707071,0.0,0.125
6448,0.275076,0.965442,0.000000,0.485714,0.414141,0.4,0.500
6449,0.287894,0.000000,0.153191,0.571429,0.090909,0.8,0.750


## 5. Encode the categorical variables 
(See the hint below for encoding categorical data!!!)

In [47]:
X_train_cat

Unnamed: 0,state,response,coverage,education,employment_status,gender,location_code,marital_status,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size
5571,California,No,Extended,High School or Below,Unemployed,M,Suburban,Married,Personal L3,Offer4,Agent,Two-Door Car,Medsize
3996,Oregon,No,Extended,High School or Below,Employed,F,Suburban,Divorced,Personal L3,Offer1,Branch,Four-Door Car,Small
1319,Washington,No,Basic,Bachelor,Employed,M,Urban,Married,Personal L1,Offer2,Agent,SUV,Medsize
2463,Oregon,No,Basic,College,Employed,M,Rural,Divorced,Personal L3,Offer1,Branch,SUV,Small
2566,Nevada,No,Basic,Bachelor,Employed,M,Rural,Married,Corporate L1,Offer3,Branch,SUV,Medsize
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1236,Washington,No,Basic,Bachelor,Employed,M,Rural,Divorced,Personal L3,Offer1,Branch,Four-Door Car,Large
8970,California,No,Basic,High School or Below,Employed,F,Suburban,Married,Corporate L2,Offer1,Agent,Two-Door Car,Medsize
3101,Washington,No,Basic,Master,Employed,F,Rural,Married,Personal L1,Offer1,Call Center,Two-Door Car,Small
7028,Oregon,No,Extended,Bachelor,Unemployed,M,Suburban,Single,Personal L3,Offer1,Agent,Two-Door Car,Medsize
