# Lab | Customer Analysis Round 6

For this lab, we still keep using the marketing_customer_analysis.csv file that you can find in the files_for_lab folder.

## Get the data

We are using the marketing_customer_analysis.csv file.

In [30]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [31]:
# import data

data = pd.read_csv("marketing_customer_analysis.csv")

## Dealing with the data

Already done in the round 2.

In [32]:
# standardise column names
data.columns = [column.lower().replace(' ', '_') for column in data.columns]

In [33]:
data.head()

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,...,months_since_policy_inception,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size
0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2/24/11,Employed,F,56274,...,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,QZ44356,Arizona,6979.535903,No,Extended,Bachelor,1/31/11,Unemployed,F,0,...,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,AI49188,Nevada,12887.43165,No,Premium,Bachelor,2/19/11,Employed,F,48767,...,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,WW63253,California,7645.861827,No,Basic,Bachelor,1/20/11,Unemployed,M,0,...,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,HB64268,Washington,2813.692575,No,Basic,Bachelor,2/3/11,Employed,M,43836,...,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize


In [34]:
# check for null values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9134 entries, 0 to 9133
Data columns (total 24 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   customer                       9134 non-null   object 
 1   state                          9134 non-null   object 
 2   customer_lifetime_value        9134 non-null   float64
 3   response                       9134 non-null   object 
 4   coverage                       9134 non-null   object 
 5   education                      9134 non-null   object 
 6   effective_to_date              9134 non-null   object 
 7   employmentstatus               9134 non-null   object 
 8   gender                         9134 non-null   object 
 9   income                         9134 non-null   int64  
 10  location_code                  9134 non-null   object 
 11  marital_status                 9134 non-null   object 
 12  monthly_premium_auto           9134 non-null   i

In [35]:
# check state column is clean (no abbreviations etc)
data["state"].unique()

array(['Washington', 'Arizona', 'Nevada', 'California', 'Oregon'],
      dtype=object)

In [36]:
# check gender column is clean
data["gender"].unique()

array(['F', 'M'], dtype=object)

In [39]:
# customer column has 9134 unique values and is categorical - seems like useless info and will cause problems
# with the encoding so I will drop the column. Is this a bad or good move?
data['policy'].nunique()

9

In [40]:
data = data.drop("customer", axis = 1)

## Explore the data

Done in the round 3.

## Processing Data

(Further processing...)

- X-y split. (done)
- Normalize (numerical). (done)
- One Hot/Label Encoding (categorical).
- Concat DataFrames

**X-Y Split** If you have not done it, you have you take in count that the target will be `total_claim_amount`

In [41]:
y = data['total_claim_amount']
X = data.drop(['total_claim_amount'], axis=1)

**Normalize (numerical)** If you have not done it yet, you can define a function using `StandardScaler`from sklearn library

In [42]:
# split data into numerical & categorical
X_num = X.select_dtypes(include = np.number)
X_cat = X.select_dtypes(include = object)

In [43]:
from sklearn.preprocessing import OneHotEncoder, Normalizer, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

In [44]:
# normalise numerical data - we will lose proportion between numerical & categorical data - is this bad?
transformer = Normalizer().fit(X_num)
x_normalized = transformer.transform(X_num)
print(x_normalized.shape)

(9134, 7)


**One Hot/Label Encoding (categorical)** Try one of the two options learned in class

In [50]:
# chose one hot method
cat_data = pd.get_dummies(X_cat, drop_first=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9134 entries, 0 to 9133
Columns: 101 entries, state_California to vehicle_size_Small
dtypes: uint8(101)
memory usage: 901.0 KB


**Concat DataFrames**

In [51]:
# concat normalised numerical and encoded cat data
# not sure if this was the correct method?
X = np.concatenate([x_normalized, cat_data], axis=1)

In [52]:
# not sure if this should be an array now or a df?
# seems like in the class it was an array too?
X

array([[0.04904913, 0.99879545, 0.00122467, ..., 1.        , 1.        ,
        0.        ],
       [0.99988883, 0.        , 0.01346645, ..., 0.        , 1.        ,
        0.        ],
       [0.2554939 , 0.96680794, 0.0021411 , ..., 1.        , 1.        ,
        0.        ],
       ...,
       [0.99993483, 0.        , 0.01041102, ..., 0.        , 1.        ,
        0.        ],
       [0.32439117, 0.94591282, 0.00413872, ..., 0.        , 0.        ,
        0.        ],
       [0.99897259, 0.        , 0.02945088, ..., 1.        , 1.        ,
        0.        ]])

## Linear Regression

- Train-test split.
- Apply linear regression.

**Train-test split** Divide your data in a train part and a test part

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

**Apply linear regression** For this question you can use `statsmodels` or `sklearn` libraries

In [54]:
model = LinearRegression()
model.fit(X_train,y_train)

LinearRegression()

In [55]:
predictions  = model.predict(X_test)
predictions.shape

(2741,)

## Model Validation

- Description:
R2.
MSE.
RMSE.
MAE.

**Get R2 from the model**

In [65]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error



In [66]:
r2_score(y_test, predictions)

0.74471410126761

**Get MSE from the model**

In [67]:
mse = mean_squared_error(y_test, predictions)
print(mse)
# is this bad?

20840.087798052376


**Get RMSE from the model**

In [63]:
import math

rmse = math.sqrt(mse)
print(rmse)

144.36096355335252


**Get MAE from the model**

In [68]:
mean_absolute_error(y_test, predictions)

98.24923689303995