## 01 - Problem (case study)

An insurance has collected their customer information in a dataset.
Our tasks consist in: 
1) process and understand the data contained in the dataset, 
2) deliver a model that predict the total claim amount based on the given dataset.

## 02 - Getting Data

In [16]:
#Read the .csv file.
import pandas as pd
import numpy as np
data = pd.read_csv('./file1.csv')
data.rename(columns=str.lower, inplace=True)
data.rename(columns={'customer lifetime value':'customer_lifetime_value','monthly premium auto':'monthly_premium_auto',"number of open complaints":'number_of_open_complaints','policy type':'policy_type','vehicle class':'vehicle_class','total claim amount':'total_claim_amount'}, inplace=True)
data.rename(columns={'st':'state'}, inplace=True)
data

Unnamed: 0,customer,state,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount
0,RB50392,Washington,,Master,,0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...,...
1066,TM65736,Oregon,M,Master,305955.03%,38644,78.0,1/1/00,Personal Auto,Four-Door Car,361.455219
1067,VJ51327,Cali,F,High School or Below,2031499.76%,63209,102.0,1/2/00,Personal Auto,SUV,207.320041
1068,GS98873,Arizona,F,Bachelor,323912.47%,16061,88.0,1/0/00,Personal Auto,Four-Door Car,633.600000
1069,CW49887,California,F,Master,462680.11%,79487,114.0,1/0/00,Special Auto,SUV,547.200000


## 03 - Cleaning/Wrangling/EDA

In [17]:
#Change headers names.
#Deal with NaN values.
#Categorical Features.
#Numerical Features.
#Exploration.

data['state'] = data['state'].replace('Cali','California').replace('AZ','Arizona').replace('WA','Washington')
data['education'] = data['education'].replace('Bachelors','Bachelor')
data['gender'] = data['gender'].replace('Femal','F').replace('Male','M').replace('female','F')
data['vehicle_class'] = data['vehicle_class'].replace('Sports Car','Luxury').replace('Luxury SUV','Luxury').replace('Luxury Car','Luxury')
data['customer_lifetime_value'] = data['customer_lifetime_value'].str.rstrip('%')
data['customer_lifetime_value'] = pd.to_numeric(data['customer_lifetime_value'], errors='coerce')
data[['col1','col2','col3']] = data['number_of_open_complaints'].str.split('/', expand=True)
data['number_of_open_complaints'] = data['col2'].astype(int)
data.drop(columns=['col1', 'col2', 'col3'], inplace=True)
data['gender'].fillna("NG", inplace=True)
data['customer_lifetime_value'].fillna(data['customer_lifetime_value'].mean(), inplace=True)

## 04 - Processing Data

In [18]:
#Dealing with outliers.
#Normalization.
#Encoding Categorical Data.
#Splitting into train set and test set.
iqr_clv = np.percentile(data['customer_lifetime_value'],75) - np.percentile(data['customer_lifetime_value'],25)
np.percentile(data['customer_lifetime_value'],75)
upper_limit_clv = np.percentile(data['customer_lifetime_value'],75) + 1.5*iqr_clv
lower_limit_clv = np.percentile(data['customer_lifetime_value'],25) - 1.5*iqr_clv
data = data[(data['customer_lifetime_value']>lower_limit_clv) & (data['customer_lifetime_value']<upper_limit_clv)]
iqr_mpa = np.percentile(data['monthly_premium_auto'],75) - np.percentile(data['monthly_premium_auto'],25)
upper_limit_mpa = np.percentile(data['monthly_premium_auto'],75) + 1.5*iqr_mpa
lower_limit_mpa = np.percentile(data['monthly_premium_auto'],25) - 1.5*iqr_mpa
data = data[(data['monthly_premium_auto']>lower_limit_mpa) & (data['monthly_premium_auto']<upper_limit_mpa)]

numerical_value = data.select_dtypes(include=np.number)
categorical_value = data.select_dtypes(include='object')
categorical_value = categorical_value.drop(['customer'], axis=1)
X_cat_nom = categorical_value.drop(['education'], axis=1)
X_cat_ord = categorical_value[['education']]
y = numerical_value['total_claim_amount']

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder().fit(X_cat_nom)
column_name_nom = encoder.get_feature_names_out(X_cat_nom.columns)
encoded = encoder.transform(X_cat_nom).toarray() 
onehot_encoded = pd.DataFrame(encoded,columns=column_name_nom)
onehot_encoded = onehot_encoded.drop(columns=['gender_M','policy_type_Personal Auto','policy_type_Special Auto','policy_type_Corporate Auto'], axis=1)

X = pd.concat([numerical_value, onehot_encoded], axis=1)

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
transformer = MinMaxScaler().fit(X)
x_standardized = transformer.transform(X)
X = pd.DataFrame(x_standardized, columns=X.columns)

#from sklearn.preprocessing import LabelEncoder
#label_encoded = LabelEncoder().fit(X_cat_ord).transform(X_cat_ord)
#label_encoded = pd.DataFrame(label_encoded,columns=X_cat_ord.columns)
#non relevant variable


X = numerical_value.drop(['total_claim_amount'], axis=1)



## 05 - Modeling

In [19]:
#Apply model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
from sklearn import linear_model
lm = linear_model.LinearRegression()
lm.fit(X,y)

## 06 - Model Validation

In [20]:
#R2.
#MSE.
#RMSE.
#MAE.

from sklearn.metrics import r2_score
predictions = lm.predict(X_train)
print("r2 predicted:",r2_score(y_train, predictions))
predictions_test = lm.predict(X_test)
print("r2 tested:",r2_score(y_test, predictions_test))

from sklearn.metrics import mean_squared_error
print("score:",lm.score(X,y))
y_pred=lm.predict(X)
mse=mean_squared_error(y_test,predictions_test)
print("mse:",mean_squared_error(y_pred,y))

from sklearn.metrics import mean_absolute_error
import math
mae = mean_absolute_error(y_test, predictions_test)
rmse = math.sqrt(mse)
print("mae:",mae)
print("rmse:",rmse)

r2 predicted: 0.38975282264860844
r2 tested: 0.352067859647343
score: 0.3701812564325643
mse: 35683.54677363658
mae: 146.48865311995584
rmse: 195.48644490235242


## 07 - Reporting

Considering the information in the dataset the prediction model have an accuracy of around 40%, which is low. New data may be needed to get better correlations in the future. 