## Introduction

This DB is provided by an insurance company. It's objective its to predict which clients are going to subscribe to the insurance offer. Basically, this is a **binary classification problem**.

This dataset was used in a hackaton. The test and train were already defined and they wanted to look for the better prediccion evaluating the ROC AUC.

## State-of-the-art

My notebook has influence from those two notebooks

- [Jakub's notebook.](https://www.kaggle.com/jjmewtw/actuarial-study-eda-pca-cluster-estimation-0-88)
- [Kostiantyn's notebook.](https://www.kaggle.com/isaienkov/insurance-prediction-eda-and-modeling-acc-88)


Both of them did a great work on the EDA so I wont pay that much attention on analysing each an every feature. The difference between them is the **approach** they take on solving the problem.

The model has a non-balanced distribution at the objective feature. Jakub did a clustering treatment plus oversampling so he enrichened the data. In the other hand, Kostiantyn looks for a good hiperparameter search, without oversampling. 

## Exploratory Data Analysis

Then, our Target is the binary feature 'Response'. 

In [None]:
# imports + loading dataset


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn import neighbors
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import time
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
import optuna
from optuna.samplers import TPESampler


train = pd.read_csv('../input/health-insurance-cross-sell-prediction/train.csv')
test = pd.read_csv('../input/health-insurance-cross-sell-prediction/test.csv')

print(train.shape)
print(test.shape)

train.head()
test.head()

We drop the 'id' feature because it doesnt give info about the target.

In [None]:
train = train.drop(['id'], axis=1)
test = test.drop(['id'], axis=1)

We make sure there aren't null values.

In [None]:
train.info()

Observing this we can say that we will have to transform those *object* types later on.

Using train.describe() we can see some stats about our dataset. It only shows the numeric variables though.

In [None]:
train.describe()

Some features may contain outlayers, like "Annual_Premium", "Age" or "Vintage".

With the next line we confirm the non-balance on the objective feature.

In [None]:
train['Response'].value_counts()/len(train)

Because of this, we will consider the F1 Score more important than other metrics, like the Accuracy.

## **Preprocessing**

### **Encoding****


In [None]:
from sklearn.preprocessing import OrdinalEncoder

C = (train.dtypes == 'object')
CategoricalVariables = list(C[C].index)

print(CategoricalVariables)

Aleshores, Gender y Vehicle_Damage volem codificar-ne els valors a 0 i 1.

Then, we want to codify Gender and Vehicle_Damage to binary.

For Vehicle_Age we will apply a different type of transformation. So that the coding is ascend-depending on the longevity of the car.

In [None]:
enc = OrdinalEncoder()
train[["Gender","Vehicle_Damage"]] = enc.fit_transform(train[["Gender","Vehicle_Damage"]])
test[["Gender","Vehicle_Damage"]] = enc.fit_transform(test[["Gender","Vehicle_Damage"]])

train.loc[train['Vehicle_Age'] == '> 2 Years', 'Vehicle_Age'] = 2
train.loc[train['Vehicle_Age'] == '1-2 Year', 'Vehicle_Age'] = 1
train.loc[train['Vehicle_Age'] == '< 1 Year', 'Vehicle_Age'] = 0
test.loc[test['Vehicle_Age'] == '> 2 Years', 'Vehicle_Age'] = 2
test.loc[test['Vehicle_Age'] == '1-2 Year', 'Vehicle_Age'] = 1
test.loc[test['Vehicle_Age'] == '< 1 Year', 'Vehicle_Age'] = 0

train['Vehicle_Age']=train['Vehicle_Age'].astype(float)
test['Vehicle_Age']=test['Vehicle_Age'].astype(float)


test.head()

### Outliers

In [None]:
# draw boxplots to visualize outliers
has_outliers = ['Age', 'Annual_Premium', 'Vintage']

plt.figure(figsize=(10,7.5))

for i,col in enumerate(has_outliers):
    plt.subplot(1, 3, i+1)
    fig = train.boxplot(column=col)
    fig.set_title('')
    fig.set_ylabel(col)

Therefore, of our candidates, only "Annual_Premium" contains outliers.

In [None]:
has_outliers.remove('Age')
has_outliers.remove('Vintage')
up_outliers = []
low_outliers = []

def max_value(df, variable, top):
    return np.where(df[variable]>top, top, df[variable])


for col in has_outliers:
    IQR = train[col].quantile(0.75) - train[col].quantile(0.25)
    Lower_fence = train[col].quantile(0.25) - (IQR * 3)
    Upper_fence = train[col].quantile(0.75) + (IQR * 3)
    low_outliers.append(Lower_fence)
    up_outliers.append(Upper_fence)
    print(f'{col} outliers are values < {Lower_fence} or > {Upper_fence}')

for col, outlier in zip(has_outliers, up_outliers):
    train[col] = max_value(train, col, outlier)

# També per el Test set

for col in has_outliers:
    IQR = test[col].quantile(0.75) - test[col].quantile(0.25)
    Lower_fence = test[col].quantile(0.25) - (IQR * 3)
    Upper_fence = test[col].quantile(0.75) + (IQR * 3)
    low_outliers.append(Lower_fence)
    up_outliers.append(Upper_fence)
    print(f'{col} outliers are values < {Lower_fence} or > {Upper_fence}')

for col, outlier in zip(has_outliers, up_outliers):
    test[col] = max_value(test, col, outlier)

test.head()

### Heatmap

In [None]:
plt.figure(figsize=(16,12))
ax = sns.heatmap(train.corr(), annot=True, fmt='.2f')
ax.set_title('Correlations Insurance Sell')


Using the heatmap we find the following correlations between attributes:

Regarding the objective attribute "Response" and the rest we find that there is a weak relationship between whether the car has been damaged (Vehicle_Damage) and a negative correlation with respect to whether the car was previously insured (Previusly_Insured)

- The age of the person is strongly correlated with the age of the car because young people often drive old cars.
- The age with the sales channel because young people usually hire online and the elderly through other channels.
- Previously insured with age and age of the vehicle, this is because young people tend to change insurers frequently.
- Previously insured with vehicle damage. Due to the indirect correlation of the person's age with the age of the car and the age with the chosen sales channel.

### Normalitzation
We will apply Feature Scaling, specifically MinMaxScaler.

With this we get all the values to be between 0 and 1, improving the speed of the gradient descents and the accuracy of the classifiers.

In [None]:
X = train.loc[:, train.columns != 'Response']
Y = train.loc[:, train.columns == 'Response']
Y = Y.to_numpy().ravel()

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
cols = X.columns
X = scaler.fit_transform(X)
X = pd.DataFrame(X, columns=[cols])
X.describe()
X = X.to_numpy()


cols = test.columns
test = scaler.fit_transform(X)
test = pd.DataFrame(test, columns=[cols])
test.describe()
test = test.to_numpy()

## Model selection

For model selection we will consider the following:

- LogisticRegression
- KNN
- DecissionTree
- Random Forest

We do not consider SVMs for their temporary cost.

We also consider interesting other types of classifiers such as:
- XGBClassifier: Boosting method based on Decision Tree, iteratively corrects errors during training.
- LGBMClassifier: Boosting method based on XGBClassifier, made by microsoft and whose difference is in the speed with respect to the XGBClassifier because it makes a growth of the tree vertically, instead of horizontally. He loses some precision with respect to his father.

### Candidate classifiers

In [None]:
list_acc=np.zeros((5,4))
list_f1=np.zeros((5,4))
list_time=np.zeros((5,4))
for i in range(5):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

    logreg = LogisticRegression(max_iter=100000)
    nn = neighbors.KNeighborsClassifier() #neighbours=5 by default
    dt = tree.DecisionTreeClassifier()
    clf = RandomForestClassifier()
    
    
    t0 = time.time()
    logreg.fit(X_train,Y_train)
    t1 = time.time()
    nn.fit(X_train,Y_train)
    t2 = time.time()
    dt.fit(X_train,Y_train)
    t3 = time.time()
    clf.fit(X_train, Y_train)
    t4 = time.time()

    
    Y_logreg=logreg.predict(X_test)
    Y_nn=nn.predict(X_test)
    Y_dt=dt.predict(X_test)
    Y_clf=clf.predict(X_test)
    
    list_acc[i][0] = metrics.accuracy_score(Y_test, Y_logreg)
    list_acc[i][1] = metrics.accuracy_score(Y_test, Y_nn)
    list_acc[i][2] = metrics.accuracy_score(Y_test, Y_dt)
    list_acc[i][3] = metrics.accuracy_score(Y_test, Y_clf)

    list_f1[i][0] = metrics.f1_score(Y_test, Y_logreg)
    list_f1[i][1] = metrics.f1_score(Y_test, Y_nn)
    list_f1[i][2] = metrics.f1_score(Y_test, Y_dt)
    list_f1[i][3] = metrics.f1_score(Y_test, Y_clf)

    list_time[i][0] = t1-t0
    list_time[i][1] = t2-t1
    list_time[i][2] = t3-t2
    list_time[i][3] = t4-t3

From this execution we obtain the following results with respect to the Accuracy, the F1 Score and the execution time.

In [None]:
plt.boxplot(list_acc);
for i in range(4):
    xderiv = (i+1)*np.ones(list_acc[:,i].shape)+(np.random.rand(5,)-0.5)*0.1
    plt.plot(xderiv,list_acc[:,i],'ro',alpha=0.3)
    
ax = plt.gca()
ax.set_xticklabels(['Logistic regression','NN','Tree','Forest'])
plt.ylabel('Accuracy')

We note that all models have significant accuracy.

In [None]:
plt.boxplot(list_f1);
for i in range(4):
    xderiv = (i+1)*np.ones(list_f1[:,i].shape)+(np.random.rand(5,)-0.5)*0.1
    plt.plot(xderiv,list_f1[:,i],'ro',alpha=0.3)
    
ax = plt.gca()
ax.set_xticklabels(['Logistic regression','NN','Tree','Forest'])
plt.ylabel('F1 Score')

We note that only Decisson Tree-based models achieve a minimally "acceptable" F1 Score.

In [None]:
plt.boxplot(list_time);
for i in range(4):
    xderiv = (i+1)*np.ones(list_time[:,i].shape)+(np.random.rand(5,)-0.5)*0.1
    plt.plot(xderiv,list_time[:,i],'ro',alpha=0.3)
    
ax = plt.gca()
ax.set_xticklabels(['Logistic regression','NN','Tree','Forest'])
plt.ylabel('Time')

We observe that the KNN takes a long time compared to other methods and therefore, we directly discard it as a model.

### Random forest

In [None]:
from sklearn.metrics import roc_curve,roc_auc_score
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
clf = RandomForestClassifier()
clf.fit(X_train, Y_train)
lr_probs = clf.predict_proba(X_test)
lr_probs = lr_probs[:, 1]

lr_auc = roc_auc_score(Y_test, lr_probs)
# keep probabilities for the positive outcome only

# summarize scores
print('ROC AUC =', lr_auc)
# calculate roc curves
lr_fpr, lr_tpr, _ = roc_curve(Y_test, lr_probs)

# plot the roc curve for the model
plt.plot(lr_fpr, lr_tpr, marker='.')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

In [None]:
def plot_confusion_matrix(y_real, y_pred):
    cm = confusion_matrix(y_real, y_pred)

    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax, fmt='g')

    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')

preds= clf.predict(X_test)
plot_confusion_matrix(Y_test, preds)

### Logistic Regressor

In [None]:
from sklearn.metrics import roc_curve,roc_auc_score
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
clf = LogisticRegression(max_iter=100000)
clf.fit(X_train, Y_train)
lr_probs = clf.predict_proba(X_test)
lr_probs = lr_probs[:, 1]

lr_auc = roc_auc_score(Y_test, lr_probs)
# keep probabilities for the positive outcome only

# summarize scores
print('ROC AUC =', lr_auc)
# calculate roc curves
lr_fpr, lr_tpr, _ = roc_curve(Y_test, lr_probs)

# plot the roc curve for the model
plt.plot(lr_fpr, lr_tpr, marker='.')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

In [None]:
preds= clf.predict(X_test)
plot_confusion_matrix(Y_test, preds)

### Decission Tree

In [None]:
from sklearn.metrics import roc_curve,roc_auc_score
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
clf = tree.DecisionTreeClassifier() 
clf.fit(X_train, Y_train)

lr_probs = clf.predict_proba(X_test)
lr_probs = lr_probs[:, 1]

lr_auc = roc_auc_score(Y_test, lr_probs)
# keep probabilities for the positive outcome only

# summarize scores
print('ROC AUC =', lr_auc)
# calculate roc curves
lr_fpr, lr_tpr, _ = roc_curve(Y_test, lr_probs)

# plot the roc curve for the model
plt.plot(lr_fpr, lr_tpr, marker='.')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

Despite not getting a good ROC AUC score, we reached a good level in the F1 Score, observing the confusion matrix

In [None]:
preds= clf.predict(X_test)
plot_confusion_matrix(Y_test, preds)

After all these models we can come to the conclusion that by this dataset, the models using Decission Trees are the way to go, so we will try to use the current models used throughout Kaggle for their good results.
Specifically the ** XGBClassifier **.

## XGBClassifier & Hyperparameter Search

We will directly apply the search for the best parameters on the **XGBClassifier** model as this is where we can find an improvement in a guaranteed way.

In [None]:
np.random.seed(777)
sampler = TPESampler(seed=0)

def create_model(trial):
    max_depth = trial.suggest_int("max_depth", 2, 20)
    n_estimators = trial.suggest_int("n_estimators", 1, 400)
    learning_rate = trial.suggest_uniform('learning_rate', 0.0000001, 0.2)
    gamma = trial.suggest_uniform('gamma', 0.0000001, 1)
    scale_pos_weight = trial.suggest_int("scale_pos_weight", 1, 20)
    model = XGBClassifier(
        learning_rate=learning_rate, 
        n_estimators=n_estimators, 
        max_depth=max_depth, 
        gamma=gamma, 
        scale_pos_weight=scale_pos_weight, 
        random_state=0,
        eval_metric= 'error' # we fix this because is a binary classifier
    )
    return model

def objective(trial):
    model = create_model(trial)
    model.fit(X_train, Y_train)
    preds = model.predict(X_test)
    score = f1_score(Y_test, preds) # we fix the f1_score as evaluating metric
    return score

"""
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=300) # Random parameter optimizer, for n_trials

xgb_params = study.best_params
"""



In [None]:
# After 300 trials, those are the best parameters that optuna could find. I will use them, instead of re-executing the hyperparameter search, for obvius reasons
xgb_params = {
    'max_depth': 2, 
    'n_estimators': 384, 
    'learning_rate': 0.13878057972985153, 
    'gamma': 0.0256366368300565, 
    'scale_pos_weight': 2,
    'eval_metric':'error',
    'random_state':0
}

xgb = XGBClassifier(**xgb_params)
xgb.fit(X_train, Y_train)
preds = xgb.predict(X_test)
print('Optimized XGBClassifier accuracy: ', accuracy_score(Y_test, preds))
print('Optimized XGBClassifier f1-score', f1_score(Y_test, preds))

lr_probs = xgb.predict_proba(X_test)
lr_probs = lr_probs[:, 1]

lr_auc = roc_auc_score(Y_test, lr_probs)
# keep probabilities for the positive outcome only

# summarize scores
print('ROC AUC =', lr_auc)
# calculate roc curves
lr_fpr, lr_tpr, _ = roc_curve(Y_test, lr_probs)

# plot the roc curve for the model
plt.plot(lr_fpr, lr_tpr, marker='.')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

To conclude the notebook, we will prepare the answer on the test, like if we were submitting it for the hackaton :)

In [None]:
submit = xgb.predict(test)
submitdf= pd.DataFrame(data=submit, columns=['Response'])
submitdf

In [None]:
submitdf.to_csv('output_submission.csv',index=False)

## Conclusions

In this notebook I learned a lot about the binary classification methods currently used in Kaggle, XGBClassifier and LightGBM. In addition, I also learned about how optuna works, which is a current parameter optimizer.

Another aspect to comment on is that I have also expanded my vision by seeing another type of approach taken by [Jakub](https://www.kaggle.com/jjmewtw/actuarial-study-eda-pca-cluster-estimation-0-88) on his notebook. I find it very interesting to improve the quality of the data (enrich it) for better results.

Thanks you all if you reached this point. **Sending a virtual hug from Barcelona!**