# Diplodatos Kaggle Competition

We present this peace of code to create the baseline for the competition, and as an example of how to deal with these kind of problems. The main goals are that you:

1. Explore the data and learn from it
1. Try different models and see which one fits the best the given data
1. Get a higher score than the given one in the current baseline example
1. Try to get the highest score in the class :)

In [189]:
# Import the required packages
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Read Data

In [190]:
train_df = pd.read_csv("../data/travel_insurance_prediction_train.csv")
test_df = pd.read_csv("../data/travel_insurance_prediction_test.csv")

## Explore the Data

Is your task to explore the data, do analysis over it and get insights, then use those insights to better pick a model.

In [191]:
train_df.head()

Unnamed: 0,Customer,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,1,33,Private Sector/Self Employed,Yes,550000,6,0,No,No,1
1,2,28,Private Sector/Self Employed,Yes,800000,7,0,Yes,No,0
2,3,31,Private Sector/Self Employed,Yes,1250000,4,0,No,No,0
3,4,31,Government Sector,No,300000,7,0,No,No,0
4,5,28,Private Sector/Self Employed,Yes,1250000,3,0,No,No,0


In [192]:
test_df.head()

Unnamed: 0,Customer,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad
0,1491,29,Private Sector/Self Employed,Yes,1100000,4,0,No,No
1,1492,28,Private Sector/Self Employed,Yes,750000,5,1,Yes,No
2,1493,31,Government Sector,Yes,1500000,4,0,Yes,Yes
3,1494,28,Private Sector/Self Employed,Yes,1400000,3,0,No,Yes
4,1495,33,Private Sector/Self Employed,Yes,1500000,4,0,Yes,Yes


**TravelInsurance** is the column that we should predict. That column is not present in the test set.

In [193]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1490 entries, 0 to 1489
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Customer             1490 non-null   int64 
 1   Age                  1490 non-null   int64 
 2   Employment Type      1490 non-null   object
 3   GraduateOrNot        1490 non-null   object
 4   AnnualIncome         1490 non-null   int64 
 5   FamilyMembers        1490 non-null   int64 
 6   ChronicDiseases      1490 non-null   int64 
 7   FrequentFlyer        1490 non-null   object
 8   EverTravelledAbroad  1490 non-null   object
 9   TravelInsurance      1490 non-null   int64 
dtypes: int64(6), object(4)
memory usage: 116.5+ KB


In [194]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497 entries, 0 to 496
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Customer             497 non-null    int64 
 1   Age                  497 non-null    int64 
 2   Employment Type      497 non-null    object
 3   GraduateOrNot        497 non-null    object
 4   AnnualIncome         497 non-null    int64 
 5   FamilyMembers        497 non-null    int64 
 6   ChronicDiseases      497 non-null    int64 
 7   FrequentFlyer        497 non-null    object
 8   EverTravelledAbroad  497 non-null    object
dtypes: int64(5), object(4)
memory usage: 35.1+ KB


In [195]:
train_df.describe()

Unnamed: 0,Customer,Age,AnnualIncome,FamilyMembers,ChronicDiseases,TravelInsurance
count,1490.0,1490.0,1490.0,1490.0,1490.0,1490.0
mean,745.5,29.667114,927818.8,4.777181,0.275839,0.357047
std,430.270264,2.880994,381171.5,1.640248,0.447086,0.47929
min,1.0,25.0,300000.0,2.0,0.0,0.0
25%,373.25,28.0,600000.0,4.0,0.0,0.0
50%,745.5,29.0,900000.0,5.0,0.0,0.0
75%,1117.75,32.0,1250000.0,6.0,1.0,1.0
max,1490.0,35.0,1800000.0,9.0,1.0,1.0


In [196]:
test_df.describe()

Unnamed: 0,Customer,Age,AnnualIncome,FamilyMembers,ChronicDiseases
count,497.0,497.0,497.0,497.0,497.0
mean,1739.0,29.599598,947585.5,4.68008,0.283702
std,143.615807,3.010506,363581.8,1.51347,0.451248
min,1491.0,25.0,300000.0,2.0,0.0
25%,1615.0,28.0,650000.0,4.0,0.0
50%,1739.0,29.0,950000.0,4.0,0.0
75%,1863.0,32.0,1250000.0,6.0,1.0
max,1987.0,35.0,1750000.0,9.0,1.0


## Baseline

In this section we present a baseline based on a decision tree classifier.

Many of the attributes are binary, there are a couple of numeric attributes, we might be able to one-hot (e.g. family members), or event discretize (age and anual income), this will come more clearly after the EDA.

In [197]:
from sklearn.compose import make_column_transformer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.linear_model import SGDClassifier

### Transform the columns into features

First we need to transform the columns into features. The type of features we use will have a direct impact on the final result. In this example we decided to discretize some numeric features and make a one hot encoding of others. The number of bins, what we use as a one hot encoding, etc, is all up to you to try it out.

In [166]:
transformer = make_column_transformer(
    (KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile"), ["Age", "AnnualIncome"]),
    (OneHotEncoder(categories="auto", dtype="int", handle_unknown="ignore"),
     ["Employment Type", "GraduateOrNot", "FamilyMembers", "FrequentFlyer", "EverTravelledAbroad"]),
    remainder="passthrough")

We transform the train and test data. In order to avoid overfitting is better to remove the `Customer` column and we don't want the `TravelInsurance` column as part of the attributes either.

In [167]:
# The data for training the model
X_train = transformer.fit_transform(train_df.drop(columns=["Customer", "TravelInsurance"]))
y_train = train_df["TravelInsurance"].values

# The test data is only for generating the submission
X_test = transformer.transform(test_df.drop(columns=["Customer"]))

### Decision Tree - Grid Search

We do a Grid Search for the Decision Tree (this can be replaced by a randomized search if the model is too complex).

In [168]:
search_params = {
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': [1, 2, 5],
    'max_depth': [3, 6, 10]
}
tree = DecisionTreeClassifier(random_state=42)
tree_clf = GridSearchCV(tree, search_params, cv=4, scoring='f1', n_jobs=-1)
tree_clf.fit(X_train, y_train)

best_tree_clf = tree_clf.best_estimator_

In [169]:
print(best_tree_clf)

DecisionTreeClassifier(criterion='entropy', max_depth=6, min_samples_leaf=5,
                       random_state=42)


### Check Results

We can print the results of the best estimator found on the whole training set (we could also set apart a validation set if we find it useful).

In [170]:
print(classification_report(y_train, best_tree_clf.predict(X_train)))

              precision    recall  f1-score   support

           0       0.81      0.96      0.88       958
           1       0.89      0.61      0.72       532

    accuracy                           0.83      1490
   macro avg       0.85      0.78      0.80      1490
weighted avg       0.84      0.83      0.82      1490



### SVM - Grid Search

In [81]:
params = {
    "kernel": ["linear", "poly", "rbf", "sigmoid"],
    "gamma": ["auto", "scale"],
    "C":[1,10,20]
}

svm = SVC()
svm_clf = GridSearchCV(svm, params, cv=4, scoring='f1', n_jobs=-1)
svm_clf.fit(X_train, y_train)

best_svm = svm_clf.best_estimator_

In [82]:
print(best_svm)

SVC(C=10, gamma='auto', kernel='poly')


In [83]:
print(classification_report(y_train, best_svm.predict(X_train)))

              precision    recall  f1-score   support

           0       0.81      0.96      0.88       958
           1       0.90      0.61      0.73       532

    accuracy                           0.84      1490
   macro avg       0.86      0.78      0.80      1490
weighted avg       0.85      0.84      0.83      1490



### KNN

In [92]:
params = {
    "n_neighbors": [1,2,3,4,5,6,7,8,15,20],
    "weights": ["uniform", "distance"],
    "algorithm":["auto", "ball_tree", "kd_tree", "brute"]
}


knn = KNeighborsClassifier()
knn_clf= GridSearchCV(knn, params, cv=4, scoring='f1', n_jobs=-1)
knn_clf.fit(X_train, y_train)

best_knn = knn_clf.best_estimator_

In [93]:
print(best_knn)

KNeighborsClassifier(n_neighbors=7)


In [89]:
print(classification_report(y_train, best_knn.predict(X_train)))

              precision    recall  f1-score   support

           0       0.82      0.94      0.87       958
           1       0.85      0.62      0.72       532

    accuracy                           0.83      1490
   macro avg       0.84      0.78      0.80      1490
weighted avg       0.83      0.83      0.82      1490



### KNN Radius -

In [141]:
params = {
    "weights": ["uniform", "distance"],
    "algorithm":["auto", "ball_tree", "kd_tree", "brute"],
    "outlier_label":["most_frequent"],
}

knnr = RadiusNeighborsClassifier()
knnr_clf= GridSearchCV(knnr, params, cv=9, scoring='f1', n_jobs=-1)
knnr_clf.fit(X_train, y_train)

best_knnr = knnr_clf.best_estimator_

In [142]:
print(best_knnr)

RadiusNeighborsClassifier(outlier_label='most_frequent', weights='distance')


In [143]:
print(classification_report(y_train, best_knnr.predict(X_train)))

              precision    recall  f1-score   support

           0       0.87      0.98      0.92       958
           1       0.95      0.74      0.83       532

    accuracy                           0.89      1490
   macro avg       0.91      0.86      0.88      1490
weighted avg       0.90      0.89      0.89      1490



Se subio con parámetros 
RadiusNeighborsClassifier(outlier_label='most_frequent', weights='distance')
y dio 0,69 de score en la competencia

### SGD Linear

In [154]:
# Selección de parámetros
params = {
    'loss': ['perceptron','hinge','log','squared_loss','epsilon_insensitive'],
    'penalty' : ['l2', 'l1', 'none'],
    'alpha' : [0.0001, 0.001, 0.01, 0.1],
    'random_state': [42],
    'learning_rate': ['optimal','constant','invscaling'],
    'eta0': [0.0001, 0.001, 0.01, 0.1]
}

sgdlinear = SGDClassifier()
sgdlinear_clf = GridSearchCV(sgdlinear, params, cv=9, scoring='f1', n_jobs=-1)
sgdlinear_clf.fit(X_train, y_train)

best_sgdlinear = sgdlinear_clf.best_estimator_

In [157]:
print(best_sgdlinear)

SGDClassifier(alpha=0.1, eta0=0.0001, loss='log', random_state=42)


In [158]:
print(classification_report(y_train, best_sgdlinear.predict(X_train)))

              precision    recall  f1-score   support

           0       0.77      0.96      0.85       958
           1       0.86      0.48      0.61       532

    accuracy                           0.79      1490
   macro avg       0.81      0.72      0.73      1490
weighted avg       0.80      0.79      0.77      1490



In [198]:
from sklearn.preprocessing import MinMaxScaler

transformer = make_column_transformer(
    (MinMaxScaler(feature_range=(0,1)), ["Age", "AnnualIncome", "FamilyMembers"]),
    (OneHotEncoder(categories="auto", dtype="int", handle_unknown="ignore"),
     ["Employment Type", "GraduateOrNot", "FrequentFlyer", "EverTravelledAbroad"]),
    remainder="passthrough")

In [199]:
# The data for training the model
X_train = transformer.fit_transform(train_df.drop(columns=["Customer", "TravelInsurance"]))
y_train = train_df["TravelInsurance"].values

# The test data is only for generating the submission
X_test = transformer.transform(test_df.drop(columns=["Customer"]))

In [200]:
search_params = {
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': [1, 2, 5],
    'max_depth': [3, 6, 10]
}
tree = DecisionTreeClassifier(random_state=42)
tree_clf = GridSearchCV(tree, search_params, cv=4, scoring='f1', n_jobs=-1)
tree_clf.fit(X_train, y_train)

best_tree_clf = tree_clf.best_estimator_

In [201]:
print(best_tree_clf)

DecisionTreeClassifier(max_depth=6, min_samples_leaf=5, random_state=42)


In [202]:
print(classification_report(y_train, best_tree_clf.predict(X_train)))

              precision    recall  f1-score   support

           0       0.83      0.96      0.89       958
           1       0.89      0.64      0.74       532

    accuracy                           0.84      1490
   macro avg       0.86      0.80      0.81      1490
weighted avg       0.85      0.84      0.83      1490



## Generate the output

The last thing we do is generating a file that should be *submitted* on kaggle

In [204]:
test_id = test_df["Customer"]
test_pred = best_tree_clf.predict(X_test)

submission = pd.DataFrame(list(zip(test_id, test_pred)), columns=["Customer", "TravelInsurance"])
submission.to_csv("travel_insurance_submission.csv", header=True, index=False)