# Health Insurance Cross Sell (by Andrew Dettor)
### Will a Health Insurance customer be open to buying Vehicle Insurance as well?

### Dataset: [Health Insurance Cross Sell Prediction 🏠 🏥](https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction)
### Goals:
* Explore distributions of numerical and categorical features and their relationships with the target feature, Response (whether the customer responded to an offer about buying vehicle insurance)
* Preprocess data in order to model it (look at missing values/outliers/skewed distributions/standardization)
* Test out different Classification models by tuning their hyperparameters and comparing their performance
* Explore which features were the most impactful
* Explore potential interactions between features


### FYI - Confusing Names for Features:
* Vehicle_Damage - 1 = Customer got his/her vehicle damaged in the past. 0 = Customer didn't get his/her vehicle damaged in the past.
* Vintage - Number of Days Customer has been associated with the company.
* Policy_Sales_Channel - Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
* Annual_Premium - The amount customer needs to pay as premium in the year. (Not sure if this is talking only about their current health insurance premium or their potential vehicle insurance premium)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import warnings
warnings.filterwarnings('ignore')

# *Read in the data*

In [None]:
fname = "../input/health-insurance-cross-sell-prediction/train.csv"
df = pd.read_csv(fname)
df.head()

# *Exploratory Data Analysis*

In [None]:
df.info()

#### No missing values! That's a plus.

In [None]:
df.describe()

#### A lot of these numerical values seem like they're actually categorical

In [None]:
# Dropping id because it seems useless
id_col = df["id"]
df = df.drop("id", axis=1)

In [None]:
# Make a list of all categorical values using data description

cat_cols = ["Gender", "Driving_License", "Region_Code", "Previously_Insured", "Vehicle_Age", "Vehicle_Damage", "Policy_Sales_Channel"]

# Make a list of all numerical values using data description

num_cols = ["Age", "Annual_Premium", "Vintage"]

# Set the target variable

target = "Response"

## Numerical Features

In [None]:
# Create a correlation heatmap between the numerical values
# Source: https://seaborn.pydata.org/examples/many_pairwise_correlations.html

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Compute the correlation matrix
corr = df[num_cols].corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(5, 5))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, annot=True, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

#### No correlation amongst the numerical features

In [None]:
# Create histograms for each numerical feature to see their distributions

for cname in num_cols:
    sns.distplot(a=df[cname], kde=False)
    plt.show()

#### Age has a center around 25 and a smaller center around 45. It is asymmetric, with a large right skew. The data are quite spread out from the histogram's centers. Maybe Age could be normalized for the model.
#### Annual Premium has a center around 30,000, and is generally bell-shaped and symmetric, however, there is a huge outlier around 500,000. The data are cery concentrated around its center. 
#### Vintage has a uniform distribution with no visible center and has a large spread.


In [None]:
# See the average value for each response value

pd.pivot_table(df, index=target, values=num_cols)

#### The average ages, insurance premiums, and number of days the customer has been with the company are very similar in both response groups. The biggest difference is with Age. It seems like older people are more likely to buy health insurance (not surprising).

## Categorical Features

In [None]:
# Create barplots for each categorical feature to see their distributions

for cname in cat_cols:
    valueCounts = df[cname].value_counts()
    sns.barplot(valueCounts.index, valueCounts).set_title(cname)
    plt.show()

#### There are slightly more males than females in the dataset.
#### Just about everyone has a driver's license, which makes sense.
#### One region has an extraordinarily large number of people in it.
#### Most people don't already have vehicle insurance.
#### Most people's vehicles are within 0-2 years old.
#### There are about the same number of people who have had vehicle damage in the past as there are with no previous vehicle damage.
#### There about 3-5 sales channels that are used most frequently.


In [None]:
# See how people responded based on each value of each categorical feature
# Source: https://www.kaggle.com/kenjee/titanic-project-example

df_with_id = pd.concat([id_col, df], axis=1)

for cname in cat_cols:
    print(pd.pivot_table(df_with_id, index=target, columns=cname, values ="id", aggfunc="count"))
    print("\n\n")

#### Almost nobody who was interested in vehicle insurance already had vehicle insurance. (It's a waste of time to ask people who already have insurance to get a different insurance for the same thing.)

#### Very very few people wanted vehicle insurance if they have had no damage previously. People who have had damage before were extremely more likely to want vehicle insurance.

#### The number of people from each sales channel and in each region varies wildly.

#### Proportionally, very few people with a vehicle age <1 year wanted to get vehicle insurance.

#### Looking at the Age pivot table, it seems like a higher proportion of males wanted vehicle insurance.

In [None]:
# See the proportions of people coming from the top sales channels

(df["Policy_Sales_Channel"].value_counts()/df["Policy_Sales_Channel"].value_counts().sum()).head()

In [None]:
# See the proportions of people in the top regions

(df["Region_Code"].value_counts()/df["Region_Code"].value_counts().sum()).head()

# *Feature Engineering*

## Dealing with Categorical Values

In [None]:
df.head()

In [None]:
# One-Hot Encode the columns with low cardinality

for cname in ["Gender", "Vehicle_Age", "Vehicle_Damage"]:
    df_one_hot = pd.get_dummies(df[cname], prefix=cname)
    df = pd.concat([df, df_one_hot], axis=1)
    df = df.drop(cname, axis=1)


In [None]:
# Remove the spaces in column names for easier access
# Source: https://stackoverflow.com/questions/13757090/pandas-column-access-w-column-names-containing-spaces

df.columns = [c.replace(' ', '_') for c in df.columns]

In [None]:
df.head()

### How to deal with columns with high cardinality (Region_Code, Policy_Sales_Channel):
#### Option 1: Group categorical values such that there are few unique values, then One-Hot Encode them
#### Option 2: Use a different type of Encoder that doesn't add so many new columns

#### I'm going to go with option 2 because it'll add fewer features

In [None]:
# Use a Target Encoder 
# Replaces categorical values with the average value of the target for that value of the feature

from category_encoders import TargetEncoder

# NOTE: Only fit categorical encoders to a training set to avoid target leakage
# Will fit the encoder within a K-Fold loop

## Creating New Features

In [None]:
# People are very unlikely to say yes if they're insured or if they've had no previous vehicle damage
# Combine these two into one feature. It should be a good predictor of saying no.
df["Insured_With_No_Damage"] = df["Previously_Insured"]*df["Vehicle_Damage_No"]

# People are very likely to say yes if they're uninsured or if they've had previous damage
# Should be a good predictor of saying yes
df["Not_Insured_With_Damage"] = df["Previously_Insured"].apply(lambda x: 1 if x == 0 else 0) * df["Vehicle_Damage_Yes"]

# If people have a new car with damage and no insurance, they probably should get insurance
df["New_Damage_No_Insurance"] = df["Vehicle_Age_<_1_Year"]*df["Not_Insured_With_Damage"]

In [None]:
df.head()

In [None]:
# Create features indicating if they're in the top 3 most popular sales channels / region codes
top3regions = df["Region_Code"].value_counts().index.tolist()[0:3]
top3channels = df["Policy_Sales_Channel"].value_counts().index.tolist()[0:3]

df["Top_3_Region"] = df["Region_Code"].apply(lambda x: 1 if x in top3regions else 0)
df["Top_3_Sales_Channel"] = df["Policy_Sales_Channel"].apply(lambda x: 1 if x in top3channels else 0)

In [None]:
df.head()

In [None]:
# If a customer pays a high amount of money, but has been with the company for a long time, they probably have the money to pay more

df["Amount_Spent_Per_Day"] = df["Annual_Premium"]/df["Vintage"]

#### One more feature I could add is if the customer is in the top 3 most likely to say yes regions/sales channels, rather than the top 3 most populous regions/channels.
#### However, I would have to do this on a training set to prevent target leakage
#### This is probably redundant because of the Target Encoder I'll be using for these columns

# *Data Cleaning*

## Missing Values

In [None]:
df.head()

#### There are no missing values, so I don't have to impute anything. However, if there were, I would impute each feature as follows:

Numerical Columns: mean\
Categorical Columns: most frequent

## Outliers

#### I saw with Annual_Premium the histogram extended out to like 500,000. I should drop the outliers before normalizing.

In [None]:
# See the skew

df["Annual_Premium"].skew()

In [None]:
# Detecting outliers based on what I remember from my statistics class

q75 = df["Annual_Premium"].quantile(q=.75)
q25 = df["Annual_Premium"].quantile(q=.25)
IQR = q75-q25

lowerBound = q25 - 1.5*IQR
upperBound = q75 + 1.5*IQR

print(lowerBound)
print(upperBound)

In [None]:
# Remove outliers

outliers = df.loc[(df["Annual_Premium"] < lowerBound) | (df["Annual_Premium"] > upperBound)]
df = df.drop(outliers.index)
print("Dropped", outliers.shape[0], "outliers.")

## Normalization 

In [None]:
# How has skew changed from removing outliers?

df["Annual_Premium"].skew()

#### Still moderately skewed

In [None]:
# Age skew

df["Age"].skew()

#### Also moderately skewed

In [None]:
# Vintage skew

df["Vintage"].skew()

#### Unsurprisingly not skewed

In [None]:
# Do a BoxCox transformation on Age and Annual Premium
# Source: https://www.kaggle.com/datafan07/top-1-approach-eda-new-models-and-stacking

from scipy.stats import boxcox

df["Age"], age_lambda = boxcox(df["Age"])
df["Annual_Premium"], annualprem_lambda = boxcox(df["Annual_Premium"])

# Keep track of each lambda to use when preprocessing the test set

## Scaling

In [None]:
# Scale the numerical data for the models that require it
# Don't forget to also scale the Target Encoded features after the encoder has been fit on a training set
# Source: https://www.kaggle.com/kenjee/titanic-project-example

from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df[["Age", "Annual_Premium", "Vintage", "Amount_Spent_Per_Day"]] = scale.fit_transform(df[["Age", "Annual_Premium", "Vintage", "Amount_Spent_Per_Day"]])

# *Model Selection*

## Preprocessing and cross validation functions
NOTE: I could do things a lot simpler if I didn't have to worry about target leakage with the TargetEncoder. Right now, each validation split needs to be tailored to the values in the corresponding training split. I can't just pass in a preprocessed big X into a cross validation function which randomly splits the X, because it would contain target-leaked data. 

In [None]:
def get_preprocessed_cv_data(train_X, val_X, train_y, val_y):
    
    ###############################################################################################
    # One-Hot Encode the columns with low cardinality
    for cname in ["Gender", "Vehicle_Age", "Vehicle_Damage"]:
        train_X_one_hot = pd.get_dummies(train_X[cname], prefix=cname)
        train_X = pd.concat([train_X, train_X_one_hot], axis=1)
        train_X = train_X.drop(cname, axis=1)
        
        val_X_one_hot = pd.get_dummies(val_X[cname], prefix=cname)
        val_X = pd.concat([val_X, val_X_one_hot], axis=1)
        val_X = val_X.drop(cname, axis=1)
        
    ###############################################################################################
    # Target Encode the columns with high cardinality
    
    enc = TargetEncoder(cols=["Region_Code", "Policy_Sales_Channel"])
    train_X = enc.fit_transform(train_X, train_y)
    val_X = enc.transform(val_X)
        
    ###############################################################################################
    # Remove the spaces in column names for easier access
    train_X.columns = [c.replace(' ', '_') for c in train_X.columns]
    val_X.columns = [c.replace(' ', '_') for c in val_X.columns]
    
    # Remove the < sign bc it doesnt work for XGB Classifier
    train_X.columns = [c.replace('<', 'less') for c in train_X.columns]
    val_X.columns = [c.replace('<', 'less') for c in val_X.columns]
    
    # Remove the > sign bc it doesnt work for XGB Classifier
    train_X.columns = [c.replace('>', 'greater') for c in train_X.columns]
    val_X.columns = [c.replace('>', 'greater') for c in val_X.columns]
    
    ###############################################################################################
    # Feature Engineering
    train_X["Insured_With_No_Damage"] = train_X["Previously_Insured"]*train_X["Vehicle_Damage_No"]
    train_X["Not_Insured_With_Damage"] = train_X["Previously_Insured"].apply(lambda x: 1 if x == 0 else 0) * train_X["Vehicle_Damage_Yes"]
    train_X["New_Damage_No_Insurance"] = train_X["Vehicle_Age_less_1_Year"]*train_X["Not_Insured_With_Damage"]
    train_X["Amount_Spent_Per_Day"] = train_X["Annual_Premium"]/train_X["Vintage"]
    
    val_X["Insured_With_No_Damage"] = val_X["Previously_Insured"]*val_X["Vehicle_Damage_No"]
    val_X["Not_Insured_With_Damage"] = val_X["Previously_Insured"].apply(lambda x: 1 if x == 0 else 0) * val_X["Vehicle_Damage_Yes"]
    val_X["New_Damage_No_Insurance"] = val_X["Vehicle_Age_less_1_Year"]*val_X["Not_Insured_With_Damage"]
    val_X["Amount_Spent_Per_Day"] = val_X["Annual_Premium"]/val_X["Vintage"]
    
    ###############################################################################################
    # More Feature Engineering
    top3regions = train_X["Region_Code"].value_counts().index.tolist()[0:3]
    top3channels = train_X["Policy_Sales_Channel"].value_counts().index.tolist()[0:3]

    train_X["Top_3_Region"] = train_X["Region_Code"].apply(lambda x: 1 if x in top3regions else 0)
    val_X["Top_3_Region"] = val_X["Region_Code"].apply(lambda x: 1 if x in top3regions else 0)
    
    train_X["Top_3_Sales_Channel"] = train_X["Policy_Sales_Channel"].apply(lambda x: 1 if x in top3channels else 0)
    val_X["Top_3_Sales_Channel"] = val_X["Policy_Sales_Channel"].apply(lambda x: 1 if x in top3channels else 0)
    
    ###############################################################################################
    # Remove outliers from training set Annual Premium
    q75 = train_X["Annual_Premium"].quantile(q=.75)
    q25 = train_X["Annual_Premium"].quantile(q=.25)
    IQR = q75-q25

    lowerBound = q25 - 1.5*IQR
    upperBound = q75 + 1.5*IQR
    
    outliers_train = train_X.loc[(train_X["Annual_Premium"] < lowerBound) | (train_X["Annual_Premium"] > upperBound)]
    train_X = train_X.drop(outliers_train.index)
    train_y = train_y.drop(outliers_train.index)
    
    ###############################################################################################
    # Normalize Age and Annual Premium
    # Use same lambdas on validation set
    
    for col in ["Age", "Annual_Premium"]:
        train_X[col], lmbda = boxcox(train_X[col])
        val_X[col] = boxcox(val_X[col], lmbda=lmbda)
        
    return train_X, val_X, train_y, val_y

In [None]:
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

# Returns the best hyperparameters and average score from 5-fold cross validation
def get_cv_score_and_best_model(X, y, model, paramfield, boolScale):
    
    totalScore = 0
    bestScore = 0
    bestParams = {}
    
    kf = KFold(n_splits=3)
    for train_index, val_index in kf.split(X):

        # Create train/test splits and preprocess using previously defined function
        train_X, val_X, train_y, val_y = X.iloc[train_index], X.iloc[val_index], y.iloc[train_index], y.iloc[val_index]
        train_X, val_X, train_y, val_y = get_preprocessed_cv_data(train_X, val_X, train_y, val_y)
        
        # Scale all numerical features if required (for SVM and KNN)
        if boolScale:
            toScale = ["Age", "Annual_Premium", "Vintage", "Amount_Spent_Per_Day", "Region_Code", "Policy_Sales_Channel"]
            scale = StandardScaler()
            train_X[toScale] = scale.fit_transform(train_X[toScale])
            val_X[toScale] = scale.transform(val_X[toScale])
        
        train_length = train_X.shape[0]
        val_length = val_X.shape[0]
        
        new_X = pd.concat([train_X, val_X], axis=0)
        new_y = pd.concat([train_y, val_y], axis=0)
        
        
        # Want to find the best hyperparameters for this specific train/test split
        # Create a list where train data indices are -1 and validation data indices are 0
        # Source: https://stackoverflow.com/questions/31948879/using-explicit-predefined-validation-set-for-grid-search-with-sklearn
        split_index = [-1 if i<=train_length else 0 for i in range(train_length+val_length)]
        
        ps = PredefinedSplit(test_fold = split_index)
        
        # Scoring metric is from the dataset description
        # Tries out all permutations of parameters in paramField on this split
        clf = GridSearchCV(model, paramfield, scoring="roc_auc", cv=ps)
        
        # Fit the classifier using the preprocessed X and y
        # It will know what the validation set is bc the PredefinedSplit
        clf.fit(new_X, new_y)
        
        # Get the best score from this split coming from the best params
        score = clf.best_score_
        
        # Add to the totalScore to later return the average score
        totalScore += score
        
        # If this is the best split so far, keep track of the found hyperparameters
        if score >= bestScore:
            bestParams = clf.best_params_
    
    # Find the average score across the cv folds
    avgScore = totalScore/kf.get_n_splits()
    
    return avgScore, bestParams

## Model Training and Hyperparameter Optimization

In [None]:
# Get a fresh X and y
df = pd.read_csv(fname)
X = df.drop(["Response", "id"], axis=1)
y = df["Response"]

In [None]:
# Model: Naive Bayes
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

# Not many hyperparameters with this one
paramfield = {'var_smoothing': [10**-8, 10**-9, 10**-10]}

# avgScore, bestParams = get_cv_score_and_best_model(X, y, model, paramfield, False)

# print(avgScore)
# print(bestParams)

# Previous Output:
# 0.8244416193893894
# {'var_smoothing': 1e-09}

In [None]:
# Model: Logistic Regression
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
paramfield = {'penalty': ('l1', 'l2', 'elasticnet', 'none'),
              'C': [.011, .033, .11, .33, 1, 3, 9],
              'max_iter': [50, 100, 150, 200, 300],
              }

# avgScore, bestParams = get_cv_score_and_best_model(X, y, model, paramfield, False)

# print(avgScore)
# print(bestParams)

# Previous Output:
# 0.8423410561182303
# {'C': 0.11, 'max_iter': 50, 'penalty': 'l2'}

In [None]:
# Model: Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
paramfield = {'criterion': ('gini', 'entropy'),
              'splitter': ('best', 'random'),
              'max_depth':[50, 100, 200, None],
              'max_leaf_nodes':[500, 1000, 2000, None],
             }

# avgScore, bestParams = get_cv_score_and_best_model(X, y, model, paramfield, False)

# print(avgScore)
# print(bestParams)

# Previous Output:
# 0.8499812468738526
# {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': 500, 'splitter': 'random'}

In [None]:
# Model: K Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
paramfield = {'n_neighbors': [3,4,5,6,7],
              'weights': ('uniform', 'distance'),
             }

# avgScore, bestParams = get_cv_score_and_best_model(X, y, model, paramfield, True)

# print(avgScore)
# print(bestParams)

# Previous Output:
# 0.7800292419908607
# {'n_neighbors': 7, 'weights': 'uniform'}

In [None]:
# Model: XGBoost Classifier
from xgboost import XGBClassifier

model = XGBClassifier()
paramfield = {'learning_rate':[.033, .1, .3, .9],
              'n_estimators':[25, 50, 100, 150, 200],
             }

# avgScore, bestParams = get_cv_score_and_best_model(X, y, model, paramfield, False)

# print(avgScore)
# print(bestParams)

# Previous Output:
# 0.8569668696113993
# {'learning_rate': 0.1, 'n_estimators': 100}

In [None]:
# Model: Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
paramfield = {'max_depth':[50, 100, None],
              'max_leaf_nodes':[500, 1000, None],
              'n_estimators': [50, 100, 200]
             }

# avgScore, bestParams = get_cv_score_and_best_model(X, y, model, paramfield, False)

# print(avgScore)
# print(bestParams)

# Previous Output:
# 0.8557109570795897
# {'max_depth': 50, 'max_leaf_nodes': 500, 'n_estimators': 200}

In [None]:
# Model: Support Vector Classifier
from sklearn.svm import SVC

# Takes longer than 9 hours to fit. The notebook times out after 9 hours.

# From the documentation:
# The implementation is based on libsvm. 
# The fit time complexity is more than quadratic with the number of samples
# which makes it hard to scale to dataset with more than a couple of 10000 samples.

# I have over 380,000 samples, so of course it takes forever.

# model = SVC()
# paramfield = {'C': [.33, 1, 3]}

# Use LinearSVC instead

from sklearn.svm import LinearSVC

model = LinearSVC()
paramfield = {'C': [.11, .33, 1, 3, 9],
             'loss': ('epsilon_insensitive', 'squared_epsilon_insensitive')
             }

# avgScore, bestParams = get_cv_score_and_best_model(X, y, model, paramfield, True)

# print(avgScore)
# print(bestParams)

# Previous Output:
# 0.8416891938364014
# {'C': 0.11, 'loss': 'squared_epsilon_insensitive'}

## Results

Which model to choose? (based on ROC-AUC score)
* **Naive Bayes**: 0.82
* **Logistic Regression**: 0.84
* **Decision Tree**: 0.85
* **K Nearest Neighbors**: 0.78
* **XGBoost**: *0.86*
* **Random Forest**: *0.86*
* **Linear Support Vector Machine**: 0.84

I should go with Decision Tree because it's simpler and has a comparable score to XGBoost and RandomForest Classifier, but I'm going to go with RandomForest because it has a very slightly higher score and it's also a very explainable model (just a bunch of decision trees).

In [None]:
# Fit a model on the training data

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, train_size=.8, test_size=.2)
train_X, val_X, train_y, val_y = get_preprocessed_cv_data(train_X, val_X, train_y, val_y)

# From GridSearchCV: {'max_depth': 50, 'max_leaf_nodes': 500, 'n_estimators': 200}
model = RandomForestClassifier(max_depth=50, max_leaf_nodes=500, n_estimators=200)
model.fit(train_X, train_y)

In [None]:
# See the auc-roc score
from sklearn.metrics import roc_auc_score

# Source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score
print(roc_auc_score(val_y, model.predict_proba(val_X)[:, 1]))

# *Feature Selection*
Why?
- Prevents overfitting on training and validation data
- Train and do inference faster with fewer features

In [None]:
# Use L1 Regularization to see the most important features
# The less important coefficients will be regularized to 0
# Make C lower to remove more features

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

logistic = LogisticRegression(C=.5, penalty="l1", solver="liblinear"
                             ).fit(train_X, train_y)

# Select the nonzero coefficients
selector = SelectFromModel(logistic, prefit=True)

train_X_new = selector.transform(train_X)

In [None]:
# Create a list of which features were selected

selected_df = pd.DataFrame(selector.inverse_transform(train_X_new),
                          index=np.arange(train_X_new.shape[0]),
                          columns=train_X.columns.tolist())

selected_features = selected_df.columns[selected_df.sum() != 0]

selected_features

In [None]:
# See which features should be removed according to this method

for col in train_X.columns.tolist():
    if col not in selected_features:
        print(col)

In [None]:
# Different aproach to see feature importance
# Permutation importance

import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(model).fit(val_X, val_y)
eli5.show_weights(perm, top=100, feature_names = val_X.columns.tolist())

What strikes me as odd is all the very low weights. Did I do something wrong?\
If I trust this, then it seems like most of the features I made are helpful.\
Driving_License has no effect because almost everybody in the dataset has a driving license.\
I will try to find out why Age is so important in the next section.

# *Machine Learning Explainability*
#### Making sense of the model's predictions

## Partial Dependence Plots

In [None]:
# See how exactly Age and Vintage affect predictions

from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots

features_to_plot = ['Age', 'Vintage']
inter1  =  pdp.pdp_interact(model=model, dataset=val_X, model_features=val_X.columns.tolist(), features=features_to_plot)

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour')
plt.show()

Being in the middle of the bell curve of Ages seems to greatly increase the chances of buying Vehicle Insurance. Vintage (days with company) only has an effect at the extremes. Having been with the company for a long time or having been with the company for only a few days increases the odds of getting vehicle insurance. With regards to age, it makes a wider range of ages likely to buy vehicle insurance.

In [None]:
# See how exactly Age and Annual Premiumn affect predictions

from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots

features_to_plot = ['Age', 'Annual_Premium']
inter1  =  pdp.pdp_interact(model=model, dataset=val_X, model_features=val_X.columns.tolist(), features=features_to_plot)

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour')
plt.show()

This is a very similar story to the previous graph. As people pay more for their annual premium, they are slightly more likely to buy vehicle insurance. The pattern with Age is the same as the previous graph.

## SHAP Value Plots

In [None]:
# Another way to see relationships between variables
# SHAP values and SHAP summary plots

import shap

explainer = shap.TreeExplainer(model)

shap_values = explainer.shap_values(val_X)

shap.summary_plot(shap_values[1], val_X)

Clear indicators that make a customer more likely to buy vehicle insurance:
* Not currently being insured
* Has had vehicle damage previously
* Pays a high annual premium
* Comes from a sales channel or region code that historically has been more likely to buy vehicle insurance
* Comes from a sales channel or region code with a high volume of people

Not-so-clear indicators:
* Having a very new car (< 1 year) sometimes increases odds and sometimes decreases odds
* High age sometimes increases odds and sometimes decreases odds
* Being male (slighly increases odds of buying vehicle insurance)

In [None]:
# Partial Dependence Plot, but enhanced with SHAP values
# Check out Age and Previously_Insured

shap.dependence_plot('Age', shap_values[1], val_X, interaction_index="Previously_Insured")

As found previously, the further someone is from the mean age, the less likely they are to buy vehicle insurance. The graph starts down, goes up, then back down. Not having vehicle insurance previously intensifies this effect greatly, basically increasing the 'amplitude'.

In [None]:
# Partial Dependence Plot, but enhanced with SHAP values
# Check out Annual Premium and if the customer has had damage to their vehicle in the past

shap.dependence_plot('Annual_Premium', shap_values[1], val_X, interaction_index="Vehicle_Damage_Yes")

There is no clear pattern for Annual Premium here. The same values can lead to a similar result for predicting buying vehicle insurance. However, not having vehicle damage previously nullifies the effect of Annual Premium. Notice how the blue dots are all centered around 0.00 on the y-axis, while the pink ones are all over the place.

# *Test Data*

In [None]:
# Load in the training data again
# Get a fresh X and y
df = pd.read_csv(fname)

train_id_col = df["id"]

train_X = df.drop(["Response", "id"], axis=1)
train_y = df["Response"]

In [None]:
# Load in the test data

fname_test = "../input/health-insurance-cross-sell-prediction/test.csv"
test_X = pd.read_csv(fname_test)

test_id_col = test_X["id"]

test_X = test_X.drop(["id"], axis=1)
test_X.head()

In [None]:
# Fit a model on all of the training data

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, _ = get_preprocessed_cv_data(train_X, test_X, train_y, _)

# From GridSearchCV: {'max_depth': 50, 'max_leaf_nodes': 500, 'n_estimators': 200}
model = RandomForestClassifier(max_depth=50, max_leaf_nodes=500, n_estimators=200)

model.fit(train_X, train_y)

In [None]:
# Predict on all the test data and create a submission CSV
# Check the dataset page for specifics on formatting

test_preds = model.predict(test_X)

output = pd.DataFrame({'id': test_id_col,
                      'Response': test_preds})

output.to_csv('submission.csv', index=False)

This hackathon competition is no longer accepting submissions. :(