# Introduction

The goal is to predict which customers respond positively to an automobile insurance offer. To get a baseline for further improvement and experiments we compute a tuned LightGBM.

The evaluation metric is *area under the ROC curve* and the binary target variable is *Response*.

## Libraries

In [1]:
# Import Libraries
import numpy as np 
import pandas as pd 
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.feature_selection import mutual_info_classif
from sklearn.base import clone
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score,roc_curve

## Read Data

In [2]:
train_data = pd.read_csv('/kaggle/input/playground-series-s4e7/train.csv', index_col='id')
test_data = pd.read_csv('/kaggle/input/playground-series-s4e7/test.csv', index_col='id')
sample_sub = pd.read_csv("/kaggle/input/playground-series-s4e7/sample_submission.csv")

## First Look on the Data

In [3]:
f" Training data shape: {train_data.shape}, --- Test data shape: {test_data.shape}"

' Training data shape: (11504798, 11), --- Test data shape: (7669866, 10)'

In [4]:
train_data.Response.value_counts()

Response
0    10089739
1     1415059
Name: count, dtype: int64

The amount of training data is huge and the target variable is inbalanced. We make a try with downsampling, such that we train the model on a balanced data set.

In [5]:
n_splits, random_state = 5, 904

In [6]:
train_data.head()

Unnamed: 0_level_0,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Male,21,1,35.0,0,1-2 Year,Yes,65101.0,124.0,187,0
1,Male,43,1,28.0,0,> 2 Years,Yes,58911.0,26.0,288,1
2,Female,25,1,14.0,1,< 1 Year,No,38043.0,152.0,254,0
3,Female,35,1,1.0,0,1-2 Year,Yes,2630.0,156.0,76,0
4,Female,36,1,15.0,1,1-2 Year,No,31951.0,152.0,294,0


There are several categorical features in the data set. Obviously, *Gender*, *Vehicle_Age* and *Vehicle_Damage* are categorical, where *Vehicle_Age* is an ordinal feature. Further categorical feature are *Driving_License*, *Region_Code*, *Previously_insured* and *Policy_Sales_Channel*.

In [7]:
train_data.dtypes

Gender                   object
Age                       int64
Driving_License           int64
Region_Code             float64
Previously_Insured        int64
Vehicle_Age              object
Vehicle_Damage           object
Annual_Premium          float64
Policy_Sales_Channel    float64
Vintage                   int64
Response                  int64
dtype: object

We single out the following features as categorical for further feature engineering:

In [8]:
cat_features = ['Gender', 'Vehicle_Age', 'Vehicle_Damage']

# Preprocessing

## Downsampling

In [9]:
majority_class = train_data[train_data['Response'] == 0]
minority_class = train_data[train_data['Response'] == 1]
sample_size = len(minority_class)
majority_sample = majority_class.sample(sample_size)
X = pd.concat([majority_sample, minority_class], axis=0)
print("New shape of training data: ", X.shape)

New shape of training data:  (2830118, 11)


## Feature Engineering

There are several ways to encode the categorical features. For now *Gender* and *Vehicle_Damage* is one-hot-encoded and since *Vehicle Age* is ordinal, we define it as integer. The other variables are encoded to optimize memory usage (see [this notebook](https://www.kaggle.com/code/jmascacibar/optimizing-memory-usage-with-insurance-cross-sell))

In [10]:
def new_features(data):
    df=data.copy()
    df[cat_features] = df[cat_features].astype('category')
    df["Vehicle_Age"] = df["Vehicle_Age"].cat.rename_categories({"1-2 Year": 1, "< 1 Year": 0, "> 2 Years": 2}).astype('int8')
    df['Gender'] = df['Gender'].cat.rename_categories({'Male': 0, 'Female': 1}).astype('int8')
    df['Vehicle_Damage'] = df['Vehicle_Damage'].cat.rename_categories({'No': 0, 'Yes': 1}).astype('int8')
    df['Age'] = df['Age'].astype('int8')
    df['Driving_License'] = df['Driving_License'].astype('int8')
    df['Region_Code'] = df['Region_Code'].astype('int8')
    df['Previously_Insured'] = df['Previously_Insured'].astype('int8')
    df['Annual_Premium'] = df['Annual_Premium'].astype('int32')
    df['Policy_Sales_Channel'] = df['Policy_Sales_Channel'].astype('int16')
    df['Vintage'] = df['Vintage'].astype('int16')
    return df


X = new_features(X)
test_data = new_features(test_data)
X['Response'] = X['Response'].astype('int8')

In [11]:
target = 'Response'
y = X.pop(target)

## EDA

## Mutual Information Importance

In [12]:
X_small = X.copy()
_, sample_indices = next(StratifiedKFold(
    n_splits=100, shuffle=True, random_state=42).split(X, y.astype('str')))
X_small = X_small.iloc[sample_indices]
y_small = y.iloc[sample_indices]

#add noise feature
rng = np.random.default_rng(0)
X_small[F'cat_noise'] = rng.choice(5, size=len(X_small))

In [13]:
mi = mutual_info_classif(X_small, y_small, random_state=6)
mi = pd.DataFrame({
        'Features': X_small.columns,
        'Mutual information': mi*100
    })
mi = mi.sort_values(['Mutual information'], ascending=False).reset_index(drop=True)

In [14]:
# plt.figure(figsize=(16, 10))
# ax = sns.barplot(x="Mutual information", y="Features", data=mi)
# ax.set_xscale("log")
# ax.axvline(1,ls='--',color='k')
# plt.title("Crossvalidated Mutual Information", size=20)
# plt.show()

All features contain more mutual information with the target than the noise feature, thus all the feature are potentially beneficial for the model.

## Statistical Properties

In [15]:
kfold = StratifiedKFold(n_splits=n_splits, shuffle = True, random_state=random_state)

def cross_validate():
    """Compute out-of-fold and test predictions for a LGBM model.
    """
    
    start_time = datetime.datetime.now()
    scores = []
    oof_preds = np.zeros(len(y))
    test_preds = np.zeros(len(test_data))
    
    for fold, (train_index, valid_index) in enumerate(kfold.split(X,y.astype(str))):
                                                      
        X_train = X.iloc[train_index]
        y_train = y.iloc[train_index]
        X_val = X.iloc[valid_index]
        y_val = y.iloc[valid_index]
        
        m = clone(model)
        m.fit(X_train, y_train, eval_set=[(X_val, y_val)])

        y_pred = m.predict_proba(X_val)[:,1] #probability for Response == 1

        score = roc_auc_score(y_val,  y_pred)
        print(f"# Fold {fold}: ROC-AUC-Score={score:.5f}")
        scores.append(score)
        oof_preds[valid_index] += y_pred
        test_preds = test_preds + m.predict_proba(test_data)[:,1]/kfold.get_n_splits()
            
    elapsed_time = datetime.datetime.now() - start_time
    print(f"#ROC-AUC mean: {np.array(scores).mean():.7f} (+- {np.array(scores).std():.7f})"
          f"#elapsed time:   {int(np.round(elapsed_time.total_seconds() / 60))} min")

    return oof_preds, test_preds

In [16]:
X.describe()

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 23)

# Modeling

## Crossvalidation Function

## LGBM Parameters

In [18]:
lgbm_params = {'n_estimators': 2000,
               "verbose": -1,
               "eval_metric": "auc",
               "early_stopping_round": 50,
               'depth': 6,
               'num_leaves': 171,
               'learning_rate': 0.02644045021671239,
               'min_child_samples': 72,
               'subsample': 0.5065444622662555,
               'colsample_bytree': 0.5395763161430562,
               'lambda_l1': 3.3153753990389056e-08,
               'lambda_l2': 2.160332353116391e-08}

## Run the Model

In [19]:
model = LGBMClassifier(**lgbm_params)
lgbm_oof, lgbm_test = cross_validate()

# Evaluation

... later ...

In [20]:
plt.figure(figsize=(10, 8))
plt.plot(roc_curve(y, lgbm_oof))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()

In [21]:
import time
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix,roc_auc_score,classification_report

In [22]:
%%time

model = XGBClassifier(
    objective='binary:logistic',
    eval_metric='auc',
    eta=0.05,
    alpha=0.1269124823585012,
    subsample=0.8345882521794742,
    colsample_bytree=0.44270196445757065,
    max_depth=15,
    min_child_weight=8,
    gamma=1.308021832047589e-08,
    random_state=random_state,
    max_bin=50000, #a weird max_bin, for reference: https://www.kaggle.com/competitions/playground-series-s4e7/discussion/516265
    enable_categorical=True,
    n_estimators=10000
)

model.fit(
    X_train,
    y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=50,
    verbose=50
)

print("Best iteration:", model.best_iteration)

booster = model.get_booster()
y_pred_prob = booster.predict(xgb.DMatrix(X_test, enable_categorical=True), iteration_range=(0, model.best_iteration + 1))
auc = roc_auc_score(y_test, y_pred_prob)
print(f"Validation AUC: {auc:.5f}")



[0]	validation_0-auc:0.81906
[50]	validation_0-auc:0.87623
[100]	validation_0-auc:0.88116
[150]	validation_0-auc:0.88268
[200]	validation_0-auc:0.88374
[250]	validation_0-auc:0.88450
[300]	validation_0-auc:0.88485
[350]	validation_0-auc:0.88505
[400]	validation_0-auc:0.88533
[450]	validation_0-auc:0.88582
[500]	validation_0-auc:0.88612
[550]	validation_0-auc:0.88627
[600]	validation_0-auc:0.88638
[650]	validation_0-auc:0.88666
[700]	validation_0-auc:0.88677
[750]	validation_0-auc:0.88695
[800]	validation_0-auc:0.88692
[816]	validation_0-auc:0.88692
Best iteration: 767
Validation AUC: 0.88699
CPU times: user 29min 29s, sys: 6.37 s, total: 29min 35s
Wall time: 7min 43s


In [23]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

In [24]:
y_pred_rf = rf.predict(X_test)
RF_probability = rf.predict_proba(X_test)[:,1]

In [25]:
print("Accuracy Score: ", accuracy_score(y_test, y_pred_rf))
print("AUC_RF: ", roc_auc_score(y_pred_rf,y_test))
print("F1-Score: ", f1_score(y_test, y_pred_rf, average= None))
print(classification_report(y_test, y_pred_rf))

In [26]:
from sklearn.metrics import roc_curve
fpr, tpr, _ = roc_curve(ytest, RF_probability)

plt.title('Linear Regression ROC curve')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()

In [27]:
test_data.columns

Index(['Gender', 'Age', 'Driving_License', 'Region_Code', 'Previously_Insured',
       'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium',
       'Policy_Sales_Channel', 'Vintage'],
      dtype='object')

In [28]:
test = xgb.DMatrix(test_data, enable_categorical=True)

In [29]:
test_pred = booster.predict(test,iteration_range=(0, model.best_iteration + 1))

In [30]:
submission = pd.DataFrame({'id': test_data.index, 'Response': test_pred})

In [31]:
submission

Unnamed: 0,id,Response
0,11504798,0.016747
1,11504799,0.891927
2,11504800,0.694602
3,11504801,0.001001
4,11504802,0.323460
...,...,...
7669861,19174659,0.625008
7669862,19174660,0.000820
7669863,19174661,0.001780
7669864,19174662,0.886988


In [32]:
submission.to_csv('submission.csv', index=False)

# Submission

In [33]:
sample_sub['Response'] = lgbm_test
sample_sub.to_csv('submission.csv', index=False)
print(sample_sub.head())