# Binary Classification of Insurance Selling

The aim of this workbook is train a model to predict whether customers respond positively to an automobile insurance offer.

url: https://www.kaggle.com/competitions/playground-series-s4e7/overview

# Lib Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import dask.dataframe as dd

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [None]:
RANDOM_STATE = 32
Y_COLUMN = 'Response'

In [None]:
import os

TRAIN_DATASET_DIR = '/kaggle/input/playground-series-s4e7/train.csv' if os.path.exists('/kaggle/input/') else 'train.csv'
TEST_DATASET_DIR = '/kaggle/input/playground-series-s4e7/test.csv' if os.path.exists('/kaggle/input/') else 'test.csv'

print(f"train dataset dir: {TRAIN_DATASET_DIR}")
print(f"test dataset dir: {TEST_DATASET_DIR}")

## 1. EDA

### 1.1 Train and Test Dataset Loading

The aim of this section is to load the train and test datasets, in order to investigate missing data and general trends.

In [None]:
train_df = dd.read_csv(TRAIN_DATASET_DIR)

In [None]:
train_df = train_df.compute()

In [None]:
train_df.head()

In [None]:
test_df = dd.read_csv(TEST_DATASET_DIR)
test_df.head()

Train Data Null and Dtype checks

In [None]:
train_df.isna().sum()

In [None]:
train_df.dtypes

Test Data Null and Dtype checks

In [None]:
test_df.isna().sum().compute()

There doesn't seem to be any missing or unusual data so we can proceed with the EDA of the training Dataset.

### 1.2 Train Dataset Feature Engineering

The first step is to analyse the current dataset to determine if features can be engineered. For starters, I'm going to investigate the object columns to determine whether they can be encoded.

In [None]:
train_df.head()

In [None]:
train_df['Gender'].value_counts()

There are only two values so these can be binary encoded.

In [None]:
train_df['Vehicle_Age'].value_counts()

There are only three unique values for Vehicle Age, so these can also be encoded.

In [None]:
train_df['Vehicle_Damage'].value_counts()

Finally Vehicle Damage can also be binary encoded.

So to summarize, all of the object columns can be encoded. Vehicle Damage and Gender will be Binary Encoded, and Vehicle Age will be one hot encoded.

In [None]:
BINARY_COLS = ['Gender', 'Vehicle_Damage']
ONE_HOT_COLS = ['Vehicle_Age']

In [None]:
label_binarizer = LabelBinarizer()

In [None]:
train_df['Gender'] = label_binarizer.fit_transform(train_df['Gender'])
train_df['Vehicle_Damage'] = label_binarizer.fit_transform(train_df['Vehicle_Damage'])

In [None]:
train_df.head()

In [None]:
vehicle_age_one_hot = pd.get_dummies(train_df['Vehicle_Age'])

In [None]:
train_df_full = pd.concat([train_df.drop('Vehicle_Age', axis=1), vehicle_age_one_hot], axis=1)
train_df_full

### 1.3 Numerical Col Analysis

The next step is to review the distribution of numerical columns

In [None]:
NUMERICAL_COLS = ['Age', 'Region_Code', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage']

In [None]:
train_df_full[NUMERICAL_COLS].hist(bins=20, figsize=(12, 10))

Given the above distributions, it may be worth normalizing these columns.

### 1.4 Target vs Feature Analysis

The next step is to review the distribution of the Target Variable, and investigate the relationship between the rest of the features prior to model training.

In [None]:
plt.matshow(train_df_full.corr())
plt.show()

## 2. Pipeline Definition and Train Test Split

Define the preprocessors used for numerical, one hot and binary data.

In [None]:
numerical_transformer = Pipeline(
    steps=[
        ('ss', StandardScaler())
    ]
)

In [None]:
binary_transformer = Pipeline(
    steps=[
        ('lb', LabelBinarizer())
    ]
)

In [None]:
one_hot_transformer = Pipeline(
    steps=[
        ('oh', OneHotEncoder())
    ]
)

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, NUMERICAL_COLS),
        ('bin', binary_transformer, BINARY_COLS),
        ('one_hot', one_hot_transformer, ONE_HOT_COLS)
    ]
)

In [None]:
X = train_df.drop(Y_COLUMN, axis=1)
y = train_df[Y_COLUMN]

In [None]:
lr_model = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegressionCV(Cs=10, cv=4, penalty='l2'))
    ]
)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=RANDOM_STATE)

In [None]:
lr_model.fit(X_train, y_train)