<h1 style="font-family: verdana; font-size: 48px;"><center>💵 Insurance Cross Selling 💵</center></h1>

<center><img src="https://m.everydaywinningtip.com/wp-content/uploads/2024/06/Car-Insurance-Premiums.png"></center>

<p><center style="color:#159364; font-family:cursive; font-size:20px; font-weight:600">Thanks for visiting my notebook. If you enjoy The Notebook, I kindly request that you consider upvoting to provide me with further motivation to create additional works of similar nature in the future. </center></p>

# 🔬 Overview 🔬
<div class="alert alert-block alert-info">
    <p style="font-family:verdana; font-size:20px; line-height: 1.7;">
        Insurance cross-selling is a sales strategy in which an insurance provider provides supplementary insurance products to its existing consumers. The objective is to capitalize on the existing relationship and trust to offer complementary or supplementary policies, such as home insurance to a client who already has auto insurance with the company. This method has the potential to increase the value of customers, improve customer retention, and increase revenue for the insurance provider.
    </p>
    <p style="font-size: 20px; font-family: verdana; line-height: 1.7">
        Here is a brief overview of the features that we are going to encounter and train our models on:
    </p>
    <ul style="font-family:verdana; font-size:20px; line-height: 2.3;">
        <li><strong>age:</strong> The age of the customer.</li>
        <li><strong>gender:</strong> The gender of the customer (usually represented as 'Male' or 'Female').</li>
        <li><strong>driving_licence:</strong> Indicates whether the customer has a valid driving license (usually 1 for Yes and 0 for No).</li>
        <li><strong>region_code:</strong> A categorical code representing the region where the customer resides.</li>
        <li><strong>previously_insured:</strong> Indicates whether the customer has previously been insured (1 for Yes, 0 for No).</li>
        <li><strong>vehicle_age:</strong> The age of the customer's vehicle, typically categorized into bins (e.g., '1-2 Years', '< 1 Year', '> 2 Years').</li>
        <li><strong>vehicle_damage:</strong> Indicates whether the customer has experienced vehicle damage in the past (1 for Yes, 0 for No).</li>
        <li><strong>annual_premium:</strong> The amount of premium paid by the customer annually for their insurance policy.</li>
        <li><strong>policy_sales_channel:</strong> A categorical code representing the channel through which the insurance policy was sold (e.g., different agents or online sales channels).</li>
        <li><strong>vintage:</strong> The number of days since the customer last bought the insurance policy.</li>
    </ul>
    
</div>

# 🎯 Our Goal 🎯
<div class="alert alert-block alert-info" style="font-size: 20px; font-family: verdana; line-height: 1.7">
    Our task is to understand the provided data and create predictive models in order to predict whether the customer has a positive response to other company products.
</div>


# 📚 Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import gc

warnings.filterwarnings("ignore")

from sklearn.preprocessing import MinMaxScaler

from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, ConfusionMatrixDisplay, classification_report

import catboost as cb

### 🛠️ Custom Function

<div class="alert alert-block alert-warning" style="font-family:verdana; font-size:20px; line-height:1.7;">
    <p>
        The subsequent two functions are designed to simplify the process of integrating large CSV files and to reduce the size of a Pandas dataframe.
    </p>
    <p>
        These two functions are from <a href="https://www.kaggle.com/code/gemartin/load-data-reduce-memory-usage"> https://www.kaggle.com/code/gemartin/load-data-reduce-memory-usage </a>
    </p>
</div>

In [2]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

def import_data(file, **kwargs):
    """create a dataframe and optimize its memory usage"""
    df = pd.read_csv(file, parse_dates=True, keep_date_col=True, **kwargs)
    df = reduce_mem_usage(df)
    return df

# 💾 Import Data 💾

In [3]:
train = import_data("/kaggle/input/playground-series-s4e7/train.csv", index_col = "id", engine="pyarrow")

Memory usage of dataframe is 1053.30 MB
Memory usage after optimization is: 274.30 MB
Decreased by 74.0%


In [4]:
test = import_data("/kaggle/input/playground-series-s4e7/test.csv", index_col = "id", engine="pyarrow")

Memory usage of dataframe is 643.68 MB
Memory usage after optimization is: 175.55 MB
Decreased by 72.7%


In [5]:
train["Region_Code"] = train["Region_Code"].astype(np.int8)
test["Region_Code"] = test["Region_Code"].astype(np.int8)

train["Policy_Sales_Channel"] = train["Policy_Sales_Channel"].astype(np.int16)
test["Policy_Sales_Channel"] = test["Policy_Sales_Channel"].astype(np.int16)

In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11504798 entries, 0 to 11504797
Data columns (total 11 columns):
 #   Column                Dtype   
---  ------                -----   
 0   Gender                category
 1   Age                   int8    
 2   Driving_License       int8    
 3   Region_Code           int8    
 4   Previously_Insured    int8    
 5   Vehicle_Age           category
 6   Vehicle_Damage        category
 7   Annual_Premium        float32 
 8   Policy_Sales_Channel  int16   
 9   Vintage               int16   
 10  Response              int8    
dtypes: category(3), float32(1), int16(2), int8(5)
memory usage: 263.3 MB


In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7669866 entries, 11504798 to 19174663
Data columns (total 10 columns):
 #   Column                Dtype   
---  ------                -----   
 0   Gender                category
 1   Age                   int8    
 2   Driving_License       int8    
 3   Region_Code           int8    
 4   Previously_Insured    int8    
 5   Vehicle_Age           category
 6   Vehicle_Damage        category
 7   Annual_Premium        float32 
 8   Policy_Sales_Channel  int16   
 9   Vintage               int16   
dtypes: category(3), float32(1), int16(2), int8(4)
memory usage: 168.2 MB


In [8]:
target = "Response"

In [9]:
initial_features = test.columns.to_list()
print(initial_features)

['Gender', 'Age', 'Driving_License', 'Region_Code', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage']


In [10]:
categorical_features = [col for col in initial_features if pd.concat([train[col], test[col]]).nunique() < 10]

print(categorical_features)

['Gender', 'Driving_License', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage']


In [11]:
numerical_features = list(set(initial_features) - set(categorical_features))

print(numerical_features)

['Age', 'Region_Code', 'Policy_Sales_Channel', 'Annual_Premium', 'Vintage']


# 📊 Feature Distribution 📊

In [12]:
train[categorical_features] = train[categorical_features].astype("category")
test[categorical_features] = test[categorical_features].astype("category")

## Statistical summary of numeric features

In [13]:
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,11504798.0,38.383563,14.993459,20.0,24.0,36.0,49.0,85.0
Region_Code,11504798.0,26.41869,12.99159,0.0,15.0,28.0,35.0,52.0
Annual_Premium,11504798.0,30461.359375,16454.744141,2630.0,25277.0,31824.0,39451.0,540165.0
Policy_Sales_Channel,11504798.0,112.425442,54.035708,1.0,29.0,151.0,152.0,163.0
Vintage,11504798.0,163.897744,79.979531,10.0,99.0,166.0,232.0,299.0
Response,11504798.0,0.122997,0.328434,0.0,0.0,0.0,0.0,1.0


In [14]:
test.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,7669866.0,38.391369,14.999507,20.0,24.0,36.0,49.0,85.0
Region_Code,7669866.0,26.426614,12.994326,0.0,15.0,28.0,35.0,52.0
Annual_Premium,7669866.0,30465.523438,16445.865234,2630.0,25280.0,31827.0,39460.0,540165.0
Policy_Sales_Channel,7669866.0,112.364992,54.073585,1.0,29.0,151.0,152.0,163.0
Vintage,7669866.0,163.899577,79.984449,10.0,99.0,166.0,232.0,299.0


# Feature engineering

**Create New Features:**
   ```python
   df['Previously_Insured_Annual_Premium'] = pd.factorize((df['Previously_Insured'].astype(str) + df['Annual_Premium'].astype(str)).to_numpy())[0]
   df['Previously_Insured_Vehicle_Age'] = pd.factorize((df['Previously_Insured'].astype(str) + df['Vehicle_Age'].astype(str)).to_numpy())[0]
   df['Previously_Insured_Vehicle_Damage'] = pd.factorize((df['Previously_Insured'].astype(str) + df['Vehicle_Damage'].astype(str)).to_numpy())[0]
   df['Previously_Insured_Vintage'] = pd.factorize((df['Previously_Insured'].astype(str) + df['Vintage'].astype(str)).to_numpy())[0]
   ```

   - **Combining Columns:** Each line concatenates two columns as strings. For example, `(df['Previously_Insured'].astype(str) + df['Annual_Premium'].astype(str))` combines the `Previously_Insured` and `Annual_Premium` columns into a single string for each row.
   - **Factorizing Combined Strings:** The `pd.factorize()` function is then applied to the combined string arrays. Factorization converts each unique string into a unique integer. This is useful for converting categorical combinations into numerical values.
   - **Assigning New Features:** The factorized values are assigned to new columns in the DataFrame (`df`). Each new column represents a unique combination of the original columns.

In [19]:
# Concatenate train and test dataframes
df = pd.concat([train, test])

# Create the new features by factorizing the concatenated string columns
df['Previously_Insured_Annual_Premium'] = pd.factorize((df['Previously_Insured'].astype(str) + df['Annual_Premium'].astype(str)).to_numpy())[0]
df['Previously_Insured_Vehicle_Age'] = pd.factorize((df['Previously_Insured'].astype(str) + df['Vehicle_Age'].astype(str)).to_numpy())[0]
df['Previously_Insured_Vehicle_Damage'] = pd.factorize((df['Previously_Insured'].astype(str) + df['Vehicle_Damage'].astype(str)).to_numpy())[0]
df['Previously_Insured_Vintage'] = pd.factorize((df['Previously_Insured'].astype(str) + df['Vintage'].astype(str)).to_numpy())[0]

# Split the combined dataframe back into train and test
train = df.iloc[:train.shape[0]].reset_index(drop=True)
test = df.iloc[train.shape[0]:].reset_index(drop=True)

In [20]:
test = test.drop(['Response'], axis=1)

# 🤖 Model Training 🤖


## 📏 Metric for Evaluation 📏
<div class="alert alert-block alert-info" style="font-family:verdana; font-size:20px; line-height:1.7;">
    <p>The metric we are going use while evaluating model performance is <strong>area under the ROC curve (ROC-AUC)</strong>.</p>
    <div>
        <p>An <strong>ROC</strong> curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:</p>
        <ul style="line-height:2.2">
            <li><strong>True Positive Rate(TPR)</strong> $ = \frac{TP}{TP + FN}$ </li>
            <li><strong>False Positive Rate(FPR)</strong> $ = \frac{FP}{FP + TN}$ </li>
        </ul>
    </div>
    <p>
        At various classification thresholds, a <strong>ROC curve illustrates the relationship between TPR and FPR</strong>. By decreasing the classification threshold, a greater number of items are classified as positive, resulting in an increase in both False Positives and True Positives. A typical ROC curve is illustrated in the accompanying figure.
    </p>
    <center><img src="https://developers.google.com/static/machine-learning/crash-course/images/ROCCurve.svg" height="600" width="640"></center>
    <p><strong>ROC-AUC or simply AUC</strong> measures the entire two-dimensional area under the entire ROC curve from (0,0) to (1,1).</p>
    <center><img src="https://developers.google.com/static/machine-learning/crash-course/images/AUC.svg" height="600" width="640"></center>
</div>

In [25]:
X = train.drop(target, axis=1)
y = train[target]

In [26]:
skfold = StratifiedKFold(5, shuffle=True, random_state=42)

<div class="alert alert-block alert-danger" style="font-family: verdana; font-size: 20px; line-height: 1.7;">
    Several public notebooks are the source of the adjusted hyper-parameters. I haven't tuned them myself.
</div>

## CatBoost

In [27]:
cat_params = {
    'loss_function': 'Logloss',
    'eval_metric': 'AUC',
    'class_names': [0, 1],
    'learning_rate': 0.075,
    'iterations': 3000,
    'depth': 9,
    'random_strength': 0,
    'l2_leaf_reg': 0.5,
    'max_leaves': 512,
    'fold_permutation_block': 64,
    'task_type': 'GPU',
    'random_seed': 42,
    'verbose': False,
    'allow_writing_files': False
}

## Detailed Explanation of cb.Pool
The cb.Pool class is a core component of CatBoost, designed to handle datasets more efficiently and provide additional functionalities, such as:

- **Categorical Features Handling**: ```cat_features``` parameter indicates which columns are categorical. CatBoost processes these features internally using techniques like ordered boosting and target statistics to handle categorical data effectively.
- **Memory Optimization**: Pools optimize memory usage, making it efficient to work with large datasets.
- **Integration with CatBoost Models**: Using Pool ensures that data is in the correct format for training CatBoost models, allowing for seamless integration and better performance.

In [30]:
oof_preds = []
oof_aucs = []

test_pool = cb.Pool(test.astype(str), cat_features=X.columns.values)

for fold, (train_idx, test_idx) in enumerate(skfold.split(X, y)):
    X_train, y_train = X.iloc[train_idx], y[train_idx]
    X_test, y_test = X.iloc[test_idx], y[test_idx]
    
    X_train_pool = cb.Pool(X_train.astype(str), y_train, cat_features=X.columns.values)
    X_test_pool = cb.Pool(X_test.astype(str), y_test, cat_features=X.columns.values)
    
    cat_clf = cb.CatBoostClassifier(**cat_params)
    cat_clf = cat_clf.fit(X=X_train_pool,
                          eval_set=X_test_pool,
                          verbose=500,
                          early_stopping_rounds=200)
    
    test_pred = cat_clf.predict_proba(test_pool)[:, 1]
    
    oof_preds.append(test_pred)
    auc = cat_clf.best_score_['validation']['AUC']
    oof_aucs.append(auc)
    print(f"\n---- Fold {fold}: ROC-AUC Score: {auc:.6f}\n")
    
    del X_train, y_train, X_test, y_test
    del X_train_pool, X_test_pool
    del cat_clf
    gc.collect()

auc_mean = np.mean(oof_aucs)
auc_std = np.std(oof_aucs)
print(f"\n---> ROC-AUC Score: {auc_mean:.6f} ± {auc_std:.6f}\n")

test_pred_cat = np.mean(oof_preds, axis=0)

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8757758	best: 0.8757758 (0)	total: 9.42s	remaining: 7h 50m 42s
500:	test: 0.8944106	best: 0.8944106 (500)	total: 8m 49s	remaining: 44m 3s
1000:	test: 0.8948238	best: 0.8948238 (1000)	total: 17m 29s	remaining: 34m 56s
1500:	test: 0.8949615	best: 0.8949615 (1497)	total: 26m 8s	remaining: 26m 6s
2000:	test: 0.8950280	best: 0.8950285 (1996)	total: 34m 49s	remaining: 17m 23s
bestTest = 0.8950460553
bestIteration = 2219
Shrink model to first 2220 iterations.

---- Fold 0: ROC-AUC Score: 0.895046



Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8751965	best: 0.8751965 (0)	total: 905ms	remaining: 45m 14s
500:	test: 0.8939918	best: 0.8939918 (500)	total: 8m 40s	remaining: 43m 18s
1000:	test: 0.8944492	best: 0.8944492 (999)	total: 17m 16s	remaining: 34m 29s
1500:	test: 0.8946018	best: 0.8946018 (1500)	total: 25m 48s	remaining: 25m 46s
2000:	test: 0.8946610	best: 0.8946610 (2000)	total: 34m 24s	remaining: 17m 10s
2500:	test: 0.8946760	best: 0.8946806 (2382)	total: 42m 55s	remaining: 8m 33s
bestTest = 0.8946805894
bestIteration = 2382
Shrink model to first 2383 iterations.

---- Fold 1: ROC-AUC Score: 0.894681



Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8754973	best: 0.8754973 (0)	total: 907ms	remaining: 45m 21s
500:	test: 0.8943290	best: 0.8943290 (500)	total: 8m 40s	remaining: 43m 18s
1000:	test: 0.8947532	best: 0.8947533 (997)	total: 17m 18s	remaining: 34m 32s
1500:	test: 0.8948840	best: 0.8948840 (1500)	total: 25m 57s	remaining: 25m 55s
2000:	test: 0.8949235	best: 0.8949242 (1998)	total: 34m 37s	remaining: 17m 17s
2500:	test: 0.8949407	best: 0.8949438 (2369)	total: 43m 16s	remaining: 8m 38s
bestTest = 0.8949438035
bestIteration = 2369
Shrink model to first 2370 iterations.

---- Fold 2: ROC-AUC Score: 0.894944



Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8751477	best: 0.8751477 (0)	total: 1.1s	remaining: 54m 53s
500:	test: 0.8941020	best: 0.8941020 (500)	total: 8m 42s	remaining: 43m 24s
1000:	test: 0.8945151	best: 0.8945151 (1000)	total: 17m 22s	remaining: 34m 40s
1500:	test: 0.8946646	best: 0.8946646 (1500)	total: 25m 56s	remaining: 25m 54s
2000:	test: 0.8947448	best: 0.8947448 (2000)	total: 34m 29s	remaining: 17m 13s
2500:	test: 0.8947653	best: 0.8947668 (2491)	total: 43m 4s	remaining: 8m 35s
bestTest = 0.8947716951
bestIteration = 2582
Shrink model to first 2583 iterations.

---- Fold 3: ROC-AUC Score: 0.894772



Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8762399	best: 0.8762399 (0)	total: 904ms	remaining: 45m 12s
500:	test: 0.8948084	best: 0.8948084 (500)	total: 8m 40s	remaining: 43m 16s
1000:	test: 0.8952348	best: 0.8952348 (1000)	total: 17m 21s	remaining: 34m 39s
1500:	test: 0.8953721	best: 0.8953721 (1500)	total: 25m 58s	remaining: 25m 56s
2000:	test: 0.8954369	best: 0.8954384 (1994)	total: 34m 36s	remaining: 17m 16s
bestTest = 0.8954569697
bestIteration = 2249
Shrink model to first 2250 iterations.

---- Fold 4: ROC-AUC Score: 0.895457


---> ROC-AUC Score: 0.894980 ± 0.000271



# 🏳️ Generalization and Submission 🏳️

<div class="alert alert-block alert-danger" style="font-family: verdana; font-size:20px; line-height:1.7;">
    I am going use outputs of some public notebook kernels to improve my submission score.
</div> 

In [31]:
output = pd.DataFrame({'id': test.index, 'Response': test_pred_cat})
output.to_csv('sub_v3_cat.csv', index=False)

In [35]:
ext1 = pd.read_csv("/kaggle/input/ps4e7-classification-generalization/submission.csv",
                   engine="pyarrow")[target].ravel()

ext2 = pd.read_parquet("/kaggle/input/stacking-xgb-lgbm-catb-ann/submission.parquet")[target].ravel()

ext3 = pd.read_csv("/kaggle/input/insurance-binary-classification/submission.csv",
                   engine="pyarrow")[target].ravel()

ext4 = pd.read_csv("/kaggle/input/0-89679-ps4e7-are-you-insured/submission.csv",
                   engine="pyarrow")[target].ravel()

In [38]:
sub = pd.DataFrame({
    'id': test.index,
    'Response': np.average([ext1, ext2, ext3, ext4, test_pred_cat], axis=0, weights=[25, 20, 15, 20, 1])
})

In [39]:
sub.to_csv("sub_v4_merge.csv", index=False)