# Objective

This notebook use a kaggle open data-set in order to develop an experiment in order to manage unbalanced data for a classification problem. The name of the open-data set is [Health insurance cross sell prediction](https://www.kaggle.com/datasets/anmolkumar/health-insurance-cross-sell-prediction).

In [3]:
import pandas as pd
from utilities import ut_standard_col_name, ut_distinct_elements

In [6]:
# import input (training) data
input_df = pd.read_csv("/Users/lorenzopusateri/Documents/01_studio/04_kaggle/Kaggle/competitions/insurance_cross_selling/data/train.csv")
# standardize column names
input_df = ut_standard_col_name(input_df)
input_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381109 entries, 0 to 381108
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    381109 non-null  int64  
 1   gender                381109 non-null  object 
 2   age                   381109 non-null  int64  
 3   driving_license       381109 non-null  int64  
 4   region_code           381109 non-null  float64
 5   previously_insured    381109 non-null  int64  
 6   vehicle_age           381109 non-null  object 
 7   vehicle_damage        381109 non-null  object 
 8   annual_premium        381109 non-null  float64
 9   policy_sales_channel  381109 non-null  float64
 10  vintage               381109 non-null  int64  
 11  response              381109 non-null  int64  
dtypes: float64(3), int64(6), object(3)
memory usage: 34.9+ MB


# Variables meaning

One of the first operation with the data is like when we meet a new person in the real life. We have to exchange some basic information. In that case we have already some information the challenge with the metadata, however, is good prectice start with numerical information.

Columns are:
* `id`, Unique ID for the customer.
* `gender`, Gender of the customer.
* `age`, Age of the customer.
* `driving_license`, 0: Customer does not have DL, 1: Customer already has DL
* `region_code`, unique code for the region of the customer.
* `previously_insured`, 1: Customer already has Vehicle Insurance, 0: Customer doesn't have Vehicle Insurance.
* `vehicle_age`, age of the vehicle.
* `vehicle_damage`, 1: Customer got his/her vehicle damaged in the past. 0: Customer didn't get his/her vehicle damaged in the past.
* `annual_premium`, The amount customer needs to pay as premium in the year.
* `policy_sales_channel`, Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
* `vintage`, Number of Days, Customer has been associated with the company.
* `response`, 1: Customer is interested, 0: Customer is not interested.

Since I am italian I want to simulate the problem like the real scenario in Italy, where is not permitted to use the gender information in insurance companies for pricing and other reseach. Hence, I will drop that information. Moreover, I drop also the `id` since it could be useful only if we have other data to merge where there it is used as key. We do not have other data to merge and for prediction the `id` variable is useless.

In [7]:
train_df = input_df.copy()
train_df = train_df.drop(columns=['gender', 'id'])

# Target variable

Now it is time to inspect the target variable in the training data-set. The problem we are facing is a classification problem. Thanks to the traing data we need to build a model that is able to classify new records with the same kind of information (variables) in the correct way.

In [8]:
value_counts_response = train_df.response.value_counts().to_dict()

# Compute percentage of target classes
perc_label_0 = value_counts_response[0]/len(train_df)
perc_label_1 = value_counts_response[1]/len(train_df)
print(f"Percentage of records with response=0: {round(perc_label_0, 2)*100}%")
print(f"Percentage of records with response=1: {round(perc_label_1, 2)*100}%")

Percentage of records with response=0: 88.0%
Percentage of records with response=1: 12.0%


## Imbalanced data-set

As we can see there are much more records with `response=0`. That scenario in Machine Learning (aka ML) jargon is called _imbalanced data_. In the book _Introduction to Machine Learning with Python_, written by Andreas C. Müller and Sarah Guido they talk about the problem of imbalanced datasets.

«Types of errors play an important role when one of two classes is much more frequent than the other one.

This is very common in practice; a good example is click-through prediction, where each data point represents an “impression,” an item that was shown to a user. This item might be an ad, or a related story, or a related person to follow on a social media site. 

The goal is to predict whether, if shown a particular item, a user will click on it (indicating they are interested). Most things users are shown on the Internet (in particular, ads) will not result in a click. You might need to show a user 100 ads or articles before they find something interesting enough to click on.

This results in a dataset where for each 99 “no click” data points, there is 1 “clicked” data point; in other words, 99% of the samples belong to the “no click” class. Datasets in which one class is much more frequent than the other are often called imbalanced datasets, or datasets with imbalanced classes.

In reality, imbalanced data is the norm, and it is rare that the events of interest have equal or even similar frequency in the data.»

Hence we have to use some algorithms to manage that kind of errors and create a balanced data-set. There are differnt methods to deal with imbalanced datasets. Those methods can be splitted into two main macro categories:
1. Under-sampling, we **reduce** the dimension of the imbalanced dataset removing records that has the label that appears most of the time. That famility of methods is useful when we have a huge amount of data and we can lose some information.
2. Over-sampling, we **increase** the dimension of the imbalanced dataset adding new records. That family of methods is useful since we do not lose original information. However, depending on the algorithm we decide to use we can add more or less noise and/or errros in the data.

The method I decide to use in this notebook is the _Synthetic Minority Over-sampling Technique for Nominal and Continuous_ (aka SMOTENC) implemented in the Python library `imbalanced-learn`. I decided to use it since I have both categorical (Nominal) and continuous variables. To further discussion on the over-sampling method that are implemented in the Python library `imbalanced-learn` click on the following [link](https://imbalanced-learn.org/stable/over_sampling.html#smote-adasyn).