**Health Insurance Cross Sell Prediction 🏠 🏥** <br>
https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction

Hello world, this is my very first notebook in kaggle. I'm just a totally newbie to programming and Data Science. In this notebook I got result of 0.945 AUC, 95% accuracy and 95% precision which is I personally doubtfull about that result. How come it could get so high compared to other notebooks posted in this ? 
In this analysis, I oversampled the data because there are class imbalance in Target feature (Response). And I also have not predicting the test dataset yet. 

Please let me know whether it is to good to be true on getting that very high AUC or it is what it is?
<br>Thank you in advance.

PS, in this notebook I just post the Random Forest and kNN methods as both of them resulted better compared to Logistic Regression and Decision Tree which I managed to get 0.7~0.8 AUC.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Load Dataset

In [None]:
df = pd.read_csv('../input/health-insurance-cross-sell-prediction/train.csv')
#df.sample((5), random_state=789)
df.head(1)

In [None]:
df.info()
#df.describe()

In [None]:
df.isna() .sum()

In [None]:
print(df.duplicated().sum())

 **Feature to Use:**
* **``All 10 Features from Gender to Vintage, because:``**
 * No missing values nor duplicate.
 * We can do feature encoding, outliers handling, standardization, class balancing.

# DATA PRE-PROCESSING:
## Feature Engineering

### One Hot Encodings

* Vehicle_Age (3 value counts)
* Gender (2 value counts)
* Vehicle_Damage (2 value counts)

In [None]:
onehots = pd.get_dummies(df['Vehicle_Age'], prefix='Vehicle_Age')
df = df.join(onehots)

In [None]:
onehots2 = pd.get_dummies(df['Gender'], prefix='Gender')
df = df.join(onehots2)

In [None]:
onehots3 = pd.get_dummies(df['Vehicle_Damage'], prefix='Vehicle_Damage')
df = df.join(onehots3)

In [None]:
#df.sample(5)
df.info()

#### DROP Unused Features
* drop ID (ga kepake)
* drop 'Gender' (label encoded)
* drop 'Vehicle_Damage' (label encoded)
* drop 'Vehicle_Age' (one hot encoded)

In [None]:
df = df.drop(['id', 'Gender', 'Vehicle_Damage', 'Vehicle_Age'], axis=1) 

In [None]:
df.info()

## OUTLIERS Handling (Annual Premium)
* I'm using IQR method to decide outliers, one of the reason is because the data is not normally distributed.

### Outliers Method 1 : throw away outliers
(There are 2 method i used to handle outliers. At ends, this method resulted a little bit better.)

In [None]:
print(f'Count of rows before filtering outlier: {len(df)}')

filtered_entries = np.array([True] * len(df))
for col in ['Annual_Premium']:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    low_limit = Q1 - (IQR * 1.5)
    high_limit = Q3 + (IQR * 1.5)

    filtered_entries = ((df[col] >= low_limit) & (df[col] <= high_limit)) & filtered_entries
    
df = df[filtered_entries]

print(f'Count of rows after filtering outlier: {len(df)}')

# Visualisasi Boxplot 
numerikready = ['Age', 'Driving_License', 'Previously_Insured', 'Annual_Premium', 'Vintage', 'Response']
fig, ax = plt.subplots(1,1, figsize=(20,7))
for i in range(0, len(numerikready)):
    plt.subplot(2, np.ceil(len(numerikready)/2), i+1)
    sns.boxplot(df[numerikready[i]], color='teal', orient='v')
    plt.tight_layout()

### Outliers Method 2 : change outliers to minimum or maximum (fence value).
This Cell is not run because the Method 1 actually resulted better, but i show the code here. <br>
Despite this method probably more applicable to predict the test dataset.

## STANDARDIZATION
I standardize the ``Annual_Premium`` because it has a very large value compared to others features.

In [None]:
from sklearn.preprocessing import StandardScaler
df['Annual_Premium_std'] = StandardScaler().fit_transform(df['Annual_Premium'].values.reshape(len(df), 1))

std = ['Annual_Premium_std']

display(df[std].describe())

### DROP UnStandardized data

In [None]:
df = df.drop(['Annual_Premium'], axis=1)
df.describe()

## BALANCING CLASS (Response)
* There is class imbalance in Response feature, so I decide to balancing it. 
* I oversampled the Response == 1 Class. I oversampled it using RandomOverSampler. I use oversampling because it resulted better than undersampling.

In [None]:
display(df.sample(1))
print('#'*100)
print(df['Response'].value_counts())

In [None]:
X = df[[col for col in df.columns if (str(df[col].dtype) != 'object') and col not in ['Response']]]
y = df['Response'].values
print(X.shape)
print(y.shape)

In [None]:
from imblearn import over_sampling
X_over, y_over = over_sampling.RandomOverSampler().fit_resample(X, y)
df_y_over = pd.Series(y_over).value_counts()
df_y_over

In [None]:
pd.DataFrame(y_over).rename(columns = {0 : 'Response'})

## Save the BALANCED dataset

In [None]:
df = pd.concat([X_over, pd.DataFrame(y_over).rename(columns = {0 : 'Response'})], axis=1)
df

## SAVE to CSV PreProcessed

In [None]:
### this code is optional if you want to export the pre-processed data to csv.
#df.to_csv('train_pre_processed.csv')

# MODELLING
## Random Forest
* I use n_estimators= 400, and max_depth=110 because it resulted best after I tuned hyperparameter with randomized search.


In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Split Feature Vector and Label
X = df.drop(['Response'], axis = 1) # menggunakan semua feature kecuali target
y = df['Response'] # target / label

#Splitting the data into Train and Test
from sklearn.model_selection import train_test_split 
X_train, X_test,y_train,y_test = train_test_split(X,
                                                y,
                                                test_size = 0.3,
                                                random_state = 789)

In [None]:
y_test.count()

In [None]:
rf = RandomForestClassifier(n_estimators= 400, max_depth=110, random_state=0)
rf.fit(X_train, y_train)

In [None]:
y_predicted = rf.predict(X_test)
y_predicted

In [None]:
# OUTLIERS throwed away 
# Data oversampled on Response == 1

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score
print('\nconfustion matrix') # generate the confusion matrix
print(confusion_matrix(y_test, y_predicted))

print('\naccuracy')
print(accuracy_score(y_test, y_predicted))
print('\nprecision')
print(precision_score(y_test, y_predicted))


print('\nclassification report')
print(classification_report(y_test, y_predicted)) # generate the precision, recall, f-1 score

In [None]:
print("train Accuracy : ",rf.score(X_train,y_train))
print("test Accuracy : ",rf.score(X_test,y_test))

In [None]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_predicted, pos_label=1) # pos_label: label yang kita anggap positive
print('Area Under ROC Curve (AUC):', auc(fpr, tpr))

In [None]:
plt.subplots(figsize=(10, 6))
plt.plot(fpr, tpr, 'o-', label="ROC curve")
plt.plot(np.linspace(0,1,10), np.linspace(0,1,10), label="diagonal")
for x, y, txt in zip(fpr, tpr, thresholds):
    plt.annotate(np.round(txt,2), (x, y-0.04))
plt.legend(loc="upper left")
plt.xlabel("FPR")
plt.ylabel("TPR")

In [None]:
feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
ax = feat_importances.nlargest(10).plot(kind='barh')
ax.invert_yaxis()
plt.xlabel('score')
plt.ylabel('feature')
plt.title('feature importance score')

### Evaluation of Best Random Forest model based on pre-processing method 
Based on several trial, these are the conditions that resulted best: 
 * Features : All feature is used, with 3 features are one-hot encoded.
 * Outliers : filtered (throw away)
 * Standardized : Yes on Annual_Premium
 * Class balancing : RandomOversampling on Response 1 Class
<br>
<br>
 * Best n_estimators: 400
 * Best max_depth: 110


## kNN Method

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Split Feature Vector and Label
X = df.drop(['Response'], axis = 1) # menggunakan semua feature kecuali target
y = df['Response'] # target / label

#Splitting the data into Train and Test
from sklearn.model_selection import train_test_split 
X_train, X_test,y_train,y_test = train_test_split(X,
                                                y,
                                                test_size = 0.3,
                                                random_state = 789)

neigh = KNeighborsClassifier(n_neighbors = 3)
neigh.fit(X,y)

In [None]:
y_predicted = neigh.predict(X_test)
y_predicted

### Evaluation : kNN (k-Nearest Neighbors)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print('\nconfustion matrix') # generate the confusion matrix
print(confusion_matrix(y_test, y_predicted))

from sklearn.metrics import accuracy_score
print('\naccuracy')
print(accuracy_score(y_test, y_predicted))

from sklearn.metrics import classification_report
print('\nclassification report')
print(classification_report(y_test, y_predicted)) # generate the precision, recall, f-1 score, num

In [None]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_predicted, pos_label=1) # pos_label: label yang kita anggap positive
print('Area Under ROC Curve (AUC):', auc(fpr, tpr))
print("train Accuracy : ",neigh.score(X_train,y_train))
print("test Accuracy : ",neigh.score(X_test,y_test))

* Surprisingly in this analysis kNN resulted a little bit better than Random Forest. 
* kNN method also not overfitted nor underfitted because train accuracy score vs test accuracy score is same (0.9465 vs 0.9459).

# CONCLUSION

* The best performing models is kNN, even without tuning hyperparameter already resulted best among others. It has 95% accuracy, and 95% weighted-avg F1 score. 
* kNN model is not overfitted nor underfitted because has the same train vs test accuracy (0.94 vs 0.94). This better than Random Forest which is a little bit overfitted because it has higher difference (0.99 vs 0.94)
* kNN is faster to run (+- 3.5 minutes) than Random Forest (+- 5 minutes) <- this what it was in my local PC.
* Best pre-processing methods:
 * Features : All feature is used, with 3 features are one-hot encoded.
 * Outliers : filtered (throw away)
 * Standardized : Yes on Annual_Premium
 * Class balancing : RandomOversampling
<br> 
<br> 

* And also specifically, it has 90% True Positive precision which is obviously will boost conversion ratio after this model is implemented in the company. It will boost sales/marketing team performance because now they know which customer to be targeted (The Predicted **Yes** Response).

# PREDICTING THE KAGGLE TEST DATASET

In [None]:
dfkaggletest = pd.read_csv('../input/health-insurance-cross-sell-prediction/test.csv')
dfkaggletest.info()

In [None]:
onehots = pd.get_dummies(dfkaggletest['Vehicle_Age'], prefix='Vehicle_Age')
dfkaggletest = dfkaggletest.join(onehots)
onehots2 = pd.get_dummies(dfkaggletest['Gender'], prefix='Gender')
dfkaggletest = dfkaggletest.join(onehots2)
onehots3 = pd.get_dummies(dfkaggletest['Vehicle_Damage'], prefix='Vehicle_Damage')
dfkaggletest = dfkaggletest.join(onehots3)

In [None]:
dfkaggletest = dfkaggletest.drop(['id', 'Gender', 'Vehicle_Damage', 'Vehicle_Age'], axis=1) 

In [None]:
print(f'Jumlah baris sebelum memfilter outlier: {len(dfkaggletest)}')

filtered_entries = np.array([True] * len(dfkaggletest))
for col in ['Annual_Premium']:
    Q1 = dfkaggletest[col].quantile(0.25)
    Q3 = dfkaggletest[col].quantile(0.75)
    IQR = Q3 - Q1
    low_limit = Q1 - (IQR * 1.5)
    high_limit = Q3 + (IQR * 1.5)
    
    for i in dfkaggletest[col]:
        if i > high_limit :
                dfkaggletest[col] = np.where(dfkaggletest[col] > high_limit, high_limit, dfkaggletest[col])
        else:
            i = i

dfkaggletest = dfkaggletest

print(f'Jumlah baris setelah memfilter outlier: {len(dfkaggletest)}')

# Visualisasi Boxplot 
numerikready = ['Age', 'Driving_License', 'Previously_Insured', 'Annual_Premium', 'Vintage']#, 'Response']
fig, ax = plt.subplots(1,1, figsize=(20,7))
for i in range(0, len(numerikready)):
    plt.subplot(2, np.ceil(len(numerikready)/2), i+1)
    sns.boxplot(dfkaggletest[numerikready[i]], color='teal', orient='v')
    plt.tight_layout()

In [None]:
from sklearn.preprocessing import StandardScaler
dfkaggletest['Annual_Premium_std'] = StandardScaler().fit_transform(dfkaggletest['Annual_Premium'].values.reshape(len(dfkaggletest), 1))

std = ['Annual_Premium_std']

dfkaggletest = dfkaggletest.drop(['Annual_Premium'], axis=1)
dfkaggletest.describe()

In [None]:
y_predicted = neigh.predict(dfkaggletest)
y_predicted

In [None]:
dfkagglesubmission = pd.read_csv('../input/health-insurance-cross-sell-prediction/test.csv')
dfid = dfkagglesubmission[['id']]

dfkagglesubmission = pd.concat([dfid, pd.DataFrame(y_predicted).rename(columns = {0 : 'Response'})], axis=1)

dfkagglesubmission

In [None]:
dfkagglesubmission.to_csv('test_submission.csv')