## Imbalance Data Handling
Proses ini dilakukan karena model klasifikasi tidak bisa mendeteksi kelas minority dengan baik disebabkan oleh kurangnya proporsi data minority dibandingkan data majority dan mengakibatkan model cenderung pada data majority. Proses handling ini akan dilakukan dengan metode:
- Random Over Sampling
- Random Under Sampling
- SMOTE
- class_weight parameter


Hasil yang diharapkan dari proses ini adalah mendapatkan metode handling terbaik dengan Evaluation Matrix menggunakan model dasar Decision Tree Classifier lalu dibuat perbandingan antar metode. Metode penanganan imbalanced data yang paling sesuai dan mendukung model untuk deteksi kelas minority akan diuji dengan ROC-AUC score

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
df=pd.read_csv('encoded_train.csv')
df.head()

Unnamed: 0,Duration,Net Sales,Commision (in value),Age,Claim,Agency_CWT,Agency_EPX,Agency_Others,Agency Type_Travel Agency,Distribution Channel_Online,Product Name_Annual Silver Plan,Product Name_Bronze Plan,Product Name_Others,Product Name_Rental Vehicle Excess Insurance,Product Name_Silver Plan,Destination_SINGAPORE
0,365,216.0,54.0,57,0,0,0,0,0,1,1,0,0,0,0,1
1,4,10.0,0.0,33,0,0,1,0,1,1,0,0,1,0,0,0
2,19,22.0,7.7,26,0,0,0,1,0,1,0,0,1,0,0,0
3,20,112.0,0.0,59,0,0,1,0,1,1,0,0,0,0,0,0
4,8,16.0,4.0,28,0,0,0,0,0,1,0,1,0,0,0,1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43832 entries, 0 to 43831
Data columns (total 16 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Duration                                      43832 non-null  int64  
 1   Net Sales                                     43832 non-null  float64
 2   Commision (in value)                          43832 non-null  float64
 3   Age                                           43832 non-null  int64  
 4   Claim                                         43832 non-null  int64  
 5   Agency_CWT                                    43832 non-null  int64  
 6   Agency_EPX                                    43832 non-null  int64  
 7   Agency_Others                                 43832 non-null  int64  
 8   Agency Type_Travel Agency                     43832 non-null  int64  
 9   Distribution Channel_Online                   43832 non-null 

In [4]:
# Imbalanced data proportion
(pd.crosstab(index=df['Claim'],columns='Proportion (%)',normalize=True)*100).round(2)

col_0,Proportion (%)
Claim,Unnamed: 1_level_1
0,98.46
1,1.54


Terlihat bahwa data imbalance dengan proporsi kelas 0 lebih banyak daripada proporsi kelas 1

In [5]:
non_claim = df[df['Claim'] == 0] ## Majority Class
claim = df[df['Claim'] == 1] ## Minority Class

### Base Decision Tree Modelling

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [7]:
X = df.drop(columns= 'Claim')
y = df['Claim']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.80, stratify=y,random_state=169)

In [9]:
print(X_train.shape, X_test.shape)

(35065, 15) (8767, 15)


In [10]:
# Decision Tree Model
model_DT = DecisionTreeClassifier()

In [11]:
model_DT.fit(X_train, y_train)

In [12]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,recall_score,precision_score,f1_score

In [13]:
# Evaluation Matrix
def Eva_Matrix(model,x_train,x_test,y_train,y_test,Nama):
    Model=model.fit(x_train,y_train)
    y_pred_train=Model.predict(x_train)
    acc_train=accuracy_score(y_train,y_pred_train)
    rec_train=recall_score(y_train,y_pred_train)
    prec_train=precision_score(y_train,y_pred_train)
    f1_train=f1_score(y_train,y_pred_train)

    y_pred_test=Model.predict(x_test)
    acc_test=accuracy_score(y_test,y_pred_test)
    rec_test=recall_score(y_test,y_pred_test)
    prec_test=precision_score(y_test,y_pred_test)
    f1_test=f1_score(y_test,y_pred_test)
    
    data_LR={
    Nama + ' Training':[acc_train,rec_train,prec_train,f1_train],
    Nama + ' Testing':[acc_test,rec_test,prec_test,f1_test]
}

    df_LR=(pd.DataFrame(data_LR,index=['Accuracy','Recall','Precision','F1']).T).round(4)
    cr_train=classification_report(y_train,y_pred_train)
    cm_train=confusion_matrix(y_train,y_pred_train,labels=[1,0])
    df_train=pd.DataFrame(data=cm_train,columns=['Pred 1','Pred 0'],index=['Akt 1','Akt 0'])

    cr_test=classification_report(y_test,y_pred_test)
    cm_test=confusion_matrix(y_test,y_pred_test,labels=[1,0])
    df_test=pd.DataFrame(data=cm_test,columns=['Pred 1','Pred 0'],index=['Akt 1','Akt 0'])

    return df_LR,cr_train,df_train,cr_test,df_test

## Run Function
df_DT,cr_DT_tr,cm_DT_tr,cr_DT_ts,cm_DT_ts=Eva_Matrix(model_DT,X_train,X_test,y_train,y_test,'Decision Tree Base')


In [14]:
df_DT

Unnamed: 0,Accuracy,Recall,Precision,F1
Decision Tree Base Training,0.9975,0.8413,1.0,0.9138
Decision Tree Base Testing,0.9681,0.0889,0.071,0.0789


In [15]:
print(cr_DT_tr, cr_DT_ts)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     34523
           1       1.00      0.84      0.91       542

    accuracy                           1.00     35065
   macro avg       1.00      0.92      0.96     35065
weighted avg       1.00      1.00      1.00     35065
               precision    recall  f1-score   support

           0       0.99      0.98      0.98      8632
           1       0.07      0.09      0.08       135

    accuracy                           0.97      8767
   macro avg       0.53      0.54      0.53      8767
weighted avg       0.97      0.97      0.97      8767



Terlihat dari Classification Report terdapat ketimpangan di skor Precision dan Recall antar kelas. 
- Precision kelas 0 dan 1 di Testing memiliki selisih yang besar antara 0.99 dan 0.07
- Recall kelas 0 dan 1 di Testing memiliki selisih yang besar antara 0.98 dan 0.09
- Sehingga data ini perlu proses handling imbalance data

In [16]:
cm_DT_tr

Unnamed: 0,Pred 1,Pred 0
Akt 1,456,86
Akt 0,0,34523


In [17]:
cm_DT_ts

Unnamed: 0,Pred 1,Pred 0
Akt 1,12,123
Akt 0,157,8475


### 1. Random Over Sampling

In [18]:
from sklearn.utils import resample

Membuat duplikasi data kelas minoritas secara random sebanyak data kelas mayoritas

In [19]:
claim_oversample = resample(claim,
                            replace = True,
                            n_samples = len(non_claim),
                            random_state=28)

In [20]:
df_oversampling = pd.concat([claim_oversample, non_claim])
len(df_oversampling)

86310

In [21]:
df_oversampling['Claim'].value_counts()

1    43155
0    43155
Name: Claim, dtype: int64

Data kelas minoritas sudah diduplikasi sehingga jumlahnya sama dengan data kelas mayoritas. Selanjutnya membuat data train hasil oversampling.

In [22]:
X_train_ov = df_oversampling.drop(columns='Claim')
y_train_ov = df_oversampling['Claim']

In [23]:
df_DT_ov,cr_DT_ov_tr,cm_DT_ov_tr,cr_DT_ov_ts,cm_DT_ov_ts=Eva_Matrix(model_DT,X_train_ov,X_test,y_train_ov,y_test,'Decision Tree Over Sample')

In [24]:
print(cr_DT_ov_tr, cr_DT_ov_ts)

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     43155
           1       0.99      1.00      1.00     43155

    accuracy                           1.00     86310
   macro avg       1.00      1.00      1.00     86310
weighted avg       1.00      1.00      1.00     86310
               precision    recall  f1-score   support

           0       1.00      0.99      1.00      8632
           1       0.64      1.00      0.78       135

    accuracy                           0.99      8767
   macro avg       0.82      1.00      0.89      8767
weighted avg       0.99      0.99      0.99      8767



In [25]:
cm_DT_ov_tr

Unnamed: 0,Pred 1,Pred 0
Akt 1,43155,0
Akt 0,302,42853


In [26]:
cm_DT_ov_ts

Unnamed: 0,Pred 1,Pred 0
Akt 1,135,0
Akt 0,77,8555


Dari Classification Report terlihat bahwa skor Recall kelas 1 di train dan test sama yaitu 1.00. Hal ini menandakan bahwa model menghafal/overfit terutama di kelas 1.

### 2. Random Under Sampling
mereduksi data kelas mayoritas sehingga jumlahnya sama dengan data kelas minoritas

In [27]:
claim_undersample = resample(non_claim,
                            replace=True,
                            n_samples= len(claim),
                            random_state=69)

In [28]:
df_undersampling = pd.concat([claim_undersample, claim])

In [29]:
df_undersampling['Claim'].value_counts()

0    677
1    677
Name: Claim, dtype: int64

Data kelas minoritas dan mayoritas sudah sama, lalu dijadikan data train set baru.

In [30]:
X_train_us = df_undersampling.drop(columns='Claim')
y_train_us = df_undersampling['Claim']

In [31]:
df_DT_us,cr_DT_us_tr,cm_DT_us_tr,cr_DT_us_ts,cm_DT_us_ts=Eva_Matrix(model_DT,X_train_us,X_test,y_train_us,y_test,'Decision Tree Under Sample')

In [32]:
print(cr_DT_us_tr, cr_DT_us_ts)

              precision    recall  f1-score   support

           0       0.99      1.00      1.00       677
           1       1.00      0.99      1.00       677

    accuracy                           1.00      1354
   macro avg       1.00      1.00      1.00      1354
weighted avg       1.00      1.00      1.00      1354
               precision    recall  f1-score   support

           0       1.00      0.66      0.79      8632
           1       0.04      0.98      0.08       135

    accuracy                           0.66      8767
   macro avg       0.52      0.82      0.44      8767
weighted avg       0.98      0.66      0.78      8767



In [33]:
cm_DT_us_tr

Unnamed: 0,Pred 1,Pred 0
Akt 1,672,5
Akt 0,0,677


In [34]:
cm_DT_us_ts

Unnamed: 0,Pred 1,Pred 0
Akt 1,132,3
Akt 0,2950,5682


Dari classification report terlihat bahwa model dapat mendeteksi kelas minoritas namun testing menunjukkan model tidak bisa mendeteksi kelas mayoritas. Di confusion matrix juga menunjukkan False Negative yang semakin besar.

## 3. Random Sampling
merupakan metode yang menduplikasi data kelas minoritas sekaligus mereduksi data kelas mayoritas dengan jumlah akhir yang sama.

In [35]:
claim_oversample = resample(claim,
                            replace = True,
                            n_samples = 12000,
                            random_state=28)
claim_undersample = resample(non_claim,
                            replace=True,
                            n_samples= 12000,
                            random_state=69)

In [36]:
df_UOS = pd.concat([claim_oversample, claim_undersample])

In [37]:
df_UOS['Claim'].value_counts()

1    12000
0    12000
Name: Claim, dtype: int64

Data kelas mayoritas sudah sama jumlah dengan kelas minoritas lalu akan dibuat data train set.

In [38]:
X_train_UOS = df_UOS.drop(columns='Claim')
y_train_UOS = df_UOS['Claim']

In [39]:
df_DT_UOS,cr_DT_UOS_tr,cm_DT_UOS_tr,cr_DT_UOS_ts,cm_DT_UOS_ts=Eva_Matrix(model_DT,X_train_UOS,X_test,y_train_UOS,y_test,'Decsion Tree Under-Over Sample')

In [40]:
print(cr_DT_UOS_tr, cr_DT_UOS_ts)

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     12000
           1       0.99      1.00      1.00     12000

    accuracy                           1.00     24000
   macro avg       1.00      1.00      1.00     24000
weighted avg       1.00      1.00      1.00     24000
               precision    recall  f1-score   support

           0       1.00      0.96      0.98      8632
           1       0.27      1.00      0.42       135

    accuracy                           0.96      8767
   macro avg       0.63      0.98      0.70      8767
weighted avg       0.99      0.96      0.97      8767



In [41]:
cm_DT_UOS_tr

Unnamed: 0,Pred 1,Pred 0
Akt 1,12000,0
Akt 0,88,11912


In [42]:
cm_DT_UOS_ts

Unnamed: 0,Pred 1,Pred 0
Akt 1,135,0
Akt 0,368,8264


Dari confusion matrix, terlihat bahwa model saat train mampu mendeteksi kelas minoritas tapi saat test banyak error yaitu False Negative yang semakin besar

### 4. SMOTE

In [43]:
import imblearn
from imblearn.over_sampling import SMOTE

In [44]:
sm = SMOTE(random_state=169)

In [45]:
# Splitting data
from sklearn.model_selection import train_test_split

In [46]:
X = df.drop(columns='Claim')
y = df['Claim']

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size= .80, stratify=y, random_state = 169)

In [48]:
X_train_sm, y_train_sm = sm.fit_sample(X_train, y_train)

In [49]:
df_SMOTE = pd.concat([X_train_sm, y_train_sm], axis=1)
df_SMOTE['Claim'].value_counts()

0    34523
1    34523
Name: Claim, dtype: int64

In [50]:
X_train_sm = df_SMOTE.drop(columns= 'Claim')
y_train_sm = df_SMOTE['Claim']

In [51]:
df_DT_sm,cr_DT_sm_tr,cm_DT_sm_tr,cr_DT_sm_ts,cm_DT_sm_ts=Eva_Matrix(model_DT,X_train_sm,X_test,y_train_sm,y_test,'Decision Tree SMOTE')


In [52]:
print(cr_DT_sm_tr, cr_DT_sm_ts)

              precision    recall  f1-score   support

           0       1.00      0.99      0.99     34523
           1       0.99      1.00      0.99     34523

    accuracy                           0.99     69046
   macro avg       0.99      0.99      0.99     69046
weighted avg       0.99      0.99      0.99     69046
               precision    recall  f1-score   support

           0       0.99      0.96      0.97      8632
           1       0.06      0.17      0.09       135

    accuracy                           0.95      8767
   macro avg       0.53      0.57      0.53      8767
weighted avg       0.97      0.95      0.96      8767



In [53]:
cm_DT_sm_tr

Unnamed: 0,Pred 1,Pred 0
Akt 1,34385,138
Akt 0,294,34229


In [54]:
cm_DT_sm_ts

Unnamed: 0,Pred 1,Pred 0
Akt 1,23,112
Akt 0,331,8301


Dari Classification report terlihat bahwa model mendapat skor bagus di train tapi di test sangat jauh. Metode ini tidak cocok dalam kasus ini

### Class Weight Parameter
metode ini memberikan parameter class_weight di dalam model sehingga model akan lebih memprioritaskan train di kelas minoritas

In [55]:
DT_CW = DecisionTreeClassifier(class_weight={0 : .10,
                                            1 : .90})

In [56]:
df_DT_CW, cr_DT_CW_tr, cm_DT_CW_tr, cr_DT_CW_ts, cm_DT_CW_ts = Eva_Matrix(DT_CW, X_train, X_test, y_train, y_test, "DT_CW")

In [57]:
print(cr_DT_CW_tr, cr_DT_CW_ts)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     34523
           1       0.79      1.00      0.88       542

    accuracy                           1.00     35065
   macro avg       0.89      1.00      0.94     35065
weighted avg       1.00      1.00      1.00     35065
               precision    recall  f1-score   support

           0       0.99      0.98      0.98      8632
           1       0.06      0.07      0.06       135

    accuracy                           0.97      8767
   macro avg       0.52      0.53      0.52      8767
weighted avg       0.97      0.97      0.97      8767



In [58]:
cm_DT_CW_tr

Unnamed: 0,Pred 1,Pred 0
Akt 1,540,2
Akt 0,147,34376


In [59]:
cm_DT_CW_ts

Unnamed: 0,Pred 1,Pred 0
Akt 1,10,125
Akt 0,165,8467


Terlihat dari hasil Confusion Matrix bahwa model tidak mendeteksi kelas minoritas dengan baik sehingga errornya lebih besar dari train (False Negative dan False Positive bertambah)

## Model Tuning
untuk mendapatkan parameter terbaik dalam mengatasi overfitting

In [60]:
## Coba Gridsearch
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [61]:
skf = StratifiedKFold(n_splits=5)

In [88]:
params = {'max_leaf_nodes': list(range(2, 70)),
            'min_samples_split': range(2, 5),
            'max_depth' : range(1, 8)}

Dilakukan GridSearchCV untuk mencari parameter tree terbaik dengan data train menggunakan data train hasil metode random oversampling

In [89]:
DT_GS_cv = GridSearchCV(DecisionTreeClassifier(random_state=169), params, verbose=1, cv=skf, n_jobs=-1, scoring='f1')
DT_GS_cv.fit(X_train_ov, y_train_ov)

Fitting 5 folds for each of 1428 candidates, totalling 7140 fits


In [81]:
print(DT_GS_cv.best_estimator_, DT_GS_cv.best_score_)

DecisionTreeClassifier(max_depth=4, max_leaf_nodes=16, random_state=169) 0.7445181482989083


In [82]:
best_model = DT_GS_cv.best_estimator_
best_model.fit(X_train_ov, y_train_ov)

Didapatkan dari hasil GridSearch bahwa tree terbaik dengan max_depth = 4, max_leaf_nodes = 16 dan model hasil tuning disimpan dalam variabel best_model

In [83]:
df_DT_Tuned_ov,cr_DT_Tuned_ov_tr,cm_DT_Tuned_ov_tr,cr_DT_Tuned_ov_ts,cm_DT_Tuned_ov_ts=Eva_Matrix(best_model,X_train_ov,X_test,y_train_ov,y_test,'Decision Tree Tuned OverSample')

In [84]:
print(cr_DT_Tuned_ov_tr, cr_DT_Tuned_ov_ts)

              precision    recall  f1-score   support

           0       0.73      0.83      0.78     43155
           1       0.80      0.69      0.74     43155

    accuracy                           0.76     86310
   macro avg       0.77      0.76      0.76     86310
weighted avg       0.77      0.76      0.76     86310
               precision    recall  f1-score   support

           0       0.99      0.83      0.90      8632
           1       0.06      0.66      0.10       135

    accuracy                           0.82      8767
   macro avg       0.52      0.74      0.50      8767
weighted avg       0.98      0.82      0.89      8767



In [85]:
cm_DT_Tuned_ov_tr

Unnamed: 0,Pred 1,Pred 0
Akt 1,29842,13313
Akt 0,7345,35810


In [86]:
cm_DT_Tuned_ov_ts

Unnamed: 0,Pred 1,Pred 0
Akt 1,89,46
Akt 0,1506,7126


In [90]:
print(classification_report(y_test, best_model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.99      0.83      0.90      8632
           1       0.06      0.66      0.10       135

    accuracy                           0.82      8767
   macro avg       0.52      0.74      0.50      8767
weighted avg       0.98      0.82      0.89      8767



Dari hasil tuning, overfitting model berkurang dari sebelum tuning Recall kelas 1 train dan test adalah 1.00 sedangkan setelah tuning menjadi stabil di 0.69 dan Precision kelas 0 yang stabil dari 0.73 ke 0.99.

## Save Model to Pickle extension

In [91]:
import pickle
pickle.dump(best_model, open('ClaimDetector_DecisionTree_v.1.0.pkl', 'wb'))