## Model - Claim Detection
- Class 0 - No Claim -> Negative
- Class 1 - Claim -> Positive
- **GOALS** : Meminimalkan False-Negative (prediksi No Claim tetapi aktual Claim) dan False-Positive (prediksi Claim tetapi aktual Claim)
- Model akan berfokus di F1 score

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Import file train yang sudah dihasilkan dari preprocessing

In [2]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Duration,Destination,Net Sales,Commision (in value),Age,Claim
0,C2B,Airlines,Online,Annual Silver Plan,365,SINGAPORE,216.0,54.0,57,0
1,EPX,Travel Agency,Online,Others,4,Others,10.0,0.0,33,0
2,Others,Airlines,Online,Others,19,Others,22.0,7.7,26,0
3,EPX,Travel Agency,Online,2 way Comprehensive Plan,20,Others,112.0,0.0,59,0
4,C2B,Airlines,Online,Bronze Plan,8,SINGAPORE,16.0,4.0,28,0


Dalam dataset ini akan digunakan semua kolom kategorikal dan numerikal. **One-Hot Encoding** adalah encoding dengan membuat kolom dummy sebanyak data unik di dalam kolom fitur kategorikal karena machine learning tidak dapat memproses kolom berisi selain numerik, sehingga perlu encoding

In [3]:
df = pd.get_dummies(df, columns=['Agency', 'Agency Type', 'Distribution Channel', 'Product Name', 'Destination'], drop_first=True)

Menyimpan dataframe hasil encoding ke dalam file csv baru yang mana akan dipakai untuk proses Cross Validation dan proses Handling Imbalance Data selanjutnya.

In [4]:
df.to_csv('encoded_train.csv', index=False)
df.head()

Unnamed: 0,Duration,Net Sales,Commision (in value),Age,Claim,Agency_CWT,Agency_EPX,Agency_Others,Agency Type_Travel Agency,Distribution Channel_Online,Product Name_Annual Silver Plan,Product Name_Bronze Plan,Product Name_Others,Product Name_Rental Vehicle Excess Insurance,Product Name_Silver Plan,Destination_SINGAPORE
0,365,216.0,54.0,57,0,0,0,0,0,1,1,0,0,0,0,1
1,4,10.0,0.0,33,0,0,1,0,1,1,0,0,1,0,0,0
2,19,22.0,7.7,26,0,0,0,1,0,1,0,0,1,0,0,0
3,20,112.0,0.0,59,0,0,1,0,1,1,0,0,0,0,0,0
4,8,16.0,4.0,28,0,0,0,0,0,1,0,1,0,0,0,1


In [5]:
from sklearn.model_selection import train_test_split

Menentukan kolom feature/independen dalam variabel X dan kolom target/dependen dalam variabel y

In [6]:
X = df.drop(columns='Claim')
y = df['Claim']

Splitting data untuk membagi 80% dataset digunakan dalam proses training dan 20% untuk proses testing, sedangkan stratify untuk menjamin proporsi kelas target di masing-masing dataset train dan dataset test adalah sama.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size= .80, stratify=y, random_state = 100)

In [8]:
X_train.head()

Unnamed: 0,Duration,Net Sales,Commision (in value),Age,Agency_CWT,Agency_EPX,Agency_Others,Agency Type_Travel Agency,Distribution Channel_Online,Product Name_Annual Silver Plan,Product Name_Bronze Plan,Product Name_Others,Product Name_Rental Vehicle Excess Insurance,Product Name_Silver Plan,Destination_SINGAPORE
32196,34,7.84,2.2,48,0,0,1,0,1,0,0,1,0,0,1
37891,24,20.0,0.0,35,0,1,0,1,1,0,0,0,0,0,0
20289,42,10.0,0.0,23,0,1,0,1,1,0,0,1,0,0,0
28541,33,30.0,10.5,47,0,0,1,1,0,0,0,1,0,0,0
24352,13,29.7,17.82,38,1,0,0,1,1,0,0,0,1,0,0


In [9]:
X_test.head()

Unnamed: 0,Duration,Net Sales,Commision (in value),Age,Agency_CWT,Agency_EPX,Agency_Others,Agency Type_Travel Agency,Distribution Channel_Online,Product Name_Annual Silver Plan,Product Name_Bronze Plan,Product Name_Others,Product Name_Rental Vehicle Excess Insurance,Product Name_Silver Plan,Destination_SINGAPORE
23288,15,79.2,47.52,49,1,0,0,1,1,0,0,0,1,0,0
40495,49,13.5,3.38,64,0,0,0,0,1,0,1,0,0,0,1
10226,31,69.3,41.58,33,1,0,0,1,1,0,0,0,1,0,0
12960,8,24.0,0.0,36,0,1,0,1,1,0,0,1,0,0,1
30212,29,61.0,0.0,35,0,1,0,1,1,0,0,1,0,0,0


## 1. Logistic Regression

In [10]:
from sklearn.linear_model import LogisticRegression

In [11]:
model_LR = LogisticRegression()

In [12]:
model_LR.fit(X_train, y_train)

LogisticRegression()

In [13]:
model_LR.score(X_train, y_train)

0.984457436189933

In [14]:
model_LR.score(X_test, y_test)

0.9847154100604539

In [15]:
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

StratifiedKFold mengatur berapa kali dataset train dan model mengalami validasi

In [16]:
skf = StratifiedKFold(n_splits=5)

In [17]:
def Cross_Val(model, X, y, Nama):
    skf = StratifiedKFold(n_splits=5)
    cv_Acc = cross_val_score(model, X, y, cv = skf) 
    cv_recall = cross_val_score(model, X, y, cv = skf, scoring='recall')
    cv_precision = cross_val_score(model, X, y, cv = skf, scoring='precision')
    cv_f1 = cross_val_score(model, X, y, cv = skf, scoring='f1')
    data = {
        Nama + "CV (Mean)" : [cv_Acc.mean(), cv_recall.mean(), cv_precision.mean(), cv_f1.mean()],
        Nama + "CV (Std)" : [cv_Acc.std(), cv_recall.std(), cv_precision.std(), cv_f1.std()]
    }
    df = pd.DataFrame(data, index = ["Accuracy", "Recall", "Precision", "F1"])
    return df

In [18]:
df_LR_tr = Cross_Val(model_LR, X_train, y_train, "Logistic Regression Training").T
df_LR_tr

Unnamed: 0,Accuracy,Recall,Precision,F1
Logistic Regression TrainingCV (Mean),0.984429,0.0,0.0,0.0
Logistic Regression TrainingCV (Std),0.000166,0.0,0.0,0.0


In [19]:
df_LR_ts = Cross_Val(model_LR, X_test, y_test, "Logistic Regression Testing").T
df_LR_ts

Unnamed: 0,Accuracy,Recall,Precision,F1
Logistic Regression TestingCV (Mean),0.984487,0.0,0.0,0.0
Logistic Regression TestingCV (Std),0.000225,0.0,0.0,0.0


Dari tabel di atas bisa disimpulkan bahwa model hanya mampu mengenali kelas mayoritas (hanya muncul skor di Accuracy), dan tidak mampu mengenali Type 1 dan Type 2 Error. Sehingga model ini tidak akan dipakai di proses selanjutnya

## 2. K Nearest Neighbors

In [20]:
from sklearn.neighbors import KNeighborsClassifier

In [21]:
model_KNN = KNeighborsClassifier()

In [22]:
model_KNN.fit(X_train, y_train)

KNeighborsClassifier()

In [23]:
df_KNN_tr = Cross_Val(model_KNN, X_train, y_train, "KNN Training").T
df_KNN_tr

Unnamed: 0,Accuracy,Recall,Precision,F1
KNN TrainingCV (Mean),0.983973,0.0,0.0,0.0
KNN TrainingCV (Std),0.000171,0.0,0.0,0.0


In [24]:
df_KNN_ts = Cross_Val(model_KNN, X_test, y_test, "KNN Testing").T
df_KNN_ts

Unnamed: 0,Accuracy,Recall,Precision,F1
KNN TestingCV (Mean),0.984487,0.0,0.0,0.0
KNN TestingCV (Std),0.00023,0.0,0.0,0.0


Dari tabel di atas bisa disimpulkan bahwa model hanya mampu mengenali kelas mayoritas (hanya muncul skor di Accuracy), dan tidak mampu mengenali Type 1 dan Type 2 Error. Sehingga model ini tidak akan dipakai di proses selanjutnya

## 3. Decision Tree Classifier

In [25]:
from sklearn.tree import DecisionTreeClassifier

In [26]:
model_DT = DecisionTreeClassifier()

In [27]:
model_DT.fit(X_train, y_train)

DecisionTreeClassifier()

In [28]:
df_DT_tr = Cross_Val(model_DT, X_train, y_train, "DT Training").T
df_DT_tr

Unnamed: 0,Accuracy,Recall,Precision,F1
DT TrainingCV (Mean),0.969086,0.057204,0.0466,0.046637
DT TrainingCV (Std),0.000998,0.029261,0.018983,0.024876


In [29]:
df_DT_ts = Cross_Val(model_DT, X_test, y_test, "DT Testing").T
df_DT_ts

Unnamed: 0,Accuracy,Recall,Precision,F1
DT TestingCV (Mean),0.96886,0.022222,0.013824,0.008333
DT TestingCV (Std),0.002968,0.018144,0.017846,0.016667


Dari tabel di atas, base model Decision Tree mampu mengenali kedua target, terlihat skor Recall, Precision dan F1. Skor tersebut sangat kecil dibandingkan Accuracy, namun hal ini wajar mengingat data kelas 1 yang sangat sedikit dibandingkan data kelas 0. Sehingga cukup untuk menjadikan model dasar Decision Tree Classifier dipakai untuk tahap Imbalance Data Handling

## 4. Random Forest

In [30]:
from sklearn.ensemble import RandomForestClassifier

In [31]:
model_RF = RandomForestClassifier()

In [32]:
model_RF.fit(X_train, y_train)

RandomForestClassifier()

In [33]:
df_RF_tr = Cross_Val(model_RF, X_train, y_train, "RF Training").T
df_RF_tr

Unnamed: 0,Accuracy,Recall,Precision,F1
RF TrainingCV (Mean),0.981862,0.012929,0.069548,0.024393
RF TrainingCV (Std),0.000621,0.007382,0.050778,0.012211


In [34]:
df_RF_ts = Cross_Val(model_RF, X_test, y_test, "RF Testing").T
df_RF_ts

Unnamed: 0,Accuracy,Recall,Precision,F1
RF TestingCV (Mean),0.982548,0.007407,0.05,0.0
RF TestingCV (Std),0.001058,0.014815,0.1,0.0


Dari tabel di atas, terlihat bahwa Random Forest mampu mendeteksi kelas minoritas namun lebih lemah dibandingkan model Decision Tree (F1 score menjadi 0 di testing) sehingga tidak akan memakai model Random Forest untuk proses selanjutnya

## 5. Support Vector Classifier

In [35]:
from sklearn.svm import SVC

In [36]:
model_SVC = SVC()

In [37]:
model_SVC.fit(X_train, y_train)

SVC()

In [38]:
df_SVC_tr = Cross_Val(model_SVC, X_train, y_train, "SVC Training").T
df_SVC_tr

Unnamed: 0,Accuracy,Recall,Precision,F1
SVC TrainingCV (Mean),0.984543,0.0,0.0,0.0
SVC TrainingCV (Std),7e-05,0.0,0.0,0.0


In [39]:
df_SVC_ts = Cross_Val(model_SVC, X_test, y_test, "SVC Testing").T
df_SVC_ts

Unnamed: 0,Accuracy,Recall,Precision,F1
SVC TestingCV (Mean),0.984601,0.0,0.0,0.0
SVC TestingCV (Std),4e-06,0.0,0.0,0.0


Dari tabel di atas bisa disimpulkan bahwa model hanya mampu mengenali kelas mayoritas (hanya muncul skor di Accuracy), dan tidak mampu mengenali Type 1 dan Type 2 Error. Sehingga model ini tidak akan dipakai di proses selanjutnya. Berikut adalah tabel lengkap untuk membandingkan semua model dasar.

In [40]:
pd.concat([df_LR_tr, df_LR_ts, df_KNN_tr, df_KNN_ts, df_DT_tr, df_DT_ts, df_RF_tr, df_RF_ts, df_SVC_tr, df_SVC_ts])

Unnamed: 0,Accuracy,Recall,Precision,F1
Logistic Regression TrainingCV (Mean),0.984429,0.0,0.0,0.0
Logistic Regression TrainingCV (Std),0.000166,0.0,0.0,0.0
Logistic Regression TestingCV (Mean),0.984487,0.0,0.0,0.0
Logistic Regression TestingCV (Std),0.000225,0.0,0.0,0.0
KNN TrainingCV (Mean),0.983973,0.0,0.0,0.0
KNN TrainingCV (Std),0.000171,0.0,0.0,0.0
KNN TestingCV (Mean),0.984487,0.0,0.0,0.0
KNN TestingCV (Std),0.00023,0.0,0.0,0.0
DT TrainingCV (Mean),0.969086,0.057204,0.0466,0.046637
DT TrainingCV (Std),0.000998,0.029261,0.018983,0.024876


Dari hasil Cross Validation pada kelima base model klasifikasi, berdasarkan skor Recall, Precision, dan F1 Score, bahwa model **Decision Tree** terlihat promising untuk digunakan di langkah selanjutnya. Berikut adalah Evaluation Matrix Model Decision Tree.

In [41]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,recall_score,precision_score,f1_score


In [42]:
def Eva_Matrix(model,x_train,x_test,y_train,y_test,Nama):
    Model=model.fit(x_train,y_train)
    y_pred_train=Model.predict(x_train)
    acc_train=accuracy_score(y_train,y_pred_train)
    rec_train=recall_score(y_train,y_pred_train)
    prec_train=precision_score(y_train,y_pred_train)
    f1_train=f1_score(y_train,y_pred_train)

    y_pred_test=Model.predict(x_test)
    acc_test=accuracy_score(y_test,y_pred_test)
    rec_test=recall_score(y_test,y_pred_test)
    prec_test=precision_score(y_test,y_pred_test)
    f1_test=f1_score(y_test,y_pred_test)
    
    data_LR={
    Nama + ' Training':[acc_train,rec_train,prec_train,f1_train],
    Nama + ' Testing':[acc_test,rec_test,prec_test,f1_test]
}

    df_LR=(pd.DataFrame(data_LR,index=['Accuracy','Recall','Precision','F1']).T).round(4)
    cr_train=classification_report(y_train,y_pred_train)
    cm_train=confusion_matrix(y_train,y_pred_train,labels=[1,0])
    df_train=pd.DataFrame(data=cm_train,columns=['Pred 1','Pred 0'],index=['Akt 1','Akt 0'])

    cr_test=classification_report(y_test,y_pred_test)
    cm_test=confusion_matrix(y_test,y_pred_test,labels=[1,0])
    df_test=pd.DataFrame(data=cm_test,columns=['Pred 1','Pred 0'],index=['Akt 1','Akt 0'])

    return df_LR,cr_train,df_train,cr_test,df_test


In [43]:
# Decision Tree Evaluation Matrix function
df_DT_1,cr_DT_tr,cm_DT_tr,cr_DT_ts,cm_DT_ts=Eva_Matrix(model_DT,X_train,X_test,y_train,y_test,'Decision Tree Base')

In [44]:
df_DT_1.T

Unnamed: 0,Decision Tree Base Training,Decision Tree Base Testing
Accuracy,0.9973,0.9713
Recall,0.8247,0.0889
Precision,0.9978,0.0851
F1,0.903,0.087


In [45]:
print(cr_DT_tr, cr_DT_ts)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     34523
           1       1.00      0.82      0.90       542

    accuracy                           1.00     35065
   macro avg       1.00      0.91      0.95     35065
weighted avg       1.00      1.00      1.00     35065
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      8632
           1       0.09      0.09      0.09       135

    accuracy                           0.97      8767
   macro avg       0.54      0.54      0.54      8767
weighted avg       0.97      0.97      0.97      8767



In [46]:
# Confusion Matrix Training Data
cm_DT_tr

Unnamed: 0,Pred 1,Pred 0
Akt 1,447,95
Akt 0,1,34522


In [47]:
# Confusion Matrix Testing Data
cm_DT_ts

Unnamed: 0,Pred 1,Pred 0
Akt 1,12,123
Akt 0,129,8503


Hasil Confusion Matrix masih menunjukkan Type 1 dan Type 2 Error yang lebih banyak di data testing daripada training, Recall positive mengalami penurunan dan menunjukkan underfitting. Hal ini wajar dikarenakan data Class 1 (Positive) yang kurang sehingga perlu Imbalance Data Handling