***
> ## Data Introduction
This dataset contain informations about customers that churn and not churned in a telco company. Below is the column informations:<Br>
1.customerID = customer unique ID.<Br>
2.gender = customer gender (M/F).<br>
3.SeniorCitizen = old / young customer.<br>
4.Partner = either a customer has partners or not.<br>
5.Dependents = either a customer has dependents or not.<br>
6.tenure = how long the customer subscribed (in month).<br>
7.MultipleLines = either a customer using multiple lines or not (phone lines).<br>
8.InternetService = either a customer using InternetService lines or not.<br>
9.OnlineSecurity = either a customer has OnlineSecurity or not.<br>
10.OnlineBackup = either a customer has OnlineBackup or not.<br>
11.DeviceProtection = either a customer has DeviceProtection or not.<br>
12.TechSupport = either a customer has TechSupport or not.<br>
13.StreamingTV = either a customer has StreamingTV or not.<br>
14.StreamingMovies = either a customer has StreamingMovie or not.<br>
15.Contract = types of contract.<Br>
16.PaperlessBilling = either a customer has PaperlessBilling or not.<br>
17.PaymentMethod = types of the payment method.<Br>
18.MonthlyCharges = how much charges per month.<br>
19.TotalCharges = total charges of all time.<br>
20.Churn = either a customer churn or not.<br>

## Data Preparation (Import libraries, data cleaning & data wrangling)
***

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split
sns.set()

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/vertikalwil/Data-Analyst/main/datasets/telco.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [4]:
#there's a space in the total charges column.
for x in df.TotalCharges:
    try:
        float(x)
    except:
        print(f'Unable to convert to float with this value : {x}')

Unable to convert to float with this value :  
Unable to convert to float with this value :  
Unable to convert to float with this value :  
Unable to convert to float with this value :  
Unable to convert to float with this value :  
Unable to convert to float with this value :  
Unable to convert to float with this value :  
Unable to convert to float with this value :  
Unable to convert to float with this value :  
Unable to convert to float with this value :  
Unable to convert to float with this value :  


In [5]:
df[df.TotalCharges == ' ']

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
488,4472-LVYGI,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,,No
753,3115-CZMZD,Male,0,No,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,,No
936,5709-LVOEQ,Female,0,Yes,Yes,0,Yes,No,DSL,Yes,...,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,,No
1082,4367-NUYAO,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,,No
1340,1371-DWPAZ,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,No,Credit card (automatic),56.05,,No
3331,7644-OMVMY,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,19.85,,No
3826,3213-VVOLG,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.35,,No
4380,2520-SGTTA,Female,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.0,,No
5218,2923-ARZLG,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,19.7,,No
6670,4075-WKNIU,Female,0,Yes,Yes,0,Yes,Yes,DSL,No,...,Yes,Yes,Yes,No,Two year,No,Mailed check,73.35,,No


>Berdasarkan data diatas, ketika TotalChargesnya kosong maka nilai tenure-nya juga 0. Ini menandakan bahwa rows tersebut merupakan customer yang baru bergabung (belum ada charges). Jumlahnya hanya 11 baris, let's drop it.

In [6]:
df = df.drop(df.index[df.TotalCharges == ' ']).reset_index(drop=True)

In [7]:
#check for duplicate data, if True then there's no duplicate.
df.customerID.nunique() == len(df) 

True

In [8]:
#drop useless column
df.drop(columns=['customerID'], inplace=True)

In [9]:
df.tenure = df.tenure.astype('int64')
df.MonthlyCharges = df.MonthlyCharges.astype('float64')
df.TotalCharges = df.TotalCharges.astype('float64')

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7032 non-null   object 
 1   SeniorCitizen     7032 non-null   int64  
 2   Partner           7032 non-null   object 
 3   Dependents        7032 non-null   object 
 4   tenure            7032 non-null   int64  
 5   PhoneService      7032 non-null   object 
 6   MultipleLines     7032 non-null   object 
 7   InternetService   7032 non-null   object 
 8   OnlineSecurity    7032 non-null   object 
 9   OnlineBackup      7032 non-null   object 
 10  DeviceProtection  7032 non-null   object 
 11  TechSupport       7032 non-null   object 
 12  StreamingTV       7032 non-null   object 
 13  StreamingMovies   7032 non-null   object 
 14  Contract          7032 non-null   object 
 15  PaperlessBilling  7032 non-null   object 
 16  PaymentMethod     7032 non-null   object 


> Columns non-null entry sudah sesuai data entry, yang artinya tidak ada missing value. Data type juga sudah sesuai.

In [11]:
def numericategoric(df):
    num = len(df._get_numeric_data().columns)
    cat = len(df.columns) - num
    print("TotalNumericalData = " + str(num))
    print("TotalCategoricalData = " + str(cat))
    print("Numerical = " + str(list(df._get_numeric_data().columns )))
    print("Categorical = " + str(list(df.drop(df._get_numeric_data().columns, axis=1).columns)))

In [12]:
numericategoric(df)

TotalNumericalData = 4
TotalCategoricalData = 16
Numerical = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']
Categorical = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']


>Mayoritas kolom adalah categorical.

In [13]:
df.Churn.value_counts(normalize=True)

No     0.734215
Yes    0.265785
Name: Churn, dtype: float64

> Dataset is imbalanced

# FLOW PENGERJAAN & CONDITIONS

> Flow pengerjaan :<br>
1.Data Pre-Processing 0 = Encode non numeric columns only.<br>
2.Modelling 0 = Melakukan modelling Random Forest, Decision Tree dan XGBoost dan memilih threshold terbaik berdasarkan data Pre-Processing 0.<br>
3.Data Pre-Processing 1 = Encode non numeric columns dan oversampling menggunakan SMOTE.<br>
4.Modelling 1 = Melakukan modelling Random Forest, Decision Tree dan XGBoost dan memilih threshold terbaik berdasarkan data Pre-Processing 1.<br>
5.Data Pre-Processing 2 = Encode non numeric columns dan undersampling.<br>
6.Modelling 2 = Melakukan modelling Random Forest, Decision Tree dan XGBoost dan memilih threshold terbaik berdasarkan data Pre-Processing 2.<br>
7.Data Pre-Processing 3 = Encode non numeric columns, feature creation and scaling.<br>
8.Modelling 3 = Melakukan modelling Random Forest, Decision Tree dan XGBoost dan memilih threshold terbaik berdasarkan data Pre-Processing 3.<br>
9.Kesimpulan = recap hasil dan memilih model terbaik.

>Conditions :<br>
1.Model menggunakan parameter default tanpa hyperparameter tuning.<br>
2.Type I dan II errors (FP & FN) memiliki porsi efek yang sama kemudian data yang ada juga data imbalanced sehingga metric difokuskan menggunakan F1 Score.

In [14]:
def threshold_score(model, xtest, ytest, thres):
    recall = []
    precision = []
    f1_score = []
    rocauc_score = []
    accuracy_score = []
    for k in thres:
        y = pd.DataFrame(model.predict_proba(xtest), columns=['%No','%Yes'])[['%Yes']]
        y['prediction'] = y['%Yes'].apply(lambda x: 1 if x > k else 0 )
        recall.append(metrics.recall_score(ytest, y.prediction))
        precision.append(metrics.precision_score(ytest, y.prediction))
        f1_score.append(metrics.f1_score(ytest, y.prediction))
        rocauc_score.append(metrics.roc_auc_score(ytest, y.prediction))
        accuracy_score.append(metrics.accuracy_score(ytest, y.prediction))
    dfresult = pd.DataFrame([thres,recall,precision,f1_score,rocauc_score,accuracy_score]).transpose()
    dfresult.columns = ['thresold','recall','precision','f1-score','RocAuc-score','Accuracy-score']
    return dfresult

> Fungsi diatas berguna untuk menampilkan berbagai hasil evaluasi metrics per threshold model sekaligus.

## Data Pre-Processing 0
***
> Saya tidak akan melakukan banyak hal khusus, hanya encoding saja.

In [15]:
df1 = df.copy()

In [16]:
convert = {
    'Yes':1,
    'No':0,
    'No phone service':0,
    'Male':1,
    'Female':0,
    'Month-to-month':1,
    'One year':2,
    'Two year':3,
    'No internet service':0
}
kolom = ['gender','Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','PaperlessBilling','Churn']

In [17]:
for x in kolom:
    df1[x] = df[x].map(convert)

In [18]:
df1 = pd.get_dummies(df1, columns = ['InternetService', 'PaymentMethod'])

> Rata-rata kategorikal kolom diconvert ke binary, sedangkan untuk contract dilakukan ordinal encoding. Untuk InternetService dan PaymentMethod dilakukan One Hot Encoding.

In [19]:
x = df1.drop(columns=['Churn'])
y = df1['Churn']

In [20]:
x_pretrain, x_test, y_pretrain, y_test = train_test_split(x, y, test_size = 0.25, random_state=123)
x_train, x_validation, y_train, y_validation = train_test_split(x_pretrain, y_pretrain, test_size = 0.20, random_state=123)

## Random Forest Modelling 0
***

In [21]:
from sklearn.ensemble import RandomForestClassifier
clfrf = RandomForestClassifier()
clfrf.fit(x_train, y_train)

RandomForestClassifier()

In [22]:
threshold_score(clfrf, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.942748,0.378254,0.539891,0.715384,0.600948
1,0.2,0.820611,0.449791,0.581081,0.744479,0.706161
2,0.3,0.706107,0.521127,0.599676,0.745866,0.765877
3,0.4,0.576336,0.565543,0.570888,0.715028,0.784834
4,0.5,0.461832,0.636842,0.535398,0.68741,0.800948
5,0.6,0.354962,0.72093,0.475703,0.654782,0.805687
6,0.7,0.248092,0.764706,0.37464,0.611435,0.794313
7,0.8,0.141221,0.822222,0.241042,0.565567,0.779147
8,0.9,0.045802,1.0,0.087591,0.522901,0.763033


In [23]:
threshold_score(clfrf, x_test, y_test, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,0.692748,0.57619,0.629116,0.738189,0.756542


In [24]:
threshold_score(clfrf, x_train, y_train, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,1.0,0.956714,0.977878,0.992188,0.988386


> Saat dituning didapat bahwa threshold 0.3 merupakan nilai f1-score terbaik. Saat di test data didapat hasil f1 score sebesar 0.62 dan ditrain 0.97. Dalam kondisi sekarang, model ini termasuk overfitting.

In [25]:
status = 'Encoding Only'
dfhasil = pd.DataFrame(['RandomForest',float(threshold_score(clfrf, x_test, y_test, [0.3])['f1-score']), float(threshold_score(clfrf, x_test, y_test, [0.3])['thresold']),status]).transpose()

## Decision Tree Modelling 0
***

In [26]:
from sklearn.tree import DecisionTreeClassifier
clfdt = DecisionTreeClassifier()
clfdt.fit(x_train, y_train)

DecisionTreeClassifier()

In [27]:
threshold_score(clfdt, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.442748,0.454902,0.448743,0.633732,0.729858
1,0.2,0.442748,0.454902,0.448743,0.633732,0.729858
2,0.3,0.442748,0.454902,0.448743,0.633732,0.729858
3,0.4,0.442748,0.454902,0.448743,0.633732,0.729858
4,0.5,0.442748,0.456693,0.449612,0.634363,0.730806
5,0.6,0.442748,0.456693,0.449612,0.634363,0.730806
6,0.7,0.442748,0.456693,0.449612,0.634363,0.730806
7,0.8,0.442748,0.456693,0.449612,0.634363,0.730806
8,0.9,0.442748,0.456693,0.449612,0.634363,0.730806


In [28]:
threshold_score(clfdt, x_test, y_test, [0.5])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.5,0.46374,0.535242,0.496933,0.646376,0.720137


In [29]:
threshold_score(clfdt, x_train, y_train, [0.5])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.5,0.99446,1.0,0.997222,0.99723,0.998578


>Untuk decision tree, bisa dilihat saat kita melakukan threshold tuning didapat banyak nilai metrics yang sama. Ini artinya model banyak memprediksi nilai probabilitas 100% atau 0% ditambah nilai f1-score nya pada training data sangat bagus dan jelek di test data. Model ini mengalami overfitting yang cukup parah.

In [30]:
dfhasil = pd.concat([dfhasil,pd.DataFrame(['DecisionTree',float(threshold_score(clfdt, x_test, y_test, [0.5])['f1-score']), float(threshold_score(clfdt, x_test, y_test, [0.5])['thresold']),status]).transpose()])

## XGBoost Modelling 0
***

In [31]:
from xgboost import XGBClassifier 
from xgboost import plot_importance
clfxg = XGBClassifier()
clfxg.fit(x_train, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

In [32]:
threshold_score(clfxg, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.866412,0.414234,0.560494,0.73081,0.662559
1,0.2,0.748092,0.482759,0.586826,0.741637,0.738389
2,0.3,0.687023,0.545455,0.608108,0.748934,0.780095
3,0.4,0.599237,0.585821,0.592453,0.729631,0.795261
4,0.5,0.496183,0.62201,0.552017,0.698281,0.8
5,0.6,0.385496,0.647436,0.483254,0.65807,0.795261
6,0.7,0.270992,0.657407,0.383784,0.612167,0.783886
7,0.8,0.21374,0.788732,0.336336,0.597412,0.790521
8,0.9,0.118321,0.885714,0.208754,0.556638,0.777251


In [33]:
threshold_score(clfxg, x_test, y_test, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,0.685115,0.620035,0.650952,0.753416,0.781001


In [34]:
threshold_score(clfxg, x_train, y_train, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,0.971376,0.788606,0.870501,0.940726,0.925812


> Saat dituning didapat bahwa threshold 0.3 merupakan nilai f1-score terbaik. Saat di test data didapat hasil f1 score sebesar 0.65 dan ditrain 0.87. Dalam kondisi sekarang, model ini termasuk overfitting.

In [35]:
dfhasil = pd.concat([dfhasil,pd.DataFrame(['XGBoost',float(threshold_score(clfxg, x_test, y_test, [0.3])['f1-score']), float(threshold_score(clfxg, x_test, y_test, [0.3])['thresold']),status]).transpose()])

## Data Pre-Processing 1
***
> Saya akan melakukan encoding + oversampling menggunakan SMOTE.

In [36]:
df1 = df.copy()

In [37]:
convert = {
    'Yes':1,
    'No':0,
    'No phone service':0,
    'Male':1,
    'Female':0,
    'Month-to-month':1,
    'One year':2,
    'Two year':3,
    'No internet service':0
}
kolom = ['gender','Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','PaperlessBilling','Churn']

In [38]:
for x in kolom:
    df1[x] = df[x].map(convert)

In [39]:
df1 = pd.get_dummies(df1, columns = ['InternetService', 'PaymentMethod'])

> Rata-rata kategorikal kolom diconvert ke binary, sedangkan untuk contract dilakukan ordinal encoding. Untuk InternetService dan PaymentMethod dilakukan One Hot Encoding.

In [40]:
x = df1.drop(columns=['Churn'])
y = df1['Churn']

In [41]:
x_pretrain, x_test, y_pretrain, y_test = train_test_split(x, y, test_size = 0.25, random_state=123)
x_train, x_validation, y_train, y_validation = train_test_split(x_pretrain, y_pretrain, test_size = 0.20, random_state=123)

In [42]:
from imblearn import under_sampling, over_sampling
x_train, y_train = over_sampling.SMOTE().fit_resample(x_train, y_train)
#x_train, y_train = under_sampling.RandomUnderSampler().fit_resample(x_train, y_train)

In [43]:
y_train.value_counts()

0    3136
1    3136
Name: Churn, dtype: int64

## Random Forest Modelling 1
***

In [44]:
from sklearn.ensemble import RandomForestClassifier
clfrf = RandomForestClassifier()
clfrf.fit(x_train, y_train)

RandomForestClassifier()

In [45]:
threshold_score(clfrf, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.969466,0.359264,0.524252,0.699109,0.563033
1,0.2,0.866412,0.419593,0.56538,0.735224,0.669194
2,0.3,0.755725,0.47482,0.583211,0.739779,0.731754
3,0.4,0.645038,0.52648,0.57976,0.72668,0.767773
4,0.5,0.530534,0.565041,0.547244,0.697802,0.781991
5,0.6,0.423664,0.61326,0.501129,0.667696,0.790521
6,0.7,0.343511,0.661765,0.452261,0.642752,0.793365
7,0.8,0.217557,0.791667,0.341317,0.599321,0.791469
8,0.9,0.110687,0.852941,0.195946,0.552191,0.774408


In [46]:
threshold_score(clfrf, x_test, y_test, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,0.748092,0.531165,0.621236,0.733851,0.7281


In [47]:
threshold_score(clfrf, x_train, y_train, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,1.0,0.969697,0.984615,0.984375,0.984375


> Threshold terbaik ada pada 0.3 (berdasarkan f1 score). Untuk f1 score test data ada dinilai 0.64 dan test datanya di 0.98. Model masih overfitting dan result pada test data tidak begitu banyak berubah dari yang tanpa SMOTE.

In [48]:
status = 'Encoding and SMOTE'
dfhasil = pd.concat([dfhasil,pd.DataFrame(['RandomForest',float(threshold_score(clfrf, x_test, y_test, [0.3])['f1-score']), float(threshold_score(clfrf, x_test, y_test, [0.3])['thresold']),status]).transpose()])

## Decision Tree Modelling 1
***

In [49]:
from sklearn.tree import DecisionTreeClassifier
clfdt = DecisionTreeClassifier()
clfdt.fit(x_train, y_train)

DecisionTreeClassifier()

In [50]:
threshold_score(clfdt, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.515267,0.509434,0.512334,0.675666,0.756398
1,0.2,0.515267,0.509434,0.512334,0.675666,0.756398
2,0.3,0.515267,0.509434,0.512334,0.675666,0.756398
3,0.4,0.515267,0.509434,0.512334,0.675666,0.756398
4,0.5,0.515267,0.511364,0.513308,0.676297,0.757346
5,0.6,0.515267,0.511364,0.513308,0.676297,0.757346
6,0.7,0.515267,0.511364,0.513308,0.676297,0.757346
7,0.8,0.515267,0.511364,0.513308,0.676297,0.757346
8,0.9,0.515267,0.511364,0.513308,0.676297,0.757346


In [51]:
threshold_score(clfdt, x_test, y_test, [0.5])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.5,0.498092,0.531568,0.514286,0.655853,0.719568


In [52]:
threshold_score(clfdt, x_train, y_train, [0.5])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.5,0.998087,1.0,0.999042,0.999043,0.999043


> Untuk Decision Tree, model masih sangat overfitting dan resultnya kurang lebih sama dengan yang sebelum SMOTE.

In [53]:
dfhasil = pd.concat([dfhasil,pd.DataFrame(['DecisionTree',float(threshold_score(clfdt, x_test, y_test, [0.5])['f1-score']), float(threshold_score(clfdt, x_test, y_test, [0.5])['thresold']),status]).transpose()])

## XGBoost Modelling 1
***

In [54]:
from xgboost import XGBClassifier 
clfxg = XGBClassifier()
clfxg.fit(x_train, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

In [55]:
threshold_score(clfxg, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.843511,0.412313,0.553885,0.723143,0.662559
1,0.2,0.751908,0.479319,0.585438,0.741024,0.735545
2,0.3,0.679389,0.521994,0.590381,0.73692,0.765877
3,0.4,0.618321,0.554795,0.584838,0.727193,0.781991
4,0.5,0.522901,0.561475,0.541502,0.693985,0.780095
5,0.6,0.454198,0.60101,0.517391,0.677288,0.789573
6,0.7,0.366412,0.680851,0.476427,0.654833,0.8
7,0.8,0.312977,0.759259,0.443243,0.640095,0.804739
8,0.9,0.194656,0.87931,0.31875,0.592915,0.793365


In [56]:
threshold_score(clfxg, x_test, y_test, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,0.715649,0.588697,0.645995,0.751666,0.766212


In [57]:
threshold_score(clfxg, x_train, y_train, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,0.996811,0.910574,0.951743,0.949458,0.949458


> Untuk XGBoost, thresold terbaik berdasarkan f1 score ada di 0.3. Untuk hasil f1 score pada test data ada di 0.64 dan train 0.94. 

In [58]:
dfhasil = pd.concat([dfhasil,pd.DataFrame(['XGBoost',float(threshold_score(clfxg, x_test, y_test, [0.3])['f1-score']), float(threshold_score(clfxg, x_test, y_test, [0.3])['thresold']),status]).transpose()])

## Data Pre-Processing 2
***
> Saya akan melakukan encoding + undersampling.

In [59]:
df1 = df.copy()

In [60]:
convert = {
    'Yes':1,
    'No':0,
    'No phone service':0,
    'Male':1,
    'Female':0,
    'Month-to-month':1,
    'One year':2,
    'Two year':3,
    'No internet service':0
}
kolom = ['gender','Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','PaperlessBilling','Churn']

In [61]:
for x in kolom:
    df1[x] = df[x].map(convert)

In [62]:
df1 = pd.get_dummies(df1, columns = ['InternetService', 'PaymentMethod'])

> Rata-rata kategorikal kolom diconvert ke binary, sedangkan untuk contract dilakukan ordinal encoding. Untuk InternetService dan PaymentMethod dilakukan One Hot Encoding.

In [63]:
x = df1.drop(columns=['Churn'])
y = df1['Churn']

In [64]:
x_pretrain, x_test, y_pretrain, y_test = train_test_split(x, y, test_size = 0.25, random_state=123)
x_train, x_validation, y_train, y_validation = train_test_split(x_pretrain, y_pretrain, test_size = 0.20, random_state=123)

In [65]:
from imblearn import under_sampling, over_sampling
x_train, y_train = under_sampling.RandomUnderSampler().fit_resample(x_train, y_train)

In [66]:
y_train.value_counts()

0    1083
1    1083
Name: Churn, dtype: int64

## Random Forest Modelling 2
***

In [67]:
from sklearn.ensemble import RandomForestClassifier
clfrf = RandomForestClassifier()
clfrf.fit(x_train, y_train)

RandomForestClassifier()

In [68]:
threshold_score(clfrf, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.984733,0.314251,0.476454,0.637385,0.462559
1,0.2,0.969466,0.357746,0.522634,0.697217,0.56019
2,0.3,0.912214,0.398333,0.554524,0.72849,0.636019
3,0.4,0.854962,0.452525,0.59181,0.756611,0.707109
4,0.5,0.736641,0.481297,0.582202,0.737173,0.737441
5,0.6,0.637405,0.533546,0.58087,0.726647,0.771564
6,0.7,0.522901,0.600877,0.559184,0.704073,0.795261
7,0.8,0.385496,0.677852,0.491484,0.662483,0.801896
8,0.9,0.225191,0.766234,0.348083,0.601246,0.790521


In [69]:
threshold_score(clfrf, x_test, y_test, [0.5])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.5,0.740458,0.553495,0.633469,0.743406,0.744596


In [70]:
threshold_score(clfrf, x_train, y_train, [0.5])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.5,0.999077,0.999077,0.999077,0.999077,0.999077


> Dengan menggunakan under sampling, threshold terbaik sekarang ada pada 0.5 namun untuk nilai f1 scorenya kurang lebih masih sama.

In [71]:
status = 'Encoding and Undersampling'
dfhasil = pd.concat([dfhasil,pd.DataFrame(['RandomForest',float(threshold_score(clfrf, x_test, y_test, [0.5])['f1-score']), float(threshold_score(clfrf, x_test, y_test, [0.5])['thresold']),status]).transpose()])

## Decision Tree Modelling 2
***

In [72]:
from sklearn.tree import DecisionTreeClassifier
clfdt = DecisionTreeClassifier()
clfdt.fit(x_train, y_train)

DecisionTreeClassifier()

In [73]:
threshold_score(clfdt, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.679389,0.426859,0.5243,0.689001,0.693839
1,0.2,0.679389,0.426859,0.5243,0.689001,0.693839
2,0.3,0.679389,0.426859,0.5243,0.689001,0.693839
3,0.4,0.679389,0.426859,0.5243,0.689001,0.693839
4,0.5,0.679389,0.427885,0.525074,0.689632,0.694787
5,0.6,0.679389,0.427885,0.525074,0.689632,0.694787
6,0.7,0.679389,0.427885,0.525074,0.689632,0.694787
7,0.8,0.679389,0.427885,0.525074,0.689632,0.694787
8,0.9,0.679389,0.427885,0.525074,0.689632,0.694787


In [74]:
threshold_score(clfdt, x_test, y_test, [0.5])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.5,0.69084,0.49118,0.574148,0.693475,0.694539


In [75]:
threshold_score(clfdt, x_train, y_train, [0.5])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.5,0.999077,1.0,0.999538,0.999538,0.999538


> Dengan undersampling, f1-score pada test data dengan menggunakan decision tree lebih tinggi dari baseline dan SMOTE yaitu hampir mencapai 0.6.

In [76]:
dfhasil = pd.concat([dfhasil,pd.DataFrame(['DecisionTree',float(threshold_score(clfdt, x_test, y_test, [0.5])['f1-score']), float(threshold_score(clfdt, x_test, y_test, [0.5])['thresold']),status]).transpose()])

## XGBoost Modelling 2
***

In [77]:
from xgboost import XGBClassifier 
clfxg = XGBClassifier()
clfxg.fit(x_train, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

In [78]:
threshold_score(clfxg, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.942748,0.355396,0.516196,0.688902,0.561137
1,0.2,0.919847,0.38746,0.545249,0.719697,0.618957
2,0.3,0.874046,0.416364,0.564039,0.734627,0.664455
3,0.4,0.828244,0.451143,0.584118,0.747666,0.707109
4,0.5,0.740458,0.460808,0.568082,0.727102,0.720379
5,0.6,0.679389,0.493075,0.571429,0.72431,0.746919
6,0.7,0.568702,0.519164,0.542805,0.69734,0.762085
7,0.8,0.48855,0.592593,0.535565,0.688789,0.789573
8,0.9,0.316794,0.68595,0.43342,0.634437,0.794313


In [79]:
threshold_score(clfxg, x_test, y_test, [0.6])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.6,0.692748,0.57346,0.627485,0.736974,0.754835


In [80]:
threshold_score(clfxg, x_train, y_train, [0.6])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.6,0.967682,0.984038,0.975791,0.975993,0.975993


> Threshold tertinggi berdasarkan f1 score ada pada 0.6, dan nilai f1-score pada test data ada di 0.62 dan 0.98 pada train data.

In [81]:
dfhasil = pd.concat([dfhasil,pd.DataFrame(['XGBoost',float(threshold_score(clfxg, x_test, y_test, [0.6])['f1-score']), float(threshold_score(clfxg, x_test, y_test, [0.6])['thresold']),status]).transpose()])

## Data Pre-Processing 3
***
> Saya akan melakukan encoding, feature creation dan scaling.

In [82]:
df1 = df.copy()

In [83]:
convert = {
    'Yes':1,
    'No':0,
    'No phone service':0,
    'Male':1,
    'Female':0,
    'Month-to-month':1,
    'One year':2,
    'Two year':3,
    'No internet service':0
}
kolom = ['gender','Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','PaperlessBilling','Churn']

In [84]:
df1['TotalBenefits'] = df.loc[:,'OnlineSecurity':'StreamingMovies'].apply(lambda x: list(x).count('Yes'),axis=1)

In [85]:
for x in kolom:
    df1[x] = df[x].map(convert)

In [86]:
df1 = pd.get_dummies(df1, columns = ['InternetService', 'PaymentMethod'])

> Rata-rata kategorikal kolom diconvert ke binary, sedangkan untuk contract dilakukan ordinal encoding. Untuk InternetService dan PaymentMethod dilakukan One Hot Encoding.

In [87]:
#scaling
from sklearn.preprocessing import StandardScaler
df1['tenure'] = StandardScaler().fit_transform(df1['tenure'].values.reshape(len(df1), 1))
df1['MonthlyCharges'] = StandardScaler().fit_transform(df1['MonthlyCharges'].values.reshape(len(df1), 1))
df1['TotalCharges'] = StandardScaler().fit_transform(df1['TotalCharges'].values.reshape(len(df1), 1))
df1['TotalBenefits'] = StandardScaler().fit_transform(df1['TotalBenefits'].values.reshape(len(df1), 1))

In [88]:
x = df1.drop(columns=['Churn'])
y = df1['Churn']

In [89]:
x_pretrain, x_test, y_pretrain, y_test = train_test_split(x, y, test_size = 0.25, random_state=123)
x_train, x_validation, y_train, y_validation = train_test_split(x_pretrain, y_pretrain, test_size = 0.20, random_state=123)

In [90]:
y_train.value_counts()

0    3136
1    1083
Name: Churn, dtype: int64

## Random Forest Modelling 3
***

In [91]:
from sklearn.ensemble import RandomForestClassifier
clfrf = RandomForestClassifier()
clfrf.fit(x_train, y_train)

RandomForestClassifier()

In [92]:
threshold_score(clfrf, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.946565,0.376328,0.538545,0.71414,0.597156
1,0.2,0.828244,0.456842,0.588874,0.751449,0.712796
2,0.3,0.709924,0.510989,0.594249,0.74273,0.759242
3,0.4,0.561069,0.550562,0.555766,0.704872,0.777251
4,0.5,0.450382,0.634409,0.526786,0.682316,0.799052
5,0.6,0.351145,0.736,0.475452,0.654765,0.807583
6,0.7,0.255725,0.761364,0.382857,0.614622,0.795261
7,0.8,0.152672,0.851064,0.2589,0.571922,0.782938
8,0.9,0.068702,1.0,0.128571,0.534351,0.76872


In [93]:
threshold_score(clfrf, x_test, y_test, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,0.696565,0.576619,0.630942,0.739692,0.75711


In [94]:
threshold_score(clfrf, x_train, y_train, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,1.0,0.952507,0.975676,0.99139,0.987201


> Setelah kita melakukan scaling dan membuat 1 fitur baru, hasil f1-score pada test data malah sedikit berkurang.

In [95]:
status = 'Encoding,FeatureCreation,Scaling'
dfhasil = pd.concat([dfhasil,pd.DataFrame(['RandomForest',float(threshold_score(clfrf, x_test, y_test, [0.3])['f1-score']), float(threshold_score(clfrf, x_test, y_test, [0.3])['thresold']),status]).transpose()])

## Decision Tree Modelling 3
***

In [96]:
from sklearn.tree import DecisionTreeClassifier
clfdt = DecisionTreeClassifier()
clfdt.fit(x_train, y_train)

DecisionTreeClassifier()

In [97]:
threshold_score(clfdt, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.458015,0.466926,0.462428,0.642627,0.735545
1,0.2,0.458015,0.466926,0.462428,0.642627,0.735545
2,0.3,0.458015,0.466926,0.462428,0.642627,0.735545
3,0.4,0.458015,0.466926,0.462428,0.642627,0.735545
4,0.5,0.458015,0.466926,0.462428,0.642627,0.735545
5,0.6,0.458015,0.466926,0.462428,0.642627,0.735545
6,0.7,0.458015,0.466926,0.462428,0.642627,0.735545
7,0.8,0.458015,0.466926,0.462428,0.642627,0.735545
8,0.9,0.458015,0.466926,0.462428,0.642627,0.735545


In [98]:
threshold_score(clfdt, x_test, y_test, [0.5])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.5,0.492366,0.538622,0.514457,0.656637,0.722981


In [99]:
threshold_score(clfdt, x_train, y_train, [0.5])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.5,0.99446,1.0,0.997222,0.99723,0.998578


> tidak ada perubahan yang signifikan juga pada decision tree.

In [100]:
dfhasil = pd.concat([dfhasil,pd.DataFrame(['DecisionTree',float(threshold_score(clfdt, x_test, y_test, [0.5])['f1-score']), float(threshold_score(clfdt, x_test, y_test, [0.5])['thresold']),status]).transpose()])

## XGBoost Modelling 3
***

In [101]:
from xgboost import XGBClassifier 
clfxg = XGBClassifier()
clfxg.fit(x_train, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

In [102]:
threshold_score(clfxg, x_validation, y_validation, np.arange(0.1, 1, 0.1))

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.1,0.847328,0.410351,0.552927,0.722529,0.659716
1,0.2,0.740458,0.479012,0.581709,0.73719,0.735545
2,0.3,0.69084,0.553517,0.614601,0.753364,0.784834
3,0.4,0.591603,0.596154,0.59387,0.729597,0.799052
4,0.5,0.492366,0.617225,0.547771,0.695742,0.798104
5,0.6,0.389313,0.641509,0.484561,0.658717,0.794313
6,0.7,0.305344,0.761905,0.435967,0.636909,0.803791
7,0.8,0.19084,0.793651,0.307692,0.587223,0.78673
8,0.9,0.114504,0.909091,0.20339,0.55536,0.777251


In [103]:
threshold_score(clfxg, x_test, y_test, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,0.675573,0.617801,0.645397,0.74905,0.778726


In [104]:
threshold_score(clfxg, x_train, y_train, [0.3])

Unnamed: 0,thresold,recall,precision,f1-score,RocAuc-score,Accuracy-score
0,0.3,0.979686,0.782448,0.870029,0.942809,0.924864


> Hasil f1-score pada test data malah makin memburuk.

# KESIMPULAN

In [105]:
dfhasil = pd.concat([dfhasil,pd.DataFrame(['XGBoost',float(threshold_score(clfxg, x_test, y_test, [0.3])['f1-score']), float(threshold_score(clfxg, x_test, y_test, [0.3])['thresold']),status]).transpose()])
dfhasil.columns = ['model','f1-score on test','threshold','data pre-processing']

In [106]:
dfhasil.sort_values('f1-score on test', ascending=False)

Unnamed: 0,model,f1-score on test,threshold,data pre-processing
0,XGBoost,0.650952,0.3,Encoding Only
0,XGBoost,0.645995,0.3,Encoding and SMOTE
0,XGBoost,0.645397,0.3,"Encoding,FeatureCreation,Scaling"
0,RandomForest,0.633469,0.5,Encoding and Undersampling
0,RandomForest,0.630942,0.3,"Encoding,FeatureCreation,Scaling"
0,RandomForest,0.629116,0.3,Encoding Only
0,XGBoost,0.627485,0.6,Encoding and Undersampling
0,RandomForest,0.621236,0.3,Encoding and SMOTE
0,DecisionTree,0.574148,0.5,Encoding and Undersampling
0,DecisionTree,0.514457,0.5,"Encoding,FeatureCreation,Scaling"


>Bisa dilihat pada dataframe diatas, bahwa Decision Tree adalah model yang paling buruk sedangkan untuk XGBoost dan Random Forest performanya cukup seimbang. XGBoost hanya sedikit lebih bagus dari Random Forest. Untuk data-preprocessing terbaik adalah hanya melakukan encoding saja. Kesimpulan : model terbaik adalah XGBoost pada data pre-processing encoding only karena memiliki nilai f1 score tertinggi dibanding model dan data pre-processing lainnya.

> Catatan penting: dikarenakan nilai f1-score dari RandomForest dan XGBoost sangat berbeda tipis, maka hasil urutan dataframe diatas kemungkinan akan berubah setiap notebook di run ulang. Namun untuk urutan pertama tetap akan selalu dipegang oleh XGBoost. Terima Kasih.