Query Google Cloud Platform:
SELECT limit_balance, sex, education_level, marital_status, age, pay_0, pay_2, pay_3, pay_4, pay_5, pay_6, bill_amt_1, bill_amt_2, bill_amt_3, bill_amt_4, bill_amt_5, bill_amt_6, pay_amt_1, pay_amt_2, pay_amt_3, pay_amt_4, pay_amt_5, pay_amt_6, default_payment_next_month 
FROM `bigquery-public-data.ml_datasets.credit_card_default` 
LIMIT 21967;

# i. Perkenalan
Nama: Timothy
Batch: FTDS-011
Dataset: credit_card_default
Objective: Buatlah model Classification untuk memprediksi default_payment_next_month menggunakan dataset yang sudah kalian simpan


## Conceptual Problems (akan dijawab di bagian akhir)

1.   Apakah fungsi parameter criterion pada Decision Tree? Jelaskan salah satu criterion yang kalian pahami!
2.   Apakah fungsi dari pruning pada Tree model?
3.   Bagaimana cara memilih K yang optimal pada KNN?
4.   Jelaskan apa yang kalian ketahui tentang Cross Validation!
5.   Jelaskan apa yang kalian ketahui tentang Accuracy, Precision, Recall, F1 Score!



# ii. Import Libraries

In [None]:
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  import seaborn as sns
  from sklearn.ensemble  import RandomForestClassifier
  from sklearn.model_selection import train_test_split
  from sklearn.preprocessing import StandardScaler
  from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
  from sklearn.linear_model import LogisticRegression
  from sklearn.svm import SVC
  from sklearn.tree import DecisionTreeClassifier
  from sklearn.model_selection import GridSearchCV
  from sklearn.model_selection import cross_val_score
  from sklearn.neighbors import KNeighborsClassifier
  from sklearn.naive_bayes import GaussianNB
  from sklearn.ensemble import BaggingClassifier

# iii. Data Loading
Data diambil dari Google Cloud Plateform Big Query yang sebelumnya sudah saya jadikan .csv dengan query pada bagian paling atas notebook dan telah disimpan dalam google drive saya

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

In [None]:
df = pd.read_csv('/content/gdrive/MyDrive/Dataset/bq-results-20220531-160622-1654013231983.csv')

# iv. Exploratory Data Analysis (EDA)

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

datatype semua kolom adalah int

In [None]:
df.isnull().sum()

tidak ada missing values pada data set

In [None]:
plt.figure(figsize=[20,20])
plt.subplot(421)
g=sns.countplot(x= 'sex', hue = 'default_payment_next_month', data = df)
plt.subplot(422)
sns.countplot(x= 'education_level', hue='default_payment_next_month', data = df)
plt.subplot(423)
sns.countplot(x= 'age', hue='default_payment_next_month', data = df)
plt.subplot(424)
sns.countplot(x= 'marital_status', hue='default_payment_next_month', data = df)

Berdasarkan 4 kategori tersebut dapat saya simpulkan kalau customer yang memiliki kemungkinan untuk membayar tagihan kartu kredit adalah customer laki-laki,tingkat pendidikan minimal universitas(2), Usia 25-27, dan status pernikahan sudah menikah

In [None]:
df.describe()

In [None]:
corr = df.corr()

plt.figure(figsize=(20, 10))
sns.heatmap(corr, annot= True, vmin=0, vmax=1)
plt.show()

melalui heat map dapat terlihat yang memiliki korelasi dengan default_payment_next_month adalah pay_0, pay_2, pay_3, pay_4, pay_5, pay_6

# v. Data Preprocessing

In [None]:
cc = df.copy()
cc.shape

menghilangkan seluruh kolom bill_amt dan pay_amt karena korelasi kolom kolom tersebut terhadap default_payment_next_month sangat kecil

In [None]:
cc.drop(['bill_amt_1', 'bill_amt_2', 'bill_amt_3', 'bill_amt_4', 'bill_amt_5', 'bill_amt_6', 'pay_amt_1', 'pay_amt_2', 'pay_amt_3', 'pay_amt_4', 'pay_amt_5', 'pay_amt_6'], axis=1, inplace=True)

In [None]:
cc.head()

## Distribution Check

In [None]:
def diagnostic_plots(cc, variable):
    # Define figure size
    plt.figure(figsize=(16, 4))

    # Histogram
    plt.subplot(1, 2, 1)
    sns.histplot(cc[variable], bins=30)
    plt.title('Histogram')

    # Boxplot
    plt.subplot(1, 2, 2)
    sns.boxplot(y=cc[variable])
    plt.title('Boxplot')

    plt.show()

In [None]:
diagnostic_plots(cc, 'limit_balance')
print('\nSkewness Value : ', cc['limit_balance'].skew())

kolom limit_balance skew atau tidak terdistribusi secara normal

In [None]:
diagnostic_plots(cc, 'sex')
print('\nSkewness Value : ', cc['sex'].skew())

Kolom sex terdistribusi secara normal

In [None]:
diagnostic_plots(cc, 'education_level')
print('\nSkewness Value : ', cc['education_level'].skew())

kolom education_level skew atau tidak terdistribusi secara normal

In [None]:
diagnostic_plots(cc, 'marital_status')
print('\nSkewness Value : ', cc['marital_status'].skew())

Kolom marital_status terdistribusi secara normal

In [None]:
diagnostic_plots(cc, 'age')
print('\nSkewness Value : ', cc['age'].skew())

kolom age skew atau tidak terdistribusi secara normal

In [None]:
diagnostic_plots(cc, 'pay_0')
print('\nSkewness Value : ', cc['pay_0'].skew())

kolom pay_0 skew atau tidak terdistribusi secara normal

In [None]:
diagnostic_plots(cc, 'pay_2')
print('\nSkewness Value : ', cc['pay_2'].skew())

kolom pay_2 skew atau tidak terdistribusi secara normal

In [None]:
diagnostic_plots(cc, 'pay_3')
print('\nSkewness Value : ', cc['pay_3'].skew())

kolom pay_3 skew atau tidak terdistribusi secara normal

In [None]:
diagnostic_plots(cc, 'pay_4')
print('\nSkewness Value : ', cc['pay_4'].skew())

kolom pay_4 skew atau tidak terdistribusi secara normal

In [None]:
diagnostic_plots(cc, 'pay_5')
print('\nSkewness Value : ', cc['pay_5'].skew())

kolom pay_5 skew atau tidak terdistribusi secara normal

In [None]:
diagnostic_plots(cc, 'pay_6')
print('\nSkewness Value : ', cc['pay_6'].skew())

kolom pay_6 skew atau tidak terdistribusi secara normal

## Distribution Check Summary

1.  sex              Normal 
2.  education_level  Skew
3.  marital_status   Normal
4.  age              Skew
5.  pay_0            Skew
6.  pay_2            Skew
7.  pay_3            Skew
8.  pay_4            Skew
9.  pay_5            Skew
10. pay_6            Skew

## Handling Outlier

In [None]:
def find_normal_boundaries(cc, variable):
    upper_boundary = cc[variable].mean() + 3 * cc[variable].std()
    lower_boundary = cc[variable].mean() - 3 * cc[variable].std()

    return upper_boundary, lower_boundary

In [None]:
upper_boundary, lower_boundary = find_normal_boundaries(cc, 'sex')
upper_boundary, lower_boundary

In [None]:
print('Total row: {}'.format(len(cc)))
print('Right End Outliers: {}'.format(len(cc[cc['sex'] > upper_boundary])))
print('Left End Outliers: {}'.format(len(cc[cc['sex'] < lower_boundary])))
print('')
print('% right end outliers : {}'.format(len(cc[cc['sex'] > upper_boundary]) / len(cc) * 100))
print('% left end outliers  : {}'.format(len(cc[cc['sex'] < lower_boundary]) / len(cc) * 100))

In [None]:
upper_boundary, lower_boundary = find_normal_boundaries(cc, 'marital_status')
upper_boundary, lower_boundary

In [None]:
print('Total row: {}'.format(len(cc)))
print('Right End Outliers: {}'.format(len(cc[cc['marital_status'] > upper_boundary])))
print('Left End Outliers: {}'.format(len(cc[cc['marital_status'] < lower_boundary])))
print('')
print('% right end outliers : {}'.format(len(cc[cc['marital_status'] > upper_boundary]) / len(cc) * 100))
print('% left end outliers  : {}'.format(len(cc[cc['marital_status'] < lower_boundary]) / len(cc) * 100))

Tidak ada outlier pada kolom Sex dan Marital Status

In [None]:
def find_skewed_boundaries(cc, variable, distance):
    IQR = cc[variable].quantile(0.75) - cc[variable].quantile(0.25)

    lower_boundary = cc[variable].quantile(0.25) - (IQR * distance)
    upper_boundary = cc[variable].quantile(0.75) + (IQR * distance)

    return upper_boundary, lower_boundary

In [None]:
upper_boundary, lower_boundary = find_skewed_boundaries(cc, 'education_level', 1.5)
upper_boundary, lower_boundary

In [None]:
print('Total row: {}'.format(len(cc)))
print('Right End Outliers: {}'.format(len(cc[cc['education_level'] > upper_boundary])))
print('Left End Outliers: {}'.format(len(cc[cc['education_level'] < lower_boundary])))
print('')
print('% right end outliers : {}'.format(len(cc[cc['education_level'] > upper_boundary]) / len(cc) * 100))
print('% left end outliers  : {}'.format(len(cc[cc['education_level'] < lower_boundary]) / len(cc) * 100))

Kolom education_level terdapat 1,4% Outlier

In [None]:
upper_boundary, lower_boundary = find_skewed_boundaries(cc, 'age', 1.5)
upper_boundary, lower_boundary

In [None]:
print('Total row: {}'.format(len(cc)))
print('Right End Outliers: {}'.format(len(cc[cc['age'] > upper_boundary])))
print('Left End Outliers: {}'.format(len(cc[cc['age'] < lower_boundary])))
print('')
print('% right end outliers : {}'.format(len(cc[cc['age'] > upper_boundary]) / len(cc) * 100))
print('% left end outliers  : {}'.format(len(cc[cc['age'] < lower_boundary]) / len(cc) * 100))

Kolom age terdapat 0,8% Outlier

In [None]:
upper_boundary, lower_boundary = find_skewed_boundaries(cc, 'pay_0', 1.5)
upper_boundary, lower_boundary

In [None]:
print('Total row: {}'.format(len(cc)))
print('Right End Outliers: {}'.format(len(cc[cc['pay_0'] > upper_boundary])))
print('Left End Outliers: {}'.format(len(cc[cc['pay_0'] < lower_boundary])))
print('')
print('% right end outliers : {}'.format(len(cc[cc['pay_0'] > upper_boundary]) / len(cc) * 100))
print('% left end outliers  : {}'.format(len(cc[cc['pay_0'] < lower_boundary]) / len(cc) * 100))

Kolom pay_0 terdapat 11% Outlier

In [None]:
upper_boundary, lower_boundary = find_skewed_boundaries(cc, 'pay_2', 1.5)
upper_boundary, lower_boundary

In [None]:
print('Total row: {}'.format(len(cc)))
print('Right End Outliers: {}'.format(len(cc[cc['pay_2'] > upper_boundary])))
print('Left End Outliers: {}'.format(len(cc[cc['pay_2'] < lower_boundary])))
print('')
print('% right end outliers : {}'.format(len(cc[cc['pay_2'] > upper_boundary]) / len(cc) * 100))
print('% left end outliers  : {}'.format(len(cc[cc['pay_2'] < lower_boundary]) / len(cc) * 100))

Kolom pay_2 terdapat 14,5% Outlier

In [None]:
upper_boundary, lower_boundary = find_skewed_boundaries(cc, 'pay_3', 1.5)
upper_boundary, lower_boundary

In [None]:
print('Total row: {}'.format(len(cc)))
print('Right End Outliers: {}'.format(len(cc[cc['pay_3'] > upper_boundary])))
print('Left End Outliers: {}'.format(len(cc[cc['pay_3'] < lower_boundary])))
print('')
print('% right end outliers : {}'.format(len(cc[cc['pay_3'] > upper_boundary]) / len(cc) * 100))
print('% left end outliers  : {}'.format(len(cc[cc['pay_3'] < lower_boundary]) / len(cc) * 100))

Kolom pay_3 terdapat 14,5% Outlier

In [None]:
upper_boundary, lower_boundary = find_skewed_boundaries(cc, 'pay_4', 1.5)
upper_boundary, lower_boundary

In [None]:
print('Total row: {}'.format(len(cc)))
print('Right End Outliers: {}'.format(len(cc[cc['pay_4'] > upper_boundary])))
print('Left End Outliers: {}'.format(len(cc[cc['pay_4'] < lower_boundary])))
print('')
print('% right end outliers : {}'.format(len(cc[cc['pay_4'] > upper_boundary]) / len(cc) * 100))
print('% left end outliers  : {}'.format(len(cc[cc['pay_4'] < lower_boundary]) / len(cc) * 100))

Kolom pay_4 terdapat 12% Outlier



In [None]:
upper_boundary, lower_boundary = find_skewed_boundaries(cc, 'pay_5', 1.5)
upper_boundary, lower_boundary

In [None]:
print('Total row: {}'.format(len(cc)))
print('Right End Outliers: {}'.format(len(cc[cc['pay_5'] > upper_boundary])))
print('Left End Outliers: {}'.format(len(cc[cc['pay_5'] < lower_boundary])))
print('')
print('% right end outliers : {}'.format(len(cc[cc['pay_5'] > upper_boundary]) / len(cc) * 100))
print('% left end outliers  : {}'.format(len(cc[cc['pay_5'] < lower_boundary]) / len(cc) * 100))

Kolom pay_5 terdapat 12% Outlier

In [None]:
upper_boundary, lower_boundary = find_skewed_boundaries(cc, 'pay_6', 1.5)
upper_boundary, lower_boundary

In [None]:
print('Total row: {}'.format(len(cc)))
print('Right End Outliers: {}'.format(len(cc[cc['pay_6'] > upper_boundary])))
print('Left End Outliers: {}'.format(len(cc[cc['pay_6'] < lower_boundary])))
print('')
print('% right end outliers : {}'.format(len(cc[cc['pay_6'] > upper_boundary]) / len(cc) * 100))
print('% left end outliers  : {}'.format(len(cc[cc['pay_6'] < lower_boundary]) / len(cc) * 100))

Kolom pay_6 terdapat 11% Outlier

## Handling Outlier Summary


*   Tidak ada outlier: sex dan marital_status
*   Trimming:          education_level dan age
*   Capping:           pay_0, pay_2, pay_3, pay_4, pay_5, dan pay_6



## Trimming

In [None]:
# Limits for `education_level`
education_level_upper_limit, education_level_lower_limit = find_skewed_boundaries(cc, 'education_level', 1.5)
education_level_upper_limit, education_level_lower_limit

# Limits for `age`
age_upper_limit, age_lower_limit = find_skewed_boundaries(cc, 'age', 1.5)
age_upper_limit, age_lower_limit


In [None]:
# Flag the outliers in category `education_level`
outliers_education_level = np.where(cc['education_level'] > education_level_upper_limit, True,
                       np.where(cc['education_level'] < education_level_lower_limit, True, False))

# Flag the outliers in category `age`
outliers_age = np.where(cc['age'] > age_upper_limit, True,
                       np.where(cc['age'] < age_lower_limit, True, False))



print(outliers_age[:10])

In [None]:
cc_trimmed = cc.loc[~(outliers_age + outliers_education_level)]

## Capping


In [None]:
!pip install feature-engine
from feature_engine.outliers import Winsorizer

In [None]:
windsorizer = Winsorizer(capping_method='iqr', # choose iqr for IQR rule boundaries or gaussian for mean and std
                          tail='both', # cap left, right or both tails 
                          fold=1.5,
                          variables=['pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6'])

windsorizer.fit(cc_trimmed)

cc_t = windsorizer.transform(cc_trimmed)

In [None]:
print('cc Dataframe - Before Capping')
print(cc_trimmed.describe())
print('')
print('cc Dataframe - After Capping')
print(cc_t.describe())

In [None]:
cc_t.head()

## Splitting Dataset

In [None]:
X = cc_t.drop(['default_payment_next_month'],axis=1)
y = cc_t['default_payment_next_month']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

print('Train Size : ', X_train.shape)
print('Test Size  : ', X_test.shape)

In [None]:
X_train.info()

In [None]:
X_test.info()

Semua data merupakan int atau float tidak ada object maka kita akan lanjut ke feature scalling

## Feature Scalling
menggunakan standard scaller karena outlier sudah dihandle

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled

In [None]:
X_train = X_train_scaled.copy()

In [None]:
X_test = X_test_scaled.copy()

# vi. Model Definition, Training, and Evaluation

Menambahkan function print_score untuk memperlihatkan akurasi dari tiap model

In [None]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

## Logistic Regression

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

print_score(logreg, X_train, y_train, X_test, y_test, train=True)
print_score(logreg, X_train, y_train, X_test, y_test, train=False)

In [None]:
y_pred_train = logreg.predict(X_train)
y_pred_test = logreg.predict(X_test)

y_pred_train

## SVM

In [None]:
svr = SVC()
svr.fit(X_train, y_train)

print_score(svr, X_train, y_train, X_test, y_test, train=True)
print_score(svr, X_train, y_train, X_test, y_test, train=False)

In [None]:
y_pred_train = svr.predict(X_train)
y_pred_test = svr.predict(X_test)

y_pred_train

## Decision Tree

In [None]:
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)

In [None]:
params = {
    "criterion":("gini", "entropy"), 
    "splitter":("best", "random"), 
    "max_depth":(list(range(1, 20))), 
    "min_samples_split":[2, 3, 4], 
    "min_samples_leaf":list(range(1, 20)), 
}


tree_clf = DecisionTreeClassifier(random_state=42)
tree_cv = GridSearchCV(tree_clf, params, scoring="accuracy", n_jobs=-1, verbose=1, cv=3)
tree_cv.fit(X_train, y_train)
best_params = tree_cv.best_params_
print(f"Best paramters: {best_params})")

tree_clf = DecisionTreeClassifier(**best_params)
tree_clf.fit(X_train, y_train)
print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)

In [None]:
acc_train_cross_val = cross_val_score(tree_clf, 
                                      X_train, 
                                      y_train, 
                                      cv=3, scoring="accuracy")

print('Accuracy - All - Cross Validation  : ', acc_train_cross_val)
print('Accuracy - Mean - Cross Validation : ', acc_train_cross_val.mean())
print('Accuracy - Std - Cross Validation  : ', acc_train_cross_val.std())
print('Accuracy - Range of Test-Set       : ', (acc_train_cross_val.mean()-acc_train_cross_val.std()) , '-', (acc_train_cross_val.mean()+acc_train_cross_val.std()))

In [None]:
y_pred_train = tree_clf.predict(X_train)
y_pred_test = tree_clf.predict(X_test)

y_pred_train

## Random Forest

In [None]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

In [None]:
acc_train_cross_val = cross_val_score(rf_clf, 
                                      X_train, 
                                      y_train, 
                                      cv=3, scoring="accuracy")

print('Accuracy - All - Cross Validation  : ', acc_train_cross_val)
print('Accuracy - Mean - Cross Validation : ', acc_train_cross_val.mean())
print('Accuracy - Std - Cross Validation  : ', acc_train_cross_val.std())
print('Accuracy - Range of Test-Set       : ', (acc_train_cross_val.mean()-acc_train_cross_val.std()) , '-', (acc_train_cross_val.mean()+acc_train_cross_val.std()))

In [None]:
y_pred_train = rf_clf.predict(X_train)
y_pred_test = rf_clf.predict(X_test)

y_pred_train

## KNN

In [None]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

print_score(knn_clf, X_train, y_train, X_test, y_test, train=True)
print_score(knn_clf, X_train, y_train, X_test, y_test, train=False)

In [None]:
acc_train_cross_val = cross_val_score(knn_clf, 
                                      X_train, 
                                      y_train, 
                                      cv=3, scoring="accuracy")

print('Accuracy - All - Cross Validation  : ', acc_train_cross_val)
print('Accuracy - Mean - Cross Validation : ', acc_train_cross_val.mean())
print('Accuracy - Std - Cross Validation  : ', acc_train_cross_val.std())
print('Accuracy - Range of Test-Set       : ', (acc_train_cross_val.mean()-acc_train_cross_val.std()) , '-', (acc_train_cross_val.mean()+acc_train_cross_val.std()))

In [None]:
y_pred_train = knn_clf.predict(X_train)
y_pred_test = knn_clf.predict(X_test)

y_pred_train

## Naive Bayes

In [None]:
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)

print_score(nb_clf, X_train, y_train, X_test, y_test, train=True)
print_score(nb_clf, X_train, y_train, X_test, y_test, train=False)

In [None]:
acc_train_cross_val = cross_val_score(nb_clf, 
                                      X_train, 
                                      y_train, 
                                      cv=3, scoring="accuracy")

print('Accuracy - All - Cross Validation  : ', acc_train_cross_val)
print('Accuracy - Mean - Cross Validation : ', acc_train_cross_val.mean())
print('Accuracy - Std - Cross Validation  : ', acc_train_cross_val.std())
print('Accuracy - Range of Test-Set       : ', (acc_train_cross_val.mean()-acc_train_cross_val.std()) , '-', (acc_train_cross_val.mean()+acc_train_cross_val.std()))

In [None]:
y_pred_train = nb_clf.predict(X_train)
y_pred_test = nb_clf.predict(X_test)

y_pred_train

## Bagging

In [None]:
bg_clf = BaggingClassifier()
bg_clf.fit(X_train, y_train)

print_score(bg_clf, X_train, y_train, X_test, y_test, train=True)
print_score(bg_clf, X_train, y_train, X_test, y_test, train=False)

In [None]:
y_pred_train = bg_clf.predict(X_train)
y_pred_test = bg_clf.predict(X_test)

y_pred_train

# vii. Model Inference

In [None]:
data_inf = cc_t.sample(10, random_state=17)
data_inf

In [None]:
X1 = data_inf.drop(['default_payment_next_month'],axis=1)
y1 = data_inf['default_payment_next_month']

In [None]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, random_state=101)

In [None]:
scaler = StandardScaler()
scaler.fit(X_train1)

X_train1_scaled = scaler.transform(X_train1)
X_test1_scaled = scaler.transform(X_test1)

X_train1_scaled

In [None]:
y_pred_train_inf = tree_clf.predict(X_train1_scaled)
y_pred_test_inf = tree_clf.predict(X_test1_scaled)

y_pred_train_inf

# viii. Pengambilan Keputusan
Diantara 7 model diatas, saya memilih menggunakan model Decision karena berdasarkan print score, model tersebut yang memiliki Precision terhadap value 1(membayar tagihan kartu kredit bulan depan) tertinggi dibandingkan model lainnya yaitu sebesar 1 atau 100%, jadi dengan model decision tree saya harap pihak bank dapat merencanakan strategi kedepannya dengan informasi prediksi dari model yang sudah dibuat

# ix. Jawaban Conceptual Problems

1.   Criterion adalah parameter Decision Tree yang menentukan bagaimana kemurnian dalam pemisahan data akan diukur, cth: "Gini Impurities"
2.   Pruning adalah teknik kompresi data dalam machine learning dan search algorithms yang mengurangi ukuran decision tree dan menghapus bagian decision tree yang tidak penting dan berulang-ulang
3.   Optimal K dapat ditemukan menggunakan rumus akar kuadrat dari N, dimana N adalah jumlah sampel
4.   Cross-validation digunakan dalam machine learning untuk mengestimasi keahlian model machine learning pada data yang tidak terlihat, cross-validation menggunakan sampel yang terbatas agar dapat meng-estimasi performa model dalam membuat prediksi pada data yang tidak digunakan dalam model training
5.   
Actual Negative, Predicted Negative = True Negative
Actual Negative, Predicted Positive = False Positive
Actual Positive, Predicted Negative = False Negative
Actual Positive, Predicted Positive = True Positive
Precision = True Positive / (True Positive+False Positive)
Recall = True Positive / (True Positive + False Negative)
F1-score = 2 * ((Precision * Recall) / (Precision + Recall))
Sedangkan akurasi adalah tingkat keakuratan model yang dibuat (prediksi) dengan data aktual


