Projek Bengkel Koding : Hungarian Dataset

Nama                  : Nicholaus Verdhy Putranto

Nim                   : A11.2020.12447

Kelas                 : BKDS07

# Data Collection

Data set pada projek ini diambil dari https://archive.ics.uci.edu/dataset/45/heart+disease

# Load Dataset

In [1]:
#Import library 
import pandas as pd
import numpy as np
import re
import itertools

In [2]:
#load dataset
dir = "hungarian.data"

In [3]:
#melakukan encoding pada dataset
with open(dir, encoding='Latin1') as file:
    lines = [line.strip() for line in file]
    
#membuatnya kedalam data frame
data = itertools.takewhile(
    lambda x: len(x) == 76,
    (' '.join(lines[i:(i + 10)]).split() for i in range(0, len(lines), 10)))

df = pd.DataFrame.from_records(data)
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'hungarian.data'

In [None]:
# Melihat informasti tipe data yang ada serta jumlah data yang ada setiap vaiabel
df.info()

Sesuai dengan arahan pada sumber dataset, terdapat kolom nama 75 berisikan nilai name. Name mewakili ahkir dari setiap baris data, dan nilai -9.0 mewakili missing value. Maka dari itu kita perlu menghapus kolom 75 dan mengubah tipe data object menjadi float agar kita dapat mengubah -9.0 menjadi Nan

In [None]:
#Menghapus kolom 75 berisikan name
df.drop(75,axis=1,inplace=True)

#menhapus kolom 0 karena itu index data
df.drop(0,axis=1,inplace=True)

In [None]:
#mengubah tipe data kolom object menjadi float
df = df.astype(float)

In [None]:
#melihat informasi tipe data 
df.info()

# Validasi Data

Seperti yang tertera pada sumber data set, -0.9 mewakili missing value. kita gunakan .replace untuk mengganti nilai

In [None]:
df.replace(-9.0,np.nan, inplace=True)

In [None]:
#melihat missing value
df.isna().sum()

In [None]:
df.isna().sum().sum()

Pada data set ini tertera nilai missing value sebanyak 8511 data

In [None]:
#sekarang mari kita lihat 5 data teratas pada data set. kita gunakan pd.set_option untuk melihat full kolom
pd.set_option("Display.max_column",None)
df.head()

# Menentukan Fitur

Sesuai dengan sumber penjjelasan datasetnya, fitur yang dugunakan ada pada kolom yang ada pada lokasi [1, 2, 7,8,10,14,17,30,36,38,39,42,49,56]. Maka dari itu kita melakukan pengambilan kolom tersebut

In [None]:
df_selected = df.iloc[:, [1, 2, 7,8,10,14,17,30,36,38,39,42,49,56]]
df_selected

Setelah itu, kita mengganti kolom  nama sesuai dengan arahan dari sumber data set. kita gunakan .rename untuk menggantinya. Sebelumnya kita membuat dictionary yang nantinya kita gunakan untuk mengganti nama kolom

Agar tidak menghilangkan data awal / mentah. Mari kita copy dataframenya kedalam df_selected_copy. Hal ini untuk mengantisipasi agar data asli tidak terkontaminasi dengan data yang telah di prosesing

In [None]:
df_selected_copy = df_selected.copy()

In [None]:
#buat dictionary nama
nama_kolom = {
    2: 'age',
    3: 'sex',
    8: 'cp',
    9: 'trestbps',
    11: 'chol',
    15: 'fbs',
    18: 'restecg',
    31: 'thalach',
    37: 'exang',
    39: 'oldpeak',
    40: 'slope',
    43: 'ca',
    50: 'thal',
    57: 'target'
}

In [None]:
#mengubah nama kolom .rename
df_selected_copy.rename(columns=nama_kolom,inplace=True)

In [None]:
df_selected_copy.head()

# Data Cleansing

Memasuki tahap berikutnya yaitu data cleansing. Data cleansing ini akan berguna untuk menghilangkan missing values pada dataset

In [None]:
# melihat missing value pada kolom
df_selected_copy.isnull().sum()

In [None]:
#jumlah data
df_selected_copy.shape

In [None]:
df_selected_copy.info()

Kita bisa lihat pada missing value, pada kolom slope,ca, dan thal memiliki banyak nilai yang hilang hampir 90%, maka kita hapus kolom tersebut terlebih dahulu

In [None]:
df_selected_copy.drop(["slope","ca","thal"],axis=1, inplace=True)

In [None]:
df_selected_copy.isna().sum()

Setelah itu mari kita isi nilai missing value tersebut dengan nilai mean

In [None]:
df_selected_copy.describe()

Nilai mean dari kolom yang memiliki missing value sudah bertipe float. Kita bisa langsung mengisi nilai hilang dengan fillna dan .mean

In [None]:
df_selected_copy.fillna(df_selected_copy.mean(),inplace=True)
df_selected_copy.isna().sum()

In [None]:
df_selected_copy.info()

Setelah itu saya melakukan pengecekan pada duplikasi data dengan .duplicate

In [None]:
df_selected_copy.duplicated().sum()

Terdapat 1 data duplikat, mari kita lihat dalam dataset

In [None]:
df_duplikat = df_selected_copy.duplicated()
df_selected_copy[df_selected_copy.duplicated(keep=False)]

setelah kita mgengetahui ada duplikasi data, mari kita hapus duplikat datanya

In [None]:
df_selected_copy.drop_duplicates(inplace=True)

In [None]:
df_selected_copy.duplicated().sum()

In [None]:
df_selected_copy.to_csv("Hungarian_Data.csv",index=False,encoding="utf-8")

In [None]:
df_selected_copy.info()

# Melihat korelasi antar data

In [None]:
df_selected_copy.corr()

Untuk visualisasi , menggunakan seabord dengan diimport dahulu

In [None]:
#import library
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#mendapatkan nilai korelas
korelasi_matrix = df_selected_copy.corr()

#membuat plot
fig,ax = plt.subplots(figsize=(15,10))
sns.heatmap(korelasi_matrix,annot=True,linewidths=1.0, fmt=".3f")

Dari korelasi diatas, variabel yang memiliki korelasi yang kuat dengan target diantaranya oldpeak, exang, dan cp. Sedangkan yang memiliki korelas sangat jauh atau kroelasi negatif adalah thalach

# Konstruksi Data

Mari kita melakukan pemisahan terhadap dataset kedalam variabel bebas (x) dan variabel terikat(y). Untuk tipe data sudah sesuai dengan kebutuhan yaitu float

In [None]:
df_selected_copy.info()

In [None]:
#memasukkan variebl bebas ke dalam x
X = df_selected_copy.drop("target",axis=1)
y = df_selected_copy["target"]

# Balancing Data

Karena ini klaisfikasi, mari kita lihat keseimbangan data dialam data ini

In [None]:
df_selected_copy["target"].value_counts().plot(kind='bar',figsize=(10,6),color=['green','blue',"red","yellow","purple"])
plt.title("Count of the target")
plt.xticks(rotation=45);

Pada data diatas menunjukkan bahwa mayoritas dari dataset ini berkelas 0.0 . Hal ini harus diambil tindakan karena jika kita tidak melakukan balancing data maka model akan lebih mendominasiui untuk memprediksi data kelas 0 dari pada kelas lainnya. Maka dari itu kta lakukan dengan over sampling menggunakan SMOTE

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
#panggil smote
smote = SMOTE(random_state=42)
X_smote_resampled, y_smote_resampled = smote.fit_resample(X, y)

In [None]:
#visualisasi bingkai perbandingan no smote vs SMOTE
plt.figure(figsize=(12,8))

#mempersiapkan data visualisasi no smote dan smote
no_smote_y = pd.DataFrame(data=y)
smote_y = pd.DataFrame(data=y_smote_resampled)

#membuat visulasisi untu no smote
plt.subplot(1, 2, 1)
no_smote_y.value_counts().plot(kind='bar',figsize=(10,6),color=['green','blue',"red","yellow","purple"])
plt.title("target before over sampling with SMOTE ")
plt.xticks(rotation=0);
plt.tight_layout()
plt.show()

#membuat visulisasi smote
plt.subplot(1,2,1)
smote_y.value_counts().plot(kind='bar',figsize=(10,6),color=['green','blue',"red","yellow","purple"])
plt.title("target before over sampling with SMOTE ")
plt.xticks(rotation=0);

plt.tight_layout()
plt.show()


In [None]:
#jumlah data kelas no smote
no_smote_y.value_counts()

In [None]:
#jumlah data kelas smote
smote_y.value_counts()

# Normalisai Data

mari kita lihat deskriptif satatistik yang ada pada data

In [None]:
df_selected_copy.describe()

Melakukan normalisasi pakai MinMax Scaler

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_smote_resampled_normal = scaler.fit_transform(X_smote_resampled)
len(X_smote_resampled_normal)

dfcek1 = pd.DataFrame(X_smote_resampled_normal)
dfcek1.describe()

In [None]:
import json

# Hitung nilai minimum dan maksimum untuk setiap fitur
min_max_values = {feature: {"min": min(values), "max": max(values)} for feature, values in dfcek1.items()}

# Simpan nilai minimum dan maksimum dalam JSON
with open("min_max_values.json", "w") as json_file:
    json.dump(min_max_values, json_file, indent=2)

print("Nilai minimum dan maksimum telah disimpan dalam min_max_values.json")

In [None]:
X_smote_resampled_normal.size

In [None]:
y_smote_resampled.size

# Spliting Data

In [None]:
from sklearn.model_selection import train_test_split

# membagi fitur dan target menjadi data train dan test (untuk yang oversample)
X_train, X_test, y_train, y_test = train_test_split(X_smote_resampled, y_smote_resampled, test_size=0.2, random_state=42,stratify=y_smote_resampled)

# membagi fitur dan target menjadi data train dan test (untuk yang oversample + normalization)
X_train_normal, X_test_normal, y_train_normal, y_test_normal = train_test_split(X_smote_resampled_normal, y_smote_resampled, test_size=0.2, random_state=42,stratify = y_smote_resampled)

# MOdeling

In [None]:
from sklearn.metrics import accuracy_score,recall_score,f1_score,precision_score,roc_auc_score,confusion_matrix,precision_score
def evaluation(Y_test,Y_pred):
    acc = accuracy_score(Y_test,Y_pred)
    rcl = recall_score(Y_test,Y_pred,average = 'weighted')
    f1 = f1_score(Y_test,Y_pred,average = 'weighted')
    ps = precision_score(Y_test,Y_pred,average = 'weighted')

    metric_dict={'accuracy': round(acc,3),
    'recall': round(rcl,3),
    'F1 score': round(f1,3),
    'Precision score': round(ps,3)
    }
    
    return print(metric_dict)

 ## Oversampled

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

### K-NN

In [None]:
KNeighborsClassifier(n_neighbors=3)
knn_model = KNeighborsClassifier(n_neighbors = 3)
knn_model.fit(X_train, y_train)

In [None]:
y_pred_knn = knn_model.predict(X_test)
# Evaluate the KNN model
print("K-Nearest Neighbors (KNN) Model:")
accuracy_knn_smote = round(accuracy_score(y_test,y_pred_knn),3)
print("Accuracy:", accuracy_knn_smote)
print("Classification Report:")
print(classification_report(y_test, y_pred_knn))

In [None]:
evaluation(y_test,y_pred_knn)

In [None]:
cm = confusion_matrix(y_test, y_pred_knn)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title('Confusion Matrix')
plt.xlabel('True')
plt.ylabel('Predict')
plt.show()

### Random Forest

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

In [None]:
y_pred_rf = rf_model.predict(X_test)
# Evaluate the Random Forest model
print("\nRandom Forest Model:")
accuracy_rf_smote = round(accuracy_score(y_test, y_pred_rf),3)
print("Accuracy:",accuracy_rf_smote)
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))

In [None]:
evaluation(y_test,y_pred_rf)

In [None]:
cm = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title('Confusion Matrix')
plt.xlabel('True')
plt.ylabel('Predict')
plt.show()

### XGBOOST

In [None]:
xgb_model = XGBClassifier(learning_rate=0.1, n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)

In [None]:
y_pred_xgb = xgb_model.predict(X_test)
# Evaluate the XGBoost model
print("\nXGBoost Model:")
accuracy_xgb_smote = round(accuracy_score(y_test, y_pred_xgb),3)
print("Accuracy:",accuracy_xgb_smote)
print("Classification Report:")
print(classification_report(y_test, y_pred_xgb))

In [None]:
evaluation(y_test,y_pred_xgb)

In [None]:
cm = confusion_matrix(y_test, y_pred_xgb)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title('Confusion Matrix')
plt.xlabel('True')
plt.ylabel('Predict')
plt.show()

## Oversampled + Normalisasi

### K-NN

In [None]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train_normal, y_train_normal)

y_pred_knn = knn_model.predict(X_test_normal)
# Evaluate the KNN model
print("K-Nearest Neighbors (KNN) Model:")
accuracy_knn_smote_normal = round(accuracy_score(y_test_normal,y_pred_knn),3)
print("Accuracy:", accuracy_knn_smote_normal)
print("Classification Report:")
print(classification_report(y_test_normal, y_pred_knn))

In [None]:
evaluation(y_test_normal,y_pred_knn)

In [None]:
cm = confusion_matrix(y_test_normal, y_pred_knn)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title('Confusion Matrix')
plt.xlabel('True')
plt.ylabel('Predict')
plt.show()

### Radom Forest

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_normal, y_train_normal)

In [None]:
y_pred_rf = rf_model.predict(X_test_normal)
# Evaluate the Random Forest model
print("\nRandom Forest Model:")
accuracy_rf_smote_normal = round(accuracy_score(y_test_normal, y_pred_rf),3)
print("Accuracy:",accuracy_rf_smote_normal )
print("Classification Report:")
print(classification_report(y_test_normal, y_pred_rf))

In [None]:
evaluation(y_test_normal,y_pred_rf)

In [None]:
cm = confusion_matrix(y_test_normal, y_pred_rf)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title('Confusion Matrix')
plt.xlabel('True')
plt.ylabel('Predict')
plt.show()

In [None]:
# import pickle

# # Assuming you have a trained model stored in the variable 'model'
# # and you want to save it to a file named 'your_model.pkl'

# with open('randomforest_Oversampled_normalisasi.pkl', 'wb') as model_file:
#     pickle.dump(rf_model, model_file)

### XgBoost

In [None]:
xgb_model = XGBClassifier(learning_rate=0.1, n_estimators=100, random_state=42)
xgb_model.fit(X_train_normal, y_train_normal)

In [None]:
y_pred_xgb = xgb_model.predict(X_test_normal)
# Evaluate the XGBoost model
print("\nXGBoost Model:")
accuracy_xgb_smote_normal = round(accuracy_score(y_test_normal, y_pred_xgb),3)
print("Accuracy:",accuracy_xgb_smote_normal)
print("Classification Report:")
print(classification_report(y_test_normal, y_pred_xgb))

In [None]:
evaluation(y_test_normal,y_pred_xgb)

In [None]:
cm = confusion_matrix(y_test_normal, y_pred_xgb)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title('Confusion Matrix')
plt.xlabel('True')
plt.ylabel('Predict')
plt.show()

## Tuning Model

In [None]:
from sklearn.model_selection import RandomizedSearchCV

### K-NN

In [None]:
knn_model = KNeighborsClassifier()
param_grid = {
"n_neighbors": range(3, 21),
"metric": ["euclidean", "manhattan", "chebyshev"],
"weights": ["uniform", "distance"],
"algorithm": ["auto", "ball_tree", "kd_tree"],
"leaf_size": range(10, 61),
}
knn_model = RandomizedSearchCV(estimator=knn_model, param_distributions=param_grid, n_iter=100, scoring="accuracy", cv=5)
knn_model.fit(X_train_normal, y_train_normal)
best_params = knn_model.best_params_
print(f"Best parameters: {best_params}")

In [None]:
y_pred_knn = knn_model.predict(X_test_normal)
# Evaluate the KNN model
print("K-Nearest Neighbors (KNN) Model:")
accuracy_knn_smote_normal_Tun = round(accuracy_score(y_test_normal,y_pred_knn),3)
print("Accuracy:", accuracy_knn_smote_normal_Tun*100)
print("Classification Report:")
print(classification_report(y_test_normal, y_pred_knn))

In [None]:
evaluation(y_test_normal,y_pred_knn)

In [None]:
cm = confusion_matrix(y_test_normal, y_pred_knn)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title('Confusion Matrix')
plt.xlabel('True')
plt.ylabel('Predict')
plt.show()

In [None]:
# import pickle

# # Assuming you have a trained model stored in the variable 'model'
# # and you want to save it to a file named 'your_model.pkl'

# with open('knnmodel.pkl', 'wb') as model_file:
#     pickle.dump(knn_model, model_file)

In [None]:
# from joblib import dump

# # Assuming you have a trained model stored in the variable 'model'
# # and you want to save it to a file named 'your_model.joblib'

# dump(knn_model, 'knn_model.joblib')

### Random Forest

In [None]:
rf_model = RandomForestClassifier()
param_grid = {
"n_estimators": [100, 200],
"max_depth": [ 10, 15],
"min_samples_leaf": [1, 2],
"min_samples_split": [2, 5],
"max_features": ["sqrt", "log2"],
# "random_state": [42, 100, 200]
}
rf_model = RandomizedSearchCV(rf_model, param_grid, n_iter=100, cv=5, n_jobs=-1,random_state=42)
rf_model.fit(X_train_normal, y_train_normal)
best_params = rf_model.best_params_
print(f"Best parameters: {best_params}")

In [None]:
y_pred_rf = rf_model.predict(X_test_normal)
# Evaluate the Random Forest model
print("\nRandom Forest Model:")
accuracy_rf_smote_normal_Tun = round(accuracy_score(y_test_normal, y_pred_rf),3)
print("Accuracy:",accuracy_rf_smote_normal_Tun)
print("Classification Report:")
print(classification_report(y_test_normal, y_pred_rf))

In [None]:
evaluation(y_test_normal,y_pred_rf)

In [None]:
cm = confusion_matrix(y_test_normal, y_pred_knn)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title('Confusion Matrix')
plt.xlabel('True')
plt.ylabel('Predict')
plt.show()

#### XgBoost

In [None]:
xgb_model = XGBClassifier()
param_grid = {
"max_depth": [3, 5, 7],
"learning_rate": [0.01, 0.1],
"n_estimators": [100, 200],
"gamma": [0, 0.1],
"colsample_bytree": [0.7, 0.8],
}
xgb_model = RandomizedSearchCV(xgb_model, param_grid, n_iter=10, cv=5, n_jobs=-1)
xgb_model.fit(X_train_normal, y_train_normal)
best_params = xgb_model.best_params_
print(f"Best parameters: {best_params}")

In [None]:
y_pred_xgb = xgb_model.predict(X_test_normal)
# Evaluate the XGBoost model
print("\nXGBoost Model:")
accuracy_xgb_smote_normal_Tun = round(accuracy_score(y_test_normal, y_pred_xgb),3)
print("Accuracy:",accuracy_xgb_smote_normal_Tun)
print("Classification Report:")
print(classification_report(y_test_normal, y_pred_xgb))

In [None]:
evaluation(y_test_normal,y_pred_xgb)

In [None]:
cm = confusion_matrix(y_test_normal, y_pred_xgb)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title('Confusion Matrix')
plt.xlabel('True')
plt.ylabel('Predict')
plt.show()

In [None]:
# import pickle

# # Assuming you have a trained model stored in the variable 'model'
# # and you want to save it to a file named 'your_model.pkl'

# with open('xgBoost_tuning.pkl', 'wb') as model_file:
#     pickle.dump(xgb_model, model_file)

# Evaluasi

In [None]:
model_comp1 = pd.DataFrame({'Model': ['K-Nearest Neighbour','Random Forest','XGBoost'], 
                            'Accuracy': [accuracy_knn_smote*100,
                                         accuracy_rf_smote*100,
                                         accuracy_xgb_smote*100]})
model_comp1.head()

In [None]:
# Membuat bar plot dengan keterangan jumlah
fig, ax = plt.subplots()
bars = plt.bar(model_comp1['Model'], model_comp1['Accuracy'], color=['red', 'green', 'blue'])
plt.xlabel('Model')
plt.ylabel('Accuracy (%)')
plt.title('Oversample')
plt.xticks(rotation=45, ha='right') # Untuk memutar label sumbu x agar lebih mudah dibaca
# Menambahkan keterangan jumlah di atas setiap bar
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom')
plt.show()

In [None]:
model_comp2 = pd.DataFrame({'Model': ['K-Nearest Neighbour','Random Forest','XGBoost'], 
                            'Accuracy': [accuracy_knn_smote_normal*100,
                                         accuracy_rf_smote_normal*100,
                                         accuracy_xgb_smote_normal*100]})

model_comp2.head()

In [None]:
# Membuat bar plot dengan keterangan jumlah
fig, ax = plt.subplots()
bars = plt.bar(model_comp2['Model'], model_comp2['Accuracy'], color=['red', 'green', 'blue'])
plt.xlabel('Model')
plt.ylabel('Accuracy (%)')
plt.title('Normalization + Oversampling')
plt.xticks(rotation=45, ha='right') # Untuk memutar label sumbu x agar lebih mudah dibaca
# Menambahkan keterangan jumlah di atas setiap bar
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom')
plt.show()

In [None]:
model_comp3 = pd.DataFrame({'Model': ['K-Nearest Neighbour','Random Forest','XGBoost'], 
                            'Accuracy': [accuracy_knn_smote_normal_Tun*100,
                                         accuracy_rf_smote_normal_Tun*100,
                                         accuracy_xgb_smote_normal_Tun*100]})

model_comp3.head()

In [None]:
# Membuat bar plot dengan keterangan jumlah
fig, ax = plt.subplots()
bars = plt.bar(model_comp3['Model'], model_comp3['Accuracy'], color=['red', 'green', 'blue'])
plt.xlabel('Model')
plt.ylabel('Accuracy (%)')
plt.title('Normalization + Oversampling + Tunning')
plt.xticks(rotation=45, ha='right') # Untuk memutar label sumbu x agar lebih mudah dibaca
# Menambahkan keterangan jumlah di atas setiap bar
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom')
plt.show()

In [None]:
# Data frame
model_compBest = pd.DataFrame({
    'Model': ['K-Nearest Neighbour OverSample Tunning', 'Random Forest OverSample',
              'XGB OverSample Standarization Tunning'],
    'Accuracy': [accuracy_knn_smote_normal_Tun*100, accuracy_rf_smote_normal*100,
                 accuracy_xgb_smote_normal_Tun*100]
})

# Membuat bar plot dengan keterangan jumlah
fig, ax = plt.subplots()
bars = plt.bar(model_compBest['Model'], model_compBest['Accuracy'], color=['red', 'green', 'blue'])
plt.xlabel('Model')
plt.ylabel('Accuracy (%)')
plt.title('Best Model Comparison')
plt.xticks(rotation=45, ha='right') # Untuk memutar label sumbu x agar lebih mudah dibaca
# Menambahkan keterangan jumlah di atas setiap bar
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom')
plt.show()

# Kesimpulan


Dari hasil penelitian di atas, dapat disimpulkan bahwa penanganan ketidakseimbangan data yang optimal melibatkan penggunaan metode Random Oversampling SMOTE, tunning parameter dengan RandomSearchCV, dan normalisasi data memberikan dampak positif pada kinerja model klasifikasi. Oleh karena itu, model terbaik pada penelitian ini terdapat pada model KNN, yang mencapai tingkat akurasi tertinggi sebesar 93%. Sementara model Random Forest, meskipun awalnya memiliki akurasi tinggi, mengalami penurunan signifikan setelah proses tuning dan penanganan ketidakseimbangan data. Model XGBoots juga menunjukkan hasil yang baik dengan akurasi 92%. Maka dari itu, pemilihan model terbaik dapat menjadi keputusan kritis dalam mengoptimalkan performa keseluruhan dari suatu sistem klasifikasi, dan pada penelitian ini, model terbaik dapat diidentifikasi pada model KNN