# **1. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **2. Memuat Dataset dari Hasil Clustering**

Memuat dataset hasil clustering dari file CSV ke dalam variabel DataFrame.

In [9]:
# Load Dataset
df = pd.read_csv('Data/Air_Quality_labeled.csv')
df.head()

Unnamed: 0,PM2.5,PM10,SO2,NO2,CO,O3,Cluster
0,-0.554217,-0.693069,-0.195111,-0.636364,-0.555556,0.369231,1
1,-0.506024,-0.653465,-0.195111,-0.636364,-0.555556,0.369231,1
2,-0.518072,-0.663366,-0.130074,-0.568182,-0.555556,0.307692,1
3,-0.53012,-0.673267,0.260147,-0.545455,-0.555556,0.292308,1
4,-0.566265,-0.70297,0.325184,-0.522727,-0.555556,0.292308,1


# **3. Data Splitting**

Tahap Data Splitting bertujuan untuk memisahkan dataset menjadi dua bagian: data latih (training set) dan data uji (test set).

In [10]:
# Pisahkan features (X) and target (y ='cluster')
X = df.drop(['Cluster'], axis=1)
y = df['Cluster']

# Split data training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tampilkan hasil splitting
print(f"Jumlah data latih: {len(X_train)}")
print(f"Jumlah data uji: {len(X_test)}")

Jumlah data latih: 84153
Jumlah data uji: 21039


# **4. Membangun Model Klasifikasi**


## **a. Membangun Model Klasifikasi**

Setelah memilih algoritma klasifikasi yang sesuai, langkah selanjutnya adalah melatih model menggunakan data latih.

Berikut adalah rekomendasi tahapannya.
1. Pilih algoritma klasifikasi yang sesuai, seperti Logistic Regression, Decision Tree, Random Forest, atau K-Nearest Neighbors (KNN).
2. Latih model menggunakan data latih.

In [18]:
# Mencoba semua model demi efisiensi
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
}

Tulis narasi atau penjelasan algoritma yang Anda gunakan.

## **b. Evaluasi Model Klasifikasi**

Berikut adalah **rekomendasi** tahapannya.
1. Lakukan prediksi menggunakan data uji.
2. Hitung metrik evaluasi seperti Accuracy dan F1-Score (Opsional: Precision dan Recall).
3. Buat confusion matrix untuk melihat detail prediksi benar dan salah.

In [19]:
# Pelatihan and Evaluasi model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"Model: {name}")
    print(f"Accuracy: {accuracy}")
    print("Classification Report:\n", report)
    print("Confusion Matrix:\n", conf_matrix)
    print("="*50)

Model: Logistic Regression
Accuracy: 0.9998098768952897
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      3885
           1       1.00      1.00      1.00     17154

    accuracy                           1.00     21039
   macro avg       1.00      1.00      1.00     21039
weighted avg       1.00      1.00      1.00     21039

Confusion Matrix:
 [[ 3882     3]
 [    1 17153]]
Model: Decision Tree
Accuracy: 0.9857407671467275
Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.96      0.96      3885
           1       0.99      0.99      0.99     17154

    accuracy                           0.99     21039
   macro avg       0.98      0.98      0.98     21039
weighted avg       0.99      0.99      0.99     21039

Confusion Matrix:
 [[ 3732   153]
 [  147 17007]]


Tulis hasil evaluasi algoritma yang digunakan, jika Anda menggunakan 2 algoritma, maka bandingkan hasilnya.

1. Logistic Regression → Model dengan Akurasi Tertinggi (99.98%)

2. Decision Tree → Akurasi Terendah (98.63%)

## **c. Tuning Model Klasifikasi (Optional)**

Gunakan GridSearchCV, RandomizedSearchCV, atau metode lainnya untuk mencari kombinasi hyperparameter terbaik

In [14]:
#GridSearchCV for Logistic Regression
param_grid = {
    'penalty': ['l2', 'none'],
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'saga']
}

# initial model and GridSearchCV
log_reg = LogisticRegression(max_iter=1000)
grid_search_log_reg = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')

# Fit model to data
grid_search_log_reg.fit(X_train, y_train)

print(f"Best Hyperparameters: {grid_search_log_reg.best_params_}")
print(f"Best Cross-Validation Accuracy: {grid_search_log_reg.best_score_}")

30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.11/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
sklea

Best Hyperparameters: {'C': 10, 'penalty': 'l2', 'solver': 'saga'}
Best Cross-Validation Accuracy: 0.9997979847403355




In [15]:
# GridSearchCV untuk Decision Tree
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt = DecisionTreeClassifier()
grid_search_dt = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy')
grid_search_dt.fit(X_train, y_train)

print(f"Best Hyperparameters: {grid_search_dt.best_params_}")
print(f"Best Cross-Validation Accuracy: {grid_search_dt.best_score_}")

Best Hyperparameters: {'criterion': 'entropy', 'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best Cross-Validation Accuracy: 0.9870355344432241


In [21]:
# print all best hyperparamaters and best cross - validation
print("Best Hyperparameters and Best Cross-Validation Accuracy for model logistic regression")
print(f"Best Hyperparameters: {grid_search_log_reg.best_params_}")
print(f"Best Cross-Validation Accuracy: {grid_search_log_reg.best_score_}")
print("="*50)
print("Best Hyperparameters and Best Cross-Validation Accuracy for model decision tree")
print(f"Best Hyperparameters: {grid_search_dt.best_params_}")
print(f"Best Cross-Validation Accuracy: {grid_search_dt.best_score_}")
print("="*50)

Best Hyperparameters and Best Cross-Validation Accuracy for model logistic regression
Best Hyperparameters: {'C': 10, 'penalty': 'l2', 'solver': 'saga'}
Best Cross-Validation Accuracy: 0.9997979847403355
Best Hyperparameters and Best Cross-Validation Accuracy for model decision tree
Best Hyperparameters: {'criterion': 'entropy', 'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best Cross-Validation Accuracy: 0.9870355344432241


## **d. Evaluasi Model Klasifikasi setelah Tuning (Optional)**

Berikut adalah rekomendasi tahapannya.
1. Gunakan model dengan hyperparameter terbaik.
2. Hitung ulang metrik evaluasi untuk melihat apakah ada peningkatan performa.

In [23]:

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Model logistic regression with best hyperparameter
log_reg_best = LogisticRegression(C=0.1, penalty='l2', solver='liblinear')
log_reg_best.fit(X_train, y_train)
log_reg_predictions = log_reg_best.predict(X_test)

# Model decision tree with best hyperparameter
decision_tree_best = DecisionTreeClassifier(criterion='gini', max_depth=30, min_samples_leaf=2, min_samples_split=5)
decision_tree_best.fit(X_train, y_train)
decision_tree_predictions = decision_tree_best.predict(X_test)

# Evaluasi performa setiap model
models = {
    "Logistic Regression": log_reg_best,
    "Decision Tree": decision_tree_best,
}

for model_name, model in models.items():
    print(f"Model: {model_name}")
    predictions = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, predictions))
    print("Classification Report:\n", classification_report(y_test, predictions))
    print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))
    print("="*50)


Model: Logistic Regression
Accuracy: 0.9992395075811588
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      3885
           1       1.00      1.00      1.00     17154

    accuracy                           1.00     21039
   macro avg       1.00      1.00      1.00     21039
weighted avg       1.00      1.00      1.00     21039

Confusion Matrix:
 [[ 3874    11]
 [    5 17149]]
Model: Decision Tree
Accuracy: 0.9858833594752602
Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.97      0.96      3885
           1       0.99      0.99      0.99     17154

    accuracy                           0.99     21039
   macro avg       0.97      0.98      0.98     21039
weighted avg       0.99      0.99      0.99     21039

Confusion Matrix:
 [[ 3757   128]
 [  169 16985]]


## **e. Analisis Hasil Evaluasi Model Klasifikasi**

1. Logistic Regression

- Sebelum Tuning: Model dengan Akurasi Tertinggi (0.9998)
- Model setelah Tuning: Mengalami penurunan menjadi 0.9992

2. Random Forest

- Sebelum Tuning: Model dengan Akurasi Tertinggi (0.9870)
- Model setelah Tuning: Mengalami penurunan menjadi 0.9858

