# **1. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning.

In [1]:
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from scipy.stats import uniform, randint
from sklearn.preprocessing import LabelEncoder

# **2. Memuat Dataset dari Hasil Clustering**

Memuat dataset hasil clustering dari file CSV ke dalam variabel DataFrame.

In [2]:
data = pd.read_csv('data_with_clusters.csv')
data.head()

Unnamed: 0,Artist,Track,Album,Album_type,Danceability,Energy,Loudness,Speechiness,Acousticness,Instrumentalness,...,Channel,Views,Likes,Comments,Licensed,official_video,Stream,EnergyLiveness,most_playedon,Cluster
0,Gorillaz,Feel Good Inc.,Demon Days,album,0.818,0.705,-6.679,0.177,0.00836,0.00233,...,Gorillaz,693555221.0,6220896.0,169907.0,True,True,1040235000.0,1.150082,Spotify,1
1,Gorillaz,Rhinestone Eyes,Plastic Beach,album,0.676,0.703,-5.815,0.0302,0.0869,0.000687,...,Gorillaz,72011645.0,1079128.0,31003.0,True,True,310083700.0,15.183585,Spotify,0
2,Gorillaz,New Gold (feat. Tame Impala and Bootie Brown),New Gold (feat. Tame Impala and Bootie Brown),single,0.695,0.923,-3.93,0.0522,0.0425,0.0469,...,Gorillaz,8435055.0,282142.0,7399.0,True,True,63063470.0,7.956897,Spotify,0
3,Gorillaz,On Melancholy Hill,Plastic Beach,album,0.689,0.739,-5.81,0.026,1.5e-05,0.509,...,Gorillaz,211754952.0,1788577.0,55229.0,True,True,434663600.0,11.546875,Spotify,0
4,Gorillaz,Clint Eastwood,Gorillaz,album,0.663,0.694,-8.627,0.171,0.0253,0.0,...,Gorillaz,618480958.0,6197318.0,155930.0,True,True,617259700.0,9.942693,Youtube,1


# **3. Data Splitting**

Tahap Data Splitting bertujuan untuk memisahkan dataset menjadi dua bagian: data latih (training set) dan data uji (test set).

In [3]:
# Menentukan fitur dan target
X = data.drop(columns=['Stream', 'Cluster'])
y = data['Cluster']

# Melakukan splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Menampilkan ukuran dataset
print(f"Ukuran data latih (X_train): {X_train.shape}")
print(f"Ukuran data uji (X_test): {X_test.shape}")


Ukuran data latih (X_train): (16475, 23)
Ukuran data uji (X_test): (4119, 23)


# **4. Membangun Model Klasifikasi**


## **a. Membangun Model Klasifikasi**

Setelah memilih algoritma klasifikasi yang sesuai, langkah selanjutnya adalah melatih model menggunakan data latih.

Berikut adalah rekomendasi tahapannya.
1. Pilih algoritma klasifikasi yang sesuai, seperti Logistic Regression, Decision Tree, Random Forest, atau K-Nearest Neighbors (KNN).
2. Latih model menggunakan data latih.

### **a.1 Encoding Data kategorikal**



In [4]:
label_encoders = {}
for col in X_train.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    le.fit(pd.concat([X_train[col], X_test[col]], axis=0))  # Latih encoder dengan gabungan X_train dan X_test
    X_train[col] = le.transform(X_train[col])
    X_test[col] = le.transform(X_test[col])
    label_encoders[col] = le  # Simpan encoder jika diperlukan nanti

### **a.2 Membangun model klasifikasi**

In [12]:
# Inisialisasi model Gradient Boosting
gbm = GradientBoostingClassifier(random_state=42)  

pada pelatihan kali ini saya menggunakan algoritma GradientBoostingClassifier

### **a.3 Melatih Model Dengan Data latih**

In [13]:
# Melatih Model Dengan Data Latih
gbm.fit(X_train, y_train)  

### **a.4 Prediksi Pada Data latih**

In [14]:
# Prediksi pada data latih
y_train_pred = gbm.predict(X_train)  

## **b. Evaluasi Model Klasifikasi**

### **b.1 Lakukan prediksi menggunakan data uji.**

In [15]:
# Prediksi pada data uji
y_test_pred = gbm.predict(X_test)  

### **b.2 Evaluasi Pada training set.**

In [16]:
# Evaluasi pada data latih
train_accuracy = accuracy_score(y_train, y_train_pred)
train_f1 = f1_score(y_train, y_train_pred, average='weighted')

### **b.3 Evaluasi Pada Data Uji.**

In [18]:
# Evaluasi pada testing set
test_accuracy = accuracy_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred, average='weighted')

### **b.4 Buat confusion matrix untuk melihat detail prediksi benar dan salah.**

In [19]:
# Output hasil
print(f"Training Accuracy: {train_accuracy:.2%}")
print(f"Training F1-Score: {train_f1:.2%}")
print(f"Testing Accuracy: {test_accuracy:.2%}")
print(f"Testing F1-Score: {test_f1:.2%}")
print("\nClassification Report (Testing Set):")
print(classification_report(y_test, y_test_pred))
print("\nConfusion Matrix (Testing Set):")
print(confusion_matrix(y_test, y_test_pred))

Training Accuracy: 99.89%
Training F1-Score: 99.89%
Testing Accuracy: 99.13%
Testing F1-Score: 99.13%

Classification Report (Testing Set):
              precision    recall  f1-score   support

           0       1.00      0.99      0.99      1411
           1       0.99      1.00      0.99      1128
           2       0.99      0.99      0.99       729
           3       0.99      0.99      0.99       851

    accuracy                           0.99      4119
   macro avg       0.99      0.99      0.99      4119
weighted avg       0.99      0.99      0.99      4119


Confusion Matrix (Testing Set):
[[1397    0    8    6]
 [   0 1123    0    5]
 [   5    0  724    0]
 [   2   10    0  839]]


## **c. Tuning Model Klasifikasi (Optional)**

Gunakan GridSearchCV, RandomizedSearchCV, atau metode lainnya untuk mencari kombinasi hyperparameter terbaik

In [20]:
# Parameter untuk tuning
param_dist = {
    'n_estimators': randint(50, 171),         
    'learning_rate': uniform(0.01, 0.2),      
    'max_depth': randint(5, 7),              
    'min_samples_split': randint(2, 8),      
    'min_samples_leaf': randint(2, 8),       
    'subsample': uniform(0.7, 0.3),           
    'max_features': ['sqrt', 'log2', None]    
}

# Randomized Search untuk hyperparameter tuning
random_search = RandomizedSearchCV(
    estimator=gbm,
    param_distributions=param_dist,
    n_iter=50,  # Jumlah iterasi pencarian
    cv=3,       # Cross-validation (3 fold)
    scoring='f1',  # Gunakan F1-Score sebagai metrik utama
    random_state=42,
    verbose=1,
    n_jobs=-1
)

# Fit model pada training set
random_search.fit(X_train, y_train)

# Model terbaik dari Randomized Search
best_gbm = random_search.best_estimator_ 

# Prediksi pada training dan testing set
y_train_pred = best_gbm.predict(X_train)
y_test_pred = best_gbm.predict(X_test)

# Evaluasi pada training set
train_accuracy = accuracy_score(y_train, y_train_pred)
train_f1 = f1_score(y_train, y_train_pred, average='weighted')

# Evaluasi pada testing set
test_accuracy = accuracy_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred, average='weighted')

# Output hasil
print("Best Hyperparameters (Randomized Search):", random_search.best_params_)
print(f"Training Accuracy: {train_accuracy:.2%}")
print(f"Training F1-Score: {train_f1:.2%}")
print(f"Testing Accuracy: {test_accuracy:.2%}")
print(f"Testing F1-Score: {test_f1:.2%}")
print("\nClassification Report (Testing Set):")
print(classification_report(y_test, y_test_pred))
print("\nConfusion Matrix (Testing Set):")
print(confusion_matrix(y_test, y_test_pred))

Fitting 3 folds for each of 50 candidates, totalling 150 fits


 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan]


Best Hyperparameters (Randomized Search): {'learning_rate': np.float64(0.0849080237694725), 'max_depth': 5, 'max_features': None, 'min_samples_leaf': 4, 'min_samples_split': 6, 'n_estimators': 70, 'subsample': np.float64(0.7468055921327309)}
Training Accuracy: 99.99%
Training F1-Score: 99.99%
Testing Accuracy: 99.20%
Testing F1-Score: 99.20%

Classification Report (Testing Set):
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1411
           1       1.00      1.00      1.00      1128
           2       0.99      0.99      0.99       729
           3       0.98      0.99      0.99       851

    accuracy                           0.99      4119
   macro avg       0.99      0.99      0.99      4119
weighted avg       0.99      0.99      0.99      4119


Confusion Matrix (Testing Set):
[[1394    0    8    9]
 [   0 1124    0    4]
 [   6    0  723    0]
 [   3    3    0  845]]


## **d. Evaluasi Model Klasifikasi setelah Tuning (Optional)**

Berikut adalah rekomendasi tahapannya.
1. Gunakan model dengan hyperparameter terbaik.
2. Hitung ulang metrik evaluasi untuk melihat apakah ada peningkatan performa.

## **e. Analisis Hasil Evaluasi Model Klasifikasi**

Berikut adalah **rekomendasi** tahapannya.
1. Bandingkan hasil evaluasi sebelum dan setelah tuning (jika dilakukan).
  - setelah tuning tidak ada perbedaan yang signifikan pada metrik evaluasi 
2. Identifikasi kelemahan model, seperti:
  - Precision atau Recall rendah untuk kelas tertentu.
     - precision dan Recall sangat tinggi untuk Cluster 2 
     - sedangkan untuk Cluster lain tidak setinggi cluster 1 dengan rincian:
       - Cluster  precision    recall  
        -  0       0.99      0.99      
        -  1       1.00      1.00      
        -  2       0.99      0.99      
        -  3       0.98      0.99      
  - Apakah model mengalami overfitting atau underfitting?
   - Tidak
3. Berikan rekomendasi tindakan lanjutan, seperti mengumpulkan data tambahan atau mencoba algoritma lain jika hasil belum memuaskan.

### Rekomendai Hasil Evaluasi Model KLasifikasi
 