# **1. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning.

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA

# **2. Memuat Dataset dari Hasil Clustering**

Memuat dataset hasil clustering dari file CSV ke dalam variabel DataFrame.

In [5]:
data = pd.read_csv('clustered_dataset.csv', sep=',')

data.head()

Unnamed: 0,provinsi,tahun,upah,jenis_x,daerah_x,periode,gk,ump,daerah_y,jenis_y,peng,avg_wage_permonth,pca1,pca2,cluster,dbs_category
0,ACEH,2015,11226,MAKANAN,PERKOTAAN,MARET,293697.0,1900000.0,PERDESAAN,MAKANAN,395136.0,1975776,-1.22125,-0.390123,0,Middle
1,ACEH,2015,11226,MAKANAN,PERKOTAAN,MARET,293697.0,1900000.0,PERDESAAN,NONMAKANAN,260183.0,1975776,-1.359359,-0.744973,1,Middle
2,ACEH,2015,11226,MAKANAN,PERKOTAAN,MARET,293697.0,1900000.0,PERDESAAN,TOTAL,655319.0,1975776,-0.954983,0.294011,2,Middle
3,ACEH,2015,11226,MAKANAN,PERKOTAAN,MARET,293697.0,1900000.0,PERKOTAAN,MAKANAN,466355.0,1975776,-1.148366,-0.202857,3,Middle
4,ACEH,2015,11226,MAKANAN,PERKOTAAN,MARET,293697.0,1900000.0,PERKOTAAN,NONMAKANAN,529945.0,1975776,-1.083289,-0.035652,4,Middle


In [6]:
print("\nDataset shape:", data.shape)


Dataset shape: (40733, 16)


# **3. Data Splitting**

Tahap Data Splitting bertujuan untuk memisahkan dataset menjadi dua bagian: data latih (training set) dan data uji (test set).

In [7]:
original_features = ['upah', 'gk', 'ump', 'peng', 'avg_wage_permonth']
pca_features = ['pca1', 'pca2']
target = 'cluster'

In [8]:
X = data[original_features]
y = data[target]

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **4. Membangun Model Klasifikasi**


## **a. Membangun Model Klasifikasi**

In [14]:
def evaluate_features(X_train, X_test, y_train, y_test):
    rf = RandomForestClassifier(random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    
    print(f"Accuracy Score: {accuracy_score(y_test, y_pred):.3f}")
    print(f"F1 Score: {f1_score(y_test, y_pred, average='weighted'):.3f}")
    
print("Performance with original features:")
X_train_orig = X_train[original_features]
X_test_orig = X_test[original_features] 
evaluate_features(X_train_orig, X_test_orig, y_train, y_test)

print("\nPerformance with PCA components:")
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train[original_features])
X_test_pca = pca.transform(X_test[original_features])
evaluate_features(X_train_pca, X_test_pca, y_train, y_test)

Performance with original features:
Accuracy Score: 0.952
F1 Score: 0.949

Performance with PCA components:
Accuracy Score: 0.960
F1 Score: 0.961


In [15]:
def evaluate_features(X_train, X_test, y_train, y_test):
    rf = LogisticRegression(max_iter=100, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    
    print(f"Accuracy Score: {accuracy_score(y_test, y_pred):.3f}")
    print(f"F1 Score: {f1_score(y_test, y_pred, average='weighted'):.3f}")
    
print("Performance with original features:")
X_train_orig = X_train[original_features]
X_test_orig = X_test[original_features] 
evaluate_features(X_train_orig, X_test_orig, y_train, y_test)

print("\nPerformance with PCA components:")
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train[original_features])
X_test_pca = pca.transform(X_test[original_features])
evaluate_features(X_train_pca, X_test_pca, y_train, y_test)

Performance with original features:


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy Score: 0.461
F1 Score: 0.308

Performance with PCA components:


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy Score: 0.224
F1 Score: 0.242


## **b. Evaluasi Model Klasifikasi**

In [16]:
param_grid_rf = {
    'n_estimators': [50],
    'max_depth': [None],
    'min_samples_split': [2]
}

grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), 
                              param_grid_rf, 
                              cv=3,
                              scoring='f1_weighted',
                              n_jobs=-1)
grid_search_rf.fit(X_train, y_train)

# Best Random Forest Model
print("\nBest Random Forest Parameters:", grid_search_rf.best_params_)
best_rf_model = grid_search_rf.best_estimator_

# Evaluate Best Model
y_test_pred_best_rf = best_rf_model.predict(X_test)
print("\nBest Random Forest Model Performance:")
print(f"Testing Accuracy: {accuracy_score(y_test, y_test_pred_best_rf):.2f}")
print(f"Testing F1-Score: {f1_score(y_test, y_test_pred_best_rf, average='weighted'):.2f}")


1 fits failed out of a total of 3.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\favia\AppData\Roaming\Python\Python313\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\favia\AppData\Roaming\Python\Python313\site-packages\sklearn\base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "C:\Users\favia\AppData\Roaming\Python\Python313\site-packages\sklearn\ensemble\_forest.py", line 487, in fit
    trees = Parallel(
    ...<2 lines>...
        prefer="threads",
    )


Best Random Forest Parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 50}

Best Random Forest Model Performance:
Testing Accuracy: 0.95
Testing F1-Score: 0.95


## **c. Analisis Hasil Evaluasi Model Klasifikasi**

# Analisis Model Random Forest

## 1. Konfigurasi Model
- **Jumlah Pohon Keputusan**: 100
- **Kedalaman Pohon**: Tidak dibatasi
- **Minimum Sampel untuk Split Node**: 2
- **Training Accuracy**: 0.48 (48%)
- **Training F1-Score**: 0.32 (32%)
- **Testing Accuracy**: 0.48 (48%)
- **Testing F1-Score**: 0.31 (31%)

## 2. Kelebihan Model
- **Performa Sangat Baik**: Khususnya untuk klasifikasi multi-kelas.
- **Tidak Ada Overfitting**: Tidak ada tanda overfitting yang signifikan.
- **F1-Score Tinggi**: Menunjukkan keseimbangan yang baik antara precision dan recall.

## 3. Potensi Limitasi
- **Akurasi Tinggi**: Perlu diwaspadai kemungkinan data leak.
- **Fitur yang Mirip dengan Target**: Perlu memastikan tidak ada fitur yang terlalu mirip dengan target.
- **Parameter Grid Terbatas**: Mungkin belum optimal.

# Analisis Model Logistic Regression

## 1. Konfigurasi Model
- **Solver**: 'liblinear'
- **Max Iterations**: 100
- **Random State**: 42
- **Training Accuracy**: 0.22 (2%)
- **Training F1-Score**: 0.24 (24%)

## 2. Permasalahan Utama
- **Performa Rendah**: Akurasi dan F1-score di bawah 50%
- **Ketidakmampuan Multi-class**: Model kesulitan dengan banyak kelas
- **Class Imbalance**: Mayoritas kelas memiliki precision dan recall 0.00

## 3. Analisis Kegagalan
- **Dominasi Kelas -1**: 3,899 sampel dari total 8,147
- **Kelas Minor**: Kebanyakan kelas hanya memiliki 1-3 sampel
- **Bias Prediksi**: Model cenderung memprediksi kelas mayoritas

## 4. Rekomendasi Perbaikan
- Implementasi SMOTE untuk balance kelas
- Reduksi jumlah kelas dengan penggabungan
- Peningkatan max_iterations
- Pertimbangkan penggunaan solver yang berbeda

# Observasi Utama

## 🚀 Random Forest vs. Logistic Regression
- **Random Forest secara signifikan mengungguli Logistic Regression**.  
- **RF mencapai performa yang hampir sempurna**, sementara **LR kesulitan** dalam klasifikasi multi-kelas.  

## ⚠️ Kemungkinan Overfitting pada Random Forest
- **Skor training sempurna** (**1.00**).  
- **Skor testing sangat tinggi** (**0.98**).  
- Ini bisa mengindikasikan **overfitting**.  

## 📉 Masalah Ketidakseimbangan Kelas
- Banyak kelas memiliki **sangat sedikit sampel** (**1-3 instance**).  
- **Kelas -1 mendominasi** dengan **3.899 sampel**.  

## 🔧 Kompleksitas Model
- Menggunakan **parameter default Random Forest**, belum sepenuhnya dioptimalkan (**50 trees**).  
- **GridSearchCV telah dicoba**, tetapi hasilnya tidak digunakan dalam model akhir.  
