# **Nama  : Maulana Agus Setiawan**
# **NIM   : 2209106024**

LINK DATASET  : [Steel Industry Dataset](https://www.kaggle.com/datasets/csafrit2/steel-industry-energy-consumption)

# **Import Library**

In [None]:
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# **Load dan Informasi Awal dataset**
#### Membaca dataset dan struktur awal


In [None]:
df = pd.read_csv("/content/drive/MyDrive/Tugas semester 6/Computer Vision/Steel Industry/Steel_industry_data.csv")
print('='*20)
df.info()
print('='*20)
df.isna().sum()
print('='*20)
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35040 entries, 0 to 35039
Data columns (total 11 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   date                                  35040 non-null  object 
 1   Usage_kWh                             35040 non-null  float64
 2   Lagging_Current_Reactive.Power_kVarh  35040 non-null  float64
 3   Leading_Current_Reactive_Power_kVarh  35040 non-null  float64
 4   CO2(tCO2)                             35040 non-null  float64
 5   Lagging_Current_Power_Factor          35040 non-null  float64
 6   Leading_Current_Power_Factor          35040 non-null  float64
 7   NSM                                   35040 non-null  int64  
 8   WeekStatus                            35040 non-null  object 
 9   Day_of_week                           35040 non-null  object 
 10  Load_Type                             35040 non-null  object 
dtypes: float64(6), 

Unnamed: 0,date,Usage_kWh,Lagging_Current_Reactive.Power_kVarh,Leading_Current_Reactive_Power_kVarh,CO2(tCO2),Lagging_Current_Power_Factor,Leading_Current_Power_Factor,NSM,WeekStatus,Day_of_week,Load_Type
0,01/01/2018 00:15,3.17,2.95,0.0,0.0,73.21,100.0,900,Weekday,Monday,Light_Load
1,01/01/2018 00:30,4.0,4.46,0.0,0.0,66.77,100.0,1800,Weekday,Monday,Light_Load
2,01/01/2018 00:45,3.24,3.28,0.0,0.0,70.28,100.0,2700,Weekday,Monday,Light_Load
3,01/01/2018 01:00,3.31,3.56,0.0,0.0,68.09,100.0,3600,Weekday,Monday,Light_Load
4,01/01/2018 01:15,3.82,4.5,0.0,0.0,64.72,100.0,4500,Weekday,Monday,Light_Load


# **Drop kolom yang tidak relevan**

#### Kolom date tidak memiliki pengaruh langsung terhadap prediksi `Load_Type` dan biasanya tidak memberikan informasi numerik yang bisa digunakan dalam model tanpa proses ekstraksi waktu lebih lanjut.

In [None]:
df.drop(columns=['date'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35040 entries, 0 to 35039
Data columns (total 10 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Usage_kWh                             35040 non-null  float64
 1   Lagging_Current_Reactive.Power_kVarh  35040 non-null  float64
 2   Leading_Current_Reactive_Power_kVarh  35040 non-null  float64
 3   CO2(tCO2)                             35040 non-null  float64
 4   Lagging_Current_Power_Factor          35040 non-null  float64
 5   Leading_Current_Power_Factor          35040 non-null  float64
 6   NSM                                   35040 non-null  int64  
 7   WeekStatus                            35040 non-null  object 
 8   Day_of_week                           35040 non-null  object 
 9   Load_Type                             35040 non-null  object 
dtypes: float64(6), int64(1), object(3)
memory usage: 2.7+ MB


# **Encoding Label Kategorikal**
#### Label Encoding digunakan untuk mengubah data kategorikal menjadi format numerik agar bisa diproses oleh algoritma machine learning. Kolom seperti `WeekStatus`, `Day_of_week`, dan `Load_Type` merupakan kategori, bukan angka numerik.

In [None]:
label_encoders = {}
for col in ['WeekStatus', 'Day_of_week', 'Load_Type']:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# **Data Cleaning**
#### Menggunakan `Z-score` untuk mendeteksi outlier pada data numerik. Threshold 3 artinya data yang lebih dari 3 deviasi standar dari mean dianggap outlier, Mengapa? Karena outlier bisa menyebabkan model overfitting dan mempengaruhi hasil training secara signifikan, terutama pada algoritma yang sensitif terhadap skala seperti SVM.

In [None]:
z_scores = np.abs(stats.zscore(df.select_dtypes(include=np.number)))
df = df[(z_scores < 3).all(axis=1)]

# **Normalisasi (Scaling)**
#### `MinMaxScaler` menyetarakan semua fitur ke skala 0–1. Mengapa perlu? Karena SVM (dan banyak algoritma lain) sensitif terhadap skala fitur. Fitur dengan nilai besar akan mendominasi perhitungan jika tidak dinormalisasi.

In [None]:
scaler = MinMaxScaler()
features = df.drop(columns='Load_Type')
target = df['Load_Type']
features_scaled = scaler.fit_transform(features)

# **Feature Engineering - PCA (Principal Component Analysis)**
#### `PCA` digunakan untuk reduksi dimensi. `n_components=0.95` artinya cukup mempertahankan 95% dari total variansi data. Hal ini dilakukan untuk mengurangi overfitting, mempercepat waktu pelatihan, Mengurangi noise yang tidak penting dalam fitur.

In [None]:
pca = PCA(n_components=0.95)
features_pca = pca.fit_transform(features_scaled)

# **Split Dataset: Train dan Test**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    features_pca, target, test_size=0.2, random_state=42,
)

# **Training Model SVM**
#### Menggunakan **Support Vector Classifier** dengan kernel` RBF` (Radial Basis Function). Kernel `RBF` cocok untuk kasus klasifikasi non-linear dan data dengan dimensi tinggi setelah PCA.

In [None]:
model = SVC(kernel='rbf')
model.fit(X_train, y_train)

# **Evaluasi Model**

*   `Confusion matrix` → melihat prediksi benar/salah tiap kelas.
*   `Classification report` → precision, recall, f1-score.



In [None]:
y_pred = model.predict(X_test)

In [None]:
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=label_encoders['Load_Type'].classes_))

Confusion Matrix:
[[3361   30  213]
 [  92 1091  192]
 [ 225  578  996]]

Classification Report:
              precision    recall  f1-score   support

  Light_Load       0.91      0.93      0.92      3604
Maximum_Load       0.64      0.79      0.71      1375
 Medium_Load       0.71      0.55      0.62      1799

    accuracy                           0.80      6778
   macro avg       0.76      0.76      0.75      6778
weighted avg       0.80      0.80      0.80      6778



# **Contoh Prediksi Acak**
#### Untuk melihat seberapa akurat model dalam memprediksi sampel acak dari test set.

In [None]:
random_indices = random.sample(range(len(X_test)), 5)

print("\nContoh hasil klasifikasi (acak dari data test):")
for i, idx in enumerate(random_indices):
    actual = label_encoders['Load_Type'].inverse_transform([y_test.iloc[idx]])[0]
    predicted = label_encoders['Load_Type'].inverse_transform([y_pred[idx]])[0]
    print(f"Sample {i+1}: Actual = {actual}, Predicted = {predicted}")



Contoh hasil klasifikasi (acak dari data test):
Sample 1: Actual = Medium_Load, Predicted = Medium_Load
Sample 2: Actual = Light_Load, Predicted = Light_Load
Sample 3: Actual = Light_Load, Predicted = Light_Load
Sample 4: Actual = Medium_Load, Predicted = Medium_Load
Sample 5: Actual = Maximum_Load, Predicted = Maximum_Load
