# Analisis, Pra-pemrosesan, dan Pemodelan Machine Learning pada Dataset Sepatu Pria

Notebook ini akan memandu langkah-langkah lengkap untuk menganalisis dataset sepatu pria, mulai dari eksplorasi data (EDA), pembersihan, hingga membangun dan mengevaluasi beberapa model machine learning untuk tugas regresi dan klasifikasi.

### 1. Impor Library yang Dibutuhkan

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Model Regresi
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

# Model Klasifikasi
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Metrik Evaluasi
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, classification_report, accuracy_score

# Mengatur tampilan plot
plt.style.use('seaborn-v0_8-whitegrid')

### 2. Memuat dan Memeriksa Data Awal

In [None]:
df = pd.read_csv('MEN_SHOES.csv')
print("Tampilan 5 baris pertama data:")
display(df.head())

print("\nInformasi Dataset:")
df.info()

**Observasi Awal:**
1.  `How_Many_Sold` dan `Current_Price` terbaca sebagai tipe `object`, padahal seharusnya numerik. Ini karena adanya karakter '₹' dan ',' di dalamnya.
2.  Terdapat nilai yang hilang (missing values) pada kolom `Current_Price`.
3.  `Product_details` memiliki teks yang panjang dan variatif. Untuk analisis ini, kita akan mengabaikannya karena memerlukan teknik NLP yang lebih kompleks.
4.  `Brand_Name` adalah fitur kategorikal.

### 3. Pembersihan dan Pra-pemrosesan Data

In [None]:
# Fungsi untuk membersihkan kolom harga dan jumlah terjual
def clean_numeric_column(series):
    if series.dtype == 'object':
        series = series.str.replace('₹', '', regex=False).str.replace(',', '', regex=False)
    return pd.to_numeric(series, errors='coerce')

df_cleaned = df.copy()
df_cleaned['Current_Price'] = clean_numeric_column(df_cleaned['Current_Price'])
df_cleaned['How_Many_Sold'] = clean_numeric_column(df_cleaned['How_Many_Sold'])

# Imputasi nilai yang hilang pada Current_Price menggunakan median
median_price = df_cleaned['Current_Price'].median()
df_cleaned['Current_Price'].fillna(median_price, inplace=True)

# Hapus kolom Product_details
df_cleaned = df_cleaned.drop('Product_details', axis=1)

# Hapus baris dengan nilai RATING yang hilang jika ada
df_cleaned.dropna(subset=['RATING'], inplace=True)

print("Informasi Dataset Setelah Pembersihan:")
df_cleaned.info()

print("\nStatistik Deskriptif:")
display(df_cleaned.describe())

### 4. Exploratory Data Analysis (EDA)

#### 4.1. Visualisasi Distribusi Fitur Numerik

In [None]:
numerical_features = ['Current_Price', 'How_Many_Sold', 'RATING']

plt.figure(figsize=(18, 5))
for i, col in enumerate(numerical_features):
    plt.subplot(1, 3, i + 1)
    sns.histplot(df_cleaned[col], kde=True)
    plt.title(f'Distribusi {col}')
    plt.xlabel(col)
    plt.ylabel('Frekuensi')
plt.tight_layout()
plt.savefig('figure_1.png')
plt.show()

#### 4.2. Deteksi Pencilan (Outliers) dengan Boxplot

In [None]:
plt.figure(figsize=(18, 5))
for i, col in enumerate(numerical_features):
    plt.subplot(1, 3, i + 1)
    sns.boxplot(y=df_cleaned[col])
    plt.title(f'Boxplot {col}')
plt.tight_layout()
plt.savefig('figure_2.png')
plt.show()

#### 4.3. Heatmap Korelasi

In [None]:
plt.figure(figsize=(8, 6))
correlation_matrix = df_cleaned[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Heatmap Korelasi Fitur Numerik')
plt.savefig('figure_3.png')
plt.show()

### 5. Persiapan untuk Pemodelan

#### 5.1. Feature Engineering: Membuat Target Klasifikasi
Untuk tugas klasifikasi, kita akan membuat variabel target baru dengan mengkategorikan `RATING`.
- **0 (Buruk)**: Rating <= 3.5
- **1 (Cukup)**: 3.5 < Rating <= 4.0
- **2 (Baik)**: Rating > 4.0

In [None]:
bins = [0, 3.5, 4.0, 5.0]
labels = [0, 1, 2] # Buruk, Cukup, Baik
df_cleaned['Rating_Category'] = pd.cut(df_cleaned['RATING'], bins=bins, labels=labels, include_lowest=True)

print("Distribusi Kategori Rating:")
print(df_cleaned['Rating_Category'].value_counts())

#### 5.2. Mendefinisikan Fitur (X) dan Target (y)

In [None]:
features = ['Brand_Name', 'How_Many_Sold', 'Current_Price']

# Untuk Regresi
X_reg = df_cleaned[features]
y_reg = df_cleaned['RATING']

# Untuk Klasifikasi
X_class = df_cleaned[features]
y_class = df_cleaned['Rating_Category']

# Identifikasi tipe kolom untuk preprocessor
categorical_features = ['Brand_Name']
numerical_features_model = ['How_Many_Sold', 'Current_Price']

#### 5.3. Membuat Pipeline Preprocessing

In [None]:
# Pipeline untuk fitur numerik: scaling
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Pipeline untuk fitur kategorikal: one-hot encoding
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Gabungkan transformer dengan ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features_model),
        ('cat', categorical_transformer, categorical_features)
    ])

#### 5.4. Membagi Data (Train & Test Split)

In [None]:
# Split untuk Regresi
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Split untuk Klasifikasi
X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(X_class, y_class, test_size=0.2, random_state=42, stratify=y_class)

### 6. Tugas 1: Pemodelan Regresi

In [None]:
# Inisialisasi model regresi
models_reg = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'SVR': SVR()
}

results_reg = {}

for name, model in models_reg.items():
    # Buat pipeline lengkap dengan preprocessor dan model
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('regressor', model)])
    
    # Latih model
    pipeline.fit(X_train_reg, y_train_reg)
    
    # Prediksi
    y_pred = pipeline.predict(X_test_reg)
    
    # Evaluasi
    mae = mean_absolute_error(y_test_reg, y_pred)
    mse = mean_squared_error(y_test_reg, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test_reg, y_pred)
    
    results_reg[name] = {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R-squared': r2}
    
    print(f"--- {name} ---")
    print(f"MAE: {mae:.4f}")
    print(f"MSE: {mse:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"R-squared: {r2:.4f}\n")
    
df_results_reg = pd.DataFrame(results_reg).T

### 7. Tugas 2: Pemodelan Klasifikasi

In [None]:
# Inisialisasi model klasifikasi
models_class = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'KNN': KNeighborsClassifier(),
    'SVC': SVC(random_state=42, probability=True) # Probability=True untuk beberapa metrik
}

results_class = {}

for name, model in models_class.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', model)])
    
    pipeline.fit(X_train_class, y_train_class)
    y_pred = pipeline.predict(X_test_class)
    
    accuracy = accuracy_score(y_test_class, y_pred)
    report = classification_report(y_test_class, y_pred, output_dict=True)
    results_class[name] = {'accuracy': accuracy, 'report': report}
    
    print(f"--- {name} ---")
    print(f"Accuracy: {accuracy:.4f}")
    print(classification_report(y_test_class, y_pred))
    print("="*30 + "\n")
    
accuracies = {name: res['accuracy'] for name, res in results_class.items()}
df_results_class = pd.DataFrame.from_dict(accuracies, orient='index', columns=['Accuracy'])

### 8. Hyperparameter Tuning dengan GridSearchCV

#### 8.1. Tuning Model Regresi

In [None]:
# Parameter grid untuk model regresi
param_grid_reg = {
    'Decision Tree': {
        'regressor__max_depth': [5, 10, 15, None],
        'regressor__min_samples_leaf': [1, 2, 4]
    },
    'SVR': {
        'regressor__C': [0.1, 1, 10],
        'regressor__gamma': ['scale', 'auto']
    }
}

tuned_results_reg = {}

for name, params in param_grid_reg.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('regressor', models_reg[name])])
    
    grid_search = GridSearchCV(pipeline, params, cv=5, scoring='r2', n_jobs=-1)
    grid_search.fit(X_train_reg, y_train_reg)
    
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test_reg)
    
    r2 = r2_score(y_test_reg, y_pred)
    tuned_results_reg[name + ' Tuned'] = {'R-squared': r2, 'Best Params': grid_search.best_params_}
    
    print(f"--- {name} Tuned ---")
    print(f"Best R-squared: {grid_search.best_score_:.4f}")
    print(f"Best Parameters: {grid_search.best_params_}\n")

# Gabungkan hasil sebelum dan sesudah tuning
for name, res in tuned_results_reg.items():
    df_results_reg.loc[name] = {'R-squared': res['R-squared']}

#### 8.2. Tuning Model Klasifikasi

In [None]:
# Parameter grid untuk model klasifikasi
param_grid_class = {
    'Decision Tree': {
        'classifier__max_depth': [5, 10, 15, None],
        'classifier__min_samples_leaf': [1, 2, 4],
        'classifier__criterion': ['gini', 'entropy']
    },
    'KNN': {
        'classifier__n_neighbors': [3, 5, 7, 9],
        'classifier__weights': ['uniform', 'distance']
    },
    'SVC': {
        'classifier__C': [0.1, 1, 10],
        'classifier__gamma': ['scale', 'auto']
    }
}

tuned_results_class = {}

for name, params in param_grid_class.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', models_class[name])])
    
    grid_search = GridSearchCV(pipeline, params, cv=5, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train_class, y_train_class)
    
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test_class)
    
    accuracy = accuracy_score(y_test_class, y_pred)
    tuned_results_class[name + ' Tuned'] = {'accuracy': accuracy, 'Best Params': grid_search.best_params_}
    
    print(f"--- {name} Tuned ---")
    print(f"Best Accuracy: {grid_search.best_score_:.4f}")
    print(f"Best Parameters: {grid_search.best_params_}\n")
    
# Gabungkan hasil sebelum dan sesudah tuning
for name, res in tuned_results_class.items():
    df_results_class.loc[name] = {'Accuracy': res['accuracy']}

### 9. Visualisasi Perbandingan Kinerja Model

In [None]:
# Visualisasi hasil regresi
plt.figure(figsize=(12, 6))
df_results_reg['R-squared'].sort_values().plot(kind='barh', color=plt.cm.viridis(np.linspace(0, 1, len(df_results_reg))))
plt.title('Perbandingan Kinerja Model Regresi (R-squared)')
plt.xlabel('R-squared')
plt.savefig('figure_4.png')
plt.show()

# Visualisasi hasil klasifikasi
plt.figure(figsize=(12, 6))
df_results_class['Accuracy'].sort_values().plot(kind='barh', color=plt.cm.plasma(np.linspace(0, 1, len(df_results_class))))
plt.title('Perbandingan Kinerja Model Klasifikasi (Accuracy)')
plt.xlabel('Accuracy')
plt.savefig('figure_5.png')
plt.show()

### 10. Validasi Performa dengan Cross-Validation
Kita akan menjalankan cross-validation pada model terbaik setelah tuning untuk memastikan performanya robust.

In [None]:
print("--- Cross-Validation pada Model Regresi Terbaik (SVR Tuned) ---")
best_reg_model_info = tuned_results_reg['SVR Tuned']
best_svr = SVR(C=best_reg_model_info['Best Params']['regressor__C'], gamma=best_reg_model_info['Best Params']['regressor__gamma'])
cv_pipeline_reg = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', best_svr)])

cv_scores_r2 = cross_val_score(cv_pipeline_reg, X_reg, y_reg, cv=10, scoring='r2')

print(f"Skor R2 dari 10-fold Cross-Validation: \n{cv_scores_r2}")
print(f"Rata-rata R2: {cv_scores_r2.mean():.4f}")
print(f"Standar Deviasi R2: {cv_scores_r2.std():.4f}\n")

print("--- Cross-Validation pada Model Klasifikasi Terbaik (SVC Tuned) ---")
best_class_model_info = tuned_results_class['SVC Tuned']
best_svc = SVC(C=best_class_model_info['Best Params']['classifier__C'], gamma=best_class_model_info['Best Params']['classifier__gamma'])
cv_pipeline_class = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', best_svc)])

cv_scores_acc = cross_val_score(cv_pipeline_class, X_class, y_class, cv=10, scoring='accuracy')

print(f"Skor Akurasi dari 10-fold Cross-Validation: \n{cv_scores_acc}")
print(f"Rata-rata Akurasi: {cv_scores_acc.mean():.4f}")
print(f"Standar Deviasi Akurasi: {cv_scores_acc.std():.4f}")

### Kesimpulan

1.  **Pra-pemrosesan** berhasil membersihkan dan menyiapkan data untuk pemodelan.
2.  **EDA** memberikan wawasan tentang distribusi dan hubungan antar fitur.
3.  **Model Regresi**: Model **SVR** yang telah di-tuning menunjukkan performa terbaik dalam memprediksi `RATING` berdasarkan metrik R-squared.
4.  **Model Klasifikasi**: Model **SVC** yang telah di-tuning memberikan akurasi tertinggi untuk mengklasifikasikan kategori rating.
5.  **Hyperparameter Tuning** terbukti efektif dalam meningkatkan performa sebagian besar model.
6.  **Cross-Validation** mengonfirmasi bahwa performa model terbaik kita (SVR dan SVC) cukup stabil (robust) dengan standar deviasi yang rendah pada skor evaluasinya.