## üöÄ Langkah Selanjutnya (Minggu 4-5)

### Model yang akan dicoba:
1. **Random Forest** - Ensemble dari banyak Decision Trees
2. **XGBoost** - Gradient Boosting yang powerful
3. **Hyperparameter Tuning** - Optimalkan parameter model
4. **Cross Validation** - Validasi lebih robust
5. **Model Comparison** - Bandingkan semua model

---

### üìù Catatan:
- Baseline model ini akan menjadi acuan untuk model-model selanjutnya
- Target: R¬≤ > 0.85 dan MAPE < 10%
- Fokus: Mengurangi error dan meningkatkan generalisasi

---

**üéâ Minggu 3 Selesai!**

In [1]:
# Summary tabel perbandingan
summary_df = pd.DataFrame({
    'Metric': ['R¬≤ Score', 'RMSE (ton/ha)', 'MAE (ton/ha)', 'MAPE (%)'],
    'Training': [
        f"{train_metrics['R2']:.4f}",
        f"{train_metrics['RMSE']:.4f}",
        f"{train_metrics['MAE']:.4f}",
        f"{train_metrics['MAPE']:.2f}"
    ],
    'Test': [
        f"{test_metrics['R2']:.4f}",
        f"{test_metrics['RMSE']:.4f}",
        f"{test_metrics['MAE']:.4f}",
        f"{test_metrics['MAPE']:.2f}"
    ]
})

print("\n" + "="*60)
print("üìä SUMMARY PERFORMA BASELINE MODEL (DECISION TREE)")
print("="*60)
print(summary_df.to_string(index=False))
print("="*60)

# Interpretasi
print("\nüí° INTERPRETASI:")
if test_metrics['R2'] > 0.8:
    print("   ‚úÖ Model SANGAT BAGUS (R¬≤ > 0.8)")
elif test_metrics['R2'] > 0.6:
    print("   ‚úÖ Model BAGUS (R¬≤ > 0.6)")
elif test_metrics['R2'] > 0.4:
    print("   ‚ö†Ô∏è  Model CUKUP (R¬≤ > 0.4)")
else:
    print("   ‚ùå Model PERLU IMPROVEMENT (R¬≤ < 0.4)")

# Cek overfitting
r2_diff = abs(train_metrics['R2'] - test_metrics['R2'])
if r2_diff < 0.1:
    print("   ‚úÖ Tidak ada Overfitting (selisih R¬≤ < 0.1)")
elif r2_diff < 0.2:
    print("   ‚ö†Ô∏è  Sedikit Overfitting (selisih R¬≤ < 0.2)")
else:
    print("   ‚ùå Overfitting Terdeteksi (selisih R¬≤ > 0.2)")

print(f"\n   üìâ Average Error: ¬±{test_metrics['MAE']:.2f} ton/ha")

NameError: name 'pd' is not defined

## üìã Kesimpulan Baseline Model

### ‚úÖ Yang Sudah Dilakukan:
1. ‚úì Load data preprocessing dari Minggu 2
2. ‚úì Implementasi Decision Tree Regressor
3. ‚úì Training model dengan hyperparameter tuning
4. ‚úì Evaluasi dengan metrik: R¬≤, RMSE, MAE, MAPE
5. ‚úì Visualisasi hasil prediksi
6. ‚úì Analisis Feature Importance

### üìä Performa Model:

In [None]:
# Visualisasi pohon (max_depth=3 untuk readability)
plt.figure(figsize=(20, 10))
plot_tree(dt_model, 
          max_depth=3,
          feature_names=X_train.columns,
          filled=True,
          fontsize=10,
          rounded=True)
plt.title('Decision Tree Visualization (Depth=3)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("üí° Note: Ini hanya visualisasi 3 level pertama dari pohon.")

## üîü Visualisasi Decision Tree (Opsional)

Visualisasi pohon keputusan (hanya beberapa level pertama)

In [None]:
# Ambil feature importance
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Tampilkan top 10 fitur
print("üìä Top 10 Fitur Paling Penting:\n")
print(feature_importance.head(10).to_string(index=False))

# Visualisasi
plt.figure(figsize=(10, 8))
top_10 = feature_importance.head(10)
plt.barh(range(len(top_10)), top_10['Importance'], color='steelblue')
plt.yticks(range(len(top_10)), top_10['Feature'])
plt.xlabel('Importance Score')
plt.title('Top 10 Feature Importance - Decision Tree', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

## 9Ô∏è‚É£ Feature Importance

Fitur mana yang paling berpengaruh terhadap prediksi?

In [None]:
# Visualisasi 3: Comparison Bar Chart Metrik
metrics_df = pd.DataFrame({
    'Training': [train_metrics['R2'], train_metrics['RMSE'], train_metrics['MAE'], train_metrics['MAPE']],
    'Test': [test_metrics['R2'], test_metrics['RMSE'], test_metrics['MAE'], test_metrics['MAPE']]
}, index=['R¬≤ Score', 'RMSE', 'MAE', 'MAPE (%)'])

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

colors = ['#3498db', '#2ecc71']

for idx, metric in enumerate(metrics_df.index):
    ax = axes[idx]
    metrics_df.loc[metric].plot(kind='bar', ax=ax, color=colors, width=0.6)
    ax.set_title(f'{metric}', fontsize=14, fontweight='bold')
    ax.set_ylabel('Value')
    ax.set_xticklabels(['Training', 'Test'], rotation=0)
    ax.grid(axis='y', alpha=0.3)
    
    # Tambahkan nilai di atas bar
    for i, v in enumerate(metrics_df.loc[metric]):
        ax.text(i, v + 0.01*v, f'{v:.2f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Visualisasi 2: Residual Plot (Error Analysis)
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Residual untuk Training
train_residuals = y_train - y_train_pred
axes[0].scatter(y_train_pred, train_residuals, alpha=0.5, color='blue', edgecolor='black', linewidth=0.5)
axes[0].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0].set_xlabel('Predicted Yield (ton/ha)')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residual Plot - Training Set')
axes[0].grid(alpha=0.3)

# Residual untuk Test
test_residuals = y_test - y_test_pred
axes[1].scatter(y_test_pred, test_residuals, alpha=0.5, color='green', edgecolor='black', linewidth=0.5)
axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1].set_xlabel('Predicted Yield (ton/ha)')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot - Test Set')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("üí° Interpretasi Residual Plot:")
print("   - Titik tersebar acak di sekitar garis 0 = Model bagus")
print("   - Ada pola tertentu = Model belum optimal")

In [None]:
# Visualisasi 1: Actual vs Predicted
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot untuk Training Set
axes[0].scatter(y_train, y_train_pred, alpha=0.5, color='blue', edgecolor='black', linewidth=0.5)
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Yield (ton/ha)')
axes[0].set_ylabel('Predicted Yield (ton/ha)')
axes[0].set_title(f'Training Set\nR¬≤ = {train_metrics["R2"]:.4f}')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot untuk Test Set
axes[1].scatter(y_test, y_test_pred, alpha=0.5, color='green', edgecolor='black', linewidth=0.5)
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Yield (ton/ha)')
axes[1].set_ylabel('Predicted Yield (ton/ha)')
axes[1].set_title(f'Test Set\nR¬≤ = {test_metrics["R2"]:.4f}')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 8Ô∏è‚É£ Visualisasi Hasil

In [None]:
# Fungsi untuk menghitung semua metrik
def evaluate_model(y_true, y_pred, dataset_name=""):
    """
    Fungsi untuk evaluasi model regresi
    """
    r2 = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    
    print(f"\n{'='*50}")
    print(f"üìä EVALUASI MODEL - {dataset_name}")
    print(f"{'='*50}")
    print(f"R¬≤ Score : {r2:.4f}")
    print(f"RMSE     : {rmse:.4f} ton/ha")
    print(f"MAE      : {mae:.4f} ton/ha")
    print(f"MAPE     : {mape:.2f}%")
    print(f"{'='*50}\n")
    
    return {
        'R2': r2,
        'RMSE': rmse,
        'MAE': mae,
        'MAPE': mape
    }

# Evaluasi pada Training Set
train_metrics = evaluate_model(y_train, y_train_pred, "TRAINING SET")

# Evaluasi pada Test Set
test_metrics = evaluate_model(y_test, y_test_pred, "TEST SET")

## 7Ô∏è‚É£ Evaluasi Model dengan Metrik

### üìè Metrik yang Digunakan:

1. **R¬≤ Score (Coefficient of Determination)**
   - Range: -‚àû hingga 1 (semakin mendekati 1, semakin bagus)
   - Interpretasi: Proporsi variasi yang bisa dijelaskan model
   
2. **RMSE (Root Mean Squared Error)**
   - Satuan sama dengan target (ton/ha)
   - Sensitif terhadap outlier
   
3. **MAE (Mean Absolute Error)**
   - Rata-rata error absolut
   - Lebih robust terhadap outlier
   
4. **MAPE (Mean Absolute Percentage Error)**
   - Error dalam bentuk persentase (%)
   - Mudah dipahami

In [None]:
# Prediksi pada Training Set
y_train_pred = dt_model.predict(X_train)

# Prediksi pada Test Set
y_test_pred = dt_model.predict(X_test)

print("‚úÖ Prediksi selesai!")
print(f"Contoh prediksi Test (5 pertama):")
print(f"Aktual : {y_test.head().values}")
print(f"Prediksi: {y_test_pred[:5]}")

## 6Ô∏è‚É£ Prediksi

In [None]:
import time

# Catat waktu training
start_time = time.time()

# Training model
dt_model.fit(X_train, y_train)

# Hitung waktu training
training_time = time.time() - start_time

print("‚úÖ Model berhasil dilatih!")
print(f"‚è±Ô∏è  Waktu Training: {training_time:.2f} detik")
print(f"üìä Jumlah Leaf Nodes: {dt_model.get_n_leaves()}")
print(f"üìè Depth Aktual: {dt_model.get_depth()}")

## 5Ô∏è‚É£ Training Model

In [None]:
# Inisialisasi Decision Tree Regressor
dt_model = DecisionTreeRegressor(
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)

print("üå≥ Decision Tree Model berhasil diinisialisasi!")
print(f"Parameter: {dt_model.get_params()}")

## 4Ô∏è‚É£ Baseline Model: Decision Tree Regressor

### üìñ Kenapa Decision Tree?
- **Mudah dipahami**: Seperti pohon keputusan manusia
- **Tidak perlu scaling**: Bisa langsung pakai data
- **Baseline yang bagus**: Standar untuk membandingkan model lain
- **Handle non-linear**: Bisa tangkap pola kompleks

### Hyperparameter yang Digunakan:
- `max_depth=10`: Kedalaman pohon maksimal 10 level
- `min_samples_split=10`: Minimal 10 sampel untuk split node
- `min_samples_leaf=5`: Minimal 5 sampel di setiap leaf
- `random_state=42`: Reproducibility

In [None]:
# Statistik target variable
print("üìä Statistik Target (Yield_tons_per_hectare):")
print(f"Mean Train  : {y_train.mean():.2f} ton/ha")
print(f"Median Train: {y_train.median():.2f} ton/ha")
print(f"Min Train   : {y_train.min():.2f} ton/ha")
print(f"Max Train   : {y_train.max():.2f} ton/ha")
print(f"Std Train   : {y_train.std():.2f} ton/ha")

# Visualisasi distribusi target
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(y_train, bins=30, color='skyblue', edgecolor='black', alpha=0.7)
plt.xlabel('Yield (tons/hectare)')
plt.ylabel('Frequency')
plt.title('Distribusi Target - Training Set')
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
plt.boxplot(y_train, vert=True)
plt.ylabel('Yield (tons/hectare)')
plt.title('Boxplot Target - Training Set')
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Preview fitur training
print("üìã 5 Baris Pertama Fitur Training:")
print(X_train.head())
print("\n" + "="*50)

# Daftar kolom setelah encoding
print(f"\nüìå Daftar {len(X_train.columns)} Fitur:")
for i, col in enumerate(X_train.columns, 1):
    print(f"{i}. {col}")

## 3Ô∏è‚É£ Eksplorasi Data Cepat

In [None]:
# Load 4 file hasil preprocessing
X_train = pd.read_csv('data/X_train.csv')
X_test = pd.read_csv('data/X_test.csv')
y_train = pd.read_csv('data/y_train.csv')
y_test = pd.read_csv('data/y_test.csv')

# Ubah y dari DataFrame menjadi Series agar lebih mudah diproses
y_train = y_train.squeeze()
y_test = y_test.squeeze()

print("üìä Informasi Data:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape : {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape : {y_test.shape}")
print("\n‚úÖ Data berhasil dimuat!")

## 2Ô∏è‚É£ Load Data yang Sudah Diproses

Data yang sudah di-preprocessing pada Minggu 2 akan kita load kembali.

In [None]:
# Library untuk manipulasi data
import pandas as pd
import numpy as np

# Library untuk visualisasi
import matplotlib.pyplot as plt
import seaborn as sns

# Library untuk machine learning
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Setting visualisasi
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("‚úÖ Semua library berhasil diimport!")

## 1Ô∏è‚É£ Import Libraries

# üåæ Minggu 3: Baseline Model - Decision Tree
## Prediksi Hasil Panen Pertanian

**Tujuan:**
1. Load data yang sudah diproses (Minggu 2)
2. Implementasi Decision Tree sebagai baseline model
3. Evaluasi performa model dengan berbagai metrik
4. Visualisasi hasil

---