# 🌤️ Weather Model Training - Dual Model (Hourly + Daily)

Notebook ini mengikuti panduan dari `training_guide.md` untuk melatih **dual-model**:
1. **Model Hourly** - Prediksi per-jam (temp, humidity, windspeed, pressure, weather_code)
2. **Model Daily** - Prediksi per-hari (temp_min, temp_max, temp_mean, humidity_avg, windspeed_avg, pressure_avg, weather_code_dominant)

**Output:** 7 file model `.pkl` untuk berbagai kebutuhan deployment.

## 1. Persiapan Lingkungan dan Pemuatan Pustaka

In [None]:
# Install dependencies jika belum ada
# !pip install pandas numpy matplotlib seaborn scikit-learn xgboost joblib

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
import os
warnings.filterwarnings('ignore')

# Scikit-learn
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score, f1_score, classification_report, confusion_matrix
)

# Regression Models
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# XGBoost
try:
    from xgboost import XGBRegressor, XGBClassifier
    XGBOOST_AVAILABLE = True
except ImportError:
    print("XGBoost not installed. Skipping XGBoost models.")
    XGBOOST_AVAILABLE = False

# Joblib for saving models
import joblib

print("✅ Semua pustaka berhasil diimpor!")
print(f"   - Pandas: {pd.__version__}")
print(f"   - NumPy: {np.__version__}")
print(f"   - XGBoost Available: {XGBOOST_AVAILABLE}")

## 2. Pengumpulan dan Pemuatan Data

In [None]:
# Load dataset (23 kolom: hourly + daily features)
DATA_PATH = '../data/historical_data_2000_2024.csv'

df = pd.read_csv(DATA_PATH)

# Konversi timestamp ke datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Urutkan berdasarkan waktu (PENTING untuk time-series)
df = df.sort_values('timestamp').reset_index(drop=True)

print(f"📊 Dataset loaded: {len(df):,} baris x {len(df.columns)} kolom")
print(f"📅 Rentang waktu: {df['timestamp'].min()} - {df['timestamp'].max()}")
print(f"\n📋 Kolom dataset:")
print(df.columns.tolist())
df.head()

In [None]:
# Info struktur data
df.info()

## 3. Analisis Data Eksplorasi (EDA)

### 3.1 Statistik Deskriptif

In [None]:
# Statistik deskriptif untuk fitur numerik
df.describe()

### 3.2 Visualisasi Distribusi

In [None]:
# Visualisasi distribusi parameter cuaca utama (Hourly)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

params = ['temp', 'humidity', 'windspeed', 'sealevelpressure']
titles = ['Temperature (°C)', 'Humidity (%)', 'Wind Speed (km/h)', 'Sea Level Pressure (hPa)']

for ax, param, title in zip(axes.flatten(), params, titles):
    sns.histplot(df[param], kde=True, ax=ax, color='steelblue')
    ax.set_title(f'Distribusi {title}')
    ax.set_xlabel(title)

plt.suptitle('Distribusi Parameter Cuaca Hourly', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Visualisasi distribusi parameter cuaca Daily
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

daily_params = ['temp_max_daily', 'temp_min_daily', 'temp_mean_daily', 
                'humidity_avg_daily', 'pressure_avg_daily', 'windspeed_avg_daily']
titles = ['Temp Max (°C)', 'Temp Min (°C)', 'Temp Mean (°C)', 
          'Humidity Avg (%)', 'Pressure Avg (hPa)', 'Windspeed Avg (km/h)']

for ax, param, title in zip(axes.flatten(), daily_params, titles):
    sns.histplot(df[param].dropna(), kde=True, ax=ax, color='coral')
    ax.set_title(f'Distribusi {title}')
    ax.set_xlabel(title)

plt.suptitle('Distribusi Parameter Cuaca Daily', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### 3.3 Analisis Korelasi

In [None]:
# Heatmap korelasi
numeric_cols = ['temp', 'humidity', 'windspeed', 'sealevelpressure', 'rain', 
                'weather_code', 'temp_max_daily', 'temp_min_daily', 'temp_mean_daily',
                'humidity_avg_daily', 'pressure_avg_daily', 'windspeed_avg_daily']

plt.figure(figsize=(14, 10))
correlation_matrix = df[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Heatmap Korelasi Antar Variabel Cuaca (Hourly + Daily)')
plt.tight_layout()
plt.show()

### 3.4 Analisis Korelasi: weather_code dan rain

In [None]:
# Analisis hubungan weather_code dengan rain
weather_rain_analysis = df.groupby('weather_code')[['rain']].agg(['mean', 'min', 'max', 'count'])
print("📊 Weather Code vs Rain:")
weather_rain_analysis

In [None]:
# Verifikasi korelasi deterministik
print(f"\n🔍 Verifikasi rain == precipitation: {(df['rain'] == df['precipitation']).all()}")
print(f"🔍 Weather codes dengan rain > 0: {sorted(df[df['rain'] > 0]['weather_code'].unique())}")
print(f"🔍 Weather codes dengan rain = 0: {sorted(df[df['rain'] == 0]['weather_code'].unique())}")

# Kesimpulan
print("\n✅ KESIMPULAN:")
print("   - rain dan precipitation IDENTIK di seluruh dataset")
print("   - weather_code >= 50 SELALU hujan (deterministik)")
print("   - Tidak perlu memprediksi rain terpisah, cukup prediksi weather_code")

## 4. Pra-pemrosesan Data dan Feature Engineering

### 4.1 Preprocessing Data Hourly

In [None]:
# Copy dataframe untuk preprocessing
df_hourly = df.copy()

# 1. Label Encoding untuk 'conditions'
le_conditions = LabelEncoder()
df_hourly['conditions_encoded'] = le_conditions.fit_transform(df_hourly['conditions'])

print("📝 Label Encoding untuk 'conditions':")
for i, label in enumerate(le_conditions.classes_):
    print(f"   {i}: {label}")

In [None]:
# 2. Label Encoding untuk 'weather_code' (PENTING untuk XGBoost)
# XGBoost membutuhkan label berupa integer berurutan (0, 1, 2, ...)
le_weather_code = LabelEncoder()
df_hourly['weather_code_encoded'] = le_weather_code.fit_transform(df_hourly['weather_code'])

print("📝 Label Encoding untuk 'weather_code':")
for i, label in enumerate(le_weather_code.classes_):
    print(f"   {i}: {label}")

In [None]:
# 3. Feature Engineering: Lag Features untuk Hourly
hourly_target_cols = ['temp', 'humidity', 'windspeed', 'sealevelpressure']

for col in hourly_target_cols:
    # Lag 1 jam
    df_hourly[f'{col}_lag_1'] = df_hourly[col].shift(1)
    # Lag 24 jam
    df_hourly[f'{col}_lag_24'] = df_hourly[col].shift(24)
    # Rolling mean 24 jam
    df_hourly[f'{col}_rolling_24'] = df_hourly[col].rolling(window=24).mean()

print(f"✅ Feature Engineering Hourly selesai! Kolom baru: {12} fitur lag & rolling")

In [None]:
# 4. Hapus baris dengan NaN (akibat lag & rolling)
rows_before = len(df_hourly)
df_hourly = df_hourly.dropna().reset_index(drop=True)
rows_after = len(df_hourly)

print(f"🗑️ Baris dihapus (NaN): {rows_before - rows_after:,}")
print(f"📊 Dataset Hourly final: {rows_after:,} baris")

### 4.2 Preprocessing Data Daily

In [None]:
# Agregasi data hourly menjadi daily
df_daily = df.groupby(['year', 'month', 'day']).agg({
    'temp': ['min', 'max', 'mean'],
    'humidity': 'mean',
    'windspeed': 'mean',
    'sealevelpressure': 'mean',
    'weather_code': lambda x: x.mode()[0],  # Dominan weather_code
    'rain': 'sum'  # Total curah hujan
}).reset_index()

# Flatten column names
df_daily.columns = ['year', 'month', 'day', 
                    'temp_min', 'temp_max', 'temp_mean',
                    'humidity_avg', 'windspeed_avg', 'pressure_avg',
                    'weather_code_dominant', 'rain_total']

print(f"📊 Dataset Daily: {len(df_daily):,} baris (hari)")
df_daily.head()

In [None]:
# Label Encoding untuk 'weather_code_dominant' (PENTING untuk XGBoost)
le_weather_code_daily = LabelEncoder()
df_daily['weather_code_dominant_encoded'] = le_weather_code_daily.fit_transform(df_daily['weather_code_dominant'])

print("📝 Label Encoding untuk 'weather_code_dominant':")
for i, label in enumerate(le_weather_code_daily.classes_):
    print(f"   {i}: {label}")

In [None]:
# Feature Engineering Daily - Lag Features
df_daily['temp_min_lag_1'] = df_daily['temp_min'].shift(1)   # Kemarin
df_daily['temp_max_lag_1'] = df_daily['temp_max'].shift(1)
df_daily['temp_mean_lag_1'] = df_daily['temp_mean'].shift(1)
df_daily['humidity_avg_lag_1'] = df_daily['humidity_avg'].shift(1)
df_daily['windspeed_avg_lag_1'] = df_daily['windspeed_avg'].shift(1)
df_daily['pressure_avg_lag_1'] = df_daily['pressure_avg'].shift(1)

df_daily['temp_min_lag_7'] = df_daily['temp_min'].shift(7)   # Seminggu lalu
df_daily['temp_max_lag_7'] = df_daily['temp_max'].shift(7)
df_daily['temp_mean_lag_7'] = df_daily['temp_mean'].shift(7)
df_daily['rain_total_lag_1'] = df_daily['rain_total'].shift(1)

# Hapus NaN
rows_before = len(df_daily)
df_daily = df_daily.dropna().reset_index(drop=True)
rows_after = len(df_daily)

print(f"🗑️ Baris dihapus (NaN): {rows_before - rows_after:,}")
print(f"📊 Dataset Daily final: {rows_after:,} baris")

## 5. Pelatihan dan Perbandingan Model

### 5.1 Pemisahan Data (Time-Series Split)

In [None]:
# ===== HOURLY DATA SPLIT =====
hourly_train_size = int(len(df_hourly) * 0.8)
hourly_train = df_hourly[:hourly_train_size]
hourly_test = df_hourly[hourly_train_size:]

print(f"📊 HOURLY Data Split (80-20):")
print(f"   Train: {len(hourly_train):,} baris")
print(f"   Test:  {len(hourly_test):,} baris")

# ===== DAILY DATA SPLIT =====
daily_train_size = int(len(df_daily) * 0.8)
daily_train = df_daily[:daily_train_size]
daily_test = df_daily[daily_train_size:]

print(f"\n📊 DAILY Data Split (80-20):")
print(f"   Train: {len(daily_train):,} baris")
print(f"   Test:  {len(daily_test):,} baris")

## 6. Analisis Hasil dan Kinerja Individual Parameter

### 6.1 Evaluasi Per-Parameter (Regresi Hourly)

In [None]:
# Evaluasi kinerja per-parameter untuk model terbaik (Hourly)
print("="*70)
print("EVALUASI PER-PARAMETER (REGRESI HOURLY)")
print("="*70)

# Gunakan model terbaik dari perbandingan
best_reg_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
best_reg_model.fit(X_hourly_train, y_hourly_train_reg)
y_pred = best_reg_model.predict(X_hourly_test)

# Evaluasi per parameter
param_results = []
for i, param in enumerate(hourly_target_reg):
    mae = mean_absolute_error(y_hourly_test_reg.iloc[:, i], y_pred[:, i])
    rmse = np.sqrt(mean_squared_error(y_hourly_test_reg.iloc[:, i], y_pred[:, i]))
    r2 = r2_score(y_hourly_test_reg.iloc[:, i], y_pred[:, i])
    param_results.append({"Parameter": param, "MAE": mae, "RMSE": rmse, "R2": r2})

df_param_results = pd.DataFrame(param_results)
display(df_param_results)

### 6.2 Evaluasi Klasifikasi (Weather Code) - Confusion Matrix

In [None]:
# Confusion Matrix untuk klasifikasi weather_code
best_clf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
best_clf_model.fit(X_hourly_train, y_hourly_train_clf)
y_pred_clf = best_clf_model.predict(X_hourly_test)

# Confusion Matrix
cm = confusion_matrix(y_hourly_test_clf, y_pred_clf)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=le_weather_code.classes_,
            yticklabels=le_weather_code.classes_)
plt.title("Confusion Matrix - Weather Code Classification (Hourly)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.show()

print("Classification Report:")
print(classification_report(y_hourly_test_clf, y_pred_clf,
                          target_names=[str(c) for c in le_weather_code.classes_]))

### 6.3 Evaluasi Per-Parameter (Regresi Daily)

In [None]:
# Evaluasi kinerja per-parameter untuk model terbaik (Daily)
print("="*70)
print("EVALUASI PER-PARAMETER (REGRESI DAILY)")
print("="*70)

best_daily_reg = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
best_daily_reg.fit(X_daily_train, y_daily_train_reg)
y_pred_daily = best_daily_reg.predict(X_daily_test)

daily_param_results = []
for i, param in enumerate(daily_target_reg):
    mae = mean_absolute_error(y_daily_test_reg.iloc[:, i], y_pred_daily[:, i])
    rmse = np.sqrt(mean_squared_error(y_daily_test_reg.iloc[:, i], y_pred_daily[:, i]))
    r2 = r2_score(y_daily_test_reg.iloc[:, i], y_pred_daily[:, i])
    daily_param_results.append({"Parameter": param, "MAE": mae, "RMSE": rmse, "R2": r2})

df_daily_param = pd.DataFrame(daily_param_results)
display(df_daily_param)

## 6.5 Retraining dengan Seluruh Dataset (Final Model)

In [None]:
# Gabung train + test untuk final training
X_hourly_full = df_hourly[hourly_feature_cols]
y_hourly_reg_full = df_hourly[hourly_target_reg]
y_hourly_clf_full = df_hourly[hourly_target_clf]

X_daily_full = df_daily[daily_feature_cols]
y_daily_reg_full = df_daily[daily_target_reg]
y_daily_clf_full = df_daily[daily_target_clf]

print(f"Full Hourly Data: {len(X_hourly_full):,} baris")
print(f"Full Daily Data: {len(X_daily_full):,} baris")

In [None]:
# Training Final Models dengan 100% Data
print("="*70)
print("TRAINING FINAL MODELS (100% DATA)")
print("="*70)

# HOURLY Models
print("Training Hourly Regressor...")
hourly_regressor = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
hourly_regressor.fit(X_hourly_full, y_hourly_reg_full)
print("   Hourly Regressor trained")

print("Training Hourly Classifier...")
hourly_classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
hourly_classifier.fit(X_hourly_full, y_hourly_clf_full)
print("   Hourly Classifier trained")

# DAILY Models
print("Training Daily Regressor...")
daily_regressor = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
daily_regressor.fit(X_daily_full, y_daily_reg_full)
print("   Daily Regressor trained")

print("Training Daily Classifier...")
daily_classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
daily_classifier.fit(X_daily_full, y_daily_clf_full)
print("   Daily Classifier trained")

print("="*70)
print("ALL 4 FINAL MODELS TRAINED SUCCESSFULLY!")
print("="*70)

## 7. Penyimpanan Model Terbaik (7 File .pkl)

In [None]:
# Pastikan folder models ada
os.makedirs("models", exist_ok=True)

# Weather code to rain mapping
weather_code_to_rain = {0:0, 1:0, 2:0, 3:0, 51:0.2, 53:0.7, 55:1.1, 61:1.7, 63:4.0, 65:10.3}

# 1. COMBINED MODEL
combined_package = {
    "hourly": {
        "regressor": hourly_regressor,
        "classifier": hourly_classifier,
        "feature_columns": hourly_feature_cols,
        "target_regression": hourly_target_reg,
        "target_classification": "weather_code",
    },
    "daily": {
        "regressor": daily_regressor,
        "classifier": daily_classifier,
        "feature_columns": daily_feature_cols,
        "target_regression": daily_target_reg,
        "target_classification": "weather_code_dominant",
    },
    "label_encoder_hourly": le_weather_code,
    "label_encoder_daily": le_weather_code_daily,
    "label_encoder_conditions": le_conditions,
    "weather_code_to_rain": weather_code_to_rain,
    "version": "2.1",
    "trained_date": datetime.now().isoformat(),
    "model_type": "combined"
}
joblib.dump(combined_package, "models/weather_model_combined.pkl")
print("1. Combined model saved")

# 2. HOURLY MODEL
hourly_package = {
    "regressor": hourly_regressor,
    "classifier": hourly_classifier,
    "feature_columns": hourly_feature_cols,
    "target_regression": hourly_target_reg,
    "target_classification": "weather_code",
    "label_encoder": le_weather_code,
    "label_encoder_conditions": le_conditions,
    "weather_code_to_rain": weather_code_to_rain,
    "version": "2.1",
    "trained_date": datetime.now().isoformat(),
    "model_type": "hourly"
}
joblib.dump(hourly_package, "models/weather_model_hourly.pkl")
print("2. Hourly model saved")

# 3. DAILY MODEL
daily_package = {
    "regressor": daily_regressor,
    "classifier": daily_classifier,
    "feature_columns": daily_feature_cols,
    "target_regression": daily_target_reg,
    "target_classification": "weather_code_dominant",
    "label_encoder": le_weather_code_daily,
    "weather_code_to_rain": weather_code_to_rain,
    "version": "2.1",
    "trained_date": datetime.now().isoformat(),
    "model_type": "daily"
}
joblib.dump(daily_package, "models/weather_model_daily.pkl")
print("3. Daily model saved")

In [None]:
# 4-7. SEPARATE REGRESSOR & CLASSIFIER FILES

# 4. Hourly Regressor Only
hourly_reg_package = {
    "model": hourly_regressor,
    "feature_columns": hourly_feature_cols,
    "target": hourly_target_reg,
    "version": "2.1",
    "model_type": "hourly_regressor"
}
joblib.dump(hourly_reg_package, "models/weather_model_hourly_regressor.pkl")
print("4. Hourly regressor saved")

# 5. Hourly Classifier Only
hourly_clf_package = {
    "model": hourly_classifier,
    "feature_columns": hourly_feature_cols,
    "target": "weather_code",
    "label_encoder": le_weather_code,
    "weather_code_to_rain": weather_code_to_rain,
    "version": "2.1",
    "model_type": "hourly_classifier"
}
joblib.dump(hourly_clf_package, "models/weather_model_hourly_classifier.pkl")
print("5. Hourly classifier saved")

# 6. Daily Regressor Only
daily_reg_package = {
    "model": daily_regressor,
    "feature_columns": daily_feature_cols,
    "target": daily_target_reg,
    "version": "2.1",
    "model_type": "daily_regressor"
}
joblib.dump(daily_reg_package, "models/weather_model_daily_regressor.pkl")
print("6. Daily regressor saved")

# 7. Daily Classifier Only
daily_clf_package = {
    "model": daily_classifier,
    "feature_columns": daily_feature_cols,
    "target": "weather_code_dominant",
    "label_encoder": le_weather_code_daily,
    "weather_code_to_rain": weather_code_to_rain,
    "version": "2.1",
    "model_type": "daily_classifier"
}
joblib.dump(daily_clf_package, "models/weather_model_daily_classifier.pkl")
print("7. Daily classifier saved")

print(f"\nTotal: 7 model files created in models/ folder!")

## 8. Multi-Step Forecasting (Recursive Strategy)

In [None]:
def recursive_forecast_hourly(model_reg, model_clf, last_known_data, feature_cols, n_hours=24):
    """
    Prediksi cuaca secara rekursif untuk n_hours ke depan.
    """
    predictions = []
    current_data = last_known_data.copy()
    
    for i in range(n_hours):
        X = current_data[feature_cols].values.reshape(1, -1)
        reg_pred = model_reg.predict(X)[0]
        clf_pred = model_clf.predict(X)[0]
        
        pred_dict = {
            "hour_ahead": i + 1,
            "temp": reg_pred[0],
            "humidity": reg_pred[1],
            "windspeed": reg_pred[2],
            "sealevelpressure": reg_pred[3],
            "weather_code_encoded": clf_pred
        }
        predictions.append(pred_dict)
        
        # Update untuk iterasi berikutnya
        current_data["temp_lag_1"] = reg_pred[0]
        current_data["humidity_lag_1"] = reg_pred[1]
        current_data["windspeed_lag_1"] = reg_pred[2]
        current_data["sealevelpressure_lag_1"] = reg_pred[3]
        current_data["hour"] = (current_data["hour"] + 1) % 24
        
    return pd.DataFrame(predictions)

print("Fungsi recursive_forecast_hourly() berhasil didefinisikan")

In [None]:
# Contoh penggunaan recursive forecast
last_row = hourly_test.iloc[-1:].copy()

# Lakukan recursive forecast 72 jam ke depan
forecast_df = recursive_forecast_hourly(
    best_reg_model, 
    best_clf_model, 
    last_row, 
    hourly_feature_cols, 
    n_hours=72
)

print("Hasil Recursive Forecast (72 jam ke depan):")
display(forecast_df.head(10))

## 9. Visualisasi Multi-Step Forecast vs Aktual

In [None]:
# Visualisasi perbandingan Recursive Forecast vs Aktual
actual_72h = hourly_test.tail(72).copy().reset_index(drop=True)

start_point = hourly_test.iloc[-73:-72].copy()
forecast_72h = recursive_forecast_hourly(
    best_reg_model, 
    best_clf_model, 
    start_point, 
    hourly_feature_cols, 
    n_hours=72
)

# Plot perbandingan
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
params = ["temp", "humidity", "windspeed", "sealevelpressure"]
titles = ["Temperature (C)", "Humidity (%)", "Wind Speed (km/h)", "Pressure (hPa)"]

for ax, param, title in zip(axes.flatten(), params, titles):
    ax.plot(range(72), actual_72h[param].values, "b-", label="Aktual", linewidth=2)
    ax.plot(range(72), forecast_72h[param].values, "r--", label="Prediksi Rekursif", linewidth=2)
    ax.set_title(f"{title} - Actual vs Recursive Forecast")
    ax.set_xlabel("Hours Ahead")
    ax.set_ylabel(title)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle("Multi-Step Recursive Forecast vs Actual (72 Hours)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

## 10. Visualisasi Dampak Data Inkremental

In [None]:
# Demonstrasi Incremental Learning
print("="*70)
print("ANALISIS DAMPAK DATA INKREMENTAL")
print("="*70)

fractions = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
r2_scores = []

for frac in fractions:
    n_samples = int(len(X_hourly_train) * frac)
    X_subset = X_hourly_train.iloc[:n_samples]
    y_subset = y_hourly_train_reg.iloc[:n_samples]
    
    temp_model = RandomForestRegressor(n_estimators=50, random_state=42, n_jobs=-1)
    temp_model.fit(X_subset, y_subset)
    
    y_pred = temp_model.predict(X_hourly_test)
    r2 = r2_score(y_hourly_test_reg, y_pred)
    r2_scores.append(r2)
    
    print(f"   {int(frac*100):3d}% data ({n_samples:,} samples): R2 = {r2:.4f}")

# Plot hasil
plt.figure(figsize=(10, 6))
plt.plot([f*100 for f in fractions], r2_scores, "bo-", linewidth=2, markersize=8)
plt.xlabel("Percentage of Training Data (%)", fontsize=12)
plt.ylabel("R2 Score on Test Set", fontsize=12)
plt.title("Impact of Incremental Data on Model Performance", fontsize=14, fontweight="bold")
plt.grid(True, alpha=0.3)
plt.xticks([f*100 for f in fractions])
plt.tight_layout()
plt.show()

print(f"\nInsight: Model performance increases from R2={r2_scores[0]:.4f} (10%) to R2={r2_scores[-1]:.4f} (100%)")

## Rangkuman

Notebook ini telah menyelesaikan:

1. **Persiapan Lingkungan** - Import semua pustaka
2. **Pemuatan Data** - Load dataset 23 kolom
3. **EDA** - Analisis distribusi, korelasi
4. **Feature Engineering** - Lag features untuk Hourly dan Daily
5. **Perbandingan Model** - 5 model regresi & 4 model klasifikasi
6. **Analisis Kinerja** - Evaluasi per-parameter dan Confusion Matrix
7. **Retraining & Penyimpanan** - 7 file model `.pkl` tersimpan
8. **Multi-Step Forecasting** - Recursive strategy
9. **Visualisasi Forecast** - Actual vs Predicted
10. **Dampak Inkremental** - R2 vs data size

**Output Files:**
```
models/
|- weather_model_combined.pkl
|- weather_model_hourly.pkl
|- weather_model_daily.pkl
|- weather_model_hourly_regressor.pkl
|- weather_model_hourly_classifier.pkl
|- weather_model_daily_regressor.pkl
|- weather_model_daily_classifier.pkl
```