# üå§Ô∏è Weather Model Training - Model Selection

Notebook ini mengikuti panduan dari `training_guide.md` sampai **Poin 5: Pemilihan Model Terbaik**.

**Tujuan:**
1. Memuat dan eksplorasi dataset cuaca historis (2000-2024)
2. Feature engineering untuk time-series
3. Membandingkan berbagai algoritma untuk memilih model terbaik
4. Menentukan model terbaik untuk Regresi dan Klasifikasi

## 1. Persiapan Lingkungan dan Pemuatan Pustaka

In [None]:
# Install dependencies jika belum ada
# !pip install pandas numpy matplotlib seaborn scikit-learn xgboost joblib

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score, f1_score, classification_report, confusion_matrix
)

# Regression Models
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# XGBoost
try:
    from xgboost import XGBRegressor, XGBClassifier
    XGBOOST_AVAILABLE = True
except ImportError:
    print("XGBoost not installed. Skipping XGBoost models.")
    XGBOOST_AVAILABLE = False

# Joblib for saving models
import joblib

print("‚úÖ Semua pustaka berhasil diimpor!")
print(f"   - Pandas: {pd.__version__}")
print(f"   - NumPy: {np.__version__}")
print(f"   - XGBoost Available: {XGBOOST_AVAILABLE}")

## 2. Pengumpulan dan Pemuatan Data

In [None]:
# Load dataset
DATA_PATH = 'data/historical_data_2000_2024.csv'

df = pd.read_csv(DATA_PATH)

# Konversi timestamp ke datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Urutkan berdasarkan waktu (PENTING untuk time-series)
df = df.sort_values('timestamp').reset_index(drop=True)

print(f"üìä Dataset loaded: {len(df):,} baris x {len(df.columns)} kolom")
print(f"üìÖ Rentang waktu: {df['timestamp'].min()} - {df['timestamp'].max()}")
df.head()

In [None]:
# Info struktur data
df.info()

## 3. Analisis Data Eksplorasi (EDA)

### 3.1 Statistik Deskriptif

In [None]:
# Statistik deskriptif untuk fitur numerik
df.describe()

### 3.2 Visualisasi Distribusi

In [None]:
# Visualisasi distribusi parameter cuaca utama
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

params = ['temp', 'humidity', 'windspeed', 'sealevelpressure']
titles = ['Temperature (¬∞C)', 'Humidity (%)', 'Wind Speed (km/h)', 'Sea Level Pressure (hPa)']

for ax, param, title in zip(axes.flatten(), params, titles):
    sns.histplot(df[param], kde=True, ax=ax, color='steelblue')
    ax.set_title(f'Distribusi {title}')
    ax.set_xlabel(title)

plt.tight_layout()
plt.show()

### 3.3 Analisis Korelasi

In [None]:
# Heatmap korelasi
numeric_cols = ['temp', 'humidity', 'windspeed', 'sealevelpressure', 'rain', 
                'precipitation', 'apparent_temperature', 'surface_pressure', 'weather_code']

plt.figure(figsize=(12, 8))
correlation_matrix = df[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Heatmap Korelasi Antar Variabel Cuaca')
plt.tight_layout()
plt.show()

### 3.4 Analisis Korelasi: weather_code, conditions, dan rain

In [None]:
# Analisis hubungan weather_code dengan rain
weather_rain_analysis = df.groupby('weather_code')[['rain', 'precipitation']].agg(['mean', 'min', 'max', 'count'])
print("üìä Weather Code vs Rain/Precipitation:")
weather_rain_analysis

In [None]:
# Analisis hubungan conditions dengan rain
conditions_analysis = df.groupby('conditions')[['rain']].agg(['mean', 'count'])
print("üìä Conditions vs Rain:")
conditions_analysis

In [None]:
# Verifikasi: rain == precipitation
print(f"\nüîç Verifikasi rain == precipitation: {(df['rain'] == df['precipitation']).all()}")
print(f"üîç Weather codes dengan rain > 0: {sorted(df[df['rain'] > 0]['weather_code'].unique())}")
print(f"üîç Weather codes dengan rain = 0: {sorted(df[df['rain'] == 0]['weather_code'].unique())}")

# Kesimpulan
print("\n‚úÖ KESIMPULAN:")
print("   - rain dan precipitation IDENTIK di seluruh dataset")
print("   - weather_code >= 50 SELALU hujan (deterministik)")
print("   - Tidak perlu memprediksi rain terpisah, cukup prediksi weather_code")

## 4. Pra-pemrosesan Data dan Feature Engineering

In [None]:
# Copy dataframe untuk preprocessing
df_processed = df.copy()

# 1. Label Encoding untuk 'conditions'
le_conditions = LabelEncoder()
df_processed['conditions_encoded'] = le_conditions.fit_transform(df_processed['conditions'])

print("üìù Label Encoding untuk 'conditions':")
for i, label in enumerate(le_conditions.classes_):
    print(f"   {i}: {label}")

In [None]:
# 2. Feature Engineering: Lag Features
target_cols = ['temp', 'humidity', 'windspeed', 'sealevelpressure']

for col in target_cols:
    # Lag 1 jam
    df_processed[f'{col}_lag_1'] = df_processed[col].shift(1)
    # Lag 24 jam
    df_processed[f'{col}_lag_24'] = df_processed[col].shift(24)
    # Rolling mean 24 jam
    df_processed[f'{col}_rolling_24'] = df_processed[col].rolling(window=24).mean()

print(f"‚úÖ Feature Engineering selesai! Kolom baru: {12} fitur lag & rolling")

In [None]:
# 3. Hapus baris dengan NaN (akibat lag & rolling)
rows_before = len(df_processed)
df_processed = df_processed.dropna().reset_index(drop=True)
rows_after = len(df_processed)

print(f"üóëÔ∏è Baris dihapus (NaN): {rows_before - rows_after:,}")
print(f"üìä Dataset final: {rows_after:,} baris")

In [None]:
# Preview hasil preprocessing
print("üìã Kolom hasil preprocessing:")
print(df_processed.columns.tolist())
df_processed.head()

## 5. Pelatihan dan Perbandingan Model

### 5.1 Pemisahan Data (Time-Series Split)

In [None]:
# PENTING: Chronological Split (BUKAN random split)
# Data HARUS diurutkan berdasarkan timestamp

train_size = int(len(df_processed) * 0.8)
train_df = df_processed[:train_size]
test_df = df_processed[train_size:]

print(f"üìä Data Split (80-20 Chronological):")
print(f"   Train: {len(train_df):,} baris ({train_df['timestamp'].min()} - {train_df['timestamp'].max()})")
print(f"   Test:  {len(test_df):,} baris ({test_df['timestamp'].min()} - {test_df['timestamp'].max()})")

In [None]:
# Definisi fitur (X) dan target (y)

# Fitur untuk model
feature_cols = [
    'year', 'month', 'day', 'hour',
    'temp_lag_1', 'temp_lag_24', 'temp_rolling_24',
    'humidity_lag_1', 'humidity_lag_24', 'humidity_rolling_24',
    'windspeed_lag_1', 'windspeed_lag_24', 'windspeed_rolling_24',
    'sealevelpressure_lag_1', 'sealevelpressure_lag_24', 'sealevelpressure_rolling_24'
]

# Target untuk Regresi
target_regression = ['temp', 'humidity', 'windspeed', 'sealevelpressure']

# Target untuk Klasifikasi
target_classification = 'weather_code'

# Pisahkan X dan y
X_train = train_df[feature_cols]
X_test = test_df[feature_cols]

y_train_reg = train_df[target_regression]
y_test_reg = test_df[target_regression]

y_train_clf = train_df[target_classification]
y_test_clf = test_df[target_classification]

print(f"‚úÖ Fitur (X): {len(feature_cols)} kolom")
print(f"‚úÖ Target Regresi: {target_regression}")
print(f"‚úÖ Target Klasifikasi: {target_classification}")

### 5.2 Komparasi Model Regresi

In [None]:
# Definisi model regresi
regression_models = {
    'Linear Regression': LinearRegression(),
    'K-Neighbors': KNeighborsRegressor(n_neighbors=5),
    'Decision Tree': DecisionTreeRegressor(random_state=42, max_depth=10),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
}

if XGBOOST_AVAILABLE:
    regression_models['XGBoost'] = XGBRegressor(n_estimators=100, random_state=42, n_jobs=-1, verbosity=0)

print(f"üìã Model Regresi yang akan dibandingkan: {list(regression_models.keys())}")

In [None]:
# Fungsi evaluasi regresi
def evaluate_regression(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return {'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'R2': r2}

# Training dan evaluasi semua model regresi
regression_results = []

for name, model in regression_models.items():
    print(f"\nüîÑ Training {name}...")
    
    # Train model (multi-output)
    model.fit(X_train, y_train_reg)
    
    # Predict
    y_pred = model.predict(X_test)
    
    # Evaluate untuk setiap target
    for i, target in enumerate(target_regression):
        metrics = evaluate_regression(y_test_reg[target], y_pred[:, i])
        metrics['Model'] = name
        metrics['Target'] = target
        regression_results.append(metrics)
    
    # Overall metrics
    overall_metrics = evaluate_regression(y_test_reg, y_pred)
    overall_metrics['Model'] = name
    overall_metrics['Target'] = 'OVERALL'
    regression_results.append(overall_metrics)
    
    print(f"   ‚úÖ {name} - R¬≤: {overall_metrics['R2']:.4f}, RMSE: {overall_metrics['RMSE']:.4f}")

print("\n‚úÖ Semua model regresi selesai dilatih!")

In [None]:
# Tampilkan hasil perbandingan regresi
df_reg_results = pd.DataFrame(regression_results)

# Filter hanya OVERALL
df_reg_overall = df_reg_results[df_reg_results['Target'] == 'OVERALL'].copy()
df_reg_overall = df_reg_overall.sort_values('R2', ascending=False)

print("\nüìä HASIL PERBANDINGAN MODEL REGRESI (Overall):")
print("="*70)
display(df_reg_overall[['Model', 'MSE', 'RMSE', 'MAE', 'R2']].reset_index(drop=True))

In [None]:
# Visualisasi perbandingan R¬≤ per model
plt.figure(figsize=(10, 5))
colors = ['#2ecc71' if x == df_reg_overall['R2'].max() else '#3498db' for x in df_reg_overall['R2']]
plt.barh(df_reg_overall['Model'], df_reg_overall['R2'], color=colors)
plt.xlabel('R¬≤ Score')
plt.title('Perbandingan R¬≤ Score - Model Regresi')
plt.xlim(0, 1)
for i, v in enumerate(df_reg_overall['R2']):
    plt.text(v + 0.01, i, f'{v:.4f}', va='center')
plt.tight_layout()
plt.show()

# Model terbaik untuk regresi
best_reg_model = df_reg_overall.iloc[0]['Model']
print(f"\nüèÜ MODEL TERBAIK UNTUK REGRESI: {best_reg_model}")

### 5.3 Komparasi Model Klasifikasi

In [None]:
# Definisi model klasifikasi
classification_models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42, n_jobs=-1),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=10),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
}

if XGBOOST_AVAILABLE:
    classification_models['XGBoost'] = XGBClassifier(n_estimators=100, random_state=42, n_jobs=-1, verbosity=0)

print(f"üìã Model Klasifikasi yang akan dibandingkan: {list(classification_models.keys())}")

In [None]:
# Training dan evaluasi semua model klasifikasi
classification_results = []

for name, model in classification_models.items():
    print(f"\nüîÑ Training {name}...")
    
    # Train model
    model.fit(X_train, y_train_clf)
    
    # Predict
    y_pred = model.predict(X_test)
    
    # Evaluate
    accuracy = accuracy_score(y_test_clf, y_pred)
    f1_macro = f1_score(y_test_clf, y_pred, average='macro', zero_division=0)
    f1_weighted = f1_score(y_test_clf, y_pred, average='weighted', zero_division=0)
    
    classification_results.append({
        'Model': name,
        'Accuracy': accuracy,
        'F1 (Macro)': f1_macro,
        'F1 (Weighted)': f1_weighted
    })
    
    print(f"   ‚úÖ {name} - Accuracy: {accuracy:.4f}, F1 (Weighted): {f1_weighted:.4f}")

print("\n‚úÖ Semua model klasifikasi selesai dilatih!")

In [None]:
# Tampilkan hasil perbandingan klasifikasi
df_clf_results = pd.DataFrame(classification_results)
df_clf_results = df_clf_results.sort_values('Accuracy', ascending=False)

print("\nüìä HASIL PERBANDINGAN MODEL KLASIFIKASI:")
print("="*70)
display(df_clf_results.reset_index(drop=True))

In [None]:
# Visualisasi perbandingan Accuracy per model
plt.figure(figsize=(10, 5))
colors = ['#2ecc71' if x == df_clf_results['Accuracy'].max() else '#e74c3c' for x in df_clf_results['Accuracy']]
plt.barh(df_clf_results['Model'], df_clf_results['Accuracy'], color=colors)
plt.xlabel('Accuracy')
plt.title('Perbandingan Accuracy - Model Klasifikasi')
plt.xlim(0, 1)
for i, v in enumerate(df_clf_results['Accuracy']):
    plt.text(v + 0.01, i, f'{v:.4f}', va='center')
plt.tight_layout()
plt.show()

# Model terbaik untuk klasifikasi
best_clf_model = df_clf_results.iloc[0]['Model']
print(f"\nüèÜ MODEL TERBAIK UNTUK KLASIFIKASI: {best_clf_model}")

### 5.4 Kesimpulan Pemilihan Model

In [None]:
print("="*70)
print("üéØ KESIMPULAN PEMILIHAN MODEL TERBAIK")
print("="*70)

# Best Regression
best_reg = df_reg_overall.iloc[0]
print(f"\nüìà REGRESI (temp, humidity, windspeed, pressure):")
print(f"   üèÜ Model Terbaik: {best_reg['Model']}")
print(f"   üìä R¬≤ Score: {best_reg['R2']:.4f}")
print(f"   üìä RMSE: {best_reg['RMSE']:.4f}")
print(f"   üìä MAE: {best_reg['MAE']:.4f}")

# Best Classification
best_clf = df_clf_results.iloc[0]
print(f"\nüìä KLASIFIKASI (weather_code):")
print(f"   üèÜ Model Terbaik: {best_clf['Model']}")
print(f"   üìä Accuracy: {best_clf['Accuracy']:.4f}")
print(f"   üìä F1 (Weighted): {best_clf['F1 (Weighted)']:.4f}")

print("\n" + "="*70)
print("‚úÖ Langkah selanjutnya: Retraining dengan 100% data, lalu simpan model")
print("="*70)

---

## Rangkuman

Notebook ini telah menyelesaikan:

1. ‚úÖ **Persiapan Lingkungan** - Import semua pustaka yang diperlukan
2. ‚úÖ **Pemuatan Data** - Load dataset 227K baris (2000-2024)
3. ‚úÖ **EDA** - Analisis distribusi, korelasi, dan hubungan weather_code dengan rain
4. ‚úÖ **Feature Engineering** - Lag features dan rolling mean
5. ‚úÖ **Perbandingan Model** - 5 model regresi & 4 model klasifikasi

**Langkah selanjutnya (Poin 6-10 di training_guide.md):**
- Analisis kinerja per-parameter
- Retraining dengan 100% data
- Penyimpanan model ke .pkl
- Multi-Step Forecasting
- Visualisasi hasil