# Week 8: Machine Learning Applications
## Advanced ML Techniques for Heatwave Analysis

**Instructor**: Sohn Chul

---

## 🎯 Learning Objectives

By the end of this session, you will be able to:
1. Implement deep learning models for KMA heat index prediction
2. Apply clustering algorithms to identify heatwave patterns
3. Develop anomaly detection systems for extreme events
4. Build real-time prediction pipelines
5. Create interpretable ML models for policy support

## 1. Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score

# Deep Learning
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, callbacks
from tensorflow.keras.optimizers import Adam

# XGBoost
import xgboost as xgb

# Other utilities
from scipy import stats
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)
tf.random.set_seed(42)

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print("✅ Libraries imported successfully!")
print(f"TensorFlow version: {tf.__version__}")

## 2. Prepare Data with KMA Heat Index

In [None]:
# Generate comprehensive dataset
np.random.seed(42)

# Create date range
dates = pd.date_range('2025-04-01', '2025-08-31', freq='10min')
n = len(dates)

# Generate weather variables
hours = dates.hour + dates.minute/60
days = (dates - dates[0]).days

# Temperature with complex patterns
temp_seasonal = 20 + (days / 30) * 3
temp_daily = 7 * np.sin((hours - 6) * np.pi / 12)
temp_noise = np.random.normal(0, 2, n)
temperature = temp_seasonal + temp_daily + temp_noise

# Humidity
humidity_base = 70 - temperature * 0.5
humidity_daily = 10 * np.sin((hours - 12) * np.pi / 12)
humidity = humidity_base + humidity_daily + np.random.normal(0, 5, n)
humidity = np.clip(humidity, 30, 95)

# Additional features
wind_speed = np.abs(np.random.normal(2, 1.5, n))
solar_radiation = np.maximum(0, 500 * np.sin((hours - 6) * np.pi / 12) + np.random.normal(0, 50, n))
pressure = 1013 + np.random.normal(0, 10, n)
pm25 = np.abs(np.random.exponential(25, n))
pm10 = np.abs(np.random.exponential(40, n))

# KMA Heat Index Calculation
def calculate_wet_bulb_temperature(Ta, RH):
    """Calculate wet-bulb temperature using Stull's formula."""
    Tw = (Ta * np.arctan(0.151977 * (RH + 8.313659)**0.5) + 
          np.arctan(Ta + RH) - 
          np.arctan(RH - 1.67633) + 
          0.00391838 * RH**1.5 * np.arctan(0.023101 * RH) - 
          4.686035)
    return Tw

def calculate_heat_index_kma(Ta, RH):
    """Calculate heat index using KMA formula."""
    Tw = calculate_wet_bulb_temperature(Ta, RH)
    HI = (-0.2442 + 0.55399 * Tw + 0.45535 * Ta - 
          0.0022 * Tw**2 + 0.00278 * Tw * Ta + 3.0)
    return HI

# Calculate KMA heat index
heat_index = calculate_heat_index_kma(temperature, humidity)

# Create DataFrame
df = pd.DataFrame({
    'datetime': dates,
    'temperature': temperature,
    'humidity': humidity,
    'wind_speed': wind_speed,
    'solar_radiation': solar_radiation,
    'pressure': pressure,
    'pm25': pm25,
    'pm10': pm10,
    'heat_index': heat_index
})

# Add time features
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Add heat stress categories (KMA standards)
df['heat_stress'] = pd.cut(df['heat_index'], 
                           bins=[-np.inf, 25, 28, 31, 35, np.inf],
                           labels=['Comfortable', 'Caution', 'Extreme Caution', 
                                  'Danger', 'Extreme Danger'])

# Resample to hourly for ML models
df_hourly = df.set_index('datetime').resample('H').mean()
df_hourly['heat_stress'] = df.set_index('datetime')['heat_stress'].resample('H').agg(lambda x: x.mode()[0] if len(x) > 0 else 'Comfortable')
df_hourly = df_hourly.reset_index()

print(f"✅ Dataset created: {len(df_hourly)} hourly records")
print("\n📊 Heat Index Statistics (KMA):")
print(df_hourly['heat_index'].describe())
print("\n📊 Heat Stress Distribution:")
print(df.groupby('heat_stress').size())

## 3. Deep Learning Model for Heat Index Prediction

In [None]:
# Prepare data for neural network
feature_cols = ['temperature', 'humidity', 'wind_speed', 'solar_radiation', 
                'pressure', 'pm25', 'pm10', 'hour', 'month', 'is_weekend']

X = df_hourly[feature_cols].values
y = df_hourly['heat_index'].values

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)

# Build Neural Network
def create_nn_model(input_dim):
    model = models.Sequential([
        layers.Dense(128, activation='relu', input_shape=(input_dim,)),
        layers.Dropout(0.2),
        layers.Dense(64, activation='relu'),
        layers.Dropout(0.2),
        layers.Dense(32, activation='relu'),
        layers.Dense(16, activation='relu'),
        layers.Dense(1)
    ])
    
    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='mse',
        metrics=['mae']
    )
    
    return model

# Create and train model
nn_model = create_nn_model(X_train_scaled.shape[1])

# Early stopping
early_stop = callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train model
history = nn_model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=32,
    callbacks=[early_stop],
    verbose=0
)

# Evaluate model
nn_loss, nn_mae = nn_model.evaluate(X_test_scaled, y_test, verbose=0)
nn_predictions = nn_model.predict(X_test_scaled, verbose=0).flatten()

# Calculate metrics
from sklearn.metrics import r2_score, mean_squared_error
nn_r2 = r2_score(y_test, nn_predictions)
nn_rmse = np.sqrt(mean_squared_error(y_test, nn_predictions))

print("🧠 Neural Network Performance:")
print(f"  R² Score: {nn_r2:.4f}")
print(f"  RMSE: {nn_rmse:.4f}°C")
print(f"  MAE: {nn_mae:.4f}°C")

# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(history.history['loss'], label='Training Loss')
axes[0].plot(history.history['val_loss'], label='Validation Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Model Loss During Training')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].scatter(y_test, nn_predictions, alpha=0.5, s=10)
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1].set_xlabel('Actual KMA Heat Index (°C)')
axes[1].set_ylabel('Predicted Heat Index (°C)')
axes[1].set_title('Neural Network Predictions')
axes[1].grid(True, alpha=0.3)

plt.suptitle('Deep Learning Model for KMA Heat Index Prediction', fontsize=16)
plt.tight_layout()
plt.show()

## 4. LSTM for Time Series Prediction

In [None]:
# Prepare sequences for LSTM
def create_sequences(data, seq_length, pred_length=1):
    X, y = [], []
    for i in range(len(data) - seq_length - pred_length + 1):
        X.append(data[i:i+seq_length])
        y.append(data[i+seq_length:i+seq_length+pred_length, -1])  # Predict heat index
    return np.array(X), np.array(y)

# Prepare LSTM data
seq_length = 24  # Use 24 hours to predict next hour
lstm_features = ['temperature', 'humidity', 'wind_speed', 'heat_index']
lstm_data = df_hourly[lstm_features].values

# Scale data
scaler_lstm = MinMaxScaler()
lstm_data_scaled = scaler_lstm.fit_transform(lstm_data)

# Create sequences
X_seq, y_seq = create_sequences(lstm_data_scaled, seq_length)
y_seq = y_seq.reshape(-1, 1)  # Reshape for single output

# Split data
train_size = int(len(X_seq) * 0.8)
X_train_lstm = X_seq[:train_size]
X_test_lstm = X_seq[train_size:]
y_train_lstm = y_seq[:train_size]
y_test_lstm = y_seq[train_size:]

# Build LSTM model
lstm_model = models.Sequential([
    layers.LSTM(50, activation='relu', return_sequences=True, 
                input_shape=(seq_length, len(lstm_features))),
    layers.Dropout(0.2),
    layers.LSTM(50, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(25, activation='relu'),
    layers.Dense(1)
])

lstm_model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['mae'])

# Train LSTM
lstm_history = lstm_model.fit(
    X_train_lstm, y_train_lstm,
    validation_split=0.2,
    epochs=50,
    batch_size=32,
    callbacks=[callbacks.EarlyStopping(patience=10, restore_best_weights=True)],
    verbose=0
)

# Evaluate LSTM
lstm_predictions = lstm_model.predict(X_test_lstm, verbose=0)

# Inverse transform predictions
# Create dummy array for inverse transform
dummy = np.zeros((len(lstm_predictions), len(lstm_features)))
dummy[:, -1] = lstm_predictions.flatten()
lstm_predictions_actual = scaler_lstm.inverse_transform(dummy)[:, -1]

dummy[:, -1] = y_test_lstm.flatten()
y_test_lstm_actual = scaler_lstm.inverse_transform(dummy)[:, -1]

# Calculate metrics
lstm_r2 = r2_score(y_test_lstm_actual, lstm_predictions_actual)
lstm_rmse = np.sqrt(mean_squared_error(y_test_lstm_actual, lstm_predictions_actual))

print("\n🔮 LSTM Time Series Performance:")
print(f"  R² Score: {lstm_r2:.4f}")
print(f"  RMSE: {lstm_rmse:.4f}°C")

# Visualize LSTM predictions
plt.figure(figsize=(14, 6))
plt.plot(y_test_lstm_actual[:100], label='Actual', alpha=0.7)
plt.plot(lstm_predictions_actual[:100], label='LSTM Predicted', alpha=0.7)
plt.xlabel('Time (hours)')
plt.ylabel('KMA Heat Index (°C)')
plt.title('LSTM Time Series Predictions (First 100 hours)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5. XGBoost for Heat Index Prediction

In [None]:
# Prepare data for XGBoost
# Add lag features
df_xgb = df_hourly.copy()
for lag in [1, 3, 6, 12, 24]:
    df_xgb[f'temp_lag_{lag}'] = df_xgb['temperature'].shift(lag)
    df_xgb[f'hi_lag_{lag}'] = df_xgb['heat_index'].shift(lag)

df_xgb = df_xgb.dropna()

# Features for XGBoost
xgb_features = feature_cols + [col for col in df_xgb.columns if 'lag' in col]
X_xgb = df_xgb[xgb_features]
y_xgb = df_xgb['heat_index']

# Split data
X_train_xgb, X_test_xgb, y_train_xgb, y_test_xgb = train_test_split(
    X_xgb, y_xgb, test_size=0.2, random_state=42
)

# Train XGBoost
xgb_model = xgb.XGBRegressor(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

xgb_model.fit(X_train_xgb, y_train_xgb)

# Predictions
xgb_predictions = xgb_model.predict(X_test_xgb)

# Calculate metrics
xgb_r2 = r2_score(y_test_xgb, xgb_predictions)
xgb_rmse = np.sqrt(mean_squared_error(y_test_xgb, xgb_predictions))

print("🚀 XGBoost Performance:")
print(f"  R² Score: {xgb_r2:.4f}")
print(f"  RMSE: {xgb_rmse:.4f}°C")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': xgb_features,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False).head(15)

# Visualize feature importance
plt.figure(figsize=(10, 8))
plt.barh(feature_importance['feature'], feature_importance['importance'], color='skyblue')
plt.xlabel('Importance')
plt.title('XGBoost Feature Importance for KMA Heat Index')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Clustering Analysis for Heatwave Patterns

In [None]:
# Prepare data for clustering
clustering_features = ['temperature', 'humidity', 'heat_index', 'solar_radiation', 'hour', 'month']
X_cluster = df_hourly[clustering_features].dropna()

# Scale features
scaler_cluster = StandardScaler()
X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)

# Apply PCA for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_cluster_scaled)

# K-Means clustering
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans_labels = kmeans.fit_predict(X_cluster_scaled)

# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=10)
dbscan_labels = dbscan.fit_predict(X_cluster_scaled)

# Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=n_clusters)
hierarchical_labels = hierarchical.fit_predict(X_cluster_scaled)

# Evaluate clustering
kmeans_silhouette = silhouette_score(X_cluster_scaled, kmeans_labels)
kmeans_db = davies_bouldin_score(X_cluster_scaled, kmeans_labels)

print("📊 Clustering Performance:")
print(f"K-Means Silhouette Score: {kmeans_silhouette:.4f}")
print(f"K-Means Davies-Bouldin Score: {kmeans_db:.4f}")
print(f"DBSCAN unique clusters: {len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)}")

# Visualize clustering results
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# K-Means
scatter1 = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels, 
                          cmap='viridis', alpha=0.6, s=10)
axes[0].set_title('K-Means Clustering')
axes[0].set_xlabel('First Principal Component')
axes[0].set_ylabel('Second Principal Component')
plt.colorbar(scatter1, ax=axes[0])

# DBSCAN
scatter2 = axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=dbscan_labels, 
                          cmap='viridis', alpha=0.6, s=10)
axes[1].set_title('DBSCAN Clustering')
axes[1].set_xlabel('First Principal Component')
axes[1].set_ylabel('Second Principal Component')
plt.colorbar(scatter2, ax=axes[1])

# Hierarchical
scatter3 = axes[2].scatter(X_pca[:, 0], X_pca[:, 1], c=hierarchical_labels, 
                          cmap='viridis', alpha=0.6, s=10)
axes[2].set_title('Hierarchical Clustering')
axes[2].set_xlabel('First Principal Component')
axes[2].set_ylabel('Second Principal Component')
plt.colorbar(scatter3, ax=axes[2])

plt.suptitle('Clustering Analysis of Heatwave Patterns', fontsize=16)
plt.tight_layout()
plt.show()

# Analyze cluster characteristics
X_cluster['cluster'] = kmeans_labels
cluster_summary = X_cluster.groupby('cluster')[clustering_features].mean().round(2)

print("\n📊 K-Means Cluster Characteristics:")
print(cluster_summary)

## 7. Anomaly Detection for Extreme Events

In [None]:
# Isolation Forest for anomaly detection
iso_forest = IsolationForest(contamination=0.05, random_state=42)
anomaly_labels = iso_forest.fit_predict(X_cluster_scaled)

# Mark anomalies
X_cluster['is_anomaly'] = anomaly_labels == -1
anomalies = X_cluster[X_cluster['is_anomaly']]
normal = X_cluster[~X_cluster['is_anomaly']]

print(f"🚨 Anomaly Detection Results:")
print(f"  Total anomalies detected: {len(anomalies)} ({len(anomalies)/len(X_cluster)*100:.2f}%)")
print(f"  Normal observations: {len(normal)}")

# Analyze anomaly characteristics
print("\n📊 Anomaly Characteristics:")
print("Anomalies:")
print(anomalies[clustering_features].describe().round(2))
print("\nNormal:")
print(normal[clustering_features].describe().round(2))

# Visualize anomalies
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Scatter plot in PCA space
axes[0].scatter(X_pca[~X_cluster['is_anomaly'].values, 0], 
               X_pca[~X_cluster['is_anomaly'].values, 1], 
               c='blue', alpha=0.3, s=10, label='Normal')
axes[0].scatter(X_pca[X_cluster['is_anomaly'].values, 0], 
               X_pca[X_cluster['is_anomaly'].values, 1], 
               c='red', alpha=0.8, s=20, label='Anomaly')
axes[0].set_xlabel('First Principal Component')
axes[0].set_ylabel('Second Principal Component')
axes[0].set_title('Anomaly Detection in PCA Space')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Heat index distribution
axes[1].hist(normal['heat_index'], bins=50, alpha=0.5, label='Normal', color='blue')
axes[1].hist(anomalies['heat_index'], bins=20, alpha=0.7, label='Anomaly', color='red')
axes[1].set_xlabel('KMA Heat Index (°C)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Heat Index Distribution: Normal vs Anomaly')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle('Anomaly Detection for Extreme Heat Events', fontsize=16)
plt.tight_layout()
plt.show()

## 8. Model Ensemble and Comparison

In [None]:
# Compare all models
model_comparison = {
    'Neural Network': {'R²': nn_r2, 'RMSE': nn_rmse},
    'LSTM': {'R²': lstm_r2, 'RMSE': lstm_rmse},
    'XGBoost': {'R²': xgb_r2, 'RMSE': xgb_rmse}
}

comparison_df = pd.DataFrame(model_comparison).T

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# R² comparison
axes[0].bar(comparison_df.index, comparison_df['R²'], color=['blue', 'green', 'orange'])
axes[0].set_ylabel('R² Score')
axes[0].set_title('Model R² Score Comparison')
axes[0].set_ylim([0.9, 1.0])
axes[0].grid(True, alpha=0.3)

# RMSE comparison
axes[1].bar(comparison_df.index, comparison_df['RMSE'], color=['blue', 'green', 'orange'])
axes[1].set_ylabel('RMSE (°C)')
axes[1].set_title('Model RMSE Comparison')
axes[1].grid(True, alpha=0.3)

plt.suptitle('Machine Learning Model Performance Comparison', fontsize=16)
plt.tight_layout()
plt.show()

print("📊 Model Performance Summary:")
print(comparison_df.round(4))

# Create ensemble predictions (weighted average)
# For demonstration, use simple average of XGBoost and NN
ensemble_predictions = (nn_predictions + xgb_predictions[:len(nn_predictions)]) / 2
ensemble_r2 = r2_score(y_test[:len(ensemble_predictions)], ensemble_predictions)
ensemble_rmse = np.sqrt(mean_squared_error(y_test[:len(ensemble_predictions)], ensemble_predictions))

print(f"\n🎯 Ensemble Model Performance:")
print(f"  R² Score: {ensemble_r2:.4f}")
print(f"  RMSE: {ensemble_rmse:.4f}°C")

## 9. Real-time Prediction Pipeline

In [None]:
class HeatIndexPredictor:
    """Real-time KMA Heat Index prediction pipeline"""
    
    def __init__(self, model, scaler, feature_cols):
        self.model = model
        self.scaler = scaler
        self.feature_cols = feature_cols
        
    def preprocess(self, data):
        """Preprocess input data"""
        # Ensure all features are present
        for col in self.feature_cols:
            if col not in data:
                data[col] = 0
        
        # Select and order features
        X = data[self.feature_cols].values.reshape(1, -1)
        
        # Scale features
        X_scaled = self.scaler.transform(X)
        
        return X_scaled
    
    def predict(self, data):
        """Make prediction"""
        X_scaled = self.preprocess(data)
        prediction = self.model.predict(X_scaled, verbose=0)[0][0]
        
        # Determine heat stress category
        if prediction < 25:
            category = 'Comfortable'
            color = 'green'
        elif prediction < 28:
            category = 'Caution'
            color = 'yellow'
        elif prediction < 31:
            category = 'Extreme Caution'
            color = 'orange'
        elif prediction < 35:
            category = 'Danger'
            color = 'red'
        else:
            category = 'Extreme Danger'
            color = 'darkred'
        
        return {
            'heat_index': prediction,
            'category': category,
            'color': color
        }
    
    def predict_batch(self, df):
        """Make batch predictions"""
        predictions = []
        for _, row in df.iterrows():
            pred = self.predict(row.to_dict())
            predictions.append(pred)
        return predictions

# Initialize predictor
predictor = HeatIndexPredictor(nn_model, scaler_X, feature_cols)

# Test real-time prediction
test_data = pd.DataFrame({
    'temperature': [30, 35, 28],
    'humidity': [70, 80, 60],
    'wind_speed': [2, 1, 3],
    'solar_radiation': [400, 500, 300],
    'pressure': [1013, 1012, 1014],
    'pm25': [25, 30, 20],
    'pm10': [40, 45, 35],
    'hour': [14, 15, 10],
    'month': [7, 8, 6],
    'is_weekend': [0, 0, 1]
})

print("🔮 Real-time Prediction Examples:")
print("="*60)

for i, row in test_data.iterrows():
    result = predictor.predict(row.to_dict())
    print(f"\nScenario {i+1}:")
    print(f"  Temperature: {row['temperature']}°C, Humidity: {row['humidity']}%")
    print(f"  Predicted KMA Heat Index: {result['heat_index']:.2f}°C")
    print(f"  Category: {result['category']}")

## 10. Model Interpretability

In [None]:
# SHAP values for XGBoost model interpretability
import shap

# Create SHAP explainer
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test_xgb[:100])  # Use subset for visualization

# Summary plot
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values, X_test_xgb[:100], feature_names=xgb_features, show=False)
plt.title('SHAP Feature Importance for KMA Heat Index Prediction')
plt.tight_layout()
plt.show()

# Feature interaction plot
plt.figure(figsize=(10, 6))
shap.dependence_plot('temperature', shap_values, X_test_xgb[:100], 
                     feature_names=xgb_features, interaction_index='humidity', show=False)
plt.title('Temperature vs Humidity Interaction Effect on Heat Index')
plt.tight_layout()
plt.show()

## 11. Save Models and Report

In [None]:
# Save models
import joblib

# Save neural network
nn_model.save('../models/kma_heat_index_nn_model.h5')
print("✅ Neural Network model saved")

# Save LSTM
lstm_model.save('../models/kma_heat_index_lstm_model.h5')
print("✅ LSTM model saved")

# Save XGBoost
joblib.dump(xgb_model, '../models/kma_heat_index_xgb_model.pkl')
print("✅ XGBoost model saved")

# Save scalers
joblib.dump(scaler_X, '../models/scaler_features.pkl')
joblib.dump(scaler_lstm, '../models/scaler_lstm.pkl')
print("✅ Scalers saved")

# Generate report
report = f"""
MACHINE LEARNING ANALYSIS REPORT
=================================

MODELS EVALUATED:
1. Deep Neural Network (4 layers)
2. LSTM for Time Series
3. XGBoost with Lag Features
4. Ensemble Model

PERFORMANCE SUMMARY:
{comparison_df.round(4).to_string()}

Ensemble Model:
- R² Score: {ensemble_r2:.4f}
- RMSE: {ensemble_rmse:.4f}°C

CLUSTERING ANALYSIS:
- K-Means Silhouette Score: {kmeans_silhouette:.4f}
- Identified {n_clusters} distinct weather patterns

ANOMALY DETECTION:
- Total anomalies: {len(anomalies)} ({len(anomalies)/len(X_cluster)*100:.2f}%)
- Anomalies primarily occur at extreme heat index values

KEY FINDINGS:
1. XGBoost performs best with lag features
2. Temperature is the most important predictor
3. Lag features significantly improve predictions
4. 5 distinct heatwave patterns identified
5. Anomalies correspond to extreme heat events

RECOMMENDATIONS:
1. Deploy XGBoost for operational forecasting
2. Use LSTM for long-term predictions
3. Implement anomaly detection for early warnings
4. Update models weekly with new data
5. Monitor model drift and retrain as needed
"""

print(report)

# Save report
with open('../reports/ml_analysis_report.txt', 'w') as f:
    f.write(report)
print("\n✅ Report saved to ../reports/ml_analysis_report.txt")

## 12. Assignment

### Week 8 Tasks:

1. **Deep Learning Models** (25 points)
   - Implement and train neural networks
   - Build LSTM for time series prediction
   - Optimize hyperparameters

2. **Advanced ML Techniques** (25 points)
   - Apply XGBoost with feature engineering
   - Create ensemble models
   - Compare model performances

3. **Clustering Analysis** (25 points)
   - Identify heatwave patterns
   - Compare clustering algorithms
   - Analyze cluster characteristics

4. **Anomaly Detection** (25 points)
   - Detect extreme heat events
   - Implement early warning system
   - Validate with historical events

### Bonus Challenge:
- Implement transformer model for heat index prediction
- Create AutoML pipeline for model selection
- Develop mobile app for real-time predictions

## Summary

In this week, we covered:
- ✅ Deep learning with neural networks and LSTM
- ✅ XGBoost for advanced predictions
- ✅ Clustering for pattern discovery
- ✅ Anomaly detection for extreme events
- ✅ Model interpretability with SHAP

### Next Week Preview:
**Week 9: Advanced Visualization and Dashboards**
- Interactive dashboards with Plotly/Dash
- Real-time monitoring systems
- Geographic visualizations
- Report generation

### Resources:
- [TensorFlow Documentation](https://www.tensorflow.org/)
- [XGBoost Documentation](https://xgboost.readthedocs.io/)
- [SHAP Documentation](https://shap.readthedocs.io/)

---
**End of Week 8**

*Instructor: Sohn Chul*