# Week 7: Statistical Modeling and Prediction
## Regression Analysis and Heat Index Forecasting

**Instructor**: Sohn Chul

---

## 🎯 Learning Objectives

By the end of this session, you will be able to:
1. Build regression models for KMA heat index prediction
2. Perform correlation and causation analysis
3. Implement time series forecasting models
4. Validate model performance and accuracy
5. Create ensemble models for improved predictions

## 1. Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Statistical modeling
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Time series
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.stats.diagnostic import acorr_ljungbox
import statsmodels.api as sm

# Other utilities
from scipy import stats
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print("✅ Libraries imported successfully!")

## 2. Generate and Prepare Data with KMA Heat Index

In [None]:
# Generate comprehensive dataset for modeling
np.random.seed(42)

# Create date range
dates = pd.date_range('2025-04-01', '2025-08-31', freq='H')
n = len(dates)

# Generate correlated weather variables
hours = dates.hour
days = (dates - dates[0]).days

# Temperature with patterns
temp_base = 20 + (days / 30) * 3  # Seasonal trend
temp_daily = 7 * np.sin((hours - 6) * np.pi / 12)  # Daily cycle
temperature = temp_base + temp_daily + np.random.normal(0, 2, n)

# Humidity (inversely correlated with temperature)
humidity = 75 - temperature * 0.8 + np.random.normal(0, 8, n)
humidity = np.clip(humidity, 30, 95)

# Wind speed (affects perceived temperature)
wind_speed = np.abs(np.random.normal(2, 1.5, n))

# Solar radiation (correlated with hour of day)
solar_radiation = np.maximum(0, 500 * np.sin((hours - 6) * np.pi / 12) + np.random.normal(0, 50, n))

# Air pressure
pressure = 1013 + np.random.normal(0, 10, n)

# KMA Heat Index Calculation
def calculate_wet_bulb_temperature(Ta, RH):
    """Calculate wet-bulb temperature using Stull's formula."""
    Tw = (Ta * np.arctan(0.151977 * (RH + 8.313659)**0.5) + 
          np.arctan(Ta + RH) - 
          np.arctan(RH - 1.67633) + 
          0.00391838 * RH**1.5 * np.arctan(0.023101 * RH) - 
          4.686035)
    return Tw

def calculate_heat_index_kma(Ta, RH):
    """Calculate heat index using KMA formula."""
    Tw = calculate_wet_bulb_temperature(Ta, RH)
    HI = (-0.2442 + 0.55399 * Tw + 0.45535 * Ta - 
          0.0022 * Tw**2 + 0.00278 * Tw * Ta + 3.0)
    return HI

# Calculate KMA heat index
heat_index = calculate_heat_index_kma(temperature, humidity)

# Create DataFrame
df = pd.DataFrame({
    'datetime': dates,
    'temperature': temperature,
    'humidity': humidity,
    'wind_speed': wind_speed,
    'solar_radiation': solar_radiation,
    'pressure': pressure,
    'heat_index': heat_index
})

# Add time-based features
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Add lag features for time series
df['temp_lag_1'] = df['temperature'].shift(1)
df['temp_lag_24'] = df['temperature'].shift(24)
df['hi_lag_1'] = df['heat_index'].shift(1)
df['hi_lag_24'] = df['heat_index'].shift(24)

# Remove NaN values from lag features
df = df.dropna()

print(f"✅ Dataset created with {len(df)} records")
print("\n📊 Data Summary:")
print(df[['temperature', 'humidity', 'heat_index']].describe())

## 3. Correlation Analysis

In [None]:
# Calculate correlation matrix
correlation_vars = ['temperature', 'humidity', 'wind_speed', 'solar_radiation', 'pressure', 'heat_index']
corr_matrix = df[correlation_vars].corr()

# Create correlation heatmap
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', 
            cmap='RdBu_r', center=0, vmin=-1, vmax=1,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Weather Variables and KMA Heat Index', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Print strong correlations with heat index
hi_corr = corr_matrix['heat_index'].sort_values(ascending=False)
print("\n🔍 Correlations with KMA Heat Index:")
print("="*40)
for var, corr in hi_corr.items():
    if var != 'heat_index':
        print(f"{var:15s}: {corr:+.3f}")

## 4. Linear Regression Models

In [None]:
# Prepare features and target
feature_cols = ['temperature', 'humidity', 'wind_speed', 'solar_radiation', 
                'pressure', 'hour', 'month', 'is_weekend']
X = df[feature_cols]
y = df['heat_index']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple linear models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1)
}

results = {}

for name, model in models.items():
    # Train model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred_train = model.predict(X_train_scaled)
    y_pred_test = model.predict(X_test_scaled)
    
    # Calculate metrics
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    test_mae = mean_absolute_error(y_test, y_pred_test)
    
    results[name] = {
        'train_r2': train_r2,
        'test_r2': test_r2,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'test_mae': test_mae,
        'predictions': y_pred_test
    }

# Display results
print("📊 Linear Model Performance:")
print("="*70)
print(f"{'Model':<20} {'Train R²':>10} {'Test R²':>10} {'Test RMSE':>10} {'Test MAE':>10}")
print("-"*70)
for name, metrics in results.items():
    print(f"{name:<20} {metrics['train_r2']:>10.4f} {metrics['test_r2']:>10.4f} "
          f"{metrics['test_rmse']:>10.4f} {metrics['test_mae']:>10.4f}")

## 5. Feature Importance Analysis

In [None]:
# Get coefficients from Linear Regression
lr_model = models['Linear Regression']
coefficients = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': lr_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

# Visualize feature importance
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Coefficient plot
colors = ['green' if c > 0 else 'red' for c in coefficients['Coefficient']]
axes[0].barh(coefficients['Feature'], coefficients['Coefficient'], color=colors, alpha=0.7)
axes[0].set_xlabel('Coefficient Value')
axes[0].set_title('Linear Regression Feature Coefficients', fontweight='bold')
axes[0].axvline(x=0, color='black', linestyle='-', linewidth=0.5)
axes[0].grid(True, alpha=0.3)

# Feature contribution to R²
# Calculate individual R² contributions
r2_contributions = []
for feature in feature_cols:
    X_single = X_train_scaled[:, feature_cols.index(feature)].reshape(-1, 1)
    lr_single = LinearRegression()
    lr_single.fit(X_single, y_train)
    r2_single = lr_single.score(X_single, y_train)
    r2_contributions.append(r2_single)

contribution_df = pd.DataFrame({
    'Feature': feature_cols,
    'R² Contribution': r2_contributions
}).sort_values('R² Contribution', ascending=False)

axes[1].bar(contribution_df['Feature'], contribution_df['R² Contribution'], 
           color='skyblue', alpha=0.7)
axes[1].set_xlabel('Feature')
axes[1].set_ylabel('Individual R² Score')
axes[1].set_title('Individual Feature R² Contributions', fontweight='bold')
axes[1].set_xticklabels(contribution_df['Feature'], rotation=45, ha='right')
axes[1].grid(True, alpha=0.3)

plt.suptitle('Feature Importance Analysis for KMA Heat Index Prediction', fontsize=16)
plt.tight_layout()
plt.show()

print("\n📊 Feature Coefficients:")
print(coefficients.to_string(index=False))

## 6. Polynomial Regression

In [None]:
# Create polynomial features (degree 2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)

# Train polynomial regression with Ridge regularization
poly_model = Ridge(alpha=1.0)
poly_model.fit(X_train_poly, y_train)

# Make predictions
y_pred_poly_train = poly_model.predict(X_train_poly)
y_pred_poly_test = poly_model.predict(X_test_poly)

# Calculate metrics
poly_train_r2 = r2_score(y_train, y_pred_poly_train)
poly_test_r2 = r2_score(y_test, y_pred_poly_test)
poly_test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_poly_test))
poly_test_mae = mean_absolute_error(y_test, y_pred_poly_test)

print("📊 Polynomial Regression Performance:")
print("="*40)
print(f"Train R²: {poly_train_r2:.4f}")
print(f"Test R²: {poly_test_r2:.4f}")
print(f"Test RMSE: {poly_test_rmse:.4f}")
print(f"Test MAE: {poly_test_mae:.4f}")

# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Linear vs Actual
axes[0].scatter(y_test, results['Linear Regression']['predictions'], 
               alpha=0.5, s=10, label='Predictions')
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
            'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual KMA Heat Index (°C)')
axes[0].set_ylabel('Predicted Heat Index (°C)')
axes[0].set_title('Linear Regression Predictions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Polynomial vs Actual
axes[1].scatter(y_test, y_pred_poly_test, alpha=0.5, s=10, label='Predictions')
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
            'r--', lw=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual KMA Heat Index (°C)')
axes[1].set_ylabel('Predicted Heat Index (°C)')
axes[1].set_title('Polynomial Regression Predictions')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle('Regression Model Predictions vs Actual KMA Heat Index', fontsize=16)
plt.tight_layout()
plt.show()

## 7. Ensemble Models

In [None]:
# Train ensemble models
ensemble_models = {
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)
}

ensemble_results = {}

for name, model in ensemble_models.items():
    # Train model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    # Calculate metrics
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    test_mae = mean_absolute_error(y_test, y_pred_test)
    
    ensemble_results[name] = {
        'model': model,
        'train_r2': train_r2,
        'test_r2': test_r2,
        'test_rmse': test_rmse,
        'test_mae': test_mae,
        'predictions': y_pred_test
    }

# Display ensemble results
print("📊 Ensemble Model Performance:")
print("="*60)
print(f"{'Model':<20} {'Train R²':>10} {'Test R²':>10} {'Test RMSE':>10} {'Test MAE':>10}")
print("-"*60)
for name, metrics in ensemble_results.items():
    print(f"{name:<20} {metrics['train_r2']:>10.4f} {metrics['test_r2']:>10.4f} "
          f"{metrics['test_rmse']:>10.4f} {metrics['test_mae']:>10.4f}")

# Feature importance from Random Forest
rf_model = ensemble_results['Random Forest']['model']
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'], color='teal', alpha=0.7)
plt.xlabel('Importance Score')
plt.title('Random Forest Feature Importance for KMA Heat Index', fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n📊 Random Forest Feature Importance:")
print(feature_importance.to_string(index=False))

## 8. Time Series Forecasting

In [None]:
# Prepare time series data
ts_data = df.set_index('datetime')['heat_index'].resample('D').mean()

# Split into train and test
train_size = int(len(ts_data) * 0.8)
ts_train = ts_data[:train_size]
ts_test = ts_data[train_size:]

# ARIMA Model
arima_model = ARIMA(ts_train, order=(2, 1, 2))
arima_fit = arima_model.fit()

# Make predictions
arima_forecast = arima_fit.forecast(steps=len(ts_test))

# Exponential Smoothing
exp_model = ExponentialSmoothing(ts_train, seasonal='add', seasonal_periods=7)
exp_fit = exp_model.fit()
exp_forecast = exp_fit.forecast(steps=len(ts_test))

# Calculate metrics
arima_rmse = np.sqrt(mean_squared_error(ts_test, arima_forecast))
arima_mae = mean_absolute_error(ts_test, arima_forecast)
exp_rmse = np.sqrt(mean_squared_error(ts_test, exp_forecast))
exp_mae = mean_absolute_error(ts_test, exp_forecast)

print("📊 Time Series Model Performance:")
print("="*40)
print(f"ARIMA(2,1,2):")
print(f"  RMSE: {arima_rmse:.4f}")
print(f"  MAE: {arima_mae:.4f}")
print(f"\nExponential Smoothing:")
print(f"  RMSE: {exp_rmse:.4f}")
print(f"  MAE: {exp_mae:.4f}")

# Visualize forecasts
fig = go.Figure()

# Actual data
fig.add_trace(go.Scatter(
    x=ts_train.index, y=ts_train.values,
    mode='lines', name='Training Data',
    line=dict(color='blue', width=2)
))

fig.add_trace(go.Scatter(
    x=ts_test.index, y=ts_test.values,
    mode='lines', name='Actual Test Data',
    line=dict(color='black', width=2)
))

# ARIMA forecast
fig.add_trace(go.Scatter(
    x=ts_test.index, y=arima_forecast,
    mode='lines', name='ARIMA Forecast',
    line=dict(color='red', width=2, dash='dash')
))

# Exponential Smoothing forecast
fig.add_trace(go.Scatter(
    x=ts_test.index, y=exp_forecast,
    mode='lines', name='Exp. Smoothing Forecast',
    line=dict(color='green', width=2, dash='dot')
))

fig.update_layout(
    title='Time Series Forecasting of Daily Mean KMA Heat Index',
    xaxis_title='Date',
    yaxis_title='Heat Index (°C)',
    height=500,
    showlegend=True
)

fig.show()

## 9. Model Comparison and Selection

In [None]:
# Compile all model results
all_models = {}

# Add linear models
for name, metrics in results.items():
    all_models[name] = {
        'R² Score': metrics['test_r2'],
        'RMSE': metrics['test_rmse'],
        'MAE': metrics['test_mae']
    }

# Add polynomial model
all_models['Polynomial Regression'] = {
    'R² Score': poly_test_r2,
    'RMSE': poly_test_rmse,
    'MAE': poly_test_mae
}

# Add ensemble models
for name, metrics in ensemble_results.items():
    all_models[name] = {
        'R² Score': metrics['test_r2'],
        'RMSE': metrics['test_rmse'],
        'MAE': metrics['test_mae']
    }

# Create comparison DataFrame
comparison_df = pd.DataFrame(all_models).T
comparison_df = comparison_df.sort_values('R² Score', ascending=False)

# Visualize model comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# R² Score comparison
axes[0].barh(comparison_df.index, comparison_df['R² Score'], color='skyblue', alpha=0.7)
axes[0].set_xlabel('R² Score')
axes[0].set_title('Model R² Score Comparison', fontweight='bold')
axes[0].grid(True, alpha=0.3)

# RMSE comparison
axes[1].barh(comparison_df.index, comparison_df['RMSE'], color='coral', alpha=0.7)
axes[1].set_xlabel('RMSE')
axes[1].set_title('Model RMSE Comparison', fontweight='bold')
axes[1].grid(True, alpha=0.3)

# MAE comparison
axes[2].barh(comparison_df.index, comparison_df['MAE'], color='lightgreen', alpha=0.7)
axes[2].set_xlabel('MAE')
axes[2].set_title('Model MAE Comparison', fontweight='bold')
axes[2].grid(True, alpha=0.3)

plt.suptitle('Statistical Model Performance Comparison for KMA Heat Index Prediction', fontsize=16)
plt.tight_layout()
plt.show()

print("📊 Model Performance Summary:")
print("="*60)
print(comparison_df.round(4))

# Identify best model
best_model = comparison_df['R² Score'].idxmax()
print(f"\n🏆 Best Model: {best_model}")
print(f"   R² Score: {comparison_df.loc[best_model, 'R² Score']:.4f}")
print(f"   RMSE: {comparison_df.loc[best_model, 'RMSE']:.4f}")
print(f"   MAE: {comparison_df.loc[best_model, 'MAE']:.4f}")

## 10. Residual Analysis

In [None]:
# Use best performing model for residual analysis
if 'Random Forest' in ensemble_results:
    best_predictions = ensemble_results['Random Forest']['predictions']
    model_name = 'Random Forest'
else:
    best_predictions = y_pred_poly_test
    model_name = 'Polynomial Regression'

residuals = y_test - best_predictions

# Residual plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Residuals vs Fitted
axes[0, 0].scatter(best_predictions, residuals, alpha=0.5, s=10)
axes[0, 0].axhline(y=0, color='red', linestyle='--', linewidth=1)
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted Values')
axes[0, 0].grid(True, alpha=0.3)

# Q-Q plot
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot')
axes[0, 1].grid(True, alpha=0.3)

# Histogram of residuals
axes[1, 0].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Residuals')
axes[1, 0].axvline(x=0, color='red', linestyle='--', linewidth=1)
axes[1, 0].grid(True, alpha=0.3)

# Residuals over time
axes[1, 1].plot(residuals.values, alpha=0.7)
axes[1, 1].axhline(y=0, color='red', linestyle='--', linewidth=1)
axes[1, 1].set_xlabel('Index')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Residuals Over Time')
axes[1, 1].grid(True, alpha=0.3)

plt.suptitle(f'Residual Analysis for {model_name} Model', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Statistical tests
# Shapiro-Wilk test for normality
shapiro_stat, shapiro_p = stats.shapiro(residuals[:1000])  # Use subset for test
print(f"\n📊 Residual Analysis:")
print(f"Mean of residuals: {residuals.mean():.4f}")
print(f"Std of residuals: {residuals.std():.4f}")
print(f"\nShapiro-Wilk test for normality:")
print(f"  Statistic: {shapiro_stat:.4f}")
print(f"  p-value: {shapiro_p:.4f}")
if shapiro_p > 0.05:
    print("  Result: Residuals appear to be normally distributed")
else:
    print("  Result: Residuals do not appear to be normally distributed")

## 11. Save Model and Generate Report

In [None]:
# Save best model
import joblib

if 'Random Forest' in ensemble_results:
    best_model_obj = ensemble_results['Random Forest']['model']
    model_filename = '../models/kma_heat_index_rf_model.pkl'
else:
    best_model_obj = poly_model
    model_filename = '../models/kma_heat_index_poly_model.pkl'

# Save model and scaler
joblib.dump(best_model_obj, model_filename)
joblib.dump(scaler, '../models/kma_heat_index_scaler.pkl')
print(f"✅ Model saved to {model_filename}")

# Generate summary report
report = f"""
STATISTICAL MODELING REPORT - KMA HEAT INDEX PREDICTION
========================================================

DATA SUMMARY:
- Total samples: {len(df)}
- Training samples: {len(X_train)}
- Testing samples: {len(X_test)}
- Features used: {', '.join(feature_cols)}

MODEL PERFORMANCE:
{comparison_df.round(4).to_string()}

BEST MODEL: {best_model}
- R² Score: {comparison_df.loc[best_model, 'R² Score']:.4f}
- RMSE: {comparison_df.loc[best_model, 'RMSE']:.4f} °C
- MAE: {comparison_df.loc[best_model, 'MAE']:.4f} °C

KEY FINDINGS:
1. Temperature is the strongest predictor of KMA heat index
2. Humidity has significant negative correlation with heat index
3. Time-based features (hour, month) improve prediction accuracy
4. Ensemble models outperform linear models
5. Random Forest provides best balance of accuracy and interpretability

TIME SERIES FORECASTING:
- ARIMA(2,1,2) RMSE: {arima_rmse:.4f} °C
- Exponential Smoothing RMSE: {exp_rmse:.4f} °C

RECOMMENDATIONS:
1. Use Random Forest for real-time heat index prediction
2. Consider ensemble methods for operational forecasting
3. Update models regularly with new data
4. Monitor model performance for drift
5. Integrate with S-DoT sensor network for real-time predictions
"""

print(report)

# Save report
with open('../reports/statistical_modeling_report.txt', 'w') as f:
    f.write(report)
print("\n✅ Report saved to ../reports/statistical_modeling_report.txt")

## 12. Assignment

### Week 7 Tasks:

1. **Regression Analysis** (25 points)
   - Build multiple regression models for KMA heat index
   - Compare linear, polynomial, and regularized models
   - Analyze feature importance and coefficients

2. **Ensemble Methods** (25 points)
   - Implement Random Forest and Gradient Boosting
   - Tune hyperparameters using cross-validation
   - Compare ensemble performance with linear models

3. **Time Series Forecasting** (25 points)
   - Apply ARIMA and exponential smoothing
   - Forecast daily heat index values
   - Evaluate forecast accuracy

4. **Model Validation** (25 points)
   - Perform residual analysis
   - Conduct cross-validation
   - Test for overfitting

### Bonus Challenge:
- Implement neural network for heat index prediction
- Create real-time prediction API
- Develop uncertainty quantification for predictions

## Summary

In this week, we covered:
- ✅ Multiple regression techniques for KMA heat index prediction
- ✅ Feature importance and correlation analysis
- ✅ Ensemble methods (Random Forest, Gradient Boosting)
- ✅ Time series forecasting with ARIMA
- ✅ Model validation and residual analysis

### Next Week Preview:
**Week 8: Machine Learning Applications**
- Deep learning for heat index prediction
- Clustering for pattern discovery
- Anomaly detection
- Real-time prediction systems

### Resources:
- [Scikit-learn Documentation](https://scikit-learn.org/)
- [Statsmodels Time Series](https://www.statsmodels.org/stable/tsa.html)
- [Feature Engineering Guide](https://feature-engine.readthedocs.io/)

---
**End of Week 7**

*Instructor: Sohn Chul*