# 🎯 AQI Forecasting ML Pipeline - 1d/3d/7d Predictions

**Complete ML Pipeline with MLflow Tracking**

## Models:
### Regression (Predict AQI values)
1. Linear Regression
2. Random Forest
3. XGBoost

### Classification (Predict High AQI alerts)
1. Logistic Regression
2. Random Forest
3. XGBoost

## Forecasting Horizons:
- **1-day** ahead (aqi_next_1d)
- **3-day** ahead (aqi_next_3d)
- **7-day** ahead (aqi_next_7d)

## 1. Setup & Imports

In [0]:
pip install xgboost

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
# Core libraries
import pandas as pd
import numpy as np
from pyspark.sql import functions as F

# Sklearn models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from xgboost import XGBRegressor, XGBClassifier

# Metrics
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score, precision_score, recall_score, f1_score
)

# MLflow
import mlflow
import mlflow.sklearn
from sklearn.utils.class_weight import compute_class_weight
print(" Libraries imported successfully")

  from google.protobuf import service as _service


 Libraries imported successfully


## 2. Load Data from Gold Table

In [0]:
# Table configuration
CATALOG = "aqi_india"
SCHEMA = "gold"
TABLE = "aqi_ml_features"

# Load data
df = spark.table(f"{CATALOG}.{SCHEMA}.{TABLE}")

print(f"Total records: {df.count():,}")
print(f"Columns: {len(df.columns)}")

Total records: 289,034
Columns: 52


## 3. Feature Engineering & Preprocessing

In [0]:
# Filter valid records (all target columns not null)
df_clean = df.filter(
    F.col("aqi_next_1d").isNotNull() &
    F.col("aqi_next_3d").isNotNull() &
    F.col("aqi_next_7d").isNotNull()
)

print(f"Records after filtering: {df_clean.count():,}")

Records after filtering: 287,137


In [0]:
# Select features and targets
feature_cols = [
    # Current AQI
    "aqi",
    
    # Lag features
    "aqi_lag_1", "aqi_lag_3", "aqi_lag_7", "aqi_lag_14", "aqi_lag_30",
    
    # Rolling statistics
    "aqi_rolling_avg_7", "aqi_rolling_avg_14", "aqi_rolling_avg_30",
    "aqi_rolling_std_7", "aqi_rolling_std_14",
    "aqi_rolling_max_7", "aqi_rolling_min_7",
    
    # Change features
    "aqi_change_1d", "aqi_change_7d",
    "aqi_pct_change_1d", "aqi_pct_change_7d",
    
    # City baselines
    "city_avg_aqi", "city_std_aqi",
    "aqi_deviation_from_city_avg", "aqi_z_score",
    
    # Time features
    "month", "day_of_week", "quarter",
    "month_sin", "month_cos",
    "day_of_week_sin", "day_of_week_cos",
    
    # Boolean features
    "is_weekend", "is_high_pollution", "is_severe_pollution",
    
    # Interaction features
    "is_winter_high_pollution_city", "weekend_pollution_delta"
]

target_cols = [
    "aqi_next_1d", "aqi_next_3d", "aqi_next_7d",
    "target_high_aqi_tomorrow", "target_severe_aqi_tomorrow"
]

# Select columns
df_ml = df_clean.select(feature_cols + target_cols)

print(f"Features: {len(feature_cols)}")
print(f"Targets: {len(target_cols)}")

Features: 33
Targets: 5


In [0]:
# Convert boolean columns to int
df_ml = df_ml \
    .withColumn("is_weekend", F.col("is_weekend").cast("int")) \
    .withColumn("is_high_pollution", F.col("is_high_pollution").cast("int")) \
    .withColumn("is_severe_pollution", F.col("is_severe_pollution").cast("int"))

# Convert to pandas
pdf = df_ml.toPandas()

print(f" Data converted to pandas: {pdf.shape}")
print(f"  Rows: {pdf.shape[0]:,}")
print(f"  Columns: {pdf.shape[1]}")

 Data converted to pandas: (287137, 38)
  Rows: 287,137
  Columns: 38


In [0]:
# Handle missing values
pdf = pdf.fillna(0)

print(f" Missing values handled")
print(f"  Final dataset: {pdf.shape}")

 Missing values handled
  Final dataset: (287137, 38)


## 4. Prepare Train/Test Sets

In [0]:
# Features
X = pdf[feature_cols]

# Regression targets
y_1d = pdf["aqi_next_1d"]
y_3d = pdf["aqi_next_3d"]
y_7d = pdf["aqi_next_7d"]

# Classification targets
y_high = pdf["target_high_aqi_tomorrow"]
y_severe = pdf["target_severe_aqi_tomorrow"]

# Train/test split (80/20)
X_train, X_test, y_1d_train, y_1d_test = train_test_split(X, y_1d, test_size=0.2, random_state=42)
_, _, y_3d_train, y_3d_test = train_test_split(X, y_3d, test_size=0.2, random_state=42)
_, _, y_7d_train, y_7d_test = train_test_split(X, y_7d, test_size=0.2, random_state=42)
_, _, y_high_train, y_high_test = train_test_split(X, y_high, test_size=0.2, random_state=42)
_, _, y_severe_train, y_severe_test = train_test_split(X, y_severe, test_size=0.2, random_state=42)

print(f" Train/Test split complete")
print(f" Training samples: {X_train.shape[0]:,}")
print(f" Test samples: {X_test.shape[0]:,}")

 Train/Test split complete
 Training samples: 229,709
 Test samples: 57,428


## 5. MLflow Setup

In [0]:
# MLflow configuration
from mlflow.tracking import _model_registry

def _dummy_get_registry_uri_from_spark_session():
    return None

_model_registry.utils._get_registry_uri_from_spark_session = _dummy_get_registry_uri_from_spark_session

mlflow.set_tracking_uri("databricks")

# Set experiment (update with your username)
EXPERIMENT_NAME = "/Users/keerthi.amulya.1999@gmail.com/AQI_ML_Pipeline_Complete_2"
mlflow.set_experiment(EXPERIMENT_NAME)

exp = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
print(f"✓ MLflow Experiment: {exp.name}")
print(f"  Experiment ID: {exp.experiment_id}")

✓ MLflow Experiment: /Users/keerthi.amulya.1999@gmail.com/AQI_ML_Pipeline_Complete_2
  Experiment ID: 3079394547912133


## 6. Helper Functions

In [0]:
def calculate_mape(y_true, y_pred):
    """Calculate Mean Absolute Percentage Error"""
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def train_regression_model(model, model_name, horizon, X_train, X_test, y_train, y_test):
    """
    Train regression model and log to MLflow
    """
    run_name = f"{model_name}_{horizon}"
    
    with mlflow.start_run(run_name=run_name):
        # Train model
        model.fit(X_train, y_train)
        
        # Predictions
        y_pred = model.predict(X_test)
        
        # Calculate metrics
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        mape = calculate_mape(y_test, y_pred)
        
        # Log parameters
        mlflow.log_param("model_type", model_name)
        mlflow.log_param("forecast_horizon", horizon)
        mlflow.log_param("n_features", X_train.shape[1])
        mlflow.log_param("n_train", X_train.shape[0])
        mlflow.log_param("n_test", X_test.shape[0])
        
        # Log metrics
        mlflow.log_metrics({
            "rmse": rmse,
            "mae": mae,
            "r2": r2,
            "mape": mape
        })
        
        # Log model
        mlflow.sklearn.log_model(model, "model")
        
        print(f"  {run_name}: RMSE={rmse:.2f}, MAE={mae:.2f}, R2={r2:.3f}, MAPE={mape:.2f}%")
        
        return model, {"rmse": rmse, "mae": mae, "r2": r2, "mape": mape}

def train_classification_model(model, model_name, target_type, X_train, X_test, y_train, y_test):
    """
    Train classification model and log to MLflow
    """
    run_name = f"{model_name}_{target_type}"
    
    with mlflow.start_run(run_name=run_name):
        # Train model
        model.fit(X_train, y_train)
        
        # Predictions
        y_pred = model.predict(X_test)
        
        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, zero_division=0)
        recall = recall_score(y_test, y_pred, zero_division=0)
        f1 = f1_score(y_test, y_pred, zero_division=0)
        
        # Log parameters
        mlflow.log_param("model_type", model_name)
        mlflow.log_param("target_type", target_type)
        mlflow.log_param("n_features", X_train.shape[1])
        mlflow.log_param("n_train", X_train.shape[0])
        mlflow.log_param("n_test", X_test.shape[0])
        
        # Log metrics
        mlflow.log_metrics({
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall,
            "f1": f1
        })
        
        # Log model
        mlflow.sklearn.log_model(model, "model")
        
        print(f"  {run_name}: Acc={accuracy:.3f}, Prec={precision:.3f}, Rec={recall:.3f}, F1={f1:.3f}")
        
        return model, {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

print("✓ Helper functions defined")

✓ Helper functions defined


---
# REGRESSION PIPELINE
## Predict AQI Values (1d, 3d, 7d)

## 7. Linear Regression (1d, 3d, 7d)

In [0]:
print("\n" + "="*80)
print("LINEAR REGRESSION - Multi-horizon Forecasting")
print("="*80)

lr_results = {}

# 1-day forecast
print("\n1-DAY FORECAST:")
lr_1d, metrics_1d = train_regression_model(
    LinearRegression(), "LinearRegression", "1day",
    X_train, X_test, y_1d_train, y_1d_test
)
lr_results['1d'] = metrics_1d

# 3-day forecast
print("\n3-DAY FORECAST:")
lr_3d, metrics_3d = train_regression_model(
    LinearRegression(), "LinearRegression", "3day",
    X_train, X_test, y_3d_train, y_3d_test
)
lr_results['3d'] = metrics_3d

# 7-day forecast
print("\n7-DAY FORECAST:")
lr_7d, metrics_7d = train_regression_model(
    LinearRegression(), "LinearRegression", "7day",
    X_train, X_test, y_7d_train, y_7d_test
)
lr_results['7d'] = metrics_7d

print("\n✓ Linear Regression complete for all horizons")


LINEAR REGRESSION - Multi-horizon Forecasting

1-DAY FORECAST:




  LinearRegression_1day: RMSE=37.18, MAE=24.50, R2=0.806, MAPE=23.45%

3-DAY FORECAST:




  LinearRegression_3day: RMSE=46.71, MAE=32.28, R2=0.693, MAPE=32.21%

7-DAY FORECAST:




  LinearRegression_7day: RMSE=50.13, MAE=35.51, R2=0.649, MAPE=36.10%

✓ Linear Regression complete for all horizons


## 8. Random Forest Regression (1d, 3d, 7d)

In [0]:
print("\n" + "="*80)
print("RANDOM FOREST REGRESSION - Multi-horizon Forecasting")
print("="*80)

rf_results = {}

# 1-day forecast
print("\n1-DAY FORECAST:")
rf_1d, metrics_1d = train_regression_model(
    RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42, n_jobs=-1),
    "RandomForest", "1day",
    X_train, X_test, y_1d_train, y_1d_test
)
rf_results['1d'] = metrics_1d

# 3-day forecast
print("\n3-DAY FORECAST:")
rf_3d, metrics_3d = train_regression_model(
    RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42, n_jobs=-1),
    "RandomForest", "3day",
    X_train, X_test, y_3d_train, y_3d_test
)
rf_results['3d'] = metrics_3d

# 7-day forecast
print("\n7-DAY FORECAST:")
rf_7d, metrics_7d = train_regression_model(
    RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42, n_jobs=-1),
    "RandomForest", "7day",
    X_train, X_test, y_7d_train, y_7d_test
)
rf_results['7d'] = metrics_7d

print("\n✓ Random Forest complete for all horizons")


RANDOM FOREST REGRESSION - Multi-horizon Forecasting

1-DAY FORECAST:




  RandomForest_1day: RMSE=36.82, MAE=24.18, R2=0.810, MAPE=22.91%

3-DAY FORECAST:




  RandomForest_3day: RMSE=44.40, MAE=30.56, R2=0.723, MAPE=30.34%

7-DAY FORECAST:




  RandomForest_7day: RMSE=45.85, MAE=32.14, R2=0.707, MAPE=32.34%

✓ Random Forest complete for all horizons


## 9. XGBoost Regression (1d, 3d, 7d)

In [0]:
print("\n" + "="*80)
print("XGBOOST REGRESSION - Multi-horizon Forecasting")
print("="*80)

xgb_results = {}

# 1-day forecast
print("\n1-DAY FORECAST:")
xgb_1d, metrics_1d = train_regression_model(
    XGBRegressor(n_estimators=100, max_depth=10, learning_rate=0.1, random_state=42, n_jobs=-1),
    "XGBoost", "1day",
    X_train, X_test, y_1d_train, y_1d_test
)
xgb_results['1d'] = metrics_1d

# 3-day forecast
print("\n3-DAY FORECAST:")
xgb_3d, metrics_3d = train_regression_model(
    XGBRegressor(n_estimators=100, max_depth=10, learning_rate=0.1, random_state=42, n_jobs=-1),
    "XGBoost", "3day",
    X_train, X_test, y_3d_train, y_3d_test
)
xgb_results['3d'] = metrics_3d

# 7-day forecast
print("\n7-DAY FORECAST:")
xgb_7d, metrics_7d = train_regression_model(
    XGBRegressor(n_estimators=100, max_depth=10, learning_rate=0.1, random_state=42, n_jobs=-1),
    "XGBoost", "7day",
    X_train, X_test, y_7d_train, y_7d_test
)
xgb_results['7d'] = metrics_7d

print("\n✓ XGBoost complete for all horizons")


XGBOOST REGRESSION - Multi-horizon Forecasting

1-DAY FORECAST:




  XGBoost_1day: RMSE=36.60, MAE=24.00, R2=0.812, MAPE=22.64%

3-DAY FORECAST:




  XGBoost_3day: RMSE=43.87, MAE=30.15, R2=0.729, MAPE=29.85%

7-DAY FORECAST:




  XGBoost_7day: RMSE=45.50, MAE=31.89, R2=0.711, MAPE=31.97%

✓ XGBoost complete for all horizons


---
# CLASSIFICATION PIPELINE
## Predict High/Severe Pollution Alerts

## 10. Logistic Regression (High & Severe)

In [0]:
def train_classification_model(model, model_name, target_type, X_train, X_test, y_train, y_test):
    """
    Train classification model and log to MLflow
    """
    run_name = f"{model_name}_{target_type}"
    
    with mlflow.start_run(run_name=run_name):
        # Train model
        model.fit(X_train, y_train)
        
        # Predictions
        y_pred = model.predict(X_test)
        
        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, zero_division=0)
        recall = recall_score(y_test, y_pred, zero_division=0)
        f1 = f1_score(y_test, y_pred, zero_division=0)
        
        # Log parameters
        mlflow.log_param("model_type", model_name)
        mlflow.log_param("target_type", target_type)
        mlflow.log_param("n_features", X_train.shape[1])
        mlflow.log_param("n_train", X_train.shape[0])
        mlflow.log_param("n_test", X_test.shape[0])
        
        # Calculate class distribution
        unique, counts = np.unique(y_train, return_counts=True)
        class_dist = dict(zip(unique, counts))
        mlflow.log_param("class_0_count", class_dist.get(0, 0))
        mlflow.log_param("class_1_count", class_dist.get(1, 0))
        
        # Log metrics
        mlflow.log_metrics({
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall,
            "f1": f1
        })
        
        # Log model
        mlflow.sklearn.log_model(model, "model")
        
        print(f"  {run_name}:")
        print(f"    Accuracy:  {accuracy:.3f}")
        print(f"    Precision: {precision:.3f}")
        print(f"    Recall:    {recall:.3f}")
        print(f"    F1 Score:  {f1:.3f}")
        
        return model, {
            "accuracy": accuracy, 
            "precision": precision, 
            "recall": recall, 
            "f1": f1
        }

print(" Classification training function defined")

 Classification training function defined


In [0]:
print("\n" + "="*80)
print("LOGISTIC REGRESSION - Alert Classification")
print("="*80)

log_results = {}

# High pollution alert
print("\nHIGH POLLUTION ALERT (AQI > 150):")
log_high, metrics_high = train_classification_model(
    LogisticRegression(
        max_iter=1000, 
        class_weight='balanced',  
        random_state=42
    ),
    "LogisticRegression", "HighAQI",
    X_train, X_test, y_high_train, y_high_test
)
log_results['high'] = metrics_high

# Severe pollution alert
print("\nSEVERE POLLUTION ALERT (AQI > 300):")
print(" Note: Severe events are rare (class imbalance)")
log_severe, metrics_severe = train_classification_model(
    LogisticRegression(
        max_iter=1000, 
        class_weight='balanced',  
        random_state=42
    ),
    "LogisticRegression", "SevereAQI",
    X_train, X_test, y_severe_train, y_severe_test
)
log_results['severe'] = metrics_severe

print("\n✓ Logistic Regression complete for both alert types")


LOGISTIC REGRESSION - Alert Classification

HIGH POLLUTION ALERT (AQI > 150):


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


  LogisticRegression_HighAQI:
    Accuracy:  0.897
    Precision: 0.642
    Recall:    0.879
    F1 Score:  0.742

SEVERE POLLUTION ALERT (AQI > 300):
 Note: Severe events are rare (class imbalance)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


  LogisticRegression_SevereAQI:
    Accuracy:  0.931
    Precision: 0.093
    Recall:    0.926
    F1 Score:  0.170

✓ Logistic Regression complete for both alert types


## 11. Random Forest Classification (High & Severe)

In [0]:
print("\n" + "="*80)
print("RANDOM FOREST CLASSIFICATION - Alert Classification")
print("="*80)

rfc_results = {}

# High pollution alert
print("\nHIGH POLLUTION ALERT (AQI > 150):")
rfc_high, metrics_high = train_classification_model(
    RandomForestClassifier(
        n_estimators=100, 
        max_depth=15, 
        class_weight='balanced',  
        random_state=42, 
        n_jobs=-1
    ),
    "RandomForestClassifier", "HighAQI",
    X_train, X_test, y_high_train, y_high_test
)
rfc_results['high'] = metrics_high

# Severe pollution alert
print("\nSEVERE POLLUTION ALERT (AQI > 300):")
print("   Note: Severe events are rare (class imbalance)")
rfc_severe, metrics_severe = train_classification_model(
    RandomForestClassifier(
        n_estimators=100, 
        max_depth=15, 
        class_weight='balanced',  
        random_state=42, 
        n_jobs=-1
    ),
    "RandomForestClassifier", "SevereAQI",
    X_train, X_test, y_severe_train, y_severe_test
)
rfc_results['severe'] = metrics_severe

print("\n✓ Random Forest Classification complete for both alert types")


RANDOM FOREST CLASSIFICATION - Alert Classification

HIGH POLLUTION ALERT (AQI > 150):




  RandomForestClassifier_HighAQI:
    Accuracy:  0.916
    Precision: 0.711
    Recall:    0.840
    F1 Score:  0.770

SEVERE POLLUTION ALERT (AQI > 300):
   Note: Severe events are rare (class imbalance)




  RandomForestClassifier_SevereAQI:
    Accuracy:  0.983
    Precision: 0.285
    Recall:    0.804
    F1 Score:  0.421

✓ Random Forest Classification complete for both alert types


## 12. XGBoost Classification (High & Severe)

In [0]:
print("\n" + "="*80)
print("XGBOOST CLASSIFICATION - Alert Classification")
print("="*80)

xgbc_results = {}

# Calculate class weight ratio for severe cases
severe_ratio = len(y_severe_train[y_severe_train == 0]) / len(y_severe_train[y_severe_train == 1])
print(f"  Severe class ratio: {severe_ratio:.1f}:1 (imbalanced)")

# High pollution alert
print("\nHIGH POLLUTION ALERT (AQI > 150):")
xgbc_high, metrics_high = train_classification_model(
    XGBClassifier(
        n_estimators=100, 
        max_depth=10, 
        learning_rate=0.1,
        scale_pos_weight=3,  
        random_state=42, 
        n_jobs=-1
    ),
    "XGBoostClassifier", "HighAQI",
    X_train, X_test, y_high_train, y_high_test
)
xgbc_results['high'] = metrics_high

# Severe pollution alert
print("\nSEVERE POLLUTION ALERT (AQI > 300):")
print(" Note: Severe events are rare (class imbalance)")
xgbc_severe, metrics_severe = train_classification_model(
    XGBClassifier(
        n_estimators=100, 
        max_depth=10, 
        learning_rate=0.1,
        scale_pos_weight=severe_ratio,  
        random_state=42, 
        n_jobs=-1
    ),
    "XGBoostClassifier", "SevereAQI",
    X_train, X_test, y_severe_train, y_severe_test
)
xgbc_results['severe'] = metrics_severe

print("\n✓ XGBoost Classification complete for both alert types")


XGBOOST CLASSIFICATION - Alert Classification
  Severe class ratio: 114.3:1 (imbalanced)

HIGH POLLUTION ALERT (AQI > 150):




  XGBoostClassifier_HighAQI:
    Accuracy:  0.913
    Precision: 0.700
    Recall:    0.849
    F1 Score:  0.767

SEVERE POLLUTION ALERT (AQI > 300):
 Note: Severe events are rare (class imbalance)




  XGBoostClassifier_SevereAQI:
    Accuracy:  0.988
    Precision: 0.347
    Recall:    0.721
    F1 Score:  0.468

✓ XGBoost Classification complete for both alert types


---
## 13. Summary of All Results

In [0]:
print("\n" + "="*80)
print("📊 REGRESSION RESULTS SUMMARY (Lower RMSE is better)")
print("="*80)

print("\n1-DAY FORECAST:")
print(f"  Linear Regression: RMSE={lr_results['1d']['rmse']:.2f}, R2={lr_results['1d']['r2']:.3f}")
print(f"  Random Forest:     RMSE={rf_results['1d']['rmse']:.2f}, R2={rf_results['1d']['r2']:.3f}")
print(f"  XGBoost:           RMSE={xgb_results['1d']['rmse']:.2f}, R2={xgb_results['1d']['r2']:.3f}")

print("\n3-DAY FORECAST:")
print(f"  Linear Regression: RMSE={lr_results['3d']['rmse']:.2f}, R2={lr_results['3d']['r2']:.3f}")
print(f"  Random Forest:     RMSE={rf_results['3d']['rmse']:.2f}, R2={rf_results['3d']['r2']:.3f}")
print(f"  XGBoost:           RMSE={xgb_results['3d']['rmse']:.2f}, R2={xgb_results['3d']['r2']:.3f}")

print("\n7-DAY FORECAST:")
print(f"  Linear Regression: RMSE={lr_results['7d']['rmse']:.2f}, R2={lr_results['7d']['r2']:.3f}")
print(f"  Random Forest:     RMSE={rf_results['7d']['rmse']:.2f}, R2={rf_results['7d']['r2']:.3f}")
print(f"  XGBoost:           RMSE={xgb_results['7d']['rmse']:.2f}, R2={xgb_results['7d']['r2']:.3f}")

print("\n" + "="*80)
print("📊 CLASSIFICATION RESULTS SUMMARY (Higher Accuracy is better)")
print("="*80)

print("\nHIGH AQI ALERT (>150):")
print(f"  Logistic Regression: Acc={log_results['high']['accuracy']:.3f}, F1={log_results['high']['f1']:.3f}")
print(f"  Random Forest:       Acc={rfc_results['high']['accuracy']:.3f}, F1={rfc_results['high']['f1']:.3f}")
print(f"  XGBoost:             Acc={xgbc_results['high']['accuracy']:.3f}, F1={xgbc_results['high']['f1']:.3f}")


print("\n" + "="*80)
print("✅ ALL MODELS TRAINED AND LOGGED TO MLFLOW")
print("="*80)
print("\nTotal Models Trained: 15")
print("  - Regression: 9 (3 models × 3 horizons)")
print("  - Classification: 6 (3 models × 2 alert types)")
print(f"\nView results in MLflow: Experiments → {EXPERIMENT_NAME}")


📊 REGRESSION RESULTS SUMMARY (Lower RMSE is better)

1-DAY FORECAST:
  Linear Regression: RMSE=37.18, R2=0.806
  Random Forest:     RMSE=36.82, R2=0.810
  XGBoost:           RMSE=36.60, R2=0.812

3-DAY FORECAST:
  Linear Regression: RMSE=46.71, R2=0.693
  Random Forest:     RMSE=44.40, R2=0.723
  XGBoost:           RMSE=43.87, R2=0.729

7-DAY FORECAST:
  Linear Regression: RMSE=50.13, R2=0.649
  Random Forest:     RMSE=45.85, R2=0.707
  XGBoost:           RMSE=45.50, R2=0.711

📊 CLASSIFICATION RESULTS SUMMARY (Higher Accuracy is better)

HIGH AQI ALERT (>150):
  Logistic Regression: Acc=0.897, F1=0.742
  Random Forest:       Acc=0.916, F1=0.770
  XGBoost:             Acc=0.913, F1=0.767

✅ ALL MODELS TRAINED AND LOGGED TO MLFLOW

Total Models Trained: 15
  - Regression: 9 (3 models × 3 horizons)
  - Classification: 6 (3 models × 2 alert types)

View results in MLflow: Experiments → /Users/keerthi.amulya.1999@gmail.com/AQI_ML_Pipeline_Complete_2
