# AQI Prediction Pipeline (WAQI Edition + Auto-Location + Detailed Forecasts)

This notebook implements a complete pipeline for Air Quality Index (AQI) prediction using the **World Air Quality Index (WAQI)** API.

**Features**:
1. **Automatic Location Detection**: Uses `geocoder` to find your current City/Coordinates.
2. **Data Fetching**: Fetches real daily forecast data from WAQI.
3. **Augmentation**: Creates a dataset large enough for model training demonstration.
4. **Modeling**: Trains Random Forest/LSTM models.
5. **Detailed Forecasting**: 
    - **24-Hour Forecast**: Simulates hourly progression based on daily prediction.
    - **7-Day Forecast**: Includes Day labels (Day-1, Day-2) and location tagging.

In [1]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit, train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.pipeline import Pipeline

try:
    import tensorflow as tf
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import LSTM, Dense, Dropout
    TF_AVAILABLE = True
except ImportError:
    print("Warning: TensorFlow not found. LSTM model will be disabled.")
    TF_AVAILABLE = False

try:
    import geocoder
    GEOCODER_AVAILABLE = True
except ImportError:
    print("Warning: 'geocoder' library not found. Auto-location will use API default.")
    GEOCODER_AVAILABLE = False



In [2]:
# ==========================================
# CONSTANTS & CONFIGURATION
# ==========================================
API_KEY = "24da0fb52646fa4b9ad08d97d043ea86d2c31983"
DEFAULT_BASE_URL = "https://api.waqi.info/feed/here/"

## 1. Location Detection

In [3]:
def detect_user_location():
    """
    Explicitly detects user location using IP-based geocoding.
    Returns lat, lon or None, None if failed.
    """
    if not GEOCODER_AVAILABLE:
        return None, None
        
    print("\n--- Step 1: Detecting User Location ---")
    try:
        g = geocoder.ip('me')
        if g.ok:
            print(f"Success! Detected Location: {g.city}, {g.country}")
            print(f"Coordinates: {g.lat}, {g.lng}")
            return g.lat, g.lng
        else:
            print("Could not detect location from IP. Will rely on API default.")
            return None, None
    except Exception as e:
        print(f"Location detection failed: {e}")
        return None, None

In [4]:
class AQIDataFetcher:
    """
    Handles fetching data from WAQI API.
    """
    def __init__(self, api_token, lat=None, lon=None):
        self.api_token = api_token
        if lat and lon:
            self.base_url = f"https://api.waqi.info/feed/geo:{lat};{lon}/"
            print(f"Targeting API for specific coordinates: {lat}, {lon}")
        else:
            self.base_url = DEFAULT_BASE_URL
            print("Targeting API for auto-detected IP location (feed/here/).")

    def fetch_data(self):
        params = {'token': self.api_token}
        
        print(f"Fetching data from WAQI API...")
        try:
            response = requests.get(self.base_url, params=params)
            response.raise_for_status()
            data = response.json()
            
            if data['status'] != 'ok':
                raise ValueError(f"API Error: {data.get('data', 'Unknown error')}")

            # Extract Daily Forecast for PM2.5
            forecast_data = data['data'].get('forecast', {}).get('daily', {}).get('pm25', [])
            
            # Extract Location Metadata
            city_info = data['data'].get('city', {})
            metadata = {
                'city': city_info.get('name', 'Unknown'),
                'lat': city_info.get('geo', [0, 0])[0], 
                'lon': city_info.get('geo', [0, 0])[1]
            }
            
            if not forecast_data:
                print("No forecast data found in API response.")
                return pd.DataFrame(), metadata

            records = []
            for item in forecast_data:
                records.append({
                    'dt': pd.to_datetime(item['day']),
                    'aqi': item['avg'], 
                    'min': item['min'],
                    'max': item['max']
                })
            
            df = pd.DataFrame(records)
            df.set_index('dt', inplace=True)
            df.sort_index(inplace=True)
            
            # Get Current 'Real' AQI
            current_aqi = data['data'].get('aqi')
            print(f"Current Real-Time AQI: {current_aqi}")
            print(f"Confirmed Station: {metadata['city']} (Lat: {metadata['lat']}, Lon: {metadata['lon']})")
            
            return df, metadata

        except requests.exceptions.RequestException as e:
            print(f"Error fetching data: {e}")
            return pd.DataFrame(), {}

## 2. Preprocessing & Augmentation

In [5]:
class DataPreprocessor:
    """
    Handles data cleaning, filling missing values, and feature engineering.
    """
    def preprocess_and_augment(self, df):
        """
        1. Preprocesses the real data.
        2. AUGMENTS it with synthetic history because 7-10 days is NOT enough for training.
        """
        if df.empty:
            print("DataFrame is empty. Skipping preprocessing.")
            return df

        print("Preprocessing data...")
        
        # 1. Augmentation
        real_mean = df['aqi'].mean()
        real_std = df['aqi'].std() if len(df) > 1 else 10
        if np.isnan(real_std): real_std = 10
        
        last_date = df.index[-1]
        start_date = last_date - timedelta(days=60)
        synthetic_dates = pd.date_range(start=start_date, end=last_date - timedelta(days=1), freq='D')
        
        synthetic_aqi = []
        val = real_mean
        for _ in range(len(synthetic_dates)):
            val += np.random.normal(0, real_std * 0.5)
            val = max(10, min(500, val))
            synthetic_aqi.append(val)
            
        df_synthetic = pd.DataFrame({
            'aqi': synthetic_aqi,
            'min': [v - 10 for v in synthetic_aqi],
            'max': [v + 10 for v in synthetic_aqi]
        }, index=synthetic_dates)
        
        df_final = pd.concat([df_synthetic, df])
        df_final.sort_index(inplace=True)
        df_final = df_final[~df_final.index.duplicated(keep='last')]
        
        print(f"Augmented data size: {len(df_final)} records (Synthetic + Real)")

        # 2. Feature Engineering
        df_final['day_of_week'] = df_final.index.dayofweek
        df_final['month'] = df_final.index.month
        
        df_final['day_sin'] = np.sin(2 * np.pi * df_final['day_of_week'] / 7)
        df_final['day_cos'] = np.cos(2 * np.pi * df_final['day_of_week'] / 7)

        # 3. Create Lag Features
        target_col = 'aqi'
        for lag in [1, 2, 3]: 
            df_final[f'lag_{lag}d'] = df_final[target_col].shift(lag)

        df_final.dropna(inplace=True)
        
        print(f"Data shape after preprocessing: {df_final.shape}")
        return df_final

## 3. Modeling & Forecasting (Updated)

In [6]:
class AQIModels:
    """
    Contains Model definitions, training logic, and evaluation.
    """
    
    def train_random_forest(self, df, target_col='aqi'):
        print("\n--- Training Random Forest Regressor ---")
        
        features = [c for c in df.columns if c not in [target_col, 'min', 'max']]
        X = df[features]
        y = df[target_col]

        test_size = int(len(df) * 0.2)
        X_train, X_test = X.iloc[:-test_size], X.iloc[-test_size:]
        y_train, y_test = y.iloc[:-test_size], y.iloc[-test_size:]

        param_grid = {
            'n_estimators': [50, 100],
            'max_depth': [5, 10, None],
        }

        rf_model = RandomForestRegressor(random_state=42)
        tscv = TimeSeriesSplit(n_splits=3)
        
        grid_search = GridSearchCV(
            estimator=rf_model,
            param_grid=param_grid,
            cv=tscv,
            scoring='neg_mean_absolute_error',
            n_jobs=-1
        )
        
        grid_search.fit(X_train, y_train)
        best_rf = grid_search.best_estimator_
        predictions = best_rf.predict(X_test)
        
        mae = mean_absolute_error(y_test, predictions)
        print(f"Best RF Parameters: {grid_search.best_params_}")
        print(f"Random Forest MAE: {mae:.2f}")
        return best_rf, mae

    def train_lstm(self, df, target_col='aqi', look_back=3):
        print("\n--- Training LSTM Model ---")
        
        scaler = MinMaxScaler(feature_range=(0, 1))
        features = [c for c in df.columns if c not in ['min', 'max']]
        
        data_values = df[features].values
        scaled_data = scaler.fit_transform(data_values)
        
        X, y = [], []
        target_idx = features.index(target_col)

        for i in range(look_back, len(scaled_data)):
            X.append(scaled_data[i-look_back:i, :]) 
            y.append(scaled_data[i, target_idx]) 
            
        X, y = np.array(X), np.array(y)
        
        test_size = int(len(X) * 0.2)
        X_train, X_test = X[:-test_size], X[-test_size:]
        y_train, y_test = y[:-test_size], y[-test_size:]
        
        model = Sequential()
        model.add(LSTM(32, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])))
        model.add(Dropout(0.2))
        model.add(LSTM(16, return_sequences=False))
        model.add(Dropout(0.2))
        model.add(Dense(1)) 
        
        model.compile(optimizer='adam', loss='mae')
        history = model.fit(X_train, y_train, epochs=30, batch_size=8, validation_data=(X_test, y_test), verbose=1)
        
        predictions_scaled = model.predict(X_test)
        
        dummy_matrix = np.zeros((len(predictions_scaled), len(features)))
        dummy_matrix[:, target_idx] = predictions_scaled.flatten()
        predictions = scaler.inverse_transform(dummy_matrix)[:, target_idx]
        
        dummy_matrix_y = np.zeros((len(y_test), len(features)))
        dummy_matrix_y[:, target_idx] = y_test
        actuals = scaler.inverse_transform(dummy_matrix_y)[:, target_idx]
        
        mae = mean_absolute_error(actuals, predictions)
        print(f"LSTM MAE: {mae:.2f}")
        return model, mae

    def generate_forecast(self, model, df, metadata, days=7):
        """
        Generates a 7-day forecast using the trained model recursively.
        Includes 'Day-N' labels.
        """
        print(f"\n--- Generating {days} Day Forecast ---")
        
        last_data = df.iloc[-1].copy()
        last_date = df.index[-1]
        
        future_dates = [last_date + timedelta(days=i) for i in range(1, days + 1)]
        forecast_values = []
        past_aqi = [df.iloc[-i]['aqi'] for i in range(3, 0, -1)] 
        
        for future_date in future_dates:
            day_of_week = future_date.dayofweek
            day_sin = np.sin(2 * np.pi * day_of_week / 7)
            day_cos = np.cos(2 * np.pi * day_of_week / 7)
            
            lag_1d = past_aqi[-1]
            lag_2d = past_aqi[-2]
            lag_3d = past_aqi[-3]
            
            input_data = pd.DataFrame([{
                'day_of_week': day_of_week,
                'month': future_date.month,
                'day_sin': day_sin,
                'day_cos': day_cos,
                'lag_1d': lag_1d,
                'lag_2d': lag_2d,
                'lag_3d': lag_3d
            }])
            
            pred_aqi = model.predict(input_data)[0]
            forecast_values.append(pred_aqi)
            past_aqi.append(pred_aqi)
            past_aqi.pop(0) 
        
        forecast_df = pd.DataFrame({
            'Day': [f"Day-{i}" for i in range(1, days + 1)],
            'Date': future_dates,
            'Location': [metadata.get('city', 'Unknown')] * days,
            'Latitude': [metadata.get('lat', 0)] * days,
            'Longitude': [metadata.get('lon', 0)] * days,
            'Predicted AQI': [round(x, 2) for x in forecast_values],
            'Status': [self._get_aqi_status(x) for x in forecast_values]
        })
        return forecast_df

    def generate_hourly_forecast(self, current_aqi_pred):
        """
        Generates a simulated 24-hour forecast based on the predicted Daily Average.
        Uses a standard diurnal profile.
        """
        print("\n--- Generating 24-Hour Hourly Forecast (Simulated) ---")
        
        hours = list(range(24))
        diurnal_profile = [
            0.8, 0.75, 0.7, 0.7, 0.75, 0.85, 1.0, 1.2, 1.3, 1.2, # 00-09
            1.1, 1.0, 0.9, 0.9, 0.95, 1.0, 1.1, 1.2, 1.3, 1.25, # 10-19
            1.15, 1.0, 0.9, 0.85                                      # 20-23
        ]
        
        profile_mean = sum(diurnal_profile) / 24
        scaling_factor = current_aqi_pred / profile_mean
        current_hour_idx = datetime.now().hour
        
        future_hours = []
        for i in range(24):
            h_idx = (current_hour_idx + 1 + i) % 24
            time_str = (datetime.now() + timedelta(hours=i+1)).strftime("%Y-%m-%d %H:00")
            predicted_h_aqi = diurnal_profile[h_idx] * scaling_factor
            
            future_hours.append({
                'Time': time_str,
                'Predicted AQI': round(predicted_h_aqi, 2),
                'Status': self._get_aqi_status(predicted_h_aqi)
            })
            
        return pd.DataFrame(future_hours)

    def _get_aqi_status(self, aqi):
        if aqi <= 50: return "Good"
        elif aqi <= 100: return "Moderate"
        elif aqi <= 150: return "Unhealthy for Sensitive Groups"
        elif aqi <= 200: return "Unhealthy"
        elif aqi <= 300: return "Very Unhealthy"
        else: return "Hazardous"

## 4. Execution Pipeline

In [7]:
print("Initializing AQI Prediction Pipeline...")

# 1. Location Detection
user_lat, user_lon = detect_user_location()

# 2. Fetch Data (passing detected location)
print("\n--- Step 2: Fetching Air Quality Data ---")
fetcher = AQIDataFetcher(api_token=API_KEY, lat=user_lat, lon=user_lon)
df, metadata = fetcher.fetch_data()

# 3. Preprocessing & Augmentation
print("\n--- Step 3: Preprocessing & Model Training ---")
preprocessor = DataPreprocessor()
df_processed = preprocessor.preprocess_and_augment(df)

if not df_processed.empty:
    trainer = AQIModels()
    rf_model, rf_mae = trainer.train_random_forest(df_processed)
    
    if TF_AVAILABLE:
        lstm_model, lstm_mae = trainer.train_lstm(df_processed)
    else:
        lstm_model, lstm_mae = None, float('inf')
    
    # 4A. Generate 7-Day Forecast
    forecast_df = trainer.generate_forecast(rf_model, df_processed, metadata, days=7)

    # 4B. Generate 24-Hour Forecast (Using Day-1 prediction)
    day_1_pred = forecast_df.iloc[0]['Predicted AQI']
    hourly_df = trainer.generate_hourly_forecast(day_1_pred)
    
    print("\n=== FINAL RESULTS (Daily Average AQI) ===")
    print(f"Random Forest MAE: {rf_mae:.4f}")
    if TF_AVAILABLE:
        print(f"LSTM MAE:          {lstm_mae:.4f}")
    
    print("\n=== 24-HOUR HOURLY FORECAST ===")
    print(hourly_df.to_string(index=False))

    print("\n=== 7-DAY AQI FORECAST ===")
    print(forecast_df.to_string(index=False))

Initializing AQI Prediction Pipeline...

--- Step 1: Detecting User Location ---
Success! Detected Location: Ahmedabad, IN
Coordinates: 23.0258, 72.5873

--- Step 2: Fetching Air Quality Data ---
Targeting API for specific coordinates: 23.0258, 72.5873
Fetching data from WAQI API...
Current Real-Time AQI: 155
Confirmed Station: Maninagar, Ahmedabad, India (Lat: 23.002657, Lon: 72.591912)

--- Step 3: Preprocessing & Model Training ---
Preprocessing data...
Augmented data size: 61 records (Synthetic + Real)
Data shape after preprocessing: (58, 10)

--- Training Random Forest Regressor ---
Best RF Parameters: {'max_depth': 10, 'n_estimators': 100}
Random Forest MAE: 15.65

--- Generating 7 Day Forecast ---

--- Generating 24-Hour Hourly Forecast (Simulated) ---

=== FINAL RESULTS (Daily Average AQI) ===
Random Forest MAE: 15.6536

=== 24-HOUR HOURLY FORECAST ===
            Time  Predicted AQI                         Status
2025-12-26 23:00         135.34 Unhealthy for Sensitive Groups
2