# Air Quality Assessment Dataset

This dataset focuses on air quality assessment across various regions. It contains 5000 samples and captures critical environmental and demographic factors that influence pollution levels.
https://www.kaggle.com/datasets/mujtabamatin/air-quality-and-pollution-assessment
## Key Features:

- **Temperature (°C)**: Average temperature of the region.
- **Humidity (%)**: Relative humidity recorded in the region.
- **PM2.5 Concentration (µg/m³)**: Fine particulate matter levels.
- **PM10 Concentration (µg/m³)**: Coarse particulate matter levels.
- **NO2 Concentration (ppb)**: Nitrogen dioxide levels.
- **SO2 Concentration (ppb)**: Sulfur dioxide levels.
- **CO Concentration (ppm)**: Carbon monoxide levels.
- **Proximity to Industrial Areas (km)**: Distance to the nearest industrial zone.
- **Population Density (people/km²)**: Number of people per square kilometer in the region.

## Target Variable: Air Quality Levels

- **Good**: Clean air with low pollution levels.
- **Moderate**: Acceptable air quality but with some pollutants present.
- **Poor**: Noticeable pollution that may cause health issues for sensitive groups.
- **Hazardous**: Highly polluted air posing serious health risks to the population.


In [1]:
import argparse
import itertools
import matplotlib.pyplot as plt
import mlflow
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix
from xgboost import XGBClassifier
import logging
import logging
logging.getLogger("urllib3").setLevel(logging.WARNING)
logging.getLogger("mlflow").setLevel(logging.WARNING)
logging.getLogger("requests").setLevel(logging.WARNING)

logging.basicConfig(level=logging.DEBUG)


In [2]:
df = pd.read_csv('updated_pollution_dataset.csv')
df.head(5)

Unnamed: 0,Temperature,Humidity,PM2.5,PM10,NO2,SO2,CO,Proximity_to_Industrial_Areas,Population_Density,Air Quality
0,29.8,59.1,5.2,17.9,18.9,9.2,1.72,6.3,319,Moderate
1,28.3,75.6,2.3,12.2,30.8,9.7,1.64,6.0,611,Moderate
2,23.1,74.7,26.7,33.8,24.4,12.6,1.63,5.2,619,Moderate
3,27.1,39.1,6.1,6.3,13.5,5.3,1.15,11.1,551,Good
4,26.5,70.7,6.9,16.0,21.9,5.6,1.01,12.7,303,Good


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Temperature                    5000 non-null   float64
 1   Humidity                       5000 non-null   float64
 2   PM2.5                          5000 non-null   float64
 3   PM10                           5000 non-null   float64
 4   NO2                            5000 non-null   float64
 5   SO2                            5000 non-null   float64
 6   CO                             5000 non-null   float64
 7   Proximity_to_Industrial_Areas  5000 non-null   float64
 8   Population_Density             5000 non-null   int64  
 9   Air Quality                    5000 non-null   object 
dtypes: float64(8), int64(1), object(1)
memory usage: 390.8+ KB


average parameter:
'micro': Calculate metrics globally by counting the total true positives, false negatives, and false positives.
'macro': Calculate metrics for each class, and take the unweighted mean.
'weighted': Calculate metrics for each class and take the weighted mean by the number of true instances.

In [4]:
#write in terminal
!export MLFLOW_TRACKING_URI="http://127.0.0.1:5000"

In [19]:
%%writefile mlflow-projects-example.py

import matplotlib
matplotlib.use('TkAgg')
import argparse
import matplotlib.pyplot as plt
import mlflow
import numpy as np
import pandas as pd
import logging
import mlflow.pyfunc
import logging.config
import mlflow.sklearn
import sys
from pathlib import Path
from mlflow.models.signature import infer_signature
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix, precision_score, recall_score, classification_report, accuracy_score, ConfusionMatrixDisplay
from xgboost import XGBClassifier


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", default="./updated_pollution_dataset.csv")
    parser.add_argument("--early_stopping_rounds", type=int, default=10)
    parser.add_argument("--average", choices=['micro', 'macro', 'weighted'], default='weighted')
    return parser.parse_args()

def configure_logging():
    """Configure logging handlers and return a logger instance."""
    
    if Path("logging.conf").exists():
        logging.config.fileConfig("logging.conf")
    else:
        logging.basicConfig(
            format="%(asctime)s [%(levelname)s] %(message)s",
            handlers=[logging.StreamHandler(sys.stdout)],
            level=logging.INFO,
        )


def prepare_data(data_path):
    try:
        df = pd.read_csv(data_path)
        quality_mapping = {'Good': 0, 'Moderate': 1, 'Poor': 2, 'Hazardous': 3}
        df['Air Quality'] = df['Air Quality'].apply(lambda x: quality_mapping[x])
        # Convert Population_Density to float
        df["Population_Density"] =df["Population_Density"].astype(float)

        
        logging.debug(f"prepare_data called with data_path: {data_path}")
        return df  
    except Exception as e:
        logging.error(f"Error preparing data: {e}")
        raise


    
   


def split_data(loaded_data):
    logging.info("splitting data...")
    try:
        X = loaded_data.drop(columns=['Air Quality'])
        y = loaded_data['Air Quality']
        X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
        X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)
        logging.debug(f"shape of X_train : {X_train.shape}")
        return X_train, X_test, X_val, y_train, y_test, y_val
    except KeyError as e:
        logging.error(f"Missing required column: {e}")
        raise
    except ValueError as e:
        logging.error(f"Error splitting data: {e}")
        raise



def create_model(args):
    # Initialize the model for multi-class classification
    logging.info("creating model...")
    model = XGBClassifier(
        objective='multi:softmax',  # For probabilistic predictions
        num_class=4,                # Number of classes
        eval_metric=['mlogloss','merror'],   # Suitable for multi-class classification
        random_state=42,
        early_stopping_rounds=args.early_stopping_rounds
    )
    return model


def train_model(model, X_train, y_train, X_val, y_val):
    # Train the model
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=0)
    return model


def evaluate_model(model, y_test, y_pred, args):
    logging.info("The training finished successfully and its fitting to test dataset.")
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.2f}")

     # Precision and Recall
    precision = precision_score(y_test, y_pred, average=args.average)
    recall = recall_score(y_test, y_pred, average=args.average)
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")

    report = classification_report(y_test, y_pred)
    print("Classification Report:")
    print(report)
    
    logging.info("precision: %s", precision)
    logging.info("recall: %s", recall)


    return  accuracy,precision,recall


def job_done(args):
    df = prepare_data(args.data)
    if df is None:
        logging.error("Data preparation failed. Exiting the script.")
        raise RuntimeError("Data preparation failed. Please check the input file.")

    X_train, X_test, X_val, y_train, y_test, y_val = split_data(df)

    mlflow.set_experiment("a-new-demo")
    with mlflow.start_run():
        
        #Log parameters and metrics together
        params = {
            "data": args.data,
            "early_stopping_rounds": args.early_stopping_rounds,
            "average": args.average
        }
        mlflow.log_params(params)
        
        # Create and train the model
        model = create_model(args)
        trained_model = train_model(model, X_train, y_train, X_val, y_val)
        y_pred = trained_model.predict(X_test)

        accuracy, precision, recall = evaluate_model(trained_model, y_test, y_pred, args)
        
        metrics = {
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall
        }
        mlflow.log_metrics(metrics)

        # Corrected model input example
        model_input = pd.DataFrame([{
            "Temperature": 29.8,
            "Humidity": 59.1,
            "PM2.5": 5.2,
            "PM10": 17.9,
            "NO2": 18.9,
            "SO2": 9.2,
            "CO": 1.72,
            "Proximity_to_Industrial_Areas": 6.3,
            "Population_Density": 319.0,
        }])

        # Simulated model output
        model_output = pd.DataFrame({"Air Quality": [0]})

        # Infer signature
        signature = infer_signature(model_input, model_output)

        # Save locally using XGBoost's native format
        
        trained_model.save_model("xgboost_model.ubj")  # or Save as JSON 
        
        # Log with MLflow using XGBoost flavor
        mlflow.xgboost.log_model(
            trained_model,
            artifact_path="air-quality-xgboost-model",
            signature=signature,
            input_example=model_input
        )



if __name__ == "__main__":
    job_done(parse_args())


Overwriting mlflow-projects-example.py


In [5]:

!python mlflow-projects-example.py


2025/01/14 17:45:27 INFO mlflow.tracking.fluent: Experiment with name 'a-new-demo' does not exist. Creating a new experiment.
Accuracy: 0.95
Precision: 0.96
Recall: 0.95
Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       300
           1       0.95      0.94      0.95       225
           2       0.88      0.91      0.90       150
           3       0.97      0.89      0.93        75

    accuracy                           0.95       750
   macro avg       0.95      0.94      0.94       750
weighted avg       0.96      0.95      0.95       750



In [10]:
#run in terminal
#!mlflow ui



In [11]:
sample = df[:5].drop(columns = 'Air Quality')
sample["Population_Density"] =sample["Population_Density"].astype(float)
sample


Unnamed: 0,Temperature,Humidity,PM2.5,PM10,NO2,SO2,CO,Proximity_to_Industrial_Areas,Population_Density
0,29.8,59.1,5.2,17.9,18.9,9.2,1.72,6.3,319.0
1,28.3,75.6,2.3,12.2,30.8,9.7,1.64,6.0,611.0
2,23.1,74.7,26.7,33.8,24.4,12.6,1.63,5.2,619.0
3,27.1,39.1,6.1,6.3,13.5,5.3,1.15,11.1,551.0
4,26.5,70.7,6.9,16.0,21.9,5.6,1.01,12.7,303.0


In [12]:
import mlflow
mlflow.set_tracking_uri("http://127.0.0.1:5000")
logged_model = 'runs:/54b7f4b988c04bc2b5b92fc6bfe44991/air-quality-xgboost-model'

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)

predictions = loaded_model.predict(sample)
reverse_quality_mapping = {0: 'Good', 1: 'Moderate', 2: 'Poor', 3: 'Hazardous'}
readable_predictions = [reverse_quality_mapping[p] for p in predictions]
print(readable_predictions)

['Moderate', 'Moderate', 'Moderate', 'Good', 'Good']


## Production Environment

In [15]:
%%writefile inference.py


import json
import numpy as np
import pandas as pd
import mlflow.pyfunc
from xgboost import XGBClassifier
import joblib
from typing import Union, Dict, List
import logging

class Airquality_Detector(mlflow.pyfunc.PythonModel):
    def __init__(self) -> None:
        self.target_encoder = None
        self.model = None
        self.required_columns = [
            "Temperature", "Humidity", "PM2.5", "PM10", "NO2", "SO2", "CO",
            "Proximity_to_Industrial_Areas", "Population_Density"
        ]

    def load_context(self, context) -> None:
        """Load the target encoder and model from artifacts."""
        try:
            self.target_encoder = joblib.load(context.artifacts["target_encoder"])
            self.model = XGBClassifier()
            self.model.load_model(context.artifacts["model"])
            logging.info("Model and encoder loaded successfully")
        except Exception as e:
            logging.error(f"Error loading model context: {str(e)}")
            raise

    def _validate_columns(self, data: pd.DataFrame) -> None:
        """Validate that all required columns are present."""
        missing_cols = set(self.required_columns) - set(data.columns)
        if missing_cols:
            raise ValueError(f"Missing required columns: {missing_cols}")

    def _process_json_input(self, json_input: Union[str, Dict, List]) -> pd.DataFrame:
        """Convert JSON input to pandas DataFrame."""
        try:
            # Handle string input
            if isinstance(json_input, str):
                data = json.loads(json_input)
            else:
                data = json_input

            # Handle single record vs list of records
            if isinstance(data, dict):
                df = pd.DataFrame([data])
            elif isinstance(data, list):
                df = pd.DataFrame(data)
            else:
                raise ValueError("JSON input must be a dictionary or list of dictionaries")

            return df
        except Exception as e:
            raise ValueError(f"Error processing JSON input: {str(e)}")

    def _preprocess_input(self, data: Union[str, Dict, List, pd.DataFrame]) -> pd.DataFrame:
        """Preprocess input data regardless of format."""
        try:
            # Convert input to DataFrame if it's not already
            if not isinstance(data, pd.DataFrame):
                data = self._process_json_input(data)

            # Validate columns
            self._validate_columns(data)

            # Ensure correct column order
            data = data[self.required_columns]

            # Convert datatypes
            data = data.astype({
                "Temperature": float,
                "Humidity": float,
                "PM2.5": float,
                "PM10": float,
                "NO2": float,
                "SO2": float,
                "CO": float,
                "Proximity_to_Industrial_Areas": float,
                "Population_Density": float
            })

            return data
        except Exception as e:
            raise ValueError(f"Error preprocessing input: {str(e)}")

    def predict(self, context, model_input: Union[str, Dict, List, pd.DataFrame]) -> List[str]:
        """
        Make predictions on the input data.
        
        Args:
            context: MLflow model context
            model_input: Can be one of:
                - pandas DataFrame
                - JSON string
                - Python dictionary
                - List of dictionaries
                
        Returns:
            List of predicted air quality categories
        """
        try:
            # Preprocess the input
            processed_input = self._preprocess_input(model_input)
            # Log input for debugging
            logging.info(f"Input for prediction:\n{processed_input}") 
            # Make predictions
            predictions = self.model.predict(processed_input)
            # Convert numerical predictions to readable categories
            if self.target_encoder is not None:
                predictions = self.target_encoder.inverse_transform(predictions)
            # Log predictions for debugging
            logging.debug(f"Decoded predictions: {predictions}")
            return predictions.tolist()  # Convert numpy array to list 
            
        except Exception as e:
            logging.error(f"Error during prediction: {str(e)}")
            raise

Overwriting inference.py


In [20]:
%%writefile mlflow-projects-example-production.py

# mlflow-projects-example-production.py

import matplotlib
matplotlib.use('TkAgg')
import argparse
import matplotlib.pyplot as plt
import mlflow
import numpy as np
import pandas as pd
import logging
import mlflow.pyfunc
import logging.config
import logging
logging.getLogger("urllib3").setLevel(logging.WARNING)
logging.getLogger("matplotlib").setLevel(logging.WARNING)
from joblib import dump
import sys
import joblib
from pathlib import Path  # Added missing import
from sklearn.preprocessing import LabelEncoder
from mlflow.models.signature import infer_signature
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix, precision_score, recall_score, classification_report, accuracy_score, ConfusionMatrixDisplay
from xgboost import XGBClassifier

from inference import Airquality_Detector

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", default="./updated_pollution_dataset.csv")
    parser.add_argument("--early_stopping_rounds", type=int, default=10)
    parser.add_argument("--average", choices=['micro', 'macro', 'weighted'], default='weighted')
    return parser.parse_args()


def configure_logging():
    """Configure logging handlers and return a logger instance."""
    
    if Path("logging.conf").exists():
        logging.config.fileConfig("logging.conf")
    else:
        logging.basicConfig(
            format="%(asctime)s [%(levelname)s] %(message)s",
            handlers=[logging.StreamHandler(sys.stdout)],
            level=logging.INFO,
        )


def prepare_data(data_path):
    try:
        df = pd.read_csv(data_path)
        df.columns = df.columns.str.strip()  # Remove any whitespace from column names        
        # Convert Population_Density to float
        df["Population_Density"] =df["Population_Density"].astype(float)

        logging.debug(f"prepare_data called with data_path: {data_path}")
        return df  
    except Exception as e:
        logging.error(f"Error preparing data: {e}")
        raise


def split_data(loaded_data):
    logging.info("splitting data...")
    
    try:
        X = loaded_data.drop(columns=['Air Quality'])
        y = loaded_data['Air Quality']
        X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
        X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)
        logging.debug(f"shape of X_train : {X_train.shape}")
        return X_train, X_test, X_val, y_train, y_test, y_val
    except KeyError as e:
        logging.error(f"Missing required column: {e}")
        raise
    except ValueError as e:
        logging.error(f"Error splitting data: {e}")
        raise



def create_model(args):
    # Initialize the model for multi-class classification
    logging.info("creating model...")
    model = XGBClassifier(
        objective='multi:softmax',  # For probabilistic predictions
        num_class=4,                # Number of classes
        eval_metric=['mlogloss','merror'],   # Suitable for multi-class classification
        random_state=42,
        early_stopping_rounds=args.early_stopping_rounds
    )
    return model


def train_model(model, X_train, y_train, X_val, y_val):
    # Train the model
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=0)
    return model


def evaluate_model(model, y_test, y_pred, args):
    logging.info("The training finished successfully and its fitting to test dataset.")
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.2f}")

     # Precision and Recall
    precision = precision_score(y_test, y_pred, average=args.average)
    recall = recall_score(y_test, y_pred, average=args.average)
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")

    report = classification_report(y_test, y_pred)
    print("Classification Report:")
    print(report)
    
    logging.info("precision: %s", precision)
    logging.info("recall: %s", recall)


    return  accuracy,precision,recall


def job_done(args):
    df = prepare_data(args.data)
    if df is None:
        logging.error("Data preparation failed. Exiting the script.")
        raise RuntimeError("Data preparation failed. Please check the input file.")

    X_train, X_test, X_val, y_train, y_test, y_val = split_data(df)
    # Initialize LabelEncoder
    target_encoder = LabelEncoder()  # Fixed variable name
    y_train = target_encoder.fit_transform(y_train)  # Fixed variable name
    y_test = target_encoder.transform(y_test)  # Fixed variable name
    y_val = target_encoder.transform(y_val)  # Fixed variable name

    joblib.dump(target_encoder, 'target_encoder.pkl')
    
    print("LabelEncoder saved to 'target_encoder.pkl'.")

    mlflow.set_experiment("mlflow-production-demo")
    with mlflow.start_run():
        
        # Log parameters and metrics together
        params = {
            "data": args.data,
            "early_stopping_rounds": args.early_stopping_rounds,
            "average": args.average
        }
        mlflow.log_params(params)
        # Create and train the model
        model = create_model(args)
        trained_model = train_model(model, X_train, y_train, X_val, y_val)
        y_pred = trained_model.predict(X_test)

        accuracy, precision, recall = evaluate_model(trained_model, y_test, y_pred, args)
        metrics = {
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall
        }
        mlflow.log_metrics(metrics)

        model_input = pd.DataFrame([{
            "Temperature": 29.8,
            "Humidity": 59.1,
            "PM2.5": 5.2,
            "PM10": 17.9,
            "NO2": 18.9,
            "SO2": 9.2,
            "CO": 1.72,
            "Proximity_to_Industrial_Areas": 6.3,
            "Population_Density": 319.0,
        }])

        signature = infer_signature(model_input, trained_model.predict(model_input))  # Fixed signature inference
        
        trained_model.save_model('air_model.ubj')

        air_quality = Airquality_Detector()
        artifacts = {
            "target_encoder": "./target_encoder.pkl",
            "model": "./air_model.ubj"
        }
        
        mlflow.pyfunc.log_model(
            artifact_path='model',
            conda_env="./conda.yaml",
            python_model=air_quality,
            artifacts=artifacts,
            signature = signature,
            input_example = model_input,
            registered_model_name=air_quality_model,
        
            
        )

if __name__ == "__main__":
    job_done(parse_args())


Overwriting mlflow-projects-example-production.py


In [21]:
!python mlflow-projects-example-production.py

LabelEncoder saved to 'target_encoder.pkl'.
Accuracy: 0.94
Precision: 0.94
Recall: 0.94
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       300
           1       0.92      0.77      0.84        75
           2       0.92      0.97      0.95       225
           3       0.86      0.87      0.86       150

    accuracy                           0.94       750
   macro avg       0.92      0.90      0.91       750
weighted avg       0.94      0.94      0.94       750

Downloading artifacts: 100%|████████████████████| 1/1 [00:00<00:00, 3692.17it/s]
Downloading artifacts: 100%|████████████████████| 1/1 [00:00<00:00, 1400.90it/s]
🏃 View run powerful-lynx-783 at: http://127.0.0.1:5000/#/experiments/468274013418532110/runs/1faa06996ad94733a1904ff8ea9976e4
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/468274013418532110


In [5]:
df["Population_Density"] =df["Population_Density"].astype(float)
sample = df[:5].drop(columns = 'Air Quality')
sample

Unnamed: 0,Temperature,Humidity,PM2.5,PM10,NO2,SO2,CO,Proximity_to_Industrial_Areas,Population_Density
0,29.8,59.1,5.2,17.9,18.9,9.2,1.72,6.3,319.0
1,28.3,75.6,2.3,12.2,30.8,9.7,1.64,6.0,611.0
2,23.1,74.7,26.7,33.8,24.4,12.6,1.63,5.2,619.0
3,27.1,39.1,6.1,6.3,13.5,5.3,1.15,11.1,551.0
4,26.5,70.7,6.9,16.0,21.9,5.6,1.01,12.7,303.0


In [23]:

!export MLFLOW_TRACKING_URI="http://127.0.0.1:5000"
# Set the MLflow tracking URI
mlflow.set_tracking_uri("http://127.0.0.1:5000")

# run  last line in terminal
#use the model run and directory/folder housing the model
# mlflow models serve -m runs:/f3747c220a0b4629a57dce74b58d0b7d/model -h 0.0.0.0 -p 8080 --no-conda
# OR
# mlflow models serve -m models:/air_quality_model/1 -h 0.0.0.0 -p 8080 --no-conda



In [23]:
!curl -X POST http://0.0.0.0:8080/invocations \
-H "Content-Type: application/json" \
-d '{"inputs": [{"Temperature": 29.8, "Humidity": 59.1, "PM2.5": 26.7, "PM10": 33.8, "NO2": 30.8, "SO2":9.2,"CO":1.72,"Proximity_to_Industrial_Areas":11.1,"Population_Density":551.0}]}'

{"predictions": ["Moderate"]}