# mlflow - Experiment Tracking: Air Quality Classification

Our goal in this notebook is to demonstrate the power of **experiment tracking**  using **MLflow**—an open-source platform designed to simplify the management of machine learning workflows. In the world of machine learning, tracking experiments is crucial for ensuring transparency, reproducibility, and effective model management. MLflow allows us to log not only parameters, metrics, and models but also track and compare multiple experiments over time. This makes it easier to optimize model performance, fine-tune hyperparameters, and maintain a history of all experiments for future reference.

In this notebook, we will use MLflow to track and manage the development of a classification model aimed at predicting **air quality**. Our dataset contains various environmental features, including temperature, humidity, pollutant levels (such as PM2.5, PM10, NO2, SO2, CO), and factors like proximity to industrial areas and population density. By building a predictive model, we aim to classify air quality into categories like "Good" , "Moderate" or "Hazardous" based on these features. Throughout the process, we will log every aspect of the model training and evaluation, from hyperparameters to performance metrics, ensuring we have a clear and organized record of our work with MLflow.

In [1]:
# import Libraries
import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [2]:
# Load the dataset from a CSV file
df = pd.read_csv('air_quality.csv')  # Replace with the path to your CSV file

# Check the first few rows of the dataset to understand its structure
print(df.head())

# Feature columns and target variable
X = df.drop('Air Quality', axis=1)
y = df['Air Quality']


   Temperature  Humidity  PM2.5  PM10   NO2   SO2    CO  \
0         29.8      59.1    5.2  17.9  18.9   9.2  1.72   
1         28.3      75.6    2.3  12.2  30.8   9.7  1.64   
2         23.1      74.7   26.7  33.8  24.4  12.6  1.63   
3         27.1      39.1    6.1   6.3  13.5   5.3  1.15   
4         26.5      70.7    6.9  16.0  21.9   5.6  1.01   

   Proximity_to_Industrial_Areas  Population_Density Air Quality  
0                            6.3                 319    Moderate  
1                            6.0                 611    Moderate  
2                            5.2                 619    Moderate  
3                           11.1                 551        Good  
4                           12.7                 303        Good  


In [3]:
# Encode the target variable (Air Quality) using Label Encoding
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [4]:
# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_encoded, test_size=0.2, random_state=42)


In [13]:
# Define model hyperparameters
params = {
    "solver": "lbfgs",
    "max_iter": 1000,
    "random_state": 44,
}

# Train the model
lr = LogisticRegression(**params)
lr.fit(X_train, y_train)

# Predict on the test set
y_pred = lr.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average="weighted")
recall = recall_score(y_test, y_pred, average="weighted")
f1 = f1_score(y_test, y_pred, average="weighted")

print(accuracy, precision, recall, f1)

0.948 0.948699773521202 0.948 0.9479213062383529


### Experiment Tracking


Before we begin training our model, the first thing we need to do is define and set an experiment in MLflow.

- **Experiment:** In MLflow, an experiment acts as a container that holds all the runs (individual model training sessions) related to a specific task or project.
- **Run:** A run represents a single execution of a machine learning model, including its parameters, metrics, and results. Each time you train a model, MLflow logs this as a new run.

By organizing our runs under a specific experiment, we can easily track, compare, and organize all the different model runs in one place. This structure allows us to quickly evaluate the impact of different configurations, hyperparameters, or model versions over time

- The **MLflow Tracking Server (UI)** is running on http://127.0.0.1:5000/.
By setting `mlflow.set_tracking_uri(remote_server_uri)`, you're telling MLflow to track experiments and logs on that server instead of using the default local storage.
- If you don't explicitly set a remote tracking URI with mlflow.set_tracking_uri(), MLflow will log the experiments and their associated metrics and models locally in the **./mlruns** directory.
This folder is where MLflow stores all the experiment data.

    - **Experiment Folder:** Each experiment is assigned a unique experiment ID and stored in a subfolder within mlruns.
    - **Run Folder:** Within each experiment, every individual model training session is represented by a run, which is given a unique run ID. Each run is stored in its own folder under the experiment directory.

In [14]:
# remote_server_uri = "http://127.0.0.2:5000/"
# mlflow.set_tracking_uri(remote_server_uri)
# print(mlflow.tracking.get_tracking_uri())

In [15]:
# # Set the experiment name
# mlflow.set_experiment('Air_Quality_Experiment')

Once the experiment is set up in MLflow, we can start tracking various aspects of our model, such as **parameters, metrics, and the model itself**. These are logged during each run, allowing us to monitor and evaluate different model configurations over time. Here’s how we can log each of these elements:

- `set_tag:` Used to set metadata for the run, such as specifying the model type, dataset name, or experiment category. This helps in organizing and filtering runs.
- `log_param:` This function is used to log parameters that are part of the model, such as hyperparameters (e.g., the number of iterations, learning rate, etc.).
- `log_metric:` This function logs performance metrics, such as accuracy, precision, recall, or loss, which are useful for evaluating how well the model is performing.
- `log_model:` This function is used to log the trained machine learning model, so it can be saved and retrieved for future use or deployment.

#### Logging parameters and Model

In [16]:
# Start MLflow run
with mlflow.start_run():

    # Set experiment tags
    mlflow.set_tag("model_type", "LogisticRegression")
    mlflow.set_tag("experiment_name", "Air Quality Classification")

    # Log model parameters
    for param, value in params.items():
        mlflow.log_param(param, value)

    # Log evaluation metrics
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)

    # Log trained model
    mlflow.sklearn.log_model(lr, "model")

    # Print run details
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Model saved in run: {mlflow.active_run().info.run_id}")

Accuracy: 0.9480
Precision: 0.9487
Recall: 0.9480
F1 Score: 0.9479
Model saved in run: b9ab91a25dcf44209604b2bebcc25115


In [17]:
# Get the experiment ID
experiment_name = 'Air_Quality_Experiment'
experiment = mlflow.get_experiment_by_name(experiment_name)
experiment_id = experiment.experiment_id if experiment else None

if experiment_id:
    # Get the list of runs in the experiment
    runs = mlflow.search_runs(experiment_ids=experiment_id)

    # Print run details
    print(runs.columns)

    # Access metrics.accuracy correctly
    for index, row in runs.iterrows():
        print(f"Run ID: {row['run_id']}, Status: {row['status']}, Accuracy: {row['metrics.accuracy']}")
else:
    print(f"Experiment with name '{experiment_name}' not found.")

Experiment with name 'Air_Quality_Experiment' not found.


#### Loading the Model for Prediciton

After a model is logged in MLflow, it can be loaded again by specifying the **run ID**. The model is retrieved using `mlflow.sklearn.load_model()`, which takes the run ID to locate and load the specific model from the logged experiment. This allows you to access and use the model at any time after it has been logged, facilitating its reuse in future predictions or analysis.

In [19]:
# Logged model in MLFlow
mlflow_run_id = 'b9ab91a25dcf44209604b2bebcc25115' #choose the runid for the model you want to load
logged_model_path = f"runs:/{mlflow_run_id}/model"

# Load model as a sklearn model
loaded_model = mlflow.sklearn.load_model(logged_model_path)

In [20]:
y_pred = loaded_model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.948
