# 📊 Module 4: Generate Reports for Data and Model Drift

In this module, we'll use [Evidently](https://evidentlyai.com/) to generate reports that help monitor and detect:

- **Data Drift**
- **Target Drift**

We'll start by loading reference and current data:
- **Reference Data**: January & February 2011
- **Current Data**: March 2011

This simulates comparing a baseline dataset against new incoming production data.


In [1]:
# Install requirements
!pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 📦 Import Required Libraries

Before we proceed with training and tracking our machine learning model, we need to import the necessary libraries.


In [1]:
# Import necessary modules
import os
import joblib

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

import pandas as pd
import numpy as np

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset, RegressionPreset
from evidently.metric_preset import DataQualityPreset
from evidently.metric_preset import RegressionPreset

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

## 📁 Load Reference and Current Data

In [2]:
# Load reference (January) and current (February) datasets
data_path = "./data/processed/"

data_01 = pd.read_csv(data_path + 'data_2011_01.csv')
data_02 = pd.read_csv(data_path + 'data_2011_02.csv')

reference_data = pd.concat([data_01, data_02], ignore_index=True)

current_data = pd.read_csv("./data/processed/data_2011_03.csv")

# Preview shapes and basic info
print("Reference data shape:", reference_data.shape)
print("Current data shape:", current_data.shape)
# reference_data.head()

Reference data shape: (1337, 17)
Current data shape: (730, 17)


## 🗺️ Define Column Mapping

Evidently supports specifying column roles explicitly via `ColumnMapping`, which helps produce more accurate and meaningful metrics.

Here, we define:
- `target`: the actual value to predict (`count`)
- `prediction`: (optional) placeholder for model prediction column
- `numerical_features`: continuous input features
- `categorical_features`: categorical or discrete input features


In [11]:
target="count"
prediction="prediction"
numerical_features=['temp', 'atemp', 'humidity', 'windspeed', 'hour', 'weekday']
categorical_features=['season', 'holiday', 'workingday']

column_mapping = ColumnMapping(
    target="count",
    prediction="prediction",
    numerical_features=['temp', 'atemp', 'humidity', 'windspeed', 'hour', 'weekday'],
    categorical_features=['season', 'holiday', 'workingday']
    )
# column_mapping.target = target
# column_mapping.prediction = prediction
# column_mapping.numerical_features = numerical_features
# column_mapping.categorical_features = categorical_features

## 🧳 Select and Load a trained Model Version from MLflow

In this step, we interact with the MLflow Model Registry to:

1. **List all available versions** of a registered model (`BikeSharingModel`) along with their metadata, such as version number, stage, and run ID.
2. **Prompt the user** to choose a specific version to use for deployment or analysis.
3. **Load the selected model** from the MLflow tracking server using the model URI.

This makes it easy to manage multiple iterations of a model and ensures reproducibility when deploying or testing specific versions.


In [4]:
import mlflow
from mlflow.tracking import MlflowClient

# Initialize MLflow client
MLFLOW_TRACKING_URI = 'https://mlflow-mlflow.apps.cluster-x5r72.dynamic.redhatworkshops.io'
mlflow.set_tracking_uri(f"{MLFLOW_TRACKING_URI}")
client = MlflowClient()

model_name = "BikeSharingModel"

# List available versions
versions = client.search_model_versions(filter_string=f"name='{model_name}'", order_by=["version_number DESC"])

print("📦 Available versions for model:", model_name)
for v in versions:
    print(f"Version: {v.version}, Stage: {v.current_stage}, Status: {v.status}, Run ID: {v.run_id}")

# Ask the user to select a version
selected_version = input("Enter the version number you want to download: ").strip()

# Load the selected model version
model_uri = f"models:/{model_name}/{selected_version}"
model = mlflow.pyfunc.load_model(model_uri=model_uri)

print(f"✅ Model version {selected_version} loaded successfully from MLflow.")

📦 Available versions for model: BikeSharingModel
Version: 1, Stage: None, Status: READY, Run ID: a7750d532dd641e78c6c7879cc1b79ac


Enter the version number you want to download:  1


Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

✅ Model version 1 loaded successfully from MLflow.


## 📈 Generate a Regression Performance Report

The **Regression Performance Report** evaluates how well a model performs over time.

To simulate production monitoring, we'll assume that a `prediction` column already exists in the dataset (this could be added via an inference pipeline). The report will compare the predicted and actual target values (`count`) and show metrics like:
- RMSE
- R²
- Error distribution
- Prediction quality


In [12]:
reference_data["prediction"] = model.predict(reference_data[numerical_features + categorical_features])
current_data["prediction"] = model.predict(current_data[numerical_features + categorical_features])
# reference_data.head()

In [6]:
# First, simulate prediction column (for the sake of the report)
# In production, this should come from your model inference pipeline
#reference_data["prediction"] = reference_data["count"] * 0.95  # simulate slight underprediction
#current_data["prediction"] = current_data["count"] * 0.95
#reference_data["prediction"] = model.predict(reference_data)
#current_data["prediction"] = model.predict(current_data)

# Create the Regression Performance report
regression_report = Report(metrics=[RegressionPreset()])

# Run the report with column mapping
regression_report.run(
    reference_data=reference_data,
    current_data=current_data,
    column_mapping=column_mapping
)

# Save the report as HTML
os.makedirs("./reports", exist_ok=True)
output_path = "./reports/regression_performance_report.html"
regression_report.save_html(output_path)

print(f"✅ Regression Performance report saved to {output_path}")



✅ Regression Performance report saved to ./reports/regression_performance_report.html


## 📉 Generate a Data Drift Report

We'll use the `DataDriftReport` class from Evidently to compare feature distributions between the reference (January) and current (February) datasets.

This report will help us understand whether any input features have changed significantly, which may impact model predictions.


In [15]:
# Create a report with the Data Drift preset
data_drift_report = Report(metrics=[DataDriftPreset()])

# Run the comparison
data_drift_report.run(
        reference_data=reference_data[numerical_features + categorical_features], 
        current_data=current_data[numerical_features + categorical_features], 
        column_mapping=column_mapping
    )

# Create directories if they don't exist
report_dir = "./reports"
os.makedirs(report_dir, exist_ok=True)

# Save the report as an HTML file
output_path = "./reports/data_drift_report.html"
data_drift_report.save_html(output_path)

print(f"✅ Data Drift report saved to {output_path}")

✅ Data Drift report saved to ./reports/data_drift_report.html


## 🎯 Generate a Target Drift Report

We'll now generate a **Target Drift Report** using Evidently.

This report focuses specifically on changes in the distribution of the **target variable** (`cnt`), which represents the total number of bike rentals. Drift in the target distribution can indicate seasonal or behavioral changes in users that may affect model performance.


In [19]:
# Create the Target Drift report
target_drift_report = Report(metrics=[TargetDriftPreset()])

# Run the report with column mapping
target_drift_report.run(
    reference_data=reference_data,
    current_data=current_data,
    column_mapping=column_mapping
)

# Save the report as HTML
output_path = "./reports/target_drift_report.html"
target_drift_report.save_html(output_path)

print(f"✅ Target Drift report saved to {output_path}")

✅ Target Drift report saved to ./reports/target_drift_report.html


# ✅ Summary

In this module, we learned how to use Evidently to monitor data and model performance over time.

We completed the following steps:
- ✅ Evaluated **Regression Model Performance** using simulated predictions
- ✅ Compared new dataset (March) with training data (January & February) to detect **Data Drift**
- ✅ Analyzed changes in the **Target variable** distribution

These reports can be integrated into automated pipelines to continuously track the health of machine learning systems in production.


## 🧪 Generate a Data Quality Report

This report helps identify common data issues such as:
- Missing values
- Unexpected or invalid values
- Type mismatches
- Constant or duplicate columns

This is useful for ensuring the data pipeline remains clean and reliable over time.
``


In [None]:
# Create the Data Quality report
data_quality_report = Report(metrics=[DataQualityPreset()])

# Run the report with column mapping
data_quality_report.run(
    reference_data=reference_data,
    current_data=current_data,
    column_mapping=column_mapping
)

# Save the report as HTML
output_path = "./reports/data_quality_report.html"
data_quality_report.save_html(output_path)

print(f"✅ Data Quality report saved to {output_path}")
