# üìä Module 4: Generate Reports for Data and Model Drift

In this module, we'll use [Evidently](https://evidentlyai.com/) to generate reports that help monitor and detect:

- **Data Drift**
- **Target Drift**
- **Data Quality Issues**
- **Regression Model Performance**

We'll start by loading two consecutive months of data:
- **Reference Data**: January 2011
- **Current Data**: February 2011

This simulates comparing a baseline dataset against new incoming production data.


In [None]:
# Install requirements
!pip install -r requirements.txt

## üì¶ Import Required Libraries

Before we proceed with training and tracking our machine learning model, we need to import the necessary libraries.


In [1]:
# Import necessary modules
import os
import joblib

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

import pandas as pd
import numpy as np

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset, RegressionPreset
from evidently.metric_preset import DataQualityPreset
from evidently.metric_preset import RegressionPreset

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

## üìÅ Step 1: Load Reference and Current Data

In [2]:
# Load reference (January) and current (February) datasets
reference_data = pd.read_csv("./data/processed/data_2011_01.csv")
current_data = pd.read_csv("./data/processed/data_2011_02.csv")

# Preview shapes and basic info
print("Reference data shape:", reference_data.shape)
print("Current data shape:", current_data.shape)
reference_data.head()

Reference data shape: (688, 17)
Current data shape: (649, 17)


Unnamed: 0,dteday,instant,season,year,month,hour,holiday,weekday,workingday,weathersit,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01,1,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2011-01-01,2,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,2011-01-01,3,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,2011-01-01,4,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,2011-01-01,5,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


## üó∫Ô∏è Step 2: Define Column Mapping

Evidently supports specifying column roles explicitly via `ColumnMapping`, which helps produce more accurate and meaningful metrics.

Here, we define:
- `target`: the actual value to predict (`count`)
- `prediction`: (optional) placeholder for model prediction column
- `numerical_features`: continuous input features
- `categorical_features`: categorical or discrete input features


In [3]:
target="count"
prediction="prediction"
numerical_features=['temp', 'atemp', 'humidity', 'windspeed', 'hour', 'weekday']
categorical_features=['season', 'holiday', 'workingday']

column_mapping = ColumnMapping()

column_mapping.target = target
column_mapping.prediction = prediction
column_mapping.numerical_features = numerical_features
column_mapping.categorical_features = categorical_features

In [4]:
import mlflow
from mlflow.tracking import MlflowClient

# Initialize MLflow client
mlflow.set_tracking_uri("http://localhost:5000")
client = MlflowClient()
model_name = "BikeSharingModel"

# List available versions
versions = client.search_model_versions(filter_string=f"name='{model_name}'", order_by=["version_number DESC"])

print("üì¶ Available versions for model:", model_name)
for v in versions:
    print(f"Version: {v.version}, Stage: {v.current_stage}, Status: {v.status}, Run ID: {v.run_id}")

# Ask the user to select a version
selected_version = input("Enter the version number you want to download: ").strip()

# Load the selected model version
model_uri = f"models:/{model_name}/{selected_version}"
model = mlflow.pyfunc.load_model(model_uri=model_uri)

print(f"‚úÖ Model version {selected_version} loaded successfully from MLflow.")

üì¶ Available versions for model: BikeSharingModel
Version: 2, Stage: None, Status: READY, Run ID: cd611bbea71e4006a2c1668522776c47
Version: 1, Stage: None, Status: READY, Run ID: bd324dc36d764539aae2d7e5226fd5e9


Enter the version number you want to download:  2


Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

‚úÖ Model version 2 loaded successfully from MLflow.


#### üìà Step 5: Generate a Regression Performance Report

The **Regression Performance Report** evaluates how well a model performs over time.

To simulate production monitoring, we'll assume that a `prediction` column already exists in the dataset (this could be added via an inference pipeline). The report will compare the predicted and actual target values (`count`) and show metrics like:
- RMSE
- R¬≤
- Error distribution
- Prediction quality


In [5]:
reference_data["prediction"] = model.predict(reference_data[numerical_features + categorical_features])
current_data["prediction"] = model.predict(current_data[numerical_features + categorical_features])
reference_data.head()

Unnamed: 0,dteday,instant,season,year,month,hour,holiday,weekday,workingday,weathersit,temp,atemp,humidity,windspeed,casual,registered,count,prediction
0,2011-01-01,1,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16,25.48
1,2011-01-01,2,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40,35.44
2,2011-01-01,3,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32,25.67
3,2011-01-01,4,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13,12.81
4,2011-01-01,5,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1,3.42


In [7]:
# First, simulate prediction column (for the sake of the report)
# In production, this should come from your model inference pipeline
#reference_data["prediction"] = reference_data["count"] * 0.95  # simulate slight underprediction
#current_data["prediction"] = current_data["count"] * 0.95
#reference_data["prediction"] = model.predict(reference_data)
#current_data["prediction"] = model.predict(current_data)

# Create the Regression Performance report
regression_report = Report(metrics=[RegressionPreset()])

# Run the report with column mapping
regression_report.run(
    reference_data=reference_data,
    current_data=current_data,
    column_mapping=column_mapping
)

# Save the report as HTML
output_path = "./reports/regression_performance_report.html"
regression_report.save_html(output_path)

print(f"‚úÖ Regression Performance report saved to {output_path}")



R^2 score is not well-defined with less than two samples.



‚úÖ Regression Performance report saved to ../reports/regression_performance_report.html


## üìâ Step 3: Generate a Data Drift Report

We'll use the `DataDriftReport` class from Evidently to compare feature distributions between the reference (January) and current (February) datasets.

This report will help us understand whether any input features have changed significantly, which may impact model predictions.


In [3]:
# Create a report with the Data Drift preset
data_drift_report = Report(metrics=[DataDriftPreset()])

# Run the comparison
data_drift_report.run(reference_data=reference_data, current_data=current_data, column_mapping=column_mapping)

# Create directories if they don't exist
report_dir = "./reports"
os.makedirs(report_dir, exist_ok=True)

# Save the report as an HTML file
output_path = "./reports/data_drift_report.html"
data_drift_report.save_html(output_path)

print(f"‚úÖ Data Drift report saved to {output_path}")

‚úÖ Data Drift report saved to ../reports/data_drift_report.html


## üéØ Step 4: Generate a Target Drift Report

We'll now generate a **Target Drift Report** using Evidently.

This report focuses specifically on changes in the distribution of the **target variable** (`cnt`), which represents the total number of bike rentals. Drift in the target distribution can indicate seasonal or behavioral changes in users that may affect model performance.


In [4]:
# Create the Target Drift report
target_drift_report = Report(metrics=[TargetDriftPreset()])

# Run the report with column mapping
target_drift_report.run(
    reference_data=reference_data,
    current_data=current_data,
    column_mapping=column_mapping
)

# Save the report as HTML
output_path = "./reports/target_drift_report.html"
target_drift_report.save_html(output_path)

print(f"‚úÖ Target Drift report saved to {output_path}")


‚úÖ Target Drift report saved to ../reports/target_drift_report.html


## üß™ Step 5: Generate a Data Quality Report

This report helps identify common data issues such as:
- Missing values
- Unexpected or invalid values
- Type mismatches
- Constant or duplicate columns

This is useful for ensuring the data pipeline remains clean and reliable over time.
``


In [5]:
# Create the Data Quality report
data_quality_report = Report(metrics=[DataQualityPreset()])

# Run the report with column mapping
data_quality_report.run(
    reference_data=reference_data,
    current_data=current_data,
    column_mapping=column_mapping
)

# Save the report as HTML
output_path = "./reports/data_quality_report.html"
data_quality_report.save_html(output_path)

print(f"‚úÖ Data Quality report saved to {output_path}")


ValueError: cannot insert count, already exists

# ‚úÖ Summary

In this module, we learned how to use Evidently to monitor data and model performance over time.

We completed the following steps:
- ‚úÖ Compared two months of data to detect **Data Drift**
- ‚úÖ Analyzed changes in the **Target variable** distribution
- ‚úÖ Assessed the datasets for **Data Quality issues**
- ‚úÖ Evaluated **Regression Model Performance** using simulated predictions

These reports can be integrated into automated pipelines to continuously track the health of machine learning systems in production.
