
# 🧪 Lab: Model Monitoring with Evidently AI (Classification)

**Goal:** Learn how to monitor a machine‑learning model in production-like conditions using **[Evidently AI](https://docs.evidentlyai.com/)**:
- Detect **data drift** and **target drift**
- Track **classification performance** over time
- Produce **HTML reports** and a simple **batch monitoring loop** with alerts

> This lab is self-contained and runs locally with synthetic production batches. You can adapt it to your own datasets later.

**Tested with:** Python 3.9+, scikit‑learn ≥ 1.2, evidently ≥ 0.4  
**Created:** 2025-09-15 (UTC)


## 1) Environment Setup

In [3]:

# If you run in a clean environment, uncomment this cell.
# Installing may take 1–2 minutes.

# %pip install -U pip
# %pip install -U numpy pandas scikit-learn matplotlib evidently

%pip install -U evidently



[0mNote: you may need to restart the kernel to use updated packages.


In [5]:
import evidently
evidently.__version__


'0.7.14'

## 2) Imports & Utility

In [8]:

import os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score



import evidently
from evidently import Dataset, DataDefinition, BinaryClassification, Report
from evidently.presets import DataDriftPreset, ClassificationPreset
from evidently.metrics import DriftedColumnsCount, Accuracy
print("evidently.__version__ =", evidently.__version__)



# Nice display options
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 120)

OUTPUT_DIR = Path('evidently_reports')
OUTPUT_DIR.mkdir(exist_ok=True)

RNG = np.random.default_rng(42)


evidently.__version__ = 0.7.14


## 3) Load Dataset (scikit‑learn Breast Cancer)

In [9]:

# We use a built-in dataset (no internet needed).
ds = load_breast_cancer(as_frame=True)
df = ds.frame.copy()

# Rename target to a friendly name
df = df.rename(columns={'target': 'label'})

# Quick peek
df.head()


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


## 4) Create Reference and Simulated Production (Current) Data


We'll simulate a production stream by splitting the dataset:

- **Reference** = training slice (assumed "healthy" baseline).
- **Current** = batches sliced from the remaining data with synthetic drifts injected (e.g., feature mean shifts, label prevalence changes).


In [11]:

# Split into model train (reference) and holdout (to simulate production batches)
ref_df, prod_pool = train_test_split(df, test_size=0.5, random_state=7, stratify=df['label'])

# Separate features and labels
feature_names = [c for c in df.columns if c != 'label']
X_ref, y_ref = ref_df[feature_names], ref_df['label']

# Train a simple classifier
clf = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(max_iter=500, random_state=7))
])
clf.fit(X_ref, y_ref)

# Add model outputs to reference (simulate serving)
ref_df = ref_df.copy()
ref_df['pred_proba'] = clf.predict_proba(X_ref)[:, 1]
ref_df['prediction'] = (ref_df['pred_proba'] >= 0.5).astype(int)

print("Reference size:", ref_df.shape, "  AUC:", roc_auc_score(y_ref, ref_df['pred_proba']))
ref_df.head()


Reference size: (284, 33)   AUC: 0.9971380114479542


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label,pred_proba,prediction
349,11.95,14.96,77.23,426.7,0.1158,0.1206,0.01171,0.01787,0.2459,0.06581,0.361,1.05,2.455,26.65,0.0058,0.02417,0.007816,0.01052,0.02734,0.003114,12.81,17.72,83.09,496.2,0.1293,0.1885,0.03122,0.04766,0.3124,0.0759,1,0.997574,1
507,11.06,17.12,71.25,366.5,0.1194,0.1071,0.04063,0.04268,0.1954,0.07976,0.1779,1.03,1.318,12.3,0.01262,0.02348,0.018,0.01285,0.0222,0.008313,11.69,20.74,76.08,411.1,0.1662,0.2031,0.1256,0.09514,0.278,0.1168,1,0.999763,1
114,8.726,15.83,55.84,230.9,0.115,0.08201,0.04132,0.01924,0.1649,0.07633,0.1665,0.5864,1.354,8.966,0.008261,0.02213,0.03259,0.0104,0.01708,0.003806,9.628,19.62,64.48,284.4,0.1724,0.2364,0.2456,0.105,0.2926,0.1017,1,0.999695,1
419,11.16,21.41,70.95,380.3,0.1018,0.05978,0.008955,0.01076,0.1615,0.06144,0.2865,1.678,1.968,18.99,0.006908,0.009442,0.006972,0.006159,0.02694,0.00206,12.36,28.92,79.26,458.0,0.1282,0.1108,0.03582,0.04306,0.2976,0.07123,1,0.997902,1
452,12.0,28.23,76.77,442.5,0.08437,0.0645,0.04055,0.01945,0.1615,0.06104,0.1912,1.705,1.516,13.86,0.007334,0.02589,0.02941,0.009166,0.01745,0.004302,13.09,37.88,85.07,523.7,0.1208,0.1856,0.1811,0.07116,0.2447,0.08194,1,0.98858,1


## 5) Simulate Production Batches with Drift


We'll create **N batches** from the remaining pool and induce controlled drift on some features to see Evidently in action.


In [12]:

def induce_feature_shift(df_in: pd.DataFrame, shift_cols, shift_by=0.25, scale=1.0, rng=None):
    rng = rng or np.random.default_rng(0)
    df_out = df_in.copy()
    for c in shift_cols:
        if pd.api.types.is_numeric_dtype(df_out[c]):
            noise = rng.normal(loc=shift_by, scale=0.1*scale, size=len(df_out))
            df_out[c] = df_out[c] + noise
    return df_out

def flip_labels(df_in: pd.DataFrame, flip_rate=0.0, rng=None):
    rng = rng or np.random.default_rng(0)
    df_out = df_in.copy()
    if flip_rate > 0:
        m = rng.random(len(df_out)) < flip_rate
        df_out.loc[m, 'label'] = 1 - df_out.loc[m, 'label']
    return df_out

# Build batches
N_BATCHES = 6
batch_size = int(np.ceil(len(prod_pool) / N_BATCHES))

batches = []
start = 0
for i in range(N_BATCHES):
    batch = prod_pool.iloc[start:start+batch_size].copy()
    start += batch_size

    # Induce drift for later batches
    if i >= 2:
        batch = induce_feature_shift(batch, shift_cols=feature_names[:5], shift_by=0.35, scale=1.25, rng=RNG)
    if i >= 4:
        batch = flip_labels(batch, flip_rate=0.10, rng=RNG)

    # Get predictions
    Xb = batch[feature_names]
    batch['pred_proba'] = clf.predict_proba(Xb)[:, 1]
    batch['prediction'] = (batch['pred_proba'] >= 0.5).astype(int)
    batch['batch_id'] = i + 1
    batches.append(batch)

len(batches), [b.shape[0] for b in batches]


(6, [48, 48, 48, 48, 48, 45])

## 6) Evidently Column Mapping

In [13]:

# Evidently needs to understand which columns are features, target, prediction, etc.
column_mapping = ColumnMapping(
    target='label',
    prediction='prediction',
    prediction_probas='pred_proba',
    numerical_features=feature_names,  # all features here are numeric
    categorical_features=None
)
column_mapping


NameError: name 'ColumnMapping' is not defined

## 7) Data Drift Report

In [None]:

batch0 = batches[0]  # compare first current batch vs reference

data_drift_report = Report(metrics=[DataDriftPreset()])
data_drift_report.run(reference_data=ref_df, current_data=batch0, column_mapping=column_mapping)

HTML_PATH = OUTPUT_DIR / 'data_drift_batch1_vs_ref.html'
data_drift_report.save_html(str(HTML_PATH))
print(f"Saved: {HTML_PATH.resolve()}")

data_drift_report


## 8) Classification Performance Report

In [None]:

classif_report = Report(metrics=[ClassificationPreset()])
classif_report.run(reference_data=ref_df, current_data=batch0, column_mapping=column_mapping)

HTML_PATH = OUTPUT_DIR / 'classification_performance_batch1_vs_ref.html'
classif_report.save_html(str(HTML_PATH))
print(f"Saved: {HTML_PATH.resolve()}")

classif_report


## 9) Target Drift Report

In [None]:

target_drift_report = Report(metrics=[TargetDriftPreset()] )
target_drift_report.run(reference_data=ref_df, current_data=batch0, column_mapping=column_mapping)

HTML_PATH = OUTPUT_DIR / 'target_drift_batch1_vs_ref.html'
target_drift_report.save_html(str(HTML_PATH))
print(f"Saved: {HTML_PATH.resolve()}")

target_drift_report


## 10) Batch Monitoring Loop + Simple Alerts


We'll iterate through batches and compute **key metrics** per batch:
- `share_drifted_columns` from the Data Drift preset
- Overall **Accuracy** and **ROC AUC**
- A simple alert if drift share or accuracy degrades beyond thresholds


In [None]:

from evidently.metrics import DataDriftTable
from evidently.metrics import ClassificationQualityByClass
from evidently.metrics import ClassificationQualityMetric
from evidently.report import Report

records = []
DRIFT_ALERT_THRESHOLD = 0.3     # 30% of columns drifted
ACCURACY_ALERT_DROP = 0.08      # alert if accuracy below (reference - 0.08)

# Reference performance
ref_acc = accuracy_score(y_ref, ref_df['prediction'])
ref_auc = roc_auc_score(y_ref, ref_df['pred_proba'])
print(f"Reference Accuracy={ref_acc:.3f}, AUC={ref_auc:.3f}")

for b in batches:
    y_true = b['label']
    y_prob = b['pred_proba']
    y_pred = b['prediction']

    # Data drift share
    drift_tbl = Report(metrics=[DataDriftTable()])
    drift_tbl.run(reference_data=ref_df, current_data=b, column_mapping=column_mapping)
    drift_json = drift_tbl.as_dict()
    share_drifted = drift_json['metrics'][0]['result']['share_of_drifted_columns']

    # Performance
    acc = accuracy_score(y_true, y_pred)
    auc = roc_auc_score(y_true, y_prob)

    alert = False
    reasons = []
    if share_drifted >= DRIFT_ALERT_THRESHOLD:
        alert = True
        reasons.append(f"drifted_columns_share={share_drifted:.2f} ≥ {DRIFT_ALERT_THRESHOLD}")
    if acc <= (ref_acc - ACCURACY_ALERT_DROP):
        alert = True
        reasons.append(f"accuracy_drop={ref_acc-acc:.2f} ≥ {ACCURACY_ALERT_DROP}")

    # Save an HTML report per batch (Classification + Drift combined)
    rep = Report(metrics=[DataDriftPreset(), ClassificationPreset()])
    rep.run(reference_data=ref_df, current_data=b, column_mapping=column_mapping)
    out_html = OUTPUT_DIR / f'batch{int(b["batch_id"].iloc[0])}_report.html'
    rep.save_html(str(out_html))

    records.append({
        'batch_id': int(b['batch_id'].iloc[0]),
        'rows': len(b),
        'share_drifted_columns': share_drifted,
        'accuracy': acc,
        'auc': auc,
        'alert': alert,
        'reasons': "; ".join(reasons)
    })

monitor_df = pd.DataFrame(records).sort_values('batch_id')
monitor_df


## 11) Visualize Monitoring Metrics

In [None]:

plt.figure(figsize=(7,4))
plt.plot(monitor_df['batch_id'], monitor_df['share_drifted_columns'], marker='o')
plt.axhline(0.3, linestyle='--')
plt.title('Share of Drifted Columns over Batches')
plt.xlabel('Batch ID')
plt.ylabel('Share drifted')

plt.figure(figsize=(7,4))
plt.plot(monitor_df['batch_id'], monitor_df['accuracy'], marker='o')
plt.axhline(monitor_df['accuracy'].iloc[0]-0.08, linestyle='--')
plt.title('Accuracy over Batches')
plt.xlabel('Batch ID')
plt.ylabel('Accuracy')

monitor_df


## 12) Exercises / What to Hand In


1. **Run all cells** and open the generated HTML reports in the `evidently_reports/` folder.
2. Change the drift intensity in `induce_feature_shift()` and the `flip_rate` to see the effect on reports.
3. Adjust the **decision threshold** from `0.5` to another value and re-run the batch loop.
4. Replace the dataset with your own (keep the same column names for target/prediction or update the `ColumnMapping`).
5. Add at least **one new alert rule**, e.g., AUC drop or precision for the positive class using Evidently metrics.
6. (Bonus) Export `monitor_df` to CSV and make a small dashboard plotting drift vs. accuracy over time.


## 13) Appendix: Useful Snippets

In [None]:

# Save aggregated metrics
monitor_df.to_csv(OUTPUT_DIR / 'monitor_summary.csv', index=False)

# How to change threshold to 0.4 globally (example)
def predict_with_threshold(model, X, threshold=0.4):
    p = model.predict_proba(X)[:,1]
    return (p >= threshold).astype(int), p
