
# Evidently Lab: Binary Classification (v0.7+ API)

**Goal:** Build a minimal-yet-practical workflow to evaluate a binary classifier using Evidently **0.7+** with:
- clean data schema via `DataDefinition` + `BinaryClassification`
- model quality metrics (Accuracy, Precision, Recall, F1, ROC AUC)
- simple data quality checks
- segment-level breakdowns
- reference vs. current comparison & drift checks

> **Import rule (per lab request):** only import **what we use** from the list provided.


## 1) Setup & version check

In [1]:

# (Optional) quick version check — should be 0.7+
import evidently
print("evidently.__version__:", evidently.__version__)


evidently.__version__: 0.7.14


## 2) Imports (selected from the given list)

In [2]:

import pandas as pd

from evidently import Dataset, DataDefinition, BinaryClassification, Report

# Metrics & utilities used in this lab (picked from the provided list)
from evidently.metrics import (
    ColumnCount,
    RowCount,
    DuplicatedRowCount,
    DatasetMissingValueCount,
    CategoryCount,
    Accuracy,
    Precision,
    Recall,
    F1Score,
    RocAuc,
    ScoreDistribution,
    DriftedColumnsCount,
    ValueDrift,
)
from evidently.metrics.group_by import GroupBy


## 3) Create synthetic **reference** and **current** datasets

In [3]:

# We'll create a small synthetic binary classification dataset using only pandas + stdlib.
# Columns:
# - 'age' (numeric), 'income' (numeric), 'segment' (categorical)
# - 'target' (0/1), 'pred_proba' (probability of class 1), 'prediction' (0/1 via 0.5 threshold)

import random
random.seed(7)

def make_split(n_rows=600, shift=False):
    rows = []
    for i in range(n_rows):
        age = random.randint(18, 70)
        base_income = random.randint(18, 120) * 1000
        segment = random.choice(["A", "B", "C"])
        
        # True target: a simple rule with noise
        p_true = 0.25
        if age < 30: p_true += 0.05
        if base_income > 80000: p_true += 0.15
        if segment == "A": p_true += 0.10
        # add small noise
        p_true = max(0.01, min(0.99, p_true + random.uniform(-0.05, 0.05)))
        target = 1 if random.random() < p_true else 0
        
        # Model probability (imperfect) — optionally shift to simulate drift
        p_pred = p_true * 0.7 + random.uniform(0, 0.3)
        if shift:
            # Simulate a slight covariate/behavior shift in "current"
            p_pred = min(0.99, max(0.01, p_pred + 0.05))
            if segment == "C":
                p_pred = min(0.99, p_pred + 0.05)
        
        pred_label = 1 if p_pred >= 0.5 else 0
        
        rows.append({
            "age": age,
            "income": base_income,
            "segment": segment,
            "target": target,
            "pred_proba": p_pred,
            "prediction": pred_label,
        })
    return pd.DataFrame(rows)

reference_df = make_split(n_rows=800, shift=False)
current_df   = make_split(n_rows=800, shift=True)

reference_df

Unnamed: 0,age,income,segment,target,pred_proba,prediction
0,38,37000,B,1,0.346330,0
1,41,92000,A,1,0.404463,0
2,44,26000,A,0,0.464406,0
3,25,46000,C,0,0.392051,0
4,43,24000,A,0,0.300144,0
...,...,...,...,...,...,...
795,57,83000,B,0,0.519305,1
796,18,27000,A,0,0.434655,0
797,57,96000,C,0,0.512207,1
798,60,87000,C,0,0.267657,0


## 4) Define the schema with `DataDefinition` + `BinaryClassification`

In [4]:

# Map column roles so Evidently understands targets, predictions, probabilities, and features.
data_def = DataDefinition(
    classification=[
        BinaryClassification(
            # Required roles:
            target="target",               # ground-truth labels (0/1)
            prediction_labels="prediction",# predicted class labels (0/1)
            prediction="pred_proba",       # probability of the positive class (1)
            # Optional: you can specify pos_label if your positive class is not 1.
            # pos_label=1,
        )
    ],
    # (Optional) you can also mark feature types; here we let Evidently infer.
)
dataset_ref = Dataset.from_pandas(reference_df)
dataset_cur = Dataset.from_pandas(current_df)

print("OK: DataDefinition + Dataset created")


OK: DataDefinition + Dataset created


## 5) Quick **Data Quality** report

In [5]:

dq_report = Report(metrics=[
    ColumnCount(),
    RowCount(),

])

dq_report.run(
    data_definition=data_def,
    data=dataset_cur,
    reference_data=dataset_ref,
)
dq_report


TypeError: Report.run() got an unexpected keyword argument 'data_definition'

## 6) **Model Quality** metrics (current-only)

In [6]:

quality_report = Report(metrics=[
    Accuracy(),
    Precision(),
    Recall(),
    F1Score(),
    RocAuc(),
    ScoreDistribution(column="pred_proba"),  # visualize score distribution
])

quality_report.run(
    data_definition=data_def,
    data=dataset_cur,
)
quality_report


ValidationError: 1 validation error for ScoreDistribution
k
  field required (type=value_error.missing)

## 7) Segment-level breakdown with `GroupBy` (by `segment`)

In [None]:

segmented_report = Report(metrics=[
    GroupBy(column="segment", metrics=[Accuracy(), Precision(), Recall(), F1Score()])
])

segmented_report.run(
    data_definition=data_def,
    data=dataset_cur,
)
segmented_report


## 8) Reference vs. Current: column drift & performance shift

In [None]:

compare_report = Report(metrics=[
    DriftedColumnsCount(),                 # count of columns detected as drifted
    ValueDrift(column="age"),
    ValueDrift(column="income"),
    ValueDrift(column="pred_proba"),       # score/probability drift
    Accuracy(),
    F1Score(),
])

compare_report.run(
    data_definition=data_def,
    data=dataset_cur,
    reference_data=dataset_ref,
)
compare_report


TypeError: Report.run() got an unexpected keyword argument 'data_definition'

## 9) Extract numbers programmatically (JSON)

In [None]:

as_dict = compare_report.as_dict()
# Walk the structure to find Accuracy and F1 in a generic way
def find_metric(result_dict, metric_name):
    hits = []
    for sec in result_dict.get("metrics", []):
        if sec.get("metric") == metric_name and "result" in sec:
            hits.append(sec["result"])
    return hits

acc_values = find_metric(as_dict, "Accuracy")
f1_values  = find_metric(as_dict, "F1Score")

print("Extracted Accuracy result objects:", acc_values[:1])
print("Extracted F1Score result objects:", f1_values[:1])


AttributeError: 'Report' object has no attribute 'as_dict'

## 10) Save reports to HTML and JSON

In [None]:

out_dir = "/mnt/data/evidently_lab_outputs"
os.makedirs(out_dir, exist_ok=True)

# HTML
dq_report.save_html(os.path.join(out_dir, "01_data_quality.html"))
quality_report.save_html(os.path.join(out_dir, "02_model_quality_current.html"))
segmented_report.save_html(os.path.join(out_dir, "03_segmented_quality.html"))
compare_report.save_html(os.path.join(out_dir, "04_compare_ref_vs_cur.html"))

# JSON (machine-readable)
with open(os.path.join(out_dir, "04_compare_ref_vs_cur.json"), "w", encoding="utf-8") as f:
    json.dump(compare_report.as_dict(), f, ensure_ascii=False, indent=2)

print("Saved to:", out_dir)


NameError: name 'os' is not defined


## 11) Exercises / Variations

1. **Change the threshold:** Replace `prediction` with labels derived from a custom threshold (e.g. 0.35, 0.7) and re-run quality metrics.
2. **Add more features:** Insert additional numeric/categorical columns and add `ValueDrift` for them.
3. **Imbalance scenario:** Modify the data generator to reduce positives to ~10% and see how Precision/Recall/F1 change.
4. **Per-segment deep dive:** Add more metrics inside `GroupBy`, e.g., `RocAuc()`.
5. **Multiclass extension:** Swap to `MulticlassClassification` (not covered here) and adapt metrics accordingly.
