# Lab 08 — Why Schema Validation in ML/LLM Pipelines

**Focus Area:** Why schema validation — catching upstream drift; protecting training/eval

> This lab is the *why* and *show‑me* for validation. You'll simulate upstream changes (types, ranges, unexpected categories) and see how a light schema gate prevents bad data from reaching LLM‑adjacent stages.

---

## Outcomes

By the end of this lab, you will be able to:

1. Explain the difference between **structural drift** (columns/types) and **semantic drift** (values/ranges/categories), and why each harms LLM workflows.
2. Add a **pre‑flight validation gate** that fails fast with actionable messages.
3. Use a **minimal Pandera schema** (or Pydantic model per row) to enforce types, ranges, and categorical sets.
4. Capture **human‑readable failure reports** for debugging, CI, and incident triage.

## Prerequisites & Setup

- Python 3.13 with `pandas`, `numpy`, `pandera`, `pydantic`, `pyarrow` installed.  
- JupyterLab or VS Code with Jupyter extension.
- Artifacts from previous labs (optional but recommended): `artifacts/clean/per_customer.parquet` or `users_clean.parquet`  

**Start a notebook:** `week02_lab08.ipynb`

If you don't have prior artifacts, synthesize a small frame now:

In [2]:
import numpy as np, pandas as pd
rng = np.random.default_rng(42)
users2 = pd.DataFrame({
    'CustomerID': [f'C{i:05d}' for i in range(300)],
    'country_norm': rng.choice(['USA','DE','SG','BR'], size=300, p=[.55,.2,.15,.1]),
    'age': rng.integers(16, 80, size=300).astype('int64'),
    'ltv_usd': np.round(np.clip(rng.lognormal(3.0, 0.7, size=300), 0, 5e4), 2),
    'is_adult': (rng.integers(16, 80, size=300) >= 18),
    'is_high_value': rng.random(300) > 0.85,
})
users2.head()

Unnamed: 0,CustomerID,country_norm,age,ltv_usd,is_adult,is_high_value
0,C00000,SG,55,28.0,True,True
1,C00001,USA,65,18.32,True,True
2,C00002,SG,63,5.58,True,True
3,C00003,DE,24,38.47,True,True
4,C00004,USA,68,13.15,True,True


## Part A — What can go wrong, concretely?

In LLM/ML pipelines, silent data drift can:

- **Break transforms** (e.g., `to_datetime` fails after a type flip from string→int).
- **Bias metrics** (e.g., new country labels split a cohort: `U.S.A.` appears again).
- **Explode tokens/costs** (e.g., unexpectedly long text fields; numeric → string inflation).
- **Poison eval/train** (e.g., negative prices; out‑of‑range ages; missing required keys).

**Exercise:** Create 3 synthetic drifts.

In [3]:
broken = users2.copy()
# 1) Structural drift: age becomes string for some rows
broken.loc[broken.index[:20], 'age'] = broken.loc[broken.index[:20], 'age'].astype(str)
# 2) Semantic drift: country label out of policy
broken.loc[10:15, 'country_norm'] = ['U.S.A.','United States','usa','US','USA','USA']
# 3) Range drift: negative ltv sneaks in
broken.loc[50:55, 'ltv_usd'] = [-10, -5, -1, 0, 1, 2]
broken.head()

 '59' '56' '31' '33' '19' '24']' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  broken.loc[broken.index[:20], 'age'] = broken.loc[broken.index[:20], 'age'].astype(str)


Unnamed: 0,CustomerID,country_norm,age,ltv_usd,is_adult,is_high_value
0,C00000,SG,55,28.0,True,True
1,C00001,USA,65,18.32,True,True
2,C00002,SG,63,5.58,True,True
3,C00003,DE,24,38.47,True,True
4,C00004,USA,68,13.15,True,True


## Part B — Minimal Pandera schema as a gate

We'll define a small DataFrame schema to catch the above.

In [13]:
%pip install pandera

import pandera.pandas as pa
from pandera import Column, Check

Schema = pa.DataFrameSchema({
    'CustomerID': Column(object, nullable=False),
    'country_norm': Column(object, Check.isin(['USA','DE','SG','BR']), nullable=False),
    'age': Column(pa.Int64, Check.in_range(0, 120), nullable=False),
    'ltv_usd': Column(float, Check.ge(0), nullable=False),
    'is_adult': Column(bool, nullable=False),
    'is_high_value': Column(bool, nullable=False),
})

774.19s - pydevd: Sending message related to process being replaced timed-out after 5 seconds



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### B1. Validate clean vs broken

In [21]:
# Clean should pass
ok = Schema.validate(users2, lazy=True)
print('clean rows:', len(ok))

# Broken should fail with a report
try:
    Schema.validate(broken, lazy=True)
except pa.errors.SchemaErrors as err:
    report = err.failure_cases
    print(report.head(10))

clean rows: 300
  schema_context        column                            check  check_number  \
0         Column  country_norm  isin(['USA', 'DE', 'SG', 'BR'])           0.0   
1         Column  country_norm  isin(['USA', 'DE', 'SG', 'BR'])           0.0   
2         Column  country_norm  isin(['USA', 'DE', 'SG', 'BR'])           0.0   
3         Column  country_norm  isin(['USA', 'DE', 'SG', 'BR'])           0.0   
4         Column       ltv_usd      greater_than_or_equal_to(0)           0.0   
5         Column       ltv_usd      greater_than_or_equal_to(0)           0.0   
6         Column       ltv_usd      greater_than_or_equal_to(0)           0.0   
7         Column           age                   dtype('int64')           NaN   
8         Column           age                 in_range(0, 120)           0.0   

                                        failure_case index  
0                                             U.S.A.    10  
1                                      United State

**Checkpoint:** Inspect `report` to see: wrong dtype (`age`), out‑of‑set categories (`country_norm`), and negative values (`ltv_usd`).

### B2. Actionable messages for CI / logs

In [22]:
# Summarize by column + failure type
summary = (report
           .groupby(['column', 'check'])
           .size()
           .reset_index(name='failures')
           .sort_values('failures', ascending=False))
summary

Unnamed: 0,column,check,failures
2,country_norm,"isin(['USA', 'DE', 'SG', 'BR'])",4
3,ltv_usd,greater_than_or_equal_to(0),3
1,age,"in_range(0, 120)",1
0,age,dtype('int64'),1


> **Interpretation:** This summary is what you'd attach to a CI artifact or Slack alert.

## Part C — Row‑level validation with Pydantic (optional)

Use Pydantic models when you're validating **per‑row payloads** (e.g., API messages) or writing contracts across services.

In [25]:
%pip install pydantic

from pydantic import BaseModel, Field, ValidationError
from typing import Literal

class CustomerRow(BaseModel):
    CustomerID: str
    country_norm: Literal['USA','DE','SG','BR']
    age: int = Field(ge=0, le=120)
    ltv_usd: float = Field(ge=0)
    is_adult: bool
    is_high_value: bool

row = users2.iloc[0].to_dict()
CustomerRow(**row)

3074.65s - pydevd: Sending message related to process being replaced timed-out after 5 seconds



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


CustomerRow(CustomerID='C00000', country_norm='SG', age=55, ltv_usd=28.0, is_adult=True, is_high_value=True)

In [26]:
try:
    CustomerRow(**broken.iloc[12].to_dict())
except ValidationError as e:
    print(e)

1 validation error for CustomerRow
country_norm
  Input should be 'USA', 'DE', 'SG' or 'BR' [type=literal_error, input_value='usa', input_type=str]
    For further information visit https://errors.pydantic.dev/2.12/v/literal_error


**When to prefer Pydantic:** API boundaries, message queues, microservices. **When to prefer Pandera:** bulk DataFrame validation in ETL/ELT.

## Part D — Pre‑flight gate function + fail‑fast

Wrap the schema check in a reusable function that raises a concise, friendly error and writes a CSV report for triage.

In [27]:
from pathlib import Path

def validate_or_raise(df: pd.DataFrame, schema: pa.DataFrameSchema, name: str, out_dir: str = 'artifacts/validation') -> pd.DataFrame:
    Path(out_dir).mkdir(parents=True, exist_ok=True)
    try:
        return schema.validate(df, lazy=True)
    except pa.errors.SchemaErrors as err:
        rep = err.failure_cases
        dest = Path(out_dir) / f'{name}_schema_failures.csv'
        rep.to_csv(dest, index=False)
        # compact message for logs/CI
        top = (rep.groupby(['column','check']).size().reset_index(name='n')
                 .sort_values('n', ascending=False).head(5).to_dict(orient='records'))
        raise RuntimeError(f"Validation failed for {name}. Top issues: {top}. See {dest}")

In [28]:
# Example usage
_ = validate_or_raise(users2, Schema, name='users2_clean')
try:
    _ = validate_or_raise(broken, Schema, name='users2_broken')
except RuntimeError as e:
    print('\nGATE BLOCKED ->', e)


GATE BLOCKED -> Validation failed for users2_broken. Top issues: [{'column': 'country_norm', 'check': "isin(['USA', 'DE', 'SG', 'BR'])", 'n': 4}, {'column': 'ltv_usd', 'check': 'greater_than_or_equal_to(0)', 'n': 3}, {'column': 'age', 'check': 'in_range(0, 120)', 'n': 1}, {'column': 'age', 'check': "dtype('int64')", 'n': 1}]. See artifacts/validation/users2_broken_schema_failures.csv


## Part E — Wrap‑Up

Add a markdown cell and answer:

1. Name one structural and one semantic drift you simulated. How would each impact an LLM component downstream?  
2. Paste the top 3 failure types from your summary and propose a remediation (fix in source vs transform rule).  
3. Where would you place this validation gate in your Day‑1/Day‑2 pipeline, and why?

### Your answers here:

**1. Structural and Semantic Drift:**

- **Structural drift:** Age column changed from Int64 to string for some rows. This would break downstream operations like numerical aggregations, comparisons, or ML model features that expect numeric input.
- **Semantic drift:** Country labels changed to non-standard values ('U.S.A.', 'United States', 'usa', 'US'). This would cause category splits in LLM prompts, create duplicate embeddings for the same entity, and inflate token usage unnecessarily.

**2. Top 3 Failure Types:**

*(Fill in based on your actual output)*

Example:
- `age` dtype mismatch: Fix at source by enforcing integer type constraints in upstream system
- `country_norm` out of allowed set: Add transform rule to normalize variants before validation
- `ltv_usd` negative values: Fix at source by adding database constraint or API validation

**3. Validation Gate Placement:**

I would place this validation gate:
- **Immediately after data ingestion** (Day-1) to fail fast before any transformation or enrichment
- **Before feature engineering** (Day-2) to protect ML/LLM components from corrupted inputs
- **As a CI/CD check** before promoting data to production storage

Why: Early detection minimizes wasted processing, prevents cascading errors, and provides clear diagnostic information at the point of failure.

---

**Common pitfalls:** Using `object` dtypes everywhere; not distinguishing `structural` vs `semantic` drift; over‑fitting schemas (too strict for expected evolution).

## Solution Snippets (reference)

**Quick failure roll‑up:**

In [29]:
summary = (report.groupby(['column','check'])
           .size().reset_index(name='failures')
           .sort_values('failures', ascending=False))
summary.head()

Unnamed: 0,column,check,failures
2,country_norm,"isin(['USA', 'DE', 'SG', 'BR'])",4
3,ltv_usd,greater_than_or_equal_to(0),3
1,age,"in_range(0, 120)",1
0,age,dtype('int64'),1


**CI‑style assert:**

In [30]:
assert Schema.validate(users2, lazy=True) is not None

**Lightweight allow‑list for categories:**

In [31]:
allowed = {'USA','DE','SG','BR'}
viol = set(broken['country_norm']) - allowed
viol

{'U.S.A.', 'US', 'United States', 'usa'}