# Lab 11 — Data Quality Dimensions & Thresholded Alerts
 
**Focus Area:** Quality dimensions — **Completeness, Validity, Consistency, Timeliness** with thresholds & alerts

> In this lab you'll operationalize data‑quality checks across four core dimensions and wire lightweight **alerts** that block or warn before downstream LLM processing. You'll reuse artifacts from earlier labs and produce a small, reusable **DQ report**.

## Outcomes

By the end of this lab, you will be able to:

1. Define and compute metrics for **completeness**, **validity**, **consistency**, and **timeliness**.
2. Set **thresholds** (warn vs fail) and emit a compact machine‑readable **DQ report**.
3. Detect **consistency** issues (e.g., country naming) using a reference dimension.
4. Integrate DQ checks with earlier **Pandera** schema validation and **profiling** outputs.

## Prerequisites & Setup

- Python 3.13 with `pandas`, `numpy`, `pandera` (optional), `pyarrow`  
- Artifacts (preferred): from previous lab — `users_clean.parquet`, `per_customer.parquet`
- JupyterLab or VS Code with Jupyter extension.

### Loading Artifacts (Preferred Method)

If you have completed earlier labs, load the existing artifacts:

In [1]:
import pandas as pd
from pathlib import Path

# Load users data from clean artifacts
users_path = Path('artifacts/clean/users_clean.parquet')
if users_path.exists():
    users2 = pd.read_parquet(users_path)
    print(f"✓ Loaded {len(users2)} users from {users_path}")
else:
    print(f"✗ {users_path} not found. Use synthetic data fallback below.")

# Load per-customer aggregated data
per_customer_path = Path('artifacts/clean/per_customer.parquet')
if per_customer_path.exists():
    per_customer = pd.read_parquet(per_customer_path)
    print(f"✓ Loaded {len(per_customer)} customer records from {per_customer_path}")
else:
    print(f"✗ {per_customer_path} not found. Use synthetic data fallback below.")

# Examine the structure
print("\nUsers columns:", users2.columns.tolist() if 'users2' in locals() else 'N/A')
print("Per-customer columns:", per_customer.columns.tolist() if 'per_customer' in locals() else 'N/A')

# country reference (same idea as 2E)
country_dim = pd.DataFrame({
    'raw': ['USA','U.S.A.','United States','US','usa','U. S. A.','BR','Brasil','DE','Germany','SG','Singapore','N/A'],
    'canonical': ['USA','USA','USA','USA','USA','USA','BR','BR','DE','DE','SG','SG','UNKNOWN']
})
Path('artifacts/reports').mkdir(parents=True, exist_ok=True)

✓ Loaded 1471 users from artifacts/clean/users_clean.parquet
✓ Loaded 4 customer records from artifacts/clean/per_customer.parquet

Users columns: ['user_id', 'email', 'age', 'country', 'signup_date', 'spend', 'is_marketing_opt_in', 'country_norm', 'spend_usd', 'signup_dt']
Per-customer columns: ['CustomerID', 'n_orders', 'freight_mean', 'freight_sum', 'CompanyName', 'Country', 'spend_segment']


### Synthetic Data Fallback

If artifacts are missing, synthesize data:

In [None]:
import numpy as np, pandas as pd
from pathlib import Path
rng = np.random.default_rng(11)
N = 5000
users2 = pd.DataFrame({
    'CustomerID': [f'C{i:05d}' for i in range(N)],
    'email': [f'user{i}@example.com' if rng.random()>.01 else None for i in range(N)],
    'age': rng.integers(16, 80, size=N).astype('Int64'),
    'signup_dt': pd.to_datetime('2025-01-01') + pd.to_timedelta(rng.integers(0, 40, size=N), unit='D'),
    'country': rng.choice(['US','usa','United States','DE','SG','BR','N/A'], size=N, p=[.35,.05,.08,.2,.2,.1,.02]),
    'ltv_usd': np.round(np.clip(rng.lognormal(3.1, .7, size=N), 0, 1e5), 2),
})
orders = pd.DataFrame({
    'OrderID': np.arange(10_000, 10_000+N*2),
    'CustomerID': rng.choice(users2['CustomerID'], size=N*2),
    'OrderDate': pd.to_datetime('2025-02-01') + pd.to_timedelta(rng.integers(0, 10, size=N*2), unit='D'),
    'Freight': np.round(np.clip(rng.lognormal(3.0, 0.7, size=N*2), 0, 2e4), 2)
})

# country reference (same idea as 2E)
country_dim = pd.DataFrame({
    'raw': ['USA','U.S.A.','United States','US','usa','U. S. A.','BR','Brasil','DE','Germany','SG','Singapore','N/A'],
    'canonical': ['USA','USA','USA','USA','USA','USA','BR','BR','DE','DE','SG','SG','UNKNOWN']
})
Path('artifacts/reports').mkdir(parents=True, exist_ok=True)

## Part A — Define Quality Dimensions & Base Metrics

**Note:** If using real artifacts, adjust column names as needed. The examples below assume the synthetic dataset structure, but you can adapt them to match your `users_clean.parquet` schema (e.g., `Email` vs `email`, `Age` vs `age`, etc.).

### A1. Completeness (null rate on required columns)

In [6]:
print(users2.columns.tolist())

['user_id', 'email', 'age', 'country', 'signup_date', 'spend', 'is_marketing_opt_in', 'country_norm', 'spend_usd', 'signup_dt']


In [7]:
# Adjust column names to match your actual dataset
# For synthetic data: ['CustomerID','email','signup_dt']
# For real artifacts, check: users2.columns.tolist()
required_cols = ['user_id','email','signup_date', 'spend_usd']  # Update as needed
null_rates = users2[required_cols].isna().mean().to_dict()
null_rates

{'user_id': 0.0,
 'email': 0.0,
 'signup_date': 0.04078857919782461,
 'spend_usd': 0.0}

### A2. Validity (type/range rules)

In [8]:
valid_age = users2['age'].between(0, 120) | users2['age'].isna()
valid_ltv = users2['spend_usd'].ge(0) | users2['spend_usd'].isna()
validity = {
    'age_in_range_rate': float(valid_age.mean()),
    'ltv_nonnegative_rate': float(valid_ltv.mean())
}
validity

{'age_in_range_rate': 1.0, 'ltv_nonnegative_rate': 1.0}

### A3. Consistency (country naming via reference)

In [9]:
# Normalize, then left join to reference mapping
norm = (users2['country'].astype('string')
         .str.replace('.','', regex=False)
         .str.replace(' ','', regex=False)
         .str.upper())
ref = country_dim.assign(raw_key = country_dim['raw'].str.replace('.','', regex=False).str.replace(' ','', regex=False).str.upper())
map_df = pd.DataFrame({'country_key': norm})
map_df = map_df.merge(ref[['raw_key','canonical']], left_on='country_key', right_on='raw_key', how='left')
consistency_rate = float(map_df['canonical'].notna().mean())
consistency_rate

0.88841882601798

### A4. Timeliness (freshness lag)

In [14]:
import pandas as pd
now = pd.Timestamp('2025-02-15')  # fixed for reproducibility; replace with pd.Timestamp.utcnow()

# Convert signup_date to datetime if it's stored as string
users2['signup_date'] = pd.to_datetime(users2['signup_date'], format='mixed')

lag_days = (now - users2['signup_date']).dt.days
fresh_rate = float((lag_days <= 30).mean())  # % rows updated/arrived within SLA window
fresh_stats = {'lag_p50': int(lag_days.median()), 'lag_p95': int(lag_days.quantile(0.95))}
{'fresh_rate': fresh_rate, **fresh_stats}

{'fresh_rate': 0.2760027192386132, 'lag_p50': 40, 'lag_p95': 41}

**Checkpoint:** In your words, distinguish validity vs consistency for `country`.

*Answer:* Validity checks whether the country field contains a value within an expected set of valid values (e.g., not null, not an empty string). Consistency checks whether the country naming follows standardized conventions by mapping various representations (e.g., 'US', 'usa', 'United States') to a single canonical form (e.g., 'USA'), ensuring uniform representation across the dataset.

## Part B — Thresholds: Warn vs Fail & Compact Alert Object

### B1. Define thresholds

In [15]:
thresholds = {
    'completeness': { 'email_null_rate_max': 0.02, 'signup_dt_null_rate_max': 0.00 },
    'validity':     { 'age_in_range_min': 0.995,  'ltv_nonnegative_min': 1.00 },
    'consistency':  { 'country_mapped_min': 0.98 },
    'timeliness':   { 'fresh_rate_min': 0.90, 'lag_p95_max': 40 }
}

### B2. Evaluate metrics against thresholds

In [19]:
def evaluate_dq(null_rates, validity, consistency_rate, fresh_rate, fresh_stats, thresholds):
    alerts = []
    def add(level, dim, metric, value, target, msg):
        alerts.append({'level': level, 'dimension': dim, 'metric': metric, 'value': float(value), 'target': float(target), 'message': msg})

    # Completeness
    if null_rates['email'] > thresholds['completeness']['email_null_rate_max']:
        add('FAIL','completeness','email_null_rate', null_rates['email'], thresholds['completeness']['email_null_rate_max'], 'Email null rate too high')
    elif null_rates['email'] > thresholds['completeness']['email_null_rate_max'] * 0.8:
        add('WARN','completeness','email_null_rate', null_rates['email'], thresholds['completeness']['email_null_rate_max'], 'Email null rate nearing limit')

    # Validity
    if validity['age_in_range_rate'] < thresholds['validity']['age_in_range_min']:
        add('FAIL','validity','age_in_range_rate', validity['age_in_range_rate'], thresholds['validity']['age_in_range_min'], 'Age out of range')
    if validity['ltv_nonnegative_rate'] < thresholds['validity']['ltv_nonnegative_min']:
        add('FAIL','validity','ltv_nonnegative_rate', validity['ltv_nonnegative_rate'], thresholds['validity']['ltv_nonnegative_min'], 'Negative LTV detected')

    # Consistency
    if consistency_rate < thresholds['consistency']['country_mapped_min']:
        add('WARN','consistency','country_mapped_rate', consistency_rate, thresholds['consistency']['country_mapped_min'], 'New/unmapped country variants observed')

    # Timeliness
    if fresh_rate < thresholds['timeliness']['fresh_rate_min']:
        add('WARN','timeliness','fresh_rate', fresh_rate, thresholds['timeliness']['fresh_rate_min'], 'Records stale beyond SLA')
    if fresh_stats['lag_p95'] > thresholds['timeliness']['lag_p95_max']:
        add('WARN','timeliness','lag_p95', fresh_stats['lag_p95'], thresholds['timeliness']['lag_p95_max'], 'Tail latency too high')

    return alerts

alerts = evaluate_dq(null_rates, validity, consistency_rate, fresh_rate, fresh_stats, thresholds)
alerts[:5]

[{'level': 'WARN',
  'dimension': 'consistency',
  'metric': 'country_mapped_rate',
  'value': 0.88841882601798,
  'target': 0.98,
  'message': 'New/unmapped country variants observed'},
 {'level': 'WARN',
  'dimension': 'timeliness',
  'metric': 'fresh_rate',
  'value': 0.2760027192386132,
  'target': 0.9,
  'message': 'Records stale beyond SLA'},
 {'level': 'WARN',
  'dimension': 'timeliness',
  'metric': 'lag_p95',
  'value': 41.0,
  'target': 40.0,
  'message': 'Tail latency too high'}]

### B3. Persist a machine‑readable DQ report

In [20]:
import json
from pathlib import Path
Path('artifacts/reports').mkdir(parents=True, exist_ok=True)
report = {
    'timestamp': pd.Timestamp.utcnow().isoformat(),
    'metrics': {
        'completeness': null_rates,
        'validity': validity,
        'consistency': {'country_mapped_rate': consistency_rate},
        'timeliness': {'fresh_rate': fresh_rate, **fresh_stats}
    },
    'thresholds': thresholds,
    'alerts': alerts
}
with open('artifacts/reports/dq_report.json','w') as f:
    json.dump(report, f, indent=2)
'Wrote artifacts/reports/dq_report.json'

'Wrote artifacts/reports/dq_report.json'

**Checkpoint:** Which alerts would be **FAIL** (block) vs **WARN** (notify) in your org? Justify.

*Answer:* In most organizations:
- **FAIL (block)**: Completeness issues (missing critical identifiers like CustomerID, email, signup_dt), validity issues (negative LTV, age out of valid range) should block the pipeline because they can cause downstream processing errors or incorrect business logic.
- **WARN (notify)**: Consistency issues (country naming variants) and timeliness issues (stale data) should trigger warnings because they don't prevent processing but indicate data quality degradation that needs attention. These can be tolerated temporarily while being monitored and addressed.

## Part C — Hook into Validation & Profiling

### C1. Combine with Pandera (optional gate)

In [25]:
import pandera.pandas as pa
from pandera import Column, Check
UsersSchema = pa.DataFrameSchema({
    'user_id': Column(pa.Int64, nullable=False),
    'email': Column(object, nullable=False, checks=Check.str_matches(r'^.+@.+\..+$')),
    'age': Column(pa.Float64, nullable=False, checks=Check.in_range(0,120)),
    'signup_date': Column(pa.DateTime, nullable=False),
    'spend_usd': Column(pa.Float64, nullable=False, checks=Check.ge(0))
})
try:
    _ = UsersSchema.validate(users2.dropna(subset=['user_id','email','signup_date']), lazy=True)
except pa.errors.SchemaErrors as err:
    print('Schema gate failed; see failure cases below:')
    display(err.failure_cases.head())

### C2. Single boolean to drive CI

In [None]:
#print(alerts)
fail = any(a['level']=='FAIL' for a in alerts)
warn = any(a['level']=='WARN' for a in alerts)
print('DQ status =>', 'FAIL' if fail else 'WARN' if warn else 'OK')
# In CI: sys.exit(1) if fail

[{'level': 'WARN', 'dimension': 'consistency', 'metric': 'country_mapped_rate', 'value': 0.88841882601798, 'target': 0.98, 'message': 'New/unmapped country variants observed'}, {'level': 'WARN', 'dimension': 'timeliness', 'metric': 'fresh_rate', 'value': 0.2760027192386132, 'target': 0.9, 'message': 'Records stale beyond SLA'}, {'level': 'WARN', 'dimension': 'timeliness', 'metric': 'lag_p95', 'value': 41.0, 'target': 40.0, 'message': 'Tail latency too high'}]
DQ status => WARN


## Part D — Wrap‑Up

### Reflection Questions

1. **Give one concrete metric per dimension that you computed and the threshold you chose.**
   - **Completeness**: Email null rate with threshold of 2% maximum
   - **Validity**: Age in range (0-120) with threshold of 99.5% minimum
   - **Consistency**: Country mapped rate with threshold of 98% minimum
   - **Timeliness**: Fresh rate (within 30 days) with threshold of 90% minimum

2. **Which alerts would block the pipeline vs notify only? Why?**
   - **Block (FAIL)**: Missing signup dates, negative LTV values, excessive null emails, out-of-range ages - these indicate fundamental data integrity issues that would cause downstream processing errors
   - **Notify (WARN)**: Country naming inconsistencies, stale data, high tail latency - these are quality issues that should be monitored and addressed but don't prevent processing

3. **Where in your pipeline would you store and review the DQ report?**
   - Store in a time-series database or object storage with timestamps for historical tracking
   - Review in automated dashboards (e.g., Grafana, Tableau) with alerting capabilities
   - Integrate into CI/CD pipeline logs and monitoring systems
   - Archive reports in data lake for audit trails and trend analysis

### Export DQ alerts to CSV (optional)

In [28]:
# Export alerts to CSV for review
if alerts:
    alerts_df = pd.DataFrame(alerts)
    alerts_df.to_csv('artifacts/reports/dq_alerts.csv', index=False)
    print(f'Exported {len(alerts)} alerts to artifacts/reports/dq_alerts.csv')
    display(alerts_df)
else:
    print('No alerts to export - data quality checks passed!')

Exported 3 alerts to artifacts/reports/dq_alerts.csv


Unnamed: 0,level,dimension,metric,value,target,message
0,WARN,consistency,country_mapped_rate,0.888419,0.98,New/unmapped country variants observed
1,WARN,timeliness,fresh_rate,0.276003,0.9,Records stale beyond SLA
2,WARN,timeliness,lag_p95,41.0,40.0,Tail latency too high


## Common Pitfalls

- **Too strict thresholds** causing constant red; start lenient and tighten over time
- **Ambiguous units** (days vs hours) for timeliness - always document clearly
- **Mixing validity (range) with consistency (naming)** - keep dimensions distinct
- **Not storing historical DQ reports** - trends are as important as point-in-time checks
- **Ignoring warn alerts** - they accumulate and eventually become failures

## Solution Snippets (reference)

**Null‑rate dict for any set of columns:**
```python
lambda df, cols: df[cols].isna().mean().to_dict()
```

**Country mapping coverage:**
```python
coverage = (map_df['canonical'].notna().mean())
```

**SLA freshness check for arbitrary datetime col:**
```python
lambda s, now, days: float(((now - s).dt.days <= days).mean())
```

**CI fail/warn toggle:**
```python
fail = any(a['level']=='FAIL' for a in alerts)
warn = any(a['level']=='WARN' for a in alerts)
```