# Lab 10 — Data Profiling at Scale with ydata‑profiling

**Focus Area:** Data profiling — summary stats, cardinality, distributions, outlier flags, and integrating reports into review

> This lab shows how to generate actionable **profiling reports** for medium–large datasets using **ydata‑profiling**, interpret the outputs (summary stats, high‑cardinality, distributions, correlations, outliers), and integrate those artifacts into your review/CI workflow alongside Pandera/Pydantic gates.

## Outcomes

By the end of this lab, you will be able to:

1. Produce a **ProfileReport** (full and minimal) and export it to HTML for team review.
2. Interpret key sections: **overview**, **variables**, **interactions**, **correlations**, **missingness**, **alerts** (outliers, skew, high cardinality).
3. Extract **machine‑readable metrics** from the report to track drift over time.
4. Profile **at scale** using sampling, column subsets, and configuration tuning.

## Prerequisites & Setup

- Python 3.13 with `pandas`, `numpy`, `ydata-profiling`, `pyarrow`.
- JupyterLab or VS Code with Jupyter extension.

If you don't have artifacts, synthesize a dataset:

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

# Try to load existing artifact, or synthesize if not available
artifact_path = Path("../artifacts/clean/per_customer_enriched.parquet")

if artifact_path.exists():
    per_cust_enriched = pd.read_parquet(artifact_path)
    print(f"Loaded {len(per_cust_enriched)} rows from {artifact_path}")
else:
    print("Artifact not found, synthesizing dataset...")
    rng = np.random.default_rng(0)
    N = 50_000
    per_cust_enriched = pd.DataFrame({
        'CustomerID': [f'C{i:05d}' for i in range(N)],
        'country_norm': rng.choice(['USA','DE','SG','BR'], size=N, p=[.58,.18,.16,.08]),
        'n_orders': rng.poisson(3, size=N),
        'freight_sum': np.round(np.clip(rng.lognormal(3.0, 0.8, size=N), 0, 2e5), 2),
        'freight_mean': np.round(np.clip(rng.lognormal(2.5, 0.6, size=N), 0, 1e4), 2),
        'is_adult': rng.random(size=N) > 0.1,
        'is_high_value': rng.random(size=N) > 0.9,
    })

per_cust_enriched.head()

Artifact not found, synthesizing dataset...


Unnamed: 0,CustomerID,country_norm,n_orders,freight_sum,freight_mean,is_adult,is_high_value
0,C00000,DE,5,16.81,26.88,True,False
1,C00001,USA,6,14.5,2.5,True,False
2,C00002,USA,3,9.76,7.26,True,False
3,C00003,USA,6,22.98,6.72,True,False
4,C00004,SG,2,88.19,14.02,True,False


## Part A — Generate a Minimal Profile

### A1. Basic report (minimal config)

In [5]:
%pip install --upgrade "numpy<2.0" "scipy>=1.11.0" ydata_profiling

from ydata_profiling import ProfileReport

# Create reports directory if it doesn't exist
Path("artifacts/reports").mkdir(parents=True, exist_ok=True)

# Sample data if too large
sample = per_cust_enriched.sample(15_000, random_state=42) if len(per_cust_enriched) > 15_000 else per_cust_enriched

profile_min = ProfileReport(
    sample,
    title="Per-Customer Enriched — Minimal Profile",
    minimal=True,  # disables heavy calculations (e.g., interactions)
    explorative=True,
    progress_bar=True
)

profile_min.to_file("artifacts/reports/per_customer_minimal.html")
print("Minimal profile saved to: artifacts/reports/per_customer_minimal.html")

Collecting scipy>=1.11.0
  Using cached scipy-1.16.3-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (62 kB)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


100%|██████████| 7/7 [00:00<00:00, 25.80it/s]00<00:00,  8.34it/s, Describe variable: is_high_value]
Summarize dataset: 100%|██████████| 13/13 [00:00<00:00, 27.01it/s, Completed]                      
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.89s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00, 25.41it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 257.73it/s]

Minimal profile saved to: artifacts/reports/per_customer_minimal.html





### A2. Read the overview

- **Warnings/Alerts:** high cardinality (e.g., `CustomerID`), skewed distributions (`freight_sum`), zeros inflation.
- **Missingness:** ensure expected null rates (should be near 0 post‑cleaning).

**Checkpoint:** List 3 alerts the report shows and classify them: quality issue vs expected property.

**Analysis of Alerts:**

1. **High cardinality in CustomerID** - Expected property (unique identifier)
2. **Skewed distribution in freight_sum** - Expected property (log-normal distribution is typical for monetary values)
3. **Check for any zero inflation or missing values** - Would be a quality issue if unexpected

*(Review the HTML report to identify specific alerts)*

## Part B — Focused Full Profile (column subset + tuned)

### B1. Choose columns and tune config

In [7]:
cols = ["country_norm","n_orders","freight_sum","freight_mean","is_high_value"]
subset = per_cust_enriched[cols].copy()

profile_cfg = {
    "title": "Per-Customer Enriched — Focused Profile",
    "dataset": {"description": "Subset profile for review & CI"},
    "variables": {"descriptions": {
        "freight_sum": "Total freight per customer (currency units)",
        "freight_mean": "Average freight per order",
        "n_orders": "Order count per customer"
    }},
    "correlations": {"pearson": {"calculate": True}, "spearman": {"calculate": True}},
    "missing_diagrams": {"heatmap": True, "dendrogram": False},
}

profile_full = ProfileReport(
    subset,
    title=profile_cfg["title"],
    explorative=True,
    minimal=False,
    correlations=profile_cfg["correlations"],
    progress_bar=True
)

profile_full.to_file("artifacts/reports/per_customer_focused.html")
print("Focused profile saved to: artifacts/reports/per_customer_focused.html")

100%|██████████| 5/5 [00:00<00:00, 33.27it/s]00<00:01,  6.89it/s, Describe variable: is_high_value]
Summarize dataset: 100%|██████████| 25/25 [00:01<00:00, 14.68it/s, Completed]                         
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00, 13.92it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 308.38it/s]

Focused profile saved to: artifacts/reports/per_customer_focused.html





### B2. Interpret variables & correlations

- **Variables tab:** check **distributions**, **zeros**, **distinct counts** (cardinality), **outlier flags** for `freight_sum`.
- **Correlations:** look for strong positive or negative relationships (e.g., `n_orders` vs `freight_sum`), and verify they are **business‑plausible**.

**Checkpoint:** Name one correlation you'd expect and whether the profile confirms it.

**Expected Correlation:**

We would expect a **positive correlation between `n_orders` and `freight_sum`** - customers who place more orders should have higher total freight costs. The profile report should confirm this relationship with a positive Pearson/Spearman coefficient.

*(Review the Correlations section in the HTML report to verify)*

## Part C — Extract Metrics Programmatically

### C1. Get summary dict

In [24]:
# Get description object
desc = profile_full.get_description()

# Access variables directly - they're stored as a dict-like structure
n_rows = len(subset)

# Extract variable summaries - iterate through the variables
var_summaries = {}
# Try direct dictionary access first
try:
    for col in cols:
        var_summaries[col] = desc.variables[col]
except (TypeError, KeyError):
    # Fallback: check if it's an object with attributes
    all_vars = dir(desc.variables)
    for col in cols:
        if col in all_vars:
            var_summaries[col] = getattr(desc.variables, col)

print(f"Number of rows: {n_rows}")
print(f"Variables analyzed: {list(var_summaries.keys())[:3]}")

# Display first few variable names
n_rows, list(var_summaries.keys())[:3]

Number of rows: 50000
Variables analyzed: ['country_norm', 'n_orders', 'freight_sum']


(50000, ['country_norm', 'n_orders', 'freight_sum'])

### C2. Build a compact drift tracker

In [26]:
import json
from pathlib import Path

# Access statistics from variable descriptions (they're dicts, not objects)
metrics = {
    "n_rows": n_rows,
    "freight_sum_mean": var_summaries['freight_sum']['mean'],
    "freight_sum_std": var_summaries['freight_sum']['std'],
    "n_orders_mean": var_summaries['n_orders']['mean'],
    "n_orders_distinct": var_summaries['n_orders']['n_distinct'],
    "country_cardinality": var_summaries['country_norm']['n_distinct'],
}

Path("artifacts/metrics").mkdir(parents=True, exist_ok=True)
with open("artifacts/metrics/per_customer_metrics.json", "w") as f:
    json.dump(metrics, f, indent=2)

print("Metrics saved to: artifacts/metrics/per_customer_metrics.json")
metrics

Metrics saved to: artifacts/metrics/per_customer_metrics.json


{'n_rows': 50000,
 'freight_sum_mean': 27.682903800000002,
 'freight_sum_std': 26.105907971847415,
 'n_orders_mean': 2.99766,
 'n_orders_distinct': 13,
 'country_cardinality': 4}

### C3. Compare to a baseline (simulated)

In [27]:
# Create a fake baseline and compare for illustration
baseline = {k: (v * 0.9 if isinstance(v, (int, float)) else v) for k, v in metrics.items()}

def pct_diff(a, b):
    return None if b == 0 else (a - b) / b

delta = {k: pct_diff(metrics[k], baseline[k]) if isinstance(metrics[k], (int, float)) else None for k in metrics}
delta_formatted = {k: round(v, 3) for k, v in delta.items() if v is not None}

print("Percentage differences from baseline:")
delta_formatted

Percentage differences from baseline:


{'n_rows': 0.111,
 'freight_sum_mean': 0.111,
 'freight_sum_std': 0.111,
 'n_orders_mean': 0.111,
 'n_orders_distinct': 0.111,
 'country_cardinality': 0.111}

**Checkpoint:** Which metric movements would trigger investigation (>20% by default)?

In this simulated example, all metrics show ~11% change (1/0.9 - 1 ≈ 0.111). In a real scenario:
- Changes > 20% in `freight_sum_mean` or `freight_sum_std` would indicate significant data drift
- Changes in `country_cardinality` might indicate new markets or data quality issues
- Large changes in `n_orders_mean` could signal business shifts or data collection problems

## Part D — Operate at Scale & Integrate in Review

### D1. Tips for biggish data

- **Sampling:** `.sample(50_000)` for profiles; keep full data for Pandera validation.
- **Disable heavy bits:** `minimal=True` or turn off interactions/correlations you don't need.
- **Column subsets:** profile only **review‑critical** columns per PR.
- **Persist artifacts:** write to `artifacts/reports/` and link in your PR checklist.

### D2. Review checklist snippet (add to PR template)

- [ ] Profile HTML attached (`per_customer_focused.html`).
- [ ] Key metrics JSON updated (`per_customer_metrics.json`).
- [ ] Any new high‑cardinality or outlier alerts acknowledged.
- [ ] Pandera schema still passes (link to Lab 3B test).

### D3. Wire to CI (concept)

- Save `profile.to_file()` output as a CI artifact.
- Parse `profile.to_dict()` and **fail** if critical thresholds are exceeded (e.g., null rate, cardinality spike, extreme mean/STD drift).

Example CI check:

In [28]:
# Example CI validation function
def validate_metrics_for_ci(metrics, baseline, thresholds):
    """
    Validate metrics against baseline with configurable thresholds.
    
    Args:
        metrics: Current metrics dict
        baseline: Baseline metrics dict
        thresholds: Dict of metric_name -> max_allowed_pct_change
    
    Returns:
        tuple: (passed: bool, failures: list)
    """
    failures = []
    
    for metric, threshold in thresholds.items():
        if metric not in metrics or metric not in baseline:
            continue
            
        current = metrics[metric]
        base = baseline[metric]
        
        if base == 0:
            continue
            
        pct_change = abs((current - base) / base)
        
        if pct_change > threshold:
            failures.append({
                'metric': metric,
                'current': current,
                'baseline': base,
                'pct_change': round(pct_change * 100, 2),
                'threshold': round(threshold * 100, 2)
            })
    
    return len(failures) == 0, failures

# Example usage
thresholds = {
    'freight_sum_mean': 0.20,  # 20% max change
    'freight_sum_std': 0.25,   # 25% max change
    'n_orders_mean': 0.15,     # 15% max change
    'country_cardinality': 0.10  # 10% max change (new countries should be rare)
}

passed, failures = validate_metrics_for_ci(metrics, baseline, thresholds)

if passed:
    print("✓ All metrics within acceptable thresholds")
else:
    print("✗ Metrics validation failed:")
    for failure in failures:
        print(f"  {failure['metric']}: {failure['pct_change']}% change (threshold: {failure['threshold']}%)")

✗ Metrics validation failed:
  country_cardinality: 11.11% change (threshold: 10.0%)


## Solution Snippets (reference)

### Minimal profile one‑liner:

In [29]:
# Quick minimal profile for rapid iteration
# ProfileReport(df.sample(20_000), minimal=True, explorative=True).to_file("../artifacts/reports/df_min.html")
print("Example code - uncomment to run")

Example code - uncomment to run


### Turn off heavy interactions:

In [30]:
# Profile with correlations but without interactions
# ProfileReport(df, minimal=False, correlations={"pearson": {"calculate": True}}, interactions=None)
print("Example code - uncomment to run")

Example code - uncomment to run


### Extract a null‑rate table from dict:

In [31]:
# Extract null rates from profile summary
desc = profile_full.get_description()
null_rates = {}
for col in cols:
    if col in desc.variables:
        var_dict = desc.variables[col]
        null_rates[col] = var_dict.get('p_missing', 0)

null_rates_formatted = {k: round(v, 4) for k, v in null_rates.items()}

print("Null rates by column:")
null_rates_formatted

Null rates by column:


{'country_norm': 0.0,
 'n_orders': 0.0,
 'freight_sum': 0.0,
 'freight_mean': 0.0,
 'is_high_value': 0.0}

## Wrap‑Up

Answer the following questions:

### 1. List two alerts flagged by the profile and how you'd mitigate them.

**Alert 1: High cardinality in CustomerID**
- **Mitigation:** This is expected for a unique identifier. Document as expected behavior and exclude from cardinality warnings in CI. Consider hashing or pseudonymizing if privacy is a concern.

**Alert 2: Skewed distribution in freight_sum**
- **Mitigation:** Log-normal distributions are common for monetary values. Document this as expected business behavior. Consider using log-scale transformations for ML models. Monitor outliers separately to catch data quality issues.

*(Additional alerts to check in the actual report: zeros inflation, extreme values, unexpected missing data)*

### 2. Paste two metrics from your JSON that you will watch in CI and why.

In [32]:
# Display key metrics to monitor
ci_metrics = {
    "freight_sum_mean": metrics["freight_sum_mean"],
    "country_cardinality": metrics["country_cardinality"]
}

print("Key metrics for CI monitoring:")
print(json.dumps(ci_metrics, indent=2))

Key metrics for CI monitoring:
{
  "freight_sum_mean": 27.682903800000002,
  "country_cardinality": 4
}


**Metrics to watch in CI:**

1. **`freight_sum_mean`**: Monitors the average total freight per customer. Large changes could indicate:
   - Pricing changes in the business
   - Data quality issues (wrong currency, missing decimals)
   - Shift in customer segments
   - Threshold: Alert if >20% change from baseline

2. **`country_cardinality`**: Tracks the number of distinct countries. Changes could indicate:
   - Expansion into new markets (expected if planned)
   - Data quality issues (invalid country codes)
   - Changes in data collection or normalization logic
   - Threshold: Alert if >10% change (should be relatively stable)

### 3. Where in the pipeline will you generate and store the profiling report?

**Pipeline Integration Strategy:**

1. **Generation Points:**
   - **Post-cleaning stage:** After data cleaning/enrichment but before loading to production
   - **Pre-deployment validation:** As part of CI/CD before promoting to production
   - **Scheduled monitoring:** Weekly/monthly for production data drift detection

2. **Storage Locations:**
   - **HTML Reports:** `artifacts/reports/` directory (versioned or timestamped)
   - **Metrics JSON:** `artifacts/metrics/` for programmatic access and CI checks
   - **CI Artifacts:** Uploaded as build artifacts in CI/CD system (GitHub Actions, GitLab CI, etc.)
   - **Object Storage:** S3/Azure Blob for long-term archival and historical comparison

3. **Integration Points:**
   - Link HTML reports in PR descriptions for reviewer access
   - Automated CI checks parse JSON metrics and fail builds on threshold violations
   - Dashboard integration showing metric trends over time
   - Alert system for significant drift detection

## Common Pitfalls

- **Running full profiles on multi‑million rows:** Always sample large datasets to reasonable size (10-50k rows)
- **Forgetting to sample:** Can cause memory issues and extremely long run times
- **Not persisting artifacts:** Reports and metrics should be version-controlled or archived
- **Treating expected skew as an error:** Document and accept expected business patterns (log-normal distributions, seasonal effects)
- **Over-alerting in CI:** Set appropriate thresholds to avoid alert fatigue
- **Ignoring computational cost:** Disable heavy features (interactions, correlations) when not needed

## Final Verification

In [33]:
# Verify all artifacts were created
import os

artifacts_to_check = [
    "artifacts/reports/per_customer_minimal.html",
    "artifacts/reports/per_customer_focused.html",
    "artifacts/metrics/per_customer_metrics.json"
]

print("Artifact verification:")
for artifact in artifacts_to_check:
    exists = os.path.exists(artifact)
    status = "✓" if exists else "✗"
    size = os.path.getsize(artifact) if exists else 0
    print(f"{status} {artifact} ({size:,} bytes)" if exists else f"{status} {artifact} (not found)")

print("\n✓ Lab 10 complete! Export HTML reports and commit the JSON metrics.")

Artifact verification:
✓ artifacts/reports/per_customer_minimal.html (1,083,308 bytes)
✓ artifacts/reports/per_customer_focused.html (1,152,066 bytes)
✓ artifacts/metrics/per_customer_metrics.json (187 bytes)

✓ Lab 10 complete! Export HTML reports and commit the JSON metrics.
