# Session 5 — Data Quality & Governance

Production-grade pipelines deliver **accurate**, **timely**, and **secure** data. This session covers data quality dimensions, validation strategies, observability (metrics & alerting), lineage & metadata, and security & compliance — with examples.

## 🧭 1️⃣ Why Data Quality & Governance

ETL that **runs** is good. ETL that produces **correct, monitored, and governed** data is production-grade. Data Quality ensures reliable outputs; Governance ensures proper access, lineage, and compliance.

## 📐 2️⃣ Data Quality Dimensions

| Dimension | Description | Example Check |
|----------|-------------|---------------|
| **Accuracy** | Values reflect reality | City matches valid postal region |
| **Completeness** | Required fields populated | `customer_id` not null |
| **Consistency** | Uniform across systems | `CA` vs `California` normalization |
| **Timeliness** | Fresh enough for decision-making | Daily file arrived by 02:00 |
| **Uniqueness** | No duplicates | `order_id` unique |
| **Validity** | Conforms to rules/formats | Date format `YYYY-MM-DD` |

## 🧪 3️⃣ Validation Strategies (SQL, Python, Frameworks)

**Inline SQL**
```sql
SELECT COUNT(*) AS negative_amounts FROM orders WHERE amount < 0;
SELECT COUNT(*) AS invalid_email FROM customers WHERE email NOT LIKE '%@%';
```

**Python / pandas**
```python
assert df['amount'].ge(0).all(), 'Negative amount found'
assert df['email'].str.contains('@').all(), 'Invalid email'
```

**Great Expectations (YAML)**
```yaml
expect_table_row_count_to_be_between:
  min_value: 1
expect_column_values_to_not_be_null:
  column: customer_id
```

**dbt tests (schema.yml)**
```yaml
tests:
  - unique:
      column_name: order_id
  - not_null:
      column_name: customer_id
```

## 🧭 4️⃣ Where to Place Checks in Pipelines

| Stage | Example Checks |
|------|----------------|
| **Pre-Load (staging)** | Schema/column count, required columns, type checks |
| **Post-Load (curated)** | Row count match, key integrity, aggregates reconciliation |
| **Ongoing Monitoring** | Freshness, volume anomalies, distribution drift |
| **Alerting** | Email/Slack/SMS when thresholds fail |

## 🧰 5️⃣ Hands-On: Simple Validation Script (pandas)

In [None]:
import pandas as pd

df = pd.DataFrame({
    'order_id':[1,2,3,3],
    'amount':[100,200,-50,400],
    'email':['a@x.com','b@x.com','c','d@x.com']
})

issues = []
if not df['amount'].ge(0).all():
    issues.append('Negative amounts found')
if df['order_id'].duplicated().any():
    issues.append('Duplicate order_id')
if not df['email'].str.contains('@').all():
    issues.append('Invalid email format')

if issues:
    print('❌ Quality issues detected:')
    for i in issues:
        print('-', i)
    bad_rows = df[(df['amount'] < 0) | df['order_id'].duplicated(keep=False) | ~df['email'].str.contains('@')]
    print('\nQuarantined rows:')
    display(bad_rows)
else:
    print('✅ All checks passed')


### 📊 Optional: Visualize Check Counts

In [None]:
import matplotlib.pyplot as plt

labels = ['amount>=0','order_id unique','email contains @']
fails = [int((df['amount']<0).any()), int(df['order_id'].duplicated().any()), int((~df['email'].str.contains('@')).any())]
passes = [1-f for f in fails]

fig, ax = plt.subplots(figsize=(6,3))
ax.bar(labels, passes, label='pass')
ax.bar(labels, fails, bottom=passes, label='fail')
ax.set_ylabel('flag (0/1)')
ax.set_title('Validation Checks Summary')
ax.legend(loc='upper right')
plt.xticks(rotation=10)
plt.tight_layout(); plt.show()


## 🔔 6️⃣ Metrics & Alerting — Ensuring Visibility

Once validation rules are in place, the next step is **observability**: tracking the health of your data and alerting when something goes wrong.

### 🧠 Why It Matters
Without metrics, failures go unnoticed until business users complain. Metrics turn raw logs into **quantifiable quality KPIs** (row counts, null %, timeliness).

### ⚙️ Typical Metrics
| Metric | Description | Example |
|---------|--------------|----------|
| **Row Count** | Number of records per batch | Expected ≈ 10 000; trigger if < 9 000 |
| **Null Percentage** | Share of null values | `email` null % > 5 % → alert |
| **Freshness Delay** | Difference between now and last load | > 2 hours → warning |
| **Distribution Change** | Compare min/max/mean to baseline | mean(price) ± 30 % |
| **Execution Time** | Detect long-running tasks | runtime > threshold |

### 🔧 Implementation Steps
1. **Collect** metrics in each pipeline stage (pre/post-load).
2. **Store** them in a monitoring sink (CloudWatch, Azure Monitor, Prometheus).
3. **Visualize** using dashboards (Grafana, Power BI, Looker).
4. **Alert** when metrics exceed thresholds.

### ☁️ Cloud Examples
| Cloud | Metrics | Alerting |
|--------|----------|-----------|
| **AWS** | CloudWatch Metrics, Glue Job Run stats | SNS Topics, SES Email |
| **Azure** | Monitor Metrics, Log Analytics Workspace | Alerts → Logic Apps → Teams/Email |

### 🧩 Example Code (pseudo)
```python
metrics = {
    'row_count': len(df),
    'null_pct_email': df['email'].isna().mean()*100,
}
if metrics['null_pct_email'] > 5:
    send_alert('DQ Warning: high null rate', f"Null % = {metrics['null_pct_email']:.2f}")
```


## 🧬 7️⃣ Lineage & Metadata — Understanding the Journey

Lineage and metadata management ensure **traceability** and **transparency** across the data lifecycle.

### 🧠 Key Concepts
| Term | Meaning | Example |
|------|----------|----------|
| **Lineage** | How data moves & transforms from source → target | Orders DB → ETL → Redshift Table |
| **Metadata** | Descriptive info about datasets | column types, owners, refresh frequency |
| **Business Glossary** | Common definitions across teams | “Active Customer = purchase in 90 days” |
| **Provenance** | Source of truth for each field | `revenue` = `orders.amount × rate` |

### 🧱 Process to Capture Lineage
1. **Instrument pipelines** to log input/output tables.
2. **Centralize metadata** in a catalog.
3. **Visualize dependencies** (DAG graphs).
4. **Enable impact analysis** — find downstream tables affected by a schema change.

### ☁️ Tooling
| Domain | Azure | AWS | Open Source |
|---------|--------|-----|--------------|
| Catalog & Lineage | **Purview** | **Glue Data Catalog** | **OpenMetadata**, DataHub |
| Transformation Tracking | ADF, Synapse | Glue Workflows | Airflow Lineage backend |
| Glossary & Tags | Purview Business Glossary | Glue Tags | Collibra, Amundsen |

### 🧩 Example Workflow
```text
Source DB (Orders)
   │  extract
   ▼
Staging (CSV/Parquet in S3)
   │  transform
   ▼
Warehouse (FactOrders)
   │  load
   ▼
Dashboard (Power BI)
```
Each arrow represents lineage; metadata includes owners, timestamps, schema, and job IDs.


## 🔐 8️⃣ Security & Compliance — Protecting Data Integrity & Privacy

Security & governance ensure only the **right people** access the **right data** in the **right way**.

### 🧠 Pillars of Data Security
1. **Access Control** — Limit who can read/write data.
   - AWS IAM policies, Azure RBAC.
   - Example: only `Finance-Analyst` role can read revenue tables.
2. **Encryption**
   - **At Rest:** AWS KMS, Azure Key Vault.
   - **In Transit:** HTTPS/TLS connections.
3. **Network Isolation**
   - Private Endpoints, VPC Peering, Service Endpoints.
4. **Auditing & Logging**
   - CloudTrail (AWS), Activity Logs (Azure).
5. **PII Handling**
   - Mask or tokenize sensitive columns (`email`, `phone`, `ssn`).
   - Example: `SELECT SHA2(email, 256) AS email_hash`.
6. **Compliance & Retention**
   - GDPR, CCPA, HIPAA → define retention policies.
   - Example: delete logs older than 7 years.

### ☁️ Comparison
| Category | Azure | AWS |
|-----------|--------|-----|
| Access Control | RBAC (AAD Roles) | IAM Policies / Lake Formation |
| Encryption Mgmt | Key Vault | KMS |
| Audit Logs | Activity Logs + Monitor | CloudTrail + CloudWatch |
| PII Governance | Purview PII Scanner | Macie Sensitive Data Discovery |

### 🧩 Workflow Example
1. Developer deploys ETL → assigns minimal permissions.
2. Data encrypted via KMS/Key Vault.
3. Purview/Glue catalog tags sensitive columns.
4. Monitor audit logs → alert on unauthorized access.


## 🖼️ 9️⃣ Visual: Data Quality Lifecycle

In [None]:
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch

BG = '#e6f0ff'; FILL = '#e6f0ff'; EDGE = '#2563eb'; TXT = '#111827'
W, H, GAP, PAD = 0.18, 0.22, 0.06, 0.010
Y0 = 0.39

labels = [
    ('Ingest', 'Raw data arrives'),
    ('Validate', 'Pre/Post-load checks'),
    ('Log', 'Metrics + failures'),
    ('Alert', 'Notify on breach'),
    ('Dashboard', 'DQ KPIs'),
    ('Govern', 'Catalog + lineage')
]

fig, ax = plt.subplots(figsize=(13, 3.8))
fig.patch.set_facecolor(BG); ax.set_facecolor(BG); ax.set_axis_off()
ax.set_xlim(0, 1); ax.set_ylim(0, 1)

total_w = len(labels)*W + (len(labels)-1)*GAP
x_start = (1 - total_w) / 2
xs = [x_start + i*(W+GAP) for i in range(len(labels))]

def box(x, title, sub):
    r = FancyBboxPatch((x, Y0), W, H, boxstyle='round,pad=0.02,rounding_size=10',
                       fc=FILL, ec=EDGE, lw=1.6)
    ax.add_patch(r)
    ax.text(x+W/2, Y0+H*0.62, title, ha='center', va='center', fontsize=10, color=TXT, fontweight='bold')
    ax.text(x+W/2, Y0+H*0.36, sub, ha='center', va='center', fontsize=9, color=TXT)

for (t,s), x in zip(labels, xs):
    box(x, t, s)

y_mid = Y0 + H/2
for i in range(len(xs)-1):
    x1 = xs[i] + W + PAD; x2 = xs[i+1] - PAD
    ax.annotate('', xy=(x2, y_mid), xytext=(x1, y_mid),
                arrowprops=dict(arrowstyle='->', lw=2, color='#4b5563',
                                shrinkA=0, shrinkB=0, mutation_scale=12))

ax.margins(x=0.05, y=0.10)
plt.tight_layout(); plt.show()


## ☁️ 🔀 🔟 Cloud Mapping (AWS + Azure)

| Purpose | Azure | AWS |
|---------|-------|-----|
| Catalog & Lineage | Azure Purview | AWS Glue Catalog |
| Quality (built-in) | ADF Data Flows, Synapse | Glue Data Quality, Deequ |
| Alerting | Azure Monitor + Logic Apps | CloudWatch + SNS |
| Security | RBAC, Key Vault, Private Links | IAM, KMS, Lake Formation |

## 💡 1️⃣1️⃣ Practice / Assignment

1) Add **3 data checks** to your Session 3 pipeline and **quarantine** failed rows.

2) Emit a **quality summary** (rows passed/failed) and display a simple bar chart.

3) Sketch **lineage** for one dataset: source → staging → curated → BI.

4) Define **roles and permissions** (owner, steward, consumer) for one domain.
