# 01 ‚Äî Trust: Data Quality & Reliability

This notebook demonstrates **AI Readiness Pillar #1: Trust**.

You will:
- Load intentionally imperfect raw data
- Define a **data contract** (schema + rules)
- Use **Great Expectations** to validate data
- Apply a **quality gate** (fail or quarantine)
- Produce a cleaned "Silver" dataset

> Works in **VS Code** (local) and **Azure Databricks**.


## üì∫ Related Video

This notebook corresponds to:

**AI Readiness Explained ‚Äî What ‚ÄúAI-Ready‚Äù Really Means (Part 1)**  
‚ñ∂Ô∏è https://youtu.be/DfG2y-90-wc

If you prefer to understand the *architecture and design reasoning* before running the code, watch the video first.


In [None]:
# --- Project bootstrap (do not remove) ---
import sys
from pathlib import Path

PROJECT_ROOT = Path("..").resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np

# Set this for local VS Code (default repo layout)
BASE_PATH = Path("..")  # if running from notebooks/
DATA_RAW = BASE_PATH / "data" / "raw"
DATA_CURATED = BASE_PATH / "data" / "curated"

DATA_RAW, DATA_CURATED


In [None]:
customers = pd.read_csv(DATA_RAW / "customers.csv")
orders = pd.read_csv(DATA_RAW / "orders.csv")
events = pd.read_csv(DATA_RAW / "events.csv")

customers.head(), orders.head(), events.head()


## Define a simple data contract

A data contract is a **shared agreement** between producers and consumers:
- expected columns + types
- required fields
- allowed values / ranges
- uniqueness constraints

Below we define a lightweight contract in Python (you can store this as YAML/JSON in real systems).

In [None]:
customer_contract = {
    "required_columns": ["customer_id","email","department","country","created_at"],
    "unique": ["customer_id"],
    "not_null": ["customer_id","department","created_at"],
}

order_contract = {
    "required_columns": ["order_id","customer_id","amount","currency","order_ts"],
    "unique": ["order_id"],
    "not_null": ["order_id","customer_id","amount","order_ts"],
    "ranges": {"amount": (0, None)},  # no negative amounts
}

contracts = {"customers": customer_contract, "orders": order_contract}
contracts


## Validate with Great Expectations

Great Expectations provides reusable expectation suites.

Install:
- `pip install great_expectations`


In [None]:
import pandas as pd
import numpy as np
import great_expectations as gx
import great_expectations.expectations as gxe

In [None]:
context = gx.get_context()  # EphemeralDataContext by default
data_source = context.data_sources.add_pandas(name="pandas_ds")

In [None]:
customers_asset = data_source.add_dataframe_asset(name="customers_df")
orders_asset = data_source.add_dataframe_asset(name="orders_df")

customers_batch_def = customers_asset.add_batch_definition_whole_dataframe("whole_df")
orders_batch_def = orders_asset.add_batch_definition_whole_dataframe("whole_df")

In [None]:
customers_batch = customers_batch_def.get_batch(batch_parameters={"dataframe": customers})
orders_batch = orders_batch_def.get_batch(batch_parameters={"dataframe": orders})

In [None]:


# Expectations (examples)
results = []
results.append(ge_customers.expect_table_columns_to_match_ordered_list(customer_contract["required_columns"]))
results.append(ge_customers.expect_column_values_to_be_unique("customer_id"))
results.append(ge_customers.expect_column_values_to_not_be_null("customer_id"))
results.append(ge_customers.expect_column_values_to_not_be_null("created_at"))

results.append(ge_orders.expect_column_values_to_be_unique("order_id"))
results.append(ge_orders.expect_column_values_to_not_be_null("amount"))
results.append(ge_orders.expect_column_values_to_be_between("amount", min_value=0))

# Summarise
pd.DataFrame([{"expectation": r["expectation_config"]["expectation_type"], "success": r["success"]} for r in results])


## Quality Gate: quarantine bad rows

Instead of "hoping" downstream systems handle issues, implement **quality gates**.

Here we:
- fix a few simple issues
- quarantine records that still violate hard rules
- write Silver outputs


In [None]:
# Basic cleaning examples
customers_clean = customers.copy()

# standardise country codes a bit
customers_clean["country"] = customers_clean["country"].replace({
    "United Kingdom": "UK",
    "U.K.": "UK",
    "": np.nan
})

# drop duplicate customer_id keeping first
customers_clean = customers_clean.drop_duplicates(subset=["customer_id"], keep="first")

# quarantine: missing critical fields
cust_quarantine = customers_clean[customers_clean["department"].isna() | customers_clean["created_at"].isna()].copy()
customers_silver = customers_clean.drop(cust_quarantine.index).copy()

orders_clean = orders.copy()
order_quarantine = orders_clean[(orders_clean["amount"] < 0) | (orders_clean["currency"].isna())].copy()
orders_silver = orders_clean.drop(order_quarantine.index).copy()

len(customers), len(customers_silver), len(cust_quarantine), len(orders), len(orders_silver), len(order_quarantine)


In [None]:
DATA_CURATED.mkdir(parents=True, exist_ok=True)

customers_silver.to_csv(DATA_CURATED / "customers_silver.csv", index=False)
orders_silver.to_csv(DATA_CURATED / "orders_silver.csv", index=False)

cust_quarantine.to_csv(DATA_CURATED / "customers_quarantine.csv", index=False)
order_quarantine.to_csv(DATA_CURATED / "orders_quarantine.csv", index=False)

print("Wrote curated datasets to:", DATA_CURATED.resolve())


## Why this matters for AI

- **RAG**: noisy docs/metadata ‚Üí wrong retrieval ‚Üí hallucinations
- **Agents**: bad data ‚Üí wrong actions at scale
- **Analytics-to-action**: inconsistent definitions ‚Üí conflicting decisions

Quality gates make AI systems **predictable**.
