# 📘 Chapter 4: Data Journey and Data Storage

This chapter focuses on how **data evolves across a production ML pipeline**, and how to manage and validate it with tools like **ML Metadata (MLMD)**, **TensorFlow Metadata (TFMD)**, and **TensorFlow Data Validation (TFDV)**. It also introduces storage patterns like **feature stores**, **data warehouses**, and **data lakes**.


## 🧭 Data Journey & Data Provenance

**Data journey**: Tracks how data flows and transforms from collection to training and inference.
**Data provenance**: Captures the lineage of data, allowing you to trace the origin and changes over time.

### 🧱 Artifacts
Objects produced by ML pipeline steps – includes raw data, transformed data, models, schemas, metrics.


---

## 🗃️ ML Metadata (MLMD)

MLMD is a library for storing and retrieving metadata about ML workflows, such as:
- Artifacts (inputs/outputs)
- Executions (runs of components)
- Contexts (e.g. pipeline ID, project ID)

```python
# MLMD metadata model relationships
# Artifact <-- event --> Execution <-- association --> Context
```

**Usage examples:**
- Compare models trained on the same dataset
- Trace back how/when data was transformed
- Build DAGs of pipeline executions

---

## 🧬 Using a Schema (TFMD + TFDV)

A **schema** defines expectations for input data:
- Feature names and types (e.g., int, float, string)
- Whether features are required
- Expected value ranges or valency

Schemas can:
- Catch anomalies (missing values, wrong types)
- Track data evolution
- Guide feature engineering

```python
# Example: Schema field in TFMD
feature {
  name: "some_feature"
  type: BYTES
  presence {
    min_fraction: 1.0
  }
  not_in_environment: "Serving"
}
```

---

## 🧪 TFDV Schema Validation

TFDV provides tools like:
```python
import tensorflow_data_validation as tfdv

dataset_stats = tfdv.generate_statistics_from_tfrecord("train.tfrecord")
schema = tfdv.infer_schema(statistics=dataset_stats)
tfdv.display_schema(schema)
```

```python
# Validate against schema
anomalies = tfdv.validate_statistics(statistics=dataset_stats, schema=schema)
tfdv.display_anomalies(anomalies)
```

---

## 🌍 Schema Environments

Schemas can differ across environments (training vs serving):
```protobuf
default_environment: "Training"
default_environment: "Serving"
feature {
  name: "label"
  not_in_environment: "Serving"
}
```
This ensures the label is only used during training.

---

## 📉 Detecting Changes Across Datasets

TFDV supports:
- `skew_comparator`: Compare training vs serving
- `drift_comparator`: Detect changes across time

```python
# Set drift threshold in schema proto
drift_comparator: {
  infinity_norm_threshold: 0.05
}
```

---

## 🏢 Enterprise Data Storage

Efficient storage is critical for scalable ML. This chapter compares:

### 🔧 Feature Stores
Centralized systems for:
- Storing features + metadata
- Serving features in real-time or batch
- Enabling reuse across teams

Feature stores help:
- Avoid duplicate feature engineering
- Version & time-travel features
- Serve features with low latency

```python
# Conceptual use: store + retrieve feature vectors
feature_store.put("user_age_bucket", feature_vector)
features = feature_store.get(["user_age_bucket"], user_id="123")
```

---

### 🧊 Data Warehouses vs Databases

| **Data Warehouse**          | **Database**                    |
|-----------------------------|----------------------------------|
| Meant for **analytics**     | Meant for **transactions**       |
| Historical data, batched    | Real-time operations             |
| Schema-on-write             | Schema-on-write                  |
| Complex queries             | Fast, simple queries             |

---

### 💧 Data Lakes vs Warehouses

| **Data Lake**                          | **Data Warehouse**                      |
|----------------------------------------|------------------------------------------|
| Stores **raw** data                    | Stores **processed & structured** data   |
| Schema-on-read                         | Schema-on-write                          |
| Great flexibility, may become swampy   | High consistency and performance         |

**Data Swamp**: When a data lake becomes disorganized and unusable.

---


## 🧠 Keyword & Concept List

| **Term**                  | **Definition** |
|---------------------------|----------------|
| **MLMD**                  | Library to track pipeline metadata (artifacts, runs, etc.) |
| **TFMD**                  | TensorFlow Metadata – defines schema for features |
| **TFDV**                  | TensorFlow Data Validation – validates data against schema |
| **Schema**                | Structured definition of expected data fields, types, valency |
| **Artifact**              | Any output/input of a pipeline component |
| **Data Provenance**       | Traceability of data transformations across pipeline |
| **Feature Store**         | Central repository of reusable, curated features |
| **Data Lake**             | Flexible store for raw or semi-structured data |
| **Data Warehouse**        | Structured store for analytics over historical data |
| **Skew/Drift**            | Discrepancies in feature distribution over time or between splits |
| **Time Travel (ML)**      | Ensuring only past-known features are used during training |

---


✅ *This notebook captures the key tools and patterns to manage metadata and storage in production ML pipelines.*
