# Chapter 65: Data and Model Lineage

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand what data and model lineage means and why it is crucial for governance, debugging, and reproducibility
- Distinguish between data lineage (tracking data transformations) and model lineage (tracking model creation and deployment)
- Capture lineage information using tools like DVC, MLflow, and custom metadata stores
- Implement a simple lineage tracking system for the NEPSE prediction pipeline
- Visualise lineage graphs to understand dependencies and impact of changes
- Use lineage for impact analysis (e.g., if a source dataset changes, which models are affected?)
- Ensure compliance with regulations (e.g., GDPR right to explanation) through lineage
- Integrate lineage with experiment tracking and model registries
- Recognise the challenges of lineage in complex, distributed systems

---

## Introduction

In the NEPSE prediction system, we have multiple components: raw data CSV files, feature engineering scripts, trained models, and deployed APIs. If a data quality issue is discovered in the raw data from 2023, which models are affected? If we update a feature definition, which experiment runs used the old definition? Answering these questions requires **lineage**—the ability to trace the relationships between data, code, features, models, and predictions.

Lineage is a fundamental aspect of data governance and MLOps. It enables:

- **Reproducibility**: Knowing exactly which data and code produced a model.
- **Debugging**: Tracing errors back to their source.
- **Impact analysis**: Assessing the effect of changes upstream.
- **Compliance**: Meeting regulatory requirements for explainability and audit trails.
- **Trust**: Providing transparency to stakeholders.

In this chapter, we will explore both data lineage and model lineage. We will use tools like DVC for data versioning and MLflow for model tracking, and we will build a simple lineage graph to visualise dependencies. Using the NEPSE system as a running example, we will see how lineage can be captured and used in practice.

---

## 65.1 Lineage Concepts

### 65.1.1 Data Lineage

**Data lineage** tracks the origin and transformations of data as it flows through pipelines. For the NEPSE system, this includes:

- Raw CSV files from the exchange.
- Cleaned and imputed data.
- Engineered features (e.g., lagged prices, rolling statistics).
- Training and test splits.

Each transformation step should be recorded, along with the code version and parameters used. This allows us to trace any feature value back to its source.

### 65.1.2 Model Lineage

**Model lineage** tracks the creation and deployment of models. It includes:

- The training dataset (version) used.
- The code and configuration (hyperparameters) that produced the model.
- The experiment run that generated the model.
- The model version in the registry.
- The deployment environment and stage (staging, production).

Model lineage helps answer: "Which model is currently in production, and how was it trained?"

### 65.1.3 Why Both Matter

Data and model lineage are intertwined. A model is trained on a specific version of a dataset. If that dataset is updated, the model's performance may change. By linking model lineage to data lineage, we can understand the full picture.

For example, if we discover an error in the raw data from a certain period, we can query the lineage to find all models trained on that data and retrain them.

---

## 65.2 Tools for Lineage

Several tools can help capture lineage:

- **DVC** (Data Version Control): Versions datasets and pipelines, tracking dependencies between code and data.
- **MLflow**: Tracks experiments, models, and can log dataset versions.
- **Great Expectations**: Validates data and can store expectations as metadata.
- **Apache Atlas**: Enterprise‑grade data governance and lineage platform.
- **Amundsen**: Data discovery and lineage tool.
- **OpenLineage**: Open standard for lineage metadata collection.

For the NEPSE system, we will combine DVC for data versioning and pipeline tracking, and MLflow for experiment and model tracking.

---

## 65.3 Implementing Data Lineage with DVC

DVC (Data Version Control) is a tool that brings Git‑like versioning to data and machine learning pipelines. It tracks data files and the transformations applied to them.

### 65.3.1 Setting Up DVC

```bash
pip install dvc
cd nepse-project
git init
dvc init
```

DVC stores metadata in `.dvc` files and a local cache. The actual data can be stored remotely (e.g., S3, GCS).

### 65.3.2 Adding Raw Data

```bash
dvc add data/raw/nepse_2023.csv
git add data/raw/nepse_2023.csv.dvc data/raw/.gitignore
git commit -m "Add raw NEPSE 2023 data"
```

The `.dvc` file contains a hash of the data file. This hash uniquely identifies the version.

### 65.3.3 Defining a Pipeline

DVC allows you to define pipelines where each stage takes dependencies and produces outputs. This automatically captures lineage.

Create a `dvc.yaml` file:

```yaml
stages:
  clean:
    cmd: python scripts/clean_data.py data/raw/nepse_2023.csv data/interim/cleaned.csv
    deps:
      - scripts/clean_data.py
      - data/raw/nepse_2023.csv
    outs:
      - data/interim/cleaned.csv
  features:
    cmd: python scripts/feature_engineering.py data/interim/cleaned.csv data/processed/features.csv
    deps:
      - scripts/feature_engineering.py
      - data/interim/cleaned.csv
    outs:
      - data/processed/features.csv
  train:
    cmd: python scripts/train_model.py data/processed/features.csv models/model.pkl
    deps:
      - scripts/train_model.py
      - data/processed/features.csv
    outs:
      - models/model.pkl
    metrics:
      - metrics/accuracy.json
```

**Explanation:**  
Each stage specifies its dependencies (code and input data) and outputs. DVC tracks the hashes of all dependencies and outputs. When you run `dvc repro`, DVC checks if any dependency has changed and re‑runs the necessary stages. This ensures reproducibility and provides lineage: we know exactly which data and code produced each output.

### 65.3.4 Running the Pipeline

```bash
dvc repro
```

This executes the pipeline. DVC records the execution in `dvc.lock`, which captures the exact versions (hashes) of all dependencies and outputs.

### 65.3.5 Exploring Lineage with DVC

DVC provides commands to show the pipeline graph:

```bash
dvc dag
```

This displays a text‑based graph of stages. For a visual representation, you can use `dvc dag --dot` and render with Graphviz.

To see the history of a data file:

```bash
dvc log data/processed/features.csv
```

This shows the Git commits that changed the file, linking to the pipeline stages.

### 65.3.6 Remote Storage

For collaboration, you can set up a remote storage (e.g., S3 bucket) and push data:

```bash
dvc remote add -d myremote s3://mybucket/nepse-dvc
dvc push
```

Now the data is versioned and stored centrally.

---

## 65.4 Integrating Data Lineage with MLflow

While DVC tracks data and pipeline versions, MLflow tracks experiments and models. We can link them by logging DVC file hashes or Git commits in MLflow.

### 65.4.1 Logging Data Version in MLflow

In our training script, we can capture the DVC hash of the features file.

```python
import mlflow
import subprocess
import json

def get_dvc_hash(filepath):
    # Get the DVC hash of a file (if tracked by DVC)
    result = subprocess.run(['dvc', 'status', filepath], capture_output=True, text=True)
    # Parse output to get hash (simplified)
    # Better: use dvc.api to get the hash
    import dvc.api
    return dvc.api.get_url(filepath)  # returns the resource URL with hash

with mlflow.start_run():
    # ... other logging ...
    
    # Log data version
    features_hash = get_dvc_hash('data/processed/features.csv')
    mlflow.log_param("features_data_version", features_hash)
    
    # Also log the commit hash of the code
    commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()
    mlflow.log_param("git_commit", commit)
```

**Explanation:**  
By logging the DVC hash of the features file and the Git commit, we create a link between the model and the exact data and code that produced it. If we later need to reproduce the model, we can check out that Git commit and run `dvc checkout` to restore the data.

### 65.4.2 Using DVC and MLflow Together

A common workflow:

1. Data scientists develop features and update the DVC pipeline.
2. They run `dvc repro` to update the features file.
3. They run training scripts with MLflow tracking, which logs the DVC hash.
4. The best model is registered in MLflow Model Registry.
5. For deployment, the model is fetched along with its metadata, including the data version.

This creates a complete lineage chain.

---

## 65.5 Model Lineage with MLflow

MLflow already captures significant model lineage through its tracking and registry.

### 65.5.1 Tracking Run Metadata

Each MLflow run records:

- Parameters (including data versions, as above).
- Metrics.
- Tags (e.g., `model_type`, `feature_set`).
- Artifacts (model file, plots).
- Source code (if using MLflow projects).

This metadata forms the model's lineage.

### 65.5.2 Model Registry Lineage

When a model is registered in the MLflow Model Registry, it retains a link to the source run. You can see which run produced a model version and then drill down into that run's details.

```python
from mlflow.tracking import MlflowClient

client = MlflowClient()
model_version_details = client.get_model_version("NEPSE_Predictor", 5)
run_id = model_version_details.run_id
run = client.get_run(run_id)
print(run.data.params)
```

This allows you to trace a deployed model back to its training run and all associated metadata.

### 65.5.3 Stage Transitions

The registry also records stage transitions (e.g., from "Staging" to "Production") along with who performed the transition and when. This is part of the model's deployment lineage.

---

## 65.6 Building a Simple Lineage Graph

We can build a simple lineage graph by combining DVC pipeline info and MLflow run info. This graph can help visualise dependencies.

### 65.6.1 Extracting DVC Pipeline

DVC provides a JSON representation of the pipeline:

```bash
dvc dag --dot > pipeline.dot
```

You can convert this to a graph using Graphviz or parse it to extract dependencies.

### 65.6.2 Extracting MLflow Lineage

MLflow provides a REST API to query runs. We can fetch runs that produced models and link them to data versions.

### 65.6.3 Example: Visualising with NetworkX

```python
import networkx as nx
import matplotlib.pyplot as plt
import mlflow
from mlflow.tracking import MlflowClient

# Create a graph
G = nx.DiGraph()

# Add DVC pipeline nodes (simplified: from dvc.lock)
# For each stage, add edges from deps to outs.

# Add MLflow run nodes
client = MlflowClient()
runs = client.search_runs(experiment_ids=["1"])
for run in runs:
    run_id = run.info.run_id
    G.add_node(f"run_{run_id}", type="run", accuracy=run.data.metrics.get("test_accuracy"))
    # Link run to data version (if logged)
    data_version = run.data.params.get("features_data_version")
    if data_version:
        G.add_edge(f"data_{data_version}", f"run_{run_id}")

# Draw
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_size=3000, node_color='lightblue')
plt.show()
```

**Explanation:**  
This is a simplistic example, but it shows the idea. In practice, lineage graphs can become large, so tools like Apache Atlas or custom graph databases (e.g., Neo4j) are used.

---

## 65.7 Impact Analysis

One of the main uses of lineage is **impact analysis**: if something changes upstream, what downstream artifacts are affected?

### 65.7.1 Data Change

Suppose we discover that the raw data for March 2023 had a systematic error. We fix it and update the raw CSV. Using DVC, we run:

```bash
dvc repro
```

DVC automatically detects which stages depend on the changed raw data and re‑runs them, producing new cleaned data, new features, and new models. This ensures all downstream artifacts are consistent.

But what about models already in production that were trained on the old data? We need to trace them. Using the lineage graph, we can query all models that used the old data version (by the DVC hash). We can then decide to retrain them or mark them as deprecated.

### 65.7.2 Code Change

If a feature engineering script changes, DVC will rerun that stage and all downstream stages. MLflow runs that used the old script will still exist in the tracking server, but they are not automatically updated. However, we can query runs by the Git commit of the script and identify which models were affected.

### 65.7.3 Example Query

Using MLflow, we can search for runs that used a specific data version:

```python
runs = client.search_runs(
    experiment_ids=["1"],
    filter_string="params.features_data_version = 'abc123'"
)
```

This returns all runs trained on that data version.

---

## 65.8 Compliance and Auditing

For regulated industries, lineage is essential for audits. Regulators may ask:

- Which data was used to train the model that made this prediction?
- How was that data processed?
- Was the model properly validated before deployment?

A lineage system provides answers. By storing all metadata immutably, we can produce an audit trail.

### 65.8.1 Storing Lineage for Compliance

Consider storing lineage information in an immutable data store (e.g., append‑only database, blockchain) to prevent tampering. For each prediction, you might log:

- Model version.
- Input features (or a hash).
- Prediction timestamp.
- Output.

Then, using the model version, you can trace back to the training run and data.

---

## 65.9 Challenges and Best Practices

### 65.9.1 Challenges

- **Scale**: In large organisations with thousands of models and datasets, lineage graphs become huge. Automated tools are necessary.
- **Granularity**: How much detail to capture? Too much detail can be overwhelming; too little defeats the purpose.
- **Integration**: Different tools (DVC, MLflow, Airflow) have their own lineage representations. Integrating them requires effort.
- **Dynamic systems**: In streaming pipelines, data is continuously updated, making lineage more complex.

### 65.9.2 Best Practices

1. **Start simple**: Begin by capturing data version and code version for each model. Expand as needed.
2. **Automate lineage collection**: Integrate with your pipeline tools rather than manually recording.
3. **Use unique identifiers**: Use hashes of data files and code commits as universal IDs.
4. **Store lineage alongside artifacts**: Keep metadata with the model file (e.g., in MLflow).
5. **Regularly audit lineage**: Check that lineage is complete and correct.
6. **Plan for retention**: Decide how long to keep lineage information (e.g., for regulatory requirements).
7. **Educate the team**: Ensure everyone understands the importance of lineage and follows practices that enable it.

---

## 65.10 Complete Example: NEPSE Lineage Pipeline

Let's put together a complete example for the NEPSE system that captures data and model lineage using DVC and MLflow.

**Directory structure:**

```
nepse-project/
├── data/
│   ├── raw/
│   │   └── nepse_2023.csv
│   ├── interim/
│   │   └── cleaned.csv
│   └── processed/
│       └── features.csv
├── scripts/
│   ├── clean_data.py
│   ├── feature_engineering.py
│   └── train_model.py
├── models/
│   └── model.pkl
├── metrics/
│   └── accuracy.json
├── dvc.yaml
├── dvc.lock
└── .gitignore
```

**dvc.yaml** (as shown earlier).

**scripts/train_model.py** (with MLflow and DVC integration):

```python
import mlflow
import mlflow.sklearn
import pandas as pd
import argparse
import subprocess
import json
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import dvc.api

def get_dvc_hash(path):
    # Using dvc.api to get the file's DVC hash (if tracked)
    repo = dvc.api.DVCRepo('.')
    try:
        rev = repo.get_rev()
        url = dvc.api.get_url(path, repo=repo, rev=rev)
        # URL includes the hash (e.g., .../file?rev=abc123)
        return url
    except:
        return "unknown"

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_path', default='data/processed/features.csv')
    parser.add_argument('--model_path', default='models/model.pkl')
    parser.add_argument('--n_estimators', type=int, default=100)
    parser.add_argument('--max_depth', type=int, default=10)
    args = parser.parse_args()
    
    # Start MLflow run
    with mlflow.start_run():
        # Log parameters
        mlflow.log_param("n_estimators", args.n_estimators)
        mlflow.log_param("max_depth", args.max_depth)
        
        # Log data version
        data_hash = get_dvc_hash(args.data_path)
        mlflow.log_param("features_data_version", data_hash)
        
        # Log code version (Git commit)
        commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()
        mlflow.log_param("git_commit", commit)
        
        # Load data
        df = pd.read_csv(args.data_path)
        X = df.drop(columns=['target'])
        y = df['target']
        
        # Train/test split (time-based)
        split_idx = int(0.8 * len(df))
        X_train, X_test = X[:split_idx], X[split_idx:]
        y_train, y_test = y[:split_idx], y[split_idx:]
        
        # Train
        model = RandomForestClassifier(n_estimators=args.n_estimators,
                                       max_depth=args.max_depth,
                                       random_state=42)
        model.fit(X_train, y_train)
        
        # Evaluate
        train_acc = accuracy_score(y_train, model.predict(X_train))
        test_acc = accuracy_score(y_test, model.predict(X_test))
        mlflow.log_metric("train_accuracy", train_acc)
        mlflow.log_metric("test_accuracy", test_acc)
        
        # Save model
        mlflow.sklearn.log_model(model, "model")
        
        # Also save metrics to file (for DVC)
        with open('metrics/accuracy.json', 'w') as f:
            json.dump({'test_accuracy': test_acc}, f)
        
        print(f"Run ID: {mlflow.active_run().info.run_id}")

if __name__ == "__main__":
    main()
```

**Running the pipeline:**

```bash
dvc repro
```

This will run the entire pipeline, and the training stage will log to MLflow. The DVC lock file records all dependencies and outputs.

**Exploring lineage:**

- To see the pipeline graph: `dvc dag`
- To see MLflow runs: `mlflow ui`
- To find which model version used a particular data hash: use MLflow search.

Now we have a complete lineage from raw data to trained model, all captured and queryable.

---

## Chapter Summary

In this chapter, we explored the critical topic of data and model lineage. We covered:

- The definitions and importance of lineage for reproducibility, debugging, impact analysis, and compliance.
- Tools for lineage: DVC for data and pipeline versioning, and MLflow for experiment and model tracking.
- How to implement data lineage with DVC pipelines, capturing dependencies and versions.
- How to integrate data lineage with MLflow by logging DVC hashes.
- Model lineage through MLflow runs and the model registry.
- Building simple lineage graphs and performing impact analysis.
- Compliance considerations and best practices.
- A complete example for the NEPSE system combining DVC and MLflow.

With lineage in place, the NEPSE prediction system becomes transparent and auditable. We can trace any prediction back to the data and code that produced it, and we can assess the impact of changes upstream. This is a fundamental requirement for trustworthy machine learning in production.

In the next chapter, we will discuss **Model Governance**, which builds on lineage to ensure that models are developed, deployed, and maintained responsibly.

---

**End of Chapter 65**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='64. experiment_tracking.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='66. model_governance.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
