# Task 15.5: Reproducibility Audit

**Module:** 15 - Benchmarking, Evaluation & MLOps  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand why reproducibility matters in ML
- [ ] Identify common sources of non-reproducibility
- [ ] Implement reproducibility best practices
- [ ] Create a reproducibility checklist for your projects
- [ ] Verify training reproducibility across runs

---

## üìö Prerequisites

- Completed: Tasks 15.1-15.4
- Knowledge of: PyTorch training, random seeds
- Hardware: DGX Spark or any GPU

---

## üåç Real-World Context

**Reproducibility is the foundation of scientific ML.** Consider these scenarios:

- **Research:** "We can't reproduce your paper's results" = career problem
- **Production:** "Why does the model perform differently today?" = debugging nightmare
- **Compliance:** "Prove this model was trained correctly" = legal requirement
- **Collaboration:** "I ran your code but got different numbers" = wasted time

**Companies take this seriously:**
- **Google:** Has internal reproducibility standards for all ML
- **Meta:** Publishes training configs for all major models
- **OpenAI:** Provides detailed technical reports

---

## üßí ELI5: What is Reproducibility?

> **Imagine you're baking cookies.** If you follow the EXACT same recipe:
> - Same ingredients (flour, sugar, butter)
> - Same amounts (2 cups flour, 1 cup sugar)
> - Same oven temperature (350¬∞F)
> - Same baking time (12 minutes)
>
> You should get the SAME cookies every time!
>
> **But ML is trickier.** Even with the same code, you might get different results because:
> - Random numbers are used (like shuffling ingredients randomly)
> - Hardware behaves slightly differently (like different ovens)
> - Software versions change (like using a new recipe book)
>
> **Reproducibility means:** Given the same inputs, get the same outputs. Every. Single. Time.

---

## Part 1: Sources of Non-Reproducibility

Let's understand what can cause different results.

In [None]:
# Common sources of non-reproducibility
REPRODUCIBILITY_ISSUES = {
    "Random Seeds": {
        "description": "Random number generators not seeded consistently",
        "affected": ["Weight initialization", "Data shuffling", "Dropout", "Data augmentation"],
        "fix": "Set seeds for all random sources (Python, NumPy, PyTorch, CUDA)"
    },
    "GPU Non-Determinism": {
        "description": "CUDA operations with non-deterministic algorithms",
        "affected": ["Convolutions", "Attention", "Certain reduction ops"],
        "fix": "Use torch.use_deterministic_algorithms(True)"
    },
    "Data Order": {
        "description": "Training data loaded in different orders",
        "affected": ["Model convergence", "Final weights"],
        "fix": "Fix random seed for data loaders, save shuffling order"
    },
    "Software Versions": {
        "description": "Different package versions have different behaviors",
        "affected": ["Algorithm implementations", "Default parameters"],
        "fix": "Lock dependencies, use containers"
    },
    "Hardware Differences": {
        "description": "Different GPUs/CPUs have different floating point behavior",
        "affected": ["Numerical precision", "Optimization paths"],
        "fix": "Document hardware, use consistent environments"
    },
    "Floating Point Precision": {
        "description": "FP operations are not perfectly associative",
        "affected": ["Batch normalization", "Loss calculations"],
        "fix": "Use deterministic algorithms, be aware of tolerance"
    }
}

print("‚ö†Ô∏è Common Sources of Non-Reproducibility:")
print("=" * 60)

for issue, details in REPRODUCIBILITY_ISSUES.items():
    print(f"\nüî¥ {issue}")
    print(f"   {details['description']}")
    print(f"   Affects: {', '.join(details['affected'])}")
    print(f"   Fix: {details['fix']}")

---

## Part 2: Setting Up Reproducible Training

Let's implement a robust seed-setting mechanism.

In [None]:
import os
import random
import numpy as np
import torch
import torch.nn as nn
from typing import Optional

def set_seed(seed: int, deterministic: bool = True) -> None:
    """
    Set all random seeds for reproducibility.
    
    Args:
        seed: The random seed to use
        deterministic: If True, use deterministic algorithms (slower but reproducible)
    """
    # Python's built-in random
    random.seed(seed)
    
    # NumPy
    np.random.seed(seed)
    
    # PyTorch CPU
    torch.manual_seed(seed)
    
    # PyTorch GPU
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  # For multi-GPU
    
    # Environment variable for hash randomization
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # Deterministic behavior (may impact performance)
    if deterministic:
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        
        # For PyTorch 1.8+
        try:
            torch.use_deterministic_algorithms(True)
        except AttributeError:
            pass  # Older PyTorch version
    
    print(f"‚úÖ All random seeds set to {seed}")
    print(f"   Deterministic mode: {deterministic}")

# Test it
set_seed(42)

In [None]:
# Verify seed setting works
def verify_seed_reproducibility(seed: int = 42, n_tests: int = 3):
    """Verify that setting seeds produces reproducible random numbers."""
    
    results = []
    
    for i in range(n_tests):
        set_seed(seed)
        
        result = {
            "python_random": random.random(),
            "numpy_random": np.random.rand(),
            "torch_random": torch.rand(1).item(),
        }
        
        if torch.cuda.is_available():
            result["cuda_random"] = torch.rand(1, device="cuda").item()
        
        results.append(result)
    
    # Check all results are the same
    all_same = all(r == results[0] for r in results)
    
    print(f"\nüîç Seed Reproducibility Test (seed={seed}):")
    print("=" * 50)
    
    for key in results[0].keys():
        values = [r[key] for r in results]
        same = len(set(values)) == 1
        status = "‚úÖ" if same else "‚ùå"
        print(f"  {status} {key}: {values[0]:.6f}")
    
    print(f"\n{'‚úÖ All random sources are reproducible!' if all_same else '‚ùå Some sources are not reproducible!'}")
    return all_same

verify_seed_reproducibility(42)

---

## Part 3: Reproducible Model Training

Let's create a reproducible training setup.

In [None]:
# Simple model for testing
class SimpleNet(nn.Module):
    """Simple network for reproducibility testing."""
    
    def __init__(self, input_size=10, hidden_size=20, output_size=2):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        return self.fc2(x)

# Create reproducible data
def create_reproducible_data(seed: int, n_samples: int = 1000):
    """Create reproducible synthetic data."""
    set_seed(seed)
    
    X = torch.randn(n_samples, 10)
    # Simple linear relationship with noise
    y = (X[:, 0] + X[:, 1] > 0).long()
    
    return X, y

X, y = create_reproducible_data(42)
print(f"Data shape: X={X.shape}, y={y.shape}")
print(f"First few X values: {X[0, :3]}")

In [None]:
from torch.utils.data import DataLoader, TensorDataset

def train_model(
    seed: int,
    n_epochs: int = 5,
    learning_rate: float = 0.01,
    batch_size: int = 32,
    verbose: bool = False
) -> dict:
    """
    Train a model with full reproducibility.
    
    Returns:
        Dictionary with training results and final model state
    """
    # Set seed BEFORE everything
    set_seed(seed)
    
    # Create data (seeded)
    X, y = create_reproducible_data(seed)
    
    # Create data loader with generator for reproducible shuffling
    g = torch.Generator()
    g.manual_seed(seed)
    
    dataset = TensorDataset(X, y)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        generator=g,  # Reproducible shuffling!
        drop_last=False
    )
    
    # Create model (seeded initialization)
    model = SimpleNet()
    
    # Optimizer and loss
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()
    
    # Training loop
    history = []
    
    for epoch in range(n_epochs):
        model.train()
        total_loss = 0
        correct = 0
        total = 0
        
        for batch_X, batch_y in dataloader:
            optimizer.zero_grad()
            
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            _, predicted = outputs.max(1)
            correct += (predicted == batch_y).sum().item()
            total += batch_y.size(0)
        
        avg_loss = total_loss / len(dataloader)
        accuracy = correct / total
        history.append({"epoch": epoch, "loss": avg_loss, "accuracy": accuracy})
        
        if verbose:
            print(f"Epoch {epoch+1}/{n_epochs}: loss={avg_loss:.4f}, acc={accuracy:.4f}")
    
    # Get final state
    final_weights_hash = hash(tuple(model.fc1.weight.flatten().tolist()[:10]))
    
    return {
        "final_loss": history[-1]["loss"],
        "final_accuracy": history[-1]["accuracy"],
        "history": history,
        "weights_hash": final_weights_hash
    }

# Train with verbose output
result = train_model(seed=42, verbose=True)

In [None]:
# Verify reproducibility: train multiple times and compare
def verify_training_reproducibility(seed: int = 42, n_runs: int = 3):
    """Verify that training is reproducible across multiple runs."""
    
    results = []
    
    print(f"\nüî¨ Running {n_runs} training runs with seed={seed}")
    print("=" * 50)
    
    for i in range(n_runs):
        result = train_model(seed=seed, verbose=False)
        results.append(result)
        print(f"Run {i+1}: loss={result['final_loss']:.6f}, acc={result['final_accuracy']:.6f}")
    
    # Compare results
    losses = [r["final_loss"] for r in results]
    accuracies = [r["final_accuracy"] for r in results]
    hashes = [r["weights_hash"] for r in results]
    
    loss_match = all(abs(l - losses[0]) < 1e-6 for l in losses)
    acc_match = all(abs(a - accuracies[0]) < 1e-6 for a in accuracies)
    hash_match = len(set(hashes)) == 1
    
    print(f"\nüìä Reproducibility Results:")
    print(f"  {'‚úÖ' if loss_match else '‚ùå'} Final loss identical: {losses[0]:.6f}")
    print(f"  {'‚úÖ' if acc_match else '‚ùå'} Final accuracy identical: {accuracies[0]:.6f}")
    print(f"  {'‚úÖ' if hash_match else '‚ùå'} Model weights identical")
    
    is_reproducible = loss_match and acc_match and hash_match
    
    if is_reproducible:
        print(f"\nüéâ Training is fully reproducible!")
    else:
        print(f"\n‚ö†Ô∏è Training has reproducibility issues!")
    
    return is_reproducible

verify_training_reproducibility(42)

---

## Part 4: Environment Reproducibility

Capturing the full environment is crucial for reproducibility.

In [None]:
import platform
import sys
from datetime import datetime

def capture_environment() -> dict:
    """Capture the full environment for reproducibility."""
    
    env = {
        "timestamp": datetime.now().isoformat(),
        "python": {
            "version": sys.version,
            "executable": sys.executable
        },
        "platform": {
            "system": platform.system(),
            "release": platform.release(),
            "machine": platform.machine(),
            "processor": platform.processor()
        },
        "packages": {
            "torch": torch.__version__,
            "numpy": np.__version__,
        }
    }
    
    # CUDA info
    if torch.cuda.is_available():
        env["cuda"] = {
            "available": True,
            "version": torch.version.cuda,
            "device_count": torch.cuda.device_count(),
            "device_name": torch.cuda.get_device_name(0),
            "cudnn_version": torch.backends.cudnn.version(),
            "cudnn_deterministic": torch.backends.cudnn.deterministic,
            "cudnn_benchmark": torch.backends.cudnn.benchmark
        }
    else:
        env["cuda"] = {"available": False}
    
    return env

env = capture_environment()

print("\nüñ•Ô∏è Environment Snapshot:")
print("=" * 50)

import json
print(json.dumps(env, indent=2, default=str))

In [None]:
# Generate requirements file
def generate_requirements():
    """Generate a requirements.txt for the current environment."""
    
    # Key packages to track
    packages = [
        "torch",
        "numpy",
        "transformers",
        "mlflow",
        "datasets",
    ]
    
    requirements = []
    
    for pkg in packages:
        try:
            version = __import__(pkg).__version__
            requirements.append(f"{pkg}=={version}")
        except (ImportError, AttributeError):
            pass
    
    return "\n".join(requirements)

print("\nüìã Requirements:")
print(generate_requirements())

In [None]:
# Docker command for reproducible environment
print("""
üê≥ Docker for Reproducibility
{'='*50}

For DGX Spark, use NGC containers:

```bash
docker run --gpus all -it --rm \
    -v $HOME/workspace:/workspace \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    nvcr.io/nvidia/pytorch:25.01-py3 \
    python your_training_script.py
```

Key points:
- Always pin the container version (e.g., 25.01-py3, not 'latest')
- Mount volumes for persistent data
- Use --ipc=host for multi-processing

Save your container version in your experiment logs!
""")

---

## Part 5: The Reproducibility Checklist

A comprehensive checklist for reproducible ML projects.

In [None]:
REPRODUCIBILITY_CHECKLIST = """
üìã REPRODUCIBILITY CHECKLIST
{'='*60}

## üé≤ Random Seeds
- [ ] Python random seed set
- [ ] NumPy random seed set
- [ ] PyTorch manual_seed set (CPU)
- [ ] PyTorch CUDA seeds set
- [ ] PYTHONHASHSEED environment variable set
- [ ] DataLoader generator seeded

## üîß Deterministic Settings
- [ ] torch.backends.cudnn.deterministic = True
- [ ] torch.backends.cudnn.benchmark = False
- [ ] torch.use_deterministic_algorithms(True) (if available)

## üì¶ Environment
- [ ] requirements.txt or environment.yml committed
- [ ] Docker/container version recorded
- [ ] Hardware (GPU model) documented
- [ ] OS and driver versions noted

## üìä Data
- [ ] Data version tracked (hash or version number)
- [ ] Train/val/test split reproducible
- [ ] Preprocessing deterministic
- [ ] Data loading order fixed

## üß† Model
- [ ] Architecture defined in code (not just checkpoints)
- [ ] Hyperparameters logged
- [ ] Weight initialization seeded
- [ ] All model checkpoints saved

## üìù Experiment Tracking
- [ ] All parameters logged (MLflow, W&B, etc.)
- [ ] Metrics recorded at each step/epoch
- [ ] Artifacts (models, plots) saved
- [ ] Git commit hash recorded

## ‚úÖ Verification
- [ ] Training verified reproducible (run twice, compare)
- [ ] Evaluation verified reproducible
- [ ] Another team member reproduced results
"""

print(REPRODUCIBILITY_CHECKLIST)

In [None]:
# Automated reproducibility audit
def audit_reproducibility() -> dict:
    """Audit the current environment for reproducibility."""
    
    audit = {
        "passed": [],
        "warnings": [],
        "failed": []
    }
    
    # Check CUDA deterministic settings
    if torch.cuda.is_available():
        if torch.backends.cudnn.deterministic:
            audit["passed"].append("CUDNN deterministic mode enabled")
        else:
            audit["warnings"].append("CUDNN deterministic mode not enabled")
        
        if not torch.backends.cudnn.benchmark:
            audit["passed"].append("CUDNN benchmark mode disabled")
        else:
            audit["warnings"].append("CUDNN benchmark mode enabled (may cause non-determinism)")
    
    # Check PYTHONHASHSEED
    if 'PYTHONHASHSEED' in os.environ:
        audit["passed"].append(f"PYTHONHASHSEED set to {os.environ['PYTHONHASHSEED']}")
    else:
        audit["warnings"].append("PYTHONHASHSEED not set")
    
    # Check for reproducibility-affecting packages
    try:
        import transformers
        audit["passed"].append(f"Transformers version: {transformers.__version__}")
    except ImportError:
        pass
    
    # Summary
    print("\nüîç Reproducibility Audit Results:")
    print("=" * 50)
    
    print(f"\n‚úÖ Passed ({len(audit['passed'])})")
    for item in audit["passed"]:
        print(f"   ‚Ä¢ {item}")
    
    print(f"\n‚ö†Ô∏è Warnings ({len(audit['warnings'])})")
    for item in audit["warnings"]:
        print(f"   ‚Ä¢ {item}")
    
    print(f"\n‚ùå Failed ({len(audit['failed'])})")
    for item in audit["failed"]:
        print(f"   ‚Ä¢ {item}")
    
    return audit

audit_reproducibility()

---

## Part 6: Logging for Reproducibility with MLflow

Let's create a complete reproducible training run with full logging.

In [None]:
import mlflow
import subprocess

def get_git_hash() -> str:
    """Get current git commit hash."""
    try:
        result = subprocess.run(
            ["git", "rev-parse", "HEAD"],
            capture_output=True,
            text=True,
            timeout=5
        )
        return result.stdout.strip()[:8] if result.returncode == 0 else "unknown"
    except Exception:
        return "unknown"

def reproducible_training_run(
    seed: int,
    experiment_name: str = "Reproducibility-Demo",
    **training_kwargs
):
    """
    Execute a fully reproducible training run with complete logging.
    """
    # Set up MLflow
    MLFLOW_DIR = os.path.abspath("../mlflow")
    mlflow.set_tracking_uri(f"file://{MLFLOW_DIR}")
    mlflow.set_experiment(experiment_name)
    
    with mlflow.start_run(run_name=f"seed-{seed}"):
        # 1. Log reproducibility settings
        mlflow.log_params({
            "seed": seed,
            "deterministic_mode": True,
        })
        
        # 2. Log environment
        env = capture_environment()
        mlflow.log_params({
            "python_version": env["python"]["version"].split()[0],
            "torch_version": env["packages"]["torch"],
            "numpy_version": env["packages"]["numpy"],
            "platform": env["platform"]["system"],
        })
        
        if env["cuda"]["available"]:
            mlflow.log_params({
                "cuda_version": env["cuda"]["version"],
                "gpu_name": env["cuda"]["device_name"],
            })
        
        # 3. Log git info
        mlflow.set_tag("git_commit", get_git_hash())
        
        # 4. Log training parameters
        mlflow.log_params(training_kwargs)
        
        # 5. Save environment file as artifact
        env_path = "/tmp/environment.json"
        with open(env_path, 'w') as f:
            json.dump(env, f, indent=2, default=str)
        mlflow.log_artifact(env_path)
        
        # 6. Run training
        result = train_model(seed=seed, **training_kwargs)
        
        # 7. Log metrics
        for epoch_data in result["history"]:
            mlflow.log_metrics(
                {"loss": epoch_data["loss"], "accuracy": epoch_data["accuracy"]},
                step=epoch_data["epoch"]
            )
        
        mlflow.log_metrics({
            "final_loss": result["final_loss"],
            "final_accuracy": result["final_accuracy"]
        })
        
        # 8. Log model weights hash for verification
        mlflow.log_param("weights_hash", result["weights_hash"])
        
        print(f"\n‚úÖ Reproducible run complete!")
        print(f"   Seed: {seed}")
        print(f"   Final accuracy: {result['final_accuracy']:.4f}")
        print(f"   Weights hash: {result['weights_hash']}")
        
        return result

# Run it!
result = reproducible_training_run(
    seed=42,
    n_epochs=5,
    learning_rate=0.01,
    batch_size=32
)

---

## ‚úã Try It Yourself: Exercise

**Task:** Verify the reproducibility of a training pipeline.

1. Run `reproducible_training_run` with seed=123, 3 times
2. Verify all three runs have identical final metrics
3. Change one setting (e.g., remove deterministic mode) and observe the difference
4. Document your findings

<details>
<summary>üí° Hint</summary>

Try modifying the `set_seed` function to disable deterministic settings:
```python
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.benchmark = True
```
Then compare results.

</details>

In [None]:
# YOUR CODE HERE

# Step 1: Run 3 times with seed=123

# Step 2: Compare results

# Step 3: Disable deterministic mode and compare

# Step 4: Document findings

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Setting Seed Only Once

In [None]:
# ‚ùå Wrong: Setting seed only at start of script
# set_seed(42)
# ... many operations ...
# model = create_model()  # Seed state may have drifted

# ‚úÖ Right: Set seed immediately before the operation you want to reproduce
# set_seed(42)
# model = create_model()  # Deterministic
# set_seed(42)  # Reset if needed for next reproducible op

print("Set seed as close as possible to the operation you want to reproduce!")

### Mistake 2: Forgetting DataLoader Workers

In [None]:
# ‚ùå Wrong: Using num_workers > 0 without seeding
# dataloader = DataLoader(dataset, num_workers=4)  # Non-reproducible!

# ‚úÖ Right: Seed each worker
def worker_init_fn(worker_id):
    """Initialize each DataLoader worker with a unique but reproducible seed."""
    seed = torch.initial_seed() % 2**32
    np.random.seed(seed)
    random.seed(seed)

# Usage:
# dataloader = DataLoader(
#     dataset,
#     num_workers=4,
#     worker_init_fn=worker_init_fn,  # Each worker gets seeded!
#     generator=torch.Generator().manual_seed(42)
# )

print("Always use worker_init_fn when num_workers > 0!")

# üí° Note: This function is also available in the scripts module:
# from scripts.reproducibility import worker_init_fn

---

## üéâ Checkpoint

You've learned:
- ‚úÖ Sources of non-reproducibility in ML
- ‚úÖ How to properly set all random seeds
- ‚úÖ Deterministic training settings
- ‚úÖ Environment capture and logging
- ‚úÖ The reproducibility checklist
- ‚úÖ Verification techniques

---

## üìñ Further Reading

- [PyTorch Reproducibility Guide](https://pytorch.org/docs/stable/notes/randomness.html)
- [NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples)
- [Papers With Code - ML Reproducibility Challenge](https://paperswithcode.com/rc2022)

---

## üßπ Cleanup

In [None]:
import gc
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
print("‚úÖ Cleanup complete!")

---

## üìù Module 15 Summary

Congratulations on completing Module 15! You've learned:

1. **Benchmarking (15.1):** How to evaluate LLMs with standard benchmarks
2. **Custom Evaluation (15.2):** Building task-specific evaluation frameworks
3. **MLflow (15.3):** Experiment tracking and visualization
4. **Model Registry (15.4):** Version control for models
5. **Reproducibility (15.5):** Ensuring consistent results

These skills form the foundation of professional ML engineering. Every serious ML team uses these practices!

**Next Steps:**
- Apply these practices to your own projects
- Set up a central MLflow server for your team
- Create custom benchmarks for your use cases
- Build CI/CD pipelines with reproducibility checks