# Task 4.3.7: Reproducibility Audit

**Module:** 4.3 - MLOps & Experiment Tracking  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐ (Intermediate)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand why reproducibility matters in ML
- [ ] Implement proper random seed management
- [ ] Create reproducible training environments
- [ ] Audit and verify experiment reproducibility
- [ ] Document experiments for future reference

---

## Prerequisites

- Completed: Task 4.3.6 (Model Registry)
- Knowledge of: Python, PyTorch, Docker basics

---

## Real-World Context

You trained an amazing model 6 months ago. Now you need to retrain it with new data, but you can't reproduce the original results. The loss curves look different. The accuracy is lower.

**What went wrong?**

This happens constantly in ML research and production:
- Papers that can't be replicated
- Models that can't be retrained
- Bugs that can't be debugged

Reproducibility is the foundation of reliable ML. Without it, you're flying blind.

---

## ELI5: What is Reproducibility?

> **Imagine you baked the perfect cake last week.**
>
> You want to bake it again, but you:
> - Didn't write down the exact amounts
> - Used a different oven this time
> - Can't remember how long you baked it
>
> Result? A completely different cake!
>
> **In AI terms:** Reproducibility means you can run the exact same experiment again and get the exact same results. This requires controlling:
> - Random seeds (your "ingredients")
> - Software versions (your "oven")
> - Hardware settings (your "temperature")
> - Data versions (your "recipe")

---

## The Reproducibility Pyramid

```
                    ┌──────────────────┐
                    │   Same Results   │  <- Goal
                    └────────┬─────────┘
               ┌─────────────┴─────────────┐
               │     Same Environment      │  <- Docker/Conda
               └─────────────┬─────────────┘
          ┌──────────────────┴──────────────────┐
          │        Same Code & Config           │  <- Git + MLflow
          └──────────────────┬──────────────────┘
     ┌───────────────────────┴───────────────────────┐
     │              Same Data Version                │  <- DVC/HF
     └───────────────────────┬───────────────────────┘
┌────────────────────────────┴────────────────────────────┐
│                 Same Random Seeds                       │  <- Foundation
└─────────────────────────────────────────────────────────┘
```

## Part 1: Random Seed Management

The most common source of non-reproducibility is random number generation.

In [None]:
import random
import numpy as np
import torch
import os
from typing import Optional

def set_seed(seed: int = 42, deterministic: bool = True):
    """
    Set all random seeds for reproducibility.
    
    Args:
        seed: The seed value to use
        deterministic: If True, use deterministic algorithms (slower but reproducible)
    
    Example:
        set_seed(42)
        # Now all random operations are reproducible
    """
    # Python random
    random.seed(seed)
    
    # Numpy
    np.random.seed(seed)
    
    # PyTorch
    torch.manual_seed(seed)
    
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  # For multi-GPU
    
    # Environment variable for some libraries
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # Deterministic operations
    if deterministic:
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        # For PyTorch 1.8+
        if hasattr(torch, 'use_deterministic_algorithms'):
            try:
                torch.use_deterministic_algorithms(True)
            except RuntimeError:
                # Some operations don't have deterministic implementations
                pass
    else:
        # Allow non-deterministic for speed
        torch.backends.cudnn.deterministic = False
        torch.backends.cudnn.benchmark = True
    
    print(f"Seeds set to {seed} (deterministic={deterministic})")

# Test reproducibility
print("Test 1: Set seed and generate random numbers")
set_seed(42)
print(f"  Python random: {random.random():.6f}")
print(f"  Numpy random: {np.random.random():.6f}")
print(f"  Torch random: {torch.rand(1).item():.6f}")

print("\nTest 2: Reset seed and verify same numbers")
set_seed(42)
print(f"  Python random: {random.random():.6f}")
print(f"  Numpy random: {np.random.random():.6f}")
print(f"  Torch random: {torch.rand(1).item():.6f}")

In [None]:
# Demonstrate the problem without seeds
print("Without seed management:")
for i in range(3):
    x = torch.rand(3)
    print(f"  Run {i+1}: {x.numpy().round(4)}")

print("\nWith seed management:")
for i in range(3):
    set_seed(42)
    x = torch.rand(3)
    print(f"  Run {i+1}: {x.numpy().round(4)}")

### DataLoader Reproducibility

DataLoaders with multiple workers need special handling.

In [None]:
from torch.utils.data import DataLoader, TensorDataset

def seed_worker(worker_id: int):
    """
    Seed function for DataLoader workers.
    
    This ensures each worker has a unique but reproducible seed.
    """
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

def create_reproducible_dataloader(
    dataset,
    batch_size: int = 32,
    shuffle: bool = True,
    num_workers: int = 4,
    seed: int = 42
):
    """
    Create a DataLoader with reproducible behavior.
    """
    # Create generator with fixed seed
    generator = torch.Generator()
    generator.manual_seed(seed)
    
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=num_workers,
        worker_init_fn=seed_worker,
        generator=generator
    )

# Example
dataset = TensorDataset(torch.randn(100, 10), torch.randint(0, 2, (100,)))

print("Testing DataLoader reproducibility:")
for run in range(2):
    set_seed(42)
    loader = create_reproducible_dataloader(dataset, batch_size=10, shuffle=True, num_workers=0)
    first_batch = next(iter(loader))
    print(f"  Run {run+1} first batch sum: {first_batch[0].sum():.4f}")

---

## Part 2: Environment Reproducibility

Capturing and recreating the exact environment.

In [None]:
import sys
import platform
import subprocess
from datetime import datetime

def capture_environment() -> dict:
    """
    Capture comprehensive environment information.
    
    Returns:
        dict: Environment details for reproducibility
    """
    env = {
        "timestamp": datetime.now().isoformat(),
        "python": {
            "version": sys.version,
            "executable": sys.executable,
        },
        "platform": {
            "system": platform.system(),
            "release": platform.release(),
            "machine": platform.machine(),
            "processor": platform.processor(),
        },
        "packages": {},
        "gpu": {},
    }
    
    # Key packages
    key_packages = [
        'torch', 'numpy', 'transformers', 'datasets', 
        'mlflow', 'wandb', 'peft', 'accelerate'
    ]
    
    for pkg in key_packages:
        try:
            module = __import__(pkg)
            env["packages"][pkg] = getattr(module, '__version__', 'unknown')
        except ImportError:
            env["packages"][pkg] = 'not installed'
    
    # GPU info
    if torch.cuda.is_available():
        env["gpu"] = {
            "available": True,
            "device_count": torch.cuda.device_count(),
            "device_name": torch.cuda.get_device_name(0),
            "cuda_version": torch.version.cuda,
            "cudnn_version": torch.backends.cudnn.version(),
            "memory_total_gb": torch.cuda.get_device_properties(0).total_memory / 1024**3,
        }
    else:
        env["gpu"] = {"available": False}
    
    return env

# Capture current environment
env_info = capture_environment()

print("Current Environment:")
print("="*60)
print(f"\nPython: {env_info['python']['version'].split()[0]}")
print(f"Platform: {env_info['platform']['system']} {env_info['platform']['release']}")
print(f"Machine: {env_info['platform']['machine']}")

print("\nKey Packages:")
for pkg, version in env_info['packages'].items():
    print(f"  {pkg}: {version}")

if env_info['gpu']['available']:
    print(f"\nGPU: {env_info['gpu']['device_name']}")
    print(f"  Memory: {env_info['gpu']['memory_total_gb']:.1f} GB")
    print(f"  CUDA: {env_info['gpu']['cuda_version']}")

In [None]:
# Generate requirements.txt with pinned versions
def generate_requirements(output_path: str = "requirements.txt"):
    """
    Generate a requirements.txt with exact versions.
    """
    try:
        result = subprocess.run(
            [sys.executable, "-m", "pip", "freeze"],
            capture_output=True,
            text=True
        )
        
        with open(output_path, "w") as f:
            f.write(f"# Generated: {datetime.now().isoformat()}\n")
            f.write(f"# Python: {sys.version.split()[0]}\n")
            f.write(f"# Platform: {platform.system()} {platform.machine()}\n\n")
            f.write(result.stdout)
        
        print(f"Requirements saved to {output_path}")
        return True
    except Exception as e:
        print(f"Error: {e}")
        return False

# Note: Uncomment to generate
# generate_requirements("requirements_frozen.txt")

### Docker for Complete Reproducibility

In [None]:
# Dockerfile template for DGX Spark
dockerfile_template = '''
# Dockerfile for Reproducible ML Training on DGX Spark
# Based on NGC PyTorch container for ARM64 + CUDA

FROM nvcr.io/nvidia/pytorch:25.11-py3

# Set environment variables for reproducibility
ENV PYTHONHASHSEED=42
ENV CUBLAS_WORKSPACE_CONFIG=:4096:8

# Install additional dependencies
COPY requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/requirements.txt

# Copy training code
WORKDIR /workspace
COPY . /workspace

# Set default seed in environment
ENV RANDOM_SEED=42

# Entry point
ENTRYPOINT ["python", "train.py"]
'''

print("Dockerfile for Reproducible Training:")
print("="*60)
print(dockerfile_template)

---

## Part 3: Reproducibility Audit Framework

Let's build a framework to audit and verify reproducibility.

In [None]:
from dataclasses import dataclass, field
from typing import List, Dict, Any, Callable
import hashlib
import json

@dataclass
class ReproducibilityCheck:
    """Result of a reproducibility check."""
    name: str
    passed: bool
    message: str
    details: Dict = field(default_factory=dict)


class ReproducibilityAuditor:
    """
    Audit experiments for reproducibility.
    
    Example:
        auditor = ReproducibilityAuditor()
        results = auditor.audit(config, train_fn)
        auditor.print_report(results)
    """
    
    def __init__(self, tolerance: float = 1e-5):
        """
        Args:
            tolerance: Maximum allowed difference for numerical reproducibility
        """
        self.tolerance = tolerance
        self.checks: List[ReproducibilityCheck] = []
    
    def check_seeds(self, config: dict) -> ReproducibilityCheck:
        """
        Check if seeds are properly configured.
        """
        required_seeds = ['seed', 'random_seed', 'numpy_seed', 'torch_seed']
        found_seeds = [k for k in required_seeds if k in config]
        
        if not found_seeds:
            return ReproducibilityCheck(
                name="seed_configuration",
                passed=False,
                message="No seed found in config",
                details={"searched": required_seeds}
            )
        
        return ReproducibilityCheck(
            name="seed_configuration",
            passed=True,
            message=f"Seed found: {found_seeds[0]} = {config[found_seeds[0]]}",
            details={"seed_key": found_seeds[0], "seed_value": config[found_seeds[0]]}
        )
    
    def check_deterministic_mode(self) -> ReproducibilityCheck:
        """
        Check if PyTorch deterministic mode is enabled.
        """
        is_deterministic = torch.backends.cudnn.deterministic
        benchmark_disabled = not torch.backends.cudnn.benchmark
        
        passed = is_deterministic and benchmark_disabled
        
        return ReproducibilityCheck(
            name="deterministic_mode",
            passed=passed,
            message="Deterministic mode " + ("enabled" if passed else "disabled"),
            details={
                "cudnn_deterministic": is_deterministic,
                "cudnn_benchmark": not benchmark_disabled
            }
        )
    
    def check_environment_captured(self, env_info: dict) -> ReproducibilityCheck:
        """
        Check if environment info is properly captured.
        """
        required = ['python', 'packages', 'platform']
        missing = [k for k in required if k not in env_info]
        
        if missing:
            return ReproducibilityCheck(
                name="environment_capture",
                passed=False,
                message=f"Missing environment info: {missing}",
                details={"missing": missing}
            )
        
        return ReproducibilityCheck(
            name="environment_capture",
            passed=True,
            message="Environment properly captured",
            details={
                "python": env_info['python'].get('version', '')[:20],
                "packages_count": len(env_info.get('packages', {}))
            }
        )
    
    def verify_reproducibility(
        self,
        train_fn: Callable,
        config: dict,
        n_runs: int = 2
    ) -> ReproducibilityCheck:
        """
        Verify that training is reproducible by running multiple times.
        """
        results = []
        
        for i in range(n_runs):
            # Reset seeds before each run
            seed = config.get('seed', 42)
            set_seed(seed)
            
            # Run training
            result = train_fn(config)
            results.append(result)
        
        # Compare results
        first_result = results[0]
        differences = []
        
        for i, result in enumerate(results[1:], 1):
            if isinstance(first_result, dict) and isinstance(result, dict):
                for key in first_result:
                    if key in result:
                        diff = abs(first_result[key] - result[key])
                        if diff > self.tolerance:
                            differences.append({
                                "run": i,
                                "metric": key,
                                "diff": diff
                            })
            elif isinstance(first_result, (int, float)):
                diff = abs(first_result - result)
                if diff > self.tolerance:
                    differences.append({"run": i, "diff": diff})
        
        passed = len(differences) == 0
        
        return ReproducibilityCheck(
            name="reproducibility_verification",
            passed=passed,
            message=f"Runs {'identical' if passed else 'differ'} within tolerance {self.tolerance}",
            details={
                "n_runs": n_runs,
                "differences": differences,
                "tolerance": self.tolerance
            }
        )
    
    def audit(
        self,
        config: dict,
        env_info: dict = None,
        train_fn: Callable = None
    ) -> List[ReproducibilityCheck]:
        """
        Run full reproducibility audit.
        """
        self.checks = []
        
        # Check seeds
        self.checks.append(self.check_seeds(config))
        
        # Check deterministic mode
        self.checks.append(self.check_deterministic_mode())
        
        # Check environment capture
        if env_info:
            self.checks.append(self.check_environment_captured(env_info))
        
        # Verify reproducibility
        if train_fn:
            self.checks.append(self.verify_reproducibility(train_fn, config))
        
        return self.checks
    
    def print_report(self, checks: List[ReproducibilityCheck] = None):
        """
        Print a formatted audit report.
        """
        checks = checks or self.checks
        
        print("\n" + "="*60)
        print("REPRODUCIBILITY AUDIT REPORT")
        print("="*60)
        
        passed_count = sum(1 for c in checks if c.passed)
        total_count = len(checks)
        
        print(f"\nOverall: {passed_count}/{total_count} checks passed")
        print("-"*40)
        
        for check in checks:
            icon = "✅" if check.passed else "❌"
            print(f"\n{icon} {check.name}")
            print(f"   {check.message}")
            if check.details and not check.passed:
                print(f"   Details: {check.details}")
        
        print("\n" + "="*60)

In [None]:
# Test the auditor
import torch.nn as nn
import torch.optim as optim

def simple_train(config: dict) -> dict:
    """
    Simple training function for testing reproducibility.
    """
    # Create model
    model = nn.Sequential(
        nn.Linear(10, 32),
        nn.ReLU(),
        nn.Linear(32, 1)
    )
    
    optimizer = optim.Adam(model.parameters(), lr=config.get('lr', 0.01))
    criterion = nn.MSELoss()
    
    # Generate fixed data
    X = torch.randn(100, 10)
    y = torch.randn(100, 1)
    
    # Train for a few steps
    total_loss = 0.0
    for _ in range(10):
        optimizer.zero_grad()
        output = model(X)
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    return {
        "final_loss": loss.item(),
        "total_loss": total_loss,
        "first_param": model[0].weight[0, 0].item()
    }

# Configure and run audit
config = {
    "seed": 42,
    "lr": 0.01,
    "epochs": 10
}

auditor = ReproducibilityAuditor(tolerance=1e-6)
set_seed(42, deterministic=True)

env_info = capture_environment()

checks = auditor.audit(
    config=config,
    env_info=env_info,
    train_fn=simple_train
)

auditor.print_report()

---

## Part 4: MLflow Reproducibility Logging

Integrate reproducibility info with MLflow.

In [None]:
import mlflow

def log_reproducibility_info(
    config: dict,
    env_info: dict = None
):
    """
    Log reproducibility information to MLflow.
    """
    # Log seed
    if 'seed' in config:
        mlflow.log_param("seed", config['seed'])
    
    # Log environment info
    if env_info is None:
        env_info = capture_environment()
    
    # Log as params
    mlflow.log_param("python_version", env_info['python']['version'].split()[0])
    mlflow.log_param("platform", f"{env_info['platform']['system']}-{env_info['platform']['machine']}")
    
    # Log key package versions
    for pkg, version in env_info['packages'].items():
        mlflow.log_param(f"pkg_{pkg}", version)
    
    # Log GPU info
    if env_info['gpu'].get('available'):
        mlflow.log_param("gpu_name", env_info['gpu']['device_name'])
        mlflow.log_param("cuda_version", env_info['gpu']['cuda_version'])
    
    # Log full environment as artifact
    with open("environment.json", "w") as f:
        json.dump(env_info, f, indent=2, default=str)
    mlflow.log_artifact("environment.json")
    os.remove("environment.json")
    
    # Log config as artifact
    with open("config.json", "w") as f:
        json.dump(config, f, indent=2, default=str)
    mlflow.log_artifact("config.json")
    os.remove("config.json")
    
    print("Reproducibility info logged to MLflow!")

In [None]:
# Example: Reproducible training run with MLflow
TRACKING_DIR = "./mlruns"
os.makedirs(TRACKING_DIR, exist_ok=True)
mlflow.set_tracking_uri(f"file://{os.path.abspath(TRACKING_DIR)}")
mlflow.set_experiment("Reproducibility-Demo")

config = {
    "seed": 42,
    "lr": 0.01,
    "hidden_dim": 64,
    "epochs": 20
}

with mlflow.start_run(run_name="reproducible-training"):
    # Set seeds
    set_seed(config['seed'], deterministic=True)
    
    # Log reproducibility info
    log_reproducibility_info(config)
    
    # Train
    result = simple_train(config)
    
    # Log metrics
    mlflow.log_metrics(result)
    
    print(f"Run ID: {mlflow.active_run().info.run_id}")
    print(f"Results: {result}")

---

## Part 5: Reproducibility Checklist

A comprehensive checklist for ensuring reproducibility.

In [None]:
reproducibility_checklist = '''
# ML Reproducibility Checklist

## Code & Configuration
- [ ] All random seeds are set and logged
- [ ] Config file is versioned with git
- [ ] All hyperparameters are in config (not hardcoded)
- [ ] Code version (git commit) is logged

## Environment
- [ ] Python version specified
- [ ] All package versions pinned (requirements.txt or pyproject.toml)
- [ ] CUDA/cuDNN versions documented
- [ ] Docker image used (if applicable)
- [ ] Hardware specifications documented

## Data
- [ ] Dataset version tracked (DVC, HF datasets, or manual versioning)
- [ ] Data preprocessing code is deterministic
- [ ] Train/val/test splits are fixed and documented
- [ ] Data loading order is deterministic (DataLoader seeds)

## Training
- [ ] torch.backends.cudnn.deterministic = True
- [ ] torch.backends.cudnn.benchmark = False
- [ ] DataLoader worker_init_fn and generator set
- [ ] Model initialization is seeded
- [ ] Optimizer state is reproducible

## Logging
- [ ] All parameters logged to experiment tracker
- [ ] Environment info logged
- [ ] Training metrics logged per step
- [ ] Final model saved with config

## Verification
- [ ] Run training 2+ times to verify identical results
- [ ] Document any known sources of non-determinism
- [ ] Test loading and inference reproducibility
'''

print(reproducibility_checklist)

In [None]:
# Create a function to generate a reproducibility report
def generate_reproducibility_report(
    config: dict,
    env_info: dict,
    git_commit: str = None,
    output_path: str = "REPRODUCIBILITY.md"
) -> str:
    """
    Generate a markdown reproducibility report.
    """
    report = f"""# Reproducibility Report

Generated: {datetime.now().isoformat()}

## Configuration

```json
{json.dumps(config, indent=2, default=str)}
```

## Environment

| Component | Version |
|-----------|--------|
| Python | {env_info['python']['version'].split()[0]} |
| Platform | {env_info['platform']['system']} {env_info['platform']['machine']} |
| PyTorch | {env_info['packages'].get('torch', 'N/A')} |
| NumPy | {env_info['packages'].get('numpy', 'N/A')} |

## GPU Information

| Property | Value |
|----------|-------|
| Available | {env_info['gpu'].get('available', False)} |
| Device | {env_info['gpu'].get('device_name', 'N/A')} |
| CUDA | {env_info['gpu'].get('cuda_version', 'N/A')} |
| Memory | {env_info['gpu'].get('memory_total_gb', 0):.1f} GB |

## Reproducibility Settings

- Random Seed: `{config.get('seed', 'NOT SET')}`
- Deterministic Mode: `True`
- cuDNN Benchmark: `False`

## How to Reproduce

1. Clone the repository
2. Install dependencies: `pip install -r requirements.txt`
3. Set the seed: `set_seed({config.get('seed', 42)})`
4. Run training: `python train.py --config config.json`
"""
    
    if git_commit:
        report += f"\n## Git Information\n\nCommit: `{git_commit}`\n"
    
    with open(output_path, "w") as f:
        f.write(report)
    
    print(f"Report saved to {output_path}")
    return report

# Generate report
report = generate_reproducibility_report(
    config=config,
    env_info=env_info
)

print("\nGenerated Report Preview:")
print("="*60)
print(report[:1000] + "...")

---

## Try It Yourself

Create a complete reproducible training pipeline:

1. Set up seeds properly
2. Capture environment
3. Train a model
4. Verify reproducibility with multiple runs
5. Generate a reproducibility report

<details>
<summary>Hint</summary>

Use the `ReproducibilityAuditor` class to verify your training:
```python
auditor = ReproducibilityAuditor(tolerance=1e-6)
checks = auditor.audit(config, env_info, train_fn)
assert all(c.passed for c in checks), "Reproducibility failed!"
```

</details>

In [None]:
# YOUR CODE HERE
# Create your reproducible training pipeline

# Your pipeline code...

---

## Common Mistakes

### Mistake 1: Setting Seeds Only Once

```python
# Wrong - seed at start only
set_seed(42)
for epoch in range(epochs):
    train_epoch()  # Seeds get out of sync

# Right - reset seeds for each run
def train():
    set_seed(42)  # Fresh seed each time
    for epoch in range(epochs):
        train_epoch()
```
**Why:** Random state changes during training. For reproducibility between runs, reset at start.

### Mistake 2: Ignoring DataLoader Workers

```python
# Wrong - workers have different random states
loader = DataLoader(dataset, num_workers=4, shuffle=True)

# Right - seed workers properly
loader = DataLoader(
    dataset,
    num_workers=4,
    shuffle=True,
    worker_init_fn=seed_worker,
    generator=torch.Generator().manual_seed(42)
)
```
**Why:** Each worker process has its own random state.

### Mistake 3: Using benchmark Mode

```python
# Wrong - benchmark optimizes for speed, not reproducibility
torch.backends.cudnn.benchmark = True

# Right - deterministic mode
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
```
**Why:** Benchmark mode selects algorithms based on timing, which varies.

### Mistake 4: Not Pinning Package Versions

```bash
# Wrong
pip install torch transformers

# Right
pip install torch==2.1.0 transformers==4.35.0
```
**Why:** New package versions may have different behaviors.

---

## Checkpoint

You've learned:
- How to properly manage random seeds across libraries
- How to capture and recreate environments
- How to audit experiments for reproducibility
- How to integrate reproducibility with MLflow
- Best practices for reproducible ML

---

## Challenge (Optional)

Create a "reproducibility guarantee" system that:
1. Automatically captures all environment info before training
2. Runs training 3 times and compares results
3. Generates a badge/certificate if results match within tolerance
4. Automatically creates a GitHub issue if reproducibility fails

---

## Further Reading

- [PyTorch Reproducibility Guide](https://pytorch.org/docs/stable/notes/randomness.html)
- [Reproducible Deep Learning](https://www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist.pdf)
- [Papers with Code - Reproducibility](https://paperswithcode.com/)
- [MLflow Tracking Best Practices](https://mlflow.org/docs/latest/tracking.html#organizing-runs-into-experiments)

---

## Cleanup

In [None]:
# Clean up
import os

# Remove generated files
for f in ["REPRODUCIBILITY.md", "config.json", "environment.json"]:
    if os.path.exists(f):
        os.remove(f)

print("Cleanup complete!")
print("\n" + "="*60)
print("Congratulations! You've completed Module 4.3: MLOps & Experiment Tracking!")
print("="*60)

---

## Module Summary

In this module, you learned:

1. **MLflow Setup** - Track experiments with parameters, metrics, and artifacts
2. **W&B Integration** - Create dashboards and run hyperparameter sweeps
3. **Benchmark Suite** - Evaluate LLMs with standard benchmarks (MMLU, HellaSwag)
4. **Custom Evaluation** - Build task-specific metrics and LLM-as-judge
5. **Drift Detection** - Monitor models in production with Evidently AI
6. **Model Registry** - Version and deploy models safely
7. **Reproducibility** - Ensure experiments can be exactly replicated

---

## Next Module

Continue to **Module 4.4: Containerization & Deployment** to learn how to package and deploy your models as production services.

**Next:** [Module 4.4: Containerization & Deployment](../../module-4.4-containerization-deployment/)