# Module 10: Advanced Reproducibility & Containerization

**Estimated Time:** 50 minutes

## Learning Objectives

By the end of this module, you will be able to:

1. Explain computational reproducibility and its importance
2. Create Docker containers for reproducible research environments
3. Use version control for data and models (DVC, Git-LFS)
4. Manage dependencies with conda and pip
5. Automate workflows with Make and Snakemake
6. Implement reproducible random number generation
7. Document computational environments thoroughly
8. Create a fully reproducible research project

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import platform
import subprocess
from datetime import datetime
import warnings

warnings.filterwarnings("ignore")

# Set style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["font.size"] = 11

# Create output directory
import os

os.makedirs("../notebooks/outputs/module_10", exist_ok=True)

print("‚úì Libraries imported successfully")
print("‚úì Output directory created")

## 1. What is Computational Reproducibility?

**Computational Reproducibility**: The ability for independent researchers to recreate the same results using the same code and data.

### The Reproducibility Spectrum

```
Low                                                                High
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                  ‚îÇ
No code        Code          Code +         Code +        Containerized
shared         shared        dependencies   automation    + automated
                             documented     (workflows)   testing
‚îÇ              ‚îÇ             ‚îÇ              ‚îÇ             ‚îÇ
Not            Minimally     Reproducible   Highly        Fully
reproducible   reproducible  (in principle) reproducible  reproducible
```

### Common Barriers to Reproducibility

In [None]:
# Survey data on reproducibility barriers (simulated from literature)
barriers_data = {
    "Barrier": [
        "Different software\nversions",
        "Missing\ndependencies",
        "Undocumented\nsteps",
        "Hardware\ndifferences",
        "Random seed\nnot set",
        "Data not\navailable",
        "Code not\nshared",
        "OS differences",
    ],
    "Frequency": [78, 72, 68, 45, 42, 38, 35, 32],  # % of failures
    "Solution": [
        "Lockfiles",
        "Conda/Docker",
        "Workflow tools",
        "Containers",
        "Set seeds",
        "Share data",
        "Share code",
        "Containers",
    ],
}

barriers_df = pd.DataFrame(barriers_data)
barriers_df = barriers_df.sort_values("Frequency", ascending=True)

# Create visualization
fig, ax = plt.subplots(figsize=(12, 8))

# Color by solution type
solution_colors = {
    "Lockfiles": "#3498db",
    "Conda/Docker": "#e74c3c",
    "Workflow tools": "#2ecc71",
    "Containers": "#e74c3c",
    "Set seeds": "#f39c12",
    "Share data": "#9b59b6",
    "Share code": "#9b59b6",
}

colors = [solution_colors[sol] for sol in barriers_df["Solution"]]

bars = ax.barh(
    barriers_df["Barrier"],
    barriers_df["Frequency"],
    color=colors,
    edgecolor="black",
    linewidth=1.5,
    alpha=0.8,
)

# Add value labels
for i, (barrier, freq) in enumerate(zip(barriers_df["Barrier"], barriers_df["Frequency"])):
    ax.text(freq + 2, i, f"{freq}%", va="center", fontsize=11, fontweight="bold")

ax.set_xlabel("Percentage of Reproducibility Failures", fontsize=12, fontweight="bold")
ax.set_title(
    "Common Barriers to Computational Reproducibility", fontsize=14, fontweight="bold", pad=20
)
ax.set_xlim([0, 90])
ax.grid(axis="x", alpha=0.3, linestyle="--")

# Add legend
from matplotlib.patches import Patch

legend_elements = [
    Patch(facecolor="#e74c3c", edgecolor="black", label="Containerization"),
    Patch(facecolor="#3498db", edgecolor="black", label="Dependency locking"),
    Patch(facecolor="#2ecc71", edgecolor="black", label="Workflow automation"),
    Patch(facecolor="#f39c12", edgecolor="black", label="Random seeds"),
    Patch(facecolor="#9b59b6", edgecolor="black", label="Open science"),
]
ax.legend(handles=legend_elements, loc="lower right", fontsize=10, title="Solution Type")

plt.tight_layout()
plt.savefig(
    "../notebooks/outputs/module_10/reproducibility_barriers.png", dpi=300, bbox_inches="tight"
)
plt.show()

print("‚úì Barriers visualization saved")
print("\nüìä Top 3 Barriers:")
for i, row in barriers_df.tail(3).iterrows():
    print(
        f"   {row['Barrier'].replace(chr(10), ' ')}: {row['Frequency']}% (solution: {row['Solution']})"
    )

## 2. Docker and Containerization

### What are Containers?

**Containers** package your code, dependencies, and operating system into a single, portable unit.

**Benefits:**
- ‚úì "Works on my machine" ‚Üí "Works on ANY machine"
- ‚úì Isolates dependencies (no conflicts)
- ‚úì Lightweight (compared to virtual machines)
- ‚úì Versioned and shareable

### Docker Basics

**Key Concepts:**
- **Image**: Blueprint for container (like a class)
- **Container**: Running instance of image (like an object)
- **Dockerfile**: Recipe for building an image
- **Registry**: Repository for images (Docker Hub)

### Example Dockerfile for Research

In [None]:
# Create example Dockerfile for data science research
dockerfile_content = """# Research Project Dockerfile
# This creates a reproducible environment for data analysis

# Start from official Python image
FROM python:3.10-slim

# Set metadata
LABEL maintainer="your.email@university.edu"
LABEL description="Reproducible environment for Sleep & Memory Study"
LABEL version="1.0"

# Set working directory
WORKDIR /research

# Install system dependencies
RUN apt-get update && apt-get install -y \\
    build-essential \\
    git \\
    curl \\
    && rm -rf /var/lib/apt/lists/*

# Copy requirements file
COPY requirements.txt .

# Install Python dependencies
# Pin exact versions for reproducibility
RUN pip install --no-cache-dir -r requirements.txt

# Copy project files
COPY . .

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV MPLBACKEND=Agg

# Default command: run analysis
CMD ["python", "analysis/main_analysis.py"]

# To build: docker build -t my-research:v1.0 .
# To run: docker run -v $(pwd)/data:/research/data my-research:v1.0
"""

print("üì¶ EXAMPLE DOCKERFILE")
print("=" * 70)
print(dockerfile_content)

# Save Dockerfile
with open("../notebooks/outputs/module_10/Dockerfile.example", "w") as f:
    f.write(dockerfile_content)

print("\n‚úì Dockerfile saved to outputs/module_10/Dockerfile.example")

In [None]:
# Create docker-compose for more complex setups
docker_compose = """# docker-compose.yml
# For multi-container research setups

version: '3.8'

services:
  # Jupyter notebook server
  jupyter:
    build: .
    ports:
      - "8888:8888"
    volumes:
      - ./data:/research/data
      - ./notebooks:/research/notebooks
      - ./results:/research/results
    environment:
      - JUPYTER_ENABLE_LAB=yes
    command: jupyter lab --ip=0.0.0.0 --no-browser --allow-root
  
  # Database (if needed)
  postgres:
    image: postgres:14
    environment:
      - POSTGRES_DB=research_db
      - POSTGRES_USER=researcher
      - POSTGRES_PASSWORD=secure_password
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:

# Usage:
# docker-compose up -d        # Start all services
# docker-compose down         # Stop all services
# docker-compose logs jupyter # View logs
"""

print("üê≥ DOCKER-COMPOSE EXAMPLE")
print("=" * 70)
print(docker_compose)

with open("../notebooks/outputs/module_10/docker-compose.yml.example", "w") as f:
    f.write(docker_compose)

print("\n‚úì Docker-compose saved to outputs/module_10/docker-compose.yml.example")

## 3. Version Control for Data and Models

### The Problem with Large Files in Git

Git is designed for code (small text files), not:
- Large datasets (>100 MB)
- Binary files (models, images)
- Files that change frequently

### Solutions

#### 1. DVC (Data Version Control)

**DVC** tracks large files separately from Git, storing metadata in Git and data in cloud storage.

```bash
# Initialize DVC
dvc init

# Track large file
dvc add data/raw/large_dataset.csv
git add data/raw/large_dataset.csv.dvc .gitignore
git commit -m "Add dataset (tracked with DVC)"

# Configure remote storage (S3, Google Drive, etc.)
dvc remote add -d myremote s3://mybucket/dvcstore

# Push data to remote
dvc push

# Others can pull data
dvc pull
```

#### 2. Git-LFS (Large File Storage)

**Git-LFS** replaces large files with pointers in Git.

```bash
# Install Git-LFS
git lfs install

# Track file types
git lfs track "*.psd"
git lfs track "*.pkl"
git lfs track "*.h5"

# Add and commit as normal
git add model.pkl
git commit -m "Add trained model"
git push
```

### DVC vs Git-LFS Comparison

In [None]:
# Create comparison table
comparison_data = {
    "Feature": [
        "Storage",
        "Max File Size",
        "Cost",
        "Pipeline Support",
        "Cloud Options",
        "Learning Curve",
        "Best For",
    ],
    "DVC": [
        "Separate from Git",
        "Unlimited",
        "Free (use own storage)",
        "Excellent (dvc.yaml)",
        "S3, GCS, Azure, SSH, etc.",
        "Moderate",
        "ML pipelines, large datasets",
    ],
    "Git-LFS": [
        "GitHub LFS",
        "2 GB per file",
        "Free tier: 1 GB storage, 1 GB bandwidth",
        "Limited",
        "GitHub, GitLab, Bitbucket",
        "Easy",
        "Binary assets, small models",
    ],
}

comparison_df = pd.DataFrame(comparison_data)
print("üìä DVC vs GIT-LFS COMPARISON")
print("=" * 80)
print(comparison_df.to_string(index=False))
print(
    "\nüí° Recommendation: Use DVC for data science workflows, Git-LFS for occasional large files."
)

# Save comparison
comparison_df.to_csv("../notebooks/outputs/module_10/dvc_vs_gitlfs.csv", index=False)
print("\n‚úì Comparison saved to outputs/module_10/dvc_vs_gitlfs.csv")

## 4. Dependency Management

### The Dependency Hell Problem

**Scenario:** Your code works today, but in 6 months:
- Package versions have changed
- Dependencies conflict
- Code breaks

**Solution:** Lock exact versions of ALL dependencies.

### Python: pip and conda

In [None]:
# Generate requirements.txt with exact versions
requirements_locked = """# requirements.txt (LOCKED VERSIONS)
# Generated: 2025-01-20
# Python: 3.10.8

# Core scientific computing
numpy==1.24.2
pandas==2.0.1
scipy==1.10.1

# Statistics
statsmodels==0.14.0
pingouin==0.5.3
scikit-learn==1.3.0

# Visualization
matplotlib==3.7.1
seaborn==0.12.2
plotly==5.17.0

# Jupyter
jupyter==1.0.0
ipykernel==6.23.1
nbconvert==7.6.0

# To generate this file:
# pip freeze > requirements.txt

# To install exact versions:
# pip install -r requirements.txt
"""

print("üì¶ LOCKED REQUIREMENTS.TXT")
print("=" * 70)
print(requirements_locked)

with open("../notebooks/outputs/module_10/requirements_locked.txt", "w") as f:
    f.write(requirements_locked)

print("\n‚úì Requirements file saved")

In [None]:
# Generate conda environment.yml
conda_env = """# environment.yml (CONDA ENVIRONMENT)
name: research-env
channels:
  - conda-forge
  - defaults

dependencies:
  # Python version
  - python=3.10.8
  
  # Core packages
  - numpy=1.24.2
  - pandas=2.0.1
  - scipy=1.10.1
  - matplotlib=3.7.1
  - seaborn=0.12.2
  - scikit-learn=1.3.0
  - statsmodels=0.14.0
  
  # Jupyter
  - jupyter=1.0.0
  - ipykernel=6.23.1
  
  # pip packages (if not in conda)
  - pip:
    - pingouin==0.5.3
    - plotly==5.17.0

# To create environment:
# conda env create -f environment.yml

# To activate:
# conda activate research-env

# To export current environment:
# conda env export > environment.yml
"""

print("üêç CONDA ENVIRONMENT.YML")
print("=" * 70)
print(conda_env)

with open("../notebooks/outputs/module_10/environment.yml", "w") as f:
    f.write(conda_env)

print("\n‚úì Conda environment file saved")

### Best Practices

1. **Lock versions early:** Generate requirements.txt or environment.yml at project start
2. **Update deliberately:** Don't automatically update packages
3. **Test after updates:** Run tests before committing version changes
4. **Document Python version:** Include in requirements
5. **Use virtual environments:** Never install packages globally

## 5. Workflow Automation

### Why Automate Workflows?

Manual workflows lead to:
- ‚ùå Forgotten steps
- ‚ùå Inconsistent execution order
- ‚ùå Wasted time re-running unchanged steps

Automated workflows provide:
- ‚úì Reproducibility (same steps, same order)
- ‚úì Efficiency (only re-run what changed)
- ‚úì Documentation (workflow IS the documentation)

### Make and Makefiles

**Make** is a classic build automation tool.

In [None]:
# Create example Makefile for research project
makefile_content = """# Makefile for Research Project
# Automates the entire analysis pipeline

.PHONY: all clean data analysis report

# Default target: run entire pipeline
all: report

# Download and preprocess data
data: data/processed/clean_data.csv

data/processed/clean_data.csv: scripts/01_download_data.py scripts/02_clean_data.py
	@echo "Downloading and cleaning data..."
	python scripts/01_download_data.py
	python scripts/02_clean_data.py
	@echo "‚úì Data ready"

# Run statistical analysis
analysis: results/analysis_output.csv

results/analysis_output.csv: data/processed/clean_data.csv scripts/03_analyze.py
	@echo "Running analysis..."
	python scripts/03_analyze.py
	@echo "‚úì Analysis complete"

# Generate figures
figures: results/figures/figure1.png

results/figures/figure1.png: results/analysis_output.csv scripts/04_visualize.py
	@echo "Creating figures..."
	python scripts/04_visualize.py
	@echo "‚úì Figures created"

# Compile final report
report: results/final_report.pdf

results/final_report.pdf: results/analysis_output.csv results/figures/figure1.png manuscript/main.Rmd
	@echo "Compiling report..."
	Rscript -e "rmarkdown::render('manuscript/main.Rmd', output_dir='results')"
	@echo "‚úì Report complete: results/final_report.pdf"

# Clean all generated files
clean:
	@echo "Cleaning generated files..."
	rm -rf data/processed/*
	rm -rf results/*
	@echo "‚úì Clean complete"

# Usage:
# make          # Run entire pipeline
# make data     # Only download/clean data
# make analysis # Only run analysis
# make clean    # Remove all generated files
"""

print("üî® EXAMPLE MAKEFILE")
print("=" * 70)
print(makefile_content)

with open("../notebooks/outputs/module_10/Makefile.example", "w") as f:
    f.write(makefile_content)

print("\n‚úì Makefile saved to outputs/module_10/Makefile.example")

### Snakemake for Data Science

**Snakemake** is Make for Python, with better features for data science:
- Python-based syntax
- Automatic parallelization
- Cloud execution support
- Conda integration

In [None]:
# Create example Snakefile
snakefile_content = """# Snakefile for Research Project
# More powerful than Make, Python-based

# Configuration
configfile: "config.yaml"

# Target rule: what we want to produce
rule all:
    input:
        "results/final_report.pdf",
        "results/figures/figure1.png"

# Download raw data
rule download_data:
    output:
        "data/raw/dataset.csv"
    params:
        url = config["data_url"]
    shell:
        "wget {params.url} -O {output}"

# Clean data
rule clean_data:
    input:
        "data/raw/dataset.csv"
    output:
        "data/processed/clean_data.csv"
    conda:
        "envs/data_processing.yaml"
    script:
        "scripts/clean_data.py"

# Run analysis
rule analyze:
    input:
        "data/processed/clean_data.csv"
    output:
        "results/analysis_output.csv",
        "results/stats.txt"
    params:
        alpha = config["alpha"]
    conda:
        "envs/analysis.yaml"
    script:
        "scripts/analyze.py"

# Create visualizations
rule visualize:
    input:
        "results/analysis_output.csv"
    output:
        "results/figures/figure1.png"
    conda:
        "envs/visualization.yaml"
    script:
        "scripts/visualize.py"

# Generate report
rule report:
    input:
        data = "results/analysis_output.csv",
        figures = "results/figures/figure1.png"
    output:
        "results/final_report.pdf"
    conda:
        "envs/report.yaml"
    shell:
        "Rscript -e 'rmarkdown::render(\"manuscript/main.Rmd\", output_dir=\"results\")'"

# Usage:
# snakemake --cores 4          # Run pipeline with 4 cores
# snakemake --use-conda        # Use conda environments
# snakemake --dag | dot -Tpng  # Visualize workflow
# snakemake -n                 # Dry run (show what would run)
"""

print("üêç EXAMPLE SNAKEFILE")
print("=" * 70)
print(snakefile_content)

with open("../notebooks/outputs/module_10/Snakefile.example", "w") as f:
    f.write(snakefile_content)

print("\n‚úì Snakefile saved to outputs/module_10/Snakefile.example")

## 6. Reproducible Random Number Generation

### The Problem

Many analyses use randomness:
- Random sampling
- Train/test splits
- Monte Carlo simulations
- Stochastic algorithms (e.g., SGD)

**Without setting seeds, results are different every time!**

### Solution: Set Random Seeds

In [None]:
# Demonstrate importance of random seeds
print("üé≤ RANDOM SEED DEMONSTRATION\n")

# Without seed
print("WITHOUT SEED (different each time):")
sample1 = np.random.normal(0, 1, 5)
sample2 = np.random.normal(0, 1, 5)
print(f"Run 1: {sample1}")
print(f"Run 2: {sample2}")
print(f"Identical? {np.array_equal(sample1, sample2)}\n")

# With seed
print("WITH SEED (reproducible):")
np.random.seed(42)
sample3 = np.random.normal(0, 1, 5)
np.random.seed(42)  # Reset to same seed
sample4 = np.random.normal(0, 1, 5)
print(f"Run 1: {sample3}")
print(f"Run 2: {sample4}")
print(f"Identical? {np.array_equal(sample3, sample4)}")

print("\n‚úì Always set random seeds for reproducibility!")

In [None]:
# Create comprehensive seed-setting function
def set_all_seeds(seed=42):
    """
    Set random seeds for all common libraries.

    This ensures reproducibility across:
    - NumPy
    - Python's random module
    - TensorFlow (if installed)
    - PyTorch (if installed)

    Parameters:
    -----------
    seed : int
        Random seed value
    """
    # Python's random module
    import random

    random.seed(seed)

    # NumPy
    np.random.seed(seed)

    # TensorFlow (if available)
    try:
        import tensorflow as tf

        tf.random.set_seed(seed)
    except ImportError:
        pass

    # PyTorch (if available)
    try:
        import torch

        torch.manual_seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed(seed)
            torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except ImportError:
        pass

    print(f"‚úì All random seeds set to {seed}")


# Usage example
set_all_seeds(42)

# Save function to file
seed_script = '''"""Reproducibility utilities."""
import numpy as np
import random

def set_all_seeds(seed=42):
    """Set random seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    
    try:
        import tensorflow as tf
        tf.random.set_seed(seed)
    except ImportError:
        pass
    
    try:
        import torch
        torch.manual_seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except ImportError:
        pass
    
    return seed

# Usage:
# from reproducibility_utils import set_all_seeds
# set_all_seeds(42)
'''

with open("../notebooks/outputs/module_10/reproducibility_utils.py", "w") as f:
    f.write(seed_script)

print("‚úì Seed-setting utility saved to outputs/module_10/reproducibility_utils.py")

### Best Practices for Random Seeds

1. **Set seeds at the beginning** of your script/notebook
2. **Document the seed value** in your paper (e.g., "We used seed=42 for all random operations")
3. **Use consistent seeds** across related scripts
4. **Test with multiple seeds** to ensure results are robust (report in sensitivity analysis)
5. **Don't repeatedly reset seeds** within a script (breaks randomness)

## 7. Environment Documentation

### Why Document Your Environment?

Future users (including future you!) need to know:
- What software versions you used
- What operating system
- What hardware (for GPU code)

### Creating a Reproducibility Report

In [None]:
def generate_reproducibility_report(output_file="reproducibility_report.txt"):
    """
    Generate comprehensive environment documentation.
    """
    report = []
    report.append("=" * 80)
    report.append("REPRODUCIBILITY REPORT")
    report.append("=" * 80)
    report.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")

    # Python version
    report.append("PYTHON ENVIRONMENT")
    report.append("-" * 80)
    report.append(f"Python version: {sys.version}")
    report.append(f"Python executable: {sys.executable}\n")

    # System information
    report.append("SYSTEM INFORMATION")
    report.append("-" * 80)
    report.append(f"Platform: {platform.platform()}")
    report.append(f"System: {platform.system()}")
    report.append(f"Release: {platform.release()}")
    report.append(f"Machine: {platform.machine()}")
    report.append(f"Processor: {platform.processor()}\n")

    # Installed packages
    report.append("INSTALLED PACKAGES")
    report.append("-" * 80)

    # Key packages
    key_packages = ["numpy", "pandas", "scipy", "matplotlib", "seaborn", "sklearn", "statsmodels"]

    for pkg in key_packages:
        try:
            module = __import__(pkg)
            version = getattr(module, "__version__", "unknown")
            report.append(f"{pkg}: {version}")
        except ImportError:
            report.append(f"{pkg}: NOT INSTALLED")

    report.append("\nFor complete package list, run: pip freeze\n")

    # Random seed
    report.append("REPRODUCIBILITY SETTINGS")
    report.append("-" * 80)
    report.append("Random seed: 42 (set at beginning of all scripts)")
    report.append("NumPy printoptions: precision=3, suppress=True\n")

    # Data sources
    report.append("DATA SOURCES")
    report.append("-" * 80)
    report.append("Raw data: data/raw/dataset.csv (SHA256: ...)")
    report.append("See data/README.md for data provenance\n")

    # Analysis scripts
    report.append("ANALYSIS WORKFLOW")
    report.append("-" * 80)
    report.append("To reproduce analysis:")
    report.append("  1. Install dependencies: pip install -r requirements.txt")
    report.append("  2. Run workflow: make all")
    report.append("  OR use Docker: docker build -t analysis . && docker run analysis\n")

    report.append("=" * 80)

    # Write report
    report_text = "\n".join(report)

    # Print to console
    print(report_text)

    # Save to file
    output_path = f"../notebooks/outputs/module_10/{output_file}"
    with open(output_path, "w") as f:
        f.write(report_text)

    print(f"\n‚úì Report saved to {output_path}")
    return report_text


# Generate report
generate_reproducibility_report()

## 8. Complete Reproducibility Checklist

### Use this checklist for every research project:

In [None]:
checklist = """COMPUTATIONAL REPRODUCIBILITY CHECKLIST
=====================================================================

ENVIRONMENT
‚òê requirements.txt or environment.yml created with EXACT versions
‚òê Python version documented
‚òê Dockerfile created (or instructions for container)
‚òê Virtual environment used (never install globally)
‚òê Environment documentation generated (pip freeze, conda list)

CODE
‚òê All code in version control (Git)
‚òê Random seeds set at beginning of scripts
‚òê Seed values documented in paper/README
‚òê No hardcoded file paths (use relative paths or config files)
‚òê Code is commented and readable
‚òê Functions have docstrings

DATA
‚òê Raw data preserved (never modified)
‚òê Data cleaning/processing scripted (not manual)
‚òê Large files managed with DVC or Git-LFS
‚òê Data provenance documented (where did data come from?)
‚òê Data sharing plan (public repository or on request)
‚òê License for data specified

WORKFLOW
‚òê Analysis workflow automated (Makefile or Snakefile)
‚òê Pipeline runs from start to finish without manual intervention
‚òê Pipeline tested on clean environment
‚òê Workflow diagram created (optional but helpful)

DOCUMENTATION
‚òê README with setup instructions
‚òê README with execution instructions
‚òê README with expected outputs
‚òê Comments explain "why", not just "what"
‚òê Unusual dependencies explained
‚òê Known issues documented

TESTING
‚òê Someone else can run your code (ideally on different machine)
‚òê Results match reported values
‚òê Code runs without errors
‚òê Outputs generated as expected

PUBLICATION
‚òê Code repository linked in paper
‚òê Data repository linked in paper (or availability statement)
‚òê Random seeds reported in methods
‚òê Software versions reported in methods
‚òê Deviations from preregistration documented (if applicable)

LONG-TERM PRESERVATION
‚òê Code archived with DOI (Zenodo, figshare)
‚òê Data archived with DOI
‚òê Container image archived (Docker Hub, Singularity Hub)
‚òê Preregistration linked (OSF, AsPredicted)

=====================================================================
GOLD STANDARD: "One-click reproducibility"
  Goal: Someone can clone your repo, run one command, and get results
  Example: docker run myimage
        OR: make all
=====================================================================
"""

print(checklist)

with open("../notebooks/outputs/module_10/reproducibility_checklist.txt", "w") as f:
    f.write(checklist)

print("\n‚úì Checklist saved to outputs/module_10/reproducibility_checklist.txt")

## 9. Practice Exercise: Create a Reproducible Mini-Project

### Task

Create a fully reproducible analysis project with:
1. Locked dependencies
2. Random seed setting
3. Automated workflow
4. Documentation

### Starter Template

In [None]:
project_structure = """REPRODUCIBLE PROJECT TEMPLATE
=====================================================================

my_research_project/
‚îÇ
‚îú‚îÄ‚îÄ README.md                    # Project overview and instructions
‚îú‚îÄ‚îÄ requirements.txt             # Python dependencies (LOCKED versions)
‚îú‚îÄ‚îÄ Dockerfile                   # Container definition
‚îú‚îÄ‚îÄ Makefile                     # Workflow automation
‚îú‚îÄ‚îÄ .gitignore                   # Git ignore file
‚îÇ
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/                     # Original, immutable data
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ .gitkeep
‚îÇ   ‚îú‚îÄ‚îÄ processed/               # Cleaned data (generated)
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ .gitkeep
‚îÇ   ‚îî‚îÄ‚îÄ README.md                # Data provenance
‚îÇ
‚îú‚îÄ‚îÄ scripts/
‚îÇ   ‚îú‚îÄ‚îÄ 01_download_data.py      # Data acquisition
‚îÇ   ‚îú‚îÄ‚îÄ 02_clean_data.py         # Data preprocessing
‚îÇ   ‚îú‚îÄ‚îÄ 03_analyze.py            # Statistical analysis
‚îÇ   ‚îú‚îÄ‚îÄ 04_visualize.py          # Create figures
‚îÇ   ‚îî‚îÄ‚îÄ reproducibility_utils.py # Helper functions
‚îÇ
‚îú‚îÄ‚îÄ notebooks/
‚îÇ   ‚îú‚îÄ‚îÄ 01_exploratory_analysis.ipynb
‚îÇ   ‚îî‚îÄ‚îÄ 02_final_analysis.ipynb
‚îÇ
‚îú‚îÄ‚îÄ results/
‚îÇ   ‚îú‚îÄ‚îÄ figures/                 # Generated plots
‚îÇ   ‚îú‚îÄ‚îÄ tables/                  # Generated tables
‚îÇ   ‚îî‚îÄ‚îÄ stats/                   # Statistical output
‚îÇ
‚îú‚îÄ‚îÄ manuscript/
‚îÇ   ‚îú‚îÄ‚îÄ main.Rmd                 # R Markdown manuscript
‚îÇ   ‚îî‚îÄ‚îÄ references.bib           # Bibliography
‚îÇ
‚îî‚îÄ‚îÄ tests/
    ‚îî‚îÄ‚îÄ test_analysis.py         # Unit tests (optional)

=====================================================================

README.md TEMPLATE:
-------------------

# Project Title

## Overview
Brief description of research project.

## Setup
```bash
# Clone repository
git clone https://github.com/username/project.git
cd project

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

## Usage
```bash
# Run entire pipeline
make all

# OR using Docker
docker build -t my-analysis .
docker run -v $(pwd)/results:/research/results my-analysis
```

## Expected Output
- `results/figures/figure1.png`: Main result
- `results/final_report.pdf`: Complete analysis

## Environment
- Python 3.10.8
- See `requirements.txt` for package versions
- Random seed: 42

## Citation
If you use this code, please cite:
[Citation info]

## License
MIT License

=====================================================================
"""

print(project_structure)

with open("../notebooks/outputs/module_10/project_template.txt", "w") as f:
    f.write(project_structure)

print("\n‚úì Project template saved to outputs/module_10/project_template.txt")
print("\nüí° Exercise: Create this structure for your own research project!")

## 10. Summary

### Key Takeaways

1. **Reproducibility is a spectrum**
   - Minimum: Share code
   - Better: Lock dependencies
   - Best: Containerize + automate + test

2. **Containers solve environment problems**
   - Docker packages code + dependencies + OS
   - "Works on my machine" ‚Üí "Works everywhere"
   - Essential for long-term reproducibility

3. **Version control for data, not just code**
   - DVC for large datasets and ML pipelines
   - Git-LFS for occasional large files
   - Never commit large files directly to Git

4. **Lock ALL dependencies**
   - Use requirements.txt (pip freeze)
   - Use environment.yml (conda)
   - Include Python version
   - Test in clean environment

5. **Automate workflows**
   - Make or Snakemake for pipelines
   - Documents analysis steps
   - Enables one-click reproduction
   - Only re-runs changed steps

6. **Always set random seeds**
   - At beginning of scripts
   - For all random libraries
   - Document seed values
   - Test robustness with multiple seeds

7. **Document your environment**
   - Generate reproducibility report
   - Include in supplementary materials
   - Update when packages change

### Levels of Reproducibility

**Level 1: Bare Minimum**
- ‚òê Code shared publicly
- ‚òê Data available

**Level 2: Good Practice**
- ‚òê Dependencies documented
- ‚òê Random seeds set
- ‚òê README with instructions

**Level 3: Excellent Practice**
- ‚òê Dependencies locked (requirements.txt)
- ‚òê Workflow automated (Make/Snakemake)
- ‚òê Tested on clean environment

**Level 4: Gold Standard**
- ‚òê Fully containerized (Docker)
- ‚òê One-click reproduction
- ‚òê Automated tests
- ‚òê Archived with DOI

### Impact

Reproducible research:
- ‚úì Enables verification of findings
- ‚úì Facilitates building on prior work
- ‚úì Increases citation rates
- ‚úì Saves time (for future you!)
- ‚úì Builds trust in science

## Additional Resources

### Docker
- Docker Documentation: https://docs.docker.com
- Rocker (R + Docker): https://www.rocker-project.org
- Jupyter Docker Stacks: https://jupyter-docker-stacks.readthedocs.io

### Version Control
- DVC Documentation: https://dvc.org/doc
- Git-LFS: https://git-lfs.github.com

### Workflow Tools
- GNU Make: https://www.gnu.org/software/make/manual/
- Snakemake: https://snakemake.readthedocs.io
- Nextflow: https://www.nextflow.io (for bioinformatics)

### Readings
- Sandve et al. (2013). Ten simple rules for reproducible computational research. *PLOS Computational Biology*.
- Wilson et al. (2017). Good enough practices in scientific computing. *PLOS Computational Biology*.
- Peng (2011). Reproducible research in computational science. *Science*, 334(6060), 1226-1227.
- The Turing Way: https://the-turing-way.netlify.app

### Tools
- CodeOcean: Cloud platform for reproducible code
- Binder: Run Jupyter notebooks in browser
- Gigantum: Collaborative data science platform
- Renku: GitLab-based reproducible research platform

## Next Steps

In **Module 11: Research Collaboration & Project Management**, you'll learn:
- Git workflows for team collaboration
- Project management tools (Trello, GitHub Projects)
- Collaborative writing with Google Docs, Overleaf
- Code review best practices
- Managing co-authorship and contributions
- Communication strategies for research teams

Transform from solo researcher to effective collaborator! ü§ù