# Module 06: Reproducibility Crisis and Documentation Standards

**Difficulty**: ‚≠ê‚≠ê (Intermediate)

**Estimated Time**: 75 minutes

**Prerequisites**: Module 05: Statistical Validation and Hypothesis Testing

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand the reproducibility crisis in ML-based science
2. Apply the NeurIPS reproducibility checklist
3. Implement comprehensive documentation standards
4. Use version control effectively for research
5. Create reproducible computational environments
6. Design data preprocessing pipelines that prevent data leakage
7. Document data lineage and transformations

## Setup

Let's import required libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

np.random.seed(42)

print('‚úì Libraries imported successfully!')

## 1. The Reproducibility Crisis in ML-Based Science

### Understanding the Scale

The reproducibility crisis is a systemic problem in modern science.

**Key Statistics:**

- **Princeton Study**: 41 papers from 30 fields had reproducibility failures
- **Cascading Impact**: These 41 papers affected 648 subsequent papers
- **Nature 2016 Survey**: 70% of researchers couldn't reproduce others' work
- **Self-Reproducibility**: Over 50% couldn't reproduce their own work
- **Root Cause**: Data leakage is the most pervasive cause

### Why This Matters

Irreproducible research has serious consequences:
1. Wasted resources building on flawed findings
2. Scientific community moves in wrong directions
3. The scientific method breaks down
4. Loss of public trust in science
5. Real-world harm in applied domains (medicine, autonomous systems)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

categories = ['Flawed Papers', 'Papers Affected']
values = [41, 648]
colors = ['#e74c3c', '#c0392b']

axes[0, 0].bar(categories, values, color=colors, edgecolor='black', linewidth=2)
axes[0, 0].set_ylabel('Number of Papers', fontsize=11, fontweight='bold')
axes[0, 0].set_title('Princeton Study: Cascading Effects', fontsize=12, fontweight='bold')
axes[0, 0].set_ylim(0, 700)

for i, val in enumerate(values):
    axes[0, 0].text(i, val + 20, str(val), ha='center', fontsize=12, fontweight='bold')

axes[0, 1].bar([1, 2], [30, 50], color=['#e74c3c', '#e74c3c'], edgecolor='black', linewidth=2)
axes[0, 1].set_xticks([1, 2])
axes[0, 1].set_xticklabels(['Reproduce Others', 'Reproduce Own'])
axes[0, 1].set_ylabel('Success Rate (%)', fontsize=11, fontweight='bold')
axes[0, 1].set_title('Nature 2016: Success Rates', fontsize=12, fontweight='bold')
axes[0, 1].set_ylim(0, 100)

causes = ['Data Leakage', 'Poor Docs', 'Missing Params', 'No Seed', 'Env Variation']
freqs = [45, 28, 18, 12, 10]

axes[1, 0].barh(causes, freqs, color=['#e74c3c', '#e67e22', '#f39c12', '#f1c40f', '#2ecc71'], edgecolor='black')
axes[1, 0].set_xlabel('Frequency (%)', fontsize=11, fontweight='bold')
axes[1, 0].set_title('Common Causes of Irreproducibility', fontsize=12, fontweight='bold')

years = [2019, 2020, 2021, 2022, 2023, 2024]
adoption = [15, 32, 48, 65, 78, 88]

axes[1, 1].plot(years, adoption, marker='o', linewidth=3, markersize=10, color='#3498db')
axes[1, 1].set_xlabel('Year', fontsize=11, fontweight='bold')
axes[1, 1].set_ylabel('Papers with Checklist (%)', fontsize=11, fontweight='bold')
axes[1, 1].set_title('NeurIPS Checklist Adoption', fontsize=12, fontweight='bold')
axes[1, 1].set_ylim(0, 100)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('Reproducibility crisis declining with better standards!')

## 2. The NeurIPS Reproducibility Checklist

### The Seven Core Requirements

The NeurIPS reproducibility checklist is the gold standard for ensuring research can be replicated. Let's examine each requirement in detail with practical examples.

**1. Claims Accuracy** - All claims must match actual findings
**2. Limitations Documentation** - Clearly acknowledge assumptions and limitations
**3. Experimental Reproducibility** - Sufficient detail for others to replicate
**4. Open Access** - Make data and code publicly available
**5. Experimental Settings** - Report all hyperparameters and hardware specs
**6. Statistical Significance** - Results with confidence intervals and error bars
**7. Compute Resources** - Report training time, memory, GPU hours

### Applying the NeurIPS Checklist

Let's see how to apply each requirement systematically.

In [None]:
# NeurIPS Reproducibility Checklist Implementation
class ReproducibilityChecklist:
    """
    Comprehensive checklist for ensuring research reproducibility.
    Based on NeurIPS 2020+ standards.
    """
    def __init__(self, paper_title):
        self.paper_title = paper_title
        self.checklist = {
            '1. Claims Accuracy': {
                'status': False,
                'requirements': [
                    'All claims backed by evidence in results',
                    'No overgeneralization beyond tested scenarios',
                    'Clearly state what IS and IS NOT claimed',
                    'Match abstract/intro claims to conclusion'
                ],
                'example_good': 'Our model achieves 92.3% accuracy on CIFAR-10 test set',
                'example_bad': 'Our model works well on most image tasks'
            },
            '2. Limitations': {
                'status': False,
                'requirements': [
                    'Acknowledge dataset limitations',
                    'Describe computational constraints',
                    'Identify assumptions made',
                    'Discuss where method may fail'
                ],
                'example_good': 'Limited to images <1024x1024, assumes RGB format',
                'example_bad': 'No major limitations identified'
            },
            '3. Experimental Detail': {
                'status': False,
                'requirements': [
                    'Report all hyperparameters',
                    'Specify random seeds used',
                    'Document train/val/test splits',
                    'Describe preprocessing steps'
                ],
                'example_good': 'Learning rate=0.001, batch_size=32, seed=42, 80/10/10 split',
                'example_bad': 'Standard hyperparameters used'
            },
            '4. Open Access': {
                'status': False,
                'requirements': [
                    'Code available on GitHub/GitLab',
                    'Data publicly accessible or described',
                    'Pretrained models shared',
                    'Clear instructions for reproduction'
                ],
                'example_good': 'github.com/user/repo with README and requirements.txt',
                'example_bad': 'Code available upon request'
            },
            '5. Experimental Settings': {
                'status': False,
                'requirements': [
                    'Hardware specifications (GPU model, RAM)',
                    'Software versions (Python, PyTorch, etc.)',
                    'Operating system',
                    'Number of runs/seeds reported'
                ],
                'example_good': 'NVIDIA RTX 3090, PyTorch 1.12, Python 3.9, Ubuntu 20.04, 5 seeds',
                'example_bad': 'Ran on GPU with PyTorch'
            },
            '6. Statistical Significance': {
                'status': False,
                'requirements': [
                    'Report mean AND standard deviation',
                    'Confidence intervals (95% CI)',
                    'Error bars in visualizations',
                    'Multiple runs (min 3-5 seeds)'
                ],
                'example_good': 'Accuracy: 92.3% ¬± 0.8% (95% CI: 91.5-93.1, n=5)',
                'example_bad': 'Accuracy: 92.3%'
            },
            '7. Compute Resources': {
                'status': False,
                'requirements': [
                    'Total training time',
                    'GPU hours consumed',
                    'Memory requirements',
                    'Cost estimate if using cloud'
                ],
                'example_good': '48 hours on 4x RTX 3090 (192 GPU-hours), ~$300 on AWS',
                'example_bad': 'Training took a few days'
            }
        }
    
    def mark_complete(self, item_number):
        """Mark a checklist item as complete."""
        key = [k for k in self.checklist.keys() if k.startswith(str(item_number))][0]
        self.checklist[key]['status'] = True
    
    def get_completion_rate(self):
        """Calculate overall completion percentage."""
        total = len(self.checklist)
        completed = sum(1 for item in self.checklist.values() if item['status'])
        return (completed / total) * 100
    
    def generate_report(self):
        """Generate a comprehensive checklist report."""
        print(f"Reproducibility Checklist: {self.paper_title}")
        print("=" * 80)
        print(f"Overall Completion: {self.get_completion_rate():.0f}%\n")
        
        for item_name, details in self.checklist.items():
            status_symbol = "‚úì" if details['status'] else "‚úó"
            print(f"{status_symbol} {item_name}")
            print(f"   Requirements:")
            for req in details['requirements']:
                print(f"     ‚Ä¢ {req}")
            print(f"   ‚úì Good: {details['example_good']}")
            print(f"   ‚úó Bad:  {details['example_bad']}\n")
        
        if self.get_completion_rate() < 100:
            print("‚ö†Ô∏è  WARNING: Checklist incomplete. Paper not ready for submission.")
        else:
            print("‚úì All requirements met. Paper ready for submission!")

# Example usage
paper = ReproducibilityChecklist("Deep Learning for Time Series Forecasting")

# Mark some items as complete
paper.mark_complete(1)  # Claims accuracy
paper.mark_complete(2)  # Limitations
paper.mark_complete(5)  # Experimental settings

# Generate report
paper.generate_report()

## 3. Research Documentation Standards

### The Core Principle

**'If methods cannot be reproduced from documentation alone, the documentation is insufficient.'**

### What to Document

**Electronic Lab Notebooks:**
- Date and time of work
- Detailed methods for colleague replication
- Equipment settings and procedures
- Deviations from protocol
- Negative results
- Links to raw data with versions

**Data Dictionary:**
- Short and long variable names
- Format and units
- Allowable values
- Complete definitions

**Data Lineage:**
- Source of origin
- All transformations in order
- Processing pipeline steps
- Version information
- Software dependencies

In [None]:
# Electronic Lab Notebook Entry Template
class LabNotebook:
    """Template for documenting research experiments."""
    
    @staticmethod
    def create_entry(experiment_name, researcher, date=None):
        """Generate a structured lab notebook entry."""
        if date is None:
            date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        
        template = f"""
{'='*80}
ELECTRONIC LAB NOTEBOOK ENTRY
{'='*80}

Experiment: {experiment_name}
Researcher: {researcher}
Date/Time: {date}

{'='*80}
1. OBJECTIVE
{'='*80}
What question are you trying to answer?
What is the specific goal of this experiment?

Example: "Test whether adding attention mechanism improves model accuracy by >5%"

{'='*80}
2. HYPOTHESIS
{'='*80}
What do you expect to happen and why?

Example: "Attention will improve accuracy because it allows the model to focus on 
relevant features, particularly for long sequences where context matters."

{'='*80}
3. METHODS
{'='*80}
Detailed procedure that allows exact replication:

a) Data:
   - Dataset: [name, version, source, size]
   - Split: [train/val/test proportions, random seed]
   - Preprocessing: [all transformations applied]

b) Model Architecture:
   - Type: [model family]
   - Layers: [detailed architecture]
   - Parameters: [total count]

c) Training Configuration:
   - Hyperparameters: [learning rate, batch size, epochs, optimizer]
   - Hardware: [GPU model, RAM, CPU]
   - Software: [Python version, library versions]
   - Random Seeds: [all seeds used]

d) Evaluation Metrics:
   - Primary: [main metric to judge success]
   - Secondary: [additional metrics for context]

{'='*80}
4. RESULTS
{'='*80}
a) Quantitative Results:
   - Report mean, std dev, 95% CI
   - Include all runs (don't cherry-pick!)
   - Negative results are valuable!

b) Observations:
   - Training curves (convergence, overfitting?)
   - Unexpected behaviors
   - Error analysis

c) Visualizations:
   - Include figures with captions
   - Reference saved plot files

{'='*80}
5. DEVIATIONS FROM PROTOCOL
{'='*80}
Document ANY changes from planned procedure:
- Why the change was made
- What was changed
- Impact on results

Example: "Reduced batch size from 64 to 32 due to OOM error. 
May affect training stability."

{'='*80}
6. DATA LINEAGE
{'='*80}
Raw Data: data/raw/dataset_v1.csv (SHA256: abc123...)
  ‚Üì [clean_missing_values.py v1.2]
Processed: data/processed/cleaned_v1.csv (SHA256: def456...)
  ‚Üì [feature_engineering.py v2.0]
Features: data/features/features_v1.csv (SHA256: ghi789...)
  ‚Üì [train_model.py v3.1]
Model: models/attention_model_run5.pkl (SHA256: jkl012...)

{'='*80}
7. CONCLUSIONS
{'='*80}
- Was hypothesis supported?
- What are the key takeaways?
- What should be done next?
- Any concerns or limitations?

{'='*80}
8. NEXT STEPS
{'='*80}
Based on these results:
1. [action item 1]
2. [action item 2]
3. [action item 3]

{'='*80}
9. FILES AND ARTIFACTS
{'='*80}
Code: experiments/exp_2024_01_15/train.py (commit: 1a2b3c4)
Notebook: notebooks/analysis_attention.ipynb
Model: models/attention_v1_seed42.pkl
Logs: logs/training_20240115.log
Figures: figures/attention_analysis_20240115/

{'='*80}
10. SIGNATURE
{'='*80}
Researcher: {researcher}
Reviewed by: [name of reviewer, if applicable]
Date: {date}
{'='*80}
"""
        return template

# Example: Create a lab notebook entry
entry = LabNotebook.create_entry(
    experiment_name="Attention Mechanism Evaluation - Run 5",
    researcher="Dr. Jane Smith"
)

print(entry)

### Electronic Lab Notebook Template

An electronic lab notebook (ELN) is essential for tracking research progress and ensuring reproducibility. Here's a comprehensive template:

In [None]:
data_dict = pd.DataFrame({
    'Short Name': ['cust_id', 'age', 'tenure', 'churn', 'charges'],
    'Long Name': ['Customer ID', 'Age', 'Months with Company', 'Churn Status', 'Monthly Charges'],
    'Format': ['Integer 6 digits', 'Integer 18-80', 'Integer 0-72', 'Binary 0/1', 'Decimal 2 places'],
    'Units': ['ID', 'Years', 'Months', 'Binary', 'USD/month'],
    'Definition': ['Unique customer ID', 'Age at extraction', 'Months with account', 'Stopped service (1=yes)', 'Monthly charges billed']
})

print('DATA DICTIONARY EXAMPLE')
print('='*100)
print(data_dict.to_string(index=False))

In [None]:
# Docker for Complete Environment Reproducibility

print("="*80)
print("OPTION 3: DOCKER CONTAINERS")
print("="*80)
print("\nDocker guarantees identical environments across:")
print("  ‚Ä¢ Different operating systems (Linux, Mac, Windows)")
print("  ‚Ä¢ Different hardware configurations")
print("  ‚Ä¢ Different points in time (frozen dependencies)")
print()

dockerfile_example = """# Dockerfile for Research Project
# This creates a complete reproducible environment

FROM python:3.9.13-slim-buster

# Document maintainer
LABEL maintainer="researcher@university.edu"
LABEL description="Reproducible environment for Customer Churn Prediction research"
LABEL version="1.0"

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \\
    gcc \\
    g++ \\
    && rm -rf /var/lib/apt/lists/*

# Copy requirements first (for Docker layer caching)
COPY requirements.txt .

# Install Python dependencies with exact versions
RUN pip install --no-cache-dir -r requirements.txt

# Copy project files
COPY . .

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV MPLBACKEND=Agg

# Document how to run
CMD ["python", "train_model.py"]

# Usage instructions:
# Build: docker build -t churn-prediction:v1.0 .
# Run:   docker run -v $(pwd)/data:/app/data churn-prediction:v1.0
# Shell: docker run -it churn-prediction:v1.0 /bin/bash
"""

print("="*80)
print("DOCKERFILE EXAMPLE")
print("="*80)
print(dockerfile_example)

print("\n" + "="*80)
print("DOCKER COMPOSE FOR COMPLEX PROJECTS")
print("="*80)

docker_compose_example = """# docker-compose.yml
# For projects with multiple services (database, web app, etc.)

version: '3.8'

services:
  research:
    build: .
    volumes:
      - ./data:/app/data
      - ./results:/app/results
    environment:
      - PYTHONUNBUFFERED=1
      - RANDOM_SEED=42
    command: python train_model.py

  jupyter:
    build: .
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/app/notebooks
      - ./data:/app/data
    command: jupyter notebook --ip=0.0.0.0 --allow-root --no-browser

# Usage:
# Start all services: docker-compose up
# Stop: docker-compose down
# Run specific service: docker-compose run research python experiment.py
"""

print(docker_compose_example)

print("\n" + "="*80)
print("ENVIRONMENT COMPARISON")
print("="*80)

comparison = pd.DataFrame({
    'Method': ['venv + requirements.txt', 'conda + environment.yml', 'Docker + Dockerfile'],
    'Setup Time': ['Fast (seconds)', 'Medium (minutes)', 'Slow (minutes)'],
    'Reproducibility': ['Good', 'Excellent', 'Perfect'],
    'OS Independence': ['No', 'Partial', 'Yes'],
    'Learning Curve': ['Easy', 'Medium', 'Hard'],
    'Best For': ['Simple Python', 'Scientific computing', 'Production/Publishing']
})

print(comparison.to_string(index=False))

print("\n‚úì Recommendation: Start with venv, use conda for complex dependencies,")
print("  use Docker for published research or production deployment")

### Option 3: Docker for Complete Reproducibility

**Best for**: Maximum reproducibility across different operating systems and hardware

Docker creates a complete containerized environment including OS, system libraries, and all dependencies.

In [None]:
# Environment Management Best Practices

print("="*80)
print("OPTION 1: VIRTUAL ENVIRONMENTS (venv)")
print("="*80)

venv_commands = """
# Step 1: Create a virtual environment
python3 -m venv research_env

# Step 2: Activate it
# On Linux/Mac:
source research_env/bin/activate
# On Windows:
research_env\\Scripts\\activate

# Step 3: Install packages
pip install numpy pandas scikit-learn matplotlib

# Step 4: Freeze exact versions
pip freeze > requirements.txt

# Step 5: Share requirements.txt with your code
# Others can recreate your environment with:
pip install -r requirements.txt
"""

print(venv_commands)
print("\n" + "="*80)
print("REQUIREMENTS.TXT BEST PRACTICES")
print("="*80)

# Example requirements.txt with proper formatting
requirements_example = """# requirements.txt for Customer Churn Prediction
# Generated: 2024-01-15
# Python version: 3.9.13

# Core Data Science
numpy==1.24.3
pandas==2.0.3
scipy==1.11.1

# Machine Learning
scikit-learn==1.3.0
xgboost==1.7.6

# Visualization
matplotlib==3.7.2
seaborn==0.12.2

# Jupyter
jupyter==1.0.0
ipykernel==6.25.0

# Utilities
tqdm==4.65.0
python-dotenv==1.0.0

# DO NOT use >= or ~ (unpinned versions)
# DO pin exact versions (==) for reproducibility
# DO document Python version separately
# DO regenerate when adding new dependencies
"""

print(requirements_example)

print("\n" + "="*80)
print("OPTION 2: CONDA ENVIRONMENTS")
print("="*80)

conda_commands = """
# Step 1: Create conda environment with specific Python version
conda create -n research_env python=3.9

# Step 2: Activate it
conda activate research_env

# Step 3: Install packages
conda install numpy pandas scikit-learn matplotlib

# Step 4: Export complete environment
conda env export > environment.yml

# Step 5: Share environment.yml
# Others can recreate with:
conda env create -f environment.yml
"""

print(conda_commands)

print("\n" + "="*80)
print("ENVIRONMENT.YML EXAMPLE")
print("="*80)

environment_yml = """name: research_env
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.9.13
  - numpy=1.24.3
  - pandas=2.0.3
  - scikit-learn=1.3.0
  - matplotlib=3.7.2
  - pip=23.2.1
  - pip:
    - xgboost==1.7.6
    - seaborn==0.12.2
"""

print(environment_yml)

print("\n‚úì Choose venv for simple projects, conda for complex scientific computing")

### Creating Reproducible Computational Environments

One of the top causes of irreproducibility is environment variation. Different Python versions, library versions, or operating systems can produce different results.

**The Solution**: Document and isolate your computational environment.

### Option 1: Virtual Environments + requirements.txt

**Best for**: Simple Python projects without complex dependencies

## 4. Code Reproducibility and Environment Management

### The Cardinal Rule

**'Fit preprocessing transformations ONLY on training data. Never use test set statistics.'**

### Why This Matters

Using test set information in preprocessing is DATA LEAKAGE‚Äîone of the top causes of irreproducibility.

### Common Data Leakage Sources

1. Preprocessing before train-test split
2. Using future information (time series)
3. Including proxy variables
4. Improper group handling
5. Feature selection on full dataset

### The Seven-Step Data Preprocessing Workflow

1. Data Acquisition
2. Library Import
3. Data Loading & Inspection
4. Missing Value Handling (fit training only)
5. Categorical Encoding (fit training only)
6. Feature Scaling (fit training only)
7. Data Splitting (do this FIRST!)

**Critical Order**: Split data FIRST (step 7), then apply steps 4-6 ONLY to training data.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

np.random.seed(42)
X_raw = np.random.randn(100, 3) * 10 + np.array([100, 50, 20])

print('INCORRECT APPROACH (DATA LEAKAGE):')
print('='*60)
scaler_wrong = StandardScaler()
X_scaled = scaler_wrong.fit_transform(X_raw)
X_train, X_test = train_test_split(X_scaled, test_size=0.2, random_state=42)
print('Problem: Scaler fit on ALL data (includes test set)')
print('Result: Test statistics influenced the scaler!\n')

print('CORRECT APPROACH (NO LEAKAGE):')
print('='*60)
X_train, X_test = train_test_split(X_raw, test_size=0.2, random_state=42)
scaler_correct = StandardScaler()
X_train_scaled = scaler_correct.fit_transform(X_train)
X_test_scaled = scaler_correct.transform(X_test)
print('Solution: Scaler fit ONLY on training data')
print('Result: Test data never influences the scaler!')

In [None]:
# Exercise 4: Create reproducibility package templates

print("="*80)
print("EXERCISE 4: REPRODUCIBILITY PACKAGE")
print("="*80)

# Task 1: NeurIPS Checklist
print("\n1. NeurIPS CHECKLIST")
print("-" * 80)
print("Use the ReproducibilityChecklist class to assess your project:")
print()
print("house_price_project = ReproducibilityChecklist('House Price Prediction')")
print("# Mark completed items (1-7)")
print("house_price_project.mark_complete(1)  # Claims accuracy")
print("# ... mark others as appropriate")
print("house_price_project.generate_report()")

# Task 2: README template
print("\n2. README.MD TEMPLATE")
print("-" * 80)

readme_template = """# House Price Prediction

Reproducible machine learning project predicting house prices using Random Forest.

## Quick Start

```bash
# Clone repository
git clone https://github.com/username/house-price-prediction.git
cd house-price-prediction

# Create environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\\Scripts\\activate

# Install dependencies
pip install -r requirements.txt

# Run experiment (reproduces published results)
python train_model.py --seed 42

# Expected output: MAE = $23,450 ¬± $1,200
```

## Results Summary

- **Model**: Random Forest Regression
- **Performance**: MAE = $23,450 ¬± $1,200 (95% CI: $21,050-$25,850)
- **Runs**: n=5 with seeds [42, 123, 456, 789, 1011]
- **Training Time**: ~2.3 hours on Intel i7, 16GB RAM

## Repository Structure

```
house-price-prediction/
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/                 # Original data (not in git)
‚îÇ   ‚îú‚îÄ‚îÄ processed/           # Cleaned data (not in git)
‚îÇ   ‚îî‚îÄ‚îÄ sample/              # Small sample for testing (<10MB)
‚îú‚îÄ‚îÄ notebooks/               # Analysis notebooks
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îú‚îÄ‚îÄ preprocessing.py     # Data cleaning pipeline
‚îÇ   ‚îú‚îÄ‚îÄ features.py          # Feature engineering
‚îÇ   ‚îú‚îÄ‚îÄ model.py             # Random Forest implementation
‚îÇ   ‚îî‚îÄ‚îÄ evaluate.py          # Evaluation metrics
‚îú‚îÄ‚îÄ models/                  # Saved models (not in git)
‚îú‚îÄ‚îÄ results/                 # Output files
‚îú‚îÄ‚îÄ tests/                   # Unit tests
‚îú‚îÄ‚îÄ train_model.py           # Main training script
‚îú‚îÄ‚îÄ requirements.txt         # Python dependencies
‚îú‚îÄ‚îÄ .gitignore
‚îî‚îÄ‚îÄ README.md               # This file
```

## Environment

- Python 3.9.13
- See `requirements.txt` for all dependencies
- Tested on Ubuntu 20.04, macOS 12.0, Windows 11

## Data

Dataset: 10,000 houses with 15 features
- Source: [provide source]
- License: [provide license]
- Download: [provide link or instructions]
- Place in `data/raw/houses.csv`

## Reproducing Results

To exactly reproduce the published results:

1. Use Python 3.9.13
2. Install exact dependency versions from requirements.txt
3. Run with seed=42 (default)
4. Results may vary slightly (<1%) due to floating point precision

## Citation

If you use this code, please cite:

```
@article{smith2024house,
  title={Reproducible House Price Prediction},
  author={Smith, Jane},
  journal={Journal of ML Reproducibility},
  year={2024}
}
```

## License

MIT License - see LICENSE file
"""

print(readme_template)

# Task 3: requirements.txt
print("\n3. REQUIREMENTS.TXT")
print("-" * 80)

requirements = """# requirements.txt
# Python 3.9.13
# Generated: 2024-01-15

# Core
numpy==1.24.3
pandas==2.0.3
scipy==1.11.1

# Machine Learning
scikit-learn==1.3.0

# Utilities
joblib==1.3.1
tqdm==4.65.0

# Testing
pytest==7.4.0
"""
print(requirements)

# Task 4: .gitignore
print("\n4. .GITIGNORE")
print("-" * 80)

gitignore = """# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
venv/
env/
ENV/

# Jupyter
.ipynb_checkpoints
*_tested.ipynb

# Data (ignore large files)
data/raw/
data/processed/
*.csv
*.parquet
*.h5

# Models
models/*.pkl
models/*.h5
models/*.pt

# Results
results/
logs/

# IDE
.vscode/
.idea/
*.swp

# OS
.DS_Store
Thumbs.db

# Keep sample data
!data/sample/*.csv
"""
print(gitignore)

print("\n" + "="*80)
print("‚úì Complete all 5 components for a reproducible research package")
print("="*80)

## 9. Exercise 4: Create a Complete Reproducibility Package

You've completed a machine learning project predicting house prices. Create a complete reproducibility package that includes:

1. **NeurIPS Checklist** - Mark which items are complete
2. **README.md** - Instructions for reproducing your results
3. **requirements.txt** - All dependencies with exact versions
4. **.gitignore** - What files to exclude from version control
5. **Lab Notebook Entry** - Document your final experiment

**Project Details:**
- Model: Random Forest regression
- Dataset: 10,000 houses, 15 features
- Results: MAE = $23,450 ¬± $1,200 (95% CI: $21,050-$25,850, n=5 seeds)
- Training time: 2.3 hours on Intel i7, 16GB RAM
- Python 3.9, scikit-learn 1.3.0, pandas 2.0.3
- Splits: 70% train, 15% val, 15% test (seed=42)

## 5. Version Control for Reproducible Research

### Why Git Matters

- Creates audit trails of changes
- Enables rollback to working versions
- Documents decisions through commit messages
- Facilitates collaboration
- Enables reproducibility at specific commits

### Repository Best Practices

**Track (‚úì):**
- Notebooks without outputs
- Scripts and source code
- Sample data <10MB
- README and documentation
- Requirements.txt and environment.yml

**Ignore (‚ùå):**
- Notebook outputs
- Large datasets >10MB
- Virtual environments
- Cache files
- Credentials and secrets

## 6. Exercise 1: Create a Reproducibility Checklist

You're submitting a paper 'Deep Learning for Time Series' to NeurIPS.

**Current Status:**
- ‚úì Code on GitHub
- ‚úì Random seeds set
- ‚úó No confidence intervals (only mean: 94.3%)
- ‚úó Training time not documented (48 hours)
- ‚úì Limitations described
- ‚úó No setup instructions on GitHub

**Task**: Analyze which reproducibility items are complete and what needs fixing before publication.

In [None]:
print('EXERCISE 1: REPRODUCIBILITY CHECKLIST')
print('='*70)
print('\nReview the paper status above.')
print('\nQuestions to answer:')
print('1. Which items are clearly complete?')
print('2. What is preventing publication readiness?')
print('3. What priority order for fixes?')
print('4. Which checklist items need most work?')

## 7. Exercise 2: Write a Data Dictionary

Create a data dictionary for house price prediction with variables:
- sqft: Square footage
- beds: Number of bedrooms
- price: Sale price
- zip: Postal code
- year: Year built
- cond: Condition rating

Include short names, long names, format, units, and definitions.

In [None]:
print('EXERCISE 2: DATA DICTIONARY')
print('='*70)
print('\nCreate a DataFrame with columns:')
print('- Short Name, Long Name, Format, Units, Definition')
print('\nFor variables: sqft, beds, price, zip, year, cond')

## 8. Exercise 3: Identify Data Leakage

Analyze three scenarios:

**Scenario A**: Impute missing values on full dataset, then split

**Scenario B**: Split first, then select features using training only

**Scenario C**: Use information from AFTER the prediction time

For each: (1) Has leakage? (2) Why? (3) How to fix?

In [None]:
print('EXERCISE 3: DATA LEAKAGE ANALYSIS')
print('='*70)
print('\nAnalyze each scenario for data leakage:')
print('\nScenario A: Preprocess all, then split')
print('  - Has leakage? YES/NO')
print('  - Why?')
print('  - Fix?')
print('\nScenario B: Split, then feature select on training')
print('  - Has leakage? YES/NO')
print('  - Why?')
print('  - Fix?')
print('\nScenario C: Use future information')
print('  - Has leakage? YES/NO')
print('  - Why?')
print('  - Fix?')

## Summary

### Key Takeaways

‚úÖ **Reproducibility Crisis**: 70% of researchers can't reproduce others' work; data leakage is primary cause

‚úÖ **NeurIPS Checklist**: Seven requirements (claims, limitations, experimental details, open access, settings, statistics, compute)

‚úÖ **Documentation**: Electronic lab notebooks, data dictionaries, READMEs, data lineage

‚úÖ **Environment**: Specify Python/library versions, set random seeds, document dependencies

‚úÖ **Data Leakage Prevention**: Cardinal rule‚Äîfit preprocessing ONLY on training data

‚úÖ **Version Control**: Use Git for audit trails, track code/data/docs, ignore outputs/secrets

### What's Next?

**Module 07: Literature Review Methodologies** covers:
- Systematic reviews (PRISMA 2020)
- Scoping reviews (JBI methodology)
- Meta-analysis techniques
- Risk of bias assessment

## Self-Assessment

Before Module 07, ensure you can:

- [ ] Explain the reproducibility crisis and causes
- [ ] Apply the NeurIPS seven-item checklist
- [ ] Create comprehensive data dictionaries
- [ ] Document data lineage and transformations
- [ ] Set up reproducible environments
- [ ] Prevent data leakage in preprocessing
- [ ] Use version control for research
- [ ] Write clear README files
- [ ] Report results with uncertainty measures

If all boxes checked, you're ready for Module 07! üéâ