# Module 06: Reproducibility Crisis and Documentation Standards

**Difficulty**: ⭐⭐ (Intermediate)

**Estimated Time**: 75 minutes

**Prerequisites**: Module 05: Statistical Validation and Hypothesis Testing

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand the reproducibility crisis in ML-based science
2. Apply the NeurIPS reproducibility checklist
3. Implement comprehensive documentation standards
4. Use version control effectively for research
5. Create reproducible computational environments
6. Design data preprocessing pipelines that prevent data leakage
7. Document data lineage and transformations

## Setup

Let's import required libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

np.random.seed(42)

print('✓ Libraries imported successfully!')

## 1. The Reproducibility Crisis in ML-Based Science

### Understanding the Scale

The reproducibility crisis is a systemic problem in modern science.

**Key Statistics:**

- **Princeton Study**: 41 papers from 30 fields had reproducibility failures
- **Cascading Impact**: These 41 papers affected 648 subsequent papers
- **Nature 2016 Survey**: 70% of researchers couldn't reproduce others' work
- **Self-Reproducibility**: Over 50% couldn't reproduce their own work
- **Root Cause**: Data leakage is the most pervasive cause

### Why This Matters

Irreproducible research has serious consequences:
1. Wasted resources building on flawed findings
2. Scientific community moves in wrong directions
3. The scientific method breaks down
4. Loss of public trust in science
5. Real-world harm in applied domains (medicine, autonomous systems)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

categories = ['Flawed Papers', 'Papers Affected']
values = [41, 648]
colors = ['#e74c3c', '#c0392b']

axes[0, 0].bar(categories, values, color=colors, edgecolor='black', linewidth=2)
axes[0, 0].set_ylabel('Number of Papers', fontsize=11, fontweight='bold')
axes[0, 0].set_title('Princeton Study: Cascading Effects', fontsize=12, fontweight='bold')
axes[0, 0].set_ylim(0, 700)

for i, val in enumerate(values):
    axes[0, 0].text(i, val + 20, str(val), ha='center', fontsize=12, fontweight='bold')

axes[0, 1].bar([1, 2], [30, 50], color=['#e74c3c', '#e74c3c'], edgecolor='black', linewidth=2)
axes[0, 1].set_xticks([1, 2])
axes[0, 1].set_xticklabels(['Reproduce Others', 'Reproduce Own'])
axes[0, 1].set_ylabel('Success Rate (%)', fontsize=11, fontweight='bold')
axes[0, 1].set_title('Nature 2016: Success Rates', fontsize=12, fontweight='bold')
axes[0, 1].set_ylim(0, 100)

causes = ['Data Leakage', 'Poor Docs', 'Missing Params', 'No Seed', 'Env Variation']
freqs = [45, 28, 18, 12, 10]

axes[1, 0].barh(causes, freqs, color=['#e74c3c', '#e67e22', '#f39c12', '#f1c40f', '#2ecc71'], edgecolor='black')
axes[1, 0].set_xlabel('Frequency (%)', fontsize=11, fontweight='bold')
axes[1, 0].set_title('Common Causes of Irreproducibility', fontsize=12, fontweight='bold')

years = [2019, 2020, 2021, 2022, 2023, 2024]
adoption = [15, 32, 48, 65, 78, 88]

axes[1, 1].plot(years, adoption, marker='o', linewidth=3, markersize=10, color='#3498db')
axes[1, 1].set_xlabel('Year', fontsize=11, fontweight='bold')
axes[1, 1].set_ylabel('Papers with Checklist (%)', fontsize=11, fontweight='bold')
axes[1, 1].set_title('NeurIPS Checklist Adoption', fontsize=12, fontweight='bold')
axes[1, 1].set_ylim(0, 100)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('Reproducibility crisis declining with better standards!')

## 2. The NeurIPS Reproducibility Checklist

### The Seven Core Requirements

**1. Claims Accuracy** - All claims must match actual findings

**2. Limitations Documentation** - Clearly acknowledge assumptions and limitations

**3. Experimental Reproducibility** - Sufficient detail for others to replicate

**4. Open Access** - Make data and code publicly available

**5. Experimental Settings** - Report all hyperparameters and hardware specs

**6. Statistical Significance** - Results with confidence intervals and error bars

**7. Compute Resources** - Report training time, memory, GPU hours

## 3. Research Documentation Standards

### The Core Principle

**'If methods cannot be reproduced from documentation alone, the documentation is insufficient.'**

### What to Document

**Electronic Lab Notebooks:**
- Date and time of work
- Detailed methods for colleague replication
- Equipment settings and procedures
- Deviations from protocol
- Negative results
- Links to raw data with versions

**Data Dictionary:**
- Short and long variable names
- Format and units
- Allowable values
- Complete definitions

**Data Lineage:**
- Source of origin
- All transformations in order
- Processing pipeline steps
- Version information
- Software dependencies

In [None]:
data_dict = pd.DataFrame({
    'Short Name': ['cust_id', 'age', 'tenure', 'churn', 'charges'],
    'Long Name': ['Customer ID', 'Age', 'Months with Company', 'Churn Status', 'Monthly Charges'],
    'Format': ['Integer 6 digits', 'Integer 18-80', 'Integer 0-72', 'Binary 0/1', 'Decimal 2 places'],
    'Units': ['ID', 'Years', 'Months', 'Binary', 'USD/month'],
    'Definition': ['Unique customer ID', 'Age at extraction', 'Months with account', 'Stopped service (1=yes)', 'Monthly charges billed']
})

print('DATA DICTIONARY EXAMPLE')
print('='*100)
print(data_dict.to_string(index=False))

## 4. Code Reproducibility and Environment Management

### The Cardinal Rule

**'Fit preprocessing transformations ONLY on training data. Never use test set statistics.'**

### Why This Matters

Using test set information in preprocessing is DATA LEAKAGE—one of the top causes of irreproducibility.

### Common Data Leakage Sources

1. Preprocessing before train-test split
2. Using future information (time series)
3. Including proxy variables
4. Improper group handling
5. Feature selection on full dataset

### The Seven-Step Data Preprocessing Workflow

1. Data Acquisition
2. Library Import
3. Data Loading & Inspection
4. Missing Value Handling (fit training only)
5. Categorical Encoding (fit training only)
6. Feature Scaling (fit training only)
7. Data Splitting (do this FIRST!)

**Critical Order**: Split data FIRST (step 7), then apply steps 4-6 ONLY to training data.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

np.random.seed(42)
X_raw = np.random.randn(100, 3) * 10 + np.array([100, 50, 20])

print('INCORRECT APPROACH (DATA LEAKAGE):')
print('='*60)
scaler_wrong = StandardScaler()
X_scaled = scaler_wrong.fit_transform(X_raw)
X_train, X_test = train_test_split(X_scaled, test_size=0.2, random_state=42)
print('Problem: Scaler fit on ALL data (includes test set)')
print('Result: Test statistics influenced the scaler!\n')

print('CORRECT APPROACH (NO LEAKAGE):')
print('='*60)
X_train, X_test = train_test_split(X_raw, test_size=0.2, random_state=42)
scaler_correct = StandardScaler()
X_train_scaled = scaler_correct.fit_transform(X_train)
X_test_scaled = scaler_correct.transform(X_test)
print('Solution: Scaler fit ONLY on training data')
print('Result: Test data never influences the scaler!')

## 5. Version Control for Reproducible Research

### Why Git Matters

- Creates audit trails of changes
- Enables rollback to working versions
- Documents decisions through commit messages
- Facilitates collaboration
- Enables reproducibility at specific commits

### Repository Best Practices

**Track (✓):**
- Notebooks without outputs
- Scripts and source code
- Sample data <10MB
- README and documentation
- Requirements.txt and environment.yml

**Ignore (❌):**
- Notebook outputs
- Large datasets >10MB
- Virtual environments
- Cache files
- Credentials and secrets

## 6. Exercise 1: Create a Reproducibility Checklist

You're submitting a paper 'Deep Learning for Time Series' to NeurIPS.

**Current Status:**
- ✓ Code on GitHub
- ✓ Random seeds set
- ✗ No confidence intervals (only mean: 94.3%)
- ✗ Training time not documented (48 hours)
- ✓ Limitations described
- ✗ No setup instructions on GitHub

**Task**: Analyze which reproducibility items are complete and what needs fixing before publication.

In [None]:
print('EXERCISE 1: REPRODUCIBILITY CHECKLIST')
print('='*70)
print('\nReview the paper status above.')
print('\nQuestions to answer:')
print('1. Which items are clearly complete?')
print('2. What is preventing publication readiness?')
print('3. What priority order for fixes?')
print('4. Which checklist items need most work?')

## 7. Exercise 2: Write a Data Dictionary

Create a data dictionary for house price prediction with variables:
- sqft: Square footage
- beds: Number of bedrooms
- price: Sale price
- zip: Postal code
- year: Year built
- cond: Condition rating

Include short names, long names, format, units, and definitions.

In [None]:
print('EXERCISE 2: DATA DICTIONARY')
print('='*70)
print('\nCreate a DataFrame with columns:')
print('- Short Name, Long Name, Format, Units, Definition')
print('\nFor variables: sqft, beds, price, zip, year, cond')

## 8. Exercise 3: Identify Data Leakage

Analyze three scenarios:

**Scenario A**: Impute missing values on full dataset, then split

**Scenario B**: Split first, then select features using training only

**Scenario C**: Use information from AFTER the prediction time

For each: (1) Has leakage? (2) Why? (3) How to fix?

In [None]:
print('EXERCISE 3: DATA LEAKAGE ANALYSIS')
print('='*70)
print('\nAnalyze each scenario for data leakage:')
print('\nScenario A: Preprocess all, then split')
print('  - Has leakage? YES/NO')
print('  - Why?')
print('  - Fix?')
print('\nScenario B: Split, then feature select on training')
print('  - Has leakage? YES/NO')
print('  - Why?')
print('  - Fix?')
print('\nScenario C: Use future information')
print('  - Has leakage? YES/NO')
print('  - Why?')
print('  - Fix?')

## Summary

### Key Takeaways

✅ **Reproducibility Crisis**: 70% of researchers can't reproduce others' work; data leakage is primary cause

✅ **NeurIPS Checklist**: Seven requirements (claims, limitations, experimental details, open access, settings, statistics, compute)

✅ **Documentation**: Electronic lab notebooks, data dictionaries, READMEs, data lineage

✅ **Environment**: Specify Python/library versions, set random seeds, document dependencies

✅ **Data Leakage Prevention**: Cardinal rule—fit preprocessing ONLY on training data

✅ **Version Control**: Use Git for audit trails, track code/data/docs, ignore outputs/secrets

### What's Next?

**Module 07: Literature Review Methodologies** covers:
- Systematic reviews (PRISMA 2020)
- Scoping reviews (JBI methodology)
- Meta-analysis techniques
- Risk of bias assessment

## Self-Assessment

Before Module 07, ensure you can:

- [ ] Explain the reproducibility crisis and causes
- [ ] Apply the NeurIPS seven-item checklist
- [ ] Create comprehensive data dictionaries
- [ ] Document data lineage and transformations
- [ ] Set up reproducible environments
- [ ] Prevent data leakage in preprocessing
- [ ] Use version control for research
- [ ] Write clear README files
- [ ] Report results with uncertainty measures

If all boxes checked, you're ready for Module 07! 🎉