# Week 8 Lab: Reproducibility & Environments

**CS 203: Software Tools and Techniques for AI**

In this lab, you'll learn to:
1. Set random seeds for reproducibility
2. Create and manage virtual environments
3. Generate requirements.txt
4. Use config files
5. Create a proper project structure

## Part 1: The Reproducibility Problem

Let's see why random seeds matter.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Create sample data
X = np.random.randn(200, 5)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Train WITHOUT random seed
print("Training WITHOUT random seed (run multiple times):")
for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    model = RandomForestClassifier(n_estimators=10)
    model.fit(X_train, y_train)
    print(f"  Run {i+1}: Accuracy = {model.score(X_test, y_test):.3f}")

print("\nNotice how the accuracy varies each time!")

In [None]:
# Train WITH random seed
print("Training WITH random seed (run multiple times):")
for i in range(3):
    np.random.seed(42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = RandomForestClassifier(n_estimators=10, random_state=42)
    model.fit(X_train, y_train)
    print(f"  Run {i+1}: Accuracy = {model.score(X_test, y_test):.3f}")

print("\nNow the accuracy is exactly the same every time!")

## Part 2: A Complete Seed Function

In [None]:
import random
import numpy as np

def set_seed(seed=42):
    """Set all random seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    
    # PyTorch (if available)
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except ImportError:
        pass
    
    # TensorFlow (if available)
    try:
        import tensorflow as tf
        tf.random.set_seed(seed)
    except ImportError:
        pass
    
    print(f"Random seed set to {seed}")

# Call at the start of every script
set_seed(42)

In [None]:
# Test that it works
set_seed(42)
print("First sequence:", [random.randint(1, 100) for _ in range(5)])

set_seed(42)
print("Same sequence:", [random.randint(1, 100) for _ in range(5)])

## Part 3: Virtual Environments (Terminal Commands)

Run these commands in your terminal, not in this notebook.

### Creating a Virtual Environment

```bash
# Navigate to your project directory
cd my_project

# Create virtual environment
python -m venv venv

# Activate it (Mac/Linux)
source venv/bin/activate

# Activate it (Windows)
venv\Scripts\activate

# Your prompt should now show (venv)
```

### Installing Packages

```bash
# Install packages
pip install pandas scikit-learn matplotlib

# See what's installed
pip list

# Save to requirements.txt
pip freeze > requirements.txt

# Deactivate when done
deactivate
```

## Part 4: Working with requirements.txt

In [None]:
# Let's see what packages are installed in this environment
import pkg_resources

# Get installed packages
installed = [(d.project_name, d.version) for d in pkg_resources.working_set]
installed.sort()

print("Some installed packages:")
for name, version in installed[:10]:
    print(f"  {name}=={version}")

In [None]:
# Generate a requirements.txt for our project
project_packages = [
    'pandas',
    'numpy',
    'scikit-learn',
    'matplotlib'
]

print("Example requirements.txt:\n")
for pkg in project_packages:
    try:
        version = pkg_resources.get_distribution(pkg).version
        print(f"{pkg}=={version}")
    except:
        print(f"{pkg}  # version unknown")

## Part 5: Configuration Files

In [None]:
# Instead of hardcoding values in your code...

# BAD: Hardcoded values
learning_rate = 0.01
batch_size = 32
model_path = "/home/nipun/models/netflix.pkl"  # Breaks on other machines!

print("BAD: Hardcoded values are not portable!")

In [None]:
# GOOD: Use a config file
import yaml
import json

# Example config (normally in a separate file)
config_yaml = """
training:
  learning_rate: 0.01
  batch_size: 32
  epochs: 100
  random_seed: 42

paths:
  data: data/processed/
  model: models/netflix.pkl
  logs: logs/

model:
  type: random_forest
  n_estimators: 100
  max_depth: 10
"""

# Parse the config
config = yaml.safe_load(config_yaml)

print("Config loaded:")
print(json.dumps(config, indent=2))

In [None]:
# Access config values
print(f"Learning rate: {config['training']['learning_rate']}")
print(f"Model type: {config['model']['type']}")
print(f"Data path: {config['paths']['data']}")

In [None]:
# Write config to file
with open('config.yaml', 'w') as f:
    yaml.dump(config, f, default_flow_style=False)

print("Config saved to config.yaml")

# Read it back
with open('config.yaml', 'r') as f:
    loaded_config = yaml.safe_load(f)

print(f"Loaded back: learning_rate = {loaded_config['training']['learning_rate']}")

## Part 6: Project Structure Template

In [None]:
import os

# Create a reproducible project structure
project_structure = {
    'netflix-predictor': {
        'data': {
            'raw': {},
            'processed': {}
        },
        'models': {},
        'notebooks': {},
        'src': {},
        'tests': {},
    }
}

def print_structure(d, indent=0):
    for key, value in d.items():
        print("  " * indent + f"├── {key}/")
        if isinstance(value, dict):
            print_structure(value, indent + 1)

print("Recommended Project Structure:")
print_structure(project_structure)
print("  " * 1 + "├── requirements.txt")
print("  " * 1 + "├── config.yaml")
print("  " * 1 + "├── README.md")
print("  " * 1 + "└── .gitignore")

## Part 7: Creating Essential Files

In [None]:
# Create a sample .gitignore
gitignore_content = """# Data files (too large for Git)
data/raw/
*.csv
*.parquet

# Models (too large)
models/*.pkl
models/*.pth
*.h5

# Virtual environment
venv/
env/

# Python cache
__pycache__/
*.pyc
*.pyo

# Jupyter checkpoints
.ipynb_checkpoints/

# Secrets (NEVER commit these!)
.env
secrets.yaml
*.key

# IDE
.vscode/
.idea/

# Logs
*.log
logs/
"""

print(".gitignore example:")
print(gitignore_content)

In [None]:
# Create a sample README
readme_content = """# Netflix Movie Predictor

Predicts whether a movie will be successful based on features.

## Setup

1. Create virtual environment:
```bash
python -m venv venv
source venv/bin/activate  # Mac/Linux
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

## Usage

Train the model:
```bash
python src/train.py
```

Make predictions:
```bash
python src/predict.py --input data/test.csv
```

## Project Structure

- `data/` - Raw and processed data
- `models/` - Trained model files
- `src/` - Source code
- `notebooks/` - Jupyter notebooks for exploration

## Authors

- Your Name
"""

print("README.md example:")
print(readme_content)

## Part 8: Complete Reproducible Training Script

In [None]:
# Complete reproducible training script
training_script = '''
#!/usr/bin/env python
"""Train the Netflix movie predictor model."""

import random
import numpy as np
import pandas as pd
import yaml
import pickle
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier

def set_seed(seed):
    """Set all random seeds."""
    random.seed(seed)
    np.random.seed(seed)

def load_config(path="config.yaml"):
    """Load configuration from YAML file."""
    with open(path) as f:
        return yaml.safe_load(f)

def main():
    # Load config
    config = load_config()
    
    # Set random seed
    set_seed(config["training"]["random_seed"])
    print(f"Random seed: {config['training']['random_seed']}")
    
    # Load data
    data = pd.read_csv(config["paths"]["data"] + "movies.csv")
    print(f"Loaded {len(data)} samples")
    
    # Prepare features
    X = data[["budget", "runtime", "star_power"]]
    y = data["success"]
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=config["model"]["n_estimators"],
        max_depth=config["model"]["max_depth"],
        random_state=config["training"]["random_seed"]
    )
    
    # Cross-validation
    scores = cross_val_score(model, X, y, cv=5)
    print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
    
    # Train final model
    model.fit(X, y)
    
    # Save model
    with open(config["paths"]["model"], "wb") as f:
        pickle.dump(model, f)
    print(f"Model saved to {config['paths']['model']}")

if __name__ == "__main__":
    main()
'''

print("Complete reproducible training script (train.py):")
print(training_script)

## Part 9: Reproducibility Checklist

In [None]:
checklist = [
    ("Random seeds set", True),
    ("Virtual environment created", True),
    ("requirements.txt with pinned versions", True),
    ("Config file (no hardcoded values)", True),
    ("README with setup instructions", True),
    (".gitignore for data/models", True),
    ("Proper project structure", True),
    ("Tested on clean environment", False),  # You should do this!
]

print("Reproducibility Checklist:")
print("=" * 50)
for item, done in checklist:
    status = "✓" if done else "○"
    print(f"  [{status}] {item}")

complete = sum(1 for _, done in checklist if done)
print(f"\nProgress: {complete}/{len(checklist)} items complete")

## Part 10: Exercise - Make Your Own Project Reproducible

Now it's your turn! For your Netflix predictor project:

1. **Create a virtual environment** and install your dependencies
2. **Generate requirements.txt** with pinned versions
3. **Add random seeds** to all your training scripts
4. **Create a config.yaml** for hyperparameters and paths
5. **Write a README.md** with setup instructions
6. **Create a .gitignore** to exclude data and models
7. **Test it!** Clone your repo fresh and follow your own instructions

In [None]:
# Clean up
import os
if os.path.exists('config.yaml'):
    os.remove('config.yaml')
    print("Cleaned up config.yaml")