# 009: Git & Version Control Mastery

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Master** Branching strategies
- **Master** Merge vs rebase
- **Master** Pull requests and code review
- **Master** CI/CD integration
- **Master** Model versioning with DVC

## üìö Overview

This notebook covers Git & Version Control Mastery essential for AI/ML engineering.

**Post-silicon applications**: Optimized data pipelines, efficient algorithms, scalable systems.

---

Let's dive in! üöÄ

## üìö What is Git & Version Control?

**Version control** is a system that records changes to files over time, enabling you to recall specific versions, collaborate effectively, and maintain code quality. **Git** is the de facto standard distributed version control system used by 95% of software teams worldwide.

**Why Git for AI/ML?**
- ‚úÖ **Collaboration**: 10-100 engineers working on same codebase (Intel: 250+ engineers on AI platform)
- ‚úÖ **Reproducibility**: Track exact code version that trained a model (NVIDIA: "Which commit produced model v2.3?")
- ‚úÖ **Experimentation**: Branch for experiments without breaking production (AMD: 50+ feature branches active)
- ‚úÖ **Code Review**: Pull requests ensure quality before merge (Qualcomm: 98% bugs caught in review)
- ‚úÖ **Rollback**: Instantly revert bad deployments (Meta: rollback in <5 minutes)

**Version Control != Just Git:**
- **Code**: Git tracks `.py`, `.ipynb`, config files
- **Data**: DVC (Data Version Control) tracks datasets, models (>100MB files)
- **Experiments**: MLflow tracks hyperparameters, metrics, artifacts
- **Models**: Model registry (MLflow, SageMaker) tracks production models

---

## üè≠ Post-Silicon Validation Use Cases

**1. Intel Test Program Development**
- **Scenario**: 50 engineers developing test programs for 20 products
- **Challenge**: Conflicting changes, untested code reaching production
- **Solution**: Git Flow with feature branches + CI/CD + mandatory code review
- **Input**: Test programs in C/Python, configuration files, golden data
- **Output**: 95% fewer production bugs, 40% faster development
- **Value**: $8M saved annually (reduced test escapes, faster time-to-market)

**2. NVIDIA Model Training Workflows**
- **Scenario**: 30 data scientists experimenting with 100+ model variants
- **Challenge**: "Which hyperparameters produced this model? Which data version?"
- **Solution**: DVC for data/models + Git for code + MLflow for experiments
- **Input**: Training code, datasets (500GB), model checkpoints (2GB each)
- **Output**: Full reproducibility, rollback to any experiment in <5 min
- **Value**: $5M saved (reproducible research, regulatory compliance)

**3. AMD Automated Testing Pipeline**
- **Scenario**: Every code commit must pass 10K tests before merge
- **Challenge**: Manual testing takes 8 hours, blocks development
- **Solution**: GitHub Actions CI/CD pipeline (test on every PR)
- **Input**: Pull request with code changes
- **Output**: Automated testing, quality gates, deployment to staging
- **Value**: $12M saved (8 hours ‚Üí 30 minutes, 99.5% bug detection before production)

**4. Qualcomm Multi-Site Collaboration**
- **Scenario**: Engineers in San Diego, India, China collaborating on ML platform
- **Challenge**: Time zone conflicts, code conflicts, duplicate work
- **Solution**: Trunk-based development + feature flags + daily integration
- **Input**: 200+ commits/day from 3 continents
- **Output**: Zero merge conflicts, continuous integration, <1 day feedback
- **Value**: $10M saved (3√ó development velocity, eliminated duplicate work)

---

## üîÑ Git Workflow Comparison

```mermaid
graph TB
    subgraph "Git Flow"
        A1[main] --> B1[develop]
        B1 --> C1[feature/login]
        B1 --> C2[feature/api]
        C1 --> B1
        C2 --> B1
        B1 --> D1[release/v1.0]
        D1 --> A1
        A1 --> E1[hotfix/bug]
        E1 --> A1
    end
    
    subgraph "Trunk-Based"
        A2[main] --> B2[feature/short-lived]
        B2 --> A2
        A2 --> C2[Deploy]
    end
    
    style A1 fill:#e1ffe1
    style A2 fill:#e1ffe1
    style C2 fill:#ffe1e1
```

---

## üìä Learning Path Context

**Prerequisites:**
- None (foundational skill for all ML engineering)
- Basic command line experience helpful

**Next Steps:**
- **010: Linear Regression** - Apply Git to track ML experiments
- **048: Model Deployment** - CI/CD for model serving
- **111: MLOps Fundamentals** - End-to-end ML pipelines with version control

**Related Skills:**
- Docker (containerization for reproducible environments)
- CI/CD tools (GitHub Actions, Jenkins, GitLab CI)
- DVC (data version control for large datasets/models)

---

Let's master Git & Version Control for production ML systems! üöÄ

---

## Part 1: Git Fundamentals & Branching Strategies

### Core Git Concepts

**Repository Structure:**
```
.git/
‚îú‚îÄ‚îÄ HEAD              # Points to current branch
‚îú‚îÄ‚îÄ refs/
‚îÇ   ‚îú‚îÄ‚îÄ heads/        # Local branches
‚îÇ   ‚îî‚îÄ‚îÄ remotes/      # Remote branches
‚îú‚îÄ‚îÄ objects/          # All commits, trees, blobs
‚îî‚îÄ‚îÄ config            # Repository configuration
```

**Three States of Git:**
1. **Working Directory**: Modified files not yet staged
2. **Staging Area (Index)**: Files ready to commit
3. **Repository**: Committed snapshots

**Essential Commands:**
```bash
# Initialize & clone
git init                              # Create new repo
git clone <url>                       # Clone existing repo

# Daily workflow
git status                            # Check file states
git add <file>                        # Stage changes
git commit -m "message"               # Commit staged changes
git push origin <branch>              # Push to remote
git pull origin <branch>              # Pull from remote

# Branching
git branch <name>                     # Create branch
git checkout <name>                   # Switch branch
git checkout -b <name>                # Create + switch
git merge <branch>                    # Merge branch
git rebase <branch>                   # Rebase onto branch
```

---

### Branching Strategies Comparison

#### 1. **Git Flow** (Complex Projects)

**Structure:**
- `main`: Production-ready code (always deployable)
- `develop`: Integration branch (active development)
- `feature/*`: New features (branch from develop)
- `release/*`: Release preparation (branch from develop)
- `hotfix/*`: Emergency fixes (branch from main)

**When to use:**
- ‚úÖ Scheduled releases (quarterly, monthly)
- ‚úÖ Multiple versions in production
- ‚úÖ Large teams (50+ engineers)
- ‚úÖ High stability requirements

**Intel Example:**
- 250 engineers, 20 products
- `main`: Released silicon test programs
- `develop`: Next-generation features
- `feature/ddr5-test`: New DDR5 memory tests
- `release/2024.Q1`: Stabilize for Q1 release
- `hotfix/critical-bug`: Fix production issue
- **Result**: 95% fewer production bugs, clear release process

#### 2. **Trunk-Based Development** (Fast-Moving Teams)

**Structure:**
- `main`: Single source of truth (always deployable)
- Short-lived feature branches (<2 days)
- Feature flags for incomplete features
- Continuous integration + daily commits

**When to use:**
- ‚úÖ Continuous deployment (10+ deploys/day)
- ‚úÖ Small teams (5-15 engineers)
- ‚úÖ Fast iteration required
- ‚úÖ Strong CI/CD pipeline

**Qualcomm Example:**
- 15 ML engineers, deploy 3√ó/day
- All work in `main` or 1-day feature branches
- Feature flags hide incomplete features
- Automated tests run on every commit
- **Result**: 3√ó development velocity, zero merge conflicts

#### 3. **GitHub Flow** (Simple Projects)

**Structure:**
- `main`: Always deployable
- Feature branches for all work
- Pull requests for code review
- Deploy after merge

**When to use:**
- ‚úÖ Web apps, APIs (continuous deployment)
- ‚úÖ Small/medium teams
- ‚úÖ Simple release process
- ‚úÖ GitHub-centric workflow

**NVIDIA Example:**
- 30 data scientists training models
- Branch for each experiment
- Pull request + peer review
- Auto-deploy to staging after merge
- **Result**: High quality code, fast experimentation

---

### Merge vs Rebase

**Merge:**
```bash
git checkout main
git merge feature
# Creates merge commit, preserves history
```

**Pros:**
- ‚úÖ Preserves complete history
- ‚úÖ Non-destructive (safe)
- ‚úÖ Clear feature integration point

**Cons:**
- ‚ùå Messy history with many branches
- ‚ùå Harder to understand timeline

**Rebase:**
```bash
git checkout feature
git rebase main
# Replays commits on top of main
```

**Pros:**
- ‚úÖ Linear, clean history
- ‚úÖ Easy to understand
- ‚úÖ Simplifies code review

**Cons:**
- ‚ùå Rewrites history (dangerous if shared)
- ‚ùå Conflicts must be resolved per commit

**Best Practice:**
- **Rebase**: Private feature branches (clean up before PR)
- **Merge**: Public branches (preserve collaboration history)
- **AMD Rule**: "Rebase locally, merge remotely"

---

### Post-Silicon Branching Patterns

**Pattern 1: Test Program Development (Intel)**
```
main (production test programs)
‚îú‚îÄ‚îÄ develop (next release)
‚îÇ   ‚îú‚îÄ‚îÄ feature/memory-stress-test
‚îÇ   ‚îú‚îÄ‚îÄ feature/power-optimization
‚îÇ   ‚îî‚îÄ‚îÄ feature/thermal-monitoring
‚îú‚îÄ‚îÄ release/v2024.1 (stabilization)
‚îî‚îÄ‚îÄ hotfix/voltage-bug (critical fix)
```

**Pattern 2: Model Experiments (NVIDIA)**
```
main (production models)
‚îú‚îÄ‚îÄ experiment/transformer-v2 (1-day branch)
‚îú‚îÄ‚îÄ experiment/quantization (2-day branch)
‚îî‚îÄ‚îÄ experiment/distillation (3-day branch)
```

**Pattern 3: Data Pipeline (AMD)**
```
main (production pipeline)
‚îú‚îÄ‚îÄ feature/real-time-ingestion (long-running)
‚îú‚îÄ‚îÄ feature/new-data-source (short-lived)
‚îî‚îÄ‚îÄ hotfix/memory-leak (emergency)
```

### üìù What's Happening in This Code?

**Purpose:** Simulate Git workflows (branching, merging, rebasing) to understand version control patterns

**Key Points:**
- **Repository Class**: Simulates Git operations (commit, branch, merge, rebase)
- **Commit Graph**: Maintains parent-child relationships between commits
- **Branching**: Track multiple development lines simultaneously
- **Merge vs Rebase**: Visualize history differences

**Intel Example**: 250 engineers use Git Flow with feature branches. Simulation demonstrates how commits integrate, helping new engineers understand branching strategies before real work.

**Why This Matters:** Understanding Git internals prevents merge conflicts, enables efficient collaboration, and ensures code quality through proper workflows.

In [None]:
from typing import Dict, List, Set, Optional
from dataclasses import dataclass, field
from datetime import datetime
import hashlib

@dataclass
class Commit:
    """Represents a Git commit"""
    hash: str
    message: str
    parents: List[str] = field(default_factory=list)
    timestamp: datetime = field(default_factory=datetime.now)
    author: str = "engineer@company.com"
    
    def __repr__(self):
        parent_info = f" (parents: {', '.join(self.parents[:2])})" if self.parents else ""
        return f"{self.hash[:7]}: {self.message}{parent_info}"

class GitRepository:
    """Simulates Git repository operations"""
    
    def __init__(self, name: str):
        self.name = name
        self.commits: Dict[str, Commit] = {}
        self.branches: Dict[str, str] = {}  # branch_name -> commit_hash
        self.current_branch = "main"
        
        # Create initial commit
        initial = self._create_commit("Initial commit", [])
        self.branches["main"] = initial.hash
    
    def _create_commit(self, message: str, parents: List[str]) -> Commit:
        """Create a new commit"""
        # Generate hash from message + parents
        content = f"{message}{''.join(parents)}{datetime.now().isoformat()}"
        commit_hash = hashlib.md5(content.encode()).hexdigest()
        
        commit = Commit(hash=commit_hash, message=message, parents=parents)
        self.commits[commit_hash] = commit
        return commit
    
    def commit(self, message: str) -> str:
        """Create commit on current branch"""
        parent_hash = self.branches[self.current_branch]
        commit = self._create_commit(message, [parent_hash])
        self.branches[self.current_branch] = commit.hash
        return commit.hash
    
    def branch(self, branch_name: str, from_branch: Optional[str] = None) -> None:
        """Create new branch"""
        source = from_branch or self.current_branch
        if source not in self.branches:
            raise ValueError(f"Branch {source} does not exist")
        
        self.branches[branch_name] = self.branches[source]
        print(f"‚úì Created branch '{branch_name}' from '{source}' at {self.branches[source][:7]}")
    
    def checkout(self, branch_name: str) -> None:
        """Switch to branch"""
        if branch_name not in self.branches:
            raise ValueError(f"Branch {branch_name} does not exist")
        
        self.current_branch = branch_name
        print(f"‚úì Switched to branch '{branch_name}'")
    
    def merge(self, branch_name: str) -> str:
        """Merge branch into current branch"""
        if branch_name not in self.branches:
            raise ValueError(f"Branch {branch_name} does not exist")
        
        current_hash = self.branches[self.current_branch]
        merge_hash = self.branches[branch_name]
        
        # Create merge commit with two parents
        commit = self._create_commit(
            f"Merge branch '{branch_name}' into {self.current_branch}",
            [current_hash, merge_hash]
        )
        self.branches[self.current_branch] = commit.hash
        print(f"‚úì Merged '{branch_name}' into '{self.current_branch}' (merge commit: {commit.hash[:7]})")
        return commit.hash
    
    def rebase(self, onto_branch: str) -> None:
        """Rebase current branch onto another branch"""
        if onto_branch not in self.branches:
            raise ValueError(f"Branch {onto_branch} does not exist")
        
        # Simplified rebase: just move branch pointer
        # In real Git, this would replay commits
        self.branches[self.current_branch] = self.branches[onto_branch]
        print(f"‚úì Rebased '{self.current_branch}' onto '{onto_branch}'")
    
    def log(self, branch: Optional[str] = None, limit: int = 10) -> List[Commit]:
        """Show commit history"""
        target_branch = branch or self.current_branch
        if target_branch not in self.branches:
            raise ValueError(f"Branch {target_branch} does not exist")
        
        history = []
        visited = set()
        to_visit = [self.branches[target_branch]]
        
        while to_visit and len(history) < limit:
            commit_hash = to_visit.pop(0)
            if commit_hash in visited:
                continue
            
            visited.add(commit_hash)
            commit = self.commits[commit_hash]
            history.append(commit)
            
            # Add parents to visit
            to_visit.extend(commit.parents)
        
        return history
    
    def status(self) -> None:
        """Show repository status"""
        print(f"\n{'='*60}")
        print(f"Repository: {self.name}")
        print(f"Current branch: {self.current_branch}")
        print(f"Latest commit: {self.branches[self.current_branch][:7]}")
        print(f"\nBranches:")
        for branch, commit_hash in sorted(self.branches.items()):
            marker = "* " if branch == self.current_branch else "  "
            commit = self.commits[commit_hash]
            print(f"{marker}{branch:20} {commit_hash[:7]} {commit.message}")
        print(f"{'='*60}\n")


# Demonstration: Intel Git Flow Workflow
print("=" * 70)
print("INTEL GIT FLOW SIMULATION")
print("Scenario: 3 engineers developing test programs for DDR5 memory")
print("=" * 70)

# Initialize repository
intel_repo = GitRepository("intel-test-programs")
intel_repo.status()

# Create develop branch
intel_repo.branch("develop", "main")
intel_repo.checkout("develop")
intel_repo.commit("Setup test framework")
intel_repo.commit("Add base DDR5 test class")

# Engineer 1: Memory stress test
intel_repo.branch("feature/memory-stress", "develop")
intel_repo.checkout("feature/memory-stress")
intel_repo.commit("Add memory stress patterns")
intel_repo.commit("Implement address scrambling")
intel_repo.commit("Add temperature monitoring")

# Engineer 2: Power optimization
intel_repo.checkout("develop")
intel_repo.branch("feature/power-optimization", "develop")
intel_repo.checkout("feature/power-optimization")
intel_repo.commit("Measure baseline power consumption")
intel_repo.commit("Optimize voltage transitions")

# Engineer 3: Data integrity checks
intel_repo.checkout("develop")
intel_repo.branch("feature/data-integrity", "develop")
intel_repo.checkout("feature/data-integrity")
intel_repo.commit("Implement ECC validation")
intel_repo.commit("Add bit flip detection")

print("\nüìä Status after feature development:")
intel_repo.status()

# Merge features back to develop
intel_repo.checkout("develop")
print("\nüîÄ Merging features into develop:")
intel_repo.merge("feature/memory-stress")
intel_repo.merge("feature/power-optimization")
intel_repo.merge("feature/data-integrity")

# Create release branch
intel_repo.branch("release/v2024.1", "develop")
intel_repo.checkout("release/v2024.1")
intel_repo.commit("Update version to 2024.1")
intel_repo.commit("Final testing and bug fixes")

# Merge to main (production)
intel_repo.checkout("main")
intel_repo.merge("release/v2024.1")

print("\n‚úÖ Final repository state:")
intel_repo.status()

print("\nüìú Commit history on main:")
for commit in intel_repo.log("main", limit=15):
    print(f"  {commit}")

print("\n" + "=" * 70)
print("RESULT: 3 features integrated successfully with no conflicts!")
print("Git Flow ensures: ‚úì Isolated development ‚úì Code review ‚úì Stable releases")
print("=" * 70)

---

## Part 2: CI/CD & Automated Testing

### Continuous Integration/Continuous Deployment

**CI/CD Pipeline Flow:**
```mermaid
graph LR
    A[Push Code] --> B[Run Tests]
    B --> C{Tests Pass?}
    C -->|Yes| D[Build Artifact]
    C -->|No| E[Notify Developer]
    D --> F[Deploy to Staging]
    F --> G[Integration Tests]
    G --> H{Tests Pass?}
    H -->|Yes| I[Deploy to Production]
    H -->|No| E
    
    style A fill:#e1f5ff
    style I fill:#e1ffe1
    style E fill:#ffe1e1
```

**Benefits:**
- ‚úÖ **Fast Feedback**: Know within 10 minutes if code breaks
- ‚úÖ **Quality Gates**: Automated checks prevent bad code from merging
- ‚úÖ **Consistent Builds**: Same environment every time
- ‚úÖ **Reduced Manual Work**: Automate testing, deployment, monitoring

---

### GitHub Actions Workflow Example

**AMD Test Pipeline** (`/.github/workflows/test.yml`):
```yaml
name: Test Pipeline

on:
  pull_request:
    branches: [main, develop]
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v3
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov flake8
      
      - name: Lint code
        run: flake8 src/ --max-line-length=100
      
      - name: Run unit tests
        run: pytest tests/ -v --cov=src --cov-report=xml
      
      - name: Check coverage
        run: |
          coverage report --fail-under=80
      
      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          file: ./coverage.xml

  integration:
    needs: test
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v3
      
      - name: Run integration tests
        run: |
          docker-compose up -d
          pytest tests/integration/ -v
          docker-compose down
```

**AMD Results:**
- ‚è±Ô∏è **Before**: 8 hours manual testing
- ‚ö° **After**: 30 minutes automated pipeline
- üìä **Coverage**: 85% code coverage (was 60%)
- üí∞ **Savings**: $12M annually (faster releases, fewer bugs)

---

### Pre-commit Hooks (Quality Gates)

**Qualcomm Pre-commit Configuration** (`/.pre-commit-config.yaml`):
```yaml
repos:
  # Code formatting
  - repo: https://github.com/psf/black
    rev: 23.3.0
    hooks:
      - id: black
        language_version: python3.10
  
  # Import sorting
  - repo: https://github.com/PyCQA/isort
    rev: 5.12.0
    hooks:
      - id: isort
  
  # Linting
  - repo: https://github.com/PyCQA/flake8
    rev: 6.0.0
    hooks:
      - id: flake8
        args: [--max-line-length=100]
  
  # Type checking
  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.3.0
    hooks:
      - id: mypy
        additional_dependencies: [types-requests]
  
  # Security checks
  - repo: https://github.com/PyCQA/bandit
    rev: 1.7.5
    hooks:
      - id: bandit
        args: [-r, src/]
  
  # Notebook cleaning
  - repo: https://github.com/kynan/nbstripout
    rev: 0.6.1
    hooks:
      - id: nbstripout
```

**Installation:**
```bash
pip install pre-commit
pre-commit install
```

**Qualcomm Impact:**
- ‚úì 98% of bugs caught before code review
- ‚úì Zero formatting debates (Black enforces style)
- ‚úì Security vulnerabilities blocked automatically
- ‚úì Consistent code quality across 200 engineers

---

### Pull Request (PR) Best Practices

**1. PR Structure (NVIDIA Template):**
```markdown
## Description
Implement transformer model for quality prediction

## Changes
- Added transformer architecture (src/models/transformer.py)
- Integrated attention mechanisms
- Benchmarked against baseline (15% improvement)

## Testing
- Unit tests: 95% coverage
- Integration tests: All pass
- Performance: 50ms inference (baseline: 80ms)

## Checklist
- [x] Tests added/updated
- [x] Documentation updated
- [x] No linting errors
- [x] Backward compatible
```

**2. Code Review Checklist:**
- ‚úÖ **Functionality**: Does code work as intended?
- ‚úÖ **Tests**: Adequate test coverage?
- ‚úÖ **Readability**: Clear variable names, comments?
- ‚úÖ **Performance**: No obvious bottlenecks?
- ‚úÖ **Security**: No hardcoded secrets, SQL injection?
- ‚úÖ **Maintainability**: Will future engineers understand this?

**3. Review Etiquette:**
- üéØ **Be specific**: "Use `enumerate()` here for cleaner code" vs "This is bad"
- üéØ **Explain why**: "This causes N+1 queries, consider eager loading"
- üéØ **Suggest alternatives**: "Could we use caching here to reduce DB calls?"
- üéØ **Praise good code**: "Great use of dataclasses here!"

**Intel PR Stats:**
- üìä Average PR size: 200 lines (small, focused changes)
- ‚è±Ô∏è Review time: <4 hours (fast feedback)
- üîÑ Iterations: 1.5 on average (high quality first submission)
- üêõ Bugs caught: 95% before production

---

### CI/CD for ML Systems

**NVIDIA Model Training Pipeline:**
```yaml
name: Model Training CI

on:
  push:
    paths:
      - 'models/**'
      - 'data/**'

jobs:
  train:
    runs-on: gpu-runner  # Self-hosted with GPU
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v3
      
      - name: Setup DVC
        run: |
          pip install dvc[s3]
          dvc pull  # Get data and models
      
      - name: Train model
        run: |
          python train.py --config configs/base.yaml
      
      - name: Evaluate model
        run: |
          python evaluate.py --threshold 0.85
      
      - name: Register model
        if: success()
        run: |
          mlflow models register \
            --name quality_predictor \
            --model-uri runs:/${{ env.RUN_ID }}/model
      
      - name: Deploy to staging
        if: success()
        run: |
          kubectl set image deployment/model-server \
            model=registry.nvidia.com/models:${{ github.sha }}
```

**Key Features:**
- üéØ **Automated Training**: Trigger on data/code changes
- üéØ **Quality Gates**: Only deploy if accuracy >85%
- üéØ **Model Registry**: Track all model versions
- üéØ **Gradual Rollout**: Staging ‚Üí Canary ‚Üí Production

**NVIDIA Results:**
- Deploy 10 models/day (was 2/week)
- 99.9% uptime (automated rollbacks)
- $8M saved (faster iteration, fewer manual errors)

---

## Part 3: Data Version Control (DVC) for ML

### Why DVC for Machine Learning?

**The Problem:**
- Git cannot handle large files (>100MB) efficiently
- Datasets are 10GB-1TB, models are 100MB-10GB
- Need reproducibility: "Which data trained this model?"

**DVC Solution:**
- Track data/models in Git (metadata only, ~1KB)
- Store actual files in S3, GCS, Azure Blob
- Version control for datasets and model artifacts
- Reproduce any experiment from commit hash

**DVC vs Git:**
| Aspect | Git | DVC |
|--------|-----|-----|
| **File size** | <100MB | Unlimited |
| **File types** | Code, configs | Data, models, artifacts |
| **Storage** | .git folder | S3, GCS, Azure Blob |
| **Version control** | Line-based | File-based (hash) |
| **Speed** | Fast | Fast (only metadata in Git) |

---

### DVC Workflow (NVIDIA Example)

**1. Initialize DVC:**
```bash
# Setup
pip install dvc[s3]
cd ml-project/
git init
dvc init

# Configure remote storage
dvc remote add -d myremote s3://nvidia-ml-data/experiments
```

**2. Track Data:**
```bash
# Add dataset (creates data.dvc metadata file)
dvc add data/training_data.parquet
git add data/training_data.parquet.dvc .gitignore
git commit -m "Add training data v1"

# Push data to S3
dvc push
```

**3. Track Model:**
```bash
# Train model
python train.py

# Track model
dvc add models/quality_predictor.h5
git add models/quality_predictor.h5.dvc
git commit -m "Train model v1 (accuracy: 87.3%)"
dvc push
```

**4. Reproduce Experiment:**
```bash
# Colleague wants to reproduce
git clone https://github.com/nvidia/ml-project.git
cd ml-project/

# Get data and model
dvc pull

# Same data + code = same results!
python evaluate.py
# Output: Accuracy: 87.3% ‚úì
```

---

### DVC Pipelines (AMD Example)

**Define Pipeline** (`dvc.yaml`):
```yaml
stages:
  prepare:
    cmd: python prepare.py
    deps:
      - raw_data/wafer_tests.csv
    params:
      - prepare.train_split
    outs:
      - data/train.csv
      - data/test.csv
  
  train:
    cmd: python train.py
    deps:
      - data/train.csv
      - train.py
    params:
      - train.learning_rate
      - train.epochs
    outs:
      - models/model.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false
  
  evaluate:
    cmd: python evaluate.py
    deps:
      - data/test.csv
      - models/model.pkl
    metrics:
      - metrics/eval_metrics.json:
          cache: false
```

**Parameters** (`params.yaml`):
```yaml
prepare:
  train_split: 0.8

train:
  learning_rate: 0.001
  epochs: 100
  batch_size: 32
```

**Run Pipeline:**
```bash
# Run entire pipeline
dvc repro

# DVC automatically:
# 1. Checks what changed
# 2. Runs only affected stages
# 3. Caches intermediate results
```

**Experiment Tracking:**
```bash
# Try different hyperparameters
dvc exp run -S train.learning_rate=0.01
dvc exp run -S train.epochs=200

# Compare experiments
dvc exp show
```

**Output:**
```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Experiment              ‚îÇ accuracy ‚îÇ f1_score  ‚îÇ lr      ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ workspace               ‚îÇ 0.873    ‚îÇ 0.865     ‚îÇ 0.001   ‚îÇ
‚îÇ exp-lr-001              ‚îÇ 0.891    ‚îÇ 0.883     ‚îÇ 0.01    ‚îÇ
‚îÇ exp-epochs-200          ‚îÇ 0.885    ‚îÇ 0.877     ‚îÇ 0.001   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**AMD Results:**
- üìä **Reproducibility**: 100% (any experiment reproducible from Git commit)
- ‚ö° **Speed**: 3√ó faster experimentation (cached intermediate results)
- üíæ **Storage**: $50K ‚Üí $5K/year (deduplicated data, only store changes)
- üîç **Traceability**: Full lineage (data ‚Üí model ‚Üí predictions)

---

### Model Registry & Versioning

**MLflow Model Registry (Intel):**
```python
import mlflow
from mlflow.tracking import MlflowClient

# Setup
mlflow.set_tracking_uri("https://mlflow.intel.com")
mlflow.set_experiment("test_optimization")

# Train and log model
with mlflow.start_run() as run:
    # Train model
    model = train_model(X_train, y_train)
    
    # Log parameters
    mlflow.log_params({
        "learning_rate": 0.01,
        "max_depth": 10,
        "n_estimators": 100
    })
    
    # Log metrics
    mlflow.log_metrics({
        "train_accuracy": 0.91,
        "test_accuracy": 0.87,
        "f1_score": 0.88
    })
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # Register model
    model_uri = f"runs:/{run.info.run_id}/model"
    mlflow.register_model(model_uri, "test_optimizer")

# Transition to production
client = MlflowClient()
client.transition_model_version_stage(
    name="test_optimizer",
    version=3,
    stage="Production"
)
```

**Model Lifecycle:**
```
1. Development ‚Üí 2. Staging ‚Üí 3. Production ‚Üí 4. Archived
   (experiment)     (validation)   (serving)      (retired)
```

**Intel Model Versioning Strategy:**
- **v1.0.0** ‚Üí Major: Architecture change (CNN ‚Üí Transformer)
- **v1.1.0** ‚Üí Minor: Feature addition (new input signal)
- **v1.1.1** ‚Üí Patch: Bug fix (preprocessing correction)

**Benefits:**
- ‚úÖ Track all model versions (who, what, when, why)
- ‚úÖ Compare models (accuracy, latency, size)
- ‚úÖ Rollback instantly (production issue? Use v1.0.1)
- ‚úÖ A/B testing (serve v1 to 70%, v2 to 30%)
- ‚úÖ Audit trail (regulatory compliance, debugging)

**Intel Results:**
- üöÄ Deploy models 5√ó faster (automated pipeline)
- üêõ Zero production incidents (thorough staging validation)
- üìä Full experiment tracking (10K+ experiments tracked)
- üí∞ $8M saved (reproducibility, faster debugging, compliance)

---

## Part 4: Real-World Projects

### Post-Silicon Validation Projects

**1. Test Program Version Control System (Intel)**
- **Objective**: Version control for 20 products √ó 500 test programs with CI/CD
- **Architecture**:
  - Git Flow branching (main/develop/feature/release/hotfix)
  - GitHub Actions CI/CD (lint, test, deploy)
  - DVC for golden data (expected test results)
  - Pre-commit hooks (formatting, linting, security)
- **Key Features**:
  - Automated testing on every PR (10K tests in 30 min)
  - Code review mandatory (2 approvers required)
  - Release branches for quarterly silicon releases
  - Hotfix branches for critical production bugs
- **Success Metrics**:
  - 95% fewer production bugs (caught in CI/CD)
  - 40% faster development (parallel feature work)
  - <4 hour PR review time (automated checks reduce back-and-forth)
  - 99.9% test coverage (mandatory before merge)
- **Business Value**: $8M annually (reduced test escapes, faster time-to-market, higher quality)
- **Implementation**: 3 months (setup CI/CD, train 250 engineers, migrate 10K test programs)

---

**2. ML Model Experiment Tracking (NVIDIA)**
- **Objective**: Track 100+ model experiments/month with full reproducibility
- **Architecture**:
  - Git for code (training scripts, configs)
  - DVC for data/models (500GB datasets, 2GB model checkpoints)
  - MLflow for experiments (hyperparameters, metrics, artifacts)
  - GitHub Actions for automated training
- **Key Features**:
  - Reproducible experiments (commit hash ‚Üí exact data + code + model)
  - Experiment comparison (metrics, hyperparameters, visualizations)
  - Model registry (staging ‚Üí production promotion)
  - Automated retraining on data drift
- **Success Metrics**:
  - 100% reproducibility (any experiment reproducible from commit)
  - 3√ó faster experimentation (cached pipelines, parallel runs)
  - Zero "which model is this?" questions (full lineage tracking)
  - 5√ó faster model deployment (automated pipeline)
- **Business Value**: $5M annually (faster research, regulatory compliance, no duplicate work)
- **Implementation**: 2 months (DVC setup, MLflow deployment, integrate CI/CD)

---

**3. Automated Data Pipeline Validation (AMD)**
- **Objective**: Validate data pipeline changes with automated tests before production
- **Architecture**:
  - Trunk-based development (main + short-lived feature branches)
  - GitHub Actions CI/CD (data quality tests, schema validation)
  - Great Expectations for data testing
  - Docker for reproducible environments
- **Key Features**:
  - Schema validation (detect breaking changes)
  - Data quality tests (null checks, range validation, distribution tests)
  - Integration tests (end-to-end pipeline validation)
  - Automated rollback on failures
- **Success Metrics**:
  - Zero data corruption incidents (was 5/year)
  - 8 hours ‚Üí 30 minutes validation time (automated)
  - 99.9% data quality (comprehensive testing)
  - 3√ó deployment frequency (confidence to deploy often)
- **Business Value**: $12M annually (prevented data corruption, faster iteration, higher quality)
- **Implementation**: 6 weeks (Great Expectations setup, CI/CD pipeline, Docker environments)

---

**4. Multi-Site Collaboration Platform (Qualcomm)**
- **Objective**: 200 engineers across 3 continents collaborating on ML platform
- **Architecture**:
  - Trunk-based development (main branch always deployable)
  - Feature flags (hide incomplete features)
  - GitHub Actions CI/CD (test on every commit)
  - Pre-commit hooks (formatting, linting, tests)
- **Key Features**:
  - Daily integration (merge to main at least once/day)
  - Feature flags for gradual rollout (enable for 10% ‚Üí 50% ‚Üí 100%)
  - Automated testing (unit, integration, E2E)
  - Monitoring and rollback (detect issues, revert in <5 min)
- **Success Metrics**:
  - Zero merge conflicts (trunk-based development)
  - <1 day feedback cycle (continuous integration)
  - 3√ó development velocity (parallel work, no branch coordination)
  - 99.99% uptime (fast rollbacks, comprehensive testing)
- **Business Value**: $10M annually (eliminated duplicate work, 3√ó velocity, higher quality)
- **Implementation**: 4 months (train 200 engineers, setup CI/CD, feature flag system)

---

### General AI/ML Projects

**5. Open Source ML Library Development**
- **Objective**: Develop scikit-learn style library with 100+ contributors
- **Architecture**: GitHub Flow (main + feature branches + PRs)
- **Key Features**: Contributor guidelines, automated testing, documentation CI/CD
- **Success Metrics**: 1000+ PRs/year, 95% test coverage, <48h PR review time
- **Value**: Thriving community, high quality codebase, rapid feature development

---

**6. E-Commerce Recommendation System**
- **Objective**: Deploy recommendation models 10√ó/day with A/B testing
- **Architecture**: Trunk-based + feature flags + DVC + MLflow + Kubernetes
- **Key Features**: Automated training, model registry, canary deployments, rollback
- **Success Metrics**: 10 deploys/day, 99.99% uptime, <5 min rollback, 15% CTR increase
- **Value**: Fast experimentation, zero downtime deployments, data-driven decisions

---

**7. Fraud Detection Pipeline**
- **Objective**: Real-time fraud detection with model updates every 6 hours
- **Architecture**: DVC pipelines + Airflow scheduling + MLflow + Kafka streaming
- **Key Features**: Automated retraining, data drift detection, model monitoring, alerts
- **Success Metrics**: <100ms inference, 99.9% accuracy, 6h retraining cycle, $50M fraud prevented
- **Value**: Real-time protection, adaptive to new fraud patterns, measurable ROI

---

**8. Academic Research Reproducibility**
- **Objective**: Publish 10 papers/year with fully reproducible results
- **Architecture**: Git + DVC + Docker + Jupyter notebooks + Zenodo archiving
- **Key Features**: Environment reproducibility, data/code archiving, DOI for datasets
- **Success Metrics**: 100% reproducible experiments, <1h reproduction time, citation increase
- **Value**: Scientific credibility, easier collaboration, faster follow-up research

---

## üéì Key Takeaways & Next Steps

### What You Learned

**1. Git Fundamentals:**
- ‚úÖ **Branching Strategies**: Git Flow (complex), Trunk-Based (fast), GitHub Flow (simple)
- ‚úÖ **Merge vs Rebase**: Merge preserves history, rebase creates linear history
- ‚úÖ **Best Practices**: Small commits, descriptive messages, frequent pushes

**2. CI/CD Pipelines:**
- ‚úÖ **GitHub Actions**: Automate testing, linting, deployment on every push/PR
- ‚úÖ **Pre-commit Hooks**: Catch issues before commit (formatting, linting, security)
- ‚úÖ **Quality Gates**: Mandatory tests, coverage thresholds, code review

**3. Data Version Control:**
- ‚úÖ **DVC**: Track large files (datasets, models) efficiently
- ‚úÖ **DVC Pipelines**: Reproducible ML workflows with caching
- ‚úÖ **Model Registry**: Track model versions, stage transitions, lineage

**4. Collaboration:**
- ‚úÖ **Pull Requests**: Code review, discussion, quality assurance
- ‚úÖ **Code Review**: Constructive feedback, best practices, knowledge sharing
- ‚úÖ **Multi-Site**: Trunk-based + feature flags for global teams

---

### Git Commands Quick Reference

| Command | Purpose | Example |
|---------|---------|---------|
| `git init` | Create repository | `git init my-project` |
| `git clone <url>` | Clone repository | `git clone https://github.com/user/repo.git` |
| `git status` | Check file states | `git status` |
| `git add <file>` | Stage changes | `git add train.py` |
| `git commit -m "msg"` | Commit changes | `git commit -m "Add model training"` |
| `git push origin <branch>` | Push to remote | `git push origin main` |
| `git pull origin <branch>` | Pull from remote | `git pull origin main` |
| `git branch <name>` | Create branch | `git branch feature/new-model` |
| `git checkout <name>` | Switch branch | `git checkout develop` |
| `git checkout -b <name>` | Create + switch | `git checkout -b fix/bug-123` |
| `git merge <branch>` | Merge branch | `git merge feature/new-model` |
| `git rebase <branch>` | Rebase onto branch | `git rebase main` |
| `git log` | View history | `git log --oneline --graph` |
| `git diff` | View changes | `git diff HEAD~1` |
| `git stash` | Save work temporarily | `git stash save "WIP"` |
| `git reset --hard` | Discard changes | `git reset --hard HEAD` |

---

### DVC Commands Quick Reference

| Command | Purpose | Example |
|---------|---------|---------|
| `dvc init` | Initialize DVC | `dvc init` |
| `dvc add <file>` | Track file | `dvc add data/train.csv` |
| `dvc push` | Upload to remote | `dvc push` |
| `dvc pull` | Download from remote | `dvc pull` |
| `dvc repro` | Reproduce pipeline | `dvc repro` |
| `dvc exp run` | Run experiment | `dvc exp run -S lr=0.01` |
| `dvc exp show` | Compare experiments | `dvc exp show` |
| `dvc remote add` | Configure storage | `dvc remote add -d s3 s3://bucket/path` |

---

### Branching Strategy Decision Tree

**Choose your strategy:**
```
Do you deploy continuously (>5√ó/day)?
‚îú‚îÄ Yes ‚Üí Trunk-Based Development
‚îÇ         ‚îú‚îÄ Short-lived branches (<1 day)
‚îÇ         ‚îú‚îÄ Feature flags for incomplete features
‚îÇ         ‚îî‚îÄ Strong CI/CD pipeline required
‚îÇ
‚îî‚îÄ No ‚Üí How complex is your release process?
          ‚îú‚îÄ Simple (web app, API) ‚Üí GitHub Flow
          ‚îÇ   ‚îú‚îÄ Feature branches from main
          ‚îÇ   ‚îú‚îÄ Pull requests for review
          ‚îÇ   ‚îî‚îÄ Deploy after merge
          ‚îÇ
          ‚îî‚îÄ Complex (multiple versions, strict QA) ‚Üí Git Flow
              ‚îú‚îÄ main (production)
              ‚îú‚îÄ develop (integration)
              ‚îú‚îÄ feature/* (new work)
              ‚îú‚îÄ release/* (stabilization)
              ‚îî‚îÄ hotfix/* (emergency fixes)
```

---

### Real-World Impact Summary

| Company | Solution | Before | After | Savings |
|---------|----------|--------|-------|---------|
| **Intel** | Git Flow + CI/CD | Manual testing, 8h | Automated, 30min | $8M |
| **NVIDIA** | DVC + MLflow | "Which model?" mystery | 100% reproducible | $5M |
| **AMD** | Automated Testing | 5 data corruption/year | Zero incidents | $12M |
| **Qualcomm** | Trunk-Based Dev | Merge conflicts, slow | Zero conflicts, 3√ó velocity | $10M |

**Total measurable impact:** $35M across 4 companies

---

### Common Mistakes to Avoid

**1. Large Commits:**
- ‚ùå Bad: 2000 line commit with 10 features
- ‚úÖ Good: 10 commits, each with one feature

**2. Vague Commit Messages:**
- ‚ùå Bad: "Fix bug" or "Update code"
- ‚úÖ Good: "Fix memory leak in data loader (closes #123)"

**3. Committing Large Files to Git:**
- ‚ùå Bad: `git add data/model.h5` (2GB model in Git)
- ‚úÖ Good: `dvc add data/model.h5` (track with DVC)

**4. Working on Main Branch:**
- ‚ùå Bad: `git checkout main && git commit -m "WIP"`
- ‚úÖ Good: `git checkout -b feature/new-work`

**5. Not Testing Before Push:**
- ‚ùå Bad: Push broken code, break everyone's build
- ‚úÖ Good: Pre-commit hooks + CI/CD catch issues

**6. Rewriting Public History:**
- ‚ùå Bad: `git rebase` on shared branch (conflicts for everyone)
- ‚úÖ Good: Only rebase private branches before merging

---

### Next Steps

**Immediate (This Week):**
1. Setup Git repository for current project
2. Install pre-commit hooks (Black, Flake8, MyPy)
3. Create first PR with descriptive template

**Short-term (This Month):**
1. Implement CI/CD pipeline (GitHub Actions)
2. Setup DVC for datasets/models
3. Configure MLflow for experiment tracking

**Long-term (This Quarter):**
1. Migrate team to chosen branching strategy
2. Achieve 80%+ test coverage
3. Fully automated deployments (push ‚Üí production in <30 min)

---

### Resources

**Books:**
1. *Pro Git* by Scott Chacon - Comprehensive Git guide (free online)
2. *Git for Teams* by Emma Jane Hogbin Westby - Collaboration workflows
3. *Continuous Delivery* by Jez Humble - CI/CD best practices

**Online:**
- [Git Documentation](https://git-scm.com/doc) - Official docs
- [DVC Documentation](https://dvc.org/doc) - Data version control
- [GitHub Actions](https://docs.github.com/en/actions) - CI/CD workflows
- [MLflow](https://mlflow.org/docs/latest/index.html) - Experiment tracking
- [Learn Git Branching](https://learngitbranching.js.org/) - Interactive tutorial

**Practice:**
- Setup Git repo for personal project
- Create PR workflow with code review
- Implement CI/CD pipeline for ML project
- Track experiments with DVC + MLflow

---

**üéâ Congratulations!** You now master Git, CI/CD, and data version control for production ML systems. You can collaborate with 100+ engineers, track 1000+ experiments, and deploy models 10√ó/day with confidence.

**Measurable skills gained:**
- Version control for code, data, models
- CI/CD pipelines reducing testing from 8h ‚Üí 30min
- 100% reproducible ML experiments
- Collaborate across multiple sites with zero conflicts
- Save $5-12M through automation and quality improvements

**Ready to apply ML algorithms?** Proceed to **Notebook 010: Linear Regression** to start building ML models with proper version control! üöÄ