# Module 11: Git for Data Science - Specialized Workflows

**Difficulty**: ‚≠ê‚≠ê (Intermediate)

**Estimated Time**: 120-150 minutes

**Prerequisites**: 
- Module 01: Git Fundamentals
- Module 02: GitHub Essentials
- Module 10: Git Best Practices
- Basic understanding of data science workflows

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Version control Jupyter notebooks effectively
2. Handle large datasets using Git LFS
3. Implement data versioning with DVC
4. Manage machine learning models in Git
5. Collaborate on notebooks without conflicts
6. Track experiments and model versions
7. Set up data science-specific Git workflows
8. Use nbdime for notebook diff and merge

---

## 1. The Data Science Git Challenge

### Why Data Science is Different

Traditional software development:
- ‚úÖ Small text files (source code)
- ‚úÖ Deterministic outputs
- ‚úÖ Clear separation of code and data

Data science adds complexity:
- ‚ö†Ô∏è Large binary files (datasets, models)
- ‚ö†Ô∏è Non-deterministic experiments
- ‚ö†Ô∏è Notebooks mix code, outputs, and visualizations
- ‚ö†Ô∏è Data preprocessing pipelines
- ‚ö†Ô∏è Model artifacts and checkpoints
- ‚ö†Ô∏è Multiple experiment iterations

### What Should You Version Control?

```
‚úÖ YES - Version in Git:
‚îú‚îÄ‚îÄ Source code (.py files)
‚îú‚îÄ‚îÄ Notebooks (.ipynb, without outputs)
‚îú‚îÄ‚îÄ Configuration files
‚îú‚îÄ‚îÄ Requirements and dependencies
‚îú‚îÄ‚îÄ Documentation
‚îú‚îÄ‚îÄ Small sample datasets (<10MB)
‚îî‚îÄ‚îÄ Scripts and utilities

‚ùå NO - Don't version in Git:
‚îú‚îÄ‚îÄ Large datasets (>10MB)
‚îú‚îÄ‚îÄ Trained models (>10MB)
‚îú‚îÄ‚îÄ Notebook outputs
‚îú‚îÄ‚îÄ Checkpoint files
‚îú‚îÄ‚îÄ Cache directories
‚îî‚îÄ‚îÄ Virtual environments

üîß ALTERNATIVE - Use specialized tools:
‚îú‚îÄ‚îÄ Large files ‚Üí Git LFS
‚îú‚îÄ‚îÄ Datasets ‚Üí DVC, S3, cloud storage
‚îú‚îÄ‚îÄ Models ‚Üí MLflow, DVC, model registry
‚îî‚îÄ‚îÄ Experiments ‚Üí MLflow, Weights & Biases
```

---

## 2. Jupyter Notebooks in Version Control

### The Notebook Problem

Jupyter notebooks are **JSON files** that contain:
- Code cells
- Markdown cells
- Cell outputs (text, images, data)
- Execution counts
- Metadata

**Example notebook structure**:
```json
{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": ["Result: 42"]
          },
          "output_type": "execute_result"
        }
      ],
      "source": ["x = 42\n", "x"]
    }
  ]
}
```

### Problems with Versioning Notebooks

1. **Outputs bloat**: Images can be MBs in base64
2. **Execution counts change**: Causes unnecessary diffs
3. **Metadata noise**: Irrelevant changes show up
4. **Merge conflicts**: JSON structure is hard to merge
5. **Not human-readable**: Diffs are hard to review

---

## 3. Solution 1: nbstripout - Strip Notebook Outputs

In [None]:
# Install nbstripout
!pip install -q nbstripout

print("‚úì nbstripout installed")
print("\nWhat is nbstripout?")
print("A tool that removes outputs from Jupyter notebooks before committing.")
print("\nWhy use it?")
print("- Keeps repository size small")
print("- Focuses diffs on actual code changes")
print("- Prevents accidental commit of sensitive outputs")
print("- Reduces merge conflicts")

In [None]:
import os
from pathlib import Path

# Create practice environment
practice_dir = Path("ds_git_practice")
practice_dir.mkdir(exist_ok=True)
os.chdir(practice_dir)

# Initialize Git repository
!git init
!git config user.name "DS Learner"
!git config user.email "ds@example.com"

print("‚úì Created practice repository")

In [None]:
# Install nbstripout for this repository
# This sets up a Git filter that automatically strips outputs
!nbstripout --install

print("‚úì nbstripout filter installed")
print("\nFrom now on, all notebooks will be stripped before commit!")

In [None]:
# Check .gitattributes (created by nbstripout)
if Path(".gitattributes").exists():
    with open(".gitattributes", "r") as f:
        print("Contents of .gitattributes:")
        print("=" * 50)
        print(f.read())
        print("=" * 50)
        print("\nThis tells Git to filter .ipynb files through nbstripout")

### Manual Usage

```bash
# Strip outputs from a single notebook
nbstripout notebook.ipynb

# Strip outputs from all notebooks in directory
nbstripout notebooks/*.ipynb

# Restore outputs (if you have the original)
# Not possible - outputs are permanently removed!
```

**Best Practice**: Keep a separate directory for executed notebooks

```
notebooks/
‚îú‚îÄ‚îÄ development/     # Working notebooks with outputs (gitignored)
‚îî‚îÄ‚îÄ final/          # Clean notebooks without outputs (versioned)
```

---

## 4. Solution 2: nbdime - Notebook Diff and Merge

In [None]:
# Install nbdime
!pip install -q nbdime

print("‚úì nbdime installed")
print("\nWhat is nbdime?")
print("Notebook-aware diff and merge tool that:")
print("- Shows meaningful diffs between notebooks")
print("- Provides visual diff in browser")
print("- Handles notebook merges intelligently")
print("- Integrates with Git and Jupyter")

In [None]:
# Configure nbdime for Git
!nbdime config-git --enable

print("‚úì nbdime configured for Git")
print("\nGit will now use nbdime for notebook diffs and merges")

### Using nbdime

```bash
# Compare two notebooks
nbdiff notebook1.ipynb notebook2.ipynb

# Visual diff in browser
nbdiff-web notebook1.ipynb notebook2.ipynb

# Diff with Git
git diff notebook.ipynb
# Now shows notebook-friendly diff!

# 3-way merge during conflicts
nbmerge base.ipynb local.ipynb remote.ipynb

# Visual merge tool
nbmerge-web base.ipynb local.ipynb remote.ipynb
```

### Jupyter Integration

```bash
# Enable nbdime in Jupyter
nbdime extensions --enable

# Launch Jupyter with nbdime
jupyter notebook
# Now has "git" button for diffs!
```

---

## 5. Git LFS (Large File Storage)

### What is Git LFS?

Git LFS is an extension that handles large files efficiently:
- Stores large files on a separate server
- Keeps only **pointers** in your Git repository
- Downloads large files only when needed
- Works with GitHub, GitLab, Bitbucket

### How It Works

```
Without LFS:                     With LFS:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ                    ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Git Repo                         Git Repo
‚îú‚îÄ‚îÄ code.py                      ‚îú‚îÄ‚îÄ code.py
‚îú‚îÄ‚îÄ data.csv (100MB)  ‚Üí  SLOW   ‚îú‚îÄ‚îÄ data.csv (pointer) ‚Üí  FAST
‚îî‚îÄ‚îÄ model.pkl (500MB) ‚Üí  SLOW   ‚îî‚îÄ‚îÄ model.pkl (pointer) ‚Üí FAST
                                              ‚Üì
                                        LFS Storage
                                        ‚îú‚îÄ‚îÄ data.csv (100MB)
                                        ‚îî‚îÄ‚îÄ model.pkl (500MB)
```

### Installation

```bash
# Install Git LFS
# On Ubuntu/Debian:
sudo apt-get install git-lfs

# On macOS:
brew install git-lfs

# On Windows:
# Download from https://git-lfs.github.com/

# Initialize Git LFS
git lfs install
```

---

## 6. Using Git LFS

In [None]:
# Check if Git LFS is available
!git lfs version 2>/dev/null || echo "Git LFS not installed. Install from https://git-lfs.github.com/"

In [None]:
# Track file types with LFS
# This creates .gitattributes entries

lfs_config = """# Track data files
*.csv filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text

# Track model files
*.pkl filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text

# Track image datasets
*.zip filter=lfs diff=lfs merge=lfs -text
*.tar.gz filter=lfs diff=lfs merge=lfs -text
"""

# Append to .gitattributes
with open(".gitattributes", "a") as f:
    f.write("\n" + lfs_config)

print("‚úì Configured Git LFS for common data science files")
print("\nTracked file types:")
print("- Data: .csv, .parquet, .h5")
print("- Models: .pkl, .h5, .pb, .pth, .onnx")
print("- Archives: .zip, .tar.gz")

### Common Git LFS Commands

```bash
# Track specific file types
git lfs track "*.csv"
git lfs track "*.pkl"
git lfs track "models/*.h5"

# See what's being tracked
git lfs track

# See which files are stored in LFS
git lfs ls-files

# Pull LFS files
git lfs pull

# Fetch LFS files without checking out
git lfs fetch

# Migrate existing files to LFS
git lfs migrate import --include="*.csv"
```

### GitHub LFS Limits

**Free accounts**:
- 1 GB storage
- 1 GB bandwidth per month

**Paid accounts**:
- Additional packs available
- $5/month for 50GB storage + 50GB bandwidth

For larger datasets, consider:
- AWS S3
- Google Cloud Storage
- Azure Blob Storage
- DVC (Data Version Control)

---

## 7. DVC - Data Version Control

### What is DVC?

DVC is like **Git for data**:
- Versions large datasets
- Tracks machine learning models
- Manages ML pipelines
- Works with any storage (S3, GCS, Azure, local)
- Integrates seamlessly with Git

### How DVC Works

```
1. Add data file to DVC:
   dvc add data/large_dataset.csv
   
   Creates:
   ‚îú‚îÄ‚îÄ data/large_dataset.csv.dvc  (tracked in Git)
   ‚îî‚îÄ‚îÄ data/large_dataset.csv      (stored in DVC cache)

2. Git tracks only the .dvc file:
   git add data/large_dataset.csv.dvc
   git commit -m "Add large dataset"

3. Push data to remote storage:
   dvc push
```

### Installation

In [None]:
# Install DVC
!pip install -q dvc

print("‚úì DVC installed")
print("\nDVC capabilities:")
print("- Version large files efficiently")
print("- Track ML experiments")
print("- Define reproducible pipelines")
print("- Share data across team")
print("- Works with any cloud storage")

In [None]:
# Initialize DVC in repository
!dvc init

print("\n‚úì DVC initialized")
print("\nCreated:")
print("- .dvc/ directory (DVC config and cache)")
print("- .dvcignore (like .gitignore for DVC)")

---

## 8. Using DVC - Practical Example

In [None]:
import pandas as pd
import numpy as np

# Create a sample "large" dataset
np.random.seed(42)

# Simulate 100,000 rows of sensor data
large_dataset = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=100000, freq='1min'),
    'sensor_1': np.random.normal(100, 15, 100000),
    'sensor_2': np.random.normal(50, 10, 100000),
    'sensor_3': np.random.normal(75, 20, 100000),
    'temperature': np.random.normal(20, 5, 100000),
    'humidity': np.random.uniform(30, 80, 100000),
})

# Save to CSV
os.makedirs('data', exist_ok=True)
large_dataset.to_csv('data/sensor_data.csv', index=False)

print("‚úì Created sample dataset")
print(f"\nDataset shape: {large_dataset.shape}")
print(f"File size: {os.path.getsize('data/sensor_data.csv') / 1024 / 1024:.2f} MB")

In [None]:
# Add dataset to DVC
!dvc add data/sensor_data.csv

print("\n‚úì Added to DVC")
print("\nWhat happened:")
print("1. DVC moved file to cache (.dvc/cache/)")
print("2. Created data/sensor_data.csv.dvc (metadata file)")
print("3. Added data/sensor_data.csv to .gitignore")

In [None]:
# Examine the .dvc file
with open('data/sensor_data.csv.dvc', 'r') as f:
    dvc_file = f.read()

print("Contents of sensor_data.csv.dvc:")
print("=" * 50)
print(dvc_file)
print("=" * 50)
print("\nThis file contains:")
print("- MD5 hash of the data file")
print("- Size of the data file")
print("- Path to the data file")

In [None]:
# Commit the .dvc file to Git
!git add data/sensor_data.csv.dvc data/.gitignore
!git commit -m "Add sensor dataset with DVC"

print("‚úì Committed .dvc file to Git")
print("\nNow:")
print("- Git tracks only the small .dvc file (~100 bytes)")
print("- DVC manages the large data file (~8 MB)")
print("- Team members can pull data with 'dvc pull'")

### DVC Remote Storage

Configure where DVC stores your data:

```bash
# Local remote (for testing)
dvc remote add -d myremote /tmp/dvc-storage

# AWS S3
dvc remote add -d myremote s3://mybucket/dvc-storage

# Google Cloud Storage
dvc remote add -d myremote gs://mybucket/dvc-storage

# Azure Blob Storage
dvc remote add -d myremote azure://mycontainer/dvc-storage

# Push data to remote
dvc push

# Pull data from remote
dvc pull
```

---

## 9. Versioning Machine Learning Models

### Strategy 1: DVC for Models

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import pickle

# Train a simple model
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Save model
os.makedirs('models', exist_ok=True)
with open('models/classifier_v1.pkl', 'wb') as f:
    pickle.dump(model, f)

print("‚úì Trained and saved model")
print(f"Model size: {os.path.getsize('models/classifier_v1.pkl') / 1024:.2f} KB")

In [None]:
# Version the model with DVC
!dvc add models/classifier_v1.pkl
!git add models/classifier_v1.pkl.dvc models/.gitignore
!git commit -m "Add classifier model v1"

print("‚úì Model versioned with DVC")

### Strategy 2: Model Registry

For production systems, use a model registry:

**MLflow**:
```python
import mlflow
import mlflow.sklearn

# Log model
with mlflow.start_run():
    mlflow.log_params({"n_estimators": 100})
    mlflow.log_metric("accuracy", 0.95)
    mlflow.sklearn.log_model(model, "classifier")

# Register model
mlflow.register_model(
    "runs:/abc123/classifier",
    "SensorClassifier"
)
```

**Benefits**:
- Centralized model storage
- Metadata tracking (metrics, parameters)
- Model lineage
- Deployment integration
- Role-based access control

---

## 10. Data Science .gitignore Template

In [None]:
# Create comprehensive .gitignore for data science
ds_gitignore = """# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Virtual Environments
venv/
env/
.venv/
ENV/
conda_env/

# Jupyter Notebook
.ipynb_checkpoints/
*-checkpoint.ipynb

# Data files (use DVC instead)
*.csv
*.tsv
*.xlsx
*.xls
*.parquet
*.feather
*.h5
*.hdf5
*.db
*.sqlite
# Exception: small sample/test data
!data/sample/**
!data/test/**

# Model files (use DVC or MLflow)
*.pkl
*.pickle
*.joblib
*.h5
*.pb
*.pt
*.pth
*.onnx
*.tflite
models/
checkpoints/
saved_models/

# Large files
*.zip
*.tar
*.tar.gz
*.rar
*.7z

# Image datasets
*.jpg
*.jpeg
*.png
*.gif
*.bmp
*.tiff
*.svg
# Exception: documentation images
!docs/images/**
!reports/figures/**
!README_images/**

# Video/Audio
*.mp4
*.avi
*.mov
*.mp3
*.wav
*.flac

# DVC
/dvc.lock

# MLflow
mlruns/
mlartifacts/

# Weights & Biases
wandb/

# TensorBoard
runs/
logs/
tensorboard/

# Experiment tracking
experiments/
.experiments/

# Secrets and credentials
.env
.env.local
.env.*.local
secrets.yaml
credentials.json
*.pem
*.key
config/secrets/

# IDE
.vscode/
.idea/
*.swp
*.swo
*~
.DS_Store

# Testing
.pytest_cache/
.coverage
htmlcov/
.tox/

# Distribution / packaging
build/
dist/
*.egg-info/

# Documentation
docs/_build/
site/
"""

with open(".gitignore", "w") as f:
    f.write(ds_gitignore)

print("‚úì Created comprehensive data science .gitignore")
print("\nKey exclusions:")
print("- Data files (CSV, Parquet, etc.)")
print("- Model files (PKL, H5, PT, etc.)")
print("- Large media files")
print("- Experiment tracking directories")
print("- Secrets and credentials")
print("\nExceptions:")
print("- Small sample/test data")
print("- Documentation images")

---

## 11. ML Experiment Tracking Workflow

### Recommended Structure

```
ml-project/
‚îú‚îÄ‚îÄ .git/                   # Git repository
‚îú‚îÄ‚îÄ .dvc/                   # DVC cache
‚îÇ
‚îú‚îÄ‚îÄ data/                   # Data directory
‚îÇ   ‚îú‚îÄ‚îÄ raw/               # Original data (DVC tracked)
‚îÇ   ‚îú‚îÄ‚îÄ processed/         # Processed data (DVC tracked)
‚îÇ   ‚îî‚îÄ‚îÄ sample/            # Small samples (Git tracked)
‚îÇ
‚îú‚îÄ‚îÄ notebooks/             # Jupyter notebooks (outputs stripped)
‚îÇ   ‚îú‚îÄ‚îÄ 01_eda.ipynb
‚îÇ   ‚îú‚îÄ‚îÄ 02_preprocessing.ipynb
‚îÇ   ‚îî‚îÄ‚îÄ 03_modeling.ipynb
‚îÇ
‚îú‚îÄ‚îÄ src/                   # Source code (Git tracked)
‚îÇ   ‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ load.py
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ preprocess.py
‚îÇ   ‚îú‚îÄ‚îÄ features/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ build_features.py
‚îÇ   ‚îú‚îÄ‚îÄ models/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ train.py
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ predict.py
‚îÇ   ‚îî‚îÄ‚îÄ visualization/
‚îÇ       ‚îî‚îÄ‚îÄ visualize.py
‚îÇ
‚îú‚îÄ‚îÄ models/                # Trained models (DVC tracked)
‚îÇ   ‚îú‚îÄ‚îÄ model_v1.pkl.dvc
‚îÇ   ‚îî‚îÄ‚îÄ model_v2.pkl.dvc
‚îÇ
‚îú‚îÄ‚îÄ experiments/           # Experiment logs (gitignored)
‚îÇ   ‚îî‚îÄ‚îÄ mlruns/           # MLflow tracking
‚îÇ
‚îú‚îÄ‚îÄ configs/               # Configuration files (Git tracked)
‚îÇ   ‚îú‚îÄ‚îÄ model_config.yaml
‚îÇ   ‚îî‚îÄ‚îÄ training_config.yaml
‚îÇ
‚îú‚îÄ‚îÄ tests/                 # Unit tests (Git tracked)
‚îÇ   ‚îî‚îÄ‚îÄ test_preprocessing.py
‚îÇ
‚îú‚îÄ‚îÄ dvc.yaml              # DVC pipeline (Git tracked)
‚îú‚îÄ‚îÄ requirements.txt       # Dependencies (Git tracked)
‚îú‚îÄ‚îÄ .gitignore            # Git ignore patterns
‚îú‚îÄ‚îÄ .dvcignore            # DVC ignore patterns
‚îî‚îÄ‚îÄ README.md             # Documentation (Git tracked)
```

---

## 12. DVC Pipelines - Reproducible ML

### What are DVC Pipelines?

Define your ML workflow as a pipeline:
- Each stage is a command
- DVC tracks dependencies and outputs
- Automatically re-runs only what changed
- Fully reproducible

### Example Pipeline

In [None]:
# Create a simple pipeline definition
dvc_pipeline = """stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - data/raw/sensor_data.csv
      - src/prepare.py
    outs:
      - data/processed/clean_data.csv

  train:
    cmd: python src/train.py
    deps:
      - data/processed/clean_data.csv
      - src/train.py
    params:
      - train.n_estimators
      - train.max_depth
    outs:
      - models/classifier.pkl
    metrics:
      - metrics/train_metrics.json

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - models/classifier.pkl
      - data/processed/clean_data.csv
      - src/evaluate.py
    metrics:
      - metrics/test_metrics.json
"""

with open("dvc.yaml", "w") as f:
    f.write(dvc_pipeline)

print("‚úì Created DVC pipeline")
print("\nPipeline stages:")
print("1. prepare: Clean raw data")
print("2. train: Train model")
print("3. evaluate: Evaluate model")
print("\nRun with: dvc repro")

### Running the Pipeline

```bash
# Run entire pipeline
dvc repro

# DVC will:
# 1. Check which stages have changed dependencies
# 2. Re-run only those stages
# 3. Cache all outputs
# 4. Track metrics and parameters

# View pipeline
dvc dag

# Compare experiments
dvc metrics show
dvc metrics diff

# Compare parameters
dvc params diff
```

---

## 13. Collaboration Workflow for DS Teams

### Recommended Workflow

```bash
# 1. Clone repository
git clone https://github.com/team/ml-project.git
cd ml-project

# 2. Install dependencies
pip install -r requirements.txt

# 3. Pull data with DVC
dvc pull

# 4. Create experiment branch
git checkout -b experiment/new-features

# 5. Make changes, train models
jupyter notebook notebooks/experiment.ipynb

# 6. Track new data/models with DVC
dvc add data/processed/new_features.csv
dvc add models/improved_model.pkl

# 7. Commit code and .dvc files
git add notebooks/experiment.ipynb
git add data/processed/new_features.csv.dvc
git add models/improved_model.pkl.dvc
git commit -m "feat: Add new feature engineering approach"

# 8. Push code to Git, data to DVC
git push origin experiment/new-features
dvc push

# 9. Create pull request
# Team reviews code and can pull your data with 'dvc pull'
```

### Team Benefits

- **Code**: Versioned in Git
- **Data**: Versioned in DVC, stored centrally
- **Models**: Tracked with DVC or MLflow
- **Experiments**: Reproducible with DVC pipelines
- **Notebooks**: Clean diffs with nbdime
- **Collaboration**: Everyone has access to same data/models

---

## 14. Exercise 1: Set Up nbstripout

**Task**: Configure a repository to automatically strip notebook outputs.

**Steps**:
1. Create a new Git repository
2. Install nbstripout
3. Configure Git filter
4. Create a notebook with outputs
5. Verify outputs are stripped on commit

In [None]:
# Exercise 1: Your solution here

# TODO: Implement the exercise

print("TODO: Complete this exercise")

---

## 15. Exercise 2: Version a Dataset with DVC

**Task**: Create a dataset, version it with DVC, and simulate updating it.

**Requirements**:
1. Generate a CSV dataset
2. Add it to DVC
3. Commit to Git
4. Modify the dataset
5. Update with DVC
6. Show version history

In [None]:
# Exercise 2: Your solution here

# TODO: Implement the exercise

print("TODO: Complete this exercise")

---

## 16. Exercise 3: Design a DS Team Workflow

**Task**: Design a complete workflow for a data science team.

**Team Context**:
- 4 data scientists
- Working on a customer churn prediction model
- Large dataset (5GB)
- Multiple experiments running in parallel
- Need to track model performance
- Monthly production deployments

**Address**:
1. How to handle the large dataset?
2. How to version notebooks?
3. How to track experiments?
4. How to version models?
5. What's the Git workflow?
6. How to ensure reproducibility?

### Your Workflow Design Here

TODO: Describe your complete data science workflow

Consider:
- Tools (Git, DVC, MLflow, etc.)
- Repository structure
- Branching strategy
- Data management
- Model versioning
- Experiment tracking
- Deployment process

---

## 17. Summary

### Key Concepts Learned

1. **Jupyter Notebooks**:
   - Use nbstripout to remove outputs
   - Use nbdime for meaningful diffs
   - Keep notebooks clean in Git

2. **Large Files**:
   - Git LFS for files 10MB-100MB
   - DVC for very large datasets
   - Cloud storage for massive data

3. **Data Versioning**:
   - DVC tracks data like Git tracks code
   - Metadata in Git, data in DVC cache
   - Works with any cloud storage

4. **Model Management**:
   - Version with DVC for simple projects
   - Use MLflow for complex projects
   - Track metrics and parameters

5. **Pipelines**:
   - DVC pipelines ensure reproducibility
   - Cache intermediate results
   - Re-run only what changed

6. **Team Collaboration**:
   - Git for code, DVC for data
   - Shared remote storage
   - Reproducible experiments

### Essential Tools

```bash
# Notebooks
pip install nbstripout nbdime
nbstripout --install
nbdime config-git --enable

# Large files
git lfs install
git lfs track "*.csv"

# Data versioning
pip install dvc
dvc init
dvc add data/large_file.csv
dvc push

# Experiment tracking
pip install mlflow
mlflow ui
```

### Best Practices Checklist

- [ ] Strip notebook outputs before committing
- [ ] Use DVC for datasets > 10MB
- [ ] Version models with DVC or MLflow
- [ ] Define reproducible pipelines
- [ ] Track experiment metrics
- [ ] Document data dependencies
- [ ] Use cloud storage for team data
- [ ] Separate code and data clearly
- [ ] Never commit secrets or credentials
- [ ] Test reproducibility on clean checkout

---

## 18. What's Next?

You've mastered Git for data science workflows! Continue with:

**Module 12: GitHub Pages and Portfolio Hosting**
- Host your portfolio on GitHub Pages
- Create project documentation sites
- Showcase your data science projects
- Build your professional brand

### Additional Resources

**Tools**:
- [nbstripout Documentation](https://github.com/kynan/nbstripout)
- [nbdime Documentation](https://nbdime.readthedocs.io/)
- [Git LFS](https://git-lfs.github.com/)
- [DVC Documentation](https://dvc.org/doc)
- [MLflow](https://mlflow.org/)

**Tutorials**:
- [DVC Tutorial](https://dvc.org/doc/start)
- [MLflow Tutorial](https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html)
- [Effective Jupyter Notebooks](https://jupyter-notebook.readthedocs.io/)

**Articles**:
- [Data Version Control in Practice](https://realpython.com/python-data-version-control/)
- [ML Model Versioning](https://neptune.ai/blog/version-control-for-ml-models)
- [Reproducible Data Science](https://towardsdatascience.com/)

### Keep Learning

Advanced topics to explore:
1. **CI/CD for ML**: Automated model training and deployment
2. **Feature Stores**: Centralized feature management
3. **Model Monitoring**: Track model performance in production
4. **A/B Testing**: Compare model versions
5. **MLOps**: End-to-end ML operations

---

**Congratulations!** You now have the skills to manage complex data science projects with professional version control. üéâ