# Module 08: Git for Data Science

**Difficulty**: ⭐⭐ Intermediate

**Estimated Time**: 75-90 minutes

**Prerequisites**: 
- [Module 00]()
- [Module 01]()
- [Module 02]()
- [Module 03]()
- [Module 04]()
- [Module 05]()
- [Module 06]()
- [Module 07]()

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Version Jupyter notebooks effectively
2. Handle large datasets with Git LFS
3. Use nbstripout for clean commits
4. Implement data versioning strategies
5. Track experiments systematically
6. Create reproducible workflows

---

## 1. Versioning Jupyter Notebooks

Jupyter notebooks contain both code and outputs, making version control tricky.

### The Problem

Notebooks include:
- Code cells
- Output cells (can be large)
- Metadata (execution counts, timestamps)
- Binary data (images, plots)

This makes diffs messy and merge conflicts common.

### Solution: nbstripout

**Install**:
```bash
pip install nbstripout
```

**Setup** (one-time):
```bash
nbstripout --install
```

This automatically strips outputs before committing!

### Manual Stripping

```bash
# Strip outputs from specific notebook
nbstripout notebook.ipynb

# Strip all notebooks
find . -name '*.ipynb' -exec nbstripout {} \;
```

### .gitattributes for Notebooks

Create `.gitattributes`:
```
*.ipynb filter=nbstripout
*.ipynb diff=ipynb
```

---

## 2. Git LFS for Large Files

Git Large File Storage handles big files efficiently.

### Why Git LFS?

Git struggles with:
- Large datasets (>100MB)
- Binary files
- Files that change frequently

### Installing Git LFS

```bash
# macOS
brew install git-lfs

# Linux
sudo apt-get install git-lfs

# Initialize
git lfs install
```

### Tracking Large Files

```bash
# Track specific file types
git lfs track "*.csv"
git lfs track "*.h5"
git lfs track "*.pkl"

# Track specific files
git lfs track "data/large_dataset.csv"

# Commit .gitattributes
git add .gitattributes
git commit -m "Configure Git LFS"
```

### Best Practices

- Track files >50MB with LFS
- Don't track files that change frequently
- Consider external storage for very large datasets
- Document data sources in README

---

## 3. Data Versioning Strategies

### Directory Structure

```
project/
├── data/
│   ├── raw/              # Original data (never modify)
│   ├── processed/        # Cleaned data
│   ├── interim/          # Intermediate transformations
│   └── external/         # Third-party data
├── notebooks/
│   ├── 01_exploration.ipynb
│   ├── 02_preprocessing.ipynb
│   └── 03_modeling.ipynb
├── src/
│   ├── data/             # Data processing scripts
│   └── models/           # Model code
└── .gitignore
```

### .gitignore for Data Science

```
# Data (use selective commits)
data/raw/*
data/processed/*
!data/raw/.gitkeep
!data/processed/.gitkeep

# Models
models/*.h5
models/*.pkl

# Jupyter
.ipynb_checkpoints/
*_tested.ipynb

# Python
__pycache__/
*.py[cod]
```

### Data Version Control (DVC)

Tool specifically for data versioning:

```bash
# Install
pip install dvc

# Initialize
dvc init

# Track data
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc .gitignore
git commit -m "Add dataset"

# Push data to remote storage
dvc remote add -d storage s3://mybucket/path
dvc push
```

---

## 4. Experiment Tracking

### Git Branching for Experiments

```bash
# Create experiment branch
git switch -c experiment/xgboost-hyperparameters

# Make changes, train models, evaluate
# ...

# If successful, merge back
git switch main
git merge experiment/xgboost-hyperparameters

# If unsuccessful, delete branch
git branch -D experiment/xgboost-hyperparameters
```

### Tags for Model Versions

```bash
# Tag a model version
git tag -a v1.0.0 -m "Production model v1.0.0 - Accuracy: 94.2%"
git push origin v1.0.0

# List tags
git tag -l

# Checkout specific version
git checkout v1.0.0
```

### Experiment Logging

Create `.experiment_log.md`:

```markdown
# Experiment Log

## 2024-01-15: XGBoost Hyperparameter Tuning

- **Branch**: experiment/xgboost-hp
- **Data**: data/processed/train_2024-01.csv
- **Model**: XGBoost v1.7.0
- **Parameters**:
  - max_depth: 6
  - learning_rate: 0.1
  - n_estimators: 100
- **Results**:
  - Accuracy: 94.2%
  - F1 Score: 0.91
- **Status**: SUCCESS - Merged to main
- **Commit**: abc123
```

---

In [None]:
# Example: Setting up nbstripout
import os

# Create sample notebook for demonstration
practice_dir = "../outputs/data_science_git"
os.makedirs(practice_dir, exist_ok=True)

print(f"Created: {practice_dir}")
print("\nSetup nbstripout:")
print("1. pip install nbstripout")
print("2. nbstripout --install")
print("3. Outputs will be stripped automatically!")

## Exercises

### Exercise 1

Practice the concepts from this module.



In [None]:
# Your code for Exercise 1


### Exercise 2

Apply your knowledge to a real scenario.



In [None]:
# Your code for Exercise 2


### Exercise 3

Challenge exercise combining multiple concepts.



In [None]:
# Your code for Exercise 3


## Knowledge Check

Ensure you can answer key questions from this module.

### Checklist
- [ ] Understand core concepts
- [ ] Completed all exercises
- [ ] Can apply skills independently

---

## Summary

In this module, you learned essential skills for git for data science.

---

## Next Steps

Continue to the next module!

**Excellent work!**