# Lesson 2: Git & DVC Fundamentals

**Module 2: Reproducibility & Versioning**  
**Estimated Time**: 2-3 hours  
**Difficulty**: Intermediate

---

## ðŸŽ¯ Learning Objectives

By the end of this lesson, you will:

âœ… Understand why Git is not enough for ML  
âœ… Learn DVC (Data Version Control) core concepts  
âœ… Version large datasets alongside code  
âœ… Build DVC data pipelines  
âœ… Answer interview questions on data versioning  

---

## ðŸ“š Table of Contents

1. [The Problem with Git for Data](#1-the-problem)
2. [Introduction to DVC](#2-intro-to-dvc)
3. [DVC Hands-On: Versioning Data](#3-dvc-basics)
4. [DVC Hands-On: Data Pipelines](#4-dvc-pipelines)
5. [Interview Preparation](#5-interview-questions)

---

## 1. The Problem with Git for Data

### Why not just `git add data.csv`?

1. **Size Limits**: GitHub blocks files > 100MB. Git becomes slow with large binaries.
2. **Cloning Speed**: `git clone` downloads complete history. Downloading 50 versions of a 1GB dataset = 50GB.
3. **Binary Bloat**: Git computes diffs line-by-line. For binaries (images, models), this is meaningless and wasteful.

### The MLOps Solution

**Code** goes to **Git**.
**Data** goes to **Cloud Storage** (S3, GCS, Azure).
**Link** between them is managed by **DVC**.

## 2. Introduction to DVC (Data Version Control)

### What is DVC?

DVC is an open-source tool for data science projects. It works on top of Git.

### How It Works

1. You compute a large file: `data.csv` (1GB).
2. You run `dvc add data.csv`.
3. DVC creates `data.csv.dvc` (small text file, ~1KB).
4. DVC moves `data.csv` to `.dvc/cache`.
5. You `git add data.csv.dvc`.
6. You `git commit` the **pointer** file, not the **data** file.

### Key Concepts

- **.dvc Files**: Small pointer files containing MD5 hash of the data.
- **DVC Cache**: Local storage for actual data versions.
- **Remote Storage**: Shared cloud storage (S3 bucket) where team pushes data.

## 3. DVC Hands-On: Versioning Data

Let's simulate a DVC workflow.

In [1]:
# Note: In a real environment, you run these in the terminal.
# This notebook simulates the commands and explains output.

print("Step 1: Initialize DVC Project")
print("$ dvc init")
print("Initialized DVC project. It created .dvc/ directory.")

print("\nStep 2: Create Dummy Data")
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df.to_csv('data.csv', index=False)
print("Created data.csv (Simulating large file)")

print("\nStep 3: Track Data with DVC")
print("$ dvc add data.csv")
print("Output: To track the changes with git, run:\n git add data.csv.dvc .gitignore")

print("\nStep 4: Check what happened")
import os
if os.path.exists('data.csv.dvc'):
    print("data.csv.dvc exists! Let's see inside:")
    with open('data.csv.dvc', 'r') as f:
        print(f.read())
else:
    print("Note: You need to install DVC locally to see this file real-time")

print("\nStep 5: Git Commit")
print("$ git add data.csv.dvc .gitignore")
print("$ git commit -m 'Add training data v1'")

Step 1: Initialize DVC Project
$ dvc init
Initialized DVC project. It created .dvc/ directory.

Step 2: Create Dummy Data
Created data.csv (Simulating large file)

Step 3: Track Data with DVC
$ dvc add data.csv
Output: To track the changes with git, run:
 git add data.csv.dvc .gitignore

Step 4: Check what happened
Note: You need to install DVC locally to see this file real-time

Step 5: Git Commit
$ git add data.csv.dvc .gitignore
$ git commit -m 'Add training data v1'


### Modifying and Versioning New Data

Simulate changing data:

```bash
# 1. Helper script updates data
python update_data.py

# 2. DVC detects change
$ dvc status
data.csv.dvc: changed

# 3. Track new version
$ dvc add data.csv

# 4. Git commit new version
$ git add data.csv.dvc
$ git commit -m "Update training data v2"
```

**Result**: You can switch between data versions using **Git**!

```bash
$ git checkout HEAD^1
$ dvc checkout
# Now you have data v1!
```

## 4. DVC Hands-On: Data Pipelines

DVC pipelines (dvc.yaml) allow you to define dependencies between steps.

### Example Pipeline

```yaml
stages:
  prepare:
    cmd: python src/prepare.py data/raw.csv
    deps:
    - data/raw.csv
    - src/prepare.py
    outs:
    - data/prepared.csv

  train:
    cmd: python src/train.py data/prepared.csv
    deps:
    - data/prepared.csv
    - src/train.py
    outs:
    - models/model.pkl
    metrics:
    - metrics.json
```

### Benefits
- **Reproducibility**: `dvc repro` runs the pipeline, only re-running changed stages.
- **Dependency Graph**: DVC knows `train` depends on `prepare`.

## 5. Interview Preparation

### Common Questions

#### Q1: "How do you manage large datasets in your ML projects?"
**Answer Framework**: 
- "I use **DVC** paired with Cloud Storage (S3)."
- "I track data versions using git commits of `.dvc` files."
- "This ensures that every code commit is linked to the exact data version used."

#### Q2: "Why not just store data in S3 directly?"
**Answer**: 
- S3 alone doesn't give you version **history linked to code**.
- DVC lets you `git checkout` an old experiment and get the exact data processing script AND the data input.

#### Q3: "What is a DVC pipeline?"
**Answer**:
- It's a way to define stages (prepare, train, evaluate) in a `dvc.yaml` file.
- It tracks dependencies (input files) and outputs (models, metrics).
- `dvc repro` ensures reproducibility by running only necessary steps based on changed dependencies.