# Project Step 1: Data Setup & Versioning (DVC)

**Objective**: Download the COCO-128 dataset and set up **DVC** (Data Version Control) to track it.

**Why?** In Computer Vision, datasets are huge (GBs/TBs). You cannot push them to GitHub. DVC allows us to keep a lightweight reference (`.dvc` file) in Git while storing the actual images in a local cache (or S3).

---

## üìö Steps
1. **Download Data**: Fetch COCO-128 (a small 128-image subset of COCO).
2. **Initialize DVC**: Set up the DVC repo.
3. **Track Data**: Add the images to DVC.
4. **Explore Data**: Visualize some images and labels.

## 1. Download Data

We use `ultralytics` to easily fetch the standard COCO-128 dataset. It contains images and YOLO-format labels.

In [None]:
import os
import shutil
from ultralytics.utils.downloads import download

# Define paths
DATA_DIR = "../data/raw"
COCO_URL = "https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip"

os.makedirs(DATA_DIR, exist_ok=True)

# Download and Unzip if not exists
zip_path = os.path.join(DATA_DIR, "coco128.zip")
extract_path = os.path.join(DATA_DIR, "coco128")

if not os.path.exists(extract_path):
    print("Downloading COCO-128...")
    download(COCO_URL, dir=DATA_DIR, unzip=True, delete=True)
    print("Download Complete!")
else:
    print("Data already exists.")

print(f"Data location: {os.path.abspath(extract_path)}")

## 2. Initialize DVC

Run these commands in your terminal (or here via `!`).

**Concept**: `dvc init` creates a `.dvc` directory that handles the internal logic.

> **Note**: If you don't have git initialized, run `!git init` first.

In [None]:
# Initialize Git and DVC (if not already done)
import subprocess

def run_command(cmd):
    try:
        result = subprocess.run(cmd, shell=True, check=True, capture_output=True, text=True)
        print(result.stdout)
    except subprocess.CalledProcessError as e:
        print(f"Error: {e.stderr}")

print("Initializing DVC...")
run_command("dvc init --no-scm") # --no-scm to avoid conflicts if parent is already a repo
run_command("dvc config core.analytics false")
print("DVC Initialized.")

## 3. Track Data

We want to track the `data/raw/coco128` folder.

**Command**: `dvc add data/raw/coco128`

This does two things:
1. Adds `data/raw/coco128` to `.gitignore` (so git ignores the huge folder).
2. Creates `data/raw/coco128.dvc` (a small text file containing the MD5 hash).

You commit the `.dvc` file to Git.

In [None]:
print("Tracking data with DVC...")
# We assume running from 'notebooks/' so we go up one level
dvc_cmd = "cd .. && dvc add data/raw/coco128"
run_command(dvc_cmd)

print("\nChecking created files:")
if os.path.exists("../data/raw/coco128.dvc"):
    print("‚úÖ ../data/raw/coco128.dvc created!")
    with open("../data/raw/coco128.dvc", "r") as f:
        print("--- Content of .dvc file ---")
        print(f.read())
else:
    print("‚ùå .dvc file not found. Check permissions.")

## 4. Explore Data

Let's sanity check our images and labels.
YOLO labels are text files: `class_id x_center y_center width height` (normalized 0-1).

In [None]:
import matplotlib.pyplot as plt
import cv2
import glob

img_path = "../data/raw/coco128/images/train2017/"
label_path = "../data/raw/coco128/labels/train2017/"

images = glob.glob(img_path + "*.jpg")
print(f"Found {len(images)} images.")

# Pick one image
sample_img_path = images[0]
sample_label_path = sample_img_path.replace("images", "labels").replace(".jpg", ".txt")

# Read Image
img = cv2.imread(sample_img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
h, w, _ = img.shape

# Read Labels and Draw Boxes
with open(sample_label_path, "r") as f:
    labels = f.readlines()
    
for label in labels:
    cls, x, y, bw, bh = map(float, label.split())
    # Convert normalized to pixel coords
    x1 = int((x - bw/2) * w)
    y1 = int((y - bh/2) * h)
    x2 = int((x + bw/2) * w)
    y2 = int((y + bh/2) * h)
    
    cv2.rectangle(img, (x1, y1), (x2, y2), (255, 0, 0), 2)

plt.figure(figsize=(10, 10))
plt.imshow(img)
plt.title(f"Sample Image: {os.path.basename(sample_img_path)}")
plt.axis("off")
plt.show()

## Next Step

You have successfully versioned your raw data. In the next notebook, we will **Configure Training with Hydra** and **Track Experiments with W&B**.

Proceed to `02_training_tracking.ipynb`.