# Custom Dataset Preparation for YOLO

**Week 14 - Module 5: Object Detection Models**

**Estimated Time:** 15 minutes

## Learning Objectives
- Understand YOLO dataset format
- Prepare custom dataset for training
- Create data.yaml configuration
- Split data into train/val/test

---

## 1. YOLO Dataset Format

YOLO uses a specific format for training data. Understanding this format is crucial for preparing your own datasets.

### Image Files
- Supported formats: `.jpg`, `.jpeg`, `.png`, `.bmp`
- Any resolution (YOLO will resize)
- Example: `image001.jpg`, `cat_001.png`

### Label Files
- Format: `.txt` (plain text)
- **Same filename** as corresponding image
- Example: `image001.txt` for `image001.jpg`

### Label File Format

Each line in the `.txt` file represents one bounding box:

```
class_id x_center y_center width height
```

Where:
- `class_id`: Integer starting from 0 (e.g., 0=person, 1=car, 2=dog)
- `x_center`: X-coordinate of box center (normalized 0-1)
- `y_center`: Y-coordinate of box center (normalized 0-1)
- `width`: Box width (normalized 0-1)
- `height`: Box height (normalized 0-1)

### Example Label File

```txt
0 0.5 0.5 0.3 0.4
1 0.2 0.7 0.15 0.2
0 0.8 0.3 0.25 0.35
```

This means:
- **Line 1**: Class 0, center at (50%, 50%), size 30%√ó40% of image
- **Line 2**: Class 1, center at (20%, 70%), size 15%√ó20% of image
- **Line 3**: Class 0, center at (80%, 30%), size 25%√ó35% of image

### Normalization Formula

If you have pixel coordinates, convert to normalized:

```python
x_center = (x_min + x_max) / (2 * image_width)
y_center = (y_min + y_max) / (2 * image_height)
width = (x_max - x_min) / image_width
height = (y_max - y_min) / image_height
```

## 2. Dataset Directory Structure

YOLO expects a specific directory structure:

```
my_dataset/
‚îú‚îÄ‚îÄ images/
‚îÇ   ‚îú‚îÄ‚îÄ train/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ image001.jpg
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ image002.jpg
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îÇ   ‚îú‚îÄ‚îÄ val/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ image101.jpg
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ image102.jpg
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îÇ   ‚îî‚îÄ‚îÄ test/
‚îÇ       ‚îú‚îÄ‚îÄ image201.jpg
‚îÇ       ‚îú‚îÄ‚îÄ image202.jpg
‚îÇ       ‚îî‚îÄ‚îÄ ...
‚îî‚îÄ‚îÄ labels/
    ‚îú‚îÄ‚îÄ train/
    ‚îÇ   ‚îú‚îÄ‚îÄ image001.txt
    ‚îÇ   ‚îú‚îÄ‚îÄ image002.txt
    ‚îÇ   ‚îî‚îÄ‚îÄ ...
    ‚îú‚îÄ‚îÄ val/
    ‚îÇ   ‚îú‚îÄ‚îÄ image101.txt
    ‚îÇ   ‚îú‚îÄ‚îÄ image102.txt
    ‚îÇ   ‚îî‚îÄ‚îÄ ...
    ‚îî‚îÄ‚îÄ test/
        ‚îú‚îÄ‚îÄ image201.txt
        ‚îú‚îÄ‚îÄ image202.txt
        ‚îî‚îÄ‚îÄ ...
```

### Key Points
- **images/** and **labels/** must have same subdirectory structure
- **train/**, **val/**, **test/** splits are standard
- Each image must have a corresponding label file (even if empty)
- Filenames must match exactly (except extension)

## 3. Create Sample Dataset

Let's create a small synthetic dataset to demonstrate the format.

In [None]:
# Install required libraries
!pip install -q ultralytics opencv-python matplotlib pillow

import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from pathlib import Path
import shutil

# Create dataset directory structure
dataset_root = Path('sample_dataset')
splits = ['train', 'val', 'test']

# Remove existing dataset if present
if dataset_root.exists():
    shutil.rmtree(dataset_root)

# Create directories
for split in splits:
    (dataset_root / 'images' / split).mkdir(parents=True, exist_ok=True)
    (dataset_root / 'labels' / split).mkdir(parents=True, exist_ok=True)

print("‚úÖ Dataset directory structure created!")
print("\nDirectory structure:")
for root, dirs, files in os.walk(dataset_root):
    level = root.replace(str(dataset_root), '').count(os.sep)
    indent = ' ' * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    subindent = ' ' * 2 * (level + 1)
    for file in files:
        print(f"{subindent}{file}")

In [None]:
# Function to create synthetic images with boxes
def create_synthetic_image(width=640, height=480, num_objects=3):
    """
    Create a synthetic image with random colored rectangles
    Returns: image, list of [class_id, x_center, y_center, width, height]
    """
    # Create white background
    image = np.ones((height, width, 3), dtype=np.uint8) * 255
    
    annotations = []
    colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0), (255, 0, 255)]
    
    for i in range(num_objects):
        # Random class (0-2: circle, square, triangle)
        class_id = np.random.randint(0, 3)
        
        # Random position and size
        obj_width = np.random.randint(50, 150)
        obj_height = np.random.randint(50, 150)
        x1 = np.random.randint(0, width - obj_width)
        y1 = np.random.randint(0, height - obj_height)
        x2 = x1 + obj_width
        y2 = y1 + obj_height
        
        # Draw shape
        color = colors[class_id]
        if class_id == 0:  # Circle
            center = ((x1 + x2) // 2, (y1 + y2) // 2)
            radius = min(obj_width, obj_height) // 2
            cv2.circle(image, center, radius, color, -1)
        elif class_id == 1:  # Rectangle
            cv2.rectangle(image, (x1, y1), (x2, y2), color, -1)
        else:  # Triangle
            pts = np.array([
                [(x1 + x2) // 2, y1],
                [x1, y2],
                [x2, y2]
            ], np.int32)
            cv2.fillPoly(image, [pts], color)
        
        # Convert to YOLO format (normalized)
        x_center = ((x1 + x2) / 2) / width
        y_center = ((y1 + y2) / 2) / height
        norm_width = (x2 - x1) / width
        norm_height = (y2 - y1) / height
        
        annotations.append([class_id, x_center, y_center, norm_width, norm_height])
    
    return image, annotations

# Create sample images
num_samples = {'train': 10, 'val': 3, 'test': 2}

for split, count in num_samples.items():
    for i in range(count):
        # Create image and annotations
        image, annotations = create_synthetic_image()
        
        # Save image
        image_path = dataset_root / 'images' / split / f'{split}_{i:03d}.jpg'
        cv2.imwrite(str(image_path), image)
        
        # Save annotations
        label_path = dataset_root / 'labels' / split / f'{split}_{i:03d}.txt'
        with open(label_path, 'w') as f:
            for ann in annotations:
                f.write(f"{ann[0]} {ann[1]:.6f} {ann[2]:.6f} {ann[3]:.6f} {ann[4]:.6f}\n")

print("‚úÖ Synthetic dataset created!")
print(f"\nüìä Dataset Statistics:")
for split, count in num_samples.items():
    print(f"  {split}: {count} images")
print(f"\n  Total: {sum(num_samples.values())} images")
print("\n  Classes: 0=Circle, 1=Rectangle, 2=Triangle")

## 4. Verify Dataset

Always verify your dataset before training to catch errors early.

In [None]:
# Function to visualize annotations
def visualize_yolo_annotations(image_path, label_path, class_names):
    """
    Visualize YOLO annotations on an image
    """
    # Read image
    image = cv2.imread(str(image_path))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    height, width = image.shape[:2]
    
    # Read annotations
    annotations = []
    if os.path.exists(label_path):
        with open(label_path, 'r') as f:
            for line in f:
                parts = line.strip().split()
                if len(parts) == 5:
                    class_id = int(parts[0])
                    x_center = float(parts[1])
                    y_center = float(parts[2])
                    box_width = float(parts[3])
                    box_height = float(parts[4])
                    annotations.append([class_id, x_center, y_center, box_width, box_height])
    
    # Create figure
    fig, ax = plt.subplots(figsize=(10, 8))
    ax.imshow(image)
    
    # Draw bounding boxes
    colors = ['red', 'green', 'blue', 'yellow', 'magenta']
    for ann in annotations:
        class_id, x_center, y_center, box_width, box_height = ann
        
        # Convert normalized to pixel coordinates
        x_center_px = x_center * width
        y_center_px = y_center * height
        box_width_px = box_width * width
        box_height_px = box_height * height
        
        # Calculate corner coordinates
        x1 = x_center_px - box_width_px / 2
        y1 = y_center_px - box_height_px / 2
        
        # Draw rectangle
        rect = patches.Rectangle(
            (x1, y1), box_width_px, box_height_px,
            linewidth=3, edgecolor=colors[class_id % len(colors)],
            facecolor='none'
        )
        ax.add_patch(rect)
        
        # Add label
        class_name = class_names[class_id] if class_id < len(class_names) else f'Class {class_id}'
        ax.text(
            x1, y1 - 5, class_name,
            color=colors[class_id % len(colors)],
            fontsize=12, fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8)
        )
    
    ax.axis('off')
    return fig, ax, annotations

# Visualize sample images
class_names = ['Circle', 'Rectangle', 'Triangle']

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

# Show 6 training samples
for i in range(6):
    image_path = dataset_root / 'images' / 'train' / f'train_{i:03d}.jpg'
    label_path = dataset_root / 'labels' / 'train' / f'train_{i:03d}.txt'
    
    # Read and display
    image = cv2.imread(str(image_path))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    height, width = image.shape[:2]
    
    # Read annotations
    annotations = []
    with open(label_path, 'r') as f:
        for line in f:
            parts = line.strip().split()
            class_id = int(parts[0])
            x_center = float(parts[1]) * width
            y_center = float(parts[2]) * height
            box_width = float(parts[3]) * width
            box_height = float(parts[4]) * height
            
            x1 = x_center - box_width / 2
            y1 = y_center - box_height / 2
            
            # Draw box
            colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255)]
            cv2.rectangle(image, 
                        (int(x1), int(y1)), 
                        (int(x1 + box_width), int(y1 + box_height)),
                        colors[class_id], 3)
            
            # Add label
            cv2.putText(image, class_names[class_id],
                       (int(x1), int(y1) - 10),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.6,
                       colors[class_id], 2)
    
    axes[i].imshow(image)
    axes[i].set_title(f'Train Sample {i}', fontsize=10, fontweight='bold')
    axes[i].axis('off')

plt.suptitle('Dataset Verification: Annotated Images', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("‚úÖ Dataset verification complete!")
print("\nüí° All images have correct annotations")

## 5. data.yaml Configuration

The `data.yaml` file tells YOLO where to find your data and what classes you have.

### Format

```yaml
path: /path/to/dataset  # Absolute path to dataset root
train: images/train     # Relative path to training images
val: images/val         # Relative path to validation images
test: images/test       # Relative path to test images (optional)

names:
  0: person
  1: car
  2: dog
```

### Alternative Format (List)

```yaml
path: /path/to/dataset
train: images/train
val: images/val

names: ['person', 'car', 'dog']
```

In [None]:
# Create data.yaml for our dataset
import yaml

data_yaml = {
    'path': str(dataset_root.absolute()),
    'train': 'images/train',
    'val': 'images/val',
    'test': 'images/test',
    'names': {
        0: 'Circle',
        1: 'Rectangle',
        2: 'Triangle'
    }
}

# Save data.yaml
yaml_path = dataset_root / 'data.yaml'
with open(yaml_path, 'w') as f:
    yaml.dump(data_yaml, f, default_flow_style=False, sort_keys=False)

print("‚úÖ data.yaml created!")
print("\nContents:")
print("="*60)
with open(yaml_path, 'r') as f:
    print(f.read())
print("="*60)

## 6. Annotation Tools

For real-world datasets, you'll need annotation tools. Here are the most popular:

### 1. LabelImg (Desktop)
- **Type**: Free, open-source, desktop app
- **Platform**: Windows, macOS, Linux
- **Format**: Supports YOLO, PASCAL VOC, COCO
- **Installation**: `pip install labelImg`
- **GitHub**: https://github.com/heartexlabs/labelImg

**Pros:**
- Simple and intuitive
- Keyboard shortcuts
- Directly saves in YOLO format

**Cons:**
- Manual one-by-one annotation
- No collaboration features

### 2. Roboflow (Web-based)
- **Type**: Web-based platform (free tier available)
- **URL**: https://roboflow.com
- **Features**: Annotation, augmentation, format conversion, hosting

**Pros:**
- Team collaboration
- Automatic augmentation
- Export to multiple formats
- Public datasets available

**Cons:**
- Free tier has limits
- Requires internet connection

### 3. CVAT (Computer Vision Annotation Tool)
- **Type**: Web-based, open-source
- **URL**: https://www.cvat.ai
- **Features**: Advanced annotation, video support, interpolation

**Pros:**
- Enterprise-grade features
- Video annotation
- Semi-automatic annotation

**Cons:**
- Steeper learning curve
- Requires setup (self-hosted or cloud)

### 4. Labelbox
- **Type**: Commercial platform (free tier)
- **URL**: https://labelbox.com
- **Features**: Full ML data pipeline

### 5. Makesense.ai
- **Type**: Free, browser-based
- **URL**: https://www.makesense.ai
- **Features**: No signup required, runs in browser

**Pros:**
- No installation
- Privacy (runs locally)
- Simple interface

### Recommendation for Beginners
- **Small datasets (<100 images)**: LabelImg or Makesense.ai
- **Medium datasets (100-1000 images)**: Roboflow
- **Large/enterprise**: CVAT or Labelbox

## 7. Data Augmentation Preview

YOLO has built-in data augmentation during training. Here are common techniques:

### Common Augmentations
1. **Geometric**: Rotation, flip, scaling, translation
2. **Color**: HSV adjustment, brightness, contrast
3. **Advanced**: Mosaic, MixUp, CutOut

### Mosaic Augmentation
Combines 4 images into one, creating diverse scenes and improving small object detection.

### MixUp
Blends two images together, creating smoother decision boundaries.

In [None]:
# Demonstrate simple augmentations
import cv2
import numpy as np

# Load a sample image
image_path = dataset_root / 'images' / 'train' / 'train_000.jpg'
image = cv2.imread(str(image_path))
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# Create augmented versions
aug_images = []
aug_titles = []

# Original
aug_images.append(image)
aug_titles.append('Original')

# Horizontal flip
flipped = cv2.flip(image, 1)
aug_images.append(flipped)
aug_titles.append('Horizontal Flip')

# Rotation
h, w = image.shape[:2]
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, 15, 1.0)
rotated = cv2.warpAffine(image, matrix, (w, h), borderValue=(255, 255, 255))
aug_images.append(rotated)
aug_titles.append('Rotation (15¬∞)')

# Brightness adjustment
hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV)
hsv[:, :, 2] = np.clip(hsv[:, :, 2] * 1.3, 0, 255).astype(np.uint8)
bright = cv2.cvtColor(hsv, cv2.COLOR_HSV2RGB)
aug_images.append(bright)
aug_titles.append('Brightness +30%')

# Scaling
scaled = cv2.resize(image, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_LINEAR)
scaled = cv2.resize(scaled, (w, h), interpolation=cv2.INTER_LINEAR)
aug_images.append(scaled)
aug_titles.append('Scale +20%')

# Gaussian blur
blurred = cv2.GaussianBlur(image, (5, 5), 0)
aug_images.append(blurred)
aug_titles.append('Gaussian Blur')

# Visualize
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for idx, (img, title) in enumerate(zip(aug_images, aug_titles)):
    axes[idx].imshow(img)
    axes[idx].set_title(title, fontsize=12, fontweight='bold')
    axes[idx].axis('off')

plt.suptitle('Data Augmentation Examples', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüí° YOLOv8 Training Augmentations:")
print("  ‚Ä¢ Mosaic (combines 4 images)")
print("  ‚Ä¢ MixUp (blends 2 images)")
print("  ‚Ä¢ HSV augmentation")
print("  ‚Ä¢ Random flip, rotation, scale")
print("  ‚Ä¢ CutOut (random erasing)")
print("\n  These are applied automatically during training!")

## 8. Class Balance Check

Imbalanced datasets can lead to poor performance. Always check class distribution.

In [None]:
# Count instances per class
def count_class_instances(dataset_path):
    """
    Count number of instances per class in dataset
    """
    class_counts = {}
    
    for split in ['train', 'val', 'test']:
        label_dir = dataset_path / 'labels' / split
        
        for label_file in label_dir.glob('*.txt'):
            with open(label_file, 'r') as f:
                for line in f:
                    parts = line.strip().split()
                    if len(parts) == 5:
                        class_id = int(parts[0])
                        if class_id not in class_counts:
                            class_counts[class_id] = 0
                        class_counts[class_id] += 1
    
    return class_counts

# Count instances
class_counts = count_class_instances(dataset_root)

# Visualize distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
classes = [class_names[i] for i in sorted(class_counts.keys())]
counts = [class_counts[i] for i in sorted(class_counts.keys())]
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']

bars = axes[0].bar(classes, counts, color=colors)
axes[0].set_xlabel('Class', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Number of Instances', fontsize=12, fontweight='bold')
axes[0].set_title('Class Distribution', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Add count labels
for bar in bars:
    height = bar.get_height()
    axes[0].text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}',
                ha='center', va='bottom', fontweight='bold')

# Pie chart
axes[1].pie(counts, labels=classes, autopct='%1.1f%%', colors=colors, startangle=90)
axes[1].set_title('Class Distribution (%)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Print statistics
total = sum(counts)
print("\nüìä Class Balance Analysis:")
print("="*60)
print(f"{'Class':<15} {'Count':<10} {'Percentage':<15} {'Status':<15}")
print("="*60)

for class_id in sorted(class_counts.keys()):
    count = class_counts[class_id]
    pct = (count / total) * 100
    
    # Determine balance status
    if pct < 20:
        status = "‚ö†Ô∏è Underrepresented"
    elif pct > 40:
        status = "‚ö†Ô∏è Overrepresented"
    else:
        status = "‚úÖ Balanced"
    
    print(f"{class_names[class_id]:<15} {count:<10} {pct:<15.1f} {status:<15}")

print("="*60)
print(f"Total instances: {total}")
print("\nüí° Ideal: Each class should have 20-40% of total instances")
print("   If imbalanced, consider: more data, augmentation, or class weights")

## 9. Train/Val/Test Split

Proper data splitting is crucial for model evaluation.

### Common Split Ratios

| Split | Percentage | Purpose |
|-------|-----------|----------|
| Train | 70-80% | Model learning |
| Validation | 15-20% | Hyperparameter tuning |
| Test | 5-10% | Final evaluation |

### Guidelines
- **Small datasets (<100 images)**: 70/20/10
- **Medium datasets (100-1000)**: 75/15/10
- **Large datasets (>1000)**: 80/10/10

### Important Considerations
1. **Random split**: Ensure balanced class distribution
2. **Stratified split**: Maintain class proportions across splits
3. **No data leakage**: Same image shouldn't appear in multiple splits

In [None]:
# Function to split dataset
def split_dataset(source_images, source_labels, output_root, train_ratio=0.7, val_ratio=0.2):
    """
    Split dataset into train/val/test
    
    Args:
        source_images: Path to source images
        source_labels: Path to source labels
        output_root: Output directory
        train_ratio: Proportion for training (default 0.7)
        val_ratio: Proportion for validation (default 0.2)
    """
    import shutil
    from sklearn.model_selection import train_test_split
    
    # Get all image files
    image_files = list(Path(source_images).glob('*.*'))
    image_files = [f for f in image_files if f.suffix.lower() in ['.jpg', '.jpeg', '.png']]
    
    # First split: train vs (val + test)
    train_files, temp_files = train_test_split(
        image_files, 
        train_size=train_ratio, 
        random_state=42
    )
    
    # Second split: val vs test
    val_size = val_ratio / (1 - train_ratio)
    val_files, test_files = train_test_split(
        temp_files,
        train_size=val_size,
        random_state=42
    )
    
    splits = {
        'train': train_files,
        'val': val_files,
        'test': test_files
    }
    
    # Copy files to appropriate directories
    for split_name, files in splits.items():
        for image_file in files:
            # Copy image
            dst_image = output_root / 'images' / split_name / image_file.name
            shutil.copy(image_file, dst_image)
            
            # Copy label
            label_file = Path(source_labels) / (image_file.stem + '.txt')
            if label_file.exists():
                dst_label = output_root / 'labels' / split_name / label_file.name
                shutil.copy(label_file, dst_label)
    
    return splits

# Demonstrate split ratios
print("üìä Dataset Split Recommendations:")
print("="*60)
print(f"{'Dataset Size':<20} {'Train':<10} {'Val':<10} {'Test':<10}")
print("="*60)
print(f"{'Small (<100)':<20} {'70%':<10} {'20%':<10} {'10%':<10}")
print(f"{'Medium (100-1000)':<20} {'75%':<10} {'15%':<10} {'10%':<10}")
print(f"{'Large (>1000)':<20} {'80%':<10} {'10%':<10} {'10%':<10}")
print("="*60)

# Show current split
print("\nüìÅ Current Dataset Split:")
print("="*60)
for split in ['train', 'val', 'test']:
    image_dir = dataset_root / 'images' / split
    num_images = len(list(image_dir.glob('*.jpg')))
    total_images = sum([len(list((dataset_root / 'images' / s).glob('*.jpg'))) for s in ['train', 'val', 'test']])
    pct = (num_images / total_images) * 100 if total_images > 0 else 0
    print(f"{split.capitalize():<10}: {num_images:>3} images ({pct:>5.1f}%)")
print("="*60)

## 10. Common Dataset Errors

### Error 1: Mismatched Image/Label Files
```
‚ùå images/train/img001.jpg exists
‚ùå labels/train/img001.txt missing
```
**Solution**: Ensure every image has a corresponding label file (even if empty)

### Error 2: Out-of-Range Coordinates
```
‚ùå 0 0.5 0.5 1.5 0.4  (width > 1.0)
```
**Solution**: All coordinates must be normalized between 0 and 1

### Error 3: Wrong Class IDs
```
‚ùå 5 0.5 0.5 0.3 0.4  (but only 3 classes: 0, 1, 2)
```
**Solution**: Class IDs must start at 0 and be consecutive

### Error 4: Empty Label Files
```
‚ö†Ô∏è labels/train/img001.txt is empty
```
**Note**: Empty files are OK (images with no objects)

### Error 5: Incorrect Format
```
‚ùå 0,0.5,0.5,0.3,0.4  (commas instead of spaces)
‚ùå 0 50 50 30 40      (pixels instead of normalized)
```
**Solution**: Use space-separated normalized values

In [None]:
# Dataset validation function
def validate_dataset(dataset_path, num_classes):
    """
    Validate YOLO dataset for common errors
    """
    errors = []
    warnings = []
    
    for split in ['train', 'val', 'test']:
        image_dir = dataset_path / 'images' / split
        label_dir = dataset_path / 'labels' / split
        
        # Get all images
        images = list(image_dir.glob('*.jpg')) + list(image_dir.glob('*.png'))
        
        for image_file in images:
            # Check for corresponding label
            label_file = label_dir / (image_file.stem + '.txt')
            
            if not label_file.exists():
                errors.append(f"Missing label: {label_file}")
                continue
            
            # Validate label content
            if label_file.stat().st_size == 0:
                warnings.append(f"Empty label: {label_file}")
                continue
            
            with open(label_file, 'r') as f:
                for line_num, line in enumerate(f, 1):
                    parts = line.strip().split()
                    
                    if len(parts) != 5:
                        errors.append(f"{label_file}:{line_num} - Wrong format (expected 5 values)")
                        continue
                    
                    try:
                        class_id = int(parts[0])
                        x_center = float(parts[1])
                        y_center = float(parts[2])
                        width = float(parts[3])
                        height = float(parts[4])
                        
                        # Validate class ID
                        if class_id < 0 or class_id >= num_classes:
                            errors.append(f"{label_file}:{line_num} - Invalid class ID {class_id}")
                        
                        # Validate coordinates (0-1 range)
                        if not (0 <= x_center <= 1):
                            errors.append(f"{label_file}:{line_num} - x_center out of range: {x_center}")
                        if not (0 <= y_center <= 1):
                            errors.append(f"{label_file}:{line_num} - y_center out of range: {y_center}")
                        if not (0 <= width <= 1):
                            errors.append(f"{label_file}:{line_num} - width out of range: {width}")
                        if not (0 <= height <= 1):
                            errors.append(f"{label_file}:{line_num} - height out of range: {height}")
                    
                    except ValueError as e:
                        errors.append(f"{label_file}:{line_num} - Invalid number format")
    
    return errors, warnings

# Validate our dataset
errors, warnings = validate_dataset(dataset_root, num_classes=3)

print("\nüîç Dataset Validation Results:")
print("="*60)

if not errors and not warnings:
    print("‚úÖ Dataset is valid! No errors or warnings.")
else:
    if errors:
        print(f"\n‚ùå Errors ({len(errors)}):")
        for error in errors[:10]:  # Show first 10
            print(f"  ‚Ä¢ {error}")
        if len(errors) > 10:
            print(f"  ... and {len(errors) - 10} more")
    
    if warnings:
        print(f"\n‚ö†Ô∏è Warnings ({len(warnings)}):")
        for warning in warnings[:10]:
            print(f"  ‚Ä¢ {warning}")
        if len(warnings) > 10:
            print(f"  ... and {len(warnings) - 10} more")

print("\n" + "="*60)

## 11. Exercise: Prepare Your Own Mini Dataset

Now it's your turn to prepare a dataset!

### Exercise Tasks

1. **Collect Images**: Gather 10-20 images for your custom object detection task
   - Option A: Download from internet
   - Option B: Take photos with your phone
   - Option C: Use existing public dataset

2. **Choose Classes**: Define 2-3 object classes to detect
   - Examples: "cat", "dog", "person", "car", "phone", etc.

3. **Annotate Images**: Use one of these tools:
   - LabelImg (recommended for beginners)
   - Makesense.ai (browser-based)
   - Roboflow (if you want to try web-based)

4. **Organize Dataset**: Create proper directory structure
   ```
   my_custom_dataset/
   ‚îú‚îÄ‚îÄ images/
   ‚îÇ   ‚îú‚îÄ‚îÄ train/
   ‚îÇ   ‚îú‚îÄ‚îÄ val/
   ‚îÇ   ‚îî‚îÄ‚îÄ test/
   ‚îú‚îÄ‚îÄ labels/
   ‚îÇ   ‚îú‚îÄ‚îÄ train/
   ‚îÇ   ‚îú‚îÄ‚îÄ val/
   ‚îÇ   ‚îî‚îÄ‚îÄ test/
   ‚îî‚îÄ‚îÄ data.yaml
   ```

5. **Create data.yaml**: Define your classes and paths

6. **Validate**: Run the validation function to check for errors

### Success Criteria
- ‚úÖ All images have corresponding label files
- ‚úÖ All coordinates are normalized (0-1)
- ‚úÖ Class IDs are correct (0, 1, 2, ...)
- ‚úÖ data.yaml is properly configured
- ‚úÖ No validation errors

In [None]:
# Exercise template - fill in your details

# TODO: Define your classes
my_classes = ['class1', 'class2', 'class3']  # Replace with your classes

# TODO: Set your dataset path
my_dataset_path = Path('my_custom_dataset')  # Change to your path

# TODO: Create data.yaml
# my_data_yaml = {
#     'path': str(my_dataset_path.absolute()),
#     'train': 'images/train',
#     'val': 'images/val',
#     'test': 'images/test',
#     'names': {i: name for i, name in enumerate(my_classes)}
# }

# TODO: Validate your dataset
# errors, warnings = validate_dataset(my_dataset_path, len(my_classes))
# print(f"Errors: {len(errors)}, Warnings: {len(warnings)}")

print("‚úèÔ∏è Complete the TODOs above to prepare your custom dataset!")
print("\nüìö Next steps:")
print("  1. Collect 10-20 images")
print("  2. Annotate using LabelImg or Makesense.ai")
print("  3. Organize into train/val/test splits")
print("  4. Create data.yaml configuration")
print("  5. Validate dataset for errors")
print("\n  Ready for Notebook 04: Training YOLOv8! üöÄ")

## 12. Summary

### What We Learned

‚úÖ **YOLO Format**: Normalized bounding boxes with class IDs

‚úÖ **Directory Structure**: Proper organization of images and labels

‚úÖ **data.yaml**: Configuration file for training

‚úÖ **Annotation Tools**: LabelImg, Roboflow, CVAT, Makesense.ai

‚úÖ **Data Augmentation**: Built-in augmentations in YOLO

‚úÖ **Class Balance**: Importance of balanced datasets

‚úÖ **Data Splitting**: Train/val/test ratios

‚úÖ **Validation**: Checking for common errors

### Key Takeaways

1. **Format matters**: YOLO expects specific format (normalized coordinates)
2. **Validation is crucial**: Always validate before training
3. **Balance your data**: Avoid class imbalance when possible
4. **Use the right tools**: Choose annotation tool based on dataset size
5. **Split properly**: Maintain class distribution across splits

### Dataset Preparation Checklist

- [ ] Images and labels in correct directories
- [ ] Filenames match (image001.jpg ‚Üí image001.txt)
- [ ] All coordinates normalized (0-1 range)
- [ ] Class IDs start from 0 and are consecutive
- [ ] data.yaml created with correct paths
- [ ] Train/val/test split done (70/20/10 or 80/15/5)
- [ ] Dataset validated (no errors)
- [ ] Class balance checked

### Preview: Notebook 04 - Training YOLOv8

In the next notebook, we'll:
- Train YOLOv8 on custom dataset
- Monitor training progress
- Evaluate model performance
- Fine-tune hyperparameters
- Export trained model

---

**Your dataset is ready! Let's train a model!** üéØ