# Notebook 04: YOLOv8 Training on Custom Dataset

**Week 14 - Module 5: Object Detection Models**  
**Tutorial T14: Fine-tuning YOLO for Custom Detection Tasks**

## Learning Objectives
- Train YOLOv8 on custom dataset
- Understand training parameters (epochs, batch size, learning rate)
- Monitor training with TensorBoard
- Evaluate trained model performance
- Export models for deployment

**Estimated Time:** 25 minutes  
**Prerequisites:** Completed Notebook 03 (Dataset Preparation)

## Prerequisites Check

### Required Resources:
- ‚úÖ Dataset from Notebook 03 (annotated images in YOLO format)
- ‚úÖ GPU recommended (CUDA-enabled)
- ‚úÖ ~500MB disk space for model weights
- ‚úÖ Estimated training time: 10-15 minutes on GPU, 60+ minutes on CPU

### What You'll Learn:
1. How to configure training parameters
2. Monitor training progress in real-time
3. Evaluate model performance metrics
4. Fine-tune hyperparameters for better results
5. Export models for production deployment

In [None]:
# Setup: Install Ultralytics YOLO and check GPU availability
!pip install -q ultralytics tensorboard

from ultralytics import YOLO
import torch
import os
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Image, display

# Check CUDA availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è Training will use CPU (slower)")

# Set working directory
os.chdir('/Users/rameshbabu/data/projects/srm/lectures/Deep_Neural_Network_Architectures/course_planning/weekly_plans/week14-module5-detection-models/notebooks')

## Step 1: Load Pre-trained YOLOv8 Model

We'll start with a pre-trained YOLOv8 model and fine-tune it on our custom dataset.

### YOLOv8 Model Variants:
- **YOLOv8n** (nano): 6MB, fastest, 3.2M parameters - **Recommended for learning**
- **YOLOv8s** (small): 22MB, 11.2M parameters
- **YOLOv8m** (medium): 52MB, 25.9M parameters
- **YOLOv8l** (large): 87MB, 43.7M parameters
- **YOLOv8x** (extra large): 136MB, 68.2M parameters

In [None]:
# Load pre-trained YOLOv8 nano model (fastest for training)
model = YOLO('yolov8n.pt')

print("\nüì¶ Model loaded successfully!")
print(f"Model type: {model.model.__class__.__name__}")
print(f"Model task: {model.task}")
print(f"Pre-trained on: COCO dataset (80 classes)")

# Display model summary
model.info()

## Step 2: Dataset Configuration

We'll use a sample dataset for demonstration. You can replace this with your own dataset from Notebook 03.

### Dataset Structure (YOLO format):
```
dataset/
‚îú‚îÄ‚îÄ data.yaml          # Dataset configuration
‚îú‚îÄ‚îÄ train/
‚îÇ   ‚îú‚îÄ‚îÄ images/        # Training images
‚îÇ   ‚îî‚îÄ‚îÄ labels/        # Training labels (.txt)
‚îî‚îÄ‚îÄ val/
    ‚îú‚îÄ‚îÄ images/        # Validation images
    ‚îî‚îÄ‚îÄ labels/        # Validation labels (.txt)
```

In [None]:
# Option 1: Download sample hardhat detection dataset
# This is a small dataset for quick training demonstration

!pip install -q roboflow

from roboflow import Roboflow

# Download sample dataset (hardhat detection)
# You can skip this if you have your own dataset
print("üì• Downloading sample dataset...")
rf = Roboflow(api_key="YOUR_API_KEY")  # Get free API key from roboflow.com

# Alternative: Create a simple toy dataset for demonstration
# We'll create a minimal dataset configuration

dataset_path = Path('sample_dataset')
dataset_path.mkdir(exist_ok=True)

# Create data.yaml configuration file
data_yaml = f"""
# Dataset configuration for YOLOv8 training
path: {dataset_path.absolute()}  # Dataset root directory
train: train/images  # Train images (relative to 'path')
val: val/images      # Validation images (relative to 'path')

# Classes
names:
  0: person
  1: hardhat
  2: no-hardhat
"""

with open(dataset_path / 'data.yaml', 'w') as f:
    f.write(data_yaml)

print("‚úÖ Dataset configuration created!")
print(f"Dataset path: {dataset_path.absolute()}")
print("\nüìù data.yaml contents:")
print(data_yaml)

## Step 3: Training Parameters Explained

### Key Training Parameters:

| Parameter | Default | Description | Tuning Tips |
|-----------|---------|-------------|-------------|
| `epochs` | 100 | Training iterations | Start with 50, increase if underfitting |
| `imgsz` | 640 | Input image size | 640 (default), 1280 (more accurate, slower) |
| `batch` | 16 | Batch size | Reduce if OOM error, increase for faster training |
| `patience` | 50 | Early stopping patience | 10-20 for small datasets |
| `lr0` | 0.01 | Initial learning rate | Auto (default) or 0.001-0.01 |
| `device` | 0 | GPU device (0) or CPU | Use 'cpu' if no GPU |
| `workers` | 8 | Data loading threads | Reduce if CPU bottleneck |

### Loss Functions:
- **Box Loss**: Bounding box regression (IoU-based)
- **Class Loss**: Classification loss (cross-entropy)
- **DFL Loss**: Distribution Focal Loss (localization)

In [None]:
# Training configuration
training_config = {
    'data': str(dataset_path / 'data.yaml'),
    'epochs': 50,              # Number of training epochs
    'imgsz': 640,              # Input image size (pixels)
    'batch': 16,               # Batch size (reduce if OOM)
    'patience': 10,            # Early stopping patience
    'save': True,              # Save checkpoints
    'device': 0 if torch.cuda.is_available() else 'cpu',  # GPU or CPU
    'workers': 4,              # Data loading threads
    'project': 'yolo_training', # Project name
    'name': 'exp',             # Experiment name
    'exist_ok': True,          # Overwrite existing project
    'pretrained': True,        # Use pre-trained weights
    'optimizer': 'Adam',       # Optimizer (Adam, SGD, AdamW)
    'verbose': True,           # Verbose output
    'seed': 42,                # Random seed for reproducibility
    'deterministic': True,     # Deterministic training
    'plots': True,             # Generate plots
}

print("üîß Training Configuration:")
for key, value in training_config.items():
    print(f"  {key}: {value}")

## Step 4: Start Training

**Note:** This cell will take 10-15 minutes on GPU, 60+ minutes on CPU.

Training progress will show:
- Loss values (box, cls, dfl)
- Precision, Recall, mAP50, mAP50-95
- Training speed (images/second)

In [None]:
# Start training (this will take time!)
print("üöÄ Starting training...\n")
print("‚è±Ô∏è Estimated time: 10-15 min (GPU) or 60+ min (CPU)\n")

# Train the model
results = model.train(**training_config)

print("\n‚úÖ Training completed!")
print(f"Results saved to: {results.save_dir}")

## Step 5: Monitor Training with TensorBoard (Optional)

TensorBoard provides real-time visualization of training metrics.

In [None]:
# Load TensorBoard in Jupyter
%load_ext tensorboard

# Launch TensorBoard
# Point to the training logs directory
%tensorboard --logdir yolo_training/exp

print("üìä TensorBoard launched!")
print("You can view real-time training metrics above.")
print("\nKey metrics to watch:")
print("  - train/box_loss: Should decrease steadily")
print("  - train/cls_loss: Should decrease steadily")
print("  - metrics/mAP50: Should increase (target: >0.5)")
print("  - metrics/mAP50-95: Should increase (target: >0.3)")

## Step 6: Training Results Analysis

Let's analyze the training performance and visualize key metrics.

In [None]:
# Load and plot training results
import pandas as pd

# Read results CSV
results_csv = Path('yolo_training/exp/results.csv')
if results_csv.exists():
    df = pd.read_csv(results_csv)
    df.columns = df.columns.str.strip()  # Remove whitespace
    
    # Create figure with subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('YOLOv8 Training Results', fontsize=16, fontweight='bold')
    
    # Plot 1: Loss curves
    axes[0, 0].plot(df['epoch'], df['train/box_loss'], label='Box Loss', linewidth=2)
    axes[0, 0].plot(df['epoch'], df['train/cls_loss'], label='Class Loss', linewidth=2)
    axes[0, 0].plot(df['epoch'], df['train/dfl_loss'], label='DFL Loss', linewidth=2)
    axes[0, 0].set_xlabel('Epoch')
    axes[0, 0].set_ylabel('Loss')
    axes[0, 0].set_title('Training Losses')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # Plot 2: mAP metrics
    axes[0, 1].plot(df['epoch'], df['metrics/mAP50(B)'], label='mAP@0.5', linewidth=2, color='green')
    axes[0, 1].plot(df['epoch'], df['metrics/mAP50-95(B)'], label='mAP@0.5:0.95', linewidth=2, color='blue')
    axes[0, 1].set_xlabel('Epoch')
    axes[0, 1].set_ylabel('mAP')
    axes[0, 1].set_title('Mean Average Precision')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # Plot 3: Precision and Recall
    axes[1, 0].plot(df['epoch'], df['metrics/precision(B)'], label='Precision', linewidth=2, color='purple')
    axes[1, 0].plot(df['epoch'], df['metrics/recall(B)'], label='Recall', linewidth=2, color='orange')
    axes[1, 0].set_xlabel('Epoch')
    axes[1, 0].set_ylabel('Score')
    axes[1, 0].set_title('Precision and Recall')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # Plot 4: Learning rate
    if 'lr/pg0' in df.columns:
        axes[1, 1].plot(df['epoch'], df['lr/pg0'], linewidth=2, color='red')
        axes[1, 1].set_xlabel('Epoch')
        axes[1, 1].set_ylabel('Learning Rate')
        axes[1, 1].set_title('Learning Rate Schedule')
        axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print final metrics
    print("\nüìà Final Training Metrics:")
    print(f"  mAP@0.5: {df['metrics/mAP50(B)'].iloc[-1]:.3f}")
    print(f"  mAP@0.5:0.95: {df['metrics/mAP50-95(B)'].iloc[-1]:.3f}")
    print(f"  Precision: {df['metrics/precision(B)'].iloc[-1]:.3f}")
    print(f"  Recall: {df['metrics/recall(B)'].iloc[-1]:.3f}")
else:
    print("‚ö†Ô∏è Results CSV not found. Train the model first.")

## Step 7: Evaluate on Validation Set

Let's evaluate the trained model on the validation set to measure performance.

In [None]:
# Load the best trained model
best_model = YOLO('yolo_training/exp/weights/best.pt')

print("üîç Evaluating model on validation set...\n")

# Run validation
metrics = best_model.val()

# Display metrics
print("\nüìä Validation Metrics:")
print(f"  mAP@0.5: {metrics.box.map50:.3f}")
print(f"  mAP@0.5:0.95: {metrics.box.map:.3f}")
print(f"  Precision: {metrics.box.p:.3f}")
print(f"  Recall: {metrics.box.r:.3f}")
print(f"\n  Per-class mAP@0.5:")
for i, class_map in enumerate(metrics.box.ap50):
    print(f"    Class {i}: {class_map:.3f}")

# Interpretation guide
print("\nüìö Metrics Interpretation:")
print("  mAP@0.5 > 0.5: Good detection performance")
print("  mAP@0.5 > 0.7: Excellent detection performance")
print("  mAP@0.5:0.95 > 0.3: Good localization accuracy")
print("  Precision: % of correct detections (avoid false positives)")
print("  Recall: % of objects detected (avoid missing objects)")

## Step 8: Test Trained Model on New Images

Let's compare the trained model with the pre-trained model on test images.

In [None]:
# Download a test image or use your own
import urllib.request

test_image_url = 'https://ultralytics.com/images/bus.jpg'
test_image_path = 'test_image.jpg'

urllib.request.urlretrieve(test_image_url, test_image_path)

# Run inference with both models
print("üîç Running inference...\n")

# Pre-trained model
pretrained_results = model(test_image_path)

# Fine-tuned model
finetuned_results = best_model(test_image_path)

# Visualize results side-by-side
fig, axes = plt.subplots(1, 2, figsize=(15, 7))

# Plot pre-trained results
pretrained_img = pretrained_results[0].plot()
axes[0].imshow(pretrained_img)
axes[0].set_title('Pre-trained YOLOv8n (COCO)', fontsize=14, fontweight='bold')
axes[0].axis('off')

# Plot fine-tuned results
finetuned_img = finetuned_results[0].plot()
axes[1].imshow(finetuned_img)
axes[1].set_title('Fine-tuned YOLOv8n (Custom)', fontsize=14, fontweight='bold')
axes[1].axis('off')

plt.tight_layout()
plt.show()

print("\nüìä Detection Comparison:")
print(f"  Pre-trained detections: {len(pretrained_results[0].boxes)}")
print(f"  Fine-tuned detections: {len(finetuned_results[0].boxes)}")

## Step 9: Save and Export Model

Export the trained model for deployment in various formats.

In [None]:
# Export model to different formats
print("üì¶ Exporting model...\n")

# Export to ONNX (for cross-platform deployment)
onnx_path = best_model.export(format='onnx')
print(f"‚úÖ ONNX model: {onnx_path}")

# Other export formats:
# - 'torchscript': PyTorch TorchScript
# - 'coreml': Apple CoreML (iOS)
# - 'tflite': TensorFlow Lite (mobile)
# - 'pb': TensorFlow SavedModel
# - 'engine': TensorRT (NVIDIA)

# Example: Export to TensorFlow Lite for mobile deployment
# tflite_path = best_model.export(format='tflite')
# print(f"‚úÖ TFLite model: {tflite_path}")

print("\nüìÅ Model files saved:")
print(f"  PyTorch (.pt): yolo_training/exp/weights/best.pt")
print(f"  ONNX (.onnx): {onnx_path}")
print("\nüöÄ Models ready for deployment!")

## Step 10: Overfitting Check

Compare training and validation metrics to detect overfitting.

In [None]:
# Overfitting analysis
if results_csv.exists():
    df = pd.read_csv(results_csv)
    df.columns = df.columns.str.strip()
    
    # Plot train vs validation loss
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Total loss comparison
    train_loss = df['train/box_loss'] + df['train/cls_loss'] + df['train/dfl_loss']
    val_loss = df['val/box_loss'] + df['val/cls_loss'] + df['val/dfl_loss']
    
    axes[0].plot(df['epoch'], train_loss, label='Train Loss', linewidth=2)
    axes[0].plot(df['epoch'], val_loss, label='Val Loss', linewidth=2)
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Total Loss')
    axes[0].set_title('Train vs Validation Loss')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Gap analysis
    loss_gap = val_loss - train_loss
    axes[1].plot(df['epoch'], loss_gap, linewidth=2, color='red')
    axes[1].axhline(y=0, color='black', linestyle='--', alpha=0.5)
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('Loss Gap (Val - Train)')
    axes[1].set_title('Overfitting Indicator')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Overfitting diagnosis
    final_gap = loss_gap.iloc[-1]
    print("\nüîç Overfitting Analysis:")
    if final_gap < 0.1:
        print("  ‚úÖ No overfitting detected (gap < 0.1)")
        print("  ‚Üí Model generalizes well")
    elif final_gap < 0.3:
        print("  ‚ö†Ô∏è Slight overfitting (gap 0.1-0.3)")
        print("  ‚Üí Consider: Early stopping, data augmentation")
    else:
        print("  ‚ùå Significant overfitting (gap > 0.3)")
        print("  ‚Üí Solutions:")
        print("    1. Reduce epochs (early stopping)")
        print("    2. Increase data augmentation")
        print("    3. Add dropout/regularization")
        print("    4. Collect more training data")

## Step 11: Hyperparameter Tuning Tips

### Common Issues and Solutions:

#### 1. Low mAP (<0.3)
**Solutions:**
- ‚úÖ Increase epochs (50 ‚Üí 100)
- ‚úÖ Improve data quality (better annotations)
- ‚úÖ Balance dataset (equal samples per class)
- ‚úÖ Try larger model (yolov8n ‚Üí yolov8s)

#### 2. Out of Memory (OOM)
**Solutions:**
- ‚úÖ Reduce batch size (16 ‚Üí 8 ‚Üí 4)
- ‚úÖ Reduce image size (640 ‚Üí 480)
- ‚úÖ Use smaller model (yolov8s ‚Üí yolov8n)
- ‚úÖ Reduce workers (8 ‚Üí 4 ‚Üí 2)

#### 3. Slow Training
**Solutions:**
- ‚úÖ Use GPU instead of CPU
- ‚úÖ Increase batch size (if memory allows)
- ‚úÖ Use smaller model (yolov8m ‚Üí yolov8n)
- ‚úÖ Reduce image size (1280 ‚Üí 640)

#### 4. Overfitting
**Solutions:**
- ‚úÖ Early stopping (patience=10)
- ‚úÖ Data augmentation (built-in YOLO augmentation)
- ‚úÖ Reduce epochs
- ‚úÖ Collect more training data

### Recommended Hyperparameters:

| Scenario | Model | Epochs | Batch | Image Size |
|----------|-------|--------|-------|------------|
| Quick test | yolov8n | 20 | 16 | 640 |
| Small dataset (<500 images) | yolov8n | 50 | 16 | 640 |
| Medium dataset (500-5000) | yolov8s | 100 | 16 | 640 |
| Large dataset (>5000) | yolov8m | 150 | 32 | 640 |
| High accuracy needed | yolov8l | 200 | 16 | 1280 |
| Mobile deployment | yolov8n | 100 | 16 | 320 |

## Step 12: Common Training Issues

### Troubleshooting Guide:

```python
# Issue 1: CUDA Out of Memory
# Error: RuntimeError: CUDA out of memory
# Solution:
training_config['batch'] = 8  # Reduce from 16
training_config['workers'] = 2  # Reduce from 4

# Issue 2: Poor mAP (<0.3)
# Solution:
training_config['epochs'] = 100  # Increase from 50
model = YOLO('yolov8s.pt')  # Use larger model

# Issue 3: Slow training on CPU
# Solution:
training_config['device'] = 'cpu'
training_config['batch'] = 4  # Smaller batch
training_config['epochs'] = 20  # Fewer epochs for testing

# Issue 4: Dataset not found
# Solution: Check data.yaml path
import os
yaml_path = training_config['data']
print(f"YAML exists: {os.path.exists(yaml_path)}")
```

## Exercise: Train on Your Own Dataset

### Task:
1. Prepare your own dataset using Roboflow or LabelImg
2. Create `data.yaml` with your class names
3. Train YOLOv8 for 50 epochs
4. Evaluate and export the model

### Dataset Requirements:
- ‚úÖ At least 100 images per class
- ‚úÖ 80% train, 20% validation split
- ‚úÖ YOLO format annotations (.txt)
- ‚úÖ Balanced class distribution

### Steps:
```python
# 1. Update data.yaml path
training_config['data'] = 'path/to/your/data.yaml'

# 2. Train model
results = model.train(**training_config)

# 3. Evaluate
metrics = model.val()
print(f"mAP@0.5: {metrics.box.map50}")

# 4. Export
model.export(format='onnx')
```

In [None]:
# Exercise workspace - Train your own model here

# TODO: Update with your dataset path
my_dataset_path = 'path/to/your/dataset/data.yaml'

# TODO: Configure training parameters
my_config = {
    'data': my_dataset_path,
    'epochs': 50,
    'imgsz': 640,
    'batch': 16,
    'device': 0 if torch.cuda.is_available() else 'cpu',
}

# TODO: Train your model
# my_model = YOLO('yolov8n.pt')
# my_results = my_model.train(**my_config)

print("‚úèÔ∏è Complete the exercise above with your own dataset!")

## Summary

### What We Learned:
1. ‚úÖ **Model Selection**: Choosing appropriate YOLOv8 variant (nano to extra-large)
2. ‚úÖ **Training Configuration**: Understanding epochs, batch size, learning rate, patience
3. ‚úÖ **Training Process**: Running training, monitoring progress with TensorBoard
4. ‚úÖ **Evaluation Metrics**: Interpreting mAP, precision, recall, loss curves
5. ‚úÖ **Overfitting Detection**: Analyzing train vs validation performance
6. ‚úÖ **Model Export**: Deploying models in ONNX, TFLite, TorchScript formats
7. ‚úÖ **Troubleshooting**: Solving common training issues (OOM, slow training, low mAP)

### Key Takeaways:
- **Start small**: Use YOLOv8n for quick iterations, then scale up
- **Monitor metrics**: Watch mAP@0.5 (target >0.5) and loss curves
- **Early stopping**: Use patience=10-20 to prevent overfitting
- **GPU acceleration**: Training on GPU is 10-50√ó faster than CPU
- **Data quality**: Better annotations ‚Üí better model performance

### Performance Benchmarks:
- **mAP@0.5 > 0.5**: Good detection performance
- **mAP@0.5 > 0.7**: Excellent detection performance
- **mAP@0.5:0.95 > 0.3**: Good localization accuracy

### Next Steps:
1. **Notebook 05**: Explore SSD architecture and implementation
2. **Notebook 06**: Compare YOLO vs SSD for different use cases
3. **Practice**: Train on your own custom dataset with real-world images

---

**Congratulations! You can now train custom YOLO models for object detection! üéâ**