YOLO Model Training Pipeline

This notebook executes the complete training workflow for YOLOv8n object detection model using Pascal VOC 2012 dataset. Trains on real-world dataset with configuration defined in notebook 02, tracks metrics with MLflow, and validates trained model.

Training Workflow:
1. Load model configuration from notebook 02
2. Initialize YOLO model with pretrained COCO weights
3. Execute training on Pascal VOC 2012 dataset (3000-5000 filtered images)
4. Track training progress and metrics with MLflow
5. Validate trained model on validation dataset
6. Register best model to MLflow Model Registry
7. Verify best.pt checkpoint ready for prediction phase

Training Duration:
- GPU: 30-60 minutes
- CPU: 3-4 hours

Dataset: Pascal VOC 2012 (~70% train, 15% val, 15% test split)


In [1]:
import yaml
from pathlib import Path
import mlflow
import os
import torch
import shutil
import numpy as np
import matplotlib.pyplot as plt
from ultralytics import YOLO
from PIL import Image

# Reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# Project structure
PROJECT_ROOT = Path('../').resolve()  # Convertir a ruta absoluta para MLflow
DATA_DIR = PROJECT_ROOT / 'data'
MODELS_DIR = PROJECT_ROOT / 'models'
RUNS_DIR = PROJECT_ROOT / 'runs'
MODELS_DIR.mkdir(parents=True, exist_ok=True)

# Configurar MLflow con URI válido para Windows
mlflow_path = str(RUNS_DIR / 'mlflow').replace('\\', '/')
if mlflow_path[1] == ':':  # Si tiene letra de unidad (C:, D:, etc.)
    mlflow_uri = f"file:///{mlflow_path}"
else:
    mlflow_uri = f"file://{mlflow_path}"

mlflow.set_tracking_uri(mlflow_uri)
mlflow.set_experiment('yolo_3class_detection')

print("YOLO MODEL CONFIGURATION")
print("=" * 60)
print(f"Project Root: {PROJECT_ROOT}")
print(f"Data Dir: {DATA_DIR}")
print(f"Models Dir: {MODELS_DIR}")
print(f"MLflow URI: {mlflow_uri}")
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
print("=" * 60)

YOLO MODEL CONFIGURATION
Project Root: C:\Users\jordy\OneDrive\Desktop\iaaaa\iajordy2
Data Dir: C:\Users\jordy\OneDrive\Desktop\iaaaa\iajordy2\data
Models Dir: C:\Users\jordy\OneDrive\Desktop\iaaaa\iajordy2\models
MLflow URI: file:///C:/Users/jordy/OneDrive/Desktop/iaaaa/iajordy2/runs/mlflow
CUDA Available: True
GPU: NVIDIA GeForce RTX 3080 Laptop GPU


  return FileStore(store_uri, store_uri)


Stage 1: Environment Initialization and MLflow Setup

This stage sets up the training environment with reproducibility settings and experiment tracking configuration.

Components:
- PyTorch and NumPy: Deep learning framework and numerical operations
- MLflow: Experiment tracking and model versioning
- YOLO: Ultralytics YOLOv8 object detection framework

Reproducibility:
- Fixed seeds (42) for NumPy, PyTorch CPU, and CUDA GPU
- Ensures identical results across different runs and machines

MLflow configuration:
- Sets tracking URI to local mlruns directory
- Creates experiment named "yolo_3class_detection"
- All training metrics will be logged and retrievable

In [2]:
# Configuration (from notebook 02)
MODEL_NAME = 'yolov8n'
PRETRAINED_WEIGHTS = 'yolov8n.pt'
NUM_CLASSES = 3
CLASS_NAMES = ['person', 'car', 'dog']

TRAINING_CONFIG = {
    'epochs': 50,
    'batch_size': 16,
    'imgsz': 416,
    'patience': 10,
    'device': 0 if torch.cuda.is_available() else 'cpu',
    'seed': SEED,
    'lr0': 0.01,
    'lrf': 0.01,
    'momentum': 0.937,
    'weight_decay': 0.0005,
    'warmup_epochs': 3.0,
    'warmup_momentum': 0.8,
}

print("\n[1] Loading Configuration")
print("-" * 60)
print(f"Model: {MODEL_NAME}")
print(f"Classes: {NUM_CLASSES} ({', '.join(CLASS_NAMES)})")
print(f"\nTraining config:")
for key, value in TRAINING_CONFIG.items():
    print(f"  {key}: {value}")


[1] Loading Configuration
------------------------------------------------------------
Model: yolov8n
Classes: 3 (person, car, dog)

Training config:
  epochs: 50
  batch_size: 16
  imgsz: 416
  patience: 10
  device: 0
  seed: 42
  lr0: 0.01
  lrf: 0.01
  momentum: 0.937
  weight_decay: 0.0005
  warmup_epochs: 3.0
  warmup_momentum: 0.8


Stage 2: Load Configuration from Notebook 02

This stage replicates the model and training configuration defined in notebook 02.

Configuration includes:
- Model name and pretrained weights
- Dataset specification (3 classes)
- All training hyperparameters

This duplication ensures that notebook 03 is self-contained and can be executed independently after notebook 02.

In [3]:
# End any existing MLflow run
if mlflow.active_run():
    mlflow.end_run()

# Initialize model
model = YOLO(PRETRAINED_WEIGHTS)

# Start MLflow run
mlflow.start_run(run_name='yolo_training_run')

# Log parameters ONLY ONCE
mlflow.log_params({
    'model_name': MODEL_NAME,
    'num_classes': NUM_CLASSES,
    'classes': ', '.join(CLASS_NAMES),
    'epochs': TRAINING_CONFIG['epochs'],
    'batch_size': TRAINING_CONFIG['batch_size'],
    'imgsz': TRAINING_CONFIG['imgsz'],
    'patience': TRAINING_CONFIG['patience'],
    'lr0': TRAINING_CONFIG['lr0'],
    'lrf': TRAINING_CONFIG['lrf'],
    'momentum': TRAINING_CONFIG['momentum'],
    'weight_decay': TRAINING_CONFIG['weight_decay'],
})

print("\n[2] Starting Training")
print("-" * 60)

# Train - YOLO saves to runs/ by default
results = model.train(
    data=str(DATA_DIR / 'data.yaml'),
    epochs=TRAINING_CONFIG['epochs'],
    imgsz=TRAINING_CONFIG['imgsz'],
    batch=TRAINING_CONFIG['batch_size'],
    device=TRAINING_CONFIG['device'],
    patience=TRAINING_CONFIG['patience'],
    seed=TRAINING_CONFIG['seed'],
    save=True,
    exist_ok=True,
    name='yolo_run',
    verbose=True
)

# YOLO saves to runs/detect/yolo_run by default
best_model_path = Path('../runs/detect/yolo_run/weights/best.pt')

if best_model_path.exists():
    # Copy best model to models directory for easy access
    MODELS_DIR.mkdir(parents=True, exist_ok=True)
    dst_best_model = MODELS_DIR / 'best.pt'
    shutil.copy2(str(best_model_path), str(dst_best_model))
    
    # Log training artifacts and metrics
    mlflow.log_artifact(str(best_model_path), artifact_path='models')
    
    # Log training completion and results
    if hasattr(results, 'box') and results.box:
        mlflow.log_metric('final_mAP50', float(results.box.map50))
        mlflow.log_metric('final_mAP50_95', float(results.box.map))
    
    mlflow.log_metric('training_completed', 1)
    print(f"\n✓ Best model found at: {best_model_path}")
    print(f"✓ Best model copied to: {dst_best_model}")
else:
    mlflow.log_metric('training_completed', 0)
    print(f"\n✗ Best model NOT found at {best_model_path}")
    print(f"Expected path: {best_model_path}")
    # Try to find where YOLO actually saved it
    import glob
    search_paths = list(Path('../runs').glob('**/best.pt'))
    if search_paths:
        print(f"Found best.pt at: {search_paths[0]}")

mlflow.end_run()
print("Training run completed and logged to MLflow")


[2] Starting Training
------------------------------------------------------------
Ultralytics 8.4.10  Python-3.10.0 torch-2.5.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3080 Laptop GPU, 8192MiB)
[34m[1mengine\trainer: [0magnostic_nms=False, amp=True, angle=1.0, augment=False, auto_augment=randaugment, batch=16, bgr=0.0, box=7.5, cache=False, cfg=None, classes=None, close_mosaic=10, cls=0.5, compile=False, conf=None, copy_paste=0.0, copy_paste_mode=flip, cos_lr=False, cutmix=0.0, data=C:\Users\jordy\OneDrive\Desktop\iaaaa\iajordy2\data\data.yaml, degrees=0.0, deterministic=True, device=0, dfl=1.5, dnn=False, dropout=0.0, dynamic=False, embed=None, end2end=None, epochs=50, erasing=0.4, exist_ok=True, fliplr=0.5, flipud=0.0, format=torchscript, fraction=1.0, freeze=None, half=False, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, imgsz=416, int8=False, iou=0.7, keras=False, kobj=1.0, line_width=None, lr0=0.01, lrf=0.01, mask_ratio=4, max_det=300, mixup=0.0, mode=train, model=yolov8n.pt, momentum=0.937, m

Stage 3: Execute Training with MLflow Logging

This is the core training stage. The YOLO model is initialized with pretrained COCO weights and fine-tuned on the custom 3-class dataset.

Training flow:
1. Load pretrained YOLOv8n weights (trained on COCO 80-class dataset)
2. Start MLflow run for experiment tracking
3. Execute model.train() with specified hyperparameters
4. Save best model checkpoint to models/yolo_run/weights/best.pt
5. Log training artifacts to MLflow
6. End MLflow run

Key points:
- Transfer learning: Pretrained weights provide feature extraction knowledge
- Early stopping: Stops if validation metric doesn't improve for 10 epochs
- Best model: Saved checkpoint with best validation performance

In [4]:
# Validation
# Use best model from runs directory or models directory
best_model_candidates = [
    Path('../runs/detect/yolo_run/weights/best.pt'),  # Default YOLO location
    Path('../models/best.pt'),  # Copied location
]

best_model_path = None
for candidate in best_model_candidates:
    if candidate.exists():
        best_model_path = candidate
        print(f"Found best model at: {best_model_path}")
        break

if best_model_path:
    mlflow.start_run(run_name='yolo_validation_run')
    
    best_model = YOLO(str(best_model_path))
    val_results = best_model.val(
        data=str(DATA_DIR / 'data.yaml'),
        imgsz=TRAINING_CONFIG['imgsz'],
        batch=TRAINING_CONFIG['batch_size'],
        device=TRAINING_CONFIG['device'],
        verbose=False
    )
    
    mlflow.log_params({
        'validation_dataset': 'val',
        'model_checkpoint': 'best.pt'
    })
    
    if hasattr(val_results, 'box') and val_results.box:
        metrics = {
            'mAP50': float(val_results.box.map50),
            'mAP50_95': float(val_results.box.map),
            'precision': float(val_results.box.p.mean()),
            'recall': float(val_results.box.r.mean()),
        }
        for metric_name, metric_value in metrics.items():
            mlflow.log_metric(metric_name, metric_value)
        
        print("\n[3] Validation Metrics")
        print("-" * 60)
        for metric_name, metric_value in metrics.items():
            print(f"  {metric_name}: {metric_value:.4f}")
    
    mlflow.end_run()
    print("\nValidation completed")
else:
    print("\n✗ Cannot validate: best.pt not found in any expected location")
    print("Expected locations:")
    for candidate in best_model_candidates:
        print(f"  - {candidate}")


Found best model at: ..\runs\detect\yolo_run\weights\best.pt
Ultralytics 8.4.10  Python-3.10.0 torch-2.5.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3080 Laptop GPU, 8192MiB)
Model summary (fused): 73 layers, 3,006,233 parameters, 0 gradients, 8.1 GFLOPs
[34m[1mval: [0mFast image access  (ping: 0.00.0 ms, read: 667.2215.1 MB/s, size: 83.5 KB)
[K[34m[1mval: [0mScanning C:\Users\jordy\OneDrive\Desktop\iaaaa\iajordy2\data\labels\val.cache... 434 images, 0 backgrounds, 0 corrupt: 100% ━━━━━━━━━━━━ 434/434  0.0s
[K                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% ━━━━━━━━━━━━ 28/28 10.6it/s 2.6s0.2s
                   all        434       1040      0.861      0.646      0.759      0.527
Speed: 0.7ms preprocess, 2.0ms inference, 0.0ms loss, 0.8ms postprocess per image
Results saved to [1mC:\Users\jordy\OneDrive\Desktop\iaaaa\iajordy2\runs\detect\val11[0m

[3] Validation Metrics
------------------------------------------------------------
  mAP50:

Stage 4: Model Validation and Metrics Logging

This stage validates the trained model on the validation dataset and logs performance metrics to MLflow.

Validation process:
1. Load best.pt model checkpoint
2. Execute model.val() on validation dataset with same hyperparameters used in training
3. Extract performance metrics from validation results

Metrics computed:
- mAP50: Mean Average Precision at IoU threshold 0.50
- mAP50_95: Mean Average Precision averaged over IoU thresholds 0.50-0.95
- precision: Proportion of detections that are correct
- recall: Proportion of ground truth objects that are detected

MLflow logs:
- All metrics for experiment tracking and comparison
- Model checkpoint path reference
- Validation dataset specification