# Hand Keypoint Detection using YOLOv11

## Introduction

This notebook will guide you through using YOLOv11 (You Only Look Once) Algorithm to detect 21 different keypoints on human hand by training the YOLO model on a Ultralytics Hand Keypoints dataset which contains over 26,768 images.

## Credits

- Dataset: [Roboflow](https://docs.ultralytics.com/datasets/pose/hand-keypoints/)
- Reference: [Roboflow](https://docs.ultralytics.com/datasets/pose/hand-keypoints/)
- Model: [Ultralytics](https://github.com/ultralytics/ultralytics)

## Table of Contents

- [Introduction](#introduction)
- [Prerequisites](#prerequisites)
- [Installing YOLOv11](#installing-yolov11)
- [Getting the Dataset](#getting-the-dataset)
- [Training](#training)
- [Testing](#testing)
- [Performance](#performance)  
- [WebCam Testing](#webcam-testing)
- [Usage](#usage)

## Prerequisites

We'll be using following tools for this notebook
- Ultralytics YOLOv11 Model
- Roboflow Car parts Dataset

Make sure you have access to GPU for faster computation. Run `nvidia-smi` command and check if you get output something like following

In [1]:
!nvidia-smi

Tue Dec 16 00:48:41 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L40S                    On  |   00000000:30:00.0 Off |                    0 |
| N/A   36C    P0             85W /  350W |    2200MiB /  46068MiB |      7%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

Check if the current working directory is root directory of project

In [2]:
import os
HOME = os.getcwd()
print(HOME)

/home/ubuntu/Motion_MVP1_Yolo


## Installing YOLOv11

Ultralytics package includes all necessary libraries and dependencies used to run YOLOv11. So installation is quite simple

In [3]:
!pip install ultralytics

Defaulting to user installation because normal site-packages is not writeable
Collecting ultralytics
  Downloading ultralytics-8.3.239-py3-none-any.whl (1.1 MB)
[2K     [38;2;114;156;31m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.1/1.1 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
Collecting matplotlib>=3.3.0
  Downloading matplotlib-3.10.8-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.7 MB)
[2K     [38;2;114;156;31m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m8.7/8.7 MB[0m [31m89.9 MB/s[0m eta [36m0:00:00[0m0m eta [36m0:00:01[0m0:01[0m
[?25hCollecting numpy>=1.23.0
  Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K     [38;2;114;156;31m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

Make sure that ultralytics is installed correctly

In [4]:
from IPython import display
display.clear_output()

import ultralytics
ultralytics.checks()

Ultralytics 8.3.239 üöÄ Python-3.10.12 torch-2.9.1+cu128 CUDA:0 (NVIDIA L40S, 45458MiB)
Setup complete ‚úÖ (8 CPUs, 61.9 GB RAM, 107.1/123.9 GB disk)


## Getting the Dataset

Dataset will be fetched using the predefined yaml file for hand keypoints dataset. The training will start after fetching the dataset.

- Name: Ultralytics Hand Keypoints Dataset
- Format: YOLOv11
- Images count: 26,768
- Image size: 640x640
- Keypoints count: 21
- Train/Val/Test Distribution: 70%:30%:0%

## Training

Run the following code to get the dataset and start training.

### Donwload hand-keypoints.yaml file from following git repo

https://github.com/ultralytics/ultralytics/blob/main/ultralytics/cfg/datasets/hand-keypoints.yaml

### References
https://docs.ultralytics.com/tasks/pose/#val

This notebook is download from kaggle using below link

https://www.kaggle.com/code/mak175/hand-keypoint-detection/edit?fromFork=1

In [7]:
from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n-pose.pt")  # load a pretrained model (recommended for training)

# Train the model
results = model.train(data="hand-keypoints.yaml", epochs=10, imgsz=640, save=True, device=0)

Ultralytics 8.3.239 üöÄ Python-3.10.12 torch-2.9.1+cu128 CUDA:0 (NVIDIA L40S, 45458MiB)
[34m[1mengine/trainer: [0magnostic_nms=False, amp=True, augment=False, auto_augment=randaugment, batch=16, bgr=0.0, box=7.5, cache=False, cfg=None, classes=None, close_mosaic=10, cls=0.5, compile=False, conf=None, copy_paste=0.0, copy_paste_mode=flip, cos_lr=False, cutmix=0.0, data=hand-keypoints.yaml, degrees=0.0, deterministic=True, device=0, dfl=1.5, dnn=False, dropout=0.0, dynamic=False, embed=None, epochs=10, erasing=0.4, exist_ok=False, fliplr=0.5, flipud=0.0, format=torchscript, fraction=1.0, freeze=None, half=False, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, imgsz=640, int8=False, iou=0.7, keras=False, kobj=1.0, line_width=None, lr0=0.01, lrf=0.01, mask_ratio=4, max_det=300, mixup=0.0, mode=train, model=yolo11n-pose.pt, momentum=0.937, mosaic=1.0, multi_scale=False, name=train3, nbs=64, nms=False, opset=None, optimize=False, optimizer=auto, overlap_mask=True, patience=100, perspective=0.0, plots

KeyboardInterrupt: 

### If accidentally stopped the training then can resume using below code.

In [8]:
from ultralytics import YOLO

# Load the last checkpoint
model = YOLO("runs/pose/train3/weights/last.pt")

# Resume training
model.train(resume=True)


Ultralytics 8.3.239 üöÄ Python-3.10.12 torch-2.9.1+cu128 CUDA:0 (NVIDIA L40S, 45458MiB)
[34m[1mengine/trainer: [0magnostic_nms=False, amp=True, augment=False, auto_augment=randaugment, batch=16, bgr=0.0, box=7.5, cache=False, cfg=None, classes=None, close_mosaic=10, cls=0.5, compile=False, conf=None, copy_paste=0.0, copy_paste_mode=flip, cos_lr=False, cutmix=0.0, data=hand-keypoints.yaml, degrees=0.0, deterministic=True, device=0, dfl=1.5, dnn=False, dropout=0.0, dynamic=False, embed=None, epochs=10, erasing=0.4, exist_ok=False, fliplr=0.5, flipud=0.0, format=torchscript, fraction=1.0, freeze=None, half=False, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, imgsz=640, int8=False, iou=0.7, keras=False, kobj=1.0, line_width=None, lr0=0.01, lrf=0.01, mask_ratio=4, max_det=300, mixup=0.0, mode=train, model=runs/pose/train3/weights/last.pt, momentum=0.937, mosaic=1.0, multi_scale=False, name=train3, nbs=64, nms=False, opset=None, optimize=False, optimizer=auto, overlap_mask=True, patience=100, persp

ultralytics.utils.metrics.PoseMetrics object with attributes:

ap_class_index: array([0])
box: ultralytics.utils.metrics.Metric object
confusion_matrix: <ultralytics.utils.metrics.ConfusionMatrix object at 0x7843fd1ee980>
curves: ['Precision-Recall(B)', 'F1-Confidence(B)', 'Precision-Confidence(B)', 'Recall-Confidence(B)', 'Precision-Recall(B)', 'F1-Confidence(B)', 'Precision-Confidence(B)', 'Recall-Confidence(B)', 'Precision-Recall(P)', 'F1-Confidence(P)', 'Precision-Confidence(P)', 'Recall-Confidence(P)']
curves_results: [[array([          0,    0.001001,    0.002002,    0.003003,    0.004004,    0.005005,    0.006006,    0.007007,    0.008008,    0.009009,     0.01001,    0.011011,    0.012012,    0.013013,    0.014014,    0.015015,    0.016016,    0.017017,    0.018018,    0.019019,     0.02002,    0.021021,    0.022022,    0.023023,
          0.024024,    0.025025,    0.026026,    0.027027,    0.028028,    0.029029,     0.03003,    0.031031,    0.032032,    0.033033,    0.034034, 

**NOTE**: Some images in the dataset might be corrupted or non-normalized. You can ignore the warnings related to them if you encounter any.

The model was trained for only 10 epochs because of resource constraints, since long training times cause GPU overheating and throttling and cause the process to slow down.  You can increase the number of epochs for better results if you have more resources available.

## Testing

Let's test our model performance on the validation set.

In [10]:
model = YOLO(f"{HOME}/runs/pose/train3/weights/best.pt") 

metrics = model.val() # No need to explicitly pass arguments: imgsz, data, conf, batch, etc. model will use the values specified during training
metrics.box.map  
metrics.box.map50  
metrics.box.map75 
metrics.box.maps 

Ultralytics 8.3.239 üöÄ Python-3.10.12 torch-2.9.1+cu128 CUDA:0 (NVIDIA L40S, 45458MiB)
YOLO11n-pose summary (fused): 109 layers, 2,956,000 parameters, 0 gradients, 7.8 GFLOPs
[34m[1mval: [0mFast image access ‚úÖ (ping: 0.0¬±0.0 ms, read: 954.6¬±256.3 MB/s, size: 15.9 KB)
[K[34m[1mval: [0mScanning /home/ubuntu/Motion_MVP1_Yolo/datasets/hand-keypoints/labels/val.cache... 7992 images, 0 backgrounds, 145 corrupt: 100% ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 7992/7992 10.5Mit/s 0.0ss
[34m[1mval: [0m/home/ubuntu/Motion_MVP1_Yolo/datasets/hand-keypoints/images/val/IMG_00001884.jpg: ignoring corrupt image/label: non-normalized or out of bounds coordinates [1.0106432 1.0375395 1.0459943 1.030653  1.0159814 1.0682954 1.0765926
 1.0667012]
[34m[1mval: [0m/home/ubuntu/Motion_MVP1_Yolo/datasets/hand-keypoints/images/val/IMG_00001885.jpg: ignoring corrupt image/label: non-normalized or out of bounds coordinates [1.01038   1.0125777]
[34m[1mval: [0m/home/ubuntu/Motion_MVP1_Yolo/dataset

array([    0.87033])

## Performance

YOLO provides a detailed visualisation of the model performance. Check out `runs/pose/train` folder for training performance, and `runs/pose/val` folder for validation performance.

We achieved a mAP of 0.805 which is sufficiently reliable for any keypoint detection task.

## Video Testing

In [None]:
import cv2
from ultralytics import YOLO

path = ["/home/ubuntu/Motion_Nerf_MVP1/data/Barista_coffee.mp4", 
        "/home/ubuntu/Motion_Nerf_MVP1/data/McDonalds.mp4",
        "/home/ubuntu/Motion_Nerf_MVP1/data/McDonalds_POV.mp4"]

cap = cv2.VideoCapture(path[2])

if not cap.isOpened():
    print("Error: Could not open video file.")
    exit()

model = YOLO("/home/ubuntu/Motion_MVP1_YOLO/src/runs/pose/train3/weights/best.pt")

while True:
    ret, frame = cap.read()
    if not ret:
        print("End of video or cannot read the frame.")
        break

    results = model.predict(frame)
    annotated_frame = results[0].plot()

    # Resize the output window to 50% of the original size
    resized_frame = cv2.resize(annotated_frame, None, fx=0.5, fy=0.5)

    cv2.imshow("YOLO Inference on Video", resized_frame)

    if cv2.waitKey(1) & 0xFF == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()



0: 384x640 (no detections), 57.2ms
Speed: 2.2ms preprocess, 57.2ms inference, 0.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 11.1ms
Speed: 2.2ms preprocess, 11.1ms inference, 0.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 10.5ms
Speed: 2.3ms preprocess, 10.5ms inference, 0.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 9.6ms
Speed: 1.7ms preprocess, 9.6ms inference, 0.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 11.3ms
Speed: 1.1ms preprocess, 11.3ms inference, 0.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 11.0ms
Speed: 1.0ms preprocess, 11.0ms inference, 0.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 9.6ms
Speed: 1.3ms preprocess, 9.6ms inference, 0.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 11.0ms
Speed: 1.4ms preprocess, 11.0ms infer

KeyboardInterrupt: 

: 

## Define keypoint connections for drawing skeleton

In [1]:
import cv2
import numpy as np
from ultralytics import YOLO

path = ["/home/ubuntu/Motion_Nerf_MVP1/data/Barista_coffee.mp4", 
        "/home/ubuntu/Motion_Nerf_MVP1/data/McDonalds.mp4",
        "/home/ubuntu/Motion_Nerf_MVP1/data/McDonalds_POV.mp4"]

cap = cv2.VideoCapture(path[2])

if not cap.isOpened():
    print("Error: Could not open video file.")
    exit()

model = YOLO("/home/ubuntu/Motion_MVP1_YOLO/src/runs/pose/train3/weights/best.pt")

POSE_CONNECTIONS = [
    (0, 1), (1, 2), (2, 3), (3, 4),        # Thumb
    (0, 5), (5, 6), (6, 7), (7, 8),        # Index
    (0, 9), (9,10), (10,11), (11,12),      # Middle
    (0,13), (13,14), (14,15), (15,16),     # Ring
    (0,17), (17,18), (18,19), (19,20)      # Pinky
]

while True:
    ret, frame = cap.read()
    if not ret:
        print("End of video or cannot read the frame.")
        break

    results = model.predict(frame)
    annotated_frame = frame.copy()

    for kp_tensor in results[0].keypoints.xy:
        kpts = kp_tensor.cpu().numpy().astype(int)

        # Draw keypoints
        for x, y in kpts:
            cv2.circle(annotated_frame, (x, y), 4, (0, 255, 0), -1)

        # Draw lines
        for i, j in POSE_CONNECTIONS:
            if i < len(kpts) and j < len(kpts):
                pt1 = tuple(kpts[i])
                pt2 = tuple(kpts[j])
                if all(pt1) and all(pt2):  # Skip if point is (0, 0)
                    cv2.line(annotated_frame, pt1, pt2, (255, 0, 0), 2)

    # Resize the output window to 50% of the original size
    resized_frame = cv2.resize(annotated_frame, None, fx=0.5, fy=0.5)

    cv2.imshow("YOLO Skeleton Pose", resized_frame)

    if cv2.waitKey(1) & 0xFF == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()



0: 384x640 (no detections), 55.0ms
Speed: 2.2ms preprocess, 55.0ms inference, 0.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 9.3ms
Speed: 2.1ms preprocess, 9.3ms inference, 0.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 9.6ms
Speed: 1.8ms preprocess, 9.6ms inference, 0.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 9.1ms
Speed: 1.7ms preprocess, 9.1ms inference, 0.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 9.1ms
Speed: 1.1ms preprocess, 9.1ms inference, 0.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 9.1ms
Speed: 1.0ms preprocess, 9.1ms inference, 0.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 9.0ms
Speed: 1.1ms preprocess, 9.0ms inference, 0.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 9.2ms
Speed: 1.1ms preprocess, 9.2ms inference, 0.7m

## WebCam Testing

You can detect hand keypoints in realtime using OpenCV library for performing detections of Video Stream from Webcam

In [None]:
import cv2
from ultralytics import YOLO

cap = cv2.VideoCapture(0)

if not cap.isOpened():
    print("Error: Could not open webcam.")
    exit()

model = YOLO('best.pt')

while True:

    ret, frame = cap.read()

    if not ret:
        print("Error: Could not read frame.")
        break

    result = model.predict(frame)

    cv2.imshow('Webcam Video Stream', result[0].plot())

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

## Usage

YOLO weights are now upgraded to detect hand keypoints. Now you can use the trained weights for detecting hand gestures.

The trained weights are stored in `runs/pose/train/weights/best.pt` file. To use the trained weights on you images, first load the weights in a YOLO model. Then, use the model to detect hand keypoints in your images.

## Portable Weights

Now you can use you trained model any any machine with ultralytics installed. Just copy the weights `runs/pose/train/weights/best.pt` to your machine and load the YOLO model with those weights.

Your model is now ready to _read_ your gestures!