# Hand Gesture Classification Using Deep Learning

## Problem Description

In ML1, I worked on a multi-class hand gesture classification problem using classical machine learning models such as SVM and Random Forest.

Each sample consists of 21 hand landmarks extracted from MediaPipe.  
Each landmark has (x, y, z) coordinates, so the total number of input features is:

21 × 3 = 63 features.

The goal is to predict the gesture label (e.g., fist, call, dislike, etc.).

This is a supervised multi-class classification problem where:

X ∈ R^63  
y ∈ {0, 1, ..., K-1}

In this notebook, I reimplement the same problem using a Deep Neural Network.

In [69]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

## Data Preprocessing

1. **Label Encoding** — gesture names are encoded into integer class labels.

2. **Translation Normalization** — X and Y coordinates are shifted so that
   landmark 0 (wrist) becomes the origin. This makes the representation
   position-invariant.

3. **Scale Normalization** — X and Y coordinates are divided by the mean
   distance from the wrist to the 4 fingertips (index, middle, ring, pinky).
   This makes the representation scale-invariant regardless of hand size or
   distance from the camera.

In [70]:
# Load dataset
df = pd.read_csv("/Users/ahmedtarek/Developer/Python/DL/hand_landmarks_data.csv")

X = df.drop('label', axis=1).copy()
y = df['label']

x_cols = list(range(0, X.shape[1], 3))
y_cols = [c + 1 for c in x_cols]
z_cols = [c + 2 for c in x_cols]

# 1️⃣ Translation normalization (X and Y only)
X.iloc[:, x_cols] = X.iloc[:, x_cols].sub(X.iloc[:, 0], axis=0)
X.iloc[:, y_cols] = X.iloc[:, y_cols].sub(X.iloc[:, 1], axis=0)

# 2️⃣ Scale normalization using 4 fingertips (no thumb)
fingertip_indices = [8, 12, 16, 20]

fingertip_distances = []
for tip in fingertip_indices:
    col_x = tip * 3
    col_y = tip * 3 + 1
    dist = np.sqrt(X.iloc[:, col_x]**2 + X.iloc[:, col_y]**2)
    fingertip_distances.append(dist)

div = np.mean(fingertip_distances, axis=0)
div = np.where(div < 1e-6, 1e-6, div)

X.iloc[:, x_cols] = X.iloc[:, x_cols].div(div, axis=0)
X.iloc[:, y_cols] = X.iloc[:, y_cols].div(div, axis=0)

X_scaled = X.values

# Encode labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

## Train / Validation / Test Split

The dataset is split into:

- Training set → used to update model parameters.
- Validation set → used to monitor generalization and apply early stopping.
- Test set → used only once at the end for final evaluation.

This separation ensures that the test set remains completely unseen during training.  
It allows us to measure the true generalization performance of the model.

In [71]:
# Train-validation-test split
X_train, X_temp, y_train, y_temp = train_test_split(
    X_scaled, y_encoded, test_size=0.3, random_state=42, stratify=y_encoded
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

# Convert to PyTorch tensors
X_train = torch.FloatTensor(X_train)
y_train = torch.LongTensor(y_train)

X_val = torch.FloatTensor(X_val)
y_val = torch.LongTensor(y_val)

X_test = torch.FloatTensor(X_test)
y_test = torch.LongTensor(y_test)

## Neural Network Architecture

I implement a fully connected neural network (Multi-Layer Perceptron).

Structure:
- Input layer: 63 neurons (one per feature)
- Hidden Layer 1: 128 neurons + ReLU
- Hidden Layer 2: 64 neurons + ReLU
- Output layer: K neurons (number of gesture classes)

Why ReLU?

ReLU(x) = max(0, x)

ReLU helps reduce the vanishing gradient problem and allows deeper models to train more efficiently compared to sigmoid or tanh.

The model learns nonlinear transformations of the input features, which allows it to capture complex decision boundaries compared to classical linear models.

In [72]:
class GestureNN(nn.Module):
    def __init__(self, input_size, num_classes):
        super(GestureNN, self).__init__()

        self.model = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.Dropout(0.3),

            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.3),

            nn.Linear(64, num_classes)
        )

    def forward(self, x):
        return self.model(x)

model = GestureNN(input_size=63, num_classes=len(le.classes_))

## Loss Function and Optimization

For multi-class classification, I use CrossEntropyLoss.

Cross-entropy measures the difference between predicted class probabilities and the true class label.

Mathematically:

L = - Σ y_i log(ŷ_i)

This encourages the model to assign high probability to the correct class.

For optimization, I use Adam.

Adam combines:
- Momentum (to accelerate convergence)
- Adaptive learning rates (to handle different parameter scales)

This makes training more stable and faster compared to standard gradient descent.

In [73]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)

## Training Process with Early Stopping

During training, each batch of data goes through the following steps:

1. **Forward pass** – compute predictions.  
2. **Compute loss** – measure the difference between predictions and true labels.  
3. **Backward pass** – compute gradients using backpropagation.  
4. **Update weights** – adjust model parameters using the Adam optimizer according to the gradient descent rule:

\[
\theta = \theta - \alpha \nabla L(\theta)
\]

Where:  
- **θ** represents the model parameters.  
- **α** is the learning rate.  
- **∇L(θ)** is the gradient of the loss with respect to the parameters.  

This process is repeated for multiple epochs until the model converges.

### Early Stopping (Regularization)

To prevent overfitting, we implement **early stopping** based on validation loss:

- Training loss usually decreases continuously.  
- Validation loss decreases initially but may start increasing once overfitting begins.  

We monitor the validation loss after each epoch. If it does not improve for a fixed number of epochs (called **patience**), training stops, and the best model (with the lowest validation loss) is restored.  

By selecting the parameters that minimize validation loss rather than just training loss, early stopping acts as an **implicit regularization method**, helping the model generalize better to unseen data.

In [74]:
train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=128, shuffle=True)
val_loader = DataLoader(TensorDataset(X_val, y_val), batch_size=128)

num_epochs = 300
patience = 15  # number of epochs to wait
best_val_loss = float('inf')
counter = 0

for epoch in range(num_epochs):
    model.train()
    train_loss = 0

    for xb, yb in train_loader:
        optimizer.zero_grad()
        outputs = model(xb)
        loss = criterion(outputs, yb)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    train_loss /= len(train_loader)

    # ----- Validation -----
    model.eval()
    val_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for xb, yb in val_loader:
            outputs = model(xb)
            loss = criterion(outputs, yb)
            val_loss += loss.item()

            _, predicted = torch.max(outputs, 1)
            total += yb.size(0)
            correct += (predicted == yb).sum().item()

    val_loss /= len(val_loader)
    val_acc = correct / total

    print(f"Epoch [{epoch+1}/{num_epochs}] "
          f"Train Loss: {train_loss:.4f} "
          f"Val Loss: {val_loss:.4f} "
          f"Val Acc: {val_acc:.4f}")

    # ----- Early Stopping Logic -----
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        counter = 0
        torch.save(model.state_dict(), "best_model.pth")  # Save best model
    else:
        counter += 1
        print(f"Early stopping counter: {counter}/{patience}")

        if counter >= patience:
            print("Early stopping triggered.")
            break

Epoch [1/300] Train Loss: 2.8537 Val Loss: 2.7916 Val Acc: 0.1797
Epoch [2/300] Train Loss: 2.6994 Val Loss: 2.5534 Val Acc: 0.2262
Epoch [3/300] Train Loss: 2.4532 Val Loss: 2.3019 Val Acc: 0.3243
Epoch [4/300] Train Loss: 2.2390 Val Loss: 2.0708 Val Acc: 0.3794
Epoch [5/300] Train Loss: 2.0330 Val Loss: 1.8456 Val Acc: 0.4363
Epoch [6/300] Train Loss: 1.8500 Val Loss: 1.6598 Val Acc: 0.4531
Epoch [7/300] Train Loss: 1.7064 Val Loss: 1.5115 Val Acc: 0.5510
Epoch [8/300] Train Loss: 1.5824 Val Loss: 1.3906 Val Acc: 0.5669
Epoch [9/300] Train Loss: 1.4817 Val Loss: 1.2924 Val Acc: 0.6398
Epoch [10/300] Train Loss: 1.3995 Val Loss: 1.2148 Val Acc: 0.6432
Epoch [11/300] Train Loss: 1.3323 Val Loss: 1.1509 Val Acc: 0.6788
Epoch [12/300] Train Loss: 1.2724 Val Loss: 1.0938 Val Acc: 0.7099
Epoch [13/300] Train Loss: 1.2194 Val Loss: 1.0410 Val Acc: 0.7318
Epoch [14/300] Train Loss: 1.1736 Val Loss: 0.9944 Val Acc: 0.7341
Epoch [15/300] Train Loss: 1.1209 Val Loss: 0.9468 Val Acc: 0.7632
Epoc

## Final Test Evaluation

After training finishes, I load the best saved model and evaluate it on the test set.

The test set was never used during training or validation.  
Therefore, it provides an unbiased estimate of the model’s generalization performance.

Evaluation metrics include:
- Accuracy
- Precision
- Recall
- F1-score

Accuracy is calculated as:

Accuracy = Correct Predictions / Total Samples

These metrics give a complete view of classification performance.

In [75]:
model.load_state_dict(torch.load("best_model.pth"))
model.eval()

test_loader = DataLoader(TensorDataset(X_test, y_test), batch_size=128)

all_preds = []
all_labels = []

with torch.no_grad():
    for xb, yb in test_loader:
        outputs = model(xb)
        _, predicted = torch.max(outputs, 1)

        all_preds.extend(predicted.numpy())
        all_labels.extend(yb.numpy())

# Convert to numpy arrays
all_preds = np.array(all_preds)
all_labels = np.array(all_labels)

# Accuracy
test_accuracy = (all_preds == all_labels).mean()
print("Test Accuracy:", test_accuracy)

Test Accuracy: 0.9846832814122534


In [76]:
from sklearn.metrics import classification_report

print(classification_report(
    all_labels,
    all_preds,
    target_names=le.classes_
))

                 precision    recall  f1-score   support

           call       0.99      0.98      0.98       226
        dislike       0.99      1.00      1.00       194
           fist       0.99      0.98      0.99       141
           four       0.98      0.99      0.98       245
           like       0.98      0.99      0.99       216
           mute       0.95      0.97      0.96       163
             ok       0.99      0.99      0.99       239
            one       0.98      0.96      0.97       189
           palm       0.96      1.00      0.98       248
          peace       1.00      0.99      0.99       216
 peace_inverted       1.00      0.99      0.99       225
           rock       1.00      1.00      1.00       219
           stop       0.97      0.96      0.96       223
  stop_inverted       0.97      0.98      0.97       235
          three       1.00      0.96      0.98       219
         three2       1.00      1.00      1.00       248
         two_up       0.99    

## Conclusion

In this notebook, I reimplemented a classical supervised classification problem using a Deep Neural Network.

### Comparison with ML1 Baseline (SVM)

| Model | Test Accuracy |
|-------|--------------|
| SVM (ML1) | 99.08% |
| Deep Neural Network | 98.47% |

The SVM marginally outperforms the MLP. This is actually expected and theoretically meaningful.

The input consists of **hand-crafted, low-dimensional, structured features** (63 landmark coordinates) — not raw pixels or sequences. SVMs with RBF kernels are very well-suited for this regime, as they can find optimal decision boundaries in low-dimensional spaces using the kernel trick.

Deep networks tend to outperform classical models when features are **raw and high-dimensional** (images, text, audio), where they learn hierarchical representations automatically through backpropagation. With only 63 features, the MLP has less room to leverage its capacity advantage.

This does **not** mean Deep Learning failed — 98% accuracy across 18 gesture classes is excellent. It illustrates an important principle:

> **Model selection should match the data regime. Deep Learning is not universally superior — its advantage depends on the dimensionality, volume, and structure of the data.**

### Full Pipeline Summary

The complete pipeline included:
- **Preprocessing** — Label encoding and StandardScaler normalization
- **Data Splitting** — Stratified Train / Validation / Test split (70/15/15)
- **Model Training** — Mini-batch gradient descent with Adam optimizer
- **Regularization** — Dropout (0.3) and Early Stopping based on validation loss
- **Final Evaluation** — Unbiased test set evaluation with accuracy, precision, recall, and F1-score