Real-Time Hand Gesture Recogniton and Data Collection

1. Introduction 

This script is designed to capture hand landmarks and images of alphabet hand gestures in real-time using the mediaPipe Hands module. The collected data is organized into directories based on the letters of the alphabet, allowing for the creation of hand gesture datasets for sign langauge.


2. Libraries

OpenCV(cv2): used for capturing video and image processing
mediaPipe(mp): used for hand tracking and landmark detection
numpy(np): employed for numerical operations and data handling 
OS: used for file and directory operation
scikit-learn: Utilized for data preprocessing and splitting 


3. Intitializtaion 

This script initializes the mediaPipe Hands module and a Video capture object.

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(max_num_hands=2, min_detection_confidence=0.5, min_tracking_confidence=0.5)
mp_drawing = mp.solutions.drawing_utils
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 450)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 450)


4. Gesture Data Collection 

This script collects hand landmarks, draws them on the video frame, and saves the landmarks and cropped hand images for each gesture in separate directories.


while index < len(alphabet):
    # ...
    while True:
        # ...
        if results.multi_hand_landmarks is not None:
            for hand_landmarks in results.multi_hand_landmarks:
                # ...
                landmarks_positions = [(lm.x * frame.shape[1], lm.y * frame.shape[0]) for lm in hand_landmarks.landmark]
                landmarks_file_path = os.path.join(letter_path, letter + str(count) + '.npy')
                np.save(landmarks_file_path, landmarks_positions)
                # ...
                hand_crop = frame[int(bboxC[1]):int(bboxC[1] + bboxC[3]), int(bboxC[0]):int(bboxC[0] + bboxC[2])]
                if hand_crop.size != 0:
                    if cv2.waitKey(1) & 0xFF == ord('c'):
                        image_name = letter + str(count) + '.png'
                        image_path = os.path.join(letter_path, image_name)
                        cv2.imwrite(image_path, hand_crop)
                        # ...
                        count += 1
        else:
            print('no hand')
        # ...

5. Data Preprocessing 

Hand gesture images and landmarks are loaded, resized, and normalized for input into the gesture classification model.

for letter in alphabet:
    letter_path = os.path.join(data_path, letter)
    for file_name in os.listdir(letter_path):
        # ...
        hand_landmarks = np.load(landmarks_path, allow_pickle=True)
        image = cv2.imread(image_path)
        # ...
        landmarks_positions = hand_landmarks.flatten()
        images.append(hand_crop_array)
        landmarks_list.append(landmarks_positions)
        labels.append(letter)
        # ...
images = np.array(images)
labels = np.array(labels)
landmarks = np.array(landmarks_list)



6. Gesture Classification Model 

A simple fully connected neural network is built and trained to predict the hand gesture from the landmarks.


model = models.Sequential()
model.add(layers.Flatten(input_shape=(42,)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(len(alphabet), activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


7. Model Training and Evaluation 

The model is trained and evaluated on the dataset split into training and testing sets.

history = model.fit(
    x=landmarks_train,
    y=y_train,
    epochs=32,
    batch_size=64,
    validation_data=(landmarks_test, y_test)
)

test_loss, test_acc = model.evaluate(landmarks_test, y_test)
print(f"Test accuracy: {test_acc}")


8. Gesture Recognition

Real-time handlandmarks are predicted using the trained model, and the predicted gesture is displayed on the feed.
  
while index < len(alphabet):
    # ...
    while True:
        # ...
        if results.multi_hand_landmarks is not None:
            for hand_landmarks in results.multi_hand_landmarks:
                # ...
                landmarks_positions = [(lm.x * frame.shape[1], lm.y * frame.shape[0]) for lm in hand_landmarks.landmark]
                landmarks_array = np.array(landmarks_positions).flatten()
                # ...
                predictions = model.predict(landmarks_array.reshape(1, -1))
                predicted_class = np.argmax(predictions)
                cv2.putText(frame, f"Predicted: {alphabet[predicted_class]}", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
        # ...


        Conclusion 

        This script provides a comprehensive solution for real-time hand gesture recognition making it a valuable tool for application such as sign language translation.


Sources:

1. https://medium.com/mlearning-ai/american-sign-language-alphabet-recognition-ec286915df12
2. https://www.mdpi.com/1424-8220/23/18/7970
3. https://www.kaggle.com/datasets/grassknoted/asl-alphabet
4. https://github.com/computervisioneng/...
   #computervision #signlanguagedetection #objectdetection #scikitlearn #python #opencv #mediapipe #landmarkdetection
5. https://github.com/yuliianikolaenko/asl-alphabet-classification
6. https://github.com/topics/asl-recognizer
7. https://github.com/topics/asl-alphabet-translator
8. https://github.com/11a55an/american-sign-language-detection
9. https://github.com/VedantMistry13/American-Sign-Language-Recognition-using-Deep-Neural-Network
10. https://github.com/kinivi/hand-gesture-recognition-mediapipe/blob/main/app.py
11. https://github.com/Kazuhito00/hand-ge...
12. https://www.computervision.zone/cours...
13. https://github.com/nicknochnack/Actio...,  Complete Machine Learning and Data Science Courses
14. chatgpt
15. https://github.com/ivangrov
16. https://www.youtube.com/channel/UCxladMszXan-jfgzyeIMyvw/about
17. https://github.com/nicknochnack/Actio...
 

In [None]:
1. collecting keypoints for training and testing 
2. preprocessing data 
3. build a model and train
4. test predictions 
5. evaluation using confusion matrix and accuracy
6. test in real-time 
7. Tuning 

In [None]:
import cv2
import mediapipe as mp
import os
import numpy as np


#set up mediapipe
mp_hands = mp.solutions.hands
hands = mp_hands.Hands(max_num_hands=2, min_detection_confidence=0.5, min_tracking_confidence=0.5)
mp_drawing = mp.solutions.drawing_utils

#set up data path, define alphabet
data_path = "/Users/reagan/desktop/AI/AI_ASL/"
alphabet = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", 
            "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X",  "Y", "Z"]

#set up camera
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 450)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 450)

index = 0

#loop through alphabet
while index < len(alphabet):
    letter = alphabet[index]
    letter_path = os.path.join(data_path, letter)
    os.makedirs(letter_path, exist_ok=True)
    
    print(letter, letter_path)
    
    #set up count for images
    count = 0

    #main loop for collecting images
    while True:
        ret, frame = cap.read()   
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) #convert to RGB for mediapipe
        results = hands.process(rgb_frame)

        if results.multi_hand_landmarks is not None:
            for hand_landmarks in results.multi_hand_landmarks:
                mp_drawing.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS,
                                          mp_drawing.DrawingSpec(color=(0, 117, 128), thickness=2, circle_radius=4),
                                          mp_drawing.DrawingSpec(color=(53, 101, 77), thickness=2, circle_radius=2)
                                          )
                
                # Extract landmark positions and save them to a file
                landmarks_positions = [(lm.x * frame.shape[1], lm.y * frame.shape[0]) for lm in hand_landmarks.landmark]
                landmarks_file_path = os.path.join(letter_path, letter + str(count) + '.npy')
                np.save(landmarks_file_path, landmarks_positions)

                # Extract bounding box of hand and draw it on the frame
                bboxC = (
                    min(landmarks_positions, key=lambda x: x[0])[0],
                    min(landmarks_positions, key=lambda x: x[1])[1],
                    max(landmarks_positions, key=lambda x: x[0])[0] - min(landmarks_positions, key=lambda x: x[0])[0],
                    max(landmarks_positions, key=lambda x: x[1])[1] - min(landmarks_positions, key=lambda x: x[1])[1]
                )
                scaling_factor = 1.5 # bounding box size multiplier to get padding around the hand
                bboxC = (
                    int(bboxC[0] - (bboxC[2] * (scaling_factor - 1) / 2)), # adjusted left coordinate
                    int(bboxC[1] - (bboxC[3] * (scaling_factor - 1) / 2)), # adjusted right coordinate
                    int(bboxC[2] * scaling_factor), # adjusted width
                    int(bboxC[3] * scaling_factor)  # adjusted height
                )

                # Draws lines between hand landmarks 
                for connection in mp_hands.HAND_CONNECTIONS:
                    start_point = tuple(np.multiply([hand_landmarks.landmark[connection[0]].x, hand_landmarks.landmark[connection[0]].y], [450, 450]).astype(int))
                    end_point = tuple(np.multiply([hand_landmarks.landmark[connection[1]].x, hand_landmarks.landmark[connection[1]].y], [450, 450]).astype(int))
                    cv2.line(rgb_frame, start_point, end_point, (255, 0, 0), 2)
                
                # Draw rectangle around hand using adjusted bounding box
                cv2.rectangle(frame, (int(bboxC[0]), int(bboxC[1])),
                              (int(bboxC[0] + bboxC[2]), int(bboxC[1] + bboxC[3])), (0, 0, 0), 2)

                

                # Crop hand from frame using adjusted bounding box
                hand_crop = frame[int(bboxC[1]):int(bboxC[1] + bboxC[3]), int(bboxC[0]):int(bboxC[0] + bboxC[2])]
                if hand_crop.size != 0:
                    if cv2.waitKey(1) & 0xFF == ord('c'):
                        image_name = letter + str(count) + '.png'
                        image_path = os.path.join(letter_path, image_name)
                        cv2.imwrite(image_path, hand_crop)
                        print('image_name:', image_name)
                        count += 1
                
                print('hand')

        else:
            print('no hand')

        cv2.imshow('frame', frame)

        key = cv2.waitKey(1)

        if key == ord('q'):
            break
        elif key == ord(' '):
            count = 0
            break

    index += 1

print(count)

cap.release()
cv2.destroyAllWindows()


In [None]:
import cv2
import mediapipe as mp
import os
import numpy as np
from tensorflow.keras.preprocessing.image import img_to_array
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

# set up mediapipe
mp_hands = mp.solutions.hands
hands = mp_hands.Hands(max_num_hands=2, min_detection_confidence=0.5, min_tracking_confidence=0.5)
mp_drawing = mp.solutions.drawing_utils

# set up data path and alphabet
data_path = "/Users/reagan/desktop/AI/AI_ASL/"
alphabet = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J",
            "K", "L", "M", "N", "O", "P", "Q", "R", "S",
            "T", "U", "V", "W", "X",  "Y", "Z"]

# lists for storing data
images = []
labels = []
landmarks_list = []

# initialize LabelEncoder
label_encoder = LabelEncoder() 

# loop through each letter in alphabet and each file in each letter folder
for letter in alphabet:  
    letter_path = os.path.join(data_path, letter) 
    for file_name in os.listdir(letter_path):
        image_path = os.path.join(letter_path, file_name)
        landmarks_path = os.path.join(letter_path, file_name.replace(".png", ".npy")) # replace image extension with .npy extension
        print(image_path)
        print(landmarks_path)

        # load image and landmarks

        hand_landmarks = np.load(landmarks_path) # load landmarks

        image = cv2.imread(image_path) # load image

        if image is not None:
            rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

         # resize, normalize,and flatten image
            hand_crop_resized = cv2.resize(hand_crop, (224, 224))
            hand_crop_resized = hand_crop_resized.astype(float) / 255.0
            hand_crop_array = img_to_array(hand_crop_resized)
            landmarks_positions = hand_landmarks.flatten()

            print(landmarks_positions.shape)
            print(hand_crop_array.shape)
            
            images.append(hand_crop_array)
            landmarks_list.append(landmarks_positions)
            labels.append(letter)

            # save landmarks as .npy file
            landmarks_array = np.array(landmarks_positions)
            landmarks_save_path = os.path.join(letter_path, file_name.replace(".png", ".npy"))
            np.save(landmarks_save_path, landmarks_array)

# encode labels and convert to categorical

labels = label_encoder.fit_transform(labels)
labels = to_categorical(labels, num_classes=len(alphabet))

images = np.array(images)
labels = np.array(labels)
landmarks = np.array(landmarks_list)

# split data into training and testing sets

if len(landmarks) > 0:
    X_train, X_test, landmarks_train, landmarks_test, y_train, y_test = train_test_split(
        images, landmarks, labels, test_size=0.2, random_state=42) # split images, landmarks, and labels

        
    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)
    print(landmarks_train.shape)
    print(landmarks_test.shape)
    
else:
    print("No data available for splitting.") 


In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models





model = models.Sequential()
#flatten the input
model.add(layers.Flatten(input_shape=(42,)))  # 42 = 21 landmarks * 2 coordinates
model.add(layers.Dense(128, activation='relu')) # 128 = 2^7 nuerons in hidden layer
model.add(layers.Dropout(0.5)) #  50% dropout rate to prevent overfitting
model.add(layers.Dense(len(alphabet), activation='softmax'))# 26 neurons in output layer for 26 classes


#comile the model
#adam optimizer is used to minimize the loss function by updating the weights for each epoch
#loss function is categorical crossentropy because there are more than 2 classes
#accuracy is used to measure the performance of the model

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

#train the model
history = model.fit( 
    x=landmarks_train,     #input data
    y=y_train,            #output data (labels)
    epochs=32,            #32 iterations
    batch_size=64,        #number of samples per gradient update
    validation_data=(landmarks_test, y_test) #data to validate the model on
)


test_loss, test_acc = model.evaluate(landmarks_test, y_test)
print(f"Test accuracy: {test_acc}")


model.save("/Users/reagan/desktop/model_landmarks.h5")


In [None]:
import matplotlib.pyplot as plt

#plot the accuracy and loss for the training and validation data

plt.plot(history.history['accuracy'], label='accuracy') 
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0, 1])
plt.legend(loc='lower right')
plt.show()

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model, landmarks, labels, cv=5, scoring='accuracy')
print("c.v", cv_scores)
print("mean_accuracy", np.mean(cv_scores))


In [1]:
import cv2
import mediapipe as mp
import os
import numpy as np
from tensorflow.keras.models import load_model


model = load_model("/Users/reagan/models/model_landmarks.h5") # Load the model

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(max_num_hands=2, min_detection_confidence=0.5, min_tracking_confidence=0.5) # Initialize the hands module from mediapipe  
mp_drawing = mp.solutions.drawing_utils

data_path = "/Users/reagan/desktop/AI/AI_ASL/" # Path to the data
alphabet = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J",
            "K", "L", "M", "N", "O", "P", "Q", "R", "S",
            "T", "U", "V", "W", "X",  "Y", "Z"]

cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 450)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 450)

index = 0

while index < len(alphabet): # Loop through each letter
    letter = alphabet[index]
    letter_path = os.path.join(data_path, letter)
    os.makedirs(letter_path, exist_ok=True)
    
    print(letter, letter_path)

    count = 0

    while True:
        ret, frame = cap.read()
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)      #convert frame to RGB color space because mediapipe works with RGB images
        results = hands.process(rgb_frame) 

        if results.multi_hand_landmarks is not None:
            for hand_landmarks in results.multi_hand_landmarks:
                mp_drawing.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS,  #draw landmarks on frame
                                          mp_drawing.DrawingSpec(color=(0, 117, 128), thickness=2, circle_radius=4), 
                                          mp_drawing.DrawingSpec(color=(53, 101, 77), thickness=2, circle_radius=2)
                                          )

                landmarks_positions = [(lm.x * frame.shape[1], lm.y * frame.shape[0]) 
                                       for lm in hand_landmarks.landmark]
                landmarks_array = np.array(landmarks_positions).flatten()

                bboxC = (
                    min(landmarks_positions, key=lambda x: x[0])[0],  #letftmost x-coordinate 
                    min(landmarks_positions, key=lambda x: x[1])[1],  #topmost y-coordinate
                    max(landmarks_positions, key=lambda x: x[0])[0] - min(landmarks_positions, key=lambda x: x[0])[0], #width of bbox
                    max(landmarks_positions, key=lambda x: x[1])[1] - min(landmarks_positions, key=lambda x: x[1])[1]  #height of bbox
                )

                #scaling factor to scale the bbox
                scaling_factor = 1.5  

                #scaled bbox coordinates
                bboxC = (
                    int(bboxC[0] - (bboxC[2] * (scaling_factor - 1) / 2)), #adjsuted leftmost x-coordinate
                    int(bboxC[1] - (bboxC[3] * (scaling_factor - 1) / 2)), #adjusted topmost y-coordinate
                    int(bboxC[2] * scaling_factor), #adjusted width of bbox
                    int(bboxC[3] * scaling_factor)  #adjusted height of bbox
                )

                for connection in mp_hands.HAND_CONNECTIONS: #draw lines between landmarks
                    start_point = tuple(np.multiply([hand_landmarks.landmark[connection[0]].x, hand_landmarks.landmark[connection[0]].y], [450, 450]).astype(int))
                    end_point = tuple(np.multiply([hand_landmarks.landmark[connection[1]].x, hand_landmarks.landmark[connection[1]].y], [450, 450]).astype(int))
                    cv2.line(rgb_frame, start_point, end_point, (255, 0, 0), 2)  #draw line between two points

                cv2.rectangle(frame, (int(bboxC[0]), int(bboxC[1])),
                              (int(bboxC[0] + bboxC[2]), int(bboxC[1] + bboxC[3])), (0, 0, 0), 2) #draw rectangle around hand

                hand_crop = frame[int(bboxC[1]):int(bboxC[1] + bboxC[3]), int(bboxC[0]):int(bboxC[0] + bboxC[2])] #crop hand from frame
                if hand_crop.size != 0:
                    if cv2.waitKey(1) & 0xFF == ord('c'):
                        image_name = letter + str(count) + '.png'
                        image_path = os.path.join(letter_path, image_name)
                        cv2.imwrite(image_path, hand_crop)
                        print('image_name:', image_name)
                        count += 1
                
                landmarks_input = landmarks_array.reshape(1, -1)
               
                predictions = model.predict(landmarks_input) #predict letter

                confidence = predictions[0, predicted_class] #get confidence of prediction
                
                predicted_class = np.argmax(predictions) #get index of predicted letter
                cv2.putText(frame, f"Predicted: {alphabet[predicted_class]} ({confidence:.2f})", (10, 30),
                            cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)  # Put text on frame

        cv2.imshow('frame', frame) #show frame

        key = cv2.waitKey(1) #wait for key press

        if key == ord('q'):
            break
     #   elif key == ord(' '):
    #      count = 0
     #       break

    index += 1

print(count)

cap.release()
cv2.destroyAllWindows() 


2023-12-06 05:56:25.307934: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0000 00:00:1701860190.028681       1 gl_context.cc:344] GL version: 2.1 (2.1 ATI-5.1.35), renderer: AMD Radeon Pro 555X OpenGL Engine
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.


A /Users/reagan/desktop/AI/AI_ASL/A


error: OpenCV(4.8.1) /Users/runner/work/opencv-python/opencv-python/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'
