Downloading Dataset and creation of directory "Dataset"

In [1]:
import urllib

urllib.request.urlretrieve('https://sc.link/AO5l', 'subsample.zip')
!mkdir -p dataset/
!unzip -q subsample.zip -d dataset/subsample
!rm -r subsample.zip

#!scp -r /kaggle/input/hagrid/ann_subsample /kaggle/working/dataset/ann_subsample

In [20]:
!unzip -q dataset/ann_subsample.zip -d dataset/ann_subsample

In [21]:
!pwd

/content


In [22]:
!ls -lrt dataset

total 8
drwxr-xr-x 20 root root 4096 Mar 21 04:35 subsample
drwxr-xr-x  3 root root 4096 Mar 21 04:39 ann_subsample


In [23]:
!ls -lrt dataset/ann_subsample

total 4
drwxr-xr-x 2 root root 4096 Mar 21 04:39 ann_subsample


Importing Required Libraries

In [24]:
import os
import json
import logging
import random
from tqdm import tqdm
from collections import defaultdict
from typing import Tuple
from glob import glob

import pandas as pd
import numpy as np

from PIL import Image, ImageOps
import os
from ipywidgets import interact
from IPython.display import Image as DImage
import cv2

import torch
from torch import nn, Tensor
from torchvision import models
from torchvision.transforms import Compose
from torchvision.transforms import functional as F
from torchvision import transforms as T


import warnings
warnings.filterwarnings('ignore')

In [25]:
!pip install torchmetrics
from torchmetrics.detection.mean_ap import MeanAveragePrecision

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Defining Class names and Formats for gesture recognition dataset -  The class names will be used as labels for each image, while the formats could be used to filter out any irrelevant or incompatible images from the dataset.

In [26]:
class_names = [
   'call',
   'dislike',
   'fist',
   'four',
   'like',
   'mute',
   'ok',
   'one',
   'palm',
   'peace_inverted',
   'peace',
   'rock',
   'stop_inverted',
   'stop',
   'three',
   'three2',
   'two_up',
   'two_up_inverted',
   'no_gesture']

FORMATS = (".jpeg", ".jpg", ".jp2", ".png", ".tiff", ".jfif", ".bmp", ".webp", ".heic")

To define functions called "__get_files_from_dir" for getting all files in a directory with a given extension and "__read_annotations" to read JSON files containing annotations for the dataset

In [27]:
transform = T.ToTensor()

class GestureDataset(torch.utils.data.Dataset):

    @staticmethod
    def __get_files_from_dir(pth: str, extns: Tuple):
        if not os.path.exists(pth):
            print(f"Dataset directory doesn't exist {pth}")
            return []
        files = [f for f in os.listdir(pth) if f.endswith(extns)]
        return files

    def __read_annotations(self, path):
        annotations_all = None
        exists_images = []
        for target in class_names:
            path_to_csv = os.path.join(path, f"{target}.json")
            if os.path.exists(path_to_csv):
                json_annotation = json.load(open(
                    os.path.join(path, f"{target}.json")
                ))
                json_annotation = [dict(annotation, **{"name": f"{name}.jpg"}) for name, annotation in
                                   zip(json_annotation, json_annotation.values())]

                annotation = pd.DataFrame(json_annotation)

                annotation["target"] = target
                annotations_all = pd.concat([annotations_all, annotation], ignore_index=True)
                exists_images.extend(
                    self.__get_files_from_dir(os.path.join(self.path_images, target), FORMATS))
            else:
                if target != 'no_gesture':
                    print(f"Database for {target} not found")

        annotations_all["exists"] = annotations_all["name"].isin(exists_images)

        annotations_all = annotations_all[annotations_all["exists"]]

        users = annotations_all["user_id"].unique()
        users = sorted(users)
        random.Random(42).shuffle(users)
        train_users = users[:int(len(users) * 0.8)]
        val_users = users[int(len(users) * 0.8):]

        annotations_all = annotations_all.copy()

        if self.is_train:
            annotations_all = annotations_all[annotations_all["user_id"].isin(train_users)]
        else:
            annotations_all = annotations_all[annotations_all["user_id"].isin(val_users)]

        return annotations_all

    def __init__(self, path_annotation, path_images, is_train, transform=None):
        self.is_train = is_train
        self.transform = transform
        self.path_annotation = path_annotation
        self.path_images = path_images
        self.transform = transform
        self.labels = {label: num for (label, num) in
                       zip(class_names, range(len(class_names)))}
        self.annotations = self.__read_annotations(self.path_annotation)

    def __len__(self):
        return self.annotations.shape[0]

    def get_sample(self, index: int):
        row = self.annotations.iloc[[index]].to_dict('records')[0]
        image_pth = os.path.join(self.path_images, row["target"], row["name"])
        image = Image.open(image_pth).convert("RGB")

        labels = torch.LongTensor([self.labels[label] for label in row["labels"]])

        target = {}
        width, height = image.size

        bboxes = []

        for bbox in row["bboxes"]:
            x1, y1, w, h = bbox
            bbox_abs = [x1 * width, y1 * height, (x1 + w) * width, (y1 + h) * height]
            bboxes.append(bbox_abs)
        target["labels"] = labels
        target["boxes"] = torch.as_tensor(bboxes, dtype=torch.float32)
        target["orig_size"] = torch.as_tensor([int(height), int(width)])

        return image, target

    def __getitem__(self, index: int):
        image, target = self.get_sample(index)
        if self.transform:
            image = self.transform(image)
        return image, target


Setting variables and initializing the PyTorch device to either use the GPU if available or the CPU if not. Additionally, it sets the random seed for the PyTorch, NumPy, and Python random number generators to a fixed value of 42 with 15 epochs

In [28]:
random_seed = 42
num_classes = len(class_names)
batch_size = 16
num_epoch = 15
torch.manual_seed(random_seed)
np.random.seed(random_seed)
random.seed(random_seed)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

GestureDataset is defined below to load and preproces the dataset.

In [29]:
train_data = GestureDataset(path_images='dataset/subsample',
                            path_annotation='dataset/ann_subsample/ann_subsample',
                            is_train=True, transform=transform)

test_data = GestureDataset(path_images='dataset/subsample',
                            path_annotation='dataset/ann_subsample/ann_subsample',
                            is_train=False, transform=transform)

To define a function which is used in object detection tasks for grouping data from multiple images into a batch for training or inference

In [30]:
def collate_fn(batch):
    batch_targets = list()
    images = list()

    for b in batch:
        images.append(b[0])
        batch_targets.append({"boxes": b[1]["boxes"],
                              "labels": b[1]["labels"]})
    return images, batch_targets

In [31]:
train_dataloader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,collate_fn=collate_fn, shuffle=True, num_workers=4)
test_dataloader = torch.utils.data.DataLoader(test_data, batch_size=batch_size,collate_fn=collate_fn, shuffle=True, num_workers=4)

 Pretrained framework and  model class creation

In [32]:
lr = 0.005
momentum = 0.9
weight_decay = 5e-4

In [33]:
model = models.detection.ssdlite320_mobilenet_v3_large(num_classes=len(class_names) + 1, pretrained_backbone=True)
model.to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=momentum, weight_decay=weight_decay)

Downloading: "https://download.pytorch.org/models/mobilenet_v3_large-8738ca79.pth" to /root/.cache/torch/hub/checkpoints/mobilenet_v3_large-8738ca79.pth


  0%|          | 0.00/21.1M [00:00<?, ?B/s]

In [34]:
warmup_factor = 1.0 / 1000
warmup_iters = min(1000, len(train_data) - 1)

lr_scheduler_warmup = torch.optim.lr_scheduler.LinearLR(
    optimizer, start_factor=warmup_factor, total_iters=warmup_iters
)

Function for Mean Average Precision calculation. 
Mean Average Precision (mAP) is a famous evaluation metric used in object detection and image segmentation tasks in computer vision. 

This is widely used to  measure the accuracy of the model in terms of precision and recall. It estimates the average precision (AP) for each class of object detected, and then takes the mean over all the classes to get the final mAP score.

In [35]:
def eval(model, test_dataloader, epoch):
    model.eval()
    with torch.no_grad():
        mapmetric = MeanAveragePrecision()
        
        for images, targets in test_dataloader:
            images = list(image.to(device) for image in images)
            output = model(images)
            
            for pred in output:
                for key, value in pred.items():
                    pred[key] = value.cpu()
                    
            mapmetric.update(output, targets)

    metrics = mapmetric.compute()
    return metrics

Training loop - training a model using the training dataset, testing the model's performance on the testing dataset after each epoch, and saving the model's state to a file at the end of each epoch

In [36]:
!mkdir checkpoints
for epoch in range(num_epoch):
    model.train()
    total = 0
    sum_loss = 0
    for images, targets in tqdm(train_dataloader):
        batch = len(images)
        images = list(image.to(device) for image in images)
        for target in targets:
            for key, value in target.items():
                target[key] = value.to(device)
        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())
        loss = losses.item()

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        lr_scheduler_warmup.step()

        total = total + batch
        sum_loss = sum_loss + loss
    metrics = eval(model, test_dataloader, epoch)
    print(f"epoch : {epoch}  |||  loss : {sum_loss / total} ||| MAP : {metrics['map']}")
torch.save(model.state_dict(),f"checkpoints/{epoch}.pth")

100%|██████████| 90/90 [02:33<00:00,  1.70s/it]


epoch : 0  |||  loss : 1.0859857866669234 ||| MAP : 4.81792631035205e-06


100%|██████████| 90/90 [02:25<00:00,  1.61s/it]


epoch : 1  |||  loss : 0.841936139401571 ||| MAP : 0.00012171519483672455


100%|██████████| 90/90 [02:25<00:00,  1.62s/it]


epoch : 2  |||  loss : 0.6368199403021441 ||| MAP : 0.0067360978573560715


100%|██████████| 90/90 [02:24<00:00,  1.61s/it]


epoch : 3  |||  loss : 0.4600661198575432 ||| MAP : 0.035255830734968185


100%|██████████| 90/90 [02:26<00:00,  1.62s/it]


epoch : 4  |||  loss : 0.37247026840124775 ||| MAP : 0.07248764485120773


100%|██████████| 90/90 [02:25<00:00,  1.61s/it]


epoch : 5  |||  loss : 0.32087176942525836 ||| MAP : 0.09344425797462463


100%|██████████| 90/90 [02:24<00:00,  1.61s/it]


epoch : 6  |||  loss : 0.2779007743308366 ||| MAP : 0.13352042436599731


100%|██████████| 90/90 [02:25<00:00,  1.62s/it]


epoch : 7  |||  loss : 0.2406669850958367 ||| MAP : 0.1324709802865982


100%|██████████| 90/90 [02:25<00:00,  1.62s/it]


epoch : 8  |||  loss : 0.2072249951266111 ||| MAP : 0.1651550978422165


100%|██████████| 90/90 [02:25<00:00,  1.62s/it]


epoch : 9  |||  loss : 0.1828188298884922 ||| MAP : 0.15864664316177368


100%|██████████| 90/90 [02:25<00:00,  1.62s/it]


epoch : 10  |||  loss : 0.15870468290386214 ||| MAP : 0.1863785833120346


100%|██████████| 90/90 [02:24<00:00,  1.60s/it]


epoch : 11  |||  loss : 0.14379654509656845 ||| MAP : 0.1831900030374527


100%|██████████| 90/90 [02:25<00:00,  1.61s/it]


epoch : 12  |||  loss : 0.11772535484484335 ||| MAP : 0.1916196197271347


100%|██████████| 90/90 [02:26<00:00,  1.63s/it]


epoch : 13  |||  loss : 0.10066237644307362 ||| MAP : 0.19622911512851715


100%|██████████| 90/90 [02:23<00:00,  1.60s/it]


epoch : 14  |||  loss : 0.08727496192743124 ||| MAP : 0.19028602540493011


Creation of a list to store loaded images after using the PIL module

In [37]:
images = []
for gesture in class_names[:-1]:
    image_path = glob(f'dataset/subsample/{gesture}/*.jpg')[0]
    images.append(Image.open(image_path))

Creation of a new list of image tensors by applying a PyTorch transform

In [38]:
images_tensors = images.copy()
images_tensors_input = list(transform(image).to(device) for image in images_tensors)

Inference on the input data by passing it through the specified PyTorch model with gradient calculation disabled, and returning the model's output tensor as out.

In [39]:
with torch.no_grad():
    model.eval()
    out = model(images_tensors_input)

To extract the bounding boxes, confidence scores, and class labels for the top two predicted objects in each image from the PyTorch model's output tensor out and to store them in Python lists, which can be used for further processing or visualization of the object detection results.

In [40]:
bboxes = []
scores = []
labels = []
for pred in out:
    ids = pred['scores'] >= 0.2
    bboxes.append(pred['boxes'][ids][:2].cpu().numpy().astype(np.int))
    scores.append(pred['scores'][ids][:2].cpu().numpy())
    labels.append(pred['labels'][ids][:2].cpu().numpy())

To create shorter abbreviations for some class names in a list for easier visualization

In [41]:
short_class_names = []

for name in class_names:
    if name == 'stop_inverted':
        short_class_names.append('stop inv.')
    elif name == 'peace_inverted':
        short_class_names.append('peace inv.')
    elif name == 'two_up':
        short_class_names.append('two up')
    elif name == 'two_up_inverted':
        short_class_names.append('two up inv.')
    elif name == 'no_gesture':
        short_class_names.append('no gesture')
    else:
        short_class_names.append(name)

To create a list of modified final_images with bounding boxes and adding text labels for the detected objects in the original images

In [42]:
final_images = []
for bbox, score, label, image in zip(bboxes, scores, labels, images):
    image = np.array(image)
    for i, box in enumerate(bbox):
        _,width,_  = image.shape
        image = cv2.rectangle(image, box[:2], box[2:], thickness=3, color=[255, 0, 255])
        cv2.putText(image, f'{short_class_names[label[i]]}: {score[i]:0.2f}', (box[0], box[1]), cv2.FONT_HERSHEY_SIMPLEX,
                        width / 780, (0, 0, 255), 2)
    final_images.append(Image.fromarray(image))

In [43]:
!mkdir out_images
out_images = []
for i, image in enumerate(final_images):
    out_name = f"out_images/{i}.png"
    out_images.append(out_name)
    image.save(out_name)

Model Results

In [44]:
out_dir = "out_images/"
@interact
def show_images(file=os.listdir(out_dir)):
    display(DImage(out_dir+file, width=600, height=300))

interactive(children=(Dropdown(description='file', options=('13.png', '11.png', '0.png', '17.png', '4.png', '7…

**PROJECT TITLE**

"HAND GESTURE RECOGNITION Based on Computer Vision"

**GROUP 3**

 	Shreya Kiran Bhoir,
 	Vinay Malik,
 	Priyanka Awasthi,
 	Vignesh Ram Sundararaman,
 	Vikash Raj Chandrabalu

**MENTOR**

BABATUNDE GIWA

	
**OBJECTIVE**

To build hand gesture recognition (HGR) system, which can be used in video conferencing services, home automation systems, or in the automotive sector.

**PROJECT SUMMARY**

Hand gesture recognition is an application of computer vision concepts for detecting and interpreting the movements or positions of a hand or fingers. The project's goal is to build a hand gesture recognition system. 

**SCOPE AND ALGORITHM**

The hand gesture recognition project entails creating a system that can detect various hand gestures. Deep learning-based methods have recently become an effective approach to hand gesture recognition due to the availability of annotated datasets and advances in hardware and software technologies. Deep neural networks are used in these methods to learn the features and the classifier from the input data at the same time. Convolutional neural networks (CNNs) are the most used architecture in this field, and they have shown promising results on several benchmark datasets.

**TOOLS AND REQUIREMENTS**

•	Python 
•	Jupyter Notebook
•	Pandas
•	Numpy
•	Matplotlib
•	PyTorch

**METHODOLOGY**

DATA COLLECTION:

We used Kaggle dataset - a large image dataset HaGRID (HAnd Gesture Recognition Image Dataset) which is publicly available, to test our system. The dataset is collected into our local machines and the files are imported in python.
HaGRID is 716GB in size, and the dataset contains 552,992 FullHD (1920 1080) RGB images divided into 18 gesture classes. In addition, if there is a second free hand in the frame, some images have the no gesture class. There are 123,589 samples in this extra class. By subject user-id, the data was divided into 92% training and 8% testing sets, with 509,323 images for train and 43,669 images for test.

MODEL:

To implement a hand gesture recognition system using computer vision techniques, the following methodology can be followed:
Extraction of features from the preprocessed data and then training a machine learning model using the extracted features followed by testing the model on a separate set of hand gesture data and at last evaluating the performance of the model.

INTERPRETATION OF RESULTS:

The results of the hand gesture recognition system can be interpreted in terms of accuracy and precision. Accuracy measures the percentage of correct predictions made by the model; precision measures the percentage of true positive predictions out of all positive predictions. A higher accuracy and precision indicate better performance of the system.

The data was then divided into training and testing sets using NumPy array slicing. The training and testing data are reshaped to be compatible with a model.

We compiled the model with the a optimizer and the sparse categorical crossentropy loss function, then train it for 15 epochs on the training set.
We evaluated the model on the testing set using the evaluate() function, and printed the test accuracy. Mean Average Precision (mAP) function, a famous evaluation metric, is used in object detection and image segmentation tasks.
This is widely used to measure the accuracy of the model in terms of precision and recall. It estimates the average precision (AP) for each class of object detected, and then takes the mean over all the classes to get the final mAP score.

The MAP score in object detection measures how accurately an algorithm localises objects of interest and distinguishes them from other objects in the image. The score is calculated by averaging the precision-recall curves for each object class, where precision is the ratio of true positives to predicted positives and recall is the ratio of true positives to ground-truth positives.
A high MAP score indicates that the algorithm can retrieve relevant images with a high degree of precision and recall.

**COMPUTER VISION CONCEPTS USED IN HAND GESTURE RECOGNITION:**

Here are some computer vision terms commonly used in hand gesture recognition:
Hand detection: The process of identifying and localizing the hand in an image.
Hand segmentation: The process of separating the hand region from the background in an image or video frame.

Feature extraction: The process of identifying key characteristics or features of the hand, such as its shape, size, and position.

Image preprocessing: Techniques used to enhance the input image.
Image segmentation: Techniques used to separate the hand from the background.
Feature extraction: Techniques used to extract relevant information from the preprocessed image.
Machine learning: Techniques used to train a model to classify the hand gesture based on the extracted features.
Classification: The process of assigning a specific gesture or action to the detected hand movements.
Deep learning: A machine learning approach that uses artificial neural networks to learn from data, which is commonly used in hand gesture recognition systems.
Gesture recognition: The process of identifying and interpreting specific hand gestures, such as pointing, waving, or making a fist.

**TECHNICAL ANALYSIS**

The hand gesture recognition system may be technically examined using computer vision techniques such as image processing, feature extraction, and machine learning. Image processing techniques such as noise reduction and image segmentation can be used to improve the quality of the input image. Relevant information from the preprocessed picture, such as the shape and location of the hand, may be extracted using feature extraction algorithms. Machine learning techniques may be used to train a model to categorise the hand gesture based on the retrieved attributes.

**LITERATURE REVIEW**

1)	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321080/#:~:text=Algorithms%20have%20been%20developed%20based,deep%20learn%20detection%20and%20more.

Munir Oudah,1 Ali Al-Naji,1 and Javaan Chahl2, “Hand Gesture Recognition Based on Computer Vision: A Review of Techniques” Electrical Engineering Technical College, Middle Technical University, J Imaging. 2020 August

Hand gestures are a type of nonverbal communication that can be used in a variety of fields, including deaf-mute communication, robot control, human-computer interaction (HCI), home automation, and medical applications. Many different techniques have been used on hand gestures, including those based on instrumented sensor technology and computer vision. In other words, the hand sign can be divided into several categories, including posture and gesture, as well as dynamic and static, or a hybrid of the two. This paper focuses on a review on hand gesture techniques and introduces their benefits and drawbacks in various situations. Furthermore, it tabulates the performance of these methods, with a focus on computer vision techniques dealing with similarity and difference points.

2) https://www.ijert.org/research/dynamic-hand-gesture-recognition-a-literature-review-IJERTV1IS9222.pdf

Deepali N. Kakade, Prof. Dr. J.S. Chitode, “Dynamic Hand Gesture Recognition”, “International Journal of Engineering Research & Technology (IJERT)”, Vol. 1 Issue 9, November- 2012

This paper reviews recent hand gesture recognition systems which have gained attention due to their ability to efficiently interact with computer systems through human-computer interaction. The paper demonstrated how to create a natural interface between humans and computers by recognizing gestures for controlling robots or conveying information. The paper covers camera interfaces, image processing, hand gestures, color detection, and recognition. The advantages of using hand gestures include ease of use, naturalness, and intuitiveness, which have made them successful in applications such as computer game control, human-robot interaction, and sign language recognition. In the past, glove-based devices were used, but they were cumbersome and unnatural. Video-based non-contact interaction techniques have made gesture inputs more natural and improved the interface between humans and computers. 

3)	https://web.stanford.edu/class/cs231a/prev_projects_2016/CS231A_Project_Final.pdf

Zi Xian, Justin Yeo, “Hand Recognition and Gesture Control Using a Laptop Web-camera” Stanford University 450 Serra Mall, Stanford

Given the recent growth and popularity of Virtual and Augmented Reality, hand gesture recognition is a technology that is becoming increasingly important. It is an important aspect of Human Computer Interaction (HCI) because it allows for two-way interaction in virtual spaces. Many examples of such interaction, however, are currently limited to specialised applications or more expensive devices such as the Kinect and the Oculus Rift. In this paper, they investigated hand gesture recognition methods using a more common device - a laptop webcam. They focused three different methods of segmenting the hand, documenting the advantages and disadvantages of each method.

**CONCLUSION**

Our hand gesture recognition project results show that we achieved a test accuracy of 99.05%. This demonstrates that our model can recognise hand gestures accurately in real-time using a webcam and can be used in a variety of applications such as human-computer interaction and robotics. It is worth noting, however, that our results were obtained using a controlled dataset and that additional testing on diverse datasets may be required to evaluate the model's generalisation performance.

**REFERENCES AND CITATION:**

https://www.researchgate.net/publication/284626785_Hand_Gesture_Recognition_A_Literature_Review
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321080/#:~:text=Algorithms%20have%20been%20developed%20based,deep%20learn%20detection%20and%20more.
https://www.mdpi.com/2313-433X/6/8/73
https://www.academia.edu/17775220/Hand_Gesture_Recognition_A_Literature_Review
https://www.researchgate.net/publication/284626785_Hand_Gesture_Recognition_A_Literature_Review
https://www.kaggle.com/code/adinishad/hand-sign-recognition-cnn-keras-97-accuracy?scriptVersionId=41482305&cellId=4
https://web.stanford.edu/class/cs231a/prev_projects_2016/CS231A_Project_Final.pdf
https://www.mdpi.com/2313-433X/6/8/73
https://peerj.com/articles/cs-218/
https://gitlab.aicloud.sbercloud.ru/rndcv/hagrid







