# **Gesture-Recognition System For Drones**

## **Goal:**
Efficiently using a machine learning model and Mediapipe for hand-landmarking to create a system that can reliably predict hand gestures that pass on movement data to the drone.

## **The gestures:**
We will be using a total of 8 gestures to make the tello run
Gestures used are:
<div style="display: inline-block; margin-right: 10%; max-width: 200px; float: right;">
<img
  src="https://s-cdn.ryzerobotics.com/stormsend/uploads/13433930-d1e1-0135-d3c1-12530322f90d/guava-%E7%99%BD-pc-160_154_2x.png"
  alt="Drone Image"
  title="Fig.1 Drone"
  style="border-radius:5%;"><p align="center">Fig.1 Drone</p>
</div>

1. ```Up     - Point Upwards                         (2)```
2. ```Down   - Point Downwards                       (2)```
3. ```Left   - Point to Left                         (2)```
4. ```Right  - Point to Right                        (2)```
5. ```Front  - Flatten hand and Point Forward        (2)```
6. ```Back   - Thumb and Pinky Finger out            (2)```
7. ```Land   - Okay sign                             (2)```
8. ```Flip   - Yo! sign                              (4)```


#### This is  a **four-part project** and the dataset of this is available on kaggle.
The github repository of the project is:<br>
[https://github.com/RumbleJack56/HandGestureRecognition-P](https://github.com/RumbleJack56/HandGestureRecognition-P)


## **Part 1: Data-Collection**

We collect data using ```opencv``` library and use ```cv2.VideoCapture()``` for accessing camera to take images of different gestures as training examples.<br>
First lets import all the libraries

In [5]:
#importing dependencies
import cv2
import os
import time
import pandas as pd
import numpy as np
print(os.getcwd())

e:\College\S2-even\ML\Project\HandGestureRecognition-P


* We check for **webcams**
* Then select the webcam
* and then we click picture every button press (s) and save it in dataset folder

In [None]:
gesture_list = ['back1', 'back2', 'down1', 'down2', 'flip1', 'flip2', 'flip3', 'flip4', 'front1', 'front2', 'land1', 'land2', 'left1', 'left2', 'right1', 'right2', 'up1', 'up2']

In [None]:
available_cameras = list(filter(lambda x:cv2.VideoCapture(x) and cv2.VideoCapture(x).isOpened(),range(6)))
gesture_list = os.listdir(".dataset/")
maxEntries = 100
print("Available Cameras at ports : ",*available_cameras)
print("Gestures to record are :", gesture_list)

The frame is 480 by 480, we add an additional 20px on top to accomodate the texts that serve as pointers for image clicking.<br>
```cv2.keyWait()``` is used for keypress detection<br>
There are 18 gestures in total.
the image number and type are shown uptop on the frame

In [None]:
cap = cv2.VideoCapture(available_cameras[0])
mainFrame = np.zeros(500*480*3,dtype=np.uint8).reshape(500,480,3)
entryNum = 1

for gesture in gesture_list:
    mainFrame[0:20,:200,:] = np.zeros(20*200*3).reshape(20,200,3)
    cv2.putText(mainFrame,f"Save with S | Quit with Q",[200,15],0,0.5,[255,255,255])
    cv2.putText(mainFrame,f"{gesture}  Img:{entryNum}",[20,15],0,0.5,[255,255,255])
    while entryNum<=maxEntries:
        ret , frame = cap.read()
        mainFrame[20:,:,:] = frame[0:480,79:559,:]
        cv2.imshow("frame",mainFrame)

        inp = cv2.waitKey(5) & 0xFF

        if inp == ord("s"):
            mainFrame[0:20,:200,:] = np.zeros(20*200*3).reshape(20,200,3)
            cv2.putText(mainFrame,f"{gesture}  Img:{entryNum}",[20,15],0,0.5,[255,255,255])
            cv2.imwrite(f".dataset/{gesture}/{entryNum}.jpg", frame[:,79:559,:])
            entryNum+=1
        if inp == ord("q"):
            break
    entryNum=1
    mainFrame[0:20,:200,:] = np.zeros(20*200*3).reshape(20,200,3)
    cv2.putText(mainFrame,f"waiting 3 sec",[20,15],0,0.5,[255,255,255])
    cv2.imshow("frame",mainFrame)
    time.sleep(3)

cap.release()
cv2.destroyAllWindows()

## **Part 2: Data Preprocessing**

* Now we have the dataset containing 200imgs/gesture for a total for 1800 images
* Using MediaPipe, we can implement a program to convert these images into points on the hand
* We take the point, and the detail whether the hand is left or right hand as columns of a dataframe
* We save the Dataframe as a csv file

##### **First we import the necessary libraries :**

In [119]:
import pandas as pd
import numpy as np
import os
import cv2
from mediapipe import tasks,Image,solutions
from mediapipe.framework.formats import landmark_pb2
print(os.getcwd())


e:\College\S2-even\ML\Project\HandGestureRecognition-P


<img src="https://developers.google.com/static/mediapipe/images/solutions/hand-landmarks.png" width="800" height="300" alt="Hand Landmarker Image (not loaded)" style="float:right;border-radius:30px;">

**Mediapipe** provides us with a handlandmarker class, that takes a *landmarker.task* file as an object.<br> It uses that to find the points on the hand
There are **21 points** on the hand.<br>
They are shown in the image.<br>
We first define some functions to *convert image to coordinates*.<br>
Then we initialize the dataframe to save the coordinates


In [120]:
BaseOptions = tasks.BaseOptions
HandLandmarker = tasks.vision.HandLandmarker
HandLandmarkerOptions = tasks.vision.HandLandmarkerOptions
VisionMode_IMAGE = tasks.vision.RunningMode.IMAGE

#define conversion Function
def convertToCords(img):
    landmarker_options = HandLandmarkerOptions(base_options=BaseOptions(model_asset_path="handlandmarker/hand_landmarker.task"),
                                           num_hands=1,
                                           running_mode=VisionMode_IMAGE)
    detector = HandLandmarker.create_from_options(landmarker_options)
    image = Image.create_from_file(img)
    rawOutput = detector.detect(image)
    
    if len(rawOutput.hand_landmarks)==0:
        return [0]*43 , rawOutput
    cords = [[pt.x,pt.y] for h in rawOutput.hand_landmarks for pt in h]
    hands = [x.category_name for y in rawOutput.handedness for x in y]
    hands = [0 if a.lower()=="left" else 1 for a in hands]
    cords = np.array(cords).reshape(-1)
    return np.concatenate([hands,cords]) , rawOutput

#create dataframe
df = pd.DataFrame(columns = ["Gesture","Specific","Hand"]+[a+str(b) for b in range(1,22)for a in "xy" ])

We append each file which is successfully detected a hand into the dataframe and save the dataframe

In [None]:
gesture_list = os.listdir(".dataset/")
errors = []
c=0
for gesture in gesture_list:
    for img in os.listdir(".dataset/"+gesture+"/"):
        coords , raw = convertToCords(f".dataset/{gesture}/{img}")
        if list(coords).count(0) > 10:
            errors.append([gesture,img,coords])
            continue
        findat = ["".join(filter(lambda x:not x.isnumeric(),gesture)),gesture] + list(coords)
        print(findat,len(findat))
        df.loc[c] = findat
        c+=1

In [122]:
df.to_csv("csv/co-ordinates.csv",index=False)
df


Unnamed: 0,Gesture,Specific,Hand,x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6,x7,y7,x8,y8,x9,y9,x10,y10,x11,y11,x12,y12,x13,y13,x14,y14,x15,y15,x16,y16,x17,y17,x18,y18,x19,y19,x20,y20,x21,y21
0,back,back1,0.0,0.414146,0.612191,0.328202,0.509499,0.274394,0.378656,0.235361,0.280007,0.172614,0.232476,0.417589,0.305328,0.478275,0.240938,0.447562,0.327513,0.413079,0.384554,0.492442,0.339574,0.545159,0.286216,0.491170,0.383502,0.450574,0.436260,0.557899,0.376470,0.615387,0.327836,0.549380,0.412990,0.496268,0.462539,0.611348,0.414597,0.677527,0.330069,0.720912,0.286768,0.765386,0.237655
1,back,back1,0.0,0.370650,0.610049,0.298608,0.504365,0.243131,0.383307,0.198213,0.276979,0.130233,0.221631,0.393534,0.307553,0.453799,0.268997,0.417932,0.352670,0.377661,0.397066,0.462743,0.344746,0.514734,0.312115,0.452130,0.403362,0.417195,0.434952,0.522675,0.387920,0.579721,0.356566,0.508486,0.435011,0.453792,0.472075,0.566888,0.430045,0.642149,0.365600,0.692941,0.333702,0.745193,0.293637
2,back,back1,0.0,0.473404,0.610534,0.403636,0.540998,0.336039,0.457764,0.274731,0.383794,0.208107,0.352572,0.441720,0.354440,0.492019,0.326441,0.486615,0.402027,0.467288,0.446365,0.505381,0.367606,0.549151,0.344474,0.524393,0.429126,0.500389,0.467755,0.559533,0.387625,0.604363,0.359324,0.570785,0.435650,0.537248,0.476573,0.602514,0.409813,0.652023,0.339880,0.688451,0.302070,0.722174,0.258313
3,back,back1,0.0,0.404165,0.591113,0.323498,0.500862,0.257899,0.386378,0.194849,0.297057,0.122084,0.254426,0.400766,0.296457,0.453977,0.256672,0.428348,0.346314,0.396861,0.396004,0.474016,0.323633,0.517238,0.295042,0.468126,0.392140,0.432252,0.433274,0.536755,0.356948,0.585354,0.324204,0.527779,0.409341,0.481247,0.451517,0.585874,0.391966,0.654301,0.317290,0.703211,0.279311,0.751139,0.231592
4,back,back1,0.0,0.438877,0.565304,0.365005,0.479435,0.295897,0.372908,0.227563,0.281984,0.147816,0.245443,0.427087,0.272703,0.498420,0.237385,0.484256,0.318174,0.450644,0.366944,0.500393,0.296554,0.558035,0.270496,0.519303,0.359943,0.479609,0.402161,0.562462,0.327223,0.623040,0.296477,0.575185,0.375263,0.526642,0.418887,0.608410,0.358565,0.672683,0.287383,0.719293,0.249577,0.762312,0.205041
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1777,up,up2,1.0,0.318180,0.760065,0.413418,0.760018,0.514830,0.695766,0.584936,0.641249,0.652940,0.616664,0.476888,0.538831,0.495333,0.433714,0.497844,0.367536,0.495844,0.312925,0.407003,0.509364,0.416092,0.548843,0.399864,0.630851,0.388385,0.668507,0.337585,0.514801,0.350348,0.579631,0.347047,0.649710,0.347737,0.675092,0.275496,0.534110,0.295744,0.579818,0.304097,0.639276,0.311401,0.661359
1778,up,up2,1.0,0.333133,0.765357,0.433654,0.750039,0.528928,0.673653,0.591601,0.612895,0.651415,0.577790,0.474883,0.525129,0.477423,0.417132,0.470425,0.345884,0.461055,0.287999,0.402714,0.500945,0.412038,0.534246,0.406168,0.620228,0.399233,0.667343,0.332708,0.510996,0.346669,0.567924,0.351486,0.641314,0.354265,0.676793,0.270657,0.533770,0.290863,0.575671,0.304575,0.634753,0.314181,0.660901
1779,up,up2,1.0,0.359078,0.772219,0.462878,0.744825,0.550234,0.657884,0.606673,0.591640,0.665992,0.555741,0.476407,0.511747,0.471551,0.400755,0.461377,0.332231,0.450677,0.277905,0.403008,0.497380,0.424101,0.529585,0.424281,0.616300,0.418132,0.661633,0.336175,0.513233,0.364106,0.570689,0.371530,0.644089,0.372024,0.675554,0.278025,0.539133,0.309333,0.578653,0.325870,0.635492,0.332590,0.658003
1780,up,up2,1.0,0.361350,0.768730,0.462772,0.749407,0.552244,0.666314,0.609610,0.600176,0.671599,0.564308,0.486660,0.515329,0.485570,0.405388,0.477345,0.336882,0.467720,0.280982,0.413424,0.497023,0.434099,0.533829,0.432243,0.623287,0.425796,0.666410,0.345723,0.511361,0.372192,0.574959,0.377984,0.648686,0.379364,0.675236,0.286465,0.537240,0.316468,0.582095,0.331279,0.640273,0.338255,0.659528


## **Part 3: Model Compilation**

* Now we have the dataframe with target Y = Gesture and parameters X = hand and x1 y1 z1 to x21 y21 z21
* We One-Hot Encode the Gesture into an output vector of length 9
* We take the input size as 64 inputs
* Model used is a Dense Sequential Network with 128 , 64 , DropOut 0.2 , 64 , 32 , 9

##### **First we import the necessary libraries :**

In [123]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import InputLayer,Dropout,Dense
from tensorflow.nn import softmax
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import Accuracy
import pandas as pd,numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping

Then we import the data from the csv file and split it using train test split

In [124]:
gesture_list = os.listdir(".dataset/")
Xy = pd.read_csv("csv/co-ordinates.csv")
y_full = pd.DataFrame(map(lambda x: gesture_list.index(x),pd.DataFrame(Xy.loc[:,"Specific"]).to_numpy().reshape(-1).tolist()))
X_full = Xy.drop(["Specific","Gesture"],axis=1)
X_train, X_valid, y_train,  y_valid = train_test_split(X_full,y_full,train_size=0.7,test_size=0.3,random_state=443)

Now we define the model, and preview its summary

In [125]:
model = Sequential([
    InputLayer(input_shape=[43]),
    Dense(128,activation="relu",kernel_regularizer="l2",name="PrimaryIN"),
    Dense(64,activation="relu",kernel_regularizer="l2",name="Reducer1"),
    Dropout(0.2,name="80percent"),
    Dense(64,activation="relu",name="Paralleler"),
    Dense(32,activation="relu",name="Reducer2"),
    Dense(18,activation="linear",name="Out"),
],name="Gestures")

model.summary()



After this we train the model. We use SparseCategoricalCrossentropy, because that allows us to do one-class classification in softmax based outputs.<br> We used logits here, so we could take the estimation values as well, and also apply thresholding.

In [127]:
model.compile(loss=SparseCategoricalCrossentropy(from_logits=True),optimizer=Adam(1e-3),metrics=['accuracy'])

model.fit(X_train,y_train,epochs=100,validation_data=(X_valid,y_valid),callbacks=EarlyStopping(monitor='val_accuracy',patience=5,restore_best_weights=True))

Epoch 1/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - accuracy: 0.9146 - loss: 0.8329 - val_accuracy: 1.0000 - val_loss: 0.6558
Epoch 2/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9334 - loss: 0.7582 - val_accuracy: 0.9981 - val_loss: 0.6155
Epoch 3/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9538 - loss: 0.7087 - val_accuracy: 0.9963 - val_loss: 0.5708
Epoch 4/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9348 - loss: 0.6820 - val_accuracy: 0.9944 - val_loss: 0.5489
Epoch 5/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9601 - loss: 0.6073 - val_accuracy: 0.9421 - val_loss: 0.5351
Epoch 6/100
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9708 - loss: 0.5797 - val_accuracy: 1.0000 - val_loss: 0.4767


<keras.src.callbacks.history.History at 0x1c5d3a86290>

We now test in for the validation set. <br>
Since the hyperparameters are tuned well for the model, by iterative testing and intuition. <br>
The accuracy achieved is significantly high.

In [128]:
a = model.predict(X_valid)
p = [[np.argmax(softmax(x)),max(softmax(x)).numpy()] for x in a]
q = y_valid.to_numpy()
testpreds = [np.argmax(softmax(x)) for x in a]
acc = Accuracy(name="accuracy")
acc.update_state(testpreds,q)
acc.result().numpy()

[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step


1.0

We observe a 100% validation accuracy which allows us to conclude that the model is highly accurate.<br>
We save this model in the models folder

In [129]:
model.save("models/main3.keras")

## **Part 4: Live Working**



* We have trained and saved a model which has high accuracy
* We will load that model, and predict using live_stream mediapipe
* present output in an opencv window

##### **First we import the necessary libraries :**

In [8]:
from tensorflow.keras.models import load_model
from tensorflow.nn import softmax
import cv2,numpy as np,time,os
from mediapipe import Image,tasks,solutions,ImageFormat
from mediapipe.framework.formats import landmark_pb2

In [9]:
available_cameras = list(filter(lambda x:cv2.VideoCapture(x) and cv2.VideoCapture(x).isOpened(),range(6)))
print("Available Cameras at ports : ",*available_cameras)


Available Cameras at ports :  0 2


We now use code similar to coodinate conversion and Image Capturing.<br>
This allows us to use the same format of input and output as before.<br>
We feed this into the loaded model.<br>
This allows us to 

In [10]:
BaseOptions = tasks.BaseOptions
HandLandmarker = tasks.vision.HandLandmarker
HandLandmarkerOptions = tasks.vision.HandLandmarkerOptions
VisionMode_IMAGE = tasks.vision.RunningMode.IMAGE

solution_landmark_style = solutions.drawing_styles.get_default_hand_landmarks_style
solution_connection_style = solutions.drawing_styles.get_default_hand_connections_style

def convertToCords(img):
    landmarker_options = HandLandmarkerOptions(base_options=BaseOptions(model_asset_path="handlandmarker/hand_landmarker.task"),
                                           num_hands=1,
                                           running_mode=VisionMode_IMAGE)
    detector = HandLandmarker.create_from_options(landmarker_options)
    
    image = Image(image_format= ImageFormat.SRGB,data=img)
    rawOutput = detector.detect(image)

    
    if len(rawOutput.hand_landmarks)==0:
        return [0]*43 , rawOutput
    cords = [[pt.x,pt.y] for h in rawOutput.hand_landmarks for pt in h]
    hands = [x.category_name for y in rawOutput.handedness for x in y]
    hands = [0 if a.lower()=="left" else 1 for a in hands]
    cords = np.array(cords).reshape(-1)
    return np.concatenate([hands,cords]) , rawOutput

def showImg(img,detected_result):
    for landmarks,handedness in zip(detected_result.hand_landmarks,detected_result.handedness):
        proto_marks = landmark_pb2.NormalizedLandmarkList()
        proto_marks.landmark.extend([landmark_pb2.NormalizedLandmark(x=L.x,y=L.y) for L in landmarks])
        solutions.drawing_utils.draw_landmarks(img,proto_marks,solutions.hands.HAND_CONNECTIONS,solution_landmark_style(),solution_connection_style())
    return img

Now we run the main loop and put the predictions with a threshold of 60%

In [11]:
cap = cv2.VideoCapture(available_cameras[0])
mainFrame = np.zeros(550*600*3,dtype=np.uint8).reshape(550,600,3)
model = load_model("models/main3.keras")
gesture_list = ['back1', 'back2', 'down1', 'down2', 'flip1', 'flip2', 'flip3', 'flip4', 'front1', 'front2', 'land1', 'land2', 'left1', 'left2', 'right1', 'right2', 'up1', 'up2']


while True:
    _ , frame = cap.read()
    frame = frame[:,79:559,:]
    pts , raw = convertToCords(cv2.cvtColor(frame,cv2.COLOR_BGR2RGB))
    pickim = showImg(frame,raw)
    conf=0
    if len(raw.handedness)!=0:
        preds = model.predict(np.array(pts).reshape(1,43),verbose=0)
        softout = softmax(preds)
        ans = gesture_list[np.argmax(softout)]
        conf = np.max(softout.numpy())
        print(ans ,conf, list(softout.numpy().tolist()) ,list(pts),end="\r")
    
    mainFrame[70:,79:559,:] = pickim
    if conf > 0.5:
        mainFrame[:70,:300,:] = np.zeros(70*300*3).reshape(70,300,3)
        cv2.putText(mainFrame,f"{ans}",[20,20],0,0.5,color=[0,255,0])
    else:
        mainFrame[:70,:300,:] = np.zeros(70*300*3).reshape(70,300,3)
        cv2.putText(mainFrame,f"no strong detection",[20,20],0,0.5,color=[0,0,255])

    cv2.imshow("Frame",mainFrame)

    if (cv2.waitKey(25) & 0xFF == ord('q')):
        break
cap.release()
cv2.destroyAllWindows()

right2 0.8377846 [[6.903024768689647e-05, 0.01303767692297697, 7.4356574941703e-07, 0.01790691167116165, 0.0013277461985126138, 0.0024760933592915535, 0.011225773021578789, 6.049272371910774e-08, 3.1776065156918776e-07, 8.305639312311541e-06, 1.6398884472579311e-09, 9.985136983914344e-08, 2.3774562578182667e-05, 0.0015103545738384128, 0.045435160398483276, 0.8377845883369446, 5.163367859495338e-06, 0.06918823719024658]] [1.0, 0.4161107540130615, 0.8581729531288147, 0.4449227750301361, 0.775661826133728, 0.534915566444397, 0.7258224487304688, 0.632540225982666, 0.727108359336853, 0.7016428709030151, 0.7390425205230713, 0.5146825909614563, 0.701450765132904, 0.6432111263275146, 0.6929610967636108, 0.7206109166145325, 0.6936841011047363, 0.7762658596038818, 0.6951579451560974, 0.5143750905990601, 0.7766696810722351, 0.6698098182678223, 0.7755172848701477, 0.7390329837799072, 0.772437334060669, 0.7804704308509827, 0.7667444944381714, 0.5190845727920532, 0.8543644547462463, 0.67056524753570