### <center><h1> CAPSTONE PROJECT </h1></center>

# <center><h1>ASL SIGN DETECTION</h1></center>

***What's ASL ?***

*American Sign Language is a Visual Language that is predominantly used by the Deaf Communities in North America and Anglophone Canada. It uses both manual and non-manual (emotional) cues to communicate with others. In this project we are going to solely focus on the manual subsystem of ASL. More specifically the English Alphabets.*

**Workflow of the Project**
   - Importing necessary libraries
   - Creating a function to save frames from a live webcam feed in a desired format
   - Preparation of Training Data
   - Model construction
   - Training the model
   - Using the model to predict ASL hand signs

### Importing Libraries

In [1]:
import cv2
import numpy as np
import math
import os
import random
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow.keras
from tensorflow import keras
from IPython.display import clear_output
from IPython.utils import io

Directory where the Images Captured using webcam will be saved 

*Change it to a directory of your convinience*

In [2]:
IMG_SAVE_DIR = "C:/Jupyter/Capstone Project - ASL to Text/ImagesFromROI/"

Directory which will contain the Training Images according to alphabet. Sub-directories of alphabets are already present 

*Change it to a directory of your convinience*

In [3]:
TRAIN_DIR = "C:/Jupyter/Capstone Project - ASL to Text/TrainingImages/"

Defining the classes of images that we wish to classify. *As the letters J and Z require motion we are excluding them here*

In [4]:
CATEGORIES = ['A','B','C','D','E','F','G','H','I','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y']

### Function To Save Frames from Webcam within our Region of Intrest [ROI]

This integer variable is used as a counter to keep track of the images saved, and to make sure files don't have the same name and get overwritten

In [5]:
currentFrame = 0 #Intializing counter to zero

In [6]:
cap = cv2.VideoCapture(0) # Defining VideoCapture Object with value 0, which means it will use the webcam

while True:
    ret, frame = cap.read() # Reading each frame from the webcam and storing it in a variable called frame.
    
    mirror = cv2.flip(frame, 1) # Flipping each frame so that the video feed resembles a mirror  
    
    fh, fw = mirror.shape[:2] # Getting the frame's width and height
    
    rois = int(fh/1.7) # Defining the side of the ROI as being half the length of the height of the frame
    
    cropImg = mirror[0:rois, fw-rois:fw] # Cropping out the part necessary for the ROI
    
    grey = cv2.cvtColor(cropImg, cv2.COLOR_BGR2GRAY) # Converting the BGR image of the ROI to Greyscale (B&W) 
    
    value = (11, 11) # Setting the Blur Kernel size
    
    blurred = cv2.GaussianBlur(grey, value, 0) # Blurring the Greyscale Image
    
    _, thresh = cv2.threshold(blurred, 127, 255, cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU) # Applying Thresholding 
    
    third_axis = np.repeat(thresh[...,np.newaxis], 3, -1) # B&W images don't have a 3rd axis. So adding it manually.
    
    mirror[0:rois, fw-rois:fw] = third_axis # Overlaying the thresholded image on our webcam feed 
    
    cv2.imshow("Webcam", mirror) # Showing the video to user
    
    if cv2.waitKey(1) & 0xFF == ord('q'): # If the key 'q' is pressed the live session will terminate
        break
        
    if cv2.waitKey(1) & 0xFF == ord('s'): # If the key 's' is pressed the image in the ROI at that time will get saved
        cv2.imwrite(IMG_SAVE_DIR+'frame'+str(currentFrame)+'.jpg', third_axis)
        print('Saved Pic '+str(currentFrame))
        currentFrame+=1
    
cap.release()              
cv2.destroyAllWindows() # Terminating the session if 'q' is pressed

### Preparing Training Dataset

Creating empty directory to store training data

In [7]:
trainImgs = []

Recursive function that reads in image using OpenCv and Resizes it and stores the image array in a list

In [8]:
for cat in CATEGORIES:
    path=os.path.join(TRAIN_DIR,cat) # Specifying the directory to take images from for each alphabet
    class_num = CATEGORIES.index(cat) # Using the index of the CATEGORIES list to assign a label to each image
    for img in os.listdir(path): # FOR loop which appends read images to the Training List
        img=cv2.imread(TRAIN_DIR+cat+'/'+img)
        resizedimg = cv2.resize(img, (224,224), interpolation= cv2.INTER_CUBIC)
        trainImgs.append([resizedimg,class_num])

Checking if the images are stored correctly. *Press any key to exit from saved image*

In [9]:
cv2.imshow('Random', trainImgs[787][0])
cv2.waitKey(0)
cv2.destroyAllWindows()

Shuffling the Train Data so that the Neural Network doesn't develop a pattern recognition system

In [10]:
random.shuffle(trainImgs)

Seperating the Images and Labels

In [11]:
X = []
y = []
for item in trainImgs:
    X.append(item[0])
    y.append(item[1])
X=np.array(X)
y=np.array(y)

Normalizing the Images

In [12]:
Xnor = X/255.0

Train Test Split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(Xnor, y, test_size=0.3)

### Model Construction

Specifying URL where MobileNetV2 resides

In [14]:
mobilenet_v2 = "https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/4"

Specifying the Input Layer

In [15]:
mobile_net_layers = hub.KerasLayer(mobilenet_v2, input_shape=(224,224,3))

Making sure that the Inner Layers remain same as we don't want to change any pre-training in the CNN.

In [16]:
mobile_net_layers.trainable = False

Constructing the model

In [17]:
model = tf.keras.Sequential([
  mobile_net_layers,
  tf.keras.layers.Dropout(0.3),
  tf.keras.layers.Dense(24,activation='softmax')
])

Model Summary

In [18]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer (KerasLayer)    (None, 1280)              2257984   
                                                                 
 dropout (Dropout)           (None, 1280)              0         
                                                                 
 dense (Dense)               (None, 24)                30744     
                                                                 
Total params: 2,288,728
Trainable params: 30,744
Non-trainable params: 2,257,984
_________________________________________________________________


Compiling the model

In [19]:
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

### Model Training

In [20]:
model.fit(X_train, y_train, epochs=4, validation_data=(X_test, y_test))      

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x2c2d724b190>

As we can see the model has good accuracy. So I will use the entire data to train the model this time instead of just the train data

In [21]:
model = tf.keras.Sequential([
  mobile_net_layers,
  tf.keras.layers.Dropout(0.3),
  tf.keras.layers.Dense(24,activation='softmax')
])

model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

model.fit(Xnor, y, epochs=5)      

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2c2df9389d0>

Saving the model

In [22]:
model.save('MobileNetV2TrainedOnBgSubtraction.h5')

Loading the saved model *(This step was done for ease of use so that I didn't need to train the model everytime I wanted to make some changes)*

In [23]:
model=keras.models.load_model('MobileNetV2TrainedOnBgSubtraction.h5',custom_objects={'KerasLayer': hub.KerasLayer})

### Using The Model To Predict American Sign Language

Mostly using the same steps undertaken during Image Collection to preprocess the images in the ROI. But just adding the functionality of outputting the prediction and accuracy

In [28]:
cap = cv2.VideoCapture(0)
prevDisp = 'placeholder'
while True:
    _, frame = cap.read()
    
    mirror = cv2.flip(frame, 1)
    
    fh, fw = mirror.shape[:2]
    rois = int(fh/1.7)    
    cropImg = mirror[0:rois, fw-rois:fw]
    
    grey = cv2.cvtColor(cropImg, cv2.COLOR_BGR2GRAY)
    
    value = (7, 7)
    blurred = cv2.GaussianBlur(grey, value, 0)
    
    _, thresh = cv2.threshold(blurred, 127, 255, cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)
    
    rgb = np.repeat(thresh[...,np.newaxis], 3, -1)
    
    mirror[0:rois, fw-rois:fw] = rgb
    
    resizedimg = cv2.resize(rgb, (224,224), interpolation= cv2.INTER_CUBIC)
    
    normalizedimgformodel = resizedimg/255.0
    
    with io.capture_output() as captured:
        predictions = model.predict(np.array([normalizedimgformodel]))
    if predictions.max()>0.7:
        guessNo = np.argmax(np.squeeze(predictions))
        guessAlpha = CATEGORIES[guessNo]
        
        if prevDisp != guessAlpha:
            clear_output(wait=True)
            display(guessAlpha, predictions.max())
            prevDisp = guessAlpha
            
    cv2.imshow('WebCam', mirror)
    if cv2.waitKey(20) & 0xFF == ord('q'):
        break
cap.release()
cv2.destroyAllWindows()  

'S'

0.74711865

As we can see the model outputs the predicted handsign and the probability successfully. And from several tests the model performs pretty well with all the alphabets