** Sign Language Prediction from Video **

We are using LSA64 Argentinian Sign Language Video Dataset for this which consist of 3200 Videos of 64 words. We have selected 24 such words for this project which has 1200 videos in total to train our models. We are trying to do a real time conversation with Google Assistant or Alexa using sign languages from these videos as a input. 

We have used 2 types of models to test this - 
1. Time Distributed CNN and RNN
2. CNN for feature extraction and LSTM

Our best accuracy is training 96% and testing 95% using model 2. 
We first break down the Each video from the dataset into frames, keeping track of the number of frames in each. For our prediction purpose, we selected the sequence length as 40 for each video. We then feed those individual frames into a CNN. For CNN, we used Inception V3 model pre-trained on ImageNet to extract features and save it as a Sequence .npy file. We can now load these extracted features into our LSTM layer which will learn the temporal features from the video and predict the output word for each video. 

For Final prediction, we load the weights and feed the LSTM with the extracted features through .npy files for any word and create a sentence based on that which is further converted into speech to interact with the Google Assistant. The output of the Google Assistant is then converted to text as well for the end user. 


In [1]:
from keras.layers import Dense, Flatten, Dropout, ZeroPadding3D
from keras.layers.recurrent import LSTM
from keras.models import Sequential, load_model
from keras.optimizers import Adam, RMSprop
from keras.layers.wrappers import TimeDistributed
from keras.layers.convolutional import (Conv2D, MaxPooling3D, Conv3D,
    MaxPooling2D)
from collections import deque
import sys
from keras.callbacks import TensorBoard, ModelCheckpoint, EarlyStopping, CSVLogger
import time
import os.path
import pandas as pd
import numpy as np
import win32com.client as wincl
import speech_recognition as sr


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:

def pipeline():
    
    weather = ['063_010_002', '062_010_002']
    music = ['064_002_002', '053_010_002', '036_010_002']
    call = ['064_002_002', '017_002_004']
    dance = ['064_002_002', '057_003_001']
    argentina = ['028_001_005', '024_002_001']
    opaque = ['063_010_002', '001_003_003']
    vision = ['033_003_004']
    bday = ['060_004_002','050_004_001', '030_006_004']
    
    pipe = []
    pipe.append(weather)
    pipe.append(vision) 
    pipe.append(argentina)
    pipe.append(opaque)
    pipe.append(bday)
    pipe.append(dance)
    #pipe.append(call)
    pipe.append(music)
    
    labels = pd.read_csv('data_file.csv')
    pred_label = pd.read_csv('Words_Class.csv')
    print('Defining the Model and loading Weights..')
    seq_len = 40
    dim = 2048
    input_shape = (seq_len, dim)
    model = Sequential()
    model.add(LSTM(2048,return_sequences=False,
                           input_shape=input_shape,
                           dropout=0.5))
    model.add(Dense(512, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(24, activation='softmax'))
    optimizer = Adam(lr=1e-5, decay=1e-6)
    metrics = ['accuracy']
    model.compile(loss='categorical_crossentropy', optimizer=optimizer,
                           metrics=metrics)
    

    model.summary()
    model.load_weights('lstm-features.008-0.068.hdf5')
    print('Weights loaded')
    
    for data in pipe:
        
        sentence = []

        for sign in data:
            sequence, y = [], []
            features = sign + '-40-features.npy'
            
            X= np.load(features)
            sequence.append(X)
            X_test = np.array(sequence)
            Y_pred = model.predict_classes(X_test)
            Y_actual = labels.loc[labels['VideoID'] == sign].Class
            Predicted = pred_label.loc[pred_label['Prediction'] == Y_pred[0]].Class
            sentence.append(Predicted)

        print('Generating Prediction for Signs..')

        sent = []
        for word in sentence:
            sent.append(word.iloc[0])
        final_sent = ' '.join(sent)  

        print('Converting to speech.. ')
        print("Okay Google " + final_sent)

        speak = wincl.Dispatch("SAPI.SpVoice")
        r = sr.Recognizer()
        speak.Speak("Okay Google  " + final_sent)
        with sr.Microphone() as source:
            print("listening..")
            r.adjust_for_ambient_noise(source)
            audio = r.listen(source)
            text = r.recognize_google(audio)
            print("Speech to Text: " + text)


In [3]:
pipeline()

Defining the Model and loading Weights..
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 2048)              33562624  
_________________________________________________________________
dense_1 (Dense)              (None, 512)               1049088   
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 24)                12312     
Total params: 34,624,024
Trainable params: 34,624,024
Non-trainable params: 0
_________________________________________________________________
Weights loaded
Generating Prediction for Signs..
Converting to speech.. 
Okay Google What is the Weather
listening..
Speech to Text: now in Boston it's 49 and cloudy today it'll be rainy with a forecasted high of 49 and a