This work focuses on developing a Deep Learning model capable of using the optical flow produced by the gestures of a signer performing a specific ASL sign and predicting the correct gloss. The work explores models which use spatial and temporal information. There are a total of 20 glosses in the dataset.
-
cv2:
sudo apt-get install python-opencv
-
numpy:
sudo pip install numpy
-
tensorflow:
pip install --upgrade tensorflow
-
scipy:
sudo pip install numpy scipy
-
scikit-learn:
sudo pip install scikit-learn
-
pillow:
sudo pip install pillow
-
h5py:
sudo pip install h5py
-
keras:
sudo pip install keras
-
matplotlib:
sudo pip install matplotlib
Execute the following command: python experiments.py agent
where agent can be any of the these 3 values {random, bias, conv}
-
Random: It chooses randomly any gloss.
-
Bias: It chooses the gloss with more repetitions.
-
Convolutional Model: It is a convolutional network trained using supervised learning. The input to the network is the cumulative optical flow of the video of the person performing a sign, the output is the gloss. In order to train the model categorical crossentropy loss was used and the weights were optimized using Adam optimizer.
- LSTM Model: This model instead of a cumulative optical flow of the entire video, uses a the optical flow of each pair of frames and combine them using an LSTM in order to learn temporal information. This agent can not be currently tested due to an issue uploading the data. It will be available soon.
Currently there is no way to input through the command line parameters such as the number of epoches,fractions to split the dataset, pooling and kernel sizes, model name among others. It is easy to change them by going to the file experiments.py and to the function of the specific agent and change the parameters in there.
-
The final total loss.
-
The top 4 accuracy.
-
A confussion matrix.
-
It saves in the directory results some visualisation of the training including accuracy per epoch, loss per epoch, a confusion matrix image. It also saves the trained model and its weights.
Rank1 Accuraccy: 55.4% Rank2 Accuraccy: 66.2% Rank3 Accuraccy: 72.2% Rank4 Accuraccy: 77.4%
Confusion Matrix
Top 4 Accuracy Per Epoch
Validation and Training Loss
TSNE applied to the feature vectors created by the trained convolutional model for each video of the signer. This shows in 2d how the model learns create similar representations (small eucledean distance) for the same and similar classes and different representations for different classes.