# Capstone Project

Victor Geislinger 
<!--October 2018-->

## I. Definition
<!--_(approx. 1-2 pages)_-->

### Project Overview
<!--
In this section, look to provide a high-level overview of the project in layman’s terms. Questions to ask yourself when writing this section:
- _Has an overview of the project been provided, such as the problem domain, project origin, and related datasets or input data?_
- _Has enough background information been given so that an uninformed reader would understand the problem domain and following problem statement?_
-->

American Sign Language (ASL) is a sign language that does not use speech to communicate and is mostly used by the American Deaf population. Though used throughout the English-speaking United States, it is in fact its own language seperate from English and relies on building the language's syntax with multiple visuals such as handshapes, movements, positions, and nonmanual markers. Although there are many variations of sign language specific to different languages, regions, and needs, being able to use a computer to detect ASL would be extremely useful in not only ASL translation but also other sign language translations as well as non-language geusture recognition.

This project focuses on classifying the handshapes that represent the letters in the English alphabet using still images using deep learning image recognition techniques. The focus on still images will allow a start for full ASL translation which would require aspects to be measured like handshape position (handshapes in different positions and orientations can affect meaning) and movement (in ASL, meaning has a strong tie to time dependence, like all languages) that can be measured in other types of datasets like video. 

It should be noted the handshapes being classified are not all the handshapes used in ASL (such as handshapes associated with numbers) and signs that represent letters but require movement have not been included, specidically "J" and "Z" handshapes. Lastly, though this project will consider handshapes in the position associated with their associated letter, the letters "P" and "Q" are in fact the same handshapes of the letters "K" and "G" respectively but in different positions. Since the still images are significantly different between these related letters, it was deemed appropiate to consider these pairs as separate classifications.   


### Problem Statement
<!--
In this section, you will want to clearly define the problem that you are trying to solve, including the strategy (outline of tasks) you will use to achieve the desired solution. You should also thoroughly discuss what the intended solution will be for this problem. Questions to ask yourself when writing this section:
- _Is the problem statement clearly defined? Will the reader understand what you are expecting to solve?_
- _Have you thoroughly discussed how you will attempt to solve the problem?_
- _Is an anticipated solution clearly defined? Will the reader understand what results you are looking for?_
-->

The project's goal is to classify static images of the ASL handshapes associated with $24$ English letters "A"-"I", "K"-"Y". Since the dataset will consist of static images, image recognition strategies using deep learning will be used. The dataset used will be from a paper that had a similar goal and will be used as an overall benchmark to evaluate this projects performance. The subgoal is to classify the handshapes with better accuracy than the paper's model which used both the static images and depth sensing information.

The dataset will be split into training, validation, and testing subsets and multiple image recognition models will be trained and compared. The subsets will be consistent accross the different models so that these models can be more easily compared. It will be disscussed later in this paper but it should be noted now that five different subjects provided the handshape images; the trainig and validation sets will be randomly made from four of the subjects and the fifth subject's images reserved for the testing set.  



### Metrics
<!--
In this section, you will need to clearly define the metrics or calculations you will use to measure performance of a model or result in your project. These calculations and metrics should be justified based on the characteristics of the problem and problem domain. Questions to ask yourself when writing this section:
- _Are the metrics you’ve chosen to measure the performance of your models clearly discussed and defined?_
- _Have you provided reasonable justification for the metrics chosen based on the problem and solution?_
-->
    
This project evaluates the performance of a given model by observing the number of correct handshapes it classfies $accuracy = \frac{N_{correct}}{N_{predicted}}$ where $N_{correct}$ is the number of correct predictions and $N_{predicted}$ is the total number of predictions made. Another metric used to represent the precision and recall of the model is the $F_1$ score given by $F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$. This project uses a $F_1$ score over a different $F_\beta$ ($\beta \neq 1$) score since both precision and accuracy should be considered with approximately the same weight (we care about classifying the correct hanshape as much as we care about misclassifying a letter). 

## II. Analysis
<!--_(approx. 2-4 pages)_-->

### Data Exploration
<!--
In this section, you will be expected to analyze the data you are using for the problem. This data can either be in the form of a dataset (or datasets), input data (or input files), or even an environment. The type of data should be thoroughly described and, if possible, have basic statistics and information presented (such as discussion of input features or defining characteristics about the input or environment). Any abnormalities or interesting qualities about the data that may need to be addressed have been identified (such as features that need to be transformed or the possibility of outliers). Questions to ask yourself when writing this section:
- _If a dataset is present for this problem, have you thoroughly discussed certain features about the dataset? Has a data sample been provided to the reader?_
- _If a dataset is present for this problem, are statistics about the dataset calculated and reported? Have any relevant results from this calculation been discussed?_
- _If a dataset is **not** present for this problem, has discussion been made about the input space or input data for your problem?_
- _Are there any abnormalities or characteristics about the input space or dataset that need to be addressed? (categorical variables, missing values, outliers, etc.)_
-->

The dataset used was the ASL FingerSpelling Dataset from the University of Surrey’s Center for Vision, Speech and Signal Processing [\[1\]](#References) [\[2\]](#References). This dataset contains both colored images and depth sensing data collected from a Microsoft Kinnect. (Note that this project did not be use the depth sensing data since the project focused on using static images.) The images are in color and include $24$ different handshapes each representing a letter from the English alphabet; "J" and "Z" are excluded since these letters are dependent on movement and therefore don't have a static image representation. 

The dataset images have been cropped around the handshape though each cropping results in a differently sized image fitting within a $275$ pixels by $250$ pixels window with a resolution of $72$ pixels per inch. The background behind the handshape is not uniform or consistent. The dataset contains approximately $65,000$ images generated by five different non-native ASL signers with over $500$ samples of each of the $24$ different handshapes. The handshapes in the image feature some rotational differences as the subjects were instructed to adjust hand position for the camera. These positional adjustments however still preserve the common position associated with the English letter it represents. 

<!-- #TODO: Display a random sample of letters from the dataset -->

![Random sample of images from the dataset.](images/dataset_sampleImages.png)


### Exploratory Visualization
<!--
In this section, you will need to provide some form of visualization that summarizes or extracts a relevant characteristic or feature about the data. The visualization should adequately support the data being used. Discuss why this visualization was chosen and how it is relevant. Questions to ask yourself when writing this section:
- _Have you visualized a relevant characteristic or feature about the dataset or input data?_
- _Is the visualization thoroughly analyzed and discussed?_
- _If a plot is provided, are the axes, title, and datum clearly defined?_
-->

<!-- #TODO: Display plots for height & width -->

![Histogram of images' heights in pixels.](images/histogram_pxHeight.png)

![Histogram of images' widths in pixels.](images/histogram_pxWidth.png)



Exploring the dataset, it can be observed that the images have different dimensions though all fitting within a $275$ pixels by $250$ pixels window. This makes sense in the data collection since the handshapes have varrying aspect ratios. The images' heights can be observed as being roughly a normal curve centered at approximately $140$ pixels. The images' widths however have a right skewed distribution. The width distribution implies that most handshapes are relatively narrow likely coming from the fact that most handshapes of English letters have the hand positioned with the fingers perpindicular to the horizon.

<!-- #TODO: Display plot for aspect ratio -->
![Histogram of images' aspect ratio.](images/histogram_pxAspectRatio.png)

![Histogram of images' area (total number of pixels in image).](images/histogram_pxTotalPixels.png)


Investigating further into the aspect ratio of the images and the area (number of pixels in an image), a right skewed distribution is observed for both. The area gives us literally how many inputs there would be for the model if we simply input the pixels and directly affects the image size; larger areas/more pixels means large file sizes and more data. It can also be observed that over $75\%$ of images have a ratio smaller than $1.0$ meaning that a large majority of images are taller than they are wide. 

<!-- #TODO: overall descriptive statistics -->

Looking at our images' different sizes, one can pick up on some patterns in the data. The smallest height and width are $64$ pixels, while the maximums for height and width are approximately $270$ and $250$ pixels respectively. One can notice that the aspect ratio trends below $1.0$ meaning that most images are taller than they are wide. It can also be observed that the area metric is heavily skewed toward smaller values. This fits with the aspect ratio and likely is because of the skew in width. The average number of pixels in the images is about $15000$ pixels and less than $25\%$ have more than $18000$ pixels.

<!-- #TODO: Cite further discussion & code in data_preprocessing notebook (?) -->


### Algorithms and Techniques
<!--
In this section, you will need to discuss the algorithms and techniques you intend to use for solving the problem. You should justify the use of each one based on the characteristics of the problem and the problem domain. Questions to ask yourself when writing this section:
- _Are the algorithms you will use, including any default variables/parameters in the project clearly defined?_
- _Are the techniques to be used thoroughly discussed and justified?_
- _Is it made clear how the input data or datasets will be handled by the algorithms and techniques chosen?_
-->

Since this project is focused on classifying static images it was determined that deep learning techniques, specifically utilizing convolutional neural networks (CNNs), would be the most relevant and effective method for this situation. However, since CNNs require all images to be the same dimensions it was determined the best method for this project was to scale the images to the same size. This will be discussed further in the "Data Preprocessing" section but in summary it was determined to scale all the images to $160$ pixels by $160$ pixels to preserve the relative shape of the majority of images.  

Multiple model architectures were used and compared. A basic CNN model was built from scratch to determine the accuracy that could be achieved before going into more advanced techniques. Afterwards transfer learning was used to improve accuracy. VGG-16, VGG-19, and ResNet-50 architectures were used with bottleneck features to help with speed. These models results were then compared amongst one another in terms of the overall accuracy and $F_1$ score as well the accuracy in correctly classifying specific handshapes.


### Benchmark
<!--
In this section, you will need to provide a clearly defined benchmark result or threshold for comparing across performances obtained by your solution. The reasoning behind the benchmark (in the case where it is not an established result) should be discussed. Questions to ask yourself when writing this section:
- _Has some result or value been provided that acts as a benchmark for measuring performance?_
- _Is it clear how this result or value was obtained (whether by data or by hypothesis)?_
-->

The first and simplest benchmark will be comparing the developed model with a "random choice" model. With each handshape being equally likely, we would expect the "random choice" model to only identify handshape instances $\frac{1}{24}\approx$ $4.2\%$ of the time on average. 
  
A goal of this project is to achieve better performance than if specialized equipment other than a camera to take images were to be used (like the Microsoft Kinnect). We can then use the performance of the *"Spelling It Out"* paper's random forest model as our idealized benchmark. Observing the paper's confusion matrix which used four of the subjects to train and validate the model and one subject's images to test performance, similar handshapes were misclassified with the handshapes least correctly identified being "T", "O", "S", and "M" ($7\%$, $13\%$, $17\%$, and $17\%$ of the time correctly identified respectively). The handshapes that were identified most accurately were "L", "V", "B", and "G" ($87\%$, $87\%$, $83\%$, and $80\%$ of the time correctly identified respectively). The paper's model achieved an overall mean precision of $73\%$ and $75\%$ using only the images and using both images and depth data respectively. 

## III. Methodology
<!--_(approx. 3-5 pages)_-->

### Data Preprocessing
<!--
In this section, all of your preprocessing steps will need to be clearly documented, if any were necessary. From the previous section, any of the abnormalities or characteristics that you identified about the dataset will be addressed and corrected here. Questions to ask yourself when writing this section:
- _If the algorithms chosen require preprocessing steps like feature selection or feature transformations, have they been properly documented?_
- _Based on the **Data Exploration** section, if there were abnormalities or characteristics that needed to be addressed, have they been properly corrected?_
- _If no preprocessing is needed, has it been made clear why?_
-->

Since it was determined that CNNs were to be used (as discussed in the previous section), the images needed to be scaled to the same dimension. It would be ideal to keep as much information as possible in resizing the images. The extreme solution would be to scale the images to the maximum height & width ($272$ pixels & $249$ pixels respectively) however this would create larger file sizes and could greatly distort the majority of the images. Therefore, in resizing the images most images should retain most (if not all) of their information while there are fewer images losing information. Note that the images should not be cropped since they have already been cropped around the images' handshapes.

It was observed that most images are taller than they are wide. It then made sense that resizing the data should favor taller images since this is more common in the data. One solution was to add padding to images so that the width matches with the height. However, adding this padding could give bias to the model if particular letters have differing aspect ratios. In other words, the model could simply learn based on the padding instead of the handshape in the image. Thus stretching/squeezing or rescaling the image seems to be preferred even if the images will lose more information.

After determining that rescaling was the best strategy, it needed to be determined how large the rescaled images would be for the CNN architecture. It has already been said that a square image was preferred. More than $75\%$ of the images have aspect ratios less than $1.0$ so most images could simply be rescaled to a square by changing the width.

Restricting the width to $125$ pixels would keep information preserved for about $75\%$ of the images. This seems to be reasonable for width but height has to be considered as well. Restricting the image to $125$ pixels for height would mean information would be preserved for less than $25\%$ of the images. This appeared to be a poor tradeoff especially considering that most images were tall and therefore most pictures would lose information.

Focusing on height instead, a restriction of $160$ pixels would preserve all the height information for about $75\%$ of the images. It was observed over $75\%$ if the images would preserve all its width information. This was great since most images would preserve the relative shape and pixel density for the vast majority.

In summary, the images were resized to $160$ pixels by $160$ pixels which preserved most of the images' information. The size was also ideal since each dimension was divisible by $32$ giving a decently sized power of $2$. The total number of pixels for each image after resizing  was $25600$ total pixels which allowed for a large proportion of the images to not lose information after scaling.

### Implementation
<!--
In this section, the process for which metrics, algorithms, and techniques that you implemented for the given data will need to be clearly documented. It should be abundantly clear how the implementation was carried out, and discussion should be made regarding any complications that occurred during this process. Questions to ask yourself when writing this section:
- _Is it made clear how the algorithms and techniques were implemented with the given datasets or input data?_
- _Were there any complications with the original metrics or techniques that required changing prior to acquiring a solution?_
- _Was there any part of the coding process (e.g., writing complicated functions) that should be documented?_
-->

The first goal was to build a CNN from scratch (with no transfer learning). (Note all work for the following sections can be found in notebook titled *asl_recognition.ipynb* unless stated otherwise). The first step was to identify all of a subject's images to be the testing data set with the rest of the images from the other four users would be used for training and validation. This was to emulate our benchmark from the paper referred to earlier. The training and validation sets were then randomly split from the remaining so that $80\%$ of the images were for training and the other$20\%$ identified for validation. These testing, training, and validation image sets would be used consistently for any model developed.

The next step was to prepare the model. Originally the goal for the basic model from scratch was to use colored images (RGB) as input. However, having three channels made it difficult to load the training and validation sets into memory as it was too memory intensive. So the basic model used grayscale (one channel) images as input.

The images were also scaled to $224$ pixels by $224$ pixels to make it easier to compare the ResNet50 transfer learning model to be used later. It was predicted that this model would do best against the other planned transfer learning models (VGG16 and VGG19). The grayscale and scaled images for testing and validation were then loaded into memory as tensors so they could be applied to the model to find the proper weights.

The basic model was created with three convolution layers using the reLU activation function and with $16$, $32$, and $64$ filters. Each of these layers were followed by a  max pooling layer with a pooling size of $2$. The model was finished with a dense layer using the softmax function of $24$ outputs (for each handshape to predict). Below is the code used in the notebook to define the model:
    ```
    model = Sequential()
    model.add(Conv2D(filters=16, kernel_size=2, padding='same', activation='relu', input_shape=(224, 224, 1)))
    model.add(MaxPooling2D(pool_size=2))
    model.add(Conv2D(filters=32, kernel_size=2, padding='same', activation='relu'))
    model.add(MaxPooling2D(pool_size=2))
    model.add(Conv2D(filters=64, kernel_size=2, padding='same', activation='relu'))
    model.add(MaxPooling2D(pool_size=2))
    model.add(GlobalAveragePooling2D())
    model.add(Dense(24, activation='softmax'))
    ```
Next the model was trained using the training data and validation data. The model ran with $16$ epochs with batch sizes of $20$ images. This produced weights for the model that could then be used to evaluate the model.

The model was then evaluated using the testing set images after scaling them in the same grayscale and $224$ pixels by $224$ pixels (also loaded into memory as a tensor). The trained model produced about a $43\%$ overall accuracy. It was determined that the next step for improvement would be to use the technique of transfer learning.


### Refinement
<!--
In this section, you will need to discuss the process of improvement you made upon the algorithms and techniques you used in your implementation. For example, adjusting parameters for certain models to acquire improved solutions would fall under the refinement category. Your initial and final solutions should be reported, as well as any significant intermediate results as necessary. Questions to ask yourself when writing this section:
- _Has an initial solution been found and clearly reported?_
- _Is the process of improvement clearly documented, such as what techniques were used?_
- _Are intermediate and final solutions clearly reported as the process is improved?_
-->

Using the model built from scratch, training took a relatively long time even with small batches and few epochs. It was determined that using transfer learning would be an effective use of computing resources to help improve the model's overall accuracy. Attempts were made using the models of VGG16, VGG19, and ResNet50.

The first attempt was to use VGG16. It was decided to use the tecnique of using bottleneck features to speed the training with the model. So the VGG16 model was loaded and the final classification layer was removed to calculate bottleneck features. The training images  were then fed into the model to create the bottleneck features. These features could then be trained on a CNN (similar to the basic model) to predict on the relevant data.

Various CNNs and scaled images were tested with the VGG16 model. RGB images were used and the pixel density ranged from $80$ pixels by $80$ pixels to $160$ pixels by $160$ pixels. Various variations on the CNN were made but mostly included no more than three convolutional layers. Overall accuracy ranged from about $50\%$ to about $65\%$. The best overall accuracy achieved by VGG16 was about $66\%$ using three convolutional layers of $256$, $128$, and $32$ filters with a batch size of $9000$ and $400$ epochs. A similar procedure was used with VGG19 but still was not able to break the best overall accuracy of $66\%$.

Using ResNet50 transfer learning increased the overall accuracy slightly more. Again bottleneck features were used but this time using $224$ pixels by $224$ pixels RGB images. An overall accuracy of about $70\%$ was achieved using a CNN with 3 convolutional layers ($256$, $12$, and $32$ filters) on the bottleneck features, a batch size of $512$, and $2048$ epochs. This seemed to be the level that could achieved with reasonable access to computing resources.


## IV. Results
_(approx. 2-3 pages)_

### Model Evaluation and Validation
In this section, the final model and any supporting qualities should be evaluated in detail. It should be clear how the final model was derived and why this model was chosen. In addition, some type of analysis should be used to validate the robustness of this model and its solution, such as manipulating the input data or environment to see how the model’s solution is affected (this is called sensitivity analysis). Questions to ask yourself when writing this section:
- _Is the final model reasonable and aligning with solution expectations? Are the final parameters of the model appropriate?_
- _Has the final model been tested with various inputs to evaluate whether the model generalizes well to unseen data?_
- _Is the model robust enough for the problem? Do small perturbations (changes) in training data or the input space greatly affect the results?_
- _Can results found from the model be trusted?_

### Justification
In this section, your model’s final solution and its results should be compared to the benchmark you established earlier in the project using some type of statistical analysis. You should also justify whether these results and the solution are significant enough to have solved the problem posed in the project. Questions to ask yourself when writing this section:
- _Are the final results found stronger than the benchmark result reported earlier?_
- _Have you thoroughly analyzed and discussed the final solution?_
- _Is the final solution significant enough to have solved the problem?_


## V. Conclusion
_(approx. 1-2 pages)_

### Free-Form Visualization
In this section, you will need to provide some form of visualization that emphasizes an important quality about the project. It is much more free-form, but should reasonably support a significant result or characteristic about the problem that you want to discuss. Questions to ask yourself when writing this section:
- _Have you visualized a relevant or important quality about the problem, dataset, input data, or results?_
- _Is the visualization thoroughly analyzed and discussed?_
- _If a plot is provided, are the axes, title, and datum clearly defined?_

### Reflection
In this section, you will summarize the entire end-to-end problem solution and discuss one or two particular aspects of the project you found interesting or difficult. You are expected to reflect on the project as a whole to show that you have a firm understanding of the entire process employed in your work. Questions to ask yourself when writing this section:
- _Have you thoroughly summarized the entire process you used for this project?_
- _Were there any interesting aspects of the project?_
- _Were there any difficult aspects of the project?_
- _Does the final model and solution fit your expectations for the problem, and should it be used in a general setting to solve these types of problems?_

### Improvement
In this section, you will need to provide discussion as to how one aspect of the implementation you designed could be improved. As an example, consider ways your implementation can be made more general, and what would need to be modified. You do not need to make this improvement, but the potential solutions resulting from these changes are considered and compared/contrasted to your current solution. Questions to ask yourself when writing this section:
- _Are there further improvements that could be made on the algorithms or techniques you used in this project?_
- _Were there algorithms or techniques you researched that you did not know how to implement, but would consider using if you knew how?_
- _If you used your final solution as the new benchmark, do you think an even better solution exists?_

-----------

**Before submitting, ask yourself. . .**

- Does the project report you’ve written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Analysis** and **Methodology**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your analysis, methods, and results?
- Have you properly proof-read your project report to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?
- Is the code that implements your solution easily readable and properly commented?
- Does the code execute without error and produce results similar to those reported?

# References

\[1\]: Pugeault, N., and Bowden, R. (2011). Spelling It Out: Real-Time ASL Fingerspelling Recognition In Proceedings of the 1st IEEE Workshop on Consumer Depth Cameras for Computer Vision, jointly with ICCV'2011.


\[2\]: Pugeault, Nicolas. "Nico" ASL FingerSpelling Dataset from the University of Surrey’s Center for Vision, Speech and Signal Processing, empslocal.ex.ac.uk/people/staff/np331/index.php?section=FingerSpellingDataset.