# Machine Learning Engineer Nanodegree
## Capstone Proposal
Victor Geislinger  
2018 Month Day

## American Sign Language Handshape Detection from Static Images

### Domain Background
> _(approx. 1-2 paragraphs)_
>
> In this section, provide brief details on the background information of the domain from which the project is proposed. Historical information relevant to the project should be included. It should be clear how or why a problem in the domain can or should be solved. Related academic research should be appropriately cited in this section, including why that research is relevant. Additionally, a discussion of your personal motivation for investigating a particular problem in the domain is encouraged but not required.

American Sign Language (ASL) is a sign language that does not use speech to communicate and is mostly used by the American Deaf population. Though used throughout the English-speaking United States, it is in fact is it's own language seperate from English and relies on creating the language's syntax with multiple visuals such as handshapes, movement, position, and nonmanual markers. Although there are many variations of sign languages specific to different languages, regions, and needs, being able to use a computer to detect ASL would be extremely useful in not only ASL translation but also other sign language translations as well as just geusture recognition.

  
There has been past research in detecting ASL or ASL-like handshapes and movements such as gesture recognition. There have been past attempts in detecting hand motions and handshapes that use datasets with depth information using sensors like the Microsoft Kinnect. However, these technologies are relatively uncommon compared to the ubiquitous  camera sensor found on nearly every computer and phone. The recent advances in image recognition and classification make the concept of detecting and classifying ASL handshapes to be attainable through video or static images (possibly taken from video). 

### Problem Statement
> _(approx. 1 paragraph)_
> 
> In this section, clearly describe the problem that is to be solved. The problem described should be well defined and should have at least one relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms) , measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

In this project I will classigy static images of different ASL handshapes. This is a good stepping stone before a video dataset is used since time dependence and movement will not have to be considered. My solution could be compared to classifying the MNIST handwritten database but with images of ASL handshapes. If the dataset is already cropped around the handshape, a CNN (possibly taking advantage of transfer learning) could be used to classify the images. The model could then be evaluated with a validation set and/or new images from a similarly preproccessed but independent dataset. 

### Datasets and Inputs
>_(approx. 2-3 paragraphs)_
>
>In this section, the dataset(s) and/or input(s) being considered for the project should be thoroughly described, such as how they relate to the problem and why they should be used. Information such as how the dataset or input is (was) obtained, and the characteristics of the dataset or input, should be included with relevant references and citations as necessary It should be clear how the dataset(s) or input(s) will be used in the project and whether their use is appropriate given the context of the problem.

The prefered dataset I will be using the ASL FingerSpelling Dataset from the University of Surrey’s Center for Vision, Speech and Signal Processing (http://empslocal.ex.ac.uk/people/staff/np331/index.php?section=FingerSpellingDataset). This dataset contains both colored images and depth sensing data collected from a Microsoft Kinnect. (Note that I will not be using the depth sensing data since my project will be focused on using static images.) The images are in color and include 24 different handshapes each representing a letter from the English alphabet; note "j" and "z" are excluded since these letters are dependent on movement and those don't have a static image representation. The images have been cropped around the handshape though each cropping results in a differently sized image. The background behind the handshape is not uniform or consistent. The dataset contains 48,0000 images generated by 4 different non-native ASL signers with over 500 samples of each of the 24 different handshapes.
  
The secondary dataset which could be used for validation is from Kaggle called "Sign Langauge MNIST" (https://www.kaggle.com/datamunge/sign-language-mnist/home). This dataset is of gray-scaled images of 24 different handshapes each representing a letter from the English alphabet; note "j" and "z" are excluded since these letters are dependent on movement and those don't have a static image representation. All images have been cropped around the handshape to a square 784x784 pixels.

- Possible other datasets (video,depth,images) http://facundoq.github.io/unlp/sign_language_datasets/index.html

### Solution Statement
>_(approx. 1 paragraph)_
>
>In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Additionally, describe the solution thoroughly such that it is clear that the solution is quantifiable (the solution can be expressed in mathematical or logical terms) , measurable (the solution can be measured by some metric and clearly observed), and replicable (the solution can be reproduced and occurs more than once).

- Resize images to square (likely 256x256) from Surrey dataset
- Seperate 80-20 resized images for training & testing respectively
- Build CNN model architecture
  - Use transfer learning (possibly JLSVTC2012)
  - Classify images into one of 24 handshapes
- Use a confusion matrix to visualize the performance
- Determine precision

### Benchmark Model
>_(approximately 1-2 paragraphs)_
>
>In this section, provide the details for a benchmark model or result that relates to the domain, problem statement, and intended solution. Ideally, the benchmark model or result contextualizes existing methods or known information in the domain and problem given, which could then be objectively compared to the solution. Describe how the benchmark model or result is measurable (can be measured by some metric and clearly observed) with thorough detail.

- Compare with randomly guessing the handshapes
  - Uniform random guessing expected precision is 1/24 ~ 4%
- Compare with original dataset origin (Spelling It Out paper)
  - Confusion matrix
      - misclassifies on similar looking handshapes
      - lowest percent correct (t,o,s,m w/ 7%,13%,17%,17% respectively)
      - highest percent correct (l,v,b,g w/ 87%,87%,83%,80% respectively)
  - Reference success (0.35 overall --> http://empslocal.ex.ac.uk/people/staff/np331/index.php?section=FingerSpellingDataset)

### Evaluation Metrics
>_(approx. 1-2 paragraphs)_
>
>In this section, propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).


- Use mean squared error (MSE) $MSE= \frac{1}{m} \sum(y_i - \hat{y}_i)$
- Use F1 score since we want precision and recall about the same. That is not classifying a letter is just as bad as misclassifying a letter
    - $F_\beta = (1+\beta^2) \frac{precision \times recall}{(\beta^2 \times precision) + recall}$
    - $F_1 = 2 \frac{precision \times recall}{precision + recall}$
    - $recall = \frac{pos_{true}}{pos_{true} + neg_{false}}$
    - $precision = \frac{pos_{true}}{pos_{true} + pos_{false}}$

### Project Design
>_(approx. 1 page)_
>
>In this final section, summarize a theoretical workflow for approaching a solution given the problem. Provide thorough discussion for what strategies you may consider employing, what analysis of the data might be required before being used, or which algorithms will be considered for your implementation. The workflow and discussion that you provide should align with the qualities of the previous sections. Additionally, you are encouraged to include small visualizations, pseudocode, or diagrams to aid in describing the project design, but it is not required. The discussion should clearly outline your intended workflow of the capstone project.

-----------

**Before submitting your proposal, ask yourself. . .**

- Does the proposal you have written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Solution Statement** and **Project Design**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your proposal?
- Have you properly proofread your proposal to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?