<a href="https://colab.research.google.com/github/Benned-H/Reading_List/blob/master/Notes/Hand_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Hand Segmentation Using Skin Color and Background Information
By Wei Wang, Jing Pan

This paper presents a new method for segmenting hands from the background of an image. Their method uses an adaptive skin color model with three steps:
1. Capture pixel values of hand and background
2. Propose Gaussian models on the color space
3. Segment the image using various models, intersect the results

Results were better than other skin-color-only models.

## 1. Introduction

Various applications make accurate hand recognition quite important, and segmentation is a crucial first step in this process. Because human skin is generally within a limited range of hues, color-based hand recognition has been investigated for decades. The process depends on two choices: the **color space** and the **model of distribution** for skin colors. Prior work used a variety of techniques and spaces, including:
* Color spaces: Normalized RGB, CIE, XYZ, HSV, HSI, YCbCr
* Gaussian model for distributions
* Edge detection
* Varied chrominance spaces
* Skin/edge information in various spaces

## 2. Color Space and Gaussian model

Their method was primarily concerned with the use of background information to help segmentation. Thus they only used the normalized RGB and YCbCr color spaces and a single Gaussian model.

**Normalized RGB**   
RGB is a convenient color model widely used for processing image data. Unfortunately, the RGB color space is not robust because it cannot define the same color in different conditions or illumination. Normalized RGB was proposed to help this problem, and indeed gets better performance under different light conditions *only in uniform illumination*. Normalized RGB can be calculated as:

$R=\frac{R}{R+G+B}$; $G=\frac{G}{R+G+B}$; $B=\frac{B}{R+G+B}$.

**YCbCr**   
YCbCr is considered to be better for our purposes than RGB. The clustering is better, it's easy to calculate, and has far less overlap between skin and non-skin tones in various illumination conditions. YCbCr separates out a luminance signal (Y) and two chrominance components (Cb and Cr). We can discard signal Y to improve performance over various lighting conditions. The transform from RGB to YCbCr is simple:

$
\begin{bmatrix}
Y \\ Cb \\ Cr
\end{bmatrix}=
\begin{bmatrix}
0.2568 & 0.5041 & 0.0979 \\
-0.1482 & -0.2910 & 0.4392 \\
0.4392 & -0.3678 & -0.0714
\end{bmatrix}
\begin{bmatrix}
R \\ G \\ B
\end{bmatrix}+
\begin{bmatrix}
16 \\ 128 \\ 128
\end{bmatrix}
$

## Gaussian Mixture Model - [Brilliant](https://brilliant.org/wiki/gaussian-mixture-model/#)

Gaussian mixture models (GMMs) are a probabilistic model for representing normally distributed subpopulations within an overall population (a normal distribution has mean = median = mode, and symmetry over its center). Mixture models don't need to know which subpopulation each data point belongs to, which make them somewhat unsupervised learning (e.g. human height data would have two normal distributions between the sexes, which a GMM could capture).

**Motivation**   
We might want to try modeling data with a GMM if it appears to have more than one 'peak' distribution. Unimodal (one 'peak') models would give a poor fit in such a case, and yet GMMs retain the computational benefits of a single Gaussian model.

**To be continued upon additional probability background...**

## 2 cont.

The properties of skin color can be modeled using a Gaussian distribution, which has the formula:   
$f(x)=\frac{1}{\sqrt{2\pi \sigma ^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$, where $\mu$ is the mean value of the samples and $\sigma$ is the variance value.

Using this model for skin color is a process of matching each pixel in the image. If matched, we consider the pixel as a skin pixel, and if not we consider it background. The two parameters ($\mu$ and $\sigma$) decide the structure of our Gaussian model, and need to be learned. A common method for this is **offline training** on thousands of images, but these authors use an adaptive skin color model, which uses the center of the hand skin to calculate and constantly update the parameters of the model. This **online** model seems to work better in different illuminations. Because this paper doesn't explain that model, I'll read through the source of this idea:

# 2. A New Method for Hand Segmentation Using Free-Form Skin Color Model
By Ahmad Yahya Dawod, Junaidi Abdullah, and Md.Jahangir Alam

Segmentation remains difficult; this paper proposes a new method using a free-form skin color model. The pixel values of the hand are represented in the YbCbCr color space, and the CbCr space is mapped to a CbCr plane. To cluster the region of skin color on this plane, edge detection is used (as opposed to an ellipse) to construct a free-form skin color model.

## I. Introduction

The goal of hand segmentation is to detect the position and orientation of hands in an image; the aim of skin color pixel classification is to determine if a color pixel is a skin color or non-skin color. There are several techniques used to model the skin color:
* An elliptical boundary model which fits an ellipse on the CbCr plane. The ellipse ends up including non-skin pixel colors, unfortunately.
* Coarse model - Fixed straight lines are used as boundaries to a coarse region, but again this includes non-skin pixels.
* Estimate the boundary by constructing bilinear and bicubic boxes around the CbCr pixel cluster. Same issue.

This paper proposes a new method that uses a free-form boundary which models the skin color depending on the person and minimizes the inclusion of non-skin pixels.

## II. Suggested Method

Their method consists of four modules:
1. Image acquisition (skin region cropping)
2. Mapping (CbCr color space mapping)
3. Morphology (erosion & dilation)
4. Boundary creation (detecting edges)

**Image Acquisition**   
Because different people have different skin tones, the authors believe (and I agree) that we shouldn't just define some general range for skin tones. Thus we need to crop a skin image of the person using the system to develop a free-form model specific to them. As has been mentioned, choosing the right color space is the first step we need to tackle. Long story short, we choose YCbCr for the previously written reasons (taken from here, by the way). Skin image cropping just crops the image so we only see a patch of the user's skin as the cropped result. We can then form a cluster in our color space with this example.

**CbCr Color Space Mapping**   
They observed that the intensity value Y of YCbCr has little influence on the color distribution. On the Cb and Cr plane, we can generate a map where white (255) points are skin pixels and black (0) are non-skin. The resulting 255x255 map is the range of skin color present in the cropped image.

**Morphology**   
This stage uses image processing to create a cleaner single free-form shape. Two operations are used: Dilation adds pixels to fill in missing pixels in the white cluster and erosion removes extra pixels not belonging to the white cluster. Both help our resulting segmentation, and are applied in the order of dilation, erosion, and then edge point extraction.

**Boundary Creation**   
Here we determine the actual region which will define skin and non-skin color pixels. We consider the white cluster and apply an edge detection algorithm, for which there are a few different methods (gradient, laplacian). The **gradient method** detects the edge by looking for the maximum and minimum in the first derivative of the image, whereas the **laplacian method** searches for zero crossings in the second derivative of the image.

For gradient edge detection, given image function $f(x,y)$, the gradient magnitude $g(x,y)$ and direction $\theta(x,y)$ are computed as:   
$g(x,y)\cong \sqrt{\Delta x^2 + \Delta y^2}$ and $\theta(x,y)\cong a \tan(\frac{\Delta y}{\Delta x})$,   
where $\Delta x =f(x+n,y)-f(x-n,y)$ and $\Delta y =f(x,y+n)-f(x,y-n)$,   
where $n$ is a small integer.

They used the Sobel operator for the gradient method due to its fast speed, fine scale edge detection, and smoothing action. Once the edge has been detected, they arrive at the skin color cluster. Pixels with color in the cluster are skin, those not are classified as non-skin.

## III. Experimental Results

The authors also implemented a few morphological operations on the detected hand: fill holes in the segmented hand and also index all white regions in descending order by number of pixels. Only the white region with the most pixels is marked as white in the final result. In another example, a watch cuts the hand at the wrist, and with this knowledge the authors can include the top two white regions.

To validate their work, the authors estimated a **true positive rate** (TPR) and **false positive rate** (FPR). These were calculated as:   
$\text{TPR}=\frac{\text{TP}}{\text{TP}+\text{FN}}$, $\text{FPR}=\frac{\text{FP}}{\text{FP}+\text{TN}}$.

## Conclusion

This paper introduced a novel free-form skin color model in the YCbCr color space. Its most important contribution is the ability to accurately tailor a hand segmentation color space to a user, regardless of their skin color or the setting's illumination.

*--Done 6/23/2019--*

# Resuming 1

## 3. Proposed Method

Their model introduces a new technique based on mixing a background model and a skin color model. A flowchart of the process:   
![Flowchart](https://i.imgur.com/m7YG08D.png)

A prerequisite of the process is that we've detected a hand and have a region of interest (ROI) that's a cropped region fully containing the hand and some background as well. In this ROI, we crop a region $P_0$ that contains *only skin color* (in the center of the hand, probably), as we needed in the previously read paper. We also crop $N$ regions $(P_1, P_2, ... ,P_N)$ containing random samples of background color. These may include some skin color at this point, no problem. We build six directions of the model to best ensure an accurate skin color and background color sampling.

We now want to check whether the background samples actually belong to the background. We calculate the mean ($M_i$) and variance value ($s_i$) of each cropped region to set up their Gaussian models. Each background crop is then used to segment the skin crop region with an automatic threshold interval $T_i$. Pixels in the skin crop region among $T_i$ are set to white (255) and others are set to black (0). If the proportion of the white parts exceeds some fixed threshold $S$, the corresponding background image is considered truth background and is accepted, otherwise it's discarded. To ensure a mostly-uniform background color, background crop regions with variance above some value $V_m$ are also dropped.

The skin crop regions and surviving background crop regions are used to segment the cropped image witht threshold $T_0$ and $T_1$, respectively. The segmentation results $(R_0,R_1,...,R_N)$ are intersected to get the final result. Their method used parameters $N=50$, $V_m=0.15$, $S=0.95$, and interval $T_i$ for the background crop region depended on $s_i$ to guarantee that 90% of the pixels in the crop region were in its Gaussian distribution. The original images were 320x240 and the cropped images varied in size.

## 4. Experimental Results

Their method worked far better than the traditional only-skin-based method. In comparing different values for $N$, increasing the value helps the method up to some point. When $N$ exceeds about 40, improvements slow. Perhaps $N$ only needs to be large enough that all useful background information is obtained? A smaller $N$ is preferable for images with simple backgrounds, but no description of how to measure this is given.

The authors evaluate the accuracy rate (AR) using the formula:   
$\text{AR}=\frac{\text{AS}\cap \text{AL}}{\text{AS}\cup \text{AL}}$, where $\text{AS}$ is the area of the segmented hand and $\text{AL}$ is the area of the labeled hand in the image. Thus $\text{AR}<1$ and a higher AR indicates a more accurate segmentation. Their method reaches an AR of 90.36% on a difficult image while the online model reached 69.30%.

## 5. Conclusions

This paper introduced a hand segmentation method which used background information samples to improve accuracy. They used only a Gaussian model to characterize skin color in the RGB and YCbCr color spaces. Future work might include a GMM or free-skin color model to make further improvements.   
*--Done 6/24/2019--*

# 3. Dilation
By R. Fisher, S. Perkins, A. Walker, and E. Wolfart

## Introduction

Dilation is one of the two basic operators of **mathematical morphology**, which concerns the theory of processing geometrical structures (typically applied to digital images). It's typically applied to binary images, and the basic effect is to gradually enlarge the boundaries of the **foreground** (white) pixels in the image. Thus the areas of white grow in size and holes in those regions shrink.

## How It Works

Dilation uses two inputs: the image to be dilated, and a **structuring element**. The structuring element (AKA kernel) is a set of coordinate points that determines the precise result of the dilation operation. The mathematical definition:   

*Def*: Suppose $X$ is the set of Euclidean coordinates corresponding to the input binary image, and $K$ is the set of coordinates for the structuring element. Let $Kx$ denote the translation of $K$ so that its origin is at $x \in X$. Then the dilation of $X$ by $K$ is simply the set of all points $x$ s.t. the intersection of $Kx$ with $X$ is non-empty.   
*In my words*: For our result $Y \subseteq X$, when we put kernel $K$ at each $y \in Y$, $K$ will intersect with at least one $x \in X$. I hope this is a correct understanding.

E.g. Let's say our structuring element is a 3x3 square, with origin at its center. We'll represent foreground pixels with 1's and background as 0's.

In order to compute the dilation of this binary image, we consider each background pixel of the image in turn. We then superimpose the kernel over each of these **input pixels**. If *at least one* of the kernel's pixels overlaps with a foreground pixel, the input pixel is set to the foreground (probably do this *after* we first consider each background pixel). Thus if all pixels were set to background when we start, nothing will change. Also, dilation is the *dual* of erosion, meaning that dilating foreground pixels is the same as eroding background pixels.

## Guidelines for Use

Most implementations of dilation expect a binary input, usually with foreground pixels as 255 and background as 0. This can typically be produced from a grayscale image using thresholding, a technique I'll soon take notes on. Specification for the structuring element is often left to the implementation. For larger structuring elements, ellipses are commonly used. In general, dilation rounds convex boundaries and preserves concave boundaries.

Directional dilations can be created with less symmetrical kernels: an 11-pixel wide yet 1-pixel tall element will dilate horizontally only. Greyscale dilation generally brightens the image as bright regions grow in size. Dark regions near high intensity values will be filled in, and uniform areas will generally stay as they were.

Dilation can also be used for **edge detection**. Take the 3x3 dilation of an image an subtract away the original image, highlighting only the new pixels at the edges of objects.

Finally, we can also implement **region filling** using dilation. We'll also need logical NOT and AND. The process can be described by:   
$X_k=\text{dilate}(X_{k-1},J)\cap A_{\text{not}}$, where $X_k$ is the region which will fill the boundary, $J$ is the structuring element, and $A_{\text{not}}$ is the negative of the boundary. As a series of steps:
1. Say we know pixel $X_0$ is inside the region.
2. Dilate an image containing this single pixel.
3. AND the result with the NOT of the region border to prevent spreading outside the region.
4. Repeat to step 2 until convergence.
5. OR this result with the border.

*Dilation done 6/27/2019*

# 4. Erosion
By R. Fisher, S. Perkins, A. Walker, and E. Wolfart

## Introduction

Erosion is the second of the two basic operators of mathematical morphology. Its effect is to erode away the boundaries of foreground pixels (white pixels, typically) so that such areas shrink in size and holes in the areas become larger.

*Def*: Suppose $X$ is the set of Euclidean coordinates corresponding to the input binary image, and $K$ is the set of coordinates for the structuring element. Let $Kx$ denote the translation of $K$ so that its origin is at $x \in X$. Then the erosion of $X$ by $K$ is simply the set of all points $x$ s.t. $Kx \subseteq X$.   
As a note, erosion is just the *dual* of dilation: eroding forground is equivalent to dilating background pixels. Thus the code should take few changes and the understanding's there.

## Guidelines for Use

I skip the parts that repeat dilation's applications/usage. What's new for erosion?   
Erosion can be used to separate touching objects in a binary image so they can be counted using some labeling algorithm. To segment touching items, we can erode the image. Note that a circular kernel distorts the shape of items in the image much less than a square one. You can still do edge detection too by eroding an image and then subtracting the result from the original image.

*Erosion done 6/27/2019-*


# 5. RGB to YCbCr conversion [[Article]](https://sistenix.com/rgb2ycbcr.html)
By Nelson Campos

This color space conversion is pretty important to my current approach, so here's the math shown:   
* $Y=16+\frac{65.738R}{256}+\frac{129.057G}{256}+\frac{25.064B}{256}$.
* $Cb=128-\frac{37.945R}{256}-\frac{74.494G}{256}+\frac{112.439B}{256}$.
* $Cr=128+\frac{112.439R}{256}-\frac{94.154G}{256}-\frac{18.285B}{256}$.

This is the same as:
* $Y=16+0.2568R+0.5041G+0.0979B$,
* $Cb=128-0.1482R-0.2910G+0.4392B$, and
* $Cr=128+0.4392R-0.3678G+0.0714B$.

All of which are *exactly* what the first paper said.

Because this might take some processing, we could also convert the process to bit shifts. This is a bit tedious so I'll check it out if we need a speedup; we'll see.   
*--That's all I need for now.--*

# 6. Vision-Based Hand Gesture Recognition for Human Computer Interaction: A Survey
By Siddharth S. Rautaray and Anupam Agrawal

**Abstract** - HCI is a crucial component of the future of how we use computers, and gestures should become a part of this interaction. This work will present a summary of the progress made in this area as well as areas where future work was needed (as of 2012).

## 1. Introduction

A particular issue is that the 2 degrees of freedom of a mouse fail to emulate 3D manipulations. Gestures are used not only for identifying or manipulating objects in space but also communicating emotions or ideas. Thus **gesture** might be defined as *a physical movement of the hands, arms, face, and body with the intent to convey information or meaning*. Hand gestures are of particular interest, as they're by far the most common body part employed in gestures.

## 2. Enabling Technologies for Hand Gesture Recognition

**Gesture recognition** refers to the process of tracking human gestures to their representation and converting them to semantic commands. The two types of devices most making this feasible so far are *contact-based* and *vision-based* devices. Contact-based devices use physical interaction with the controller and are thus less adaptable to new users.

Previous surveys are summarized, including:
* Integration with gaze and speech are suggested (Pavlovic et al. 1997)
* Lack of data and long time for capture are issues (Moeslund et al. 2001)
* Different algorithms are needed in different settings (Chaudhary et al. 2011)
* Soft computing like ANN, fuzzy logic, genetic algorithms discussed (Wachs et al. 2011)
* Move process towards modularization, scalability, and decentralize (Kanniche 2012)

As a note, I will be avoiding notes on all but the most promising contact-based devices. They're inherently less attractive for my purposes, which are accessibility and immediate ease of use. Following a conversation with my mom, who works in special education, switches are the best bet for simplicity, her students being early and Pre-K children with autism. Further notes on gross motor control, bilateral control, etc. would likely help my understanding in what's useful in this area.

Some vision-based systems use hand markers as either reflective markers with strobe or active LED lights. These systems can be captured in 2D and preprocessed to 3D using the known setup of the markers. Vision techniques also face the challenges of:
* Huge number of DOF for gestures.
* Variability of 2D view based on camera's location.
* Different spatial resolutions.
* Variability of gesture speeds.

Any vision-based system needs to analyze images at the frame rate of their video input, in real time. Robustness against different lighting conditions and backgrounds, in-plane and out-of-plane image rotations, scalability from a few gesture primitives, and working with users of different sizes and colors are all important as well.

Of course, vision-based systems suffer from accuracy, complexity, and occlusion issues when compared to contact-based systems. Yet they're more convenient and less intrusive! Luckily, this paper shares my feelings and from here solely focuses on vision-based techniques.

## 3. Vision-Based Hand Gesture Taxonomies and Representations

The literature defines hand gestures into two types: **static** and **dynamic**. Static hand gestures are defined by some orientation and position of the still hand in space, whereas gestures involving movement are dynamic gestures. For example, gestures that we can find as a single disembodied emoji hand are likely to be static gestures. Dynamic hand gestures done intentionally for communication are **conscious** dynamic gestures, whereas unintentional gestures are **unconscious**.

Dynamic gestures can be further divided into subcategories:
* **Emblematic** gestures (emblems, quotable gestures) are direct translations of short verbal communications, e.g. waving goodbye or nodding for affirmative. Of course, these definitions may be culturally specific.
* **Affect displays** are gestures that convey emotion or intentions.
* **Adaptors** such as head shaking or moving one's leg release body tension and are typically unintentional.
* **Illustrator** gestures emphasize key points of speech, and can further be divided into:
>a. Beats are short, quick, rhythmic and often repetitive.   
>b. Deictic gestures include pointing to concrete real locations or people; also abstract pointing to locations or periods of time.   
>c. Iconic gestures depict figural representations or actions, e.g. moving a hand up with wiggling fingers to represent climbing a tree.   
>d. Metaphoric gestures depict abstractions.   
>e. Cohesive gestures are thematically related but temporally separated, caused by interruption to the communicator.

**Vision-based Hand Gesture Representations**   
There have been many models proposed to aid in hand gesture recognition, the two main categories being 3D model-based and appearance-based methods. A 3D model-based approach defines some 3D spatial description of a human hand and its temporal aspect is then the task to automate. The temporal characteristics are split into three stages: preparation/prestroke phase, nucleus/stroke phase, and retraction/poststroke phase. The model updates its parameters along transitions of the temporal model, which is computationally intensive but precise. 3D model representations include:
* 3D textured volumetric - Contain high details of human body skeleton and skin surface.
* 3D geometric - Less precise with skin information, have structure of hand still.
* 3D skeleton - Include only bone structure approximations.

Conversely, Appearance-based approaches include color-based, silhouette geometry, deformable gabarit, and motion-based models. Broadly these are either 2D static methods or motion-based models. Descriptions:
* Color-based models use body markers to track the motion of a body part. These can use color features, hierarchical models, and particle filtering.
* Silhouette geometry models include several properties of the silhouette such as perimeter, convexity, surface, bounding box, elongation, rectangularity, centroid, and orientation. These apparently can be enough in some cases, like Birdal and Hassanpour 2008.
* Deformable gabarit models are based on deformable active contours. Snakes are parameterized with motion to analyze gestures and actions.
* Motion-based models are pretty intuitive (Local motion histogram introduced by Luo 2008).

## 4. Vision-Based Hand Gesture Recognition Techniques

Most complete hand gesture recognition mechanisms are comprised of three main phases:
1. Detection
2. Tracking
3. Recognition

### 4.1 Detection

This first step involves the detection of hands and then segmentation of their regions from an image. This isolates the task-relevant data before passing it to subsequent stages. The features suggested for this step include skin color, shape, motion, and anatomical models of hands.

# To Read/Learn:

* Sobel operator?
* https://homepages.inf.ed.ac.uk/rbf/HIPR2/edgdetct.htm

Last revised 8/1/2019.