# ASL Handshape Data

## Description of data
Link to data found on Nicolas Pugeault's website
(http://empslocal.ex.ac.uk/people/staff/np331/index.php?section=FingerSpellingDataset). The dataset is of 24 static handshapes corresponding to English letters (excluding the letters "J" and "Z" since they require motion). The data comprises of 5 different non-native signers of about 60,000 RGB (intensity) images and depth images. The images have some rotational variance as the subject moved their hand during the image capture.

## Notebook setup

Run this section if data is already partially processed 

In [None]:
import os
import re
import sys
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image 
from functools import reduce

## Variables needed for reference throughout sections
dataset_url = 'www.cvssp.org/FingerSpellingKinect2011/fingerspelling5.tar.bz2'
filename = 'fingerspelling5.tar.bz2'
# Data directory
final_data_dir = 'dataset'
# Data's top-level directory (after download & decompreshion)
dataset_dir = 'dataset5'

## Grabbing data

Below is a script to download the data to the local machine. Note that compressed file is over 2GB. If the data was already retrieved, you can skip this section and start with the preprocessing the dataset.

In [None]:
# Link to dataset and 
os.system('wget {URL}'.format(URL=dataset_url))

In [None]:
# Uncompress
os.system('tar xjf {}'.format(filename))

## Removing and relabelling data

Only RGB image data is needed and should be relabelled so that the files can be easily be placed into one directory but still contain metadata for classification, validation, and testing.

In [None]:
# Make a new data directory if doesn't exist
if not os.path.exists(final_data_dir):
    os.makedirs(final_data_dir)

# Define patterns for depth files & RGB files 
# Format: `depth_0_0528.png` & `color_12_0137.png`
pattern_depth_file = '(depth\w*.png)'
pattern_rgb_file = 'color_\d*_(\d*).png'
# Number of files renamed/delted
n_del, n_rename = 0,0

# Save that this is a new subject (numerical since letter can be confusing)
# Each subject in directory with a letter ('A','B','C',...)
for (subject_id, subject_dir) in enumerate(os.listdir(dataset_dir)):
    # Directories for each letter (excluding "j" & "z")
    path_to_subject = os.path.join(dataset_dir, subject_dir)
    
    for letter_dir in os.listdir(path_to_subject):
        # Use letter as number ('a' starts @ 00)
        letter_id = ord(letter_dir.lower()) - ord(('a'))
        letter_id = f'0{letter_id}' if letter_id < 10 else letter_id
        path_to_letter = os.path.join(path_to_subject, letter_dir)
        
        for image_file in os.listdir(path_to_letter):
            # Remove depth file
            if re.search(pattern_depth_file, image_file):
                path_depth_file = os.path.join(path_to_letter, image_file)
                os.remove(path_depth_file)
                # Inform depth file removed
                print(f'\r#{n_del}: Depth file deleted {path_depth_file}', end='')
                n_del += 1
            else:
                # Get ID of each file (None if not matched)
                num_id = re.match(pattern_rgb_file, image_file)
                if num_id:
                    # Get the matching parathesis only
                    num_id = num_id.group(1)
                    path_image_file = os.path.join(path_to_letter, image_file)
                    # Rename image
                    new_image_name = f'{letter_id}_{subject_id}_{num_id}.png'
                    new_path_image_file = os.path.join(final_data_dir, new_image_name)
                    os.rename(path_image_file, new_path_image_file)
                    # Inform image renamed
                    print(f'\r#{n_rename}: {new_path_image_file} renamed from {path_image_file}', end='')
                    n_rename += 1
            sys.stdout.flush()

## Change images to grayscale

RGB images will be turned into grayscale since color shouldn't be necessary for recognition. This also should reduce the file sizes of the images and can help generalize to future datasets.

In [None]:
# Convert image to grayscale and save file
def img_to_gray(img_file):
    img_path = os.path.join(final_data_dir, img_file)
    img = Image.open(img_path, 'r').convert('L')
    img.save(img_path)
    
# Keep track of image number
n = 0 
file_list = [x if re.search('*png', x) for x in os.listdir(final_data_dir)]
n_imgs = len(file_list)
errors_convert = []

# Iterate over each image
for img_filename in file_list:
    print(f'\r#{n: <5} of {n_imgs}: Converting image `{img_filename}` to gray', end='')
    # Keep track of files that were not successfully converted
    try:
        img_to_gray(img_filename)
    except:
        errors_convert.append((n,img_filename))
        print(f'`{img_filename}` was NOT converted')
    n += 1
    sys.stdout.flush()