# ASL Handshape Data

## Description of data
Link to data found on Nicolas Pugeault's website
(http://empslocal.ex.ac.uk/people/staff/np331/index.php?section=FingerSpellingDataset). The dataset is of 24 static handshapes corresponding to English letters (excluding the letters "J" and "Z" since they require motion). The data comprises of 5 different non-native signers of about 60,000 RGB (intensity) images and depth images. The images have some rotational variance as the subject moved their hand during the image capture.

## Grabbing data

Below is a script to download the data to the local machine. Note that compressed file is over 2GB. If the data was already retrieved, you can skip this section and start with the preprocessing the dataset.

In [None]:
import os, re

In [None]:
# Link to dataset and 
dataset_url = 'www.cvssp.org/FingerSpellingKinect2011/fingerspelling5.tar.bz2'
os.system('wget {URL}'.format(URL=dataset_url))

In [None]:
# Uncompress
filename = 'fingerspelling5.tar.bz2'
os.system('tar xjf {}'.format(filename))

## Removing and relabelling data

Only RGB image data is needed and should be relabelled so that the files can be easily be placed into one directory but still contain metadata for classification, validation, and testing.

In [None]:
def purge(dir, pattern):
    for f in os.listdir(dir):
        if re.search(pattern, f):
            os.remove(os.path.join(dir, f))

            
# Make a new data directory if doesn't exist
final_data_dir = 'dataset'
if not os.path.exists(final_data_dir):
    os.makedirs(final_data_dir)
    
# Get data top-level directory (after download & decompreshion)
dataset_dir = 'dataset5'

# Define patterns for depth files & RGB files 
# Format: `depth_0_0528.png` & `color_12_0137.png`
pattern_depth_file = '(depth\w*.png)'
pattern_rgb_file = 'color_\d*_(\d*).png'

# Save that this is a new subject (numerical since letter can be confusing)
# Each subject in directory with a letter ('A','B','C',...)
for (subject_id, subject_dir) in enumerate(os.listdir(dataset_dir)):
    # Directories for each letter (excluding "j" & "z")
    path_to_subject = os.path.join(dataset_dir, subject_dir)
    
    for letter_dir in os.listdir(path_to_subject):
        # Use letter as number ('a' starts @ 00)
        letter_id = ord(letter_dir.lower()) - ord(('a'))
        letter_id = f'0{letter_id}' if letter_id < 10 else letter_id
        path_to_letter = os.path.join(path_to_subject, letter_dir)
        
        for image_file in os.listdir(path_to_letter):
            # Remove depth file
            if re.search(pattern_depth_file, image_file):
                path_depth_file = os.path.join(path_to_letter, image_file)
                os.remove(path_depth_file)
            else:
                # Get ID of each file (None if not matched)
                num_id = re.match(pattern_rgb_file, image_file)
                if num_id:
                    # Get the matching parathesis only
                    num_id = num_id.group(1)
                    path_image_file = os.path.join(path_to_letter, image_file)
                    # Rename image
                    new_image_name = f'{letter_id}_{subject_id}_{num_id}.png'
                    new_path_image_file = os.path.join(final_data_dir, new_image_name)
                    os.rename(path_image_file, new_path_image_file)