<a href="https://colab.research.google.com/github/Fasih1994/IBM_Datascienc_Capstone_data/blob/master/ETL_IBM_Datascienc_Capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Steps




*   First Load Data from kaggle
*   Remove all depth files from the data (**because we are going to work with image data**)
*   Upload the data on github
*   Download data from github on colab (**because it give GPU computing for free & we are working with images here**)
*   Set directories with suitable naming convention
*   copy all the data at one place with desired names

# About Data

The dataset used was the ASL FingerSpelling Dataset from the University of Surrey’s Center for
Vision, Speech and Signal Processing Section 2. This dataset contains both colored im-
ages and depth sensing data collected from a Microsoft Kinnect. (Note that this project did not
be use the depth sensing data since the project focused on using static images.) The images are in
color and include 24 different handshapes each representing a letter from the English alphabet; "J"
and "Z" are excluded since these letters are dependent on movement and therefore don’t have a
static image representation.

The dataset images have been cropped around the handshape though each cropping results in
a differently sized image fitting within a 275 pixels by 250 pixels window with a resolution of 72
pixels per inch. The background behind the handshape is not uniform or consistent. **The dataset
contains approximately 65, 000 images generated by five different non-native ASL signers with
over 500 samples of each of the 24 different handshapes**. The handshapes in the image feature
some rotational differences as the subjects were instructed to adjust hand position for the camera.
These positional adjustments however still preserve the common position associated with the
English letter it represents.

# Let's Start
## Loading data from kaggle:

In [0]:

import os
import re
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

final_data_dir = 'data'
# Data's top-level directory (after download & decompreshion)
dataset_dir = 'dataset5'

In [0]:
!cp kaggle.json ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list -s ASL


In [0]:
!kaggle datasets download mrgeislinger/asl-rgb-depth-fingerspelling-spelling-it-out  

In [0]:
!kaggle datasets download mrgeislinger/asl-rgb-depth-fingerspelling-spelling-it-out  

## Uncompressing the data

In [0]:
# Uncompress
os.system('tar xjf {}'.format(filename))

## Deleting Depth images

In [0]:
# Make a new data directory if doesn't exist
if not os.path.exists(final_data_dir):
    os.makedirs(final_data_dir)

# Define patterns for depth files & RGB files 
# Format: `depth_0_0528.png` & `color_12_0137.png`
pattern_depth_file = '(depth\w*.png)'
# Number of files renamed/delted
n_del = 0

# Save that this is a new subject (numerical since letter can be confusing)
# Each subject in directory with a letter ('A','B','C',...)
for (subject_id, subject_dir) in enumerate(os.listdir(dataset_dir)):
    # Directories for each letter (excluding "j" & "z")
    path_to_subject = os.path.join(dataset_dir, subject_dir)
    
    for letter_dir in os.listdir(path_to_subject):
        # Use letter as number ('a' starts @ 00)
        letter_id = ord(letter_dir.lower()) - ord(('a'))
        letter_id = '0{}'.format(letter_id) if letter_id < 10 else letter_id
        path_to_letter = os.path.join(path_to_subject, letter_dir)
        
        for image_file in os.listdir(path_to_letter):
            # Remove depth file
            if re.search(pattern_depth_file, image_file):
                path_depth_file = os.path.join(path_to_letter, image_file)
                os.remove(path_depth_file)
                # Inform depth file removed
                print('\r#{}: Depth file deleted {}'.format(n_del,path_depth_file), end='')
                n_del += 1
            sys.stdout.flush()

### after deleting all depth files I push the dataset on github
[click here to visit my github repo for dataset](https://github.com/Fasih1994/IBM_Datascienc_Capstone_data.git)

### Note:
**All the above code I ran on my local machine and the code below I ran on colab**

## Cloning data from github


In [0]:
"""
Cloning the data from github
"""

!git clone https://github.com/Fasih1994/IBM_Datascienc_Capstone_data.git

Cloning into 'IBM_Datascienc_Capstone'...
remote: Enumerating objects: 65904, done.[K
remote: Counting objects: 100% (65904/65904), done.[K
remote: Compressing objects: 100% (65904/65904), done.[K
remote: Total 65904 (delta 1), reused 65900 (delta 0), pack-reused 0
Receiving objects: 100% (65904/65904), 1.76 GiB | 50.38 MiB/s, done.
Resolving deltas: 100% (1/1), done.
Checking out files: 100% (65775/65775), done.


## Creating directories

In [0]:
# Data directory
final_data_dir = 'data'
# Data's top-level directory (after download & decompreshion)
dataset_dir = 'IBM_Datascienc_Capstone'

In [0]:
import os 
import shutil
import sys

## Organizing Data for later use

In [0]:
n_copy = 0
if not os.path.exists(final_data_dir):
    os.mkdir(final_data_dir)

subjects = 'A B C D E'.split()
for subject in subjects:
    path = os.path.join(dataset_dir,subject)
    
    for letter in os.listdir(path):
        current_path = os.path.join(path,letter)
        new_path = os.path.join(final_data_dir,letter)
        
        if not os.path.exists(new_path):
            os.mkdir(new_path)
        
        for file in os.listdir(current_path):
            file_path = os.path.join(current_path,file)
            new_name = '{}_{}.png'.format(letter,len(os.listdir(new_path)))
            new_file_path = os.path.join(new_path,new_name)
            shutil.copy(file_path,new_file_path)
            n_copy+=1
            print('\r#{}: {} copied from {}'.format(n_copy,new_file_path,file_path), end='')
            sys.stdout.flush()

#9640: data/a/a_186.png copied from IBM_Datascienc_Capstone/A/a/color_0_0199.png

## Removing extra memory usage

In [0]:
!rm -r IBM_Datascienc_Capstone

## Checking the number of files in each image

In [0]:
for letter in os.listdir('data'):
  path = 'data/'+str(letter)
  print('{} : {}'.format(letter,len(os.listdir(path))))


n : 2694
w : 3108
i : 2631
y : 2666
a : 2676
g : 2679
o : 2656
l : 2754
h : 2696
q : 2672
f : 2615
t : 2624
b : 2728
k : 2939
r : 2940
p : 2803
x : 2731
d : 2680
c : 2916
s : 2784
e : 2681
v : 2738
u : 2653
m : 2710
