<p style="background-color:#ffffff;font-family:candaralight;color:#C15D06;font-size:215%;text-align:center;border-radius:10px 10px;"> Google ASL Fingerspelling</p>
<p style="background-color:#ffffff;font-family:candaralight;color:#B0B0B0;font-size:150%;text-align:center;border-radius:10px 10px;">✋ Recognition and Visualization ✋</p>

<div style="width:100%;text-align: center;"> <img align=middle src="https://media1.giphy.com/media/Co5TKVg51CmFsxPpNP/giphy.webp" alt="Heat beating" > </div>





Welcome to my notebook for the ASL Fingerspelling Recognition dataset competition! In this notebook, I present my solution for detecting and translating American Sign Language (ASL) fingerspelling into text. Using deep learning techniques, I have trained a model on the largest dataset of its kind, consisting of over three million fingerspelled characters captured from smartphone selfie cameras.

 My aim is to contribute to the advancement of sign language recognition technology and make AI more accessible for the Deaf and Hard of Hearing community. With the potential to enable faster and smoother communication between the Deaf and Hard of Hearing individuals and hearing non-signers, this work has the power to create a positive impact.

I hope that my notebook will showcase the effectiveness of my approach and provide insights into the development of robust sign language recognition AI. I am excited to present my solution and contribute to the empowerment of the Deaf and Hard of Hearing community through innovative machine learning techniques.

Wish me luck as I embark on this journey!
        
 **<span style="color:darkorange;"> If you liked this Notebook, please do forget to upvote, and GooD LucK.</span>**

   <center><div class="alert alert-block alert-warning" style="margin: 2em; line-height: 1.7em; font-family: candaralight;">
    <b style="font-size: 18px;">👏 &nbsp; IF YOU FORK THIS OR FIND THIS HELPFUL &nbsp; 👏</b><br><br><b style="font-size: 22px; color: darkorange">PLEASE UPVOTE!</b><br><br>This was a lot of work for me and while it may seem silly, it makes me feel appreciated when others like my work. 😅
</div></center>
    

<p id="toc"></p>

<br><br>

<h1 style="font-family: candaralight; font-size: 28px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; color: #C15D06; background-color: #ffffff;">TABLE OF CONTENTS</h1>




3. Model Architecture
    3.1 Convolutional Neural Network (CNN)
    3.2 Long Short-Term Memory (LSTM)
    3.3 Model Fusion
4. Model Training
    4.1 Training Strategy
    4.2 Hyperparameter Tuning
    4.3 Transfer Learning
5. Evaluation Metrics
    5.1 Accuracy
    5.2 Precision, Recall, and F1 Score
    5.3 Confusion Matrix
6. Results and Analysis
    6.1 Performance on Training Set
    6.2 Performance on Validation Set
    6.3 Performance on Test Set
7. Model Optimization
    7.1 Fine-tuning
    7.2 Regularization Techniques
    7.3 Ensemble Methods
8. Conclusion and Future Work
    8.1 Summary of Findings
    8.2 Implications and Applications
    8.3 Areas for Improvement
9. References
10. Appendix
    10.1 Data Preprocessing Code
    10.2 Model Architecture Code
    10.3 Training Code
    10.4 Evaluation Code
    10.5 Additional Experiment Results
    
* [1. DATA OVERVIEW](#1)
    
    - [Import Libraries](#1.1)
    
    - [Loading dataset](#1.2)
    
    - [Data Description](#1.3)
        
* [2. Data Preprocessing](#2)   

    - [Data Exploration](#2.1)      
        
    - [Data Cleaning](#2.2)

* [3. Feature Engineering](#3)
    
    - [Define a function to add new features to the data](#3.1)

    - [Apply the function](#3.2)

* [4. Exploratory Data Analysis](#4)
    
    - [Define a function to add new features to the data](#4.1)

    - [Apply the function](#4.2) 
* [5. Modeling](#4)
    
    - [Baseline Model](#5.1)

    - [Linear Regression Model](#5.2) 
    
    - [Support Vector Regression Model](#5.3) 

    - [Random Forest Regression Model](#5.4) 

    - [LSTM Model](#5.5) 

* [6. Model Evaluation and Comparison](#6)
    
* [7. Conclusion](#7)


<p id="1"></p>

<h1 style="font-family: candaralight; font-size: 28px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; color: #C15D06; background-color: #ffffff;">1. DATA OVERVIEW</h1>



In this notebook, we will be exploring and analyzing the Apple stock prices dataset. We will start by importing important libraries, loading the data, and giving a brief description of the dataset.

<a id="1.1"></a>
<br>

<h3 style="font-family: candaralight; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #C15D06; background-color: #ffffff;">1.1 <b>Import</b> Libraries</h3>

---



In [52]:
# import the desired packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "simple_white"

# import data processing and visualisation libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# import image processing libraries
import cv2
import skimage
from skimage.transform import resize

# import tensorflow and keras
import tensorflow as tf
from tensorflow import keras
import os

print("Packages imported...")

KeyboardInterrupt: 



Let's load the training and supplemental_metadata dataframe
<a id="1.2"></a>


<h3 style="font-family: candaralight; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #C15D06; background-color: #ffffff;">1.2 <b>Loading</b> Dataset</h3>

---


In [None]:
df = pd.read_csv("/dataset/input/asl-fingerspelling/train.csv")
metadata = pd.read_csv("/dataset/input/asl-fingerspelling/supplemental_metadata.csv")



<a id="1.3"></a>

<h3 style="font-family: candaralight; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #C15D06; background-color: #ffffff;">1.3 <b>Data</b> Description</h3>

---

In [None]:
df.head()



The phrases in the training set contains random websites/addresses/phone numbers.


In [None]:
print(f"Total number of files : {df.shape[0]}")
print(f"Total number of Participant in the dataset : {df.participant_id.nunique()}")
print(f"Total number of unique phrases : {df.phrase.nunique()}")

In [None]:
metadata.head()

 the metadata phrases are mostly normal sentences!




    path - The path to the landmark file.
    file_id - A unique identifier for the data file.
    participant_id - A unique identifier for the data contributor.
    sequence_id - A unique identifier for the landmark sequence. Each data file may contain many sequences.
    phrase - The labels for the landmark sequence. The train and test datasets contain randomly generated addresses, phone numbers, and urls derived from components of real addresses/phone numbers/urls. Any overlap with real addresses, phone numbers, or urls is purely accidental. The supplemental dataset consists of fingerspelled sentences. Note that some of the urls include adult content. The intent of this competition is to support the Deaf and Hard of Hearing community in engaging with technology on an equal footing with other adults.



In [None]:
print(f"Total number of files in metadata: {metadata.shape[0]}")
print(f"Total number of participants in metadata : {metadata.participant_id.nunique()}")
print(f"Total number of unique phrases in metadata : {metadata.phrase.nunique()}")

Lets dig deeper into the dataframe info 

In [None]:
# Check the dimensions of the dataset
df.shape

# Check the data types of columns
df.info()

In [None]:
df.describe()

<p id="2"></p>

<h1 style="font-family: candaralight; font-size: 28px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; color: #C15D06; background-color: #ffffff;">2. DATA PREPROCESSING</h1>


<a id="2.1"></a>

<h3 style="font-family: candaralight; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #C15D06; background-color: #ffffff;">2.1 Data <b>Exploration</b> </h3>


In [None]:
print("Dataset Summary Statistics:")
df.describe()


In [None]:
print("Metadata Summary Statistics:")
metadata.describe()


<a id="2.2"></a>

<h3 style="font-family: candaralight; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #C15D06; background-color: #ffffff;">2.2 Data <b>Cleaning</b></h3>


In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("\nMissing Values:")
print(missing_values)

In [None]:
df[df.isnull().any(axis=1)]

<p id="3"></p>

<h1 style="font-family: candaralight; font-size: 28px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; color: #C15D06; background-color: #ffffff;">3. EXPLORATORY DATA ANALYSIS</h1>


<a id="3.1"></a>

<h3 style="font-family: candaralight; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #C15D06; background-color: #ffffff;">3.1 Inspect the <b>'PATH'</b> Column</h3>

---



In [None]:
np.array(list(df["path"].value_counts().to_dict().values())).min()
df["path"].describe().to_frame().T

The path column is simply the path to the landmark file (parquet).
<ul>
    <li><b>Number unique paths</b>: 68</li>
    <li><b>Minimum Number of repeated path is</b>: 287</li>
    <li><b>Maximum Number of repeated path is</b>: 1000</li>
</ul>


<a id="3.2"></a>

<h3 style="font-family: candaralight; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #C15D06; background-color: #ffffff;">3.2 Inspect the <b>`PARTICIPANT_ID`</b> Column</h3>

---


In [None]:
import plotly.express as px

print("\n... BASICS OF THE PARTICIPANT ID COLUMN:\n")
df["participant_id"].astype(str).describe().to_frame().T

The participant_id statistics indicate a varied distribution of data contributions among participants, with some participants contributing more examples than others.

<ul>
    <li><b>Number of Unique Participants</b>: 94</li>
    <li><b>Average Number of Rows Per Participant</b>: 715.82</li>
    <li><b>Standard Deviation in Counts Per Participant</b>: 230.86</li>
    <li><b>Minimum Number of Examples For One Participant</b>: 1</li>
    <li><b>Maximum Number of Examples For One Participant</b>: 1537</li>
</ul>

In [None]:

#The column is set to strings as it is an ID
df["participant_id"] = df["participant_id"].astype(str)

# Calculate the counts for each participant_id
counts = df["participant_id"].value_counts()

# Set up the figure and axes
fig, ax = plt.subplots(figsize=(15, 6))

# Plot the histogram
bars = ax.bar(counts.index, counts.values, color=color_scheme[1])

# Set the labels and title
ax.set_xlabel("Participant ID")
ax.set_ylabel("Total Row Count")
ax.set_title("Row Counts by Participant ID")

# Rotate the x-axis labels if needed
plt.xticks(rotation=90, ha='center')
plt.xlim(-1, 94)

# Show the plot
plt.show()


<a id="3.3"></a>

<h3 style="font-family: candaralight; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #C15D06; background-color: #ffffff;">3.3 Inspect the <b>`SEQUENCE_ID`</b> Column</h3>

---





In [None]:
df["sequence_id"].astype(str).describe().to_frame().T

A unique identifier for the landmark sequence. Each data file may contain many sequences. Every value is unique for every row

<a id="3.4"></a>

<h3 style="font-family: candaralight; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #C15D06; background-color: #ffffff;">3.4 Inspect the <b>`PHASES`</b> Column</h3>

---

How long are the phrases?


In [None]:
import seaborn as sns

df['phrase_len'] = df.phrase.str.len()
metadata['phrase_len'] = metadata.phrase.str.len()

for param in ['text.color', 'axes.labelcolor', 'xtick.color', 'ytick.color']:
    plt.rcParams[param] = '#000000'  # very light grey

for param in ['figure.facecolor', 'axes.facecolor', 'savefig.facecolor']:
    plt.rcParams[param] = '#ffffff'  # bluish dark grey

fig, axs = plt.subplots(1, 1, figsize=(10, 7), tight_layout=True)

# Remove axes splines
for s in ['top', 'bottom', 'left', 'right']:
    axs.spines[s].set_visible(False)

# Remove x, y ticks
axs.xaxis.set_ticks_position('none')
axs.yaxis.set_ticks_position('none')

# Add padding between axes and labels
axs.xaxis.set_tick_params(pad=5)
axs.yaxis.set_tick_params(pad=10)

# Add x, y gridlines
axs.grid(b=True, color='grey', linestyle='-.', linewidth=0.5, alpha=0.6)

# Set the custom color scheme
color_scheme = ["#4f000b", "#720026", "#ce4257", "#ff7f51", "#ff9b54"]

plt.subplot(1, 2, 1)
sns.histplot(df.phrase_len, kde=True, binwidth = 2, color=color_scheme[0])
plt.title('Character occurences in each phrase in training set')
plt.xlabel('Phrase length')
plt.ylabel('Sample Count')

plt.subplot(1, 2, 2)
sns.histplot(metadata.phrase_len, kde=True, binwidth = 2, color=color_scheme[0])
plt.title('    and Supplementary metadata')
plt.xlabel('Unique characters')
plt.ylabel('Sample Count')
plt.grid(axis='y')

plt.tight_layout()
plt.show()

<a id="3.3"></a>

<h3 style="font-family: candaralight; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #C15D06; background-color: #ffffff;">3.3 Inspect the <b>`PARQUET`</b> Column</h3>

---





In [None]:
sample = pd.read_parquet("/dataset/input/asl-fingerspelling/train_landmarks/1019715464.parquet")


In [None]:
print(f"Sample shape = {sample.shape}")
sample.sample(10)

In [None]:
sample.describe()



We see a few negative coordinates above. Verbatim from the mediapipe page:

    MULTI_HAND_LANDMARKS Collection of detected/tracked hands, where each hand is represented as a list of 21 hand landmarks and each landmark is composed of x, y and z. x and y are normalized to [0.0, 1.0] by the image width and height respectively. z represents the landmark depth with the depth at the wrist being the origin, and the smaller the value the closer the landmark is to the camera. The magnitude of z uses roughly the same scale as x.

So negative values are not expected for x and y perhaps. It is also important to note that for a good chunk of the video either or both hands will not be visible or in other words, will not have any landmark data. Check the nulls in the data below:


In [None]:
def get_cols(df, words_pos, words_neg=[], ret_names=True):
    cols = []
    names = []
    for col in df.columns:
        # Check if column name contains all words
        if all([w in col for w in words_pos]) and all([w not in col for w in words_neg]):
            cols.append(df[col])  # Append the entire column to the list
            names.append(col)

    # Returns either both columns and names as DataFrame
    if ret_names:
        return cols, names
    # Or only columns as DataFrame
    else:
        return cols


In [None]:
# Landmark Indices for Left/Right hand without z axis in raw data
LEFT_HAND_IDXS0, LEFT_HAND_NAMES0 = get_cols(sample, ['left_hand'], ['z'])
RIGHT_HAND_IDXS0, RIGHT_HAND_NAMES0 = get_cols(sample, ['right_hand'], ['z'])
#RIGHT_HAND_NAMES0.insert(0, "frame")
LEFT_HAND_NAMES0.insert(0, "frame")
COLUMNS = np.concatenate((LEFT_HAND_NAMES0, RIGHT_HAND_NAMES0))

N_COLS0 = len(COLUMNS)
# Only X/Y axes are used
N_DIMS0 = 2

print(f'N_COLS0: {N_COLS0}')

In [None]:

RIGHT_HAND = sample.loc[:, RIGHT_HAND_NAMES0] 
LEFT_HAND = sample.loc[:, LEFT_HAND_NAMES0] 
RIGHT_HAND


In [None]:
print(f"Percentage of nulls in Left Hand data = {100*np.mean(LEFT_HAND['x_left_hand_0'].isnull()):.02f} %")
print(f"Percentage of nulls in Right Hand data = {100*np.mean(RIGHT_HAND['x_right_hand_0'].isnull()):.02f} %")

<p id="4"></p>

<h1 style="font-family: candaralight; font-size: 28px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; color: #C15D06; background-color: #ffffff;">4. EVALUATION METRIC</h1>

The **Levenshtein distance**, also known as the edit distance, quantifies the dissimilarity between two strings by measuring the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. This metric provides a valuable measure of how well the predicted ASL sequence matches the ground truth or reference sequence.

To calculate the Levenshtein distance for ASL recognition, the predicted ASL sequence and the reference or ground truth sequence are compared character by character. Each character is treated as a token, representing a specific sign or gesture. The Levenshtein distance is then computed by determining the minimum number of edit operations needed to transform the predicted sequence into the reference sequence or vice versa.

The evaluation metric for this contest is the normalized total Levenshtein distance. The formula for calculating the metric is as follows:

Metric = (N - D) / N

Where:

    N is the total number of characters in the labels.
    D is the total Levenshtein distance.

To calculate the metric, you would need the labels data and the predicted sequence. The Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one sequence into another.

In the given context, it seems that the labels are provided in the "phrase" column of the [train/supplemental_metadata].csv file. The predicted sequence can be obtained from the landmark data files in the [train/supplemental]_landmarks/ directory.

To calculate the total Levenshtein distance, you would need to compare each character in the labels with the corresponding character in the predicted sequence and count the number of edits required.

Finally, you can plug the values of N (total characters in the labels) and D (total Levenshtein distance) into the formula to compute the metric. The resulting value will give you an indication of the accuracy of the predicted sequence compared to the labels, with higher values indicating better performance.

Note: Since the specific implementation details are not provided, you would need to write code or use existing libraries to calculate the Levenshtein distance and implement the metric calculation

the LD is explained in depth in this discussion

In [None]:
from Levenshtein import distance
#Using the dynamic programming approach for calculating the Levenshtein distance

def levenshteinDistanceDP(token1, token2):
    # Create a 2-D matrix 
    distances = np.zeros((len(token1) + 1, len(token2) + 1))
    
    #Initialize the first row and column, Row index is fixed to 0 and the variable t1 is used to define the column index. 
    for t1 in range(len(token1) + 1):
        distances[t1][0] = t1
        
    #Column index of the distances array is now fixed to 0, while the loop variable t2 is used to define the index of the rows
    for t2 in range(len(token2) + 1):
        distances[0][t2] = t2
    a = 0
    b = 0
    c = 0
    
    #Inside the loops the distances are calculated for all combinations of prefixes from the two words. 
    for t1 in range(1, len(token1) + 1):
        for t2 in range(1, len(token2) + 1):
            if (token1[t1-1] == token2[t2-1]):
                distances[t1][t2] = distances[t1 - 1][t2 - 1]
                
            #If the two characters are not equal, then the distance in the current cell is equal to the
            #minimum of the three existing values in the 2 x 2 matrix after adding a cost of 1
            else:
                a = distances[t1][t2 - 1]
                b = distances[t1 - 1][t2]
                c = distances[t1 - 1][t2 - 1]
                
                if (a <= b and a <= c):
                    distances[t1][t2] = a + 1
                elif (b <= a and b <= c):
                    distances[t1][t2] = b + 1
                else:
                    distances[t1][t2] = c + 1
                    
    #Print its contents 
    printDistances(distances, len(token1), len(token2))
    
    #returning the calculated distance between the two words
    return distances[len(token1)][len(token2)]


def printDistances(distances, token1Length, token2Length):
    for t1 in range(token1Length + 1):
        for t2 in range(token2Length + 1):
            print(int(distances[t1][t2]), end=" ")
        print()
        
phase1 = '3 creekhouse'
phase2 = 'scales/kuhaylah'

#Calling levenshteinDistanceDP function, 
#It returns an integer representing the distance between them

print("Printing The Distance Matrix:")
print(f" \nThe Levenshtein distance of phases = {levenshteinDistanceDP(phase1, phase2):.02f} ")



Introduction
This notebook is a continuation of the Google ASL Finger Recognition project. In the previous notebook, we processed the data and extracted features for the task. In this notebook, we will train and evaluate a model using the extracted features. Additionally, we will use the Levenshtein distance as a metric to measure the similarity between predicted labels and ground truth labels.

Table of Contents
Importing Libraries and Levenshtein Distance Implementation
Calculating Levenshtein Distance
Data Preparation and Splitting
Model Definition
Optimization Setup
Training Loop
Evaluation
Conclusion
Let's proceed with the code implementation and analysis.








<p id="4"></p>

<h1 style="font-family: candaralight; font-size: 28px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; color: #C15D06; background-color: #ffffff;">4. Visualization</h1>

In [None]:
### import libraries
import pandas as pd,numpy as np,os
import json
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
from pathlib import Path

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from matplotlib.colors import ListedColormap
from sklearn.preprocessing import normalize

print("importing")

In [None]:
LANDMARK_FILES_DIR = "/dataset/input/asl-fingerspelling/train_landmarks"
TRAIN_FILE = "/dataset/input/asl-fingerspelling/train.csv"
label_map = json.load(open("/dataset/input/asl-fingerspelling/character_to_prediction_index.json", "r"))

In [2]:
import os
import gc

import json
from tqdm import tqdm
import numpy as np
import pandas as pd

import torch
import torch.nn as nn

from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings(action='ignore')

In [None]:
LANDMARK_FILES_DIR = "/dataset/input/asl-fingerspelling/train_landmarks"
TRAIN_FILE = "/dataset/input/asl-fingerspelling/train.csv"
label_map = json.load(open("/dataset/input/asl-fingerspelling/character_to_prediction_index.json", "r"))

In [None]:
# Memory saving function credit to https://www.dataset.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    #start_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    #end_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    #print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

In [None]:
import multiprocessing as mp
import torch.nn.functional as F

import pandas as pd
import torch

# Function to process a single parquet file
def process_parquet(row):
    path = os.path.join("/dataset/input/asl-fingerspelling", row[1].path)
    data_columns = COLUMNS
    landmark_df = pd.read_parquet(path, columns=data_columns)

    # Group the landmarks by sequence_id
    grouped_landmarks = landmark_df.groupby('sequence_id')

    # Initialize empty lists to store features and labels
    features = []
    labels = []

    # Iterate over each sequence_id
    for sequence_id, group in grouped_landmarks:
        # Get the label for the sequence
        phrase = df.loc[df['sequence_id'] == sequence_id, 'phrase'].iloc[0]

        # Map each letter in the phrase using label_map
        mapped_phrase = [letter for letter in phrase]
        
        # Create a new Series with sequence and mapped_phrase
        result_series = pd.DataFrame({'sequence_id': sequence_id, 'mapped_phrase': mapped_phrase}) 
        result_series['label'] = result_series['mapped_phrase'].map(label_map).astype(np.int8)  
        
        # Convert the label Series to a list
        label_list = result_series['label'].tolist()

        # Initialize an empty feature vector for the sequence
        sequence_features = []
        
        # Iterate over each landmark index
        for landmark_index in range(20):
            # Generate feature names for x, y, z coordinates
            x_feature = f'x_right_hand_{landmark_index}'
            y_feature = f'y_right_hand_{landmark_index}'

            # Get the x, y, z coordinates for the landmark
            x = group[x_feature].values.astype(np.float16)
            y = group[y_feature].values.astype(np.float16)
            

            # Perform feature transformations or calculations
            x = torch.tensor(x).contiguous().view(-1, x.shape[0])
            y = torch.tensor(y).contiguous().view(-1, y.shape[0])

            x = x[:,~torch.any(torch.isnan(x), dim=0)]
            y = y[:,~torch.any(torch.isnan(y), dim=0)]
            
            x_mean = torch.mean(x, 0) 
            y_mean = torch.mean(y, 0) 

            x_std = torch.std(x,1) 
            y_std = torch.std(y,1) 

            # Add the calculated features to the sequence feature vector
            #sequence_features.extend([x_mean, y_mean, x_std, y_std])
            #if x_mean.numel() > 0 or y_mean.numel() > 0 or x_std.numel() > 0 or y_std.numel() > 0:
            sequence_features = torch.cat([x_mean,y_mean,x_std,y_std], axis=0)
            sequence_features = torch.where(torch.isnan(sequence_features), torch.tensor(0.0, dtype=torch.float32), sequence_features)

            diff = 3258 - sequence_features.shape[0]
            if (diff >= 0):
                padding = torch.zeros(diff)
                sequence_features = torch.cat((sequence_features, padding))
            features =  sequence_features[:3258].cpu().numpy()
            
            return features,result_series['label']

        # Add the sequence features and label to the overall feature and label lists
        #features.append(sequence_features)     
        #labels.append(label_list)
    
    #return np.array(features), np.array(labels)
    #return features,labels

# Initialize empty lists to store results
all_features = []
all_labels = []

df = pd.read_csv(TRAIN_FILE)
df = reduce_mem_usage(df) 
df2 = df.head(30)

max_label_length = 30
all_features = np.zeros((df.shape[0], 3258))
labels = np.zeros((df.shape[0], max_label_length))
    
# Process parquet files in parallel
with mp.Pool() as pool:
    results = pool.imap(process_parquet, df.iterrows(),  chunksize=250)
    for i, (x,y) in tqdm(enumerate(results), total=df.shape[0]):
            #print('x=',x.shape)
            all_features[i,:] = x
            labels[i,:len(y)] = y.values.reshape(1, -1)

    # Unpack results
    #for features, labels in results:
     #   all_features.extend(features)
      #  all_labels.extend(labels)



# Convert the result lists to tensors
#all_features = [torch.tensor(arr) for arr in all_features]
#features_tensor = torch.stack(all_features)

# Ensure all the inner lists in labels have the same length
#max_length = max(len(inner_list) for inner_list in all_labels)
#labels = [inner_list + [0] * (max_length - len(inner_list)) for inner_list in all_labels]
#labels_tensor = torch.tensor(labels)

# Print the shapes of the tensors
np.save("feature_data.npy", all_features)
np.save("feature_labels.npy", labels)

print("Features tensor shape:", all_features.shape)
print("Labels tensor shape:", labels.shape)


## Training
I trained with neural network model using PyTorch. The algorithm used in this code is not explicitly mentioned, but based on the code structure and components, it appears to be a classification task using a neural network with the Adam optimizer and CrossEntropyLoss as the loss function.

In [3]:
datay = np.load("/dataset/input/aslfr-data-processing/feature_labels.npy")
datax = np.load("/dataset/input/aslfr-data-processing/feature_data.npy")

In [None]:
datax.shape[1]

In [None]:
datay.shape

In [None]:
class ASLModel(nn.Module):
    def __init__(self, p):
        super(ASLModel, self).__init__()
        self.dropout = nn.Dropout(p)
        self.layer0 = nn.Linear(3258, 1024)
        self.layer1 = nn.Linear(1024, 512)
        self.layer2 = nn.Linear(512, 30)

        
    def forward(self, x):
        x = self.layer0(x)
        x = self.dropout(x)
        x = self.layer1(x)
        x = self.layer2(x)
        return x

In [4]:
import Levenshtein

def calculate_normalized_levenshtein_distance(pred_strings, target_strings):
    total_distance = 0
    total_length = 0

    for pred, target in zip(pred_strings, target_strings):
        distance = Levenshtein.distance(pred, target)
        total_distance += distance
        total_length += len(target)

    normalized_distance = total_distance / total_length
    return normalized_distance


In [5]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
import Levenshtein

def calculate_levenshtein_distance(pred_labels, target_labels):
    distance = 0
    for pred, target in zip(pred_labels, target_labels):
        distance += Levenshtein.distance(pred, target)
    return distance

class ASLDataset(Dataset):
    def __init__(self, datax, datay):
        self.datax = datax
        self.datay = datay
        
    def __getitem__(self, index):
        return self.datax[index,:], self.datay[index]
        
    def __len__(self):
        return len(self.datay)
    
# Data Split
trainx, testx, trainy, testy = train_test_split(datax, datay, test_size=0.15, random_state=42)

# Convert data to PyTorch tensors
trainx = torch.from_numpy(trainx).float()
trainy = torch.from_numpy(trainy).float()
testx = torch.from_numpy(testx).float()
testy = torch.from_numpy(testy).float()

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Data Preparation
train_data = ASLDataset(trainx, trainy)
test_data = ASLDataset(testx, testy)

# DataLoader
BATCH_SIZE = 128
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False)

# Model Definition
class ASLModel(nn.Module):
    def __init__(self, input_size, output_size):
        super(ASLModel, self).__init__()
        self.fc1 = nn.Linear(input_size, 2048)
        self.fc2 = nn.Linear(2048, 1024)
        self.fc3 = nn.Linear(1024, output_size)
        #self.dropout = nn.Dropout(0.2)  

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        #x = self.dropout(x)
        x = torch.relu(self.fc2(x))
        #x = self.dropout(x)
        x = self.fc3(x)
        return x

# Model Initialization
model = ASLModel(input_size=trainx.shape[1], output_size=trainy.shape[1]).to(device)

# Optimization Setup
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
#optimizer = torch.optim.SGD(model.parameters(), lr=0.001)  
#optimizer = torch.optim.AdamW(model.parameters(), lr=0.005)

# Training Loop
EPOCHS = 50
for epoch in range(EPOCHS):
    model.train()
    train_loss = 0.0
    train_correct = 0

    for inputs, targets in train_loader:
        inputs = inputs.to(device)
        targets = targets.to(device)
        #print('targets = ',targets)

        optimizer.zero_grad()
        outputs = model(inputs)
        #print('models output = ',outputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        train_correct += (outputs.round() == targets).sum().item()

    train_loss /= len(train_loader)
    train_accuracy = train_correct / len(train_data)

    # Evaluation
    model.eval()
    test_loss = 0.0
    test_correct = 0
    levenshtein_distance = 0

    with torch.no_grad():
        for inputs, targets in test_loader:
            inputs = inputs.to(device)
            targets = targets.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, targets)
            #print(loss)

            test_loss += loss.item()
            test_correct += (outputs.round() == targets).sum().item()

            # Define the reverse mapping dictionary
            reverse_label_map = {v: k for k, v in label_map.items()}
            outputs_array = outputs.detach().cpu().numpy()
            targets_array = targets.detach().cpu().numpy()

            # Convert predictions and targets to letter sequences
            pred_labels = [[reverse_label_map[label] for label in output.nonzero()[0].tolist()] if len(output.nonzero()[0]) > 0 else [] for output in outputs_array.round()]
            target_labels = [[reverse_label_map[label] for label in target.nonzero()[0].tolist()] if len(target.nonzero()[0]) > 0 else [] for target in targets_array.round()]
            
            # Calculate Levenshtein distance
            #print(pred_labels)
            levenshtein_distance += calculate_levenshtein_distance(pred_labels, target_labels)

    test_loss /= len(test_loader)
    test_accuracy = test_correct / len(test_data)
    average_levenshtein_distance = levenshtein_distance / len(test_data)

    # Print epoch results
    print(f"Epoch {epoch+1}/{EPOCHS}")
    print(f"Train Loss: {train_loss:.4f} | Train Accuracy: {train_accuracy:.4f}")
    print(f"Test Loss: {test_loss:.4f} | Test Accuracy: {test_accuracy:.4f}")
    print(f"Average Levenshtein Distance: {average_levenshtein_distance:.4f}")
    print("=" * 50)



KeyboardInterrupt: 