# ASL Neural Network Pipeline Notebook

This notebook contains all the steps necessary to train a neural network for the ASL Neural Network App project located at [this repository](https://github.com/TWilliamsA7/asl-neural-app/tree/main). Utility functions can also be found in the above repository under the src directory.

1. Setup: Configuration & Authentication
2. Environment: Initialization & Imports
3. Data: Acquisition & Preprocessing
4. Data: Loading & Splitting
5. Model: Architecture
6. Model: Training
7. Model: Evaluation

## Setup: Configuration & Authenticatioon

This section of the notebook is for setting up the necessary authentication and configuration of the Colab environment

In [12]:
# Import necessary modules for setup

from google.colab import userdata, auth, files
import os
import sys

### Create github connection via colab variables

In [None]:
# Define repository details
USERNAME = "TWilliamsA7"
REPO_NAME = "asl-neural-app.git"
BRANCH_NAME = "main"

# Get PAT (Personal Access Token) stored in Colab Secrets
PAT = userdata.get("GITHUB_PAT")
if not PAT:
    raise ValueError("GITHUB_PAT secret not found!")

# Construct Authetnicated URL for accessing repositry
AUTHENTICATED_URL = f"https://{PAT}@github.com/{USERNAME}/{REPO_NAME}"
REPO_FOLDER = REPO_NAME.replace(".git", "")

# Set global Git configuration
!git config --global user.email "twilliamsa776@gmail.com"
!git config --global user.name "{USERNAME}"

print("Setup github connection and authenticated url successfully!")

### Google Cloud Authentication

In [None]:
print("--- GCS Authentication ---")

auth.authenticate_user()

print("Google Cloud authentication complete.")

## Environment: Initialization and Imports

### Clone Github Repository

In [None]:
# Clean up any existing clone
if os.path.isdir(REPO_FOLDER):
    print(f"Removing old {REPO_FOLDER} folder...")
    !rm -rf {REPO_FOLDER}

# Clone the repository using the authenticated URL
print(f"Cloning repository: {REPO_NAME}...")
!git clone {AUTHENTICATED_URL}

# Change directory into the cloned repository
%cd {REPO_FOLDER}
print(f"Current working directory: {os.getcwd()}")

### Install Dependencies

In [None]:
print("Upgrading pip, setuptools, and wheel...")
!pip install --upgrade pip setuptools wheel -q

print("Using preinstalled numpy and tensorflow dependencies")

print("Installing remaining project dependencies from requirements.txt...")
!pip install -r requirements.txt -q

print("Dependencies installed successfully.")

### Setup .Kaggle Directory

- Must upload kaggle.json file



In [None]:
# Check if the credentials file already exists in the expected location
if not os.path.exists(os.path.expanduser('~/.kaggle/kaggle.json')):
    print("Uploading kaggle.json file...")
    # This will open a dialog for you to select and upload your file
    uploaded = files.upload()

    # Check if the upload was successful
    if not uploaded:
        print("ERROR: kaggle.json was not uploaded.")
    else:
        # The uploaded file is now in the current working directory (/content/)
        # Proceed to move and secure it.

        # 2. Create the required directory
        !mkdir -p ~/.kaggle/

        # 3. Move the uploaded file into the correct directory
        # The key in the uploaded dictionary is the filename (kaggle.json)
        # User should upload a file: 'kaggle.json'
        !mv kaggle.json ~/.kaggle/kaggle.json

        # 4. Set the correct permissions (CRITICAL)
        # Permissions MUST be 600 for security.
        !chmod 600 ~/.kaggle/kaggle.json

        print("Kaggle authentication file set up successfully!")
else:
    print("Kaggle credentials already found at ~/.kaggle/kaggle.json.")

# --- Verification Step ---
# Run a simple Kaggle command to test authentication
try:
    print("\nAttempting to list datasets (Verification)...")
    # This command uses the username/key from the now-configured kaggle.json
    !kaggle datasets list -s asl_alphabet | head -n 3
    print("\nSUCCESS: Kaggle API authenticated and is functional.")
except Exception as e:
    print(f"\nERROR: Verification failed. Please check the content of your kaggle.json file. Details: {e}")

### Connect Src directory for access to utility functions

In [None]:
sys.path.append('src')
print("Setup Complete. Colab environment is ready.")

## Data: Acquisition & Preprocessing

### Include necessary imports

In [None]:
import numpy as np
import cv2
import gc
import shutil

# If earlier cells are not ran
import os
import sys

# Ensure src accessibility
sys.path.append('src')

# Import utility functions
from data_utils import extract_keypoints

### Setup directories and constants

In [None]:
KAGGLE_DATASET_ID = "grassknoted/asl-alphabet"
DESTINATION_PATH = "sample_data"
PROCESSED_OUTPUT_DIR = 'processed_data'
DATA_ROOT_FOLDER_NAME = 'asl_alphabet_train'

os.makedirs(DESTINATION_PATH, exist_ok=True)
os.makedirs(PROCESSED_OUTPUT_DIR, exist_ok=True)

### Download Data via Kaggle API

In [None]:
print(f"Downloading dataset: {KAGGLE_DATASET_ID}")
!kaggle datasets download -d {KAGGLE_DATASET_ID} -p {DESTINATION_PATH} --unzip

# Define the exact root path to the image subfolders (A, B, C, etc.)
DATA_ROOT = os.path.join(DESTINATION_PATH, DATA_ROOT_FOLDER_NAME, DATA_ROOT_FOLDER_NAME)
print(f"Image data root set to: {DATA_ROOT}")

### Feature Extraction and Array Storage

In [None]:
GCS_BUCKET_NAME = "gs://asl-keypoint-data-storage-2025"
GCS_DESTINATION_FOLDER = "processed_features_v1"

# 1. Get all unique class folder names and sort them alphabetically
class_names = sorted([d for d in os.listdir(DATA_ROOT) if os.path.isdir(os.path.join(DATA_ROOT, d))])

# 2. Create the dictionary
label_map = {name: i for i, name in enumerate(class_names)}

FEATURE_OUTPUT_DIR = os.path.join('processed_data', 'class_splits')
os.makedirs(FEATURE_OUTPUT_DIR, exist_ok=True) # Ensure the directory exists

def create_and_save_features():
    # List to hold file paths of NPY files for later concatenation
    all_class_files = []

    # Iterate through all class folders
    for class_name in class_names:
        class_path = os.path.join(DATA_ROOT, class_name)
        label_index = label_map[class_name]

        print(f"Processing Class: {class_name} (Label: {label_index})")

        # --- Memory-Saving Block ---
        class_keypoints = []
        class_images = []
        class_labels = []

        for image_name in os.listdir(class_path):
            image_path = os.path.join(class_path, image_name)

            # Use the imported modular function
            keypoints, resized_img = extract_keypoints(image_path)

            if keypoints is not None:
                class_keypoints.append(keypoints)
                class_images.append(resized_img)
                class_labels.append(label_index)

        # 3. Convert and Save (The memory-intensive part, done one class at a time)
        X_key_class = np.array(class_keypoints, dtype=np.float32)
        X_cnn_class = np.array(class_images, dtype=np.float32)
        y_class = np.array(class_labels, dtype=np.int32)

        # 4. Save to Disk
        # Use a temporary name for each class file
        key_file = os.path.join(FEATURE_OUTPUT_DIR, f'keypoints_{class_name}.npy')
        cnn_file = os.path.join(FEATURE_OUTPUT_DIR, f'cnn_{class_name}.npy')
        label_file = os.path.join(FEATURE_OUTPUT_DIR, f'labels_{class_name}.npy')

        np.save(key_file, X_key_class)
        np.save(cnn_file, X_cnn_class)
        np.save(label_file, y_class)
        all_class_files.append((key_file, cnn_file, label_file))

        print(f"Processed and saved {class_name}. Freeing memory...")

        # 5. Crucial: Delete objects and force garbage collection
        del X_key_class, X_cnn_class, y_class, class_keypoints, class_images, class_labels
        gc.collect()

# --- EXECUTION ---
create_and_save_features()

### Final Concatenation

In [None]:
print("Starting memory-optimized final concatenation...")

# 1. Identify all temporary class files that need to be merged
temp_files = sorted(os.listdir(FEATURE_OUTPUT_DIR))
keypoint_files = [os.path.join(FEATURE_OUTPUT_DIR, f) for f in temp_files if f.startswith('keypoints_')]
cnn_files = [os.path.join(FEATURE_OUTPUT_DIR, f) for f in temp_files if f.startswith('cnn_')]
label_files = [os.path.join(FEATURE_OUTPUT_DIR, f) for f in temp_files if f.startswith('labels_')]

# Check if files were found
if not keypoint_files:
    raise FileNotFoundError("No temporary keypoint files found. Check FEATURE_OUTPUT_DIR path.")
if not cnn_files:
    raise FileNotFoundError("No temporary cnn files found. Check FEATURE_OUTPUT_DIR path.")
if not label_files:
    raise FileNotFoundError("No temporary label files found. Check FEATURE_OUTPUT_DIR path.")

# 2. Memory-Optimized Concatenation (Loading one-by-one and overwriting)

def merge_files_efficiently(file_list, final_name):
    """Loads files sequentially and saves the final result."""

    output_path = os.path.join(FEATURE_OUTPUT_DIR, final_name)
    print(f"Merging {len(file_list)} files into {final_name}...")

    all_arrays = [np.load(f) for f in file_list]
    merged_array = np.concatenate(all_arrays)
    np.save(output_path, merged_array)

    # Crucial: Delete objects and force garbage collection after each merge
    del all_arrays, merged_array
    gc.collect()
    print(f"Successfully saved {final_name}.")
    return output_path

# Execute the merges
final_keypoints_path = merge_files_efficiently(keypoint_files, 'X_keypoints.npy')
final_labels_path = merge_files_efficiently(label_files, 'y_labels.npy')

print("\nAll final feature files created successfully on local disk.")

# 3. Upload to GCS
GCS_PATH = f"{GCS_BUCKET_NAME}/{GCS_DESTINATION_FOLDER}"
print(f"Uploading final processed features from {FEATURE_OUTPUT_DIR} to {GCS_PATH}...")

print(f"Uploading final feature files to {GCS_PATH}...")

# Upload X_keypoints.npy
!gsutil cp {FEATURE_OUTPUT_DIR}/X_keypoints.npy {GCS_PATH}/X_keypoints.npy

# Upload y_labels.npy
!gsutil cp {FEATURE_OUTPUT_DIR}/y_labels.npy {GCS_PATH}/y_labels.npy

print("\nUpload to GCS complete. Only final files were uploaded.")

print("\nUpload to GCS complete. Data processing pipeline finished! ðŸŽ‰")

In [None]:
print("--- Starting Memory-Mapped Merge for X_cnn_images (with Disk Cleanup) ---")

# 1. Identify all temporary files and verify paths
try:
    temp_files = sorted(os.listdir(FEATURE_OUTPUT_DIR))
    # We load these lists for reference, they are NOT deleted yet.
    label_files = [os.path.join(FEATURE_OUTPUT_DIR, f) for f in temp_files if f.startswith('labels_')]
    cnn_files = [os.path.join(FEATURE_OUTPUT_DIR, f) for f in temp_files if f.startswith('cnn_')]
except FileNotFoundError:
    print(f"Error: The directory {FEATURE_OUTPUT_DIR} was not found. Please check REPO_NAME.")
    exit()

if not cnn_files or not label_files:
    print("Error: No intermediate 'cnn_*.npy' or 'labels_*.npy' files found. Cannot proceed.")
    exit()

# 2. Calculate the required final shape (metadata only)
print(f"Found {len(cnn_files)} intermediate files.")

# Calculate the total number of samples (rows)
total_samples = sum(np.load(f).shape[0] for f in label_files)

# Get the shape of a single image (e.g., (224, 224, 3))
cnn_image_shape = np.load(cnn_files[0]).shape[1:]

print(f"Total Samples to Merge: {total_samples}")
print(f"Image Feature Shape: {cnn_image_shape}")

# 3. Create and Populate the Memory-Mapped Array
FINAL_CNN_PATH = os.path.join(FEATURE_OUTPUT_DIR, 'X_cnn_images.npy')
current_row = 0

print(f"Creating memory-mapped file at: {FINAL_CNN_PATH}")

# Create the destination memory-mapped array (mode='w+' means create/write)
X_cnn_final_map = np.memmap(
    FINAL_CNN_PATH,
    dtype=np.float32,
    mode='w+',
    shape=(total_samples, *cnn_image_shape)
)

# Iteratively write data into the memory-mapped file
for i, cnn_file in enumerate(cnn_files):
    # Load one small class array into RAM
    X_cnn_class = np.load(cnn_file)
    num_samples = X_cnn_class.shape[0]

    # Write the small array directly into the correct slice of the large file on disk
    X_cnn_final_map[current_row:current_row + num_samples] = X_cnn_class

    # Update the row counter
    current_row += num_samples

    print(f"  -> Wrote file {i+1}/{len(cnn_files)} ({num_samples} samples).")

    # Crucial: Delete objects and force garbage collection after each loop
    del X_cnn_class
    gc.collect()

    # Flush ensures data is written to disk immediately
    X_cnn_final_map.flush()

    # --- DISK CLEANUP STEP ---
    os.remove(cnn_file)
    print(f"  -> Deleted source file: {os.path.basename(cnn_file)}")

print("\nStep 1 of 2: X_cnn_images successfully merged and saved locally.")

# Final cleanup of the memmap object before GCS upload
del X_cnn_final_map
gc.collect()

# 4. Upload the final file to GCS
GCS_PATH = f"{GCS_BUCKET_NAME}/{GCS_DESTINATION_FOLDER}"
GCS_DESTINATION_FILE = os.path.basename(FINAL_CNN_PATH)

print(f"\nStep 2 of 2: Uploading {GCS_DESTINATION_FILE} to {GCS_PATH}...")
# Use gsutil cp to copy the local file to the GCS path
!gsutil cp {FINAL_CNN_PATH} {GCS_PATH}/{GCS_DESTINATION_FILE}

print("\nSUCCESS: X_cnn_images.npy uploaded to GCS.")

In [6]:
import os

%cd /
print("Current working directory:", os.getcwd())
%cd /content/sample_data
print("New working directory:", os.getcwd())

%cd /

print("Current working directory:", os.getcwd())

/
Current working directory: /
/content/sample_data
New working directory: /content/sample_data
/
Current working directory: /


## Data: Loading and Splitting

In [None]:
import os

In [16]:
# Constants

%cd /content

GCS_BUCKET_NAME = "gs://asl-keypoint-data-storage-2025"
GCS_DESTINATION_FOLDER = "processed_features_v1/"
CNN_FILE_NAME = "X_cnn_images.npy"
LABELS_FILE_NAME = "y_labels.npy"
LOCAL_FEATURE_DIR = 'gcs_loaded_data'
GCS_PATH = f"{GCS_BUCKET_NAME}/{GCS_DESTINATION_FOLDER}"


/content


In [17]:
def setup_gcs_data():
    """Authenticates GCS access and copies large files to the local Colab SSD."""
    print("Authenticating Google Cloud Storage...")
    try:
        # Authenticate the user for GCS access
        auth.authenticate_user()
    except Exception as e:
        print(f"Authentication failed: {e}")
        return False

    # Create the local directory
    os.makedirs(LOCAL_FEATURE_DIR, exist_ok=True)
    print(f"Local storage directory created at: {LOCAL_FEATURE_DIR}")

    # Use gsutil to copy the files to the local SSD
    print(f"Copying {CNN_FILE_NAME} (38 GB) from GCS to local SSD...")
    # It is crucial to use the local SSD for fast I/O during training.
    # The 'gsutil cp' command is optimized for this transfer.
    try:
        # Copy the large feature file
        !gsutil cp {GCS_PATH}{CNN_FILE_NAME} {LOCAL_FEATURE_DIR}/

        # Copy the much smaller labels file
        print(f"Copying {LABELS_FILE_NAME} from GCS to local SSD...")
        !gsutil cp {GCS_PATH}{LABELS_FILE_NAME} {LOCAL_FEATURE_DIR}/
        print("Data transfer complete.")
        return True
    except Exception as e:
        print(f"Data transfer failed: {e}")
        return False

setup_gcs_data()

Authenticating Google Cloud Storage...
Local storage directory created at: gcs_loaded_data
Copying X_cnn_images.npy (38 GB) from GCS to local SSD...
Copying gs://asl-keypoint-data-storage-2025/processed_features_v1/X_cnn_images.npy...
==> NOTE: You are downloading one or more large file(s), which would
run significantly faster if you enabled sliced object downloads. This
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help crcmod").

\ [1 files][ 35.7 GiB/ 35.7 GiB]  122.8 MiB/s                                   
Operation completed over 1 objects/35.7 GiB.                                     
Copying y_labels.npy from GCS to local SSD...
Copying gs://asl-keypoint-data-storage-2025/processed_features_v1/y_labels.npy...
/ [1 files][248.9 KiB/248.9 KiB]                                                
Operation completed over 1 objects/248.9 KiB.                                    
Data transfer complete.


True