# Data Collection and Preprocessing

Usher Raymond Abainza, Dane Casey Casino, Kein Jake Culangao, and Karylle dela Cruz

## 1 Data Collection

**Image Dataset.** The dataset used in this study is the American Sign Language (ASL) Dataset, sourced from Kaggle. It is organized to represent 26 classes, corresponding to the letters of the alphabet. Each class is stored in a separate folder, containing images of hands forming the corresponding sign (Fig. 1 shows an example). Most classes have 70 images, with the exception of the letter t, which only contains 65, creating a minor imbalance. This small variation is not expected to significantly affect model performance. If needed, techniques such as oversampling or weighted loss functions can be applied to mitigate the effects of class imbalance.

To provide further context, the inclusion of all letters in a structured folder hierarchy allows for automatic labeling and consistent preprocessing, which simplifies model training and ensures reproducibility. The minor imbalance in the letter 't' is acknowledged but does not compromise the integrity of the dataset, and addressing it through oversampling or loss weighting ensures fairness across classes if necessary.

The dataset was collected to facilitate the development of machine learning models for hand gesture recognition, providing a standardized set of images for training, validation, and testing. The images are captured in varying lighting conditions and hand orientations, which helps the model generalize better to different real-world scenarios.

This variability in lighting and orientation is intentional, simulating realistic environments in which users might practice ASL, and thus the dataset supports robust model evaluation beyond laboratory conditions.

![image](./images/1.png)

**Figure 1:** Random sample images from the American Sign Language (ASL) Dataset with their corresponding labels

**Video Dataset.** In addition to the image dataset, the researchers will create a custom video dataset to evaluate the model's ability to recognize ASL gestures in real-time motion. This dataset will consist of short video recordings, each approximately 5 to 10 seconds in duration. The participants will be instructed to spell out their name using ASL hand signs, producing a sequence of gestures that the model will later process on a frame-by-frame basis.

The creation of this video dataset complements the static image dataset, bridging the gap between isolated letter recognition and continuous sequence processing. By capturing motion, the videos allow evaluation of temporal dynamics and provide a more realistic assessment of model performance in practical use cases.

The videos will be recorded using consistent settings, including controlled lighting, a stable camera position, and a clear background to minimize noise and visual distractions. Each recorded clip will then be segmented into individual frames (Fig. 2 shows an example), allowing the image-trained model to classify each frame and reconstruct the spelled name. This approach enables the integration of static-image-based training with dynamic gesture recognition in videos.

Such controlled recording ensures that variability in environmental factors is minimized, allowing the focus to remain on the model's capacity to handle motion and sequence reconstruction without confounding noise.

![image](./images/2.png)

**Figure 2:** Random sample frames from the video dataset with their corresponding video names

To clarify the role of the video dataset within the overall study, it was designed with two primary objectives in mind:

1. **To test real-world applicability**, ensuring the system can interpret continuous gesture sequences rather than isolated static images.
2. **To evaluate temporal consistency**, determining how well the model maintains accuracy across varied motions, speeds, and transitions between gestures.

By combining the static image dataset with the custom video dataset, the project ensures a more robust evaluation of the model's performance in both controlled and dynamic environments.



## 2 Data Preprocessing

### 2.1 Image Dataset Preparation for Model Training and Testing

Before training the CNN, the ASL image dataset required systematic preprocessing to ensure consistency, enhance learning efficiency, and facilitate reliable evaluation. These steps transform raw images into a standardized format suitable for input to deep neural networks while maintaining the integrity of visual features critical for accurate letter recognition. Proper preprocessing also mitigates variability due to lighting, hand orientation, or background differences, which supports generalization to new, unseen data.

To achieve these goals, several preprocessing steps were applied:

**Resizing.** All images were resized to 224\(\times\)224 pixels, which is compatible with standard Convolutional Neural Networks (CNNs) such as ResNet. This ensures uniformity across the dataset and reduces computational complexity.

Resizing standardizes input dimensions, allowing the CNN to process all images consistently without distortion or loss of feature representation. It also aligns the dataset with the input expectations of pre-trained architectures, enabling transfer learning and efficient feature extraction.

**Normalization.** Image pixel values were normalized using the mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225] (common for pre-trained ImageNet models). This step ensures that the model converges faster and improves training stability. Normalization centers the data and scales pixel values, preventing issues with gradient instability and ensuring that the network learns meaningful features rather than being influenced by arbitrary intensity differences.

```python
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

The transformation pipeline prepares raw images for efficient CNN training. Resizing ensures uniform dimensions, converting to tensors makes the data compatible with PyTorch, and normalization scales the pixel values to stabilize gradients and accelerate convergence during training. This pipeline is particularly critical when leveraging pre-trained models, as it aligns the input distribution with that of the ImageNet dataset, supporting transfer learning and robust feature extraction.

**Label Encoding.** Each folder name (0-9, a-z) was mapped to a unique numerical label using the folder structure, which is automatically handled by PyTorch's ImageFolder class. This encoding allows the model to output class predictions in a format suitable for cross-entropy loss computation.

```python
dataset = datasets.ImageFolder('asl_dataset', transform=transform)
```

Automated label encoding ensures consistency, reduces the risk of manual error, and facilitates seamless integration of categorical targets with the model's training process. By leveraging folder names as labels, the workflow becomes reproducible and scalable across datasets of varying size or complexity.

**Train-Validation-Test Split and Data Loading.** To properly evaluate the model while maintaining reliable performance metrics, the dataset was divided into training, validation, and testing sets using an 80%-10%-10% split. This ensures that all classes are represented across the subsets.

```python
total_size = len(dataset)
train_size = int(0.8 * total_size)
val_size = int(0.1 * total_size)
test_size = total_size - train_size - val_size  # Ensures exact total

train_dataset, val_dataset, test_dataset = random_split(
    dataset, [train_size, val_size, test_size]
)
```

Splitting the dataset in this way provides a structured framework for model evaluation. The training set supports feature learning, the validation set allows monitoring and hyperparameter tuning during training, and the test set provides an unbiased measure of final performance. Ensuring proportional representation of all classes across subsets prevents skewed evaluation and promotes reliable generalization.

DataLoaders were then created to efficiently feed the data into the model in mini-batches. A batch size of 32 was used, with shuffling applied to the training set to improve model generalization, while the validation and test sets were loaded sequentially.

```python
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
```

This setup ensures that the model receives data in manageable batches, allowing for optimized GPU usage and stable gradient updates. Shuffling the training set introduces randomness that prevents the model from memorizing data order, while sequential loading of validation and test sets preserves data integrity for accurate evaluation. Overall, this configuration supports effective learning, validation, and testing of the CNN on both static images and downstream video frames.

### 2.2 Video Dataset Preparation for Testing

To evaluate model performance on video data, a preprocessing pipeline was applied to convert videos into image frames suitable for CNN-based models. This step bridges the gap between static-image training and dynamic sequence evaluation, ensuring that the model can be assessed on real-world motion data while maintaining consistency with the preprocessing applied to static images. Proper preparation of video frames is essential to accurately capture temporal information without introducing artifacts that could degrade model performance.

**Frame Extraction.** Videos in the test set were processed individually. Frames were extracted at a target rate of 5 frames per second (FPS) to standardize temporal sampling and reduce computational load.

Standardizing the frame rate ensures uniform temporal granularity across all videos, balancing the need for sufficient temporal resolution with computational efficiency. By extracting frames consistently, the model can evaluate motion and gesture transitions reliably, without being biased by differences in video recording speed or duration.

**Preprocessing.** Each extracted frame was converted to RGB, resized to 224Ã—224 pixels, and normalized using the mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225] consistent with the image dataset preprocessing. This ensures that the frames are compatible with the pre-trained CNN models used for classification.

```python
video_test_dataset = ImageFolder("videos_processed", transform=transform)
```

Maintaining the same preprocessing steps as the image dataset guarantees that the model receives input in a familiar format, preserving feature consistency. Converting to RGB ensures compatibility with CNNs trained on three-channel images, resizing maintains input dimensionality, and normalization aligns the pixel distribution with what the network expects, which supports stable predictions.

**Directory Structure and Label Encoding.** Frames from each video were stored in a dedicated folder, preserving class structure. PyTorch's ImageFolder automatically maps each folder to a numerical label, making the dataset ready for testing.

Organizing frames in this structured manner not only facilitates automated label assignment but also enables reproducibility and easy expansion of the dataset. This approach preserves the association between frames and their corresponding gestures or letters, which is critical for evaluating sequence reconstruction accuracy.

**Data Loading.** A DataLoader was used to feed the video frames in mini-batches during testing, with shuffling disabled to maintain frame order:

```python
video_loader = DataLoader(video_test_dataset, batch_size=32, shuffle=False)
```

Sequential loading preserves the temporal order of frames, which is essential for post-processing and letter sequence reconstruction. Mini-batch loading also ensures computational efficiency during evaluation, allowing the model to process video data without overloading memory while maintaining accurate frame-level assessment.

This setup allows the model, trained on static images, to be evaluated on video data by processing each frame individually, providing an assessment of performance on dynamic inputs. By combining consistent frame extraction, standardized preprocessing, structured labeling, and orderly data loading, the pipeline ensures that evaluation reflects both the model's classification accuracy and its ability to maintain performance across continuous sequences of gestures.

Collectively, the procedures described for both image and video datasets establish a robust foundation for model development. By ensuring consistency in data format, preprocessing, and labeling, the pipeline enables accurate and reproducible evaluation of the CNN's performance. The integration of static images with frame-level video evaluation also prepares the system for subsequent experiments in sequence reconstruction, providing a seamless bridge from data preparation to model design and testing.