# Results and Discussion

**Authors:** Usher Raymond Abainza, Dane Casey Casino, Kein Jake Culanggo, and Karylle dela Cruz

---

## 1   Results of Video Testing

The effectiveness of YOLOv8 variants in ASL fingerspelling recognition was evaluated through a series of controlled video tests, where sample words—DOG, HELLO, TRAVEL, FOREST, and CAMERA—were presented to both the Nano and Small models. These tests were designed to simulate realistic conditions, including variations in hand orientation, motion, and lighting, reflecting challenges commonly encountered in practical applications. The evaluation focused on metrics that capture both the correctness of letter classification and the quality of sequential recognition, including Character Accuracy, Character Error Rate (CER), Word Error Rate (WER), and Subsequence Accuracy.

By examining the model predictions against the ground truth for each word, we aimed to assess not only overall performance but also to identify patterns in misclassification that could inform future model refinement. Particular attention was given to sequences containing visually similar letters, such as ‘A,’ ‘T,’ ‘M,’ ‘N,’ ‘S,’ and ‘E,’ which previous studies and team observations indicated are prone to misidentification due to their fist-like configurations. The following sections present a detailed discussion of the results, highlighting both successful cases and failure modes, and offering insights into the strengths and limitations of YOLOv8-s in the context of ASL fingerspelling recognition.

**1.1 YOLOv8‑n (Nano) Performance**

The YOLOv8‑n model was evaluated on the five test videos. Table 1-1 summarizes character accuracy and other metrics for each word.

| Test Word | Predicted Output | Character Accuracy (%) | CER (%) | WER (%) | Subsequence Accuracy (%) |
|-----------|------------------|------------------------|---------|---------|--------------------------|
| DOG       | DOGG             | 75.0                   | 33.3    | 33.3    | 100.0                    |
| HELLO     | HMMELO           | 50.0                   | 60.0    | 60.0    | 60.0                     |
| TRAVEL    | TTKKKRVMML       | 30.0                   | 116.7   | 116.7   | 33.3                     |
| FOREST    | FFFCOURRMMEMMMMM | 25.0                   | 200.0   | 200.0   | 66.7                     |
| CAMERA    | COMAMMMEURA      | 54.5                   | 83.3    | 83.3    | 100.0                    |

**Table 1-1.** YOLOv8-n (Nano) Performance on Five Test Videos

**1.1.1 Best Performing Test Word**

Among the five words tested, “DOG” also emerged as the best-performing word for YOLOv8-n, achieving a character accuracy of 75%, with a Character Error Rate (CER) of 33.3% and Word Error Rate (WER) of 33.3%, and perfect subsequence accuracy. Although slightly lower than YOLOv8-s, this indicates that YOLOv8-n was generally able to detect and classify the letters in “DOG” with reasonable consistency across frames. The relative success of this word can be attributed to the visually distinct handshapes of ‘D,’ ‘O,’ and ‘G,’ which reduce the likelihood of misclassification. This suggests that even the lightweight Nano variant performs well on sequences composed of clearly differentiated hand configurations.

**1.1.2 Words with Lower Accuracy**

The other words—“HELLO,” “TRAVEL,” “FOREST,” and “CAMERA”—exhibited substantially lower performance, with character accuracy ranging from 30% to 54.5%, higher CER and WER, and variable subsequence accuracy. A primary factor contributing to these results is the presence of letters with similar fist-based configurations, including ‘A,’ ‘T,’ ‘M,’ ‘N,’ ‘S,’ and ‘E.’ In “HELLO,” repeated letters such as ‘L’ and ‘E’ created confusion due to subtle finger positioning differences. “TRAVEL” and “FOREST” were particularly challenging, as multiple letters resembled fists, and transitions between letters introduced temporal inconsistencies, increasing edit distances and reducing sequence accuracy. “CAMERA” displayed moderate character accuracy but continued to show errors with repeated or similar handshapes, reflecting the same pattern of misclassification observed in other low-performing words.

**1.1.3 Distinct Factors in Failure Cases**

While misclassifications largely stemmed from fist-based handshape similarity, each word exhibited distinct contributing factors. In “TRAVEL,” the model often inserted extra letters or mispredicted sequences, likely due to rapid hand transitions that challenged the Nano variant’s capacity for temporal consistency. “FOREST” suffered from compounded errors due to motion blur, hand orientation variability, and partial occlusion, leading to repeated or missing letters and extremely low character accuracy. “HELLO” and “CAMERA,” though not as severely affected, still experienced errors in sequences containing consecutive letters with minimal visual differences. These observations indicate that misclassifications are influenced not only by letter similarity but also by dynamic video factors such as motion, frame blur, and spatial occlusion, which appear to impact YOLOv8-n more significantly than YOLOv8-s due to its reduced model capacity.

**1.1.4 Implications for YOLOv8-n Performance**

Overall, the evaluation demonstrates that YOLOv8-n can successfully detect and classify distinct ASL handshapes, as evidenced by the relatively high performance on “DOG.” However, the model struggles with sequences involving visually similar letters, particularly those forming fists, and is more sensitive to motion, partial occlusion, and frame variability than its larger counterpart. These results align with prior studies suggesting that lightweight YOLO variants maintain reasonable classification capabilities but are limited in precise localization and fine-grained discrimination. Consequently, while YOLOv8-n offers efficiency and lower computational demand suitable for resource-constrained environments, further improvements in temporal smoothing, handshape differentiation, and robustness to dynamic conditions would be necessary for reliable real-world fingerspelling recognition.

**1.2 YOLOv8‑s (Small) Performance**

The YOLOv8‑s model was tested under identical conditions. Table 12 presents its performance metrics.

| Test Word | Predicted Output | Character Accuracy (%) | CER (%) | WER (%) | Subsequence Accuracy (%) |
|-----------|------------------|------------------------|---------|---------|--------------------------|
| DOG       | DOG              | 100.0                  | 0.0     | 0.0     | 100.0                    |
| HELLO     | HEEELLO          | 66.7                   | 40.0    | 40.0    | 60.0                     |
| TRAVEL    | KRAKVMEMEEL      | 45.5                   | 100.0   | 100.0   | 0.0                      |
| FOREST    | FCORMMEMMT       | 50.0                   | 83.3    | 83.3    | 66.7                     |
| CAMERA    | CMMMEXA          | 57.1                   | 50.0    | 50.0    | 33.3                     |

**Table 1-2.** YOLOv8-s (Small) Performance on Five Test Videos

**1.2.1 Best Performing Test Word**

Among the five words tested, “DOG” emerged as the best-performing word, achieving 100% character accuracy, zero Character Error Rate (CER), zero Word Error Rate (WER), and perfect subsequence accuracy. This outcome indicates that YOLOv8-s was able to precisely localize and classify each handshape in the word “DOG” across all frames in the video. One likely reason for its superior performance is that the letters ‘D,’ ‘O,’ and ‘G’ involve handshapes that are visually distinct from each other, with minimal overlap in spatial finger positioning. Consequently, the model could more confidently differentiate each gesture without confusing subtle variations. This demonstrates that YOLOv8-s performs exceptionally well when handshapes are distinct and the sequence does not include letters with similar fist-based configurations.

**1.2.2 Words with Lower Accuracy**

The remaining words—“HELLO,” “TRAVEL,” “FOREST,” and “CAMERA”—showed lower character accuracy, higher CER and WER, and reduced subsequence accuracy. Several factors contributed to these poorer outcomes. One major challenge is the presence of letters with highly similar handshapes, particularly ‘A,’ ‘T,’ ‘M,’ ‘N,’ ‘S,’ and ‘E,’ all of which form a closed fist or near-fist configuration. For example, in “HELLO,” the repeated ‘E’ and ‘L’ letters involve handshapes that can easily be confused due to minor finger movements or orientation differences. Similarly, “TRAVEL” and “FOREST” contained multiple letters that resemble a fist or involve partial occlusions during transitions, leading to high edit distances and low subsequence accuracy. “CAMERA” exhibited moderate character accuracy but struggled with repeated ‘M’ and ‘E’ letters, reflecting the same pattern of misclassification among visually similar handshapes.

**1.2.3 Distinct Factors in Failure Cases**

While the general reason for misclassifications is confusion among fist-based letters, there are distinct contributing factors for different words. In “TRAVEL,” the model frequently inserted extra letters or mispredicted sequences, likely due to rapid hand movements and transitions between letters, causing temporal inconsistencies. In “FOREST,” the extreme misalignment may have been exacerbated by hand orientation, motion blur, or partial occlusions in the video frames, which hindered bounding box localization and led to repeated or missing letters. “HELLO” and “CAMERA,” although not as severely affected, still experienced errors due to subtle differences between similar handshapes, particularly in sequences with consecutive letters requiring precise finger positioning. These observations suggest that misclassifications are not caused solely by letter similarity but are compounded by dynamic factors such as motion, frame blur, and spatial occlusion.

**1.2.4 Implications for YOLOv8-s Performance**

Overall, these results indicate that YOLOv8-s is capable of high detection performance when letters are visually distinct and hand orientation is stable, as seen with “DOG.” However, the model exhibits limitations in fine-grained differentiation among letters forming fists, consistent with observations from previous studies. Errors are magnified in words containing sequences of similar handshapes or when video frames include rapid motion or partial occlusions. This reinforces the notion that while YOLOv8-s can effectively classify distinct ASL handshapes, improvements in localization precision, temporal consistency, and robustness to visually subtle differences are necessary to reliably handle real-world fingerspelling sequences. Furthermore, the performance pattern confirms that the model’s misclassifications are not random but largely predictable based on handshape similarity and sequence complexity.

**1.3 Grad-CAM Analysis**

![image](./images/Picture10.png)

**Figure 1-1.** YOLOv8 Grad-CAM Visualization on the “DOG” Handspelling Video

To gain additional insight into the decision-making process of the YOLOv8 models, Grad-CAM (Gradient-weighted Class Activation Mapping) was applied to selected video frames. The goal was to visualize which regions of the input images contributed most strongly to the model’s classification decisions, ideally highlighting specific finger positions or hand configurations. However, the resulting activation maps were relatively unspecific. Across all tested hand signs, Grad-CAM consistently produced a diffuse red overlay centered over the entire hand, rather than focusing on particular fingers or localized hand regions. This suggests that the models primarily rely on the overall hand region rather than distinct handshape patterns when making predictions. 

Consequently, while Grad-CAM served as a helpful supplementary tool to confirm that the model is attending to the correct general area, it did not provide fine-grained insight into the subtle spatial patterns that differentiate similar ASL letters. The analysis indicates that interpretability at the finger-level remains limited, and more targeted explainability methods may be necessary to fully understand the model’s internal feature use.


## 2   Discussion

**4.1 Performance on Distinct Handshapes**

The evaluation of YOLOv8-n and YOLOv8-s on ASL fingerspelling videos demonstrates that sequences composed of letters with visually distinct handshapes, such as the word “DOG,” consistently yielded high character accuracy, low error rates, and perfect subsequence alignment. This outcome highlights the models’ ability to reliably detect and classify hand configurations when gestures are clearly differentiated, suggesting that both YOLOv8 variants effectively leverage spatial features under stable hand orientations and smooth letter transitions. The success of these cases underscores the models’ capability to capture and interpret distinct visual patterns in ASL gestures when ambiguity is minimal.

**4.2 Challenges with Similar Fist-Based Letters**

Conversely, words containing letters with similar fist-based configurations—specifically ‘A,’ ‘T,’ ‘M,’ ‘N,’ ‘S,’ and ‘E’—consistently produced lower accuracy, higher Character and Word Error Rates, and reduced subsequence alignment. Misclassifications in these sequences were influenced by subtle differences in finger positioning, rapid transitions between letters, partial occlusions, and frame-to-frame variability. The Nano model was particularly affected, likely due to its reduced capacity for fine-grained feature extraction and temporal consistency. While the Small model generally improved performance, it too struggled with sequences containing visually similar letters, indicating that increased model capacity alone cannot fully resolve ambiguities inherent in certain handshapes.

**4.3 Distinct Failure Patterns and Model Implications**

Distinct patterns emerged among failure cases, revealing that misclassifications arise from multiple factors beyond visual similarity. In words like “TRAVEL” and “FOREST,” extra or missing letters were frequently observed, reflecting temporal inconsistencies exacerbated by hand motion, orientation variability, and occasional motion blur. In contrast, words such as “HELLO” and “CAMERA” exhibited errors predominantly due to repeated or similar handshapes, highlighting that sequence complexity and inherent gesture ambiguity both contribute to misclassification. Grad-CAM analysis further supports this interpretation, showing that both models primarily attend to the overall hand region rather than specific finger configurations. These findings indicate that while YOLOv8 models are effective at general hand detection, their reliability diminishes for visually similar or temporally dynamic sequences. Consequently, improvements in localization precision, temporal modeling, and handshape differentiation are essential for robust real-world ASL recognition.

Overall, the evaluation reveals predictable patterns in model limitations, providing clear guidance for targeted enhancements in model architecture, dataset diversity, and post-processing techniques, consistent with prior studies emphasizing that fine-grained handshape distinctions remain a core challenge in automated ASL fingerspelling recognition.


## 3 Recommendations

**3.1 Model Enhancements.** To improve recognition accuracy, particularly for challenging letters and sequences, future work should explore modifications to the current detection architecture. Larger or specialized YOLOv8 variants, such as Medium or Transformer-augmented models, may better capture fine-grained hand features and subtle differences among fist-based gestures. Additionally, integrating temporal information across consecutive video frames through sequence modeling techniques, such as recurrent neural networks or temporal convolutions, could provide context-aware predictions, helping to reduce misclassifications in sequences containing visually similar letters. These improvements would address the temporal inconsistencies observed in words like “TRAVEL” and “FOREST,” where rapid hand transitions and frame variability contributed significantly to errors.

**3.2 Data Expansion and Augmentation.** The evaluation highlights that limited data diversity contributes to misclassification and reduced character accuracy. Expanding the dataset with additional ASL hand images and video sequences under varied lighting conditions, hand orientations, and occlusions would improve model generalization. Furthermore, targeted augmentation strategies, including rotations, scaling, and simulated occlusions, could enhance model robustness, allowing it to maintain accuracy even when hand positioning or environmental conditions are suboptimal. Increased data diversity is especially important for sequences containing similar fist-based letters, which are prone to subtle misclassification.

**3.3 Post-Processing and Prediction.** Smoothing Enhancing model outputs through post-processing can mitigate transient misclassifications and improve sequence-level reliability. Temporal smoothing across video frames or sequence-level correction using language models can help maintain consistency in predicted letter sequences. Confidence-based filtering could further eliminate spurious detections, repeated letters, or improbable sequences. Implementing these strategies would reduce Word and Character Error Rates, particularly in words with consecutive similar handshapes, and would complement architectural improvements without significantly increasing computational costs. 

**3.4 Deployment Considerations.** For practical applications, model selection should align with the computational constraints of the target environment. Lightweight configurations, such as YOLOv8-n, are suitable for mobile or edge devices where efficiency is prioritized, though at the cost of some fine-grained accuracy. In contrast, YOLOv8-s or larger variants are more appropriate for server or desktop environments, where the higher detection reliability justifies the computational overhead. Future implementations should carefully balance the trade-off between model performance and resource requirements, ensuring responsive and accurate ASL recognition in real-world educational or assistive settings.
