# ASL Video-to-Text Translation Using CNN-Based Hand Sign Recognition and LLM Sequence Reconstruction

**Authors:** Usher Raymond Abainza, Dane Casey Casino, Kein Jake Culanggo, and Karylle dela Cruz

---

## 1. Background

American Sign Language (ASL) constitutes a fully expressive visual language, integrating hand gestures, facial expressions, and body movements to convey meaning. While primarily used by Deaf and Hard-of-Hearing (DHH) individuals, ASL has gained increasing attention among non-signers seeking practical communication skills. Its prominence as a second language has grown substantially; according to the Modern Language Association, ASL ranks as the third most-studied modern or foreign language at U.S. universities [1], reflecting considerable interest among hearing learners. Moreover, research demonstrates that acquiring ASL does not hinder the development of spoken English, underscoring its viability as a linguistically safe learning option [2]. Among the foundational components of ASL, fingerspelling, the sequential representation of letters, plays a critical role in conveying proper nouns, technical terminology, and less common words. Mastery of fingerspelling requires not only recognition of individual handshapes but also the ability to produce and sequence them accurately to form meaningful words and numerical expressions.

Despite its importance, traditional educational approaches for teaching fingerspelling, including static illustrations, printed charts, and multiple-choice exercises, primarily assess recognition of isolated letters. These methods provide limited insight into a learner’s capacity to construct sequences accurately and often fail to capture prevalent challenges, such as inconsistent hand positioning, timing errors between consecutive letters, and the translation of visual recognition into motor execution. In contrast, research in language acquisition highlights that active production, rather than passive recognition, significantly enhances retention, comprehension, and the transfer of knowledge to practical contexts [3].
Perceptual-motor processes are also integral to effective ASL learning. Martinez and Singleton (2018) demonstrated that movement short-term memory and visuospatial short-term memory jointly predict the proficiency with which hearing non-signers acquire signs, emphasizing the necessity of practicing actual hand movements rather than relying solely on visual recognition [4]. These findings indicate that educational systems capable of delivering immediate feedback on free-form fingerspelling attempts would address a critical pedagogical gap, particularly for learners engaged in remote or application-based instruction, by supporting skill development beyond rote memorization and multiple-choice evaluation.

Traditional computer vision techniques frequently encounter limitations due to variability in hand shapes across signers, which impedes accurate classification of fingerspelled letters [5]. Recent work has demonstrated that one-stage object detectors can reliably recognize alphabet-level hand gestures by jointly localizing hands and predicting classes within a single pass. For example, Poornima et al. (2024) applied a YOLOv5-based system specifically to ASL alphabet letters, achieving high accuracy and efficient detection, while Hasan et al. (2025) implemented a customized YOLOv8 approach, demonstrating real-time performance and robust classification for sign language recognition [6,7]. These studies establish YOLO-family models as viable and efficient alternatives to traditional CNN-based classifiers for alphabet-level sign recognition.

Building on these capabilities, recent studies have explored lightweight and optimized YOLOv8 variants for real-time ASL recognition tasks. Alsharif et al. (2024) applied transfer learning with YOLOv8 for real-time ASL alphabet recognition, achieving high precision (98%), recall (98%), F1 score (99%), and mean average precision (mAP) of 98%, highlighting the potential of YOLOv8 for rapid educational feedback [8]. Jia and Li (2023) proposed SLR-YOLO, an improved YOLOv8 network with a lighter backbone and enhanced feature fusion, achieving 90.6% and 98.5% accuracy on ASL and Bengali sign datasets, respectively [9]. Che et al. (2025) developed SFG-YOLOv8, an efficient gesture keypoint detector, showing mAP improvements for small-feature gestures with high inference speed, suitable for low-power augmented reality applications [10]. Alaftekin et al. (2024) demonstrated a YOLOv4-CSP-based real-time recognition system for Turkish sign language, achieving 98.95% precision, 98.15% recall, and 99.49% mAP, further confirming YOLO’s applicability across languages and datasets [11]. Collectively, these studies support the selection of YOLOv8-n (Nano) as a lightweight baseline for computational efficiency and YOLOv8-s (Small) for improved classification performance, balancing speed, accuracy, and resource requirements for educational platforms.

Together, these findings establish the potential of YOLO-based recognition and real-time hand tracking to provide immediate feedback on spelling accuracy and sequence formation, even when limited to alphabet and numerical recognition. While such frameworks do not interpret facial expressions, grammatical context, or full ASL comprehension, they represent a meaningful advancement in supporting literacy in fingerspelling and structured skill development for non-signers, particularly in remote or application-based learning environments.


### 1.1   Problem Gap

Although existing ASL recognition research demonstrates highly accurate classification of isolated handshapes, there remains a lack of systems capable of evaluating a learner’s production of continuous fingerspelling sequences. Current educational materials and automated systems primarily focus on recognition of single images rather than full sequences, limiting their usefulness for instructional platforms that monitor progress across multiple letters. There is a clear gap in accessible tools able to analyze free-form, pre-recorded fingerspelling attempts and transform frame-level predictions into coherent letter sequences suitable for feedback and assessment. While one-stage object detectors such as YOLO have demonstrated efficacy for single-letter recognition, few studies integrate sequence reconstruction with temporal smoothing to address variations in gesture timing.

### 1.2   System Justification

The proposed system addresses this gap by combining a YOLO-based CNN recognition model with a post-processing procedure that reconstructs letter sequences from continuous video frames. Lightweight variants of YOLO, including YOLOv8-n (Nano) and YOLOv8-s (Small), provide robust frame-level classification while minimizing computational overhead. Temporal smoothing and repetition-collapsing techniques compensate for inconsistencies in gesture execution, enabling accurate sequence reconstruction. This design aligns with pedagogical needs for evaluating active production, as highlighted in prior research on perceptual-motor processes and retention [3, 4], and facilitates the development of educational applications that provide meaningful feedback beyond isolated recognition tasks.

### 1.3   Scope and Limitations

The system focuses exclusively on the ASL manual alphabet and numerical handshapes. Dynamic signs, facial expressions, grammatical structures, and full-sentence interpretation are outside the scope. Only offline processing of pre-recorded videos is considered, with real-time deployment reserved for future work. The model is trained on publicly available datasets of static handshapes and evaluated using learner-submitted videos. It assumes clear visibility of the hands and does not incorporate signer-specific calibration, limiting its current applicability to well-lit, unobstructed recordings.

### 1.4   Expected Contribution

This study contributes an integrated framework for automated fingerspelling assessment comprising (1) a YOLO-based CNN model for accurate letter recognition, (2) a frame-level inference procedure for video-based evaluation, and (3) a reconstruction method capable of generating coherent letter sequences. By leveraging lightweight YOLO variants optimized for accuracy and speed, the framework enables educational platforms to provide formative feedback for learners practicing continuous fingerspelling. The approach addresses the limitations of prior single-image recognition systems while remaining computationally accessible for offline instructional use.









  



## 2. Objectives

The primary objective of this study is to develop an automated ASL fingerspelling recognition system suitable for educational platforms. The system will be trained on image-based datasets of ASL alphabet handshapes and evaluated using pre-recorded fingerspelling videos. Offline processing is the focus, while real-time recognition is reserved for future work. To achieve this, the study pursues the following specific objectives:

**Train a YOLOv8 Model for ASL Letter Recognition.**  
   including YOLOv8-n (Nano) and YOLOv8-s (Small), to classify individual ASL fingerspelled letters from static images. These models balance computational efficiency and recognition accuracy, enabling robust classification across varied handshapes and lighting conditions.

**Apply the Model to Pre-Recorded Video Inputs.**  
   from pre-recorded videos and applies the trained YOLOv8 model to generate frame-level predictions, supporting sequential evaluation of continuous fingerspelling attempts.

**Reconstruct Letter Sequences From Frame Predictions.**  
   per-frame predictions into coherent letter sequences, incorporating temporal smoothing, removal of transitional frames, and collapsing repeated predictions to produce accurate spellings of words.

**Evaluate Model Performance and Identify the Optimal YOLO Variant.**  
   performance on the image dataset using mAP50-95, precision, and recall. For the pre-recorded video datasets, evaluate character accuracy, character error rate, word error rate, and subsequence accuracy. Compare the results of YOLOv8-n and YOLOv8-s to identify which variant offers the best balance of accuracy, inference speed, and computational efficiency for educational applications.

### 3 References

[1] ASL, the 3rd most studied foreign language. (2000, January 1). 1996-2025 Jolanta A. Lapiak. https://www.handspeak.com/learn/8/

[2] Pontecorvo, E., Higgins, M., Mora, J., Lieberman, A. M., Pyers, J., & Caselli, N. K. (2023). Learning a Sign Language Does Not Hinder Acquisition of a Spoken Language. Journal of Speech, Language, and Hearing Research, 66(4), 1291–1308. https://doi.org/10.1044/2022_JSLHR-22-00505

[3] Hopman, E. W. M., & MacDonald, M. C. (2018). Production Practice During Language Learning Improves Comprehension. Psychological Science, 29(6), 961–971. https://doi.org/10.1177/0956797618754486

[4] Martinez, D., & Singleton, J. L. (2018). Predicting sign learning in hearing adults: The role of perceptual-motor (and phonological?) processes. Applied Psycholinguistics, 39(5), 905–931. https://doi.org/10.1017/S0142716418000048

[5] Camgoz, N.C., Koller, O., Hadfield, S., & Bowden, R. (2018). Neural sign language translation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7784–7793.

[6] Poornima, I. G. A., Priya, G. S., Yogaraja, C. A., Venkatesh, R., & Shalini, P. (2024). Hand and sign recognition of alphabets using YOLOv5. SN Computer Science, 5(3), 1–11. https://doi.org/10.1007/s42979-024-02628-4

[7] Hasan, M., Paul, B. K., Islam, N., et al. (2025). Advancing real-time sign language detection for deaf and hearing-impaired communities: A customized YOLOv8 approach with tailored annotations in computer vision. BMC Artificial Intelligence, 1, 11. https://doi.org/10.1186/s44398-025-00010-9

[8] Alsharif, B., Alalwany, E., & Ilyas, M. (2024). Transfer learning with YOLOV8 for real-time recognition system of American Sign Language Alphabet. Franklin Open, 8, 100165. https://doi.org/10.1016/j.fraope.2024.100165

[9] Jia, W., & Li, C. (2023). SLR-YOLO: An improved YOLOv8 network for real-time sign language recognition. Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology, 46(1), 1663–1680. https://doi.org/10.3233/JIFS-235132

[10] Che, W., Zhang, H., Wu, B., et al. (2025). SFG-YOLOv8: efficient and lightweight small-feature gesture keypoint detector. Journal of King Saud University – Computer and Information Sciences, 37, 44. https://doi.org/10.1007/s44443-025-00044-z

[11] Alaftekin, M., Pacal, I., & Cicek, K. (2024). Real-time sign language recognition based on YOLO algorithm. Neural Computing & Applications, 36, 7609–7624. https://doi.org/10.1007/s00521-024-09503-6