# ASL Video-to-Text Translation Using CNN-Based Hand Sign Recognition and LLM Sequence Reconstruction

**Authors:** Usher Raymond Abainza, Dane Casey Casino, Kein Jake Culanggo, and Karylle dela Cruz

---

## 1. Background

American Sign Language (ASL) constitutes a fully expressive visual language, integrating hand gestures, facial expressions, and body movements to convey meaning. While primarily used by Deaf and Hard-of-Hearing (DHH) individuals, ASL has gained increasing attention among non-signers seeking practical communication skills. Its prominence as a second language has grown substantially; according to the Modern Language Association, ASL ranks as the third most-studied modern or foreign language at U.S. universities [1], reflecting considerable interest among hearing learners. Moreover, research demonstrates that acquiring ASL does not hinder the development of spoken English, underscoring its viability as a linguistically safe learning option [2]. Among the foundational components of ASL, fingerspelling, the sequential representation of letters and numbers, plays a critical role in conveying proper nouns, technical terminology, and less common words. Mastery of fingerspelling requires not only recognition of individual handshapes but also the ability to produce and sequence them accurately to form meaningful words and numerical expressions.
Despite its importance, traditional educational approaches for teaching fingerspelling, including static illustrations, printed charts, and multiple-choice exercises, primarily assess recognition of isolated letters. These methods provide limited insight into a learner’s capacity to construct sequences accurately and often fail to capture prevalent challenges, such as inconsistent hand positioning, timing errors between consecutive letters, and the translation of visual recognition into motor execution. In contrast, research in language acquisition highlights that active production, rather than passive recognition, significantly enhances retention, comprehension, and the transfer of knowledge to practical contexts [3].
Perceptual-motor processes are also integral to effective ASL learning. Martinez and Singleton (2018) demonstrated that movement short-term memory and visuospatial short-term memory jointly predict the proficiency with which hearing non-signers acquire signs, emphasizing the necessity of practicing actual hand movements rather than relying solely on visual recognition [4]. These findings indicate that educational systems capable of delivering immediate feedback on free-form fingerspelling attempts would address a critical pedagogical gap, particularly for learners engaged in remote or application-based instruction, by supporting skill development beyond rote memorization and multiple-choice evaluation.
Traditional computer vision techniques frequently encounter limitations due to variability in hand shapes across signers, which impedes accurate classification of fingerspelled letters and numbers [5]. Advances in deep learning have addressed these challenges, offering robust solutions. Convolutional neural networks (CNNs) have demonstrated high efficacy in image-based classification of complex handshapes. For example, Alsharif et al. (2023) evaluated multiple deep architectures, including ResNet‑50 and EfficientNet, on a dataset of over 87,000 ASL alphabet images, achieving accuracy rates exceeding 99.9% and demonstrating the feasibility of reliable letter-level recognition [6]. Similarly, Zhang (2025) compared several CNN models, including Custom CNN, ResNet‑50, EfficientNet‑B0, and Inception‑V3, on the same dataset, reporting near-perfect performance for EfficientNet‑B0 and ResNet‑50 (99.8% and 99.6%, respectively) [7]. Additionally, frameworks such as MediaPipe facilitate real-time hand detection and landmark extraction, enabling CNN models to focus exclusively on letter and number classification rather than hand localization. This approach simplifies preprocessing and enhances recognition accuracy, making real-time feedback for educational applications feasible [8].
Together, these studies establish the potential of CNN-based recognition and real-time hand tracking to provide immediate feedback on spelling accuracy and sequence formation, even when limited to alphabet and numerical recognition. While such frameworks do not interpret facial expressions, grammatical context, or full ASL comprehension, they represent a meaningful advancement in supporting literacy in fingerspelling and structured skill development for non-signers, particularly in remote or application-based learning environments.


---

## 2. Objectives

The primary objective of this study is to develop an automated ASL fingerspelling recognition system designed for integration into educational platforms. The system will be trained on image-based datasets of ASL alphabet and number hand signs and evaluated using pre-recorded videos of fingerspelling attempts. This study focuses exclusively on offline processing, with real-time recognition recommended for future research. To achieve this aim, the study pursues the following specific objectives:

1. **Train a CNN Model for ASL Letter and Number Recognition**  
   Develop and train a convolutional neural network capable of accurately classifying individual ASL fingerspelled letters and numbers from static images. The model will learn robust handshape features to support reliable recognition across varying visual conditions.

2. **Apply the Model to Pre-Recorded Video Inputs**  
   Implement a video-processing pipeline that extracts frames from pre-recorded videos and applies the trained CNN to generate frame-level predictions, enabling practical evaluation of the model's performance in a more realistic, sequential setting.

3. **Reconstruct Letter and Number Sequences From Frame Predictions**  
   Design and apply a post-processing method to convert per-frame classifications into coherent letter or number sequences. This includes temporal smoothing, removal of transitional frames, and collapsing repeated predictions to produce accurate spellings of words or numbers.

4. **Evaluate Model Performance and Identify the Most Effective Architecture**  
   Evaluate the CNN's classification accuracy using image and pre-recorded video datasets, measuring per-letter accuracy, Character Error Rate (CER), and using confusion matrices. Compare various CNN architectures (VGG, ResNet, MobileNet) to find the best balance of accuracy, efficiency, and generalization.


---   

### References

[1] ASL, the 3rd most studied foreign language. (2000, January 1). 1996-2025 Jolanta A. Lapiak. https://www.handspeak.com/learn/8/

[2] Pontecorvo, E., Higgins, M., Mora, J., Lieberman, A. M., Pyers, J., & Caselli, N. K. (2023). Learning a Sign Language Does Not Hinder Acquisition of a Spoken Language. *Journal of Speech, Language, and Hearing Research : JSLHR, 66*(4), 1291–1308. https://doi.org/10.1044/2022_JSLHR-22-00505

[3] Hopman, E. W. M., & MacDonald, M. C. (2018). Production Practice During Language Learning Improves Comprehension. *Psychological Science, 29*(6), 961–971. https://doi.org/10.1177/0956797618754486

[4] Martinez, D., & Singleton, J. L. (2018). Predicting sign learning in hearing adults: The role of perceptual-motor (and phonological?) processes. *Applied Psycholinguistics, 39*(5), 905–931. doi:10.1017/S0142716418000048

[5] Camgoz, N.C., Koller, O., Hadfield, S., Bowden, R.: Neural sign language translation. *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, 7784–7793 (2018)

[6] Alsharif, B., Altaher, A. S., Altaher, A., Ilyas, M., & Alalwany, E. (2023). Deep Learning Technology to Recognize American Sign Language Alphabet. *Sensors, 23*(18), 7970. https://doi.org/10.3390/s23187970

[7] Zhang, C. (2025). A comparative study of deep CNN architectures for static American sign language recognition. *Science and Technology of Engineering Chemistry and Environmental Protection, 1*(4). https://doi.org/10.61173/yackf205

[8] Almarzouqi, H., Hussain, A.J., Al-Jumeily, D., Fergus, P.: Vision-based American Sign Language recognition using deep learning. *International Journal of Advanced Computer Science and Applications 11*(4), 237–244 (2020)