Where phonics meets the phoenix—a rebirth of speech. Personalized, accessible, and effective speech therapy powered by NLP and CV helping patients of Broca's aphasia on a path to speech recovery.
Worldwide, approximately one in four individuals over the age of 25 will experience a stroke in their lifetime, with an estimated 12.2 million people affected annually. Stroke remains a leading cause of disability across the globe, and this issue hits closer to home than we might realize. In fact, recently, a mentor and friend of mine suffered a stroke, resulting in Broca’s aphasia, a form of language impairment where speech production is severely impaired, while comprehension remains largely intact, due to damage in the brain. Broca’s aphasia heavily affects both stroke patients and the elderly in general, as cognitive decline progresses, presenting as an extremely difficult situation to navigate for both the patient and their loved ones.
As a result, many patients that experience aphasia undergo speech therapy in order to regain their cognitive abilities and improve the situation, but several issues manifest. Firstly, speech therapy is costly, with different providers and forms of therapy averaging at almost $200 per session, and this cost is exacerbated by the need for ongoing treatment. Additionally, accessibility may be hindered by the difficulty or inability to see a therapist on a daily basis.
As such, we thought about what we could do to resolve these issues, with a goal to prioritize the user’s experience in terms of accessibility, effectiveness, and personalization. Our solution is a speech therapy web app designed to meet all of the goals mentioned above. It serves both healthcare and educational purposes, targeting stroke patients, the elderly, individuals with language difficulties, and students learning English as a second language.
Phonix, our speech therapy web-app, generates personalized practice using generative AI, gives appropriate feedback using computer vision, and incorporates natural language processing through automatic speech recognition, allowing the user to practice and improve their condition from the comfort of their own homes. We created an easy-to-navigate UI with a personalized stats tracker of the pronunciations that the user excels at and those that the user needs more practice on. Additionally, Phonix effectively identifies the user’s weak points (e.g. words or specific phonemes that require more practice for better precision), and analyzes both the user’s speech and facial movements in order to provide personalized feedback according to the audio and lip analyses. In addition to feedback, our speech therapy web-app also generates personalized practice according to the user’s weak points, accurately imitating a real-life speech therapist, but with more efficiency and accessibility, with goals of reaching a wider audience. In order to make our web-app more engaging, holistic, and as realistic as possible, we also incorporated varying difficulties of practice terms for the user’s therapy experience, including a vast selection of both words and sentences.
The integration of OpenCV, Dlib, OpenAI's WhisperAI and GPT-4o-mini API, and ElevenLabs API forms a crucial foundation for our project. We utilize SQLite to store initial word and sentence data generated by RandomWordAPI Vercel and OpenAI’s API, respectively. This data is then displayed on our user interface, and the user’s pronunciation of the text is recorded. WhisperAI is employed to transcribe the audio into text, which is then compared to the expected text from the database. To enhance accuracy, we employ the english_to_ipa module to convert both the spoken and expected English text into the International Phonetic Alphabet (IPA). This enables us to precisely isolate each syllable and assess discrepancies between the expected phonetic transcription and the user’s pronunciation.
Upon identifying the mispronunciation, the particular word and syllable are passed to the OpenAI's GPT-4o-mini API call, which generates a phonetically similar word specifically tailored to the user’s particular mispronunciation. Concurrently, WhisperAI processes the user's audio file through the Open Source Allosaurus function to generate a series of timestamps corresponding to the pronounced phonemes. By identifying the mispronounced phoneme and its timestamp, we match this with the corresponding frame in the OpenCV feed, capturing the precise moment the phoneme was articulated.
Using Dlib’s facial landmark detection capabilities, we extract the user’s mouth positioning and compare it to a pre-labeled database of accurate mouth positions for each IPA phoneme. The system computes differences in the relative dimensional ratios to assess pronunciation accuracy, and personalized feedback is then provided based on these findings. This feedback is then coupled with ElevenLabs API’s text-to-speech functionality, allowing the user to hear the correct pronunciation, which is essential for improving performance. Furthermore, OpenAI’s API is invoked to provide a comprehensive response based on the user’s audio input as an additional layer of specific feedback that would be helpful to the user.
Throughout the process of developing Phonix, we learned many new technical skills and frameworks, what makes a project successful, and the power of AI.
To build the frontend of Phonix, we learned to use Javascript, React, and CSS. We developed our backend using Python, through which we learned to research and adopt open-source modules such as allosaurus and english_to_ipa, integrate Python functions with Javascript server using Flask, make and manage API calls including GPT 4o-mini and WhisperAI, construct databases using SQLite, and detect facial landmarks using computer vision with OpenCV and dlib. We also learned how to build our own dataset for use in computer vision and image analysis.
This project was also a meaningful opportunity in immersing ourselves in the holistic engineering process from start to finish. From ideating to researching to designing to programming to assembling Phonix, we learned the importance of flexibility, communication, and persistence in building a product from scratch in such a short time period. We found that frequent check-ins and updates are crucial in maintaining transparency on our project progress, especially because we chose to delegate the various components (frontend, audio processing, visual processing, etc.) to our team members. We also discovered that integrating these separate components can prove challenging, so communication and thorough planning is extremely important.
Lastly, we learned that a hackathon is a great deal of fun and learning and greatly appreciated this opportunity!
- Python
- Allosaurus by Xinjli
- OpenAI
- WhisperAI
- GPT 4o-mini
- SQLite
- OpenCV
- Dlib
- Javascript
- React
- CSS
- ElevenLabs
- IC Speech Mouth Positioning
- RandomWordAPI Vercel
- Flask