Command to enable: Start clothing mode
In this mode, the app detects different types of clothes, colors, and patterns in order to help users making better clothing choices. Users just need to point to clothes and tap anywhere on the screen and the app will then speak out loud the detection.
Command to enable: Start walking mode
Walking Mode can help users navigate safely by monitoring moving cars.
Command to enable: Start shopping mode
Shopping Mode helps users finding and buying things. Users just need to say where's (object name) and the app will monitor in real-time until the object is found.
Command to enable: Start objects mode
Objects Mode helps users finding objects. Users just need to say where's (object name) and the app will monitor in real-time until the object is found.
Command to enable: Start labels mode
Labels Mode helps users choose the right product for them. Users just need to point the camera to the product, tap anywhere on the screen, and then the app says the most relevant info from the product (names, nouns, brands, adjectives)
Command to enable: Start menu mode
Menu Mode helps users reading menus to help them decide what to eat. Users just need to point the camera to the menu, tap anywhere on the screen, and then the app says the menu items with their prices. Users can tap again at any moment to interrupt the speaking.
Command to enable: Start numbers mode
Numbers Mode helps users reading numbers (especially phone numbers)
Command to enable: Start explore mode
Explore Mode helps users exploring and understanding the world around them. They can imagine what’s around from the description and better enjoy going out.
When the user needs to stop current mode s/he just needs to say: stop
or exit
.
The audio from the app is streamed to the Socket.IO server (Node.JS
) for continuous real-time speech-to-text recognition.
The server uses Voice Activity Detection (VAD) to detect when the user started saying something and when s/he has stopped speaking. The chunks of audio are then streamed continuously to DeepSpeech model (TensorFlow
), which is a model based on Baidu's Deep Speech research paper. When the user has finished speaking, the detection is then finished and the result is processed to detect which actions should be performed. The detected actions are then returned to the app.
The video from the app is streamed to the Socket.IO server (Node.JS
) for continuous real-time image recognition.
This server acts as an API gateway by receiving an incoming image with the current mode and then routing the image for processing in the mode-specific APIs, which are described next.
The app is a Progressive Web App, which allows the app to be developed in HTML5 and Javascript, but can be added to the users' home Screen (Available from Safari iOS). The audio is synthesized using Text-to-Speech API available in iOS Webkit Javascript API. Video and audio are also capture using APIs available on iOS Webkit Javascript API.