Demo Video: https://youtube.com/shorts/SjxQXo5ecP4
-
If you did not clone with submodules, initialize the backend submodule:
git submodule update --init --recursive
-
In the repository root, install frontend dependencies and build:
npm i npm run build
-
Copy the generated
dist/folder intobackend/dist/.cp -r dist backend/dist
-
Go into the backend folder and install backend dependencies:
cd backend poetry install -
Activate the virtual environment:
source .venv/bin/activate -
Start the backend development server:
poe dev
The webapp will be hosted at http://localhost:3000/index.html.
The environment description feature requires an OpenAI-compatible backend server running a multimodal model.
EchoPath was built to address a gap in accessibility tools: many systems assume visual interaction first. This project treats audio as the primary interface for blind and low-vision users.
EchoPath is a real-time voice-and-vision navigation assistant that:
- Streams live camera frames to the backend for perception.
- Listens for the wake phrase “hey john.”
- Captures a spoken command after wake-word detection.
- Sends command + current frame to the backend (
query_llm). - Speaks concise, non-visual responses from the backend (
query_llm_response). - Plays spatial audio cues for nearby obstacles using 3D position data.
- Frontend: React + TypeScript + Capacitor camera integration.
- Transport: WebSocket for continuous frame and message streaming.
- Voice loop: Browser speech recognition + speech synthesis.
- Spatial audio: Custom Web Audio directional cue engine.
- Backend: FastAPI orchestration, Hugging Face Transformers (depth), Ultralytics YOLO (detection), and llama.cpp exposed via an OpenAI-compatible API.
- Message contracts:
image,query_llm,query_llm_response.
- Stable wake-word behavior in continuous recognition.
- Avoiding stale transcript and repeated-command state bugs.
- Recovery and retries after speech/WebSocket failures.
- Keeping responses actionable without visual-only language.
- Meeting real-time timing constraints across capture, network, inference, and TTS.
- End-to-end wake-word → command → backend → spoken-response loop.
- Live camera streaming integrated with backend vision + LLM querying.
- Spatial audio cues for obstacle direction and proximity.
- Improved robustness with timeout and retry safeguards.
- Hands-free, accessibility-first interaction flow.
- Accessibility-first design changes system architecture, not only UI text.
- Reliability and state handling are as important as model quality.
- Clear protocol contracts accelerate iteration in real-time systems.
- Helpful guidance for blind users should be concise, actionable, and sensory-aware.
- Stronger on-device/offline fallback for voice commands.
- Better personalization (voice style, verbosity, route preferences).
- Expanded route safety signals (surface, curb, and crossing cues).
- Confidence-aware responses when model certainty is low.
- Broader testing with blind and low-vision participants.
