Bridging sign language and spoken language in real time
SignBridge is an end-to-end communication tool that translates sign language to text and speech to text simultaneously, enabling real-time conversations between deaf and hearing individuals. It combines on-device computer vision, deep learning classifiers, and Google Gemini for transcription, translation, and text-to-speech.
- Fingerspelling Recognition — Detects hand landmarks via MediaPipe and classifies ASL letters A–Z in real time using a TFLite model.
- Word-Level Sign Recognition (experimental) — Buffers pose + hand landmark sequences and predicts sign glosses using a PyTorch LSTM/Transformer model trained on the WLASL dataset.
- Live Speech Transcription — Streams 16 kHz PCM audio over a WebSocket to Google Gemini Live for low-latency captions.
- Cross-Language Translation — When the deaf and hearing users speak different languages, the live transcription pipeline doubles as a spoken-language interpreter.
- Text-to-Speech Narration — Converts accumulated sign sequences into natural speech audio via Gemini's TTS modality.
- Responsive Web App — Modern SPA with pages for the interactive demo, live transcription, technology overview, methodology, and team info.
┌──────────────────────────────────────────────────────┐
│ Frontend (React) │
│ MediaPipe Hand/Pose Landmarker · AudioWorklet PCM │
└──────────┬──────────────┬───────────────┬────────────┘
│ REST │ REST │ WebSocket
▼ ▼ ▼
┌──────────────────────────────────────────────────────┐
│ Backend (FastAPI) │
│ │
│ POST /predict → TFLite fingerspelling │
│ POST /predict/wlasl → PyTorch WLASL classifier │
│ POST /gemini/transcribe → Gemini speech-to-text │
│ POST /gemini/narrate → Gemini TTS (WAV base64) │
│ WS /ws/live-transcribe → Gemini Live streaming │
│ GET /wlasl/status → Model status & metadata │
└──────────────────────────────────────────────────────┘
| Layer | Technology |
|---|---|
| Frontend | React 19, TypeScript 5.9, Vite 8, Tailwind CSS 4, React Router 7 |
| Backend | FastAPI, Uvicorn, Pydantic v2, Python 3.11+ |
| ML — Fingerspelling | TFLite (ai-edge-litert), MediaPipe Hand Landmarker |
| ML — Word Signs | PyTorch (BiLSTM w/ attention or Transformer), MediaPipe Holistic |
| Speech / TTS | Google Gemini API (google-genai) — transcription, live streaming, text-to-speech |
| Training | MediaPipe 0.10.14 feature extraction, PyTorch training loop, WLASL dataset |
SignBridge/
├── frontend/ # React SPA
│ ├── src/
│ │ ├── pages/ # Home, Demo, Transcription, Technology, Methodology, Team
│ │ ├── components/ # Layout, Navbar, Footer
│ │ └── lib/ # liveCaption WebSocket helper
│ └── public/ # Static assets, PCM AudioWorklet processor
├── backend/ # FastAPI server
│ ├── main.py # Routes & app setup
│ ├── wlasl_engine.py # WLASL model loader & inference
│ └── gemini_live_ws.py # Gemini Live WebSocket bridge
├── Models/
│ ├── Fingerspelling/ # TFLite model + label CSV (A–Z)
│ └── 2000_common_word/ # PyTorch WLASL checkpoint
├── Training/
│ ├── mediapipe_extraction.py # Feature extraction from video
│ ├── wlasl_2000_train.py # LSTM/Transformer training script
│ ├── checkpoints/ # Saved model weights
│ └── reports/latest/ # Accuracy reports, confusion matrices
├── Media/ # Logo & promo assets
├── wlasl_demo.py # Standalone WLASL inference demo
├── fingerspelling_demo.py # Standalone fingerspelling demo
├── hand_points_demo.py # MediaPipe landmark visualizer
├── requirements.txt # Python deps for demos & training
└── netlify.toml # Frontend deployment config
- Node.js >= 20
- Python >= 3.11
- A Google Gemini API key (for transcription, TTS, and live captioning)
git clone https://github.com/<your-org>/SignBridge.git
cd SignBridgecd backend
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtCreate a .env file (or edit the existing one):
GEMINI_API_KEY=your_gemini_api_key_hereThe Gemini key is optional at the server level — the frontend UI also provides an input field that sends the key per-request.
cd frontend
npm installFrom the frontend/ directory:
npm run dev:allThis starts Vite (frontend on http://localhost:5173) and Uvicorn (backend on http://127.0.0.1:8000) concurrently. The Vite dev server proxies /api/* requests to the backend automatically.
Alternatively, run them separately:
# Terminal 1 — backend
cd backend
uvicorn main:app --host 127.0.0.1 --port 8000 --reload
# Terminal 2 — frontend
cd frontend
npm run devNavigate to http://localhost:5173 and grant camera/microphone permissions when prompted.
- Input: 21 hand landmarks (x, y) extracted by MediaPipe
- Output: One of 26 classes (A–Z)
- Format: TFLite (quantized, runs on-device via
ai-edge-litert) - Location:
Models/Fingerspelling/keypoint_classifier.tflite
- Input: Sequence of frames, each a 258-dimensional vector (33×4 pose + 21×3 per hand)
- Output: Word-level gloss prediction
- Architecture: Bidirectional LSTM with attention pooling, or Transformer encoder
- Location:
Models/2000_common_word/wlasl_best.pt
python Training/mediapipe_extraction.py \
--wlasl_root path/to/wlasl_videos \
--feature_mode hands_poseRequires Python 3.11 (MediaPipe compatibility). Outputs .npy feature sequences per video.
python Training/wlasl_2000_train.py \
--wlasl_root Training/wlasl_100 \
--arch transformer \
--epochs 100 \
--batch_size 32Key flags: --arch (lstm | transformer), --loss (ce | focal), --weighted_sampling, --augment_noise, --augment_drop. Checkpoints and reports are saved to Training/checkpoints/ and Training/reports/latest/.
| Variable | Default | Description |
|---|---|---|
GEMINI_API_KEY |
— | Google Gemini API key (optional if provided per-request) |
GEMINI_TTS_MODEL |
gemini-2.5-flash-preview-tts |
Model used for text-to-speech |
GEMINI_TTS_VOICE |
Kore |
TTS voice name |
GEMINI_TRANSCRIBE_MODEL |
gemini-2.5-flash |
Model used for batch transcription |
GEMINI_LIVE_MODEL |
gemini-2.5-flash-native-audio-preview |
Model used for live WebSocket transcription |
WLASL_CHECKPOINT |
Models/2000_common_word/wlasl_best.pt |
Path to WLASL model checkpoint |
WLASL_MAX_FRAMES |
50 |
Max sequence length for WLASL inference |
| Variable | Default | Description |
|---|---|---|
VITE_API_BASE_URL |
— | Production API base URL (omit for dev proxy) |
| Method | Endpoint | Description |
|---|---|---|
POST |
/predict |
Classify fingerspelled letter from 21 hand landmarks |
POST |
/predict/wlasl |
Classify word-level sign from a frame sequence |
GET |
/wlasl/status |
Check WLASL model load status and metadata |
POST |
/gemini/transcribe |
Transcribe audio (base64) via Gemini |
POST |
/gemini/narrate |
Generate TTS audio from text, with optional translation |
WS |
/ws/live-transcribe |
Stream live PCM audio for real-time captions |
The frontend is configured for Netlify deployment via netlify.toml:
- Build command:
npm ci && npm run build - Publish directory:
frontend/dist - SPA fallback: all routes redirect to
index.html
The backend can be deployed to any Python hosting platform (Railway, Render, AWS, etc.). Set VITE_API_BASE_URL in the frontend build environment to point to the deployed backend URL.
Run these from the project root to test models independently with a webcam:
# Fingerspelling recognition
python fingerspelling_demo.py
# WLASL word recognition
python wlasl_demo.py
# MediaPipe landmark visualization
python hand_points_demo.pyThis project is licensed under the MIT License.