Polyglot — Live Screen Companion

Talk to your screen. In any language.

Polyglot is an AI-powered live screen companion that watches what you see and has natural voice conversations about it — in 8+ languages. Built for the Gemini Live Agent Challenge, it combines Gemini's native audio streaming with real-time screen understanding.

Live Demo: polyglot-frontend-1064261519338.us-east1.run.app

What It Does

Share your screen and start talking. Polyglot sees what you see and responds with natural voice — no typing needed.

Voice Conversation — Full-duplex voice chat powered by Gemini Live's native audio streaming. Interrupt mid-sentence, ask follow-ups, have a real conversation.
Screen Understanding — Smart auto-capture detects when your screen changes and sends frames to Gemini. Switch tabs, scroll, watch a video — Polyglot keeps up.
Multilingual — Speak in English, Hindi, Spanish, Japanese, German, French, Portuguese, or Korean. Polyglot responds in the same language.
Real-time Transcription — See what you said and what Polyglot said, live in the conversation panel.

Architecture Overview

graph LR
    subgraph Browser["Frontend (Next.js)"]
        MIC["🎤 Microphone"] --> |PCM 16kHz| WS_CLIENT["WebSocket Client"]
        SCREEN["🖥️ Screen Capture"] --> |JPEG frames| WS_CLIENT
        WS_CLIENT --> |Audio playback| SPEAKER["🔊 Speaker"]
        WS_CLIENT --> |Transcripts| UI["💬 Chat UI"]
    end

    subgraph CloudRun["Backend (FastAPI on Cloud Run)"]
        WS_SERVER["WebSocket Server"] --> AUDIO_Q["Audio Queue"]
        WS_SERVER --> IMAGE_Q["Image Queue"]
        WS_SERVER --> TEXT_Q["Text Queue"]
        AUDIO_Q --> GEMINI_SESSION["Gemini Live Session"]
        IMAGE_Q --> GEMINI_SESSION
        TEXT_Q --> GEMINI_SESSION
        GEMINI_SESSION --> OUTPUT_Q["Output Queue"]
        OUTPUT_Q --> WS_SERVER
    end

    WS_CLIENT <--> |"WebSocket (wss://)"| WS_SERVER
    GEMINI_SESSION <--> |"Bidirectional Stream"| GEMINI["Gemini Live API"]

    style Browser fill:#1a1a2e,stroke:#6c63ff,color:#fff
    style CloudRun fill:#16213e,stroke:#0f3460,color:#fff
    style GEMINI fill:#4a0e4e,stroke:#c84b96,color:#fff

Key Technical Features

Feature	Implementation
Bidirectional Audio	`send_realtime_input()` via Google GenAI SDK 1.67 with asyncio queue-based concurrency
Server-side VAD	Gemini's built-in voice activity detection handles turn-taking naturally
Smart Screen Capture	Pixel-diff change detection (64×64 thumbnail sampling) — only sends frames when content actually changes
Multi-turn Conversations	Persistent `session.receive()` loop across turns with interrupt handling
Audio Downsampling	Client-side 48kHz→16kHz PCM conversion for correct Gemini input format
Transcript Accumulation	Fragment-level streaming assembled into complete chat messages

Project Structure

Polyglot/
├── backend/
│   ├── app/
│   │   ├── main.py            # FastAPI + WebSocket handler
│   │   ├── gemini_live.py     # Gemini Live session management
│   │   └── config.py          # Environment configuration
│   ├── Dockerfile
│   └── requirements.txt
├── frontend/
│   ├── src/app/
│   │   ├── page.tsx           # Main UI + audio/video/WS logic
│   │   ├── layout.tsx         # Root layout
│   │   └── globals.css        # Tailwind styles
│   ├── Dockerfile
│   └── package.json
├── ARCHITECTURE.md            # Detailed technical architecture
└── .env                       # Environment variables

Run Locally

Prerequisites

Python 3.11+, Node.js 20+
Google Cloud project with Vertex AI enabled
gcloud auth application-default login

Backend

cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp ../.env.example ../.env  # Edit with your project ID
uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload

Frontend

cd frontend
npm install
echo "NEXT_PUBLIC_WS_URL=ws://localhost:8001/ws/live" > .env.local
npm run dev

Open localhost:3000, click Start, enable mic, and talk.

Deploy to Google Cloud

Both services run on Cloud Run in us-east1:

# Backend
docker build -t us-east1-docker.pkg.dev/YOUR_PROJECT/cloud-run-source-deploy/polyglot-backend ./backend
docker push us-east1-docker.pkg.dev/YOUR_PROJECT/cloud-run-source-deploy/polyglot-backend
gcloud run deploy polyglot-backend --image=... --region=us-east1 --allow-unauthenticated --timeout=3600 --session-affinity

# Frontend (pass backend URL at build time)
docker build --build-arg NEXT_PUBLIC_WS_URL=wss://YOUR_BACKEND_URL/ws/live -t .../polyglot-frontend ./frontend
docker push .../polyglot-frontend
gcloud run deploy polyglot-frontend --image=... --region=us-east1 --allow-unauthenticated --port=3000

Tech Stack

Gemini Live API — Native audio streaming with gemini-live-2.5-flash-native-audio
Google GenAI SDK — Python SDK v1.67 with send_realtime_input()
FastAPI — Async WebSocket server with concurrent queue architecture
Next.js 16 — React frontend with Tailwind CSS v4
Cloud Run — Serverless deployment with WebSocket support

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
deploy.sh		deploy.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Polyglot — Live Screen Companion

What It Does

Architecture Overview

Key Technical Features

Project Structure

Run Locally

Prerequisites

Backend

Frontend

Deploy to Google Cloud

Tech Stack

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Polyglot — Live Screen Companion

What It Does

Architecture Overview

Key Technical Features

Project Structure

Run Locally

Prerequisites

Backend

Frontend

Deploy to Google Cloud

Tech Stack

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages