Proxy

Give your coding agent a voice.

Wake word → Speech-to-text → Copilot → Text-to-speech → Speaker

All streaming. All low-latency. Fully hands-free.

Getting Started · How It Works · Features · Configuration · Contributing

Why Proxy?

You already have a coding agent. It's good at what it does. But every time you want to ask it something, you stop coding, switch context, type a prompt, wait, read the response, and switch back.

Proxy sits between you and your agent — that's it. It's not a new AI, not a framework, not a platform. It's a voice layer that proxies your speech to the agent and speaks the response back. Your agent does the thinking. Proxy just gives it a voice.

What does it look like?

You:     "Proxy, what's the technical debt in the autograder repo?"
Proxy:   *acknowledgment sound*
Proxy:   "Alright, let me check the autograder." ← latency filler
         Reading 8 files...                      ← terminal log
Proxy:   "I found three main areas of            ← streamed response
          technical debt. First, the config
          system still uses..."
You:     "Stop."                                 ← interrupt mid-speech
Proxy:   *silence*
         *listening for your next prompt*

No browser tabs. No copy-paste. No keyboard. Just your voice and your code.

Features

Wake word activation — Say "Proxy" to start. Local Vosk model, no cloud dependency for wake detection.

Streaming responses — Hear Copilot's answer while it's still generating. Sentence-boundary chunking feeds ElevenLabs TTS in real-time.

On-demand status — Say "what's happening?" during long silences to hear what Copilot is doing. Proxy summarizes recent tool calls and thoughts into one spoken sentence.

Voice interruption — Say "stop" anytime during a response to cancel and take back control. Instantly.

Persistent sessions — One Copilot session lives for the entire process. Full conversational context across every interaction.

Anti self-listening — Two-layer protection (speech gate + echo filter) prevents Proxy from transcribing its own voice.

Contextual wake sounds — First interaction gets a greeting. Subsequent ones get a casual callup.

Fully configurable — 40+ environment variables for tuning every aspect: voice, latency, wake sensitivity, STT model, and more.

Getting Started

Python 3.11+, PortAudio, GitHub Copilot CLI installed and authenticated
API keys for Deepgram (STT) and ElevenLabs (TTS)
A Vosk model for wake word detection

git clone https://github.com/ArthurCRodrigues/proxy.git
cd proxy
pip install -e ".[dev]"
proxy devices                  # list available input devices (index, name, sample rate)
proxy init                     # guided setup: downloads model, configures API keys
proxy                          # installed CLI command
./proxy                        # repo-local launcher from the project root

Run at startup (Linux)

Make sure your .env file is configured with your API keys first, then:

proxy setup

This installs a systemd user service that launches Proxy on login. Useful commands after installing:

systemctl --user status proxy     # check if it's running
journalctl --user -u proxy -f     # follow the logs
systemctl --user stop proxy       # stop it
systemctl --user disable proxy    # remove from startup

How It Works

Proxy sits between your microphone and your coding agent. Audio flows through five components in sequence:

Vosk (local) listens for the wake word — no cloud calls, no latency.
Deepgram transcribes your speech in real-time over a persistent websocket.
Copilot receives the transcript via ACP and streams back a response.
ElevenLabs converts each sentence to speech as it arrives.
Speaker plays the audio while Copilot is still generating.

Saying "stop" at any point during steps 3–5 cancels everything and returns control to you.

States

Proxy is a state machine with five states. Each interaction follows the same path:

State	What's happening	How it ends
IDLE	Waiting for wake word. Nothing else running.	You say "Proxy" → LISTENING
LISTENING	Deepgram is transcribing your voice.	You finish speaking → THINKING. You say "never mind" → IDLE. Silence for 10s → IDLE.
THINKING	Your prompt was sent to Copilot. Waiting for a response. Thoughts are spoken aloud.	First response chunk arrives → SPEAKING. You say "stop" → LISTENING.
SPEAKING	Copilot's response is streaming through TTS and playing back.	Response finishes → IDLE. You say "stop" → LISTENING.
STOPPED	Shutting down. Terminal state.	—

Configuration

All settings are environment variables. See .env.example for the full list.

Key settings

Variable	Default	What it does
`PROXY_WAKE_PHRASE`	`proxy`	The wake word
`PROXY_WAKE_ALIASES`	`proxy,roxy,rocky`	Alternative wake word spellings for better recognition
`PROXY_STOPWORD_ALIASES`	`stop,shut up`	Phrases that interrupt the current response
`PROXY_LISTENING_TIMEOUT_MS`	`10000`	How long to wait for speech before returning to IDLE
`PROXY_DEEPGRAM_UTTERANCE_END_MS`	`3500`	Silence duration before Deepgram finalizes your speech
`PROXY_ELEVENLABS_VOICE_ID`	—	Your ElevenLabs voice (required)
`PROXY_ELEVENLABS_SPEED`	`0.95`	TTS speech speed
`PROXY_COPILOT_COMMAND`	`copilot`	CLI command to invoke Copilot
`PROXY_LOG_LEVEL`	`INFO`	Logging level
`PROXY_LOG_DEBUG_MODULES`	—	Comma-separated modules for DEBUG logging (e.g. `proxy.stt.deepgram`)

Custom instructions

Place an instructions.md file in the project root to give your agent custom context. This file is sent as system instructions when the session starts. Use it to tell the agent about your project, preferred response style, or domain-specific knowledge.

If no file is found, Proxy uses built-in defaults that tell the agent to respond in plain conversational language suitable for voice.

Wake sounds

Proxy plays a short audio clip when the wake word is detected. You need to provide your own WAV files:

assets/greetings/ — played on the first wake of a session (e.g. "Hello!", "Hey there!")
assets/wake/ — played on subsequent wakes (e.g. "Yes?", "Hm?")
assets/yes.wav — fallback if the directories above are empty

Place one or more .wav files (PCM 16-bit) in each directory. Proxy picks one at random each time.

Vanguard mode (optional)

Vanguard uses a local model to fill Copilot's silence with context-aware speech. When enabled, Proxy acknowledges your prompt immediately ("Hold on, let me check the autograder") while Copilot boots up, and you can ask "what's happening?" anytime to hear a summary of what Copilot is doing.

Requires Ollama running locally:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull the default model
ollama pull llama3.2:3b

# Start the server (runs on port 11434)
ollama serve

Then enable it in your .env:

PROXY_VANGUARD_ENABLED=1
PROXY_VANGUARD_MODEL=llama3.2:3b
PROXY_VANGUARD_CONTEXT=Projects: my-api, my-frontend. Languages: Python, TypeScript.

PROXY_VANGUARD_CONTEXT tells the local model about your projects and tools so it can reference them by name. Without it, a filler might say "Let me check that repository." With it, you get "Hold on, let me look at my-api." Keep it short — just project names, languages, and key terms.

Or run proxy init — it will offer to set up Vanguard as part of the guided setup.

Vanguard is completely optional. When disabled, everything works exactly as before.

What's Next

Proxy currently works with GitHub Copilot. Here's where it's headed:

Claude Code support — first priority. Proxy should work with the most popular agent runtimes, not just one.
Agent-agnostic protocol — a simple bridge interface so any coding agent can plug in.
ElevenLabs WebSocket streaming — true real-time TTS for lower latency.
Alternative STT/TTS providers — Whisper, local TTS, Azure, Google.
Non-English language support — wake word models and STT/TTS configs for other languages.

See the full roadmap for details.

Contributing

Proxy is built to be extended. Some areas where contributions would be especially valuable:

Agent backends — Proxy currently works with GitHub Copilot. Adding support for Claude Code, Aider, Continue, or other coding agents would make it useful to a much wider audience.

STT/TTS providers — Alternative speech engines (Whisper, Azure, Google, local TTS) for different cost/latency/privacy tradeoffs.

Language support — Wake word models and STT configs for non-English languages.

Latency optimization — Every millisecond matters in voice UX. Profiling, benchmarking, and optimization PRs are welcome.

Testing — Integration tests, edge case coverage, CI pipeline.

License

MIT

Proxy — because your coding agent deserves a voice.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github		.github
docs		docs
scripts		scripts
src/proxy		src/proxy
tests/unit		tests/unit
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
proxy		proxy
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Proxy

Give your coding agent a voice.

Why Proxy?

What does it look like?

Features

Getting Started

Run at startup (Linux)

How It Works

States

Configuration

Key settings

Custom instructions

Wake sounds

Vanguard mode (optional)

What's Next

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Proxy

Give your coding agent a voice.

Why Proxy?

What does it look like?

Features

Getting Started

Run at startup (Linux)

How It Works

States

Configuration

Key settings

Custom instructions

Wake sounds

Vanguard mode (optional)

What's Next

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages