I built Voice Slides Realtime as a direct solution to a major friction point in presentation building: navigating and editing Google Slides while you are actively practicing your talk. This application merges the Google Slides API with the OpenAI Realtime WebRTC architecture to give you full, hands free programmatic control over your slide decks.

Every technical presentation has a conceptual villain: friction. We have a massive problem when practicing tech talks. You have your hands full, you are pacing the room, mentally queuing up your next transition. The moment you want to tweak a slide or add a note, you hit the trap: you have to break your flow, walk back to the laptop, and use a mouse.
This application engineers a system-level solution that introduces two primary workflows to solve this bottleneck.
This mode connects your microphone directly to an intelligent agent built on the OpenAI Realtime API. The agent maintains constant awareness of your active slide. You can command it to navigate to specific slides, replace text blocks, inject new text boxes, format fonts, or even evaluate the current slide pacing. The application uses a backend tool registry to execute these changes live in your presentation without requiring physical input.
Sometimes you do not want an AI agent talking back to you. When you just need to practice a 45 minute keynote and want an exact transcript of your ad libbed thoughts, Dictation Mode bypasses the conversational pipeline. It records your microphone natively and proxies the buffer to the OpenAI Whisper model, functioning as an uninterrupted text area that you can copy straight to your clipboard.
If you want to pull this down and run the system yourself, you need to configure the OAuth infrastructure carefully. Because this application requires write access to your Google Drive documents, we have to establish strict verification.
- Clone the repository and install the dependencies:
git clone https://github.com/TheDThompsonDev/voice-slides-realtime.git
cd voice-slides-realtime
npm install- Duplicate the environment template:
cp env.example .env.localYou are going to need two API keys. Fetch your OpenAI key from their developer dashboard and place it in the OPENAI_API_KEY variable. For the Google Client ID, follow the infrastructure steps below.
- Open the Google Cloud Console and create a new project.
- Under APIs and Services, access the Library and enable the Google Slides API.
- Access the OAuth consent screen and configure it for External use. You can leave it in Testing mode for local development.
- Go to Credentials and create a new OAuth client ID.
- Set the Application type to Web application.
- Name the application Voice Slides.
- Under Authorized JavaScript origins, add http://localhost:3000.
- Under Authorized redirect URIs, add http://localhost:3000.
- Copy the generated Client ID and paste it into your
.env.localasNEXT_PUBLIC_GOOGLE_CLIENT_ID.
Once your environment variables are set, boot the development server:
npm run devNavigate to localhost:3000. Sign in with Google, paste a valid Google Slides URL, and initialize the WebRTC connection to begin editing.
This is not a basic API wrapper. If you want to build high-availability voice agents in production, you have to engineer for the team and the architecture, not just the IDE. Below, we break down the traps most developers fall into when building real-time apps in Next.js, and the mental models used to solve them.
Building a resilient audio application requires understanding the exact lifecycle of the data. Review the architecture below before attempting to modify the codebase.
graph TD
User([User Speaking])
subgraph Frontend [Next.js Client]
Mic[Microphone Buffer]
WebRTC[WebRTC Data Channel]
Dictation[MediaRecorder Blob]
ToolExec[Tool Registry Executor]
end
subgraph Backend [Next.js Server Actions]
GSlides[Google Slides Actions]
WhisperAction[OpenAI Whisper Proxy]
SessionToken[Realtime Session Token]
end
subgraph External [External Services]
OAI_RTC[OpenAI Realtime Engine]
OAI_REST[OpenAI Transcriptions]
GCP[Google Slides API]
end
User --> Mic
%% Assistant Flow
SessionToken -- Ephemeral Token --> WebRTC
Mic --> WebRTC
WebRTC -- Streaming Audio --> OAI_RTC
OAI_RTC -- Function Call JSON --> ToolExec
ToolExec -- Server Action --> GSlides
GSlides -- API Request --> GCP
GCP -- Success Response --> GSlides
GSlides -- Result JSON --> ToolExec
ToolExec -- Function Output --> WebRTC
%% Dictation Flow
Mic --> Dictation
Dictation -- FormData Upload --> WhisperAction
WhisperAction -- File Upload --> OAI_REST
OAI_REST -- Transcript --> WhisperAction
WhisperAction -- Text --> Dictation
Most tutorials will teach you to throw your WebRTC connection, your mic stream, and your UI all into one giant file. That breaks in production because React unmounts and re-renders will constantly sever your audio connection.
The Architecture Fix: We decoupled the infrastructure from the UI using strict custom hooks. useWebRTCSession owns the entire connection lifecycle, useAuth owns the token state, and VoiceInterface.tsx acts as a pure, thin orchestrator. If the team needs to change the UI, they do not risk breaking the WebRTC SDP handshake.
When changing slides programmatically, the naive approach is to update a React state variable that holds the current slide ID, forcing the iframe to fully network-refresh. This causes a massive, jarring white flash every time you say "next slide."
The Architecture Fix: We utilize persistent sessionStorage caching paired with dynamic iframe hash routing. By mutating the URL hash (#slide=id.xxx) instead of the React key, we bypass the network refresh layer. We remove the white flicker completely, making the embedded web app feel native.
When the AI assistant "goes silent," most developers simply send a system prompt telling the Language Model "don't talk." The problem? The Whisper Voice Activity Detection (VAD) model will hallucinate background room noise as speech, triggering infinite loops of empty function calls back to your server.
The Architecture Fix: We implemented a physical DOM mutation strategy. We don't just ask the AI to be quiet—we dynamically mute the HTML audio track at the browser layer while simultaneously patching the systemic instructions via the WebRTC data channel. We enforce a strict English lattice lock at the API boundary, killing false-positive triggers from room ambiance.
If you want to understand how the AI operates under the hood, here are three senior-level lessons extracted from this codebase showing how we handle the OpenAI Realtime API.
The Trap: Exposing your OPENAI_API_KEY on the client side just to initialize the WebRTC connection. It violates every security protocol but is wildly common in AI tutorials.
The Fix: Ephemeral Tokens. We execute a secure Server Action (actions/openai.ts) to hit OpenAI's REST API. We pass our system prompts, VAD configs, and our model instructions securely on the server. OpenAI returns a short-lived client_secret. We pass that ephemeral token back down to the browser to authorize the WebRTC peer connection. The master key never touches the client.
The Trap: Using HTTP polling or standard WebSockets for two-way conversational voice traffic. This results in terrible 500ms+ latency that kills conversational flow.
The Fix: WebRTC. Notice in hooks/useWebRTCSession.ts that we aren't just sending text payloads. We create a native RTCPeerConnection and establish a Data Channel specifically named oai-events. Audio is streamed natively via hardware tracks (pc.addTrack(micStream)), and structured data (like tools and text transcripts) travels instantly over the Data Channel. This gives us sub-50ms conversational latency.
The Trap: Hardcoding a massive 100-line if/else chain in your React component to handle AI tool execution.
The Fix: We leverage OpenAI's Function Calling strictly via the Open/Closed Principle. When the AI decides it needs to edit a Google Slide, it doesn't edit it directly. It fires a response.function_call_arguments.done event down the WebRTC data channel.
Instead of parsing that inline, we instantly proxy it to our Backend Tool Registry (lib/tool-handlers.ts). We execute the Google Slides API call, bundle the success/fail result as JSON, and fire it back to the AI via a conversation.item.create event. The AI reads the JSON result and then audibly confirms your action: "Alright, I've updated the text on that slide for you."
The codebase follows a modular architecture engineered for scale and separation of concerns.
src/
├── actions/ # Next.js Server Actions
│ ├── openai.ts # Session token + Whisper transcription
│ └── slides.ts # Google Slides CRUD operations
├── app/
│ ├── globals.css # Theme + scrollbar styles
│ ├── layout.tsx # Root layout, env var injection
│ └── page.tsx # Page shell using hooks
├── components/
│ ├── DictationInterface.tsx # Mic recorder + Whisper pipeline
│ ├── Providers.tsx # Google OAuth context
│ ├── SlidesViewer.tsx # Iframe slide viewer + nav
│ └── VoiceInterface.tsx # Voice UI orchestrator
├── hooks/
│ ├── useAuth.ts # Token state + session persistence
│ ├── useDictation.ts # Transcript + clipboard
│ ├── usePresentation.ts # URL parsing, slide loading, nav
│ ├── useTranscriptLog.ts # Log buffer + auto-scroll
│ └── useWebRTCSession.ts # WebRTC lifecycle + message routing
└── lib/
├── constants.ts # System prompt
├── tool-definitions.ts # OpenAI function schemas
├── tool-handlers.ts # Tool registry map + executor
└── types.ts # Shared TypeScript types
The AI assistant supports these voice-driven tools, defined in tool-definitions.ts and dispatched through a registry map in tool-handlers.ts. Our Open/Closed architecture means adding a new tool requires two additions (one schema entry and one handler entry)—no if/else dispatch logic needs to change.
| Tool | Description |
|---|---|
next_slide |
Navigate forward one slide |
previous_slide |
Navigate backward one slide |
go_to_slide |
Jump to a specific slide number |
get_current_slide_elements |
Inspect all element IDs and text on the active slide |
replace_text |
Overwrite text inside a specific element |
add_text_box |
Insert a new text box on the current slide |
format_text |
Apply bold or italic formatting |
evaluate_presentation |
Read all slide text for full-deck critique |
The frontend state is managed through five custom hooks, each owning a single domain:
graph LR
subgraph page.tsx
useAuth
usePresentation
useDictation
end
subgraph VoiceInterface.tsx
useTranscriptLog
useWebRTCSession
end
useAuth -- accessToken --> usePresentation
useAuth -- signOut --> usePresentation
usePresentation -- slideState --> VoiceInterface.tsx
useTranscriptLog -- addLog --> useWebRTCSession
useWebRTCSession -- tool calls --> tool-handlers.ts