Skip to content

TheDThompsonDev/VoicePresentationEditor

Repository files navigation

Voice Slides Realtime

I built Voice Slides Realtime as a direct solution to a major friction point in presentation building: navigating and editing Google Slides while you are actively practicing your talk. This application merges the Google Slides API with the OpenAI Realtime WebRTC architecture to give you full, hands free programmatic control over your slide decks. image

The Enemy: Breaking the Developer Flow

Every technical presentation has a conceptual villain: friction. We have a massive problem when practicing tech talks. You have your hands full, you are pacing the room, mentally queuing up your next transition. The moment you want to tweak a slide or add a note, you hit the trap: you have to break your flow, walk back to the laptop, and use a mouse.

This application engineers a system-level solution that introduces two primary workflows to solve this bottleneck.

1. AI Assistant Mode

This mode connects your microphone directly to an intelligent agent built on the OpenAI Realtime API. The agent maintains constant awareness of your active slide. You can command it to navigate to specific slides, replace text blocks, inject new text boxes, format fonts, or even evaluate the current slide pacing. The application uses a backend tool registry to execute these changes live in your presentation without requiring physical input.

2. Speech Dictation Mode

Sometimes you do not want an AI agent talking back to you. When you just need to practice a 45 minute keynote and want an exact transcript of your ad libbed thoughts, Dictation Mode bypasses the conversational pipeline. It records your microphone natively and proxies the buffer to the OpenAI Whisper model, functioning as an uninterrupted text area that you can copy straight to your clipboard.

How to Run This Locally

If you want to pull this down and run the system yourself, you need to configure the OAuth infrastructure carefully. Because this application requires write access to your Google Drive documents, we have to establish strict verification.

  1. Clone the repository and install the dependencies:
git clone https://github.com/TheDThompsonDev/voice-slides-realtime.git
cd voice-slides-realtime
npm install
  1. Duplicate the environment template:
cp env.example .env.local

You are going to need two API keys. Fetch your OpenAI key from their developer dashboard and place it in the OPENAI_API_KEY variable. For the Google Client ID, follow the infrastructure steps below.

Configuring Google Cloud

  1. Open the Google Cloud Console and create a new project.
  2. Under APIs and Services, access the Library and enable the Google Slides API.
  3. Access the OAuth consent screen and configure it for External use. You can leave it in Testing mode for local development.
  4. Go to Credentials and create a new OAuth client ID.
  5. Set the Application type to Web application.
  6. Name the application Voice Slides.
  7. Under Authorized JavaScript origins, add http://localhost:3000.
  8. Under Authorized redirect URIs, add http://localhost:3000.
  9. Copy the generated Client ID and paste it into your .env.local as NEXT_PUBLIC_GOOGLE_CLIENT_ID.

Once your environment variables are set, boot the development server:

npm run dev

Navigate to localhost:3000. Sign in with Google, paste a valid Google Slides URL, and initialize the WebRTC connection to begin editing.

The Mental Model: Engineering System Resilience

This is not a basic API wrapper. If you want to build high-availability voice agents in production, you have to engineer for the team and the architecture, not just the IDE. Below, we break down the traps most developers fall into when building real-time apps in Next.js, and the mental models used to solve them.

Data Flow Architecture

Building a resilient audio application requires understanding the exact lifecycle of the data. Review the architecture below before attempting to modify the codebase.

graph TD
    User([User Speaking])

    subgraph Frontend [Next.js Client]
        Mic[Microphone Buffer]
        WebRTC[WebRTC Data Channel]
        Dictation[MediaRecorder Blob]
        ToolExec[Tool Registry Executor]
    end

    subgraph Backend [Next.js Server Actions]
        GSlides[Google Slides Actions]
        WhisperAction[OpenAI Whisper Proxy]
        SessionToken[Realtime Session Token]
    end

    subgraph External [External Services]
        OAI_RTC[OpenAI Realtime Engine]
        OAI_REST[OpenAI Transcriptions]
        GCP[Google Slides API]
    end

    User --> Mic

    %% Assistant Flow
    SessionToken -- Ephemeral Token --> WebRTC
    Mic --> WebRTC
    WebRTC -- Streaming Audio --> OAI_RTC
    OAI_RTC -- Function Call JSON --> ToolExec
    ToolExec -- Server Action --> GSlides
    GSlides -- API Request --> GCP
    GCP -- Success Response --> GSlides
    GSlides -- Result JSON --> ToolExec
    ToolExec -- Function Output --> WebRTC

    %% Dictation Flow
    Mic --> Dictation
    Dictation -- FormData Upload --> WhisperAction
    WhisperAction -- File Upload --> OAI_REST
    OAI_REST -- Transcript --> WhisperAction
    WhisperAction -- Text --> Dictation
Loading

The Trap: The "God Component"

Most tutorials will teach you to throw your WebRTC connection, your mic stream, and your UI all into one giant file. That breaks in production because React unmounts and re-renders will constantly sever your audio connection.

The Architecture Fix: We decoupled the infrastructure from the UI using strict custom hooks. useWebRTCSession owns the entire connection lifecycle, useAuth owns the token state, and VoiceInterface.tsx acts as a pure, thin orchestrator. If the team needs to change the UI, they do not risk breaking the WebRTC SDP handshake.

The Trap: The Full Unmount Flicker

When changing slides programmatically, the naive approach is to update a React state variable that holds the current slide ID, forcing the iframe to fully network-refresh. This causes a massive, jarring white flash every time you say "next slide."

The Architecture Fix: We utilize persistent sessionStorage caching paired with dynamic iframe hash routing. By mutating the URL hash (#slide=id.xxx) instead of the React key, we bypass the network refresh layer. We remove the white flicker completely, making the embedded web app feel native.

The Trap: Hardware vs Software Muting (The Hot Mic Hallucination)

When the AI assistant "goes silent," most developers simply send a system prompt telling the Language Model "don't talk." The problem? The Whisper Voice Activity Detection (VAD) model will hallucinate background room noise as speech, triggering infinite loops of empty function calls back to your server.

The Architecture Fix: We implemented a physical DOM mutation strategy. We don't just ask the AI to be quiet—we dynamically mute the HTML audio track at the browser layer while simultaneously patching the systemic instructions via the WebRTC data channel. We enforce a strict English lattice lock at the API boundary, killing false-positive triggers from room ambiance.

Tutorial: Engineering the AI Integration

If you want to understand how the AI operates under the hood, here are three senior-level lessons extracted from this codebase showing how we handle the OpenAI Realtime API.

Lesson 1: Securely Minting the Realtime Token

The Trap: Exposing your OPENAI_API_KEY on the client side just to initialize the WebRTC connection. It violates every security protocol but is wildly common in AI tutorials.

The Fix: Ephemeral Tokens. We execute a secure Server Action (actions/openai.ts) to hit OpenAI's REST API. We pass our system prompts, VAD configs, and our model instructions securely on the server. OpenAI returns a short-lived client_secret. We pass that ephemeral token back down to the browser to authorize the WebRTC peer connection. The master key never touches the client.

Lesson 2: The WebRTC Data Channel Limit

The Trap: Using HTTP polling or standard WebSockets for two-way conversational voice traffic. This results in terrible 500ms+ latency that kills conversational flow.

The Fix: WebRTC. Notice in hooks/useWebRTCSession.ts that we aren't just sending text payloads. We create a native RTCPeerConnection and establish a Data Channel specifically named oai-events. Audio is streamed natively via hardware tracks (pc.addTrack(micStream)), and structured data (like tools and text transcripts) travels instantly over the Data Channel. This gives us sub-50ms conversational latency.

Lesson 3: Function Calling over Data Channels

The Trap: Hardcoding a massive 100-line if/else chain in your React component to handle AI tool execution.

The Fix: We leverage OpenAI's Function Calling strictly via the Open/Closed Principle. When the AI decides it needs to edit a Google Slide, it doesn't edit it directly. It fires a response.function_call_arguments.done event down the WebRTC data channel.

Instead of parsing that inline, we instantly proxy it to our Backend Tool Registry (lib/tool-handlers.ts). We execute the Google Slides API call, bundle the success/fail result as JSON, and fire it back to the AI via a conversation.item.create event. The AI reads the JSON result and then audibly confirms your action: "Alright, I've updated the text on that slide for you."

Project Structure

The codebase follows a modular architecture engineered for scale and separation of concerns.

src/
├── actions/                    # Next.js Server Actions
│   ├── openai.ts               # Session token + Whisper transcription
│   └── slides.ts               # Google Slides CRUD operations
├── app/
│   ├── globals.css             # Theme + scrollbar styles
│   ├── layout.tsx              # Root layout, env var injection
│   └── page.tsx                # Page shell using hooks
├── components/
│   ├── DictationInterface.tsx  # Mic recorder + Whisper pipeline
│   ├── Providers.tsx           # Google OAuth context
│   ├── SlidesViewer.tsx        # Iframe slide viewer + nav
│   └── VoiceInterface.tsx      # Voice UI orchestrator
├── hooks/
│   ├── useAuth.ts              # Token state + session persistence
│   ├── useDictation.ts         # Transcript + clipboard
│   ├── usePresentation.ts      # URL parsing, slide loading, nav
│   ├── useTranscriptLog.ts     # Log buffer + auto-scroll
│   └── useWebRTCSession.ts     # WebRTC lifecycle + message routing
└── lib/
    ├── constants.ts            # System prompt
    ├── tool-definitions.ts     # OpenAI function schemas
    ├── tool-handlers.ts        # Tool registry map + executor
    └── types.ts                # Shared TypeScript types

Tool Registry

The AI assistant supports these voice-driven tools, defined in tool-definitions.ts and dispatched through a registry map in tool-handlers.ts. Our Open/Closed architecture means adding a new tool requires two additions (one schema entry and one handler entry)—no if/else dispatch logic needs to change.

Tool Description
next_slide Navigate forward one slide
previous_slide Navigate backward one slide
go_to_slide Jump to a specific slide number
get_current_slide_elements Inspect all element IDs and text on the active slide
replace_text Overwrite text inside a specific element
add_text_box Insert a new text box on the current slide
format_text Apply bold or italic formatting
evaluate_presentation Read all slide text for full-deck critique

Hook Architecture

The frontend state is managed through five custom hooks, each owning a single domain:

graph LR
    subgraph page.tsx
        useAuth
        usePresentation
        useDictation
    end

    subgraph VoiceInterface.tsx
        useTranscriptLog
        useWebRTCSession
    end

    useAuth -- accessToken --> usePresentation
    useAuth -- signOut --> usePresentation
    usePresentation -- slideState --> VoiceInterface.tsx
    useTranscriptLog -- addLog --> useWebRTCSession
    useWebRTCSession -- tool calls --> tool-handlers.ts
Loading

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors