Origami is a high-fidelity automated video generation platform that transforms PDF slides into cinematic narrated content. By leveraging advanced AI for script synthesis, professional Text-to-Speech (TTS), and programmatic video rendering, Origami streamlines the creation of engaging presentations, tutorials, and educational media directly from your static documents.
- Features
- Getting Started
- Usage
- Configuration
- Tech Stack
- Project Structure
- Roadmap & TODO
- Acknowledgements
- License
- PDF to Presentation: Upload PDF slides and automatically extract them into a sequence of video scenes.
- AI-Powered Scripting: Integrated with Google Gemini AI and WebLLM (Local Browser Inference) to transform fragmented slide notes into coherent, professional scripts.
- High-Quality TTS: Supports local and cloud-based Text-to-Speech using Kokoro-js.
- Local Inference: Run TTS entirely locally via Dockerized Kokoro FastAPI.
- Hybrid Voices: Create custom voice blends by mixing two models with adjustable weights.
- Rich Media Support: Insert MP4 videos and GIFs seamlessly between slides.
- Programmatic Video Rendering: Frame-perfect assembly powered by high-performance canvas rendering.
- Smart Audio Engineering:
- Auto-Ducking: Background music volume automatically lowers during voiceovers.
- Normalization: Final render is automatically normalized to YouTube standards (-14 LUFS).
- Interactive Slide Editor: Drag-and-drop reordering, real-time preview, and batch script updates.
- Node.js (v20+)
- npm or yarn
- Docker & Docker Compose (Required for local server-side TTS and deployment)
-
Clone the repository:
git clone https://github.com/IslandApps/Origami-AI.git cd Origami-AI -
Install dependencies:
npm install
-
Start the development server (runs both Vite and the rendering server):
npm run dev
The application will be available at http://localhost:3000.
To deploy this application using Docker, you must first clone the repository, as the image is built locally from the source.
- Clone the repository:
git clone https://github.com/IslandApps/Origami-AI.git cd Origami-AI
A docker-compose.yml file is provided in the root directory. To start the application, run:
docker-compose up -dExample docker-compose.yml:
services:
origami-ai:
build: .
container_name: origami-ai
ports:
- "3000:3000"
restart: unless-stopped
environment:
- PORT=3000
- NODE_ENV=production-
Build the image:
docker build -t origami-ai . -
Run the container:
docker run -d -p 3000:3000 --name origami-ai origami-ai
This project is designed to be easily self-hosted using Dokploy. Simply point Dokploy to this repository, and it will automatically detect the Dockerfile and docker-compose.yml to orchestrate the deployment.
The application will be available at http://localhost:3000.
Drag and drop your presentation PDF into the main upload area. The application will process text from each page to create initial slides.
Scroll down to the Configure Slides panel to manage your project globally:
- Global Settings: Set a global voice (or create a custom Hybrid Voice), adjust post-slide delays, or run batch operations like "Find & Replace".
- Media Assets: Click Insert Video to add MP4 clips or GIFs between slides.
- Audio Mixing: Upload custom background music or select from the library (e.g., "Modern EDM"). Use the sliders to mix volume levels.
In the Slide Editor grid:
- AI Scripting: Click the AI Fix Script button (Sparkles icon) to have Gemini rewrite raw slide text into a natural spoken script.
- Manual Editing: Edit scripts directly. Highlight specific text sections to generate/regenerate audio for just that part.
- Generate Output: Click the Generate TTS button (Speech icon) to create voiceovers.
- Preview: Click the Play button to hear the result or click the slide thumbnail to expand the visual preview.
Click the Download Video button. The application will:
- Pre-process the slide configuration in your browser.
- Render frames in parallel using your browser's resources.
- Process the final video and audio mix using client-side FFmpeg WASM.
- Normalize the audio to -14 LUFS and download the resulting MP4.
Open the Settings Modal (Gear Icon) to customize the application:
Configure the AI model used for script refinement ("AI Fix Script").
- Google Gemini: Built-in and recommended. Requires a Google AI Studio API Key.
- Custom/OpenAI-Compatible: Point to any OpenAI-compatible endpoint (e.g., LocalAI, Ollama, vLLM).
- Base URL: Enter your provider's URL (e.g.,
http://localhost:11434/v1). - Model Name: Specify the model ID (e.g.,
llama-3). - API Key: Enter if required by your provider.
- Base URL: Enter your provider's URL (e.g.,
- Engine: Choose between the internal Web Worker (client-side) or a local Dockerized Kokoro instance (faster/server-side).
- Audio Defaults: Set default voice models and quantization levels (q4/q8).
You can build your own library of background music tracks that will be available in the dropdown menus:
- Navigate to the
src/assets/music/directory. - Paste your
.mp3files here. - The application will automatically detect these files and list them in the UI (e.g.,
my_cool_track.mp3becomes "My Cool Track").
- Frontend: React 19, Vite, Tailwind CSS (v4)
- Video Engine: FFmpeg WASM (Client-side)
- AI: Google Gemini API & WebLLM (Local Browser Inference)
- TTS: Kokoro (FastAPI / Web Worker)
- Backend: Express.js (serving as a rendering orchestration layer)
- Utilities: Lucide React (icons), dnd-kit (drag & drop), pdfjs-dist (PDF processing)
src/components/: React UI components (Slide Editor, Modals, Uploaders).src/services/: Core logic for AI, TTS, PDF processing, and local storage.server.ts: Express server handling static file serving and SPA routing.
- YouTube Metadata Generator: Automatically generate optimized titles and descriptions using Gemini.
- Thumbnail Generator: Create custom YouTube thumbnails based on slide content.
- Voiceover Recording: Support for recording custom voiceovers directly within the app using a microphone.
- Header Layout Optimization: Refactor and organize the application header for better aesthetics and usability.
This project is made possible by the following incredible open-source libraries and projects:
- FFmpeg.wasm: Enabling frame-perfect video assembly directly in the browser.
- WebLLM: Bringing high-performance local LLM inference to the web.
- Kokoro-js: Providing high-quality, local Text-to-Speech capabilities.
- Hugging Face Transformers: Powering state-of-the-art machine learning in the browser.
- PDF.js: The standard for parsing and rendering PDF documents.
- Lucide React: Beautifully crafted open-source icons.
- dnd-kit: A modern, lightweight toolkit for drag-and-drop interfaces.
- Dokploy: The open-source platform used for seamless self-hosting and deployment.
- Google Antigravity: The AI-powered IDE used to build and refine this project.