Pixelbot, a multimodal context-aware AI agent built using Pixeltable β open-source AI data infrastructure. The agent can process and reason about various data types (documents, images, videos, audio), use external tools, search its knowledge base derived from uploaded files, generate images, maintain a chat history, and leverage a selective memory bank.
The endpoint is built with Flask (Python) and the frontend with vanilla JS. This open source code replicates entirely what you can find at https://agent.pixeltable.com/ that is hosted on AWS EC2 instances.
Pixeltable acts as AI Data Infrastructure, simplifying the development of this complex, infinite-memory multimodal agent:
- π Declarative Workflows: The entire agent logicβfrom data ingestion and processing to LLM calls and tool executionβis defined declaratively using Pixeltable tables, views, and computed columns (
setup_pixeltable.py
). Pixeltable automatically manages dependencies and execution order. - π Unified Data Handling: Natively handles diverse data types (documents, images, videos, audio) within its tables, eliminating the need for separate storage solutions.
- βοΈ Automated Processing: Computed columns automatically trigger functions (like thumbnail generation, audio extraction, transcription via Whisper, image generation via DALL-E) when new data arrives or dependencies change.
- β¨ Efficient Transformations: Views and Iterators (like
DocumentSplitter
,FrameIterator
,AudioSplitter
) process data on-the-fly (e.g., chunking documents, extracting video frames) without duplicating the underlying data. - π Integrated Search: Embedding indexes are easily added to tables/views, enabling powerful semantic search across text, images, and frames with simple syntax (
.similarity()
). - π Seamless Tool Integration: Any Python function (
@pxt.udf
) or Pixeltable query function (@pxt.query
) can be registered as a tool for the LLM usingpxt.tools()
. Pixeltable handles the invocation (pxt.invoke_tools()
) based on the LLM's decision. - πΎ State Management: Persistently stores all relevant application state (uploaded files, chat history, memory, generated images, workflow runs) within its managed tables.
flowchart TD
%% User Interaction
User([User]) -->|Query| ToolsTable[agents.tools]
User -->|Selective Memory| MemoryBankTable[agents.memory_bank]
User -->|Upload Files| SourceTables["agents.collection, agents.images, agents.videos, agents.audios"]
User -->|Generate Image| ImageGenTable[agents.image_generation_tasks]
%% Main Agent Workflow
ToolsTable -->|Prompt| DocSearch[Search Documents]
ToolsTable -->|Prompt| ImageSearch[Search Images]
ToolsTable -->|Prompt| VideoFrameSearch[Search Video Frames]
ToolsTable -->|Prompt, Tools| InitialLLM[Claude 3.5 - Tools]
AvailableTools["**Available Tools**:
get_latest_news
fetch_financial_data
search_news
search_video_transcripts
search_audio_transcripts"] -.-> InitialLLM
InitialLLM -->|Tool Choice| ToolExecution[pxt.invoke_tools]
ToolExecution --> ToolOutput[Tool Output]
%% Context Assembly
DocSearch -->|Context| AssembleTextContext[Assemble Text Context]
ImageSearch -->|Context| AssembleFinalMessages[Assemble Final Messages]
VideoFrameSearch -->|Context| AssembleFinalMessages
ToolOutput -->|Context| AssembleTextContext
AssembleTextContext -->|Text Summary| AssembleFinalMessages
ToolsTable -->|Recent History| AssembleFinalMessages
MemIndex -->|Context| AssembleTextContext
ChatHistIndex -->|Context| AssembleTextContext
%% Final LLM Call & Output
AssembleFinalMessages -->|Messages| FinalLLM[Claude 3.5 - Answer]
FinalLLM -->|Answer| ExtractAnswer[Extract Answer]
ExtractAnswer -->|Answer| User
ExtractAnswer -->|Answer| LogChat[agents.chat_history]
ToolsTable -->|User Prompt| LogChat
%% Follow-up Generation
FinalLLM -->|Answer| FollowUpLLM[Mistral Small - Follow-up]
FollowUpLLM -->|Suggestions| User
%% Image Generation Workflow
ImageGenTable -->|Prompt| OpenAI_Dalle[DALL-E 3]
OpenAI_Dalle -->|Image Data| ImageGenTable
ImageGenTable -->|Retrieve Image| User
%% Supporting Structures
SourceTables --> Views[**Materialized Views**
Chunks, Frames, Sentences]
Views --> Indexes[Embedding Indexes
E5, CLIP]
MemoryBankTable --> MemIndex[Search Memory]
LogChat --> ChatHistIndex[Search Conversations]
%% Styling
classDef table fill:#E1C1E9,stroke:#333,stroke-width:1px
classDef view fill:#C5CAE9,stroke:#333,stroke-width:1px
classDef llm fill:#FFF9C4,stroke:#333,stroke-width:1px
classDef workflow fill:#E1F5FE,stroke:#333,stroke-width:1px
classDef search fill:#C8E6C9,stroke:#333,stroke-width:1px
classDef tool fill:#FFCCBC,stroke:#333,stroke-width:1px
classDef io fill:#fff,stroke:#000,stroke-width:2px
class User io
class ToolsTable,,SourceTables,ImageGenTable,LogChat,MemoryBankTable table
class Views view
class Indexes,MemIndex,ChatHistIndex search
class InitialLLM,FinalLLM,FollowUpLLM,OpenAI_Dalle llm
class DocSearch,ImageSearch,VideoFrameSearch,MemorySearch,ChatHistorySearch search
class ToolExecution,AvailableTools,ToolOutput tool
class AssembleTextContext,AssembleFinalMessages,ExtractAnswer workflow
.
βββ .env # Environment variables (API keys, AUTH_MODE)
βββ .venv/ # Virtual environment files (if created here)
βββ data/ # Default directory for uploaded/source media files
βββ logs/ # Application logs
β βββ app.log
βββ static/ # Static assets for Flask frontend (CSS, JS, Images)
β βββ css/style.css
β βββ image/*.png
β βββ js/
β β βββ api.js
β β βββ ui.js
β βββ manifest.json
β βββ robots.txt
β βββ sitemap.xml
βββ templates/ # HTML templates for Flask frontend
β βββ index.html
βββ endpoint.py # Flask backend: API endpoints and UI rendering
βββ functions.py # Python UDFs and context assembly logic
βββ config.py # Central configuration (model IDs, defaults, personas)
βββ requirements.txt # Python dependencies
βββ setup_pixeltable.py # Pixeltable schema definition script
Pixeltable organizes data in directories, tables, and views. This application uses the following structure within the agents
directory:
agents/
βββ collection # Table: Source documents (PDF, TXT, etc.)
β βββ document: pxt.Document
β βββ uuid: pxt.String
β βββ timestamp: pxt.Timestamp
βββ images # Table: Source images
β βββ image: pxt.Image
β βββ uuid: pxt.String
β βββ timestamp: pxt.Timestamp
β βββ thumbnail: pxt.String(computed) # Base64 sidebar thumbnail
βββ videos # Table: Source videos
β βββ video: pxt.Video
β βββ uuid: pxt.String
β βββ timestamp: pxt.Timestamp
β βββ audio: pxt.Audio(computed) # Extracted audio (used by audio_chunks view)
βββ audios # Table: Source audio files (MP3, WAV)
β βββ audio: pxt.Audio
β βββ uuid: pxt.String
β βββ timestamp: pxt.Timestamp
βββ chat_history # Table: Stores conversation turns
β βββ role: pxt.String # 'user' or 'assistant'
β βββ content: pxt.String
β βββ timestamp: pxt.Timestamp
βββ memory_bank # Table: Saved text/code snippets
β βββ content: pxt.String
β βββ type: pxt.String # 'code' or 'text'
β βββ language: pxt.String # e.g., 'python'
β βββ context_query: pxt.String # Original query or note
β βββ timestamp: pxt.Timestamp
βββ image_generation_tasks # Table: Image generation requests & results
β βββ prompt: pxt.String
β βββ timestamp: pxt.Timestamp
β βββ generated_image: pxt.Image(computed) # DALL-E 3 output
βββ user_personas # Table: User-defined personas
β βββ persona_name: pxt.String
β βββ initial_prompt: pxt.String
β βββ final_prompt: pxt.String
β βββ llm_params: pxt.Json
β βββ timestamp: pxt.Timestamp
βββ tools # Table: Main agent workflow orchestration
β βββ prompt: pxt.String
β βββ timestamp: pxt.Timestamp
β βββ user_id: pxt.String
β βββ initial_system_prompt: pxt.String
β βββ final_system_prompt: pxt.String
β βββ max_tokens, stop_sequences, temperature, top_k, top_p # LLM Params
β βββ initial_response: pxt.Json(computed) # Claude tool choice output
β βββ tool_output: pxt.Json(computed) # Output from executed tools (UDFs or Queries)
β βββ doc_context: pxt.Json(computed) # Results from document search
β βββ image_context: pxt.Json(computed) # Results from image search
β βββ video_frame_context: pxt.Json(computed) # Results from video frame search
β βββ memory_context: pxt.Json(computed) # Results from memory bank search
β βββ chat_memory_context: pxt.Json(computed) # Results from chat history search
β βββ history_context: pxt.Json(computed) # Recent chat turns
β βββ multimodal_context_summary: pxt.String(computed) # Assembled text context for final LLM
β βββ final_prompt_messages: pxt.Json(computed) # Fully assembled messages (incl. images/frames) for final LLM
β βββ final_response: pxt.Json(computed) # Claude final answer generation output
β βββ answer: pxt.String(computed) # Extracted text answer
β βββ follow_up_input_message: pxt.String(computed) # Formatted prompt for Mistral
β βββ follow_up_raw_response: pxt.Json(computed) # Raw Mistral response
β βββ follow_up_text: pxt.String(computed) # Extracted follow-up suggestions
βββ chunks # View: Document chunks via DocumentSplitter
β βββ (Implicit: EmbeddingIndex: E5-large-instruct on text)
βββ video_frames # View: Video frames via FrameIterator (1 FPS)
β βββ (Implicit: EmbeddingIndex: CLIP on frame)
βββ video_audio_chunks # View: Audio chunks from video table via AudioSplitter
β βββ transcription: pxt.Json(computed) # Whisper transcription
βββ video_transcript_sentences # View: Sentences from video transcripts via StringSplitter
β βββ (Implicit: EmbeddingIndex: E5-large-instruct on text)
βββ audio_chunks # View: Audio chunks from audio table via AudioSplitter
β βββ transcription: pxt.Json(computed) # Whisper transcription
βββ audio_transcript_sentences # View: Sentences from direct audio transcripts via StringSplitter
βββ (Implicit: EmbeddingIndex: E5-large-instruct on text)
# Available Tools (Registered via pxt.tools()):
# - functions.get_latest_news (UDF)
# - functions.fetch_financial_data (UDF)
# - functions.search_news (UDF)
# - search_video_transcripts (@pxt.query function)
# - search_audio_transcripts (@pxt.query function)
# Embedding Indexes Enabled On:
# - agents.chunks.text
# - agents.images.image
# - agents.video_frames.frame
# - agents.video_transcript_sentences.text
# - agents.audio_transcript_sentences.text
# - agents.memory_bank.content
# - agents.chat_history.content
You are welcome to swap any of the below calls, e.g. WhisperX instead of OpenAI Whisper, Llama.cpp instead of Mistral... either through our built-in modules or by bringing your own models, frameworks, and API calls. See our integration and UDFs pages to learn more. You can easily make this applicaiton entirely local if you decide to rely on local LLM runtimes and local embedding/transcription solutions.
- Python 3.9+
- API Keys:
- Anthropic
- OpenAI
- Mistral AI
- NewsAPI (100 requests per day free)
# 1. Create and activate a virtual environment (recommended)
python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate
# 2. Install dependencies
pip install -r requirements.txt
Create a .env
file in the project root and add your API keys. Keys marked with *
are required for core LLM functionality.
# Required for Core LLM Functionality *
ANTHROPIC_API_KEY=sk-ant-api03-... # For main reasoning/tool use (Claude 3.5 Sonnet)
OPENAI_API_KEY=sk-... # For audio transcription (Whisper) & image generation (DALL-E 3)
MISTRAL_API_KEY=... # For follow-up question suggestions (Mistral Small)
# Optional (Enable specific tools by providing keys)
NEWS_API_KEY=... # Enables the NewsAPI tool
# Note: yfinance and DuckDuckGo Search tools do not require API keys.
# --- !!**Authentication Mode (required to run locally)**!! ---
# Set to 'local' to bypass the WorkOS authentication used at agent.pixeltable.com and to leverage a default user.
# Leaving unset will result in errors
AUTH_MODE=local
-
Initialize Pixeltable Schema: This script creates the necessary Pixeltable directories, tables, views, and computed columns defined in
setup_pixeltable.py
. Run this once initially.Why run this? This defines the data structures and the declarative AI workflow within Pixeltable. It tells Pixeltable how to store, transform, index, and process your data automatically.
python setup_pixeltable.py
-
Start the Web Server: This runs the Flask application using the Waitress production server by default.
python endpoint.py
The application will be available at
http://localhost:5000
.
Data Persistence Note: Pixeltable stores all its data (file references, tables, views, indexes) locally, typically in a .pixeltable
directory created within your project workspace. This means your uploaded files, generated images, chat history, and memory bank are persistent across application restarts.
The web interface provides several tabs:
- Chat Interface: Main interaction area. Ask questions, switch between chat and image generation modes. View results, including context retrieved (images, video frames) and follow-up suggestions. Save responses to the Memory Bank.
- Agent Settings: Configure the system prompts (initial for tool use, final for answer generation) and LLM parameters (temperature, max tokens, etc.) used by Claude.
- Chat History: View past queries and responses. Search history and view detailed execution metadata for each query. Download history as JSON.
- Generated Images: View images created using the image generation mode. Search by prompt, view details, download, or delete images.
- Memory Bank: View, search, manually add, and delete saved text/code snippets. Download memory as JSON.
- How it Works: Provides a technical overview of how Pixeltable powers the application's features.
- πΎ Unified Multimodal Data Management: Ingests, manages, process, and index documents (text, PDFs, markdown), images (JPG, PNG), videos (MP4), and audio files (MP3, WAV) using Pixeltable's specialized data types.
- βοΈ Declarative AI Workloads: Leverages Pixeltable's computed columns and views to declaratively define complex conditional workflows including data processing (chunking, frame extraction, audio extraction), embedding generation, AI model inference, and context assembly while maintaining data lineage and versioning.
- π§ Agentic RAG & Tool Use: The agent dynamically decides which tools to use based on the query. Available tools include:
- External APIs: Fetching news (NewsAPI, DuckDuckGo), financial data (yfinance).
- Internal Knowledge Search: Pixeltable
@pxt.query
functions are registered as tools, allowing the agent to search video transcripts and audio transcripts on demand, as an example.
- π Semantic Search: Implements vector search across multiple modalities, powered by any embedding indexes that Pixeltable incrementally and automatically maintain:
- Document Chunks (
sentence-transformers
) - Images & Video Frames (
CLIP
) - Chat History (
sentence-transformers
) - Memory Bank items (
sentence-transformers
)
- Document Chunks (
- π LLM Integration: Seamlessly integrates multiple LLMs for different tasks within the Pixeltable workflow:
- Reasoning & Tool Use: Anthropic Claude 3.5 Sonnet
- Audio Transcription: OpenAI Whisper (via computed columns on audio chunks)
- Image Generation: OpenAI DALL-E 3 (via computed columns on image prompts)
- Follow-up Suggestions: Mistral Small Latest
- π¬ Chat History: Persistently stores conversation turns in a Pixeltable table (
agents.chat_history
), enabling retrieval and semantic search over past interactions. - π Memory Bank: Allows saving and semantically searching important text snippets or code blocks stored in a dedicated Pixeltable table (
agents.memory_bank
). - πΌοΈ Image Generation: Generates images based on user prompts using DALL-E 3, orchestrated via a Pixeltable table (
agents.image_generation_tasks
). - π Local Mode: Supports running locally without external authentication (WorkOS) (
AUTH_MODE=local
) for easier setup and development. - π₯οΈ Responsive UI: A clean web interface built with Flask, Tailwind CSS, and JavaScript.
- π οΈ Centralized Configuration: Uses an arbitraty
config.py
to manage model IDs, default system prompts, LLM parameters, and persona presets.
This application serves as a comprehensive demonstration of Pixeltable's capabilities for managing complex multimodal AI workflows, covering data storage, transformation, indexing, retrieval, and serving.
The primary focus is on illustrating Pixeltable patterns and best practices within the setup_pixeltable.py
script and related User-Defined Functions (functions.py
).
While functional, less emphasis was placed on optimizing the Flask application (endpoint.py
) and the associated frontend components (style.css
, index.html
, ui.js
...). These parts should not necessarily be considered exemplars of web development best practices.
For simpler examples demonstrating Pixeltable integration with various frameworks (FastAPI, React, TypeScript, Gradio, etc.), please refer to the Pixeltable Examples Documentation.