Skip to content

pixeltable/pixelbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Pixelbot

License My Discord (1306431018890166272)

Pixelbot, a multimodal context-aware AI agent built using Pixeltable β€” open-source AI data infrastructure. The agent can process and reason about various data types (documents, images, videos, audio), use external tools, search its knowledge base derived from uploaded files, generate images, maintain a chat history, and leverage a selective memory bank.

Overview

The endpoint is built with Flask (Python) and the frontend with vanilla JS. This open source code replicates entirely what you can find at https://agent.pixeltable.com/ that is hosted on AWS EC2 instances.

πŸš€ How Pixeltable Powers This App

Pixeltable acts as AI Data Infrastructure, simplifying the development of this complex, infinite-memory multimodal agent:

  • πŸ“œ Declarative Workflows: The entire agent logicβ€”from data ingestion and processing to LLM calls and tool executionβ€”is defined declaratively using Pixeltable tables, views, and computed columns (setup_pixeltable.py). Pixeltable automatically manages dependencies and execution order.
  • πŸ”€ Unified Data Handling: Natively handles diverse data types (documents, images, videos, audio) within its tables, eliminating the need for separate storage solutions.
  • βš™οΈ Automated Processing: Computed columns automatically trigger functions (like thumbnail generation, audio extraction, transcription via Whisper, image generation via DALL-E) when new data arrives or dependencies change.
  • ✨ Efficient Transformations: Views and Iterators (like DocumentSplitter, FrameIterator, AudioSplitter) process data on-the-fly (e.g., chunking documents, extracting video frames) without duplicating the underlying data.
  • πŸ”Ž Integrated Search: Embedding indexes are easily added to tables/views, enabling powerful semantic search across text, images, and frames with simple syntax (.similarity()).
  • πŸ”Œ Seamless Tool Integration: Any Python function (@pxt.udf) or Pixeltable query function (@pxt.query) can be registered as a tool for the LLM using pxt.tools(). Pixeltable handles the invocation (pxt.invoke_tools()) based on the LLM's decision.
  • πŸ’Ύ State Management: Persistently stores all relevant application state (uploaded files, chat history, memory, generated images, workflow runs) within its managed tables.
flowchart TD
    %% User Interaction
    User([User]) -->|Query| ToolsTable[agents.tools]
    User -->|Selective Memory| MemoryBankTable[agents.memory_bank]
    User -->|Upload Files| SourceTables["agents.collection, agents.images, agents.videos, agents.audios"]
    User -->|Generate Image| ImageGenTable[agents.image_generation_tasks]

    %% Main Agent Workflow
    ToolsTable -->|Prompt| DocSearch[Search Documents]
    ToolsTable -->|Prompt| ImageSearch[Search Images]
    ToolsTable -->|Prompt| VideoFrameSearch[Search Video Frames]

    ToolsTable -->|Prompt, Tools| InitialLLM[Claude 3.5 - Tools]
    AvailableTools["**Available Tools**:
    get_latest_news
    fetch_financial_data
    search_news
    search_video_transcripts
    search_audio_transcripts"] -.-> InitialLLM
    InitialLLM -->|Tool Choice| ToolExecution[pxt.invoke_tools]
    ToolExecution --> ToolOutput[Tool Output]

    %% Context Assembly
    DocSearch -->|Context| AssembleTextContext[Assemble Text Context]
    ImageSearch -->|Context| AssembleFinalMessages[Assemble Final Messages]
    VideoFrameSearch -->|Context| AssembleFinalMessages

    ToolOutput -->|Context| AssembleTextContext
    AssembleTextContext -->|Text Summary| AssembleFinalMessages
    ToolsTable -->|Recent History| AssembleFinalMessages
    MemIndex -->|Context| AssembleTextContext
    ChatHistIndex -->|Context| AssembleTextContext

    %% Final LLM Call & Output
    AssembleFinalMessages -->|Messages| FinalLLM[Claude 3.5 - Answer]
    FinalLLM -->|Answer| ExtractAnswer[Extract Answer]
    ExtractAnswer -->|Answer| User
    ExtractAnswer -->|Answer| LogChat[agents.chat_history]
    ToolsTable -->|User Prompt| LogChat

    %% Follow-up Generation
    FinalLLM -->|Answer| FollowUpLLM[Mistral Small - Follow-up]
    FollowUpLLM -->|Suggestions| User

    %% Image Generation Workflow
    ImageGenTable -->|Prompt| OpenAI_Dalle[DALL-E 3]
    OpenAI_Dalle -->|Image Data| ImageGenTable
    ImageGenTable -->|Retrieve Image| User

    %% Supporting Structures
    SourceTables --> Views[**Materialized Views**
    Chunks, Frames, Sentences]
    Views --> Indexes[Embedding Indexes
    E5, CLIP]
    MemoryBankTable --> MemIndex[Search Memory]
    LogChat --> ChatHistIndex[Search Conversations]

    %% Styling
    classDef table fill:#E1C1E9,stroke:#333,stroke-width:1px
    classDef view fill:#C5CAE9,stroke:#333,stroke-width:1px
    classDef llm fill:#FFF9C4,stroke:#333,stroke-width:1px
    classDef workflow fill:#E1F5FE,stroke:#333,stroke-width:1px
    classDef search fill:#C8E6C9,stroke:#333,stroke-width:1px
    classDef tool fill:#FFCCBC,stroke:#333,stroke-width:1px
    classDef io fill:#fff,stroke:#000,stroke-width:2px

    class User io
    class ToolsTable,,SourceTables,ImageGenTable,LogChat,MemoryBankTable table
    class Views view
    class Indexes,MemIndex,ChatHistIndex search
    class InitialLLM,FinalLLM,FollowUpLLM,OpenAI_Dalle llm
    class DocSearch,ImageSearch,VideoFrameSearch,MemorySearch,ChatHistorySearch search
    class ToolExecution,AvailableTools,ToolOutput tool
    class AssembleTextContext,AssembleFinalMessages,ExtractAnswer workflow
Loading

πŸ“ Project Structure

.
β”œβ”€β”€ .env                  # Environment variables (API keys, AUTH_MODE)
β”œβ”€β”€ .venv/                # Virtual environment files (if created here)
β”œβ”€β”€ data/                 # Default directory for uploaded/source media files
β”œβ”€β”€ logs/                 # Application logs
β”‚   └── app.log
β”œβ”€β”€ static/               # Static assets for Flask frontend (CSS, JS, Images)
β”‚   β”œβ”€β”€ css/style.css
β”‚   β”œβ”€β”€ image/*.png
β”‚   β”œβ”€β”€ js/
β”‚   β”‚   β”œβ”€β”€ api.js
β”‚   β”‚   └── ui.js
β”‚   └── manifest.json
β”‚   └── robots.txt
β”‚   └── sitemap.xml
β”œβ”€β”€ templates/            # HTML templates for Flask frontend
β”‚   └── index.html
β”œβ”€β”€ endpoint.py           # Flask backend: API endpoints and UI rendering
β”œβ”€β”€ functions.py          # Python UDFs and context assembly logic
β”œβ”€β”€ config.py             # Central configuration (model IDs, defaults, personas)
β”œβ”€β”€ requirements.txt      # Python dependencies
└── setup_pixeltable.py   # Pixeltable schema definition script

πŸ“Š Pixeltable Schema Overview

Pixeltable organizes data in directories, tables, and views. This application uses the following structure within the agents directory:

agents/
β”œβ”€β”€ collection              # Table: Source documents (PDF, TXT, etc.)
β”‚   β”œβ”€β”€ document: pxt.Document
β”‚   β”œβ”€β”€ uuid: pxt.String
β”‚   └── timestamp: pxt.Timestamp
β”œβ”€β”€ images                  # Table: Source images
β”‚   β”œβ”€β”€ image: pxt.Image
β”‚   β”œβ”€β”€ uuid: pxt.String
β”‚   β”œβ”€β”€ timestamp: pxt.Timestamp
β”‚   └── thumbnail: pxt.String(computed) # Base64 sidebar thumbnail
β”œβ”€β”€ videos                  # Table: Source videos
β”‚   β”œβ”€β”€ video: pxt.Video
β”‚   β”œβ”€β”€ uuid: pxt.String
β”‚   β”œβ”€β”€ timestamp: pxt.Timestamp
β”‚   └── audio: pxt.Audio(computed)      # Extracted audio (used by audio_chunks view)
β”œβ”€β”€ audios                  # Table: Source audio files (MP3, WAV)
β”‚   β”œβ”€β”€ audio: pxt.Audio
β”‚   β”œβ”€β”€ uuid: pxt.String
β”‚   └── timestamp: pxt.Timestamp
β”œβ”€β”€ chat_history            # Table: Stores conversation turns
β”‚   β”œβ”€β”€ role: pxt.String        # 'user' or 'assistant'
β”‚   β”œβ”€β”€ content: pxt.String
β”‚   └── timestamp: pxt.Timestamp
β”œβ”€β”€ memory_bank             # Table: Saved text/code snippets
β”‚   β”œβ”€β”€ content: pxt.String
β”‚   β”œβ”€β”€ type: pxt.String         # 'code' or 'text'
β”‚   β”œβ”€β”€ language: pxt.String    # e.g., 'python'
β”‚   β”œβ”€β”€ context_query: pxt.String # Original query or note
β”‚   └── timestamp: pxt.Timestamp
β”œβ”€β”€ image_generation_tasks  # Table: Image generation requests & results
β”‚   β”œβ”€β”€ prompt: pxt.String
β”‚   β”œβ”€β”€ timestamp: pxt.Timestamp
β”‚   └── generated_image: pxt.Image(computed) # DALL-E 3 output
β”œβ”€β”€ user_personas           # Table: User-defined personas
β”‚   β”œβ”€β”€ persona_name: pxt.String
β”‚   β”œβ”€β”€ initial_prompt: pxt.String
β”‚   β”œβ”€β”€ final_prompt: pxt.String
β”‚   β”œβ”€β”€ llm_params: pxt.Json
β”‚   └── timestamp: pxt.Timestamp
β”œβ”€β”€ tools                   # Table: Main agent workflow orchestration
β”‚   β”œβ”€β”€ prompt: pxt.String
β”‚   β”œβ”€β”€ timestamp: pxt.Timestamp
β”‚   β”œβ”€β”€ user_id: pxt.String
β”‚   β”œβ”€β”€ initial_system_prompt: pxt.String
β”‚   β”œβ”€β”€ final_system_prompt: pxt.String
β”‚   β”œβ”€β”€ max_tokens, stop_sequences, temperature, top_k, top_p # LLM Params
β”‚   β”œβ”€β”€ initial_response: pxt.Json(computed)  # Claude tool choice output
β”‚   β”œβ”€β”€ tool_output: pxt.Json(computed)       # Output from executed tools (UDFs or Queries)
β”‚   β”œβ”€β”€ doc_context: pxt.Json(computed)       # Results from document search
β”‚   β”œβ”€β”€ image_context: pxt.Json(computed)     # Results from image search
β”‚   β”œβ”€β”€ video_frame_context: pxt.Json(computed) # Results from video frame search
β”‚   β”œβ”€β”€ memory_context: pxt.Json(computed)    # Results from memory bank search
β”‚   β”œβ”€β”€ chat_memory_context: pxt.Json(computed) # Results from chat history search
β”‚   β”œβ”€β”€ history_context: pxt.Json(computed)   # Recent chat turns
β”‚   β”œβ”€β”€ multimodal_context_summary: pxt.String(computed) # Assembled text context for final LLM
β”‚   β”œβ”€β”€ final_prompt_messages: pxt.Json(computed) # Fully assembled messages (incl. images/frames) for final LLM
β”‚   β”œβ”€β”€ final_response: pxt.Json(computed)    # Claude final answer generation output
β”‚   β”œβ”€β”€ answer: pxt.String(computed)          # Extracted text answer
β”‚   β”œβ”€β”€ follow_up_input_message: pxt.String(computed) # Formatted prompt for Mistral
β”‚   β”œβ”€β”€ follow_up_raw_response: pxt.Json(computed) # Raw Mistral response
β”‚   └── follow_up_text: pxt.String(computed) # Extracted follow-up suggestions
β”œβ”€β”€ chunks                  # View: Document chunks via DocumentSplitter
β”‚   └── (Implicit: EmbeddingIndex: E5-large-instruct on text)
β”œβ”€β”€ video_frames            # View: Video frames via FrameIterator (1 FPS)
β”‚   └── (Implicit: EmbeddingIndex: CLIP on frame)
β”œβ”€β”€ video_audio_chunks      # View: Audio chunks from video table via AudioSplitter
β”‚   └── transcription: pxt.Json(computed)   # Whisper transcription
β”œβ”€β”€ video_transcript_sentences # View: Sentences from video transcripts via StringSplitter
β”‚   └── (Implicit: EmbeddingIndex: E5-large-instruct on text)
β”œβ”€β”€ audio_chunks            # View: Audio chunks from audio table via AudioSplitter
β”‚   └── transcription: pxt.Json(computed)   # Whisper transcription
└── audio_transcript_sentences # View: Sentences from direct audio transcripts via StringSplitter
    └── (Implicit: EmbeddingIndex: E5-large-instruct on text)

# Available Tools (Registered via pxt.tools()):
# - functions.get_latest_news (UDF)
# - functions.fetch_financial_data (UDF)
# - functions.search_news (UDF)
# - search_video_transcripts (@pxt.query function)
# - search_audio_transcripts (@pxt.query function)

# Embedding Indexes Enabled On:
# - agents.chunks.text
# - agents.images.image
# - agents.video_frames.frame
# - agents.video_transcript_sentences.text
# - agents.audio_transcript_sentences.text
# - agents.memory_bank.content
# - agents.chat_history.content

▢️ Getting Started

Prerequisites

You are welcome to swap any of the below calls, e.g. WhisperX instead of OpenAI Whisper, Llama.cpp instead of Mistral... either through our built-in modules or by bringing your own models, frameworks, and API calls. See our integration and UDFs pages to learn more. You can easily make this applicaiton entirely local if you decide to rely on local LLM runtimes and local embedding/transcription solutions.

Installation

# 1. Create and activate a virtual environment (recommended)
python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate

# 2. Install dependencies
pip install -r requirements.txt

Environment Setup

Create a .env file in the project root and add your API keys. Keys marked with * are required for core LLM functionality.

# Required for Core LLM Functionality *
ANTHROPIC_API_KEY=sk-ant-api03-...  # For main reasoning/tool use (Claude 3.5 Sonnet)
OPENAI_API_KEY=sk-...             # For audio transcription (Whisper) & image generation (DALL-E 3)
MISTRAL_API_KEY=...               # For follow-up question suggestions (Mistral Small)

# Optional (Enable specific tools by providing keys)
NEWS_API_KEY=...                  # Enables the NewsAPI tool
# Note: yfinance and DuckDuckGo Search tools do not require API keys.

# --- !!**Authentication Mode (required to run locally)**!! ---
# Set to 'local' to bypass the WorkOS authentication used at agent.pixeltable.com and to leverage a default user.
# Leaving unset will result in errors
AUTH_MODE=local

Running the Application

  1. Initialize Pixeltable Schema: This script creates the necessary Pixeltable directories, tables, views, and computed columns defined in setup_pixeltable.py. Run this once initially.

    Why run this? This defines the data structures and the declarative AI workflow within Pixeltable. It tells Pixeltable how to store, transform, index, and process your data automatically.

    python setup_pixeltable.py
  2. Start the Web Server: This runs the Flask application using the Waitress production server by default.

    python endpoint.py

    The application will be available at http://localhost:5000.

Data Persistence Note: Pixeltable stores all its data (file references, tables, views, indexes) locally, typically in a .pixeltable directory created within your project workspace. This means your uploaded files, generated images, chat history, and memory bank are persistent across application restarts.

πŸ–±οΈ Usage Overview

The web interface provides several tabs:

  • Chat Interface: Main interaction area. Ask questions, switch between chat and image generation modes. View results, including context retrieved (images, video frames) and follow-up suggestions. Save responses to the Memory Bank.
  • Agent Settings: Configure the system prompts (initial for tool use, final for answer generation) and LLM parameters (temperature, max tokens, etc.) used by Claude.
  • Chat History: View past queries and responses. Search history and view detailed execution metadata for each query. Download history as JSON.
  • Generated Images: View images created using the image generation mode. Search by prompt, view details, download, or delete images.
  • Memory Bank: View, search, manually add, and delete saved text/code snippets. Download memory as JSON.
  • How it Works: Provides a technical overview of how Pixeltable powers the application's features.

⭐ Key Features

  • πŸ’Ύ Unified Multimodal Data Management: Ingests, manages, process, and index documents (text, PDFs, markdown), images (JPG, PNG), videos (MP4), and audio files (MP3, WAV) using Pixeltable's specialized data types.
  • βš™οΈ Declarative AI Workloads: Leverages Pixeltable's computed columns and views to declaratively define complex conditional workflows including data processing (chunking, frame extraction, audio extraction), embedding generation, AI model inference, and context assembly while maintaining data lineage and versioning.
  • 🧠 Agentic RAG & Tool Use: The agent dynamically decides which tools to use based on the query. Available tools include:
    • External APIs: Fetching news (NewsAPI, DuckDuckGo), financial data (yfinance).
    • Internal Knowledge Search: Pixeltable @pxt.query functions are registered as tools, allowing the agent to search video transcripts and audio transcripts on demand, as an example.
  • πŸ” Semantic Search: Implements vector search across multiple modalities, powered by any embedding indexes that Pixeltable incrementally and automatically maintain:
    • Document Chunks (sentence-transformers)
    • Images & Video Frames (CLIP)
    • Chat History (sentence-transformers)
    • Memory Bank items (sentence-transformers)
  • πŸ”Œ LLM Integration: Seamlessly integrates multiple LLMs for different tasks within the Pixeltable workflow:
    • Reasoning & Tool Use: Anthropic Claude 3.5 Sonnet
    • Audio Transcription: OpenAI Whisper (via computed columns on audio chunks)
    • Image Generation: OpenAI DALL-E 3 (via computed columns on image prompts)
    • Follow-up Suggestions: Mistral Small Latest
  • πŸ’¬ Chat History: Persistently stores conversation turns in a Pixeltable table (agents.chat_history), enabling retrieval and semantic search over past interactions.
  • πŸ“ Memory Bank: Allows saving and semantically searching important text snippets or code blocks stored in a dedicated Pixeltable table (agents.memory_bank).
  • πŸ–ΌοΈ Image Generation: Generates images based on user prompts using DALL-E 3, orchestrated via a Pixeltable table (agents.image_generation_tasks).
  • 🏠 Local Mode: Supports running locally without external authentication (WorkOS) (AUTH_MODE=local) for easier setup and development.
  • πŸ–₯️ Responsive UI: A clean web interface built with Flask, Tailwind CSS, and JavaScript.
  • πŸ› οΈ Centralized Configuration: Uses an arbitraty config.py to manage model IDs, default system prompts, LLM parameters, and persona presets.

⚠️ Disclaimer

This application serves as a comprehensive demonstration of Pixeltable's capabilities for managing complex multimodal AI workflows, covering data storage, transformation, indexing, retrieval, and serving.

The primary focus is on illustrating Pixeltable patterns and best practices within the setup_pixeltable.py script and related User-Defined Functions (functions.py).

While functional, less emphasis was placed on optimizing the Flask application (endpoint.py) and the associated frontend components (style.css, index.html, ui.js...). These parts should not necessarily be considered exemplars of web development best practices.

For simpler examples demonstrating Pixeltable integration with various frameworks (FastAPI, React, TypeScript, Gradio, etc.), please refer to the Pixeltable Examples Documentation.