Unified Python package for RAG-powered documentation management - Crawl, store, chunk, and retrieve technical documentation with vector + BM25 hybrid search.
- Overview
- Features
- Architecture
- Installation
- Quick Start
- Configuration
- Usage
- Database Schema
- Development
- Technical Documentation
- License
Context Bridge is a standalone Python package designed to help AI agents, LLMs, and developers manage technical documentation with RAG (Retrieval-Augmented Generation) capabilities. It bridges the gap between scattered online documentation and AI-ready, searchable knowledge bases.
- Crawls technical documentation from URLs using Crawl4AI
- Organizes crawled pages into logical groups with size management
- Chunks Markdown content intelligently while preserving structure
- Embeds chunks using vector embeddings (Ollama/Gemini)
- Stores everything in PostgreSQL with vector and vchord_bm25
- Searches with hybrid vector + BM25 search for best results
- Serves via MCP (Model Context Protocol) for AI agent integration
- Manages through a Streamlit UI for human oversight
- π·οΈ Smart Crawling: Automatically detect and crawl documentation sites, sitemaps, and text files
- π¦ Intelligent Chunking: Smart Markdown chunking that respects code blocks, paragraphs, and sentences
- π Hybrid Search: Dual vector + BM25 search for superior retrieval accuracy
- π Version Management: Track multiple versions of the same documentation
- π― Document Organization: Manual page grouping with size constraints before chunking
- β‘ High Performance: PSQLPy for fast async PostgreSQL operations
- π€ AI-Ready: MCP server for seamless AI agent integration
- π¨ User-Friendly: Streamlit UI for documentation management
- Vector Search: Powered by vector extension
- BM25 Full-Text Search: Using vchord_bm25 extension
- Async/Await: Fully asynchronous operations for scalability
- Configurable Embeddings: Support for Ollama (local) and Google Gemini (cloud)
- Type-Safe: Pydantic models for configuration and data validation
- Modular Design: Clean separation of concerns (repositories, services, managers)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Context Bridge Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Streamlit β β MCP Server β β Python API β
β UI β β (AI Agent) β β (Direct) β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
βββββββββββββββββββββ΄ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββ
β Service Layer β
β - CrawlingService β
β - ChunkingService β
β - EmbeddingService β
β - SearchService β
βββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββ
β Repository Layer β
β - DocumentRepository β
β - PageRepository β
β - GroupRepository β
β - ChunkRepository β
βββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββ
β PostgreSQL Manager β
β - Connection Pooling β
β - Transaction Management β
βββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββ
β PostgreSQL Database β
β Extensions: β
β - vector (vector search) β
β - vchord_bm25 (BM25 search) β
β - pg_tokenizer (text tokenization) β
βββββββββββββββββββββββββββββββββββββββββ
External Dependencies:
ββββββββββββββββ ββββββββββββββββ
β Crawl4AI β β Ollama β
β (Crawling) β β or Gemini β
β β β (Embeddings) β
ββββββββββββββββ ββββββββββββββββ
1. Crawl Documentation
β
2. Store Raw Pages
β
3. Manual Organization (Group Pages)
β
4. Smart Chunking
β
5. Generate Embeddings
β
6. Store with Vector + BM25 Indexes
β
7. Hybrid Search (Vector + BM25)
- Python 3.11+
- PostgreSQL 14+ with extensions:
vectorvchordpg_tokenizervchord_bm25
- Ollama (for local embeddings) or Google API Key (for Gemini)
pip install context-bridge# With Gemini support
pip install context-bridge[gemini]
# With MCP server
pip install context-bridge[mcp]
# With Streamlit UI
pip install context-bridge[ui]
# All features
pip install context-bridge[all]MCP Server:
# Using the installed script
context-bridge-mcp
# Or run directly
python -m context_bridge_mcpStreamlit UI:
# Using streamlit directly
streamlit run streamlit_app/app.py
# Or with uv
uv run streamlit run streamlit_app/app.pygit clone https://github.com/yourusername/context-bridge.git
cd context-bridge
pip install -e .python -m context_bridge.database.init_databasesThis will:
- Create required PostgreSQL extensions
- Create all necessary tables
- Set up vector and BM25 indexes
import asyncio
from context_bridge import ContextBridge, Config
async def main():
# Create config with your settings
config = Config(
postgres_host="localhost",
postgres_password="your_secure_password",
embedding_model="nomic-embed-text:latest"
)
# Use with context manager
async with ContextBridge(config=config) as bridge:
# Crawl documentation
result = await bridge.crawl_documentation(
name="Python Docs",
version="3.11",
source_url="https://docs.python.org/3/library/"
)
# Search documentation
search_results = await bridge.search(
query="async await tutorial",
document_id=result.document_id
)
for hit in search_results[:3]:
print(f"Score: {hit.score}, Content: {hit.content[:100]}...")
if __name__ == "__main__":
asyncio.run(main())# Set environment variables
export POSTGRES_HOST=postgres
export POSTGRES_PASSWORD=secure_password
export OLLAMA_BASE_URL=http://ollama:11434
export EMBEDDING_MODEL=nomic-embed-text:latest
# Or in docker-compose.yml
environment:
- POSTGRES_HOST=postgres
- POSTGRES_PASSWORD=secure_password
- EMBEDDING_MODEL=nomic-embed-text:latestimport asyncio
from context_bridge import ContextBridge
async def main():
# Config automatically loaded from environment variables
async with ContextBridge() as bridge:
result = await bridge.crawl_documentation(
name="Python Docs",
version="3.11",
source_url="https://docs.python.org/3/library/"
)Create .env file (git-ignored):
# PostgreSQL Configuration
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_secure_password
POSTGRES_DB=context_bridge
# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text:latest
VECTOR_DIMENSION=768
# Search Configuration
SIMILARITY_THRESHOLD=0.7
BM25_WEIGHT=0.3
VECTOR_WEIGHT=0.7
# Chunking Configuration
CHUNK_SIZE=2000
MIN_COMBINED_CONTENT_SIZE=100
MAX_COMBINED_CONTENT_SIZE=3500000
# Crawling Configuration
CRAWL_MAX_DEPTH=3
CRAWL_MAX_CONCURRENT=5Then in your code:
import asyncio
from context_bridge import ContextBridge
async def main():
# Config automatically loaded from .env file (if python-dotenv is available)
async with ContextBridge() as bridge:
result = await bridge.crawl_documentation(...)To use .env files in development, install with dev dependencies:
pip install context-bridge[dev]The package uses Pydantic for type-safe, type-hinted configuration. Context Bridge supports three configuration methods:
- Direct Python instantiation (recommended for packaged installs)
- Environment variables (recommended for containers/CI)
- .env file (convenient for local development only)
| Setting | Default | Description | Python API |
|---|---|---|---|
POSTGRES_HOST |
localhost |
PostgreSQL host | postgres_host |
POSTGRES_PORT |
5432 |
PostgreSQL port | postgres_port |
POSTGRES_USER |
postgres |
PostgreSQL user | postgres_user |
POSTGRES_PASSWORD |
`` (empty) | PostgreSQL password (min 8 chars for prod) | postgres_password |
POSTGRES_DB |
context_bridge |
Database name | postgres_db |
DB_POOL_MAX |
10 |
Connection pool size | postgres_max_pool_size |
| Setting | Default | Description | Python API |
|---|---|---|---|
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama API URL | ollama_base_url |
EMBEDDING_MODEL |
nomic-embed-text:latest |
Ollama model name | embedding_model |
VECTOR_DIMENSION |
768 |
Embedding vector dimension | vector_dimension |
| Setting | Default | Description | Python API |
|---|---|---|---|
SIMILARITY_THRESHOLD |
0.7 |
Minimum similarity score | similarity_threshold |
BM25_WEIGHT |
0.3 |
BM25 weight in hybrid search | bm25_weight |
VECTOR_WEIGHT |
0.7 |
Vector weight in hybrid search | vector_weight |
| Setting | Default | Description | Python API |
|---|---|---|---|
CHUNK_SIZE |
2000 |
Default chunk size (bytes) | chunk_size |
MIN_COMBINED_CONTENT_SIZE |
100 |
Minimum combined page size (bytes) | min_combined_content_size |
MAX_COMBINED_CONTENT_SIZE |
3500000 |
Maximum combined page size (bytes) | max_combined_content_size |
| Setting | Default | Description | Python API |
|---|---|---|---|
CRAWL_MAX_DEPTH |
3 |
Maximum crawl depth | crawl_max_depth |
CRAWL_MAX_CONCURRENT |
10 |
Maximum concurrent crawl operations | crawl_max_concurrent |
from context_bridge.service.crawling_service import CrawlingService, CrawlConfig
from crawl4ai import AsyncWebCrawler
# Configure crawler
config = CrawlConfig(
max_depth=3, # How deep to follow links
max_concurrent=10, # Concurrent requests
memory_threshold=70.0 # Memory usage threshold
)
service = CrawlingService(config)
# Crawl a documentation site
async with AsyncWebCrawler(verbose=True) as crawler:
result = await service.crawl_webpage(
crawler,
"https://docs.example.com"
)
# Access results
for crawl_result in result.results:
print(f"URL: {crawl_result.url}")
print(f"Content length: {len(crawl_result.markdown)}")from context_bridge.repositories.document_repository import DocumentRepository
async with db_manager.connection() as conn:
doc_repo = DocumentRepository(conn)
# Create a new document
doc_id = await doc_repo.create(
name="Python Documentation",
version="3.11",
source_url="https://docs.python.org/3/",
description="Official Python 3.11 documentation"
)
# Store crawled pages
for page in crawled_pages:
await page_repo.create(
document_id=doc_id,
url=page.url,
content=page.markdown,
content_hash=hash(page.markdown)
)from context_bridge.repositories.group_repository import GroupRepository
# User manually selects pages to group
page_ids = [1, 2, 3, 4, 5]
# Create a group
async with db_manager.connection() as conn:
group_repo = GroupRepository(conn)
group_id = await group_repo.create_group(
document_id=doc_id,
page_ids=page_ids,
min_size=1000, # Minimum total content size
max_size=50000 # Maximum total content size
)from context_bridge.service.chunking_service import ChunkingService
from context_bridge.service.embedding import EmbeddingService
chunking_service = ChunkingService()
embedding_service = EmbeddingService(config)
# Get groups ready for chunking
eligible_groups = await group_repo.get_eligible_groups(doc_id)
for group in eligible_groups:
# Get combined content
content = await group_repo.get_group_content(group.id)
# Smart chunking
chunks = chunking_service.smart_chunk_markdown(
content,
chunk_size=2000
)
# Generate embeddings and store
for i, chunk_text in enumerate(chunks):
embedding = await embedding_service.get_embedding(chunk_text)
await chunk_repo.create(
document_id=doc_id,
group_id=group.id,
chunk_index=i,
content=chunk_text,
embedding=embedding
)from context_bridge.repositories.document_repository import DocumentRepository
# Find relevant documents
documents = await doc_repo.find_by_query(
query="python asyncio tutorial",
limit=5
)
for doc in documents:
print(f"{doc.name} (v{doc.version})")from context_bridge.repositories.chunk_repository import ChunkRepository
# Search within a specific document
chunks = await chunk_repo.hybrid_search(
document_id=doc_id,
version="3.11",
query="async await examples",
query_embedding=await embedding_service.get_embedding("async await examples"),
limit=10,
vector_weight=0.7,
bm25_weight=0.3
)
for chunk in chunks:
print(f"Score: {chunk.score}")
print(f"Content: {chunk.content[:200]}...")The Context Bridge includes a full-featured web interface for managing documentation:
# Install with UI support
pip install context-bridge[ui]
# Run the Streamlit application
uv run streamlit run streamlit_app/app.py
# Or use the installed script
context-bridge-uiFeatures:
- Document Management: View, search, and delete documents
- Page Organization: Select and group crawled pages for processing
- Chunk Processing: Convert page groups into searchable chunks
- Hybrid Search: Search across all documentation with advanced filtering
The Model Context Protocol server allows AI agents to interact with Context Bridge:
# Install with MCP support
pip install context-bridge[mcp]
# Run the MCP server
uv run python -m context_bridge_mcp
# Or use the installed script
context-bridge-mcpAvailable Tools:
find_documents: Search for documents by querysearch_content: Perform hybrid vector + BM25 search within specific documents
Integration with AI Clients: The MCP server can be integrated with AI assistants like Claude Desktop for seamless documentation access.
For detailed usage instructions, see the MCP Server Usage Guide.
-- Documents (versioned documentation)
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
version TEXT NOT NULL,
source_url TEXT,
description TEXT,
metadata JSONB DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(name, version)
);
-- Pages (raw crawled content)
CREATE TABLE pages (
id SERIAL PRIMARY KEY,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
url TEXT NOT NULL UNIQUE,
content TEXT NOT NULL,
content_hash TEXT NOT NULL,
content_length INTEGER GENERATED ALWAYS AS (length(content)) STORED,
crawled_at TIMESTAMPTZ DEFAULT NOW(),
status TEXT DEFAULT 'pending' CHECK (status IN ('pending', 'processing', 'chunked', 'deleted')),
group_id UUID, -- For future grouping feature
metadata JSONB DEFAULT '{}'::jsonb
);
-- Chunks (embedded content)
CREATE TABLE chunks (
id SERIAL PRIMARY KEY,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
group_id UUID, -- For future grouping feature
embedding VECTOR(768), -- Dimension must match config
bm25_vector bm25vector, -- Auto-generated by trigger
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(document_id, group_id, chunk_index)
);
-- Indexes
CREATE INDEX idx_pages_document ON pages(document_id);
CREATE INDEX idx_pages_status ON pages(status);
CREATE INDEX idx_pages_hash ON pages(content_hash);
CREATE INDEX idx_pages_group ON pages(group_id);
CREATE INDEX idx_chunks_document ON chunks(document_id);
CREATE INDEX idx_chunks_group ON chunks(group_id);
CREATE INDEX idx_chunks_vector ON chunks USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_chunks_bm25 ON chunks USING bm25(bm25_vector bm25_ops);context_bridge/ # Core package
βββ __init__.py
βββ config.py # Configuration management
βββ core.py # Main ContextBridge API
βββ database/
β βββ init_databases.py # Database initialization
β βββ postgres_manager.py # Connection pool manager
βββ schema/
β βββ extensions.sql # PostgreSQL extensions & schema
βββ repositories/ # Data access layer
β βββ document_repository.py
β βββ page_repository.py
β βββ group_repository.py
β βββ chunk_repository.py
βββ service/ # Business logic layer
β βββ crawling_service.py
β βββ chunking_service.py
β βββ embedding.py
β βββ search_service.py
β βββ url_service.py
context_bridge_mcp/ # MCP Server (Model Context Protocol)
βββ __init__.py
βββ server.py # MCP server implementation
βββ schemas.py # Tool input/output schemas
βββ __main__.py # CLI entry point
streamlit_app/ # Streamlit Web UI
βββ __init__.py
βββ app.py # Main application
βββ pages/ # Multi-page navigation
β βββ documents.py # Document management
β βββ crawled_pages.py # Page management
β βββ search.py # Search interface
βββ components/ # Reusable UI components
βββ utils/ # UI utilities and helpers
βββ README.md # UI-specific documentation
docs/ # Documentation
βββ guide/
β βββ MCP_SERVER_USAGE.md # MCP server usage guide
βββ plan/ # Development plans
β βββ ui_and_mcp_implementation_plan.md
βββ technical/ # Technical guides
β βββ crawl4ai_complete_guide.md
β βββ embedding_service.md
β βββ psqlpy-complete-guide.md
β βββ python_mcp_server_guide.md
β βββ python-testing-guide.md
β βββ smart_chunk_markdown_algorithm.md
βββ memory_templates.yaml # Memory usage templates
tests/ # Test suite
βββ conftest.py
βββ integration/
βββ unit/
βββ e2e/ # End-to-end tests
βββ conftest.py
βββ test_streamlit_ui.py
### Running Tests
```bash
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests
pytest
# Run with coverage
pytest --cov=context_bridge --cov-report=html
# Run specific test file
pytest tests/test_chunking_service.py -v
# Format code
black context_bridge tests
# Type checking
mypy context_bridge
# Linting
ruff check context_bridgeComprehensive technical guides are available in docs/:
- UI Testing Report - Comprehensive Playwright testing results and bug fixes
- MCP Server Usage Guide - How to use the MCP server with AI clients
- Crawl4AI Guide - Complete crawling documentation
- Embedding Service - Ollama and Gemini embedding setup
- PSQLPy Guide - PostgreSQL driver usage
- MCP Server Guide - MCP server implementation
- Testing Guide - Testing best practices
- Smart Chunking Algorithm - Chunking implementation
- UI & MCP Implementation Plan - Development roadmap and progress
Contributions are welcome! Please read our contributing guidelines and submit pull requests.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Crawl4AI - High-performance web crawler
- PSQLPy - Async PostgreSQL driver
- pgvector - Vector similarity search
- MCP - Model Context Protocol
For questions, issues, or feature requests:
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: your.email@example.com
Built with β€οΈ for AI agents and developers