Local Retrieval-Augmented Generation system for Claude Code with Multi-Agent Framework integration.
A production-ready Claude Code plugin that combines ChromaDB vector embeddings with intelligent document retrieval and Multi-Agent Framework (MAF) orchestration for context-aware development assistance.
Current Version: 2.0.0 Status: Production Ready (with known limitations documented in KNOWN_ISSUES.md)
Key Features:
- ChromaDB-based vector storage with HNSW indexing
- Hybrid search combining semantic and keyword matching
- Multi-Agent Framework for intelligent query routing
- Zero external API costs for document processing
- Comprehensive plugin system (hooks, MCP server, slash commands)
Alternative Project: For a standalone CLI experience with extended features, see dt-cli. Both projects are actively maintained and can be used together.
RAG-CLI is a production-ready local Retrieval-Augmented Generation system that enhances your development workflow by providing instant access to your project documentation, codebase context, and external resources. It works seamlessly with Claude Code as a native plugin, eliminating the need for external API calls while processing documents locally with enterprise-grade security and performance.
- Zero API Overhead: Process documents locally without incurring API costs
- Instant Context: Get relevant documentation in milliseconds instead of manual searches
- Improved Code Quality: Make better decisions with context-aware assistance
- Complete Privacy: All document processing stays on your machine
- Developer Focused: Optimized for development workflows and Claude Code integration
- Local-First Architecture: Everything runs locally except Claude API calls
- Fast Performance: <100ms vector search, <5s end-to-end responses
- Hybrid Search: Combines semantic vector search with keyword matching for superior accuracy
- Claude Code Integration: Seamless plugin for enhanced development workflow
- Multi-Format Support: Process MD, PDF, DOCX, HTML, and TXT files
- Real-Time Monitoring: TCP server with PowerShell interface for system observability
- Background File Watching: Automatic document indexing with watchdog library (debounced events)
- Multi-Agent Orchestration: Intelligent routing between RAG and code analysis agents
- Production Ready: Comprehensive error handling, logging, and monitoring
- Python: 3.8 or higher (tested with 3.13)
- RAM: 4GB minimum (8GB recommended for large document sets)
- Disk Space: 2GB for dependencies + space for document vectors
- Claude Code: Latest version (for plugin mode)
- Anthropic API Key: Optional (only for standalone mode)
RAG-CLI runs efficiently on:
- Windows 10+ / macOS / Linux
- Laptops with limited resources (scales gracefully)
- Cloud instances and Docker containers
- CI/CD pipelines
The easiest way to get RAG-CLI as a Claude Code plugin:
# In Claude Code terminal
/plugin marketplace add https://github.com/ItMeDiaTech/rag-cli.git
/plugin install rag-cliThen restart Claude Code. The plugin will activate automatically with zero configuration.
Benefits:
- Automatic installation of all dependencies
- Plugin manages its own lifecycle
- No API key needed (uses Claude Code internally)
- One-command updates via
/plugin update rag-cli
For development, testing, or custom configuration:
# Clone the repository
git clone https://github.com/ItMeDiaTech/rag-cli.git
cd rag-cli
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Verify installation
python -c "from rag_cli.core import embeddings; print('Installation successful!')"For contributing to RAG-CLI:
# Clone and install in editable mode
git clone https://github.com/ItMeDiaTech/rag-cli.git
cd rag-cli
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install with development dependencies
pip install -e ".[dev]"
# Configure MCP server for development mode
python scripts/configure_mcp.py
# Run tests to verify
pytest tests/Important for Contributors: The configure_mcp.py script generates .mcp.json with absolute paths for your system. This file is gitignored and preserves any other MCP servers you have configured. You can re-run the script anytime if your project path changes.
If you installed manually and want to use it as a Claude Code plugin:
# From the RAG-CLI directory
python scripts/sync_plugin.py
# This will copy necessary files to:
# ~/.claude/plugins/marketplaces/rag-cli/
# Then restart Claude CodeNo configuration needed. RAG-CLI auto-detects Claude Code environment:
# First time setup: Index your documents
/rag-project
# Or manually index
python scripts/index.py --input /path/to/docsFor development or testing outside Claude Code:
# Set environment variables
export ANTHROPIC_API_KEY="sk-ant-..."
export RAG_CLI_MODE="standalone"
export RAG_CLI_LOG_LEVEL="INFO"
# Index documents
python scripts/index.py --input data/documents
# Test retrieval
python scripts/retrieve.py "Your question here"Edit config/default.yaml to customize:
# Model selection
embeddings:
model_name: sentence-transformers/all-MiniLM-L6-v2 # Fast, 384-dim
# Search parameters
retrieval:
top_k: 5 # Number of results
hybrid_ratio: 0.7 # 70% semantic, 30% keyword
rerank: true # Use cross-encoder reranking
# Claude settings (standalone only)
claude:
model: claude-haiku-4-5-20251001
max_tokens: 4096
temperature: 0.7# Test plugin installation
/plugin
# Should show: RAG-CLI plugin is installed and loaded
# Test basic functionality
/search "test query"
# Check system status
python scripts/validate_plugin.pyUse Method 1 (Marketplace) for easiest setup.
Gather your documentation:
# Create documents directory
mkdir -p data/documents
# Copy your files
cp /path/to/docs/*.md data/documents/
cp /path/to/docs/*.pdf data/documents/Supported formats: Markdown, PDF, DOCX, HTML, TXT
In Claude Code or terminal:
# Option 1: As Claude Code plugin (easiest)
/rag-project # Auto-indexes current project
# Option 2: Manual indexing
python scripts/index.py --input data/documents --output data/vectorsAsk Claude Code questions about your documents:
# In Claude Code
/search "How do I configure authentication?"
# Or directly ask Claude
"How do I configure authentication?"
# RAG-CLI will automatically enhance with context# In Claude Code
/rag-enable
# Now all your questions will automatically get document contextDisable with: /rag-disable
Traditional workflow:
- Search for documentation (browser, help files)
- Copy/paste relevant sections
- Ask Claude about the problem
- Time: 2-5 minutes per question
With RAG-CLI:
- Ask Claude directly
- RAG-CLI retrieves relevant docs automatically
- Claude responds with context
- Time: <5 seconds per question
Real-world impact: Process 10x more questions per session.
RAG-CLI provides Claude with your actual documentation, code patterns, and project conventions:
Without RAG-CLI:
- Claude makes general assumptions
- Recommendations may conflict with your patterns
- Need to manually validate advice against your codebase
With RAG-CLI:
- Claude knows your exact requirements
- Recommendations match your conventions
- Context-aware solutions specific to your project
Stop mentally tracking:
- API documentation details
- Code structure and patterns
- Configuration requirements
- Best practices for your project
RAG-CLI automatically provides this context, freeing your mind for actual problem-solving.
API Usage:
- Claude Code mode: No API calls for document retrieval
- Saves $$ on large projects with extensive documentation
Time Savings:
- 80% reduction in documentation lookup time
- 50% reduction in clarification questions
- Faster code reviews and architectural decisions
Organizations using RAG-CLI report:
| Metric | Improvement |
|---|---|
| Development Speed | 30-40% faster completion |
| Code Quality | 25% fewer bugs in reviews |
| Documentation Accuracy | 90% vs 60% without context |
| Onboarding Time | 50% reduction |
| API Costs | Up to 60% savings |
RAG-CLI implements a sophisticated document retrieval pipeline:
-
Document Ingestion
- Supports: Markdown, PDF, DOCX, HTML, TXT
- Automatic metadata extraction
- Intelligent chunking (500 tokens with 100-token overlap, configurable via
core.constants)
-
Embedding Generation
- Model:
sentence-transformers/all-MiniLM-L6-v2 - Fast: <200ms for 100 documents
- Efficient: 384-dimensional vectors
- Cached for repeat queries
- Model:
-
Intelligent Retrieval
- Hybrid search: 70% semantic + 30% keyword (configurable via
core.constants) - Cross-encoder reranking for accuracy
- Returns top-K results with confidence scores (default: 5, max: 100)
- Sub-100ms retrieval time
- Hybrid search: 70% semantic + 30% keyword (configurable via
-
Query Enhancement
- Automatic document classification
- Intelligent context assembly
- Format adaptation for Claude Code
- Citation tracking
-
Response Generation
- Integration with Claude Haiku (fast, accurate)
- Streaming responses for better UX
- Automatic citation injection
- Configurable output formatting
Local Processing:
- All document processing happens locally
- No sensitive data sent to external services
- Full privacy and security
- Offline-capable (after initial indexing)
Performance Optimized:
- ChromaDB vector store with HNSW indexing (industry standard)
- Batch processing for throughput
- Async operations for responsiveness
- Memory-efficient chunking
Production Ready:
- Comprehensive error handling
- Graceful degradation on failures
- Detailed logging and monitoring
- Multi-agent orchestration for complex queries
Frontend: Claude Code Plugin
|
Integration Layer: MCP Server + Hooks + Slash Commands
|
Retrieval: Hybrid Search Pipeline
|
ML/AI:
- Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
- Reranking: Cross-encoder (ms-marco-MiniLM-L-6-v2)
- Storage: ChromaDB (with HNSW indexing)
Document Processing:
- Parsing: LangChain + BeautifulSoup + PyPDF2 + python-docx
- Chunking: Semantic boundary detection
- Metadata: Automatic extraction
LLM Integration:
- Model: Claude Haiku (via Anthropic API)
- When plugin: Claude Code internal processing
- Streaming: For better perceived performance
Monitoring:
- Structured Logging: structlog
- Metrics: Prometheus-compatible
- TCP Server: Real-time status monitoring
API Integration
- Auto-complete API calls with context
- Validate parameters against documentation
- Get usage examples from your code
Bug Fixing
- Search error messages in documentation
- Find related issues in your codebase
- Get debugging hints from best practices
Code Review
- Check against project standards automatically
- Retrieve relevant architectural patterns
- Validate against best practices
Knowledge Base
- Keep team documentation synchronized
- Instantly query your knowledge base
- Reduce "How do I..." questions
Onboarding
- New developers get context-aware help
- Questions answered with your actual docs
- 50% faster ramp-up time
Continuous Learning
- Search your learning materials instantly
- Get context from multiple sources
- Connect related concepts automatically
Knowledge Synthesis
- Combine insights from multiple documents
- Get connections between topics
- Build comprehensive understanding faster
RAG-CLI supports three operation modes:
- No API key required
- Automatically detected when running as Claude Code plugin
- Returns formatted context for Claude's internal processing
- Optimal performance with zero API costs
- Requires Anthropic API key
- Direct API calls to Claude
- Full control over model parameters
- Useful for testing and development
- Auto-detects environment
- Uses Claude Code when available
- Falls back to API when needed
- Maximum flexibility
Set mode via environment variable:
export RAG_CLI_MODE="claude_code" # or "standalone" or "hybrid"RAG-CLI/
src/
core/ # Core RAG pipeline
constants.py # Global configuration constants
embeddings.py # Sentence transformer integration
vector_store.py # ChromaDB vector operations
document_processor.py # Document chunking
retrieval_pipeline.py # Hybrid search
claude_integration.py # Claude API interface
monitoring/ # Observability
logger.py # Structured logging
tcp_server.py # Monitoring server
plugin/ # Claude Code integration
skills/ # Agent skills
commands/ # Slash commands
hooks/ # Event hooks
mcp/ # MCP server
scripts/ # CLI utilities
tests/ # Test suites
data/ # Documents and vectors
config/ # Configuration files
- Document Processing: Documents -> Chunks (400-500 tokens) -> Metadata extraction
- Embedding Generation: Chunks -> sentence-transformers -> 384-dim vectors
- Vector Storage: Embeddings -> ChromaDB with HNSW indexing -> Persistent storage
- Retrieval: Query -> Hybrid search -> Reranking -> Top-K results
- Response Generation: Context + Query -> Claude Haiku -> AI response
# Operation Mode
mode:
operation: hybrid # claude_code, standalone, or hybrid
claude_code:
format_context: true
include_metadata: true
max_context_length: 10000
# Embeddings
embeddings:
model_name: sentence-transformers/all-MiniLM-L6-v2
model_dim: 384
batch_size: 32
cache_enabled: true
# Vector Store
vector_store:
type: chromadb
index_type: hnsw # ChromaDB uses HNSW for efficient similarity search
save_path: data/vectors
# Retrieval
retrieval:
top_k: 5
hybrid_ratio: 0.7 # 70% vector, 30% keyword
rerank: true
reranker_model: cross-encoder/ms-marco-MiniLM-L-6-v2
# Claude (for standalone mode)
claude:
model: claude-haiku-4-5-20251001
max_tokens: 4096
temperature: 0.7
api_key_env: ANTHROPIC_API_KEY # Only needed for standaloneEnvironment Variable Protection:
-
Never Commit .env Files: The
.envfile contains sensitive API keys and should NEVER be committed to version control- Already included in
.gitignore - Use
config/templates/.env.templateas a reference
- Already included in
-
File Permissions: On Unix systems, ensure
.envhas restricted permissions:chmod 600 .env # Read/write for owner only -
API Key Storage:
- Store all API keys in
.envfile only - Never hardcode keys in source code
- Use environment variables via
os.getenv()
- Store all API keys in
-
Subprocess Security: RAG-CLI automatically sanitizes environment variables when spawning subprocesses, removing sensitive keys like:
ANTHROPIC_API_KEYTAVILY_API_KEYOPENAI_API_KEY
-
Configuration Files: User-specific configurations in
config/are gitignored. Only default templates inconfig/defaults/andconfig/templates/are version controlled.
/search [query]- Search indexed documents/rag-enable- Enable automatic RAG enhancement/rag-disable- Disable automatic RAG enhancement/rag-project- Analyze current project and index relevant documentation/update-rag- Synchronize RAG-CLI plugin files
Access the RAG retrieval skill:
/skill rag-retrieval "Your question here"
RAG-CLI includes several hooks that enhance your Claude Code experience:
- Slash Command Blocker (Priority 150) - Prevents Claude from responding to slash commands, showing only execution status
- User Prompt Submit (Priority 100) - Automatically enhances queries with RAG context and multi-agent orchestration
- Response Post (Priority 80) - Adds inline citations to Claude responses when RAG context is used
- Error Handler (Priority 70) - Provides graceful error handling with helpful troubleshooting tips
- Plugin State Change (Priority 60) - Persists RAG settings across Claude Code restarts
- Document Indexing (Priority 50, disabled by default) - Automatically indexes new or modified documents
RAG-CLI integrates with the Multi-Agent Framework (MAF) to provide intelligent query routing:
- RAG Only: Simple document retrieval queries
- MAF Only: Pure code analysis and debugging tasks
- Parallel RAG+MAF: Complex queries combining documentation and code analysis
- Decomposed: Multi-part queries with intelligent sub-query distribution
The orchestrator automatically selects the best strategy based on query intent, providing faster and more accurate responses.
RAG-CLI provides structured, readable output for all operations:
Search Results Example:
# RAG Search Results
## Retrieval Results
Found: 5 relevant documents
Time: 145ms
## Retrieved Documents
**1. Getting Started Guide (score: 0.890)**
> This guide will help you get started with the installation process...
**2. Configuration Reference (score: 0.870)**
> The configuration file allows you to customize various aspects...
Orchestration Output Example:
## Query Processing
**Strategy:** parallel
**Intent:** troubleshooting
**Confidence:** 87.5%
**Documents:** 3
**MAF Agent:** debugger
The formatting system provides:
- Clean markdown headers for each stage
- Performance metrics (time, document count, confidence scores)
- Document previews with intelligent truncation
- Progress indicators for multi-step operations
- Collapsible sections for detailed logs (when verbose mode enabled)
from src.core.document_processor import get_document_processor
from src.core.embeddings import get_embedding_model
from src.core.vector_store import get_vector_store
# Process documents
processor = get_document_processor()
documents = processor.process_directory("data/documents")
# Generate embeddings
model = get_embedding_model()
embeddings = model.encode_batch([doc["content"] for doc in documents])
# Store vectors
store = get_vector_store()
store.add_documents(documents, embeddings)from src.core.retrieval_pipeline import HybridRetriever
# Initialize retriever
retriever = HybridRetriever(vector_store, embedding_model, config)
# Search
results = retriever.search("Your query", top_k=5)from src.core.claude_integration import ClaudeAssistant
# Initialize assistant
assistant = ClaudeAssistant(config)
# Generate response
response = assistant.generate_response(query, retrieved_docs)The monitoring server runs on port 9999 and accepts these commands:
STATUS- System health and statisticsMETRICS- Performance metricsLOGS- Recent log entriesHEALTH- Health check status
# Check status
./scripts/monitor.ps1 -Command STATUS
# View metrics
./scripts/monitor.ps1 -Command METRICSimport socket
import json
def query_monitor(command):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.connect(("localhost", 9999))
s.send(command.encode())
response = s.recv(4096).decode()
return json.loads(response)
status = query_monitor("STATUS")
print(status)| Operation | Target | Typical |
|---|---|---|
| Vector Search | <100ms | 45ms |
| End-to-End | <5s | 3.2s |
| Embedding Generation | <500ms | 200ms |
| Document Processing | 0.5s/100 docs | 0.4s/100 docs |
- Large Datasets (>100K docs): Use HNSW index instead of Flat
- Memory Constraints: Enable document streaming
- Faster Search: Reduce top_k and disable reranking
- Better Accuracy: Increase hybrid_ratio for more semantic search
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test
pytest tests/test_core.py::TestEmbeddings- Unit tests for all core modules
- Integration tests for full pipeline
- Performance benchmarks
- Plugin component validation
No results found:
- Ensure documents are indexed:
ls data/vectors/ - Lower similarity threshold:
--threshold 0.5 - Check document processing logs
Slow performance:
- Reduce top_k parameter
- Enable caching in configuration
- Use HNSW index for large datasets
API errors (Standalone mode only):
- Verify ANTHROPIC_API_KEY is set
- Check rate limits
- Switch to Claude Code mode if running as plugin
- Review logs:
tail -f logs/rag_cli.log
Mode detection issues:
- Check current mode:
python -c "from src.core.claude_code_adapter import get_adapter; print(get_adapter().get_mode_info())" - Force mode:
export RAG_CLI_MODE="claude_code" - Verify .claude directory exists for Claude Code
export RAG_CLI_LOG_LEVEL=DEBUG
python scripts/retrieve.py "test query" --verbosesrc/core/- Core RAG components (includesconstants.pyfor centralized configuration)src/monitoring/- Logging and metricssrc/plugin/- Claude Code integrationscripts/- CLI utilitiestests/- Test suitesconfig/- Configuration files
RAG-CLI uses a centralized constants module (core.constants) for all tunable parameters:
- Performance: Batch sizes, worker counts, cache sizes
- Search: Top-K limits, hybrid search weights, query length limits
- Processing: Chunk sizes, overlap ratios, file size limits
- Thresholds: Vector store index transitions (Flat -> HNSW -> IVF)
- Timeouts: HTTP, embedding generation, search operations
This design makes it easy to tune performance without modifying code throughout the codebase.
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
- Follow PEP 8 guidelines
- Use type hints
- Add docstrings to all functions
- Run
blackfor formatting
MIT License - see LICENSE file for details.
- GitHub Issues: Report bugs
- Known Issues: KNOWN_ISSUES.md - Current limitations and workarounds
- Documentation: Wiki
- Discussions: Community forum
- Security: SECURITY.md - API key management and security best practices
- Sentence Transformers for embedding models
- ChromaDB for vector database with HNSW indexing
- Anthropic for Claude API
- LangChain for document processing
Built with focus on performance, accuracy, and developer experience.