A minimal SQL agent that handles high-cardinality databases through dynamic table selection and value disambiguation using semantic search.
Traditional text-to-SQL systems struggle with large databases (100+ tables) because:
- Context Limits: Cannot fit all table schemas into LLM context
- Value Ambiguity: Users provide fuzzy inputs ("SF" vs "San Francisco", misspellings)
- Query Complexity: Need to navigate relationships across many tables
This agent uses a progressive query building approach:
User Query � Intent Parsing � Table Selection � Value Disambiguation � SQL Generation
1. Dynamic Table Selection
- LLM analyzes user query against database schema
- Returns only relevant tables (minimizes context usage)
- Scales to databases with 100+ tables
2. Value Disambiguation
- Vector embeddings (Gemini text-embedding-004) for high-cardinality columns
- Semantic search finds closest matches for fuzzy inputs
- Example: "SF" � "San Francisco", "ACDC" � "AC/DC"
3. Error Recovery
- Classifies SQL errors (syntax, missing tables, performance)
- Self-corrects and retries (up to 2 attempts)
- Progressively refines queries
NEW: We now have a LangGraph-based multi-agent implementation with autonomous agents!
The original pipeline architecture works well but has limitations:
- Fixed workflow: Same steps every time, can't adapt
- No agent autonomy: Functions don't make decisions
- Limited observability: Hard to see why decisions were made
The multi-agent system solves this:
┌──────────────┐
│ Supervisor │ ← Decides what to do next
│ Agent │
└──────┬───────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐
│ Planner │ │ Table │ │ SQL │
│ │ │ Selector │ │Generator │
└─────────┘ └──────────┘ └──────────┘
| Feature | Pipeline | Multi-Agent |
|---|---|---|
| Routing | Fixed sequence | Supervisor decides dynamically |
| Tools | Direct calls | Agents autonomously use tools |
| Flexibility | Rigid | Can skip/repeat steps |
| Observability | Basic logs | Full agent conversation trace |
# Run the multi-agent system
python run_multi_agent.py
# Or use in code
from agent import MultiAgentSQL
agent = MultiAgentSQL("Chinook.db", api_key)
result = agent.query("Show me tracks by AC/DC", verbose=True)Full documentation: See agent/README.md and LANGGRAPH_ARCHITECTURE.md
� Dynamic table selection from large schemas � Fuzzy value matching with vector embeddings � Progressive SQL query building � Automatic error classification and retry � Interactive CLI interface
- Python 3.11+
- Google API Key (for Gemini models)
- Clone and navigate to project:
cd v2- Install dependencies:
uv sync- Configure API key:
Create a
.envfile:
cp .env.example .envEdit .env and add your Google API key:
GOOGLE_API_KEY=your_actual_api_key_here
- Pre-build vector embeddings:
python setup.pyThis analyzes the database, identifies high-cardinality columns, and creates vector embeddings. You'll see output like:
Found 3 high-cardinality columns:
" Artist: Name
" Album: Title
" Track: Name
[1/3] � Building embeddings for Artist.Name... (275 values) �
[2/3] � Building embeddings for Album.Title... (347 values) �
[3/3] � Building embeddings for Track.Name... (3503 values) �
python main.pyExample session:
Your question: Show me all tracks by AC/DC
[1] Parsing intent...
[2] Selecting relevant tables...
[3] Disambiguating values...
[4] Generating SQL query...
[5] Executing query...
� Query succeeded
SQL: SELECT Track.Name FROM Track JOIN Album ON Track.AlbumId = Album.AlbumId ...
Results:
Name
----
For Those About To Rock
Put The Finger On You
...
python test_agent.pyTests include:
- Basic artist lookup
- Customer count by country
- Top selling tracks
- Fuzzy value matching (e.g., "SF" � "San Francisco")
python check_collections.pyView what embeddings have been created:
ChromaDB Collections Status
======================================================================
Found 3 collection(s):
� Artist.Name
- Collection name: artist_name
- Vectors stored: 275
- Sample values: AC/DC, Accept, Aerosmith...
Total vectors: 4125
v2/
�