SQL Agent for High-Cardinality Databases

A minimal SQL agent that handles high-cardinality databases through dynamic table selection and value disambiguation using semantic search.

Theory

Traditional text-to-SQL systems struggle with large databases (100+ tables) because:

Context Limits: Cannot fit all table schemas into LLM context
Value Ambiguity: Users provide fuzzy inputs ("SF" vs "San Francisco", misspellings)
Query Complexity: Need to navigate relationships across many tables

Solution Architecture

This agent uses a progressive query building approach:

User Query � Intent Parsing � Table Selection � Value Disambiguation � SQL Generation

1. Dynamic Table Selection

LLM analyzes user query against database schema
Returns only relevant tables (minimizes context usage)
Scales to databases with 100+ tables

2. Value Disambiguation

Vector embeddings (Gemini text-embedding-004) for high-cardinality columns
Semantic search finds closest matches for fuzzy inputs
Example: "SF" � "San Francisco", "ACDC" � "AC/DC"

3. Error Recovery

Classifies SQL errors (syntax, missing tables, performance)
Self-corrects and retries (up to 2 attempts)
Progressively refines queries

🆕 Multi-Agent Architecture (LangGraph)

NEW: We now have a LangGraph-based multi-agent implementation with autonomous agents!

Why Multi-Agent?

The original pipeline architecture works well but has limitations:

Fixed workflow: Same steps every time, can't adapt
No agent autonomy: Functions don't make decisions
Limited observability: Hard to see why decisions were made

The multi-agent system solves this:

                  ┌──────────────┐
                  │  Supervisor  │ ← Decides what to do next
                  │    Agent     │
                  └──────┬───────┘
                         │
            ┌────────────┼────────────┐
            ▼            ▼            ▼
      ┌─────────┐  ┌──────────┐  ┌──────────┐
      │ Planner │  │  Table   │  │   SQL    │
      │         │  │ Selector │  │Generator │
      └─────────┘  └──────────┘  └──────────┘

Key Differences

Feature	Pipeline	Multi-Agent
Routing	Fixed sequence	Supervisor decides dynamically
Tools	Direct calls	Agents autonomously use tools
Flexibility	Rigid	Can skip/repeat steps
Observability	Basic logs	Full agent conversation trace

Quick Start (Multi-Agent)

# Run the multi-agent system
python run_multi_agent.py

# Or use in code
from agent import MultiAgentSQL

agent = MultiAgentSQL("Chinook.db", api_key)
result = agent.query("Show me tracks by AC/DC", verbose=True)

Full documentation: See agent/README.md and LANGGRAPH_ARCHITECTURE.md

Features

� Dynamic table selection from large schemas � Fuzzy value matching with vector embeddings � Progressive SQL query building � Automatic error classification and retry � Interactive CLI interface

Installation

Prerequisites

Python 3.11+
Google API Key (for Gemini models)

Setup

Clone and navigate to project:

cd v2

Install dependencies:

uv sync

Configure API key: Create a .env file:

cp .env.example .env

Edit .env and add your Google API key:

GOOGLE_API_KEY=your_actual_api_key_here

Pre-build vector embeddings:

python setup.py

This analyzes the database, identifies high-cardinality columns, and creates vector embeddings. You'll see output like:

Found 3 high-cardinality columns:
  " Artist: Name
  " Album: Title
  " Track: Name

[1/3] � Building embeddings for Artist.Name... (275 values) �
[2/3] � Building embeddings for Album.Title... (347 values) �
[3/3] � Building embeddings for Track.Name... (3503 values) �

Usage

Interactive Mode

python main.py

Example session:

Your question: Show me all tracks by AC/DC

[1] Parsing intent...
[2] Selecting relevant tables...
[3] Disambiguating values...
[4] Generating SQL query...
[5] Executing query...

� Query succeeded
SQL: SELECT Track.Name FROM Track JOIN Album ON Track.AlbumId = Album.AlbumId ...

Results:
Name
----
For Those About To Rock
Put The Finger On You
...

Run Tests

python test_agent.py

Tests include:

Basic artist lookup
Customer count by country
Top selling tracks
Fuzzy value matching (e.g., "SF" � "San Francisco")

Check Vector Collections

python check_collections.py

View what embeddings have been created:

ChromaDB Collections Status
======================================================================

Found 3 collection(s):

  � Artist.Name
    - Collection name: artist_name
    - Vectors stored: 275
    - Sample values: AC/DC, Accept, Aerosmith...

Total vectors: 4125

Project Structure

v2/
�

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SQL Agent for High-Cardinality Databases

Theory

Solution Architecture

🆕 Multi-Agent Architecture (LangGraph)

Why Multi-Agent?

Key Differences

Quick Start (Multi-Agent)

Features

Installation

Prerequisites

Setup

Usage

Interactive Mode

Run Tests

Check Vector Collections

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
agent		agent
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Chinook.db		Chinook.db
README.md		README.md
agent.py		agent.py
check_collections.py		check_collections.py
chinook_sqlite.sql		chinook_sqlite.sql
main.py		main.py
pyproject.toml		pyproject.toml
query_generator.py		query_generator.py
run_multi_agent.py		run_multi_agent.py
setup.py		setup.py
table_selector.py		table_selector.py
test_agent.py		test_agent.py
uv.lock		uv.lock
value_disambiguator.py		value_disambiguator.py

Folders and files

Latest commit

History

Repository files navigation

SQL Agent for High-Cardinality Databases

Theory

Solution Architecture

🆕 Multi-Agent Architecture (LangGraph)

Why Multi-Agent?

Key Differences

Quick Start (Multi-Agent)

Features

Installation

Prerequisites

Setup

Usage

Interactive Mode

Run Tests

Check Vector Collections

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages