Skip to content

Anshumaan031/CardinalQuery

Repository files navigation

SQL Agent for High-Cardinality Databases

A minimal SQL agent that handles high-cardinality databases through dynamic table selection and value disambiguation using semantic search.

Theory

Traditional text-to-SQL systems struggle with large databases (100+ tables) because:

  1. Context Limits: Cannot fit all table schemas into LLM context
  2. Value Ambiguity: Users provide fuzzy inputs ("SF" vs "San Francisco", misspellings)
  3. Query Complexity: Need to navigate relationships across many tables

Solution Architecture

This agent uses a progressive query building approach:

User Query � Intent Parsing � Table Selection � Value Disambiguation � SQL Generation

1. Dynamic Table Selection

  • LLM analyzes user query against database schema
  • Returns only relevant tables (minimizes context usage)
  • Scales to databases with 100+ tables

2. Value Disambiguation

  • Vector embeddings (Gemini text-embedding-004) for high-cardinality columns
  • Semantic search finds closest matches for fuzzy inputs
  • Example: "SF" � "San Francisco", "ACDC" � "AC/DC"

3. Error Recovery

  • Classifies SQL errors (syntax, missing tables, performance)
  • Self-corrects and retries (up to 2 attempts)
  • Progressively refines queries

🆕 Multi-Agent Architecture (LangGraph)

NEW: We now have a LangGraph-based multi-agent implementation with autonomous agents!

Why Multi-Agent?

The original pipeline architecture works well but has limitations:

  • Fixed workflow: Same steps every time, can't adapt
  • No agent autonomy: Functions don't make decisions
  • Limited observability: Hard to see why decisions were made

The multi-agent system solves this:

                  ┌──────────────┐
                  │  Supervisor  │ ← Decides what to do next
                  │    Agent     │
                  └──────┬───────┘
                         │
            ┌────────────┼────────────┐
            ▼            ▼            ▼
      ┌─────────┐  ┌──────────┐  ┌──────────┐
      │ Planner │  │  Table   │  │   SQL    │
      │         │  │ Selector │  │Generator │
      └─────────┘  └──────────┘  └──────────┘

Key Differences

Feature Pipeline Multi-Agent
Routing Fixed sequence Supervisor decides dynamically
Tools Direct calls Agents autonomously use tools
Flexibility Rigid Can skip/repeat steps
Observability Basic logs Full agent conversation trace

Quick Start (Multi-Agent)

# Run the multi-agent system
python run_multi_agent.py

# Or use in code
from agent import MultiAgentSQL

agent = MultiAgentSQL("Chinook.db", api_key)
result = agent.query("Show me tracks by AC/DC", verbose=True)

Full documentation: See agent/README.md and LANGGRAPH_ARCHITECTURE.md


Features

� Dynamic table selection from large schemas � Fuzzy value matching with vector embeddings � Progressive SQL query building � Automatic error classification and retry � Interactive CLI interface

Installation

Prerequisites

  • Python 3.11+
  • Google API Key (for Gemini models)

Setup

  1. Clone and navigate to project:
cd v2
  1. Install dependencies:
uv sync
  1. Configure API key: Create a .env file:
cp .env.example .env

Edit .env and add your Google API key:

GOOGLE_API_KEY=your_actual_api_key_here
  1. Pre-build vector embeddings:
python setup.py

This analyzes the database, identifies high-cardinality columns, and creates vector embeddings. You'll see output like:

Found 3 high-cardinality columns:
  " Artist: Name
  " Album: Title
  " Track: Name

[1/3] � Building embeddings for Artist.Name... (275 values) �
[2/3] � Building embeddings for Album.Title... (347 values) �
[3/3] � Building embeddings for Track.Name... (3503 values) �

Usage

Interactive Mode

python main.py

Example session:

Your question: Show me all tracks by AC/DC

[1] Parsing intent...
[2] Selecting relevant tables...
[3] Disambiguating values...
[4] Generating SQL query...
[5] Executing query...

� Query succeeded
SQL: SELECT Track.Name FROM Track JOIN Album ON Track.AlbumId = Album.AlbumId ...

Results:
Name
----
For Those About To Rock
Put The Finger On You
...

Run Tests

python test_agent.py

Tests include:

  • Basic artist lookup
  • Customer count by country
  • Top selling tracks
  • Fuzzy value matching (e.g., "SF" � "San Francisco")

Check Vector Collections

python check_collections.py

View what embeddings have been created:

ChromaDB Collections Status
======================================================================

Found 3 collection(s):

  � Artist.Name
    - Collection name: artist_name
    - Vectors stored: 275
    - Sample values: AC/DC, Accept, Aerosmith...

Total vectors: 4125

Project Structure

v2/
�

About

An autonomous LLM-powered SQL agent leveraging multi-agent orchestration, vector embeddings, and dynamic schema pruning to enable natural language querying over high-cardinality databases exceeding 100+ tables.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages