Text-to-SQL Assistant - Technical Documentation

This document provides a detailed technical overview of the Text-to-SQL Assistant, an application that converts natural language questions into executable SQL queries. It leverages OpenAI's GPT models for reasoning, ChromaDB for RAG (Retrieval-Augmented Generation), and Streamlit for the user interface.

🏗️ Architectural Overview

The application follows a sequential pipeline architecture to process user queries.

High-Level Data Flow

User Input: The user asks a question via the Streamlit UI (e.g., "Show me top 5 customers").
Orchestration: The request is passed to the answer_question function in orchestrator.py.
Pipeline Execution:
1. Schema Retrieval (RAG): The system searches for relevant database tables using vector embeddings (ChromaDB).
2. SQL Generation: The LLM generates a SQL query based on the retrieved schema.
3. Validation: A critic module checks the SQL for safety (preventing DROP, DELETE, etc.).
4. Execution: The validated SQL is run against a MySQL database.
Response: The results (Dataframe) and the generated SQL are sent back to the UI.

🧩 Modules & Components

The core logic resides in the agents/ directory (note: while named agents, these are now functional modules in the pipeline).

1. Orchestrator (`agents/orchestrator.py`)

This is the simple pipeline controller.

Function: answer_question(question)
Logic: Sequentially calls retrieval, generation, validation, and execution steps. Returns the SQL, validation message, and result DataFrame.

2. Schema Retriever (`agents/retriever.py`)

Implements Retrieval-Augmented Generation (RAG) to provide context to the LLM. It avoids feeding the entire database schema to the LLM context window.

Database: Uses ChromaDB to store vector embeddings of table schemas.
Embeddings: Uses text-embedding-3-small to convert questions and table descriptions into vectors.
Process:
- retrieve_schema(question) queries the vector store for the top 3 most relevant tables.
- Returns a formatted string containing Table Name, Columns, and Description.

3. SQL Critic / Validator (`agents/critic.py`)

A safety layer to ensure the generated SQL is safe to execute.

Function: validate_sql(sql)
Checks: Scans for dangerous keywords like DROP, DELETE, UPDATE, ALTER.
Result: Returns True if safe, otherwise False with an error message.

4. SQL Executor (`agents/sql_executor.py`)

Handles the actual interaction with the MySQL database.

Library: mysql.connector
Function: run_sql(sql)
Output: Executes the query and returns the result as a Pandas DataFrame.

📂 Directory Structure

text_to_sql/
├── agents/
│   ├── orchestrator.py          # Main pipeline controller
│   ├── critic.py                # SQL validation logic
│   ├── retriever.py             # RAG logic with ChromaDB
│   ├── sql_executor.py          # MySQL database connection & execution
│   └── sql_generator.py         # SQL generation logic
├── db/
│   └── mysql_connection.py      # Database helper scripts
├── chroma_store/                # Persisted Vector Database (ChromaDB)
├── app.py                       # Streamlit Entry Point
├── build_schema_store.py        # Script to index schema into ChromaDB
├── requirements.txt             # Project Dependencies
└── .env                         # Environment variables (API Keys, DB Creds)

🚀 Setup & Usage

Prerequisites

Python 3.8+
MySQL Database
OpenAI API Key

Installation

Clone the repository:
```
git clone <repository_url>
```
Install dependencies:
```
pip install -r requirements.txt
```

Configure Environment: Create a .env file with the following:

OPENAI_API_KEY=your_key_here
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=root
MYSQL_PASSWORD=your_password
MYSQL_DATABASE=your_db

Build Schema Store: Run this once to populate ChromaDB with your database schema:
```
python build_schema_store.py
```
Run the Application:
```
streamlit run app.py
```

🛠️ Interaction Diagram

graph TD
    User[User (Streamlit UI)] --> Orchestrator[Orchestrator]
    Orchestrator --> Retriever[Retriever (ChromaDB)]
    Orchestrator --> Generator[SQL Generator (LLM)]
    Orchestrator --> Critic[Critic (Validator)]
    Orchestrator --> Executor[Executor (MySQL)]
    
    Retriever -- Schema Context --> Generator
    Generator -- Generated SQL --> Critic
    Critic -- Validation Status --> Orchestrator
    
    Orchestrator -- If Valid --> Executor
    Executor -- DataFrame --> Orchestrator
    
    Orchestrator --> User

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
agents		agents
assets		assets
data		data
db		db
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
build_schema_store.py		build_schema_store.py
database.py		database.py
parse_sql_schema.py		parse_sql_schema.py
requirements.txt		requirements.txt
schema.sql		schema.sql
txt_to_sql.py		txt_to_sql.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-SQL Assistant - Technical Documentation

🏗️ Architectural Overview

High-Level Data Flow

🧩 Modules & Components

1. Orchestrator (`agents/orchestrator.py`)

2. Schema Retriever (`agents/retriever.py`)

3. SQL Critic / Validator (`agents/critic.py`)

4. SQL Executor (`agents/sql_executor.py`)

📂 Directory Structure

🚀 Setup & Usage

Prerequisites

Installation

🛠️ Interaction Diagram

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Parthib17/Text-to-SQL-Assistant

Folders and files

Latest commit

History

Repository files navigation

Text-to-SQL Assistant - Technical Documentation

🏗️ Architectural Overview

High-Level Data Flow

🧩 Modules & Components

1. Orchestrator (agents/orchestrator.py)

2. Schema Retriever (agents/retriever.py)

3. SQL Critic / Validator (agents/critic.py)

4. SQL Executor (agents/sql_executor.py)

📂 Directory Structure

🚀 Setup & Usage

Prerequisites

Installation

🛠️ Interaction Diagram

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Orchestrator (`agents/orchestrator.py`)

2. Schema Retriever (`agents/retriever.py`)

3. SQL Critic / Validator (`agents/critic.py`)

4. SQL Executor (`agents/sql_executor.py`)

Packages