This document provides a detailed technical overview of the Text-to-SQL Assistant, an application that converts natural language questions into executable SQL queries. It leverages OpenAI's GPT models for reasoning, ChromaDB for RAG (Retrieval-Augmented Generation), and Streamlit for the user interface.
The application follows a sequential pipeline architecture to process user queries.
- User Input: The user asks a question via the Streamlit UI (e.g., "Show me top 5 customers").
- Orchestration: The request is passed to the
answer_questionfunction inorchestrator.py. - Pipeline Execution:
- Schema Retrieval (RAG): The system searches for relevant database tables using vector embeddings (ChromaDB).
- SQL Generation: The LLM generates a SQL query based on the retrieved schema.
- Validation: A critic module checks the SQL for safety (preventing
DROP,DELETE, etc.). - Execution: The validated SQL is run against a MySQL database.
- Response: The results (Dataframe) and the generated SQL are sent back to the UI.
The core logic resides in the agents/ directory (note: while named agents, these are now functional modules in the pipeline).
This is the simple pipeline controller.
- Function:
answer_question(question) - Logic: Sequentially calls retrieval, generation, validation, and execution steps. Returns the SQL, validation message, and result DataFrame.
Implements Retrieval-Augmented Generation (RAG) to provide context to the LLM. It avoids feeding the entire database schema to the LLM context window.
- Database: Uses ChromaDB to store vector embeddings of table schemas.
- Embeddings: Uses
text-embedding-3-smallto convert questions and table descriptions into vectors. - Process:
retrieve_schema(question)queries the vector store for the top 3 most relevant tables.- Returns a formatted string containing Table Name, Columns, and Description.
A safety layer to ensure the generated SQL is safe to execute.
- Function:
validate_sql(sql) - Checks: Scans for dangerous keywords like
DROP,DELETE,UPDATE,ALTER. - Result: Returns
Trueif safe, otherwiseFalsewith an error message.
Handles the actual interaction with the MySQL database.
- Library:
mysql.connector - Function:
run_sql(sql) - Output: Executes the query and returns the result as a Pandas DataFrame.
text_to_sql/
├── agents/
│ ├── orchestrator.py # Main pipeline controller
│ ├── critic.py # SQL validation logic
│ ├── retriever.py # RAG logic with ChromaDB
│ ├── sql_executor.py # MySQL database connection & execution
│ └── sql_generator.py # SQL generation logic
├── db/
│ └── mysql_connection.py # Database helper scripts
├── chroma_store/ # Persisted Vector Database (ChromaDB)
├── app.py # Streamlit Entry Point
├── build_schema_store.py # Script to index schema into ChromaDB
├── requirements.txt # Project Dependencies
└── .env # Environment variables (API Keys, DB Creds)
- Python 3.8+
- MySQL Database
- OpenAI API Key
-
Clone the repository:
git clone <repository_url>
-
Install dependencies:
pip install -r requirements.txt
-
Configure Environment: Create a
.envfile with the following:OPENAI_API_KEY=your_key_here MYSQL_HOST=localhost MYSQL_PORT=3306 MYSQL_USER=root MYSQL_PASSWORD=your_password MYSQL_DATABASE=your_db
-
Build Schema Store: Run this once to populate ChromaDB with your database schema:
python build_schema_store.py
-
Run the Application:
streamlit run app.py
graph TD
User[User (Streamlit UI)] --> Orchestrator[Orchestrator]
Orchestrator --> Retriever[Retriever (ChromaDB)]
Orchestrator --> Generator[SQL Generator (LLM)]
Orchestrator --> Critic[Critic (Validator)]
Orchestrator --> Executor[Executor (MySQL)]
Retriever -- Schema Context --> Generator
Generator -- Generated SQL --> Critic
Critic -- Validation Status --> Orchestrator
Orchestrator -- If Valid --> Executor
Executor -- DataFrame --> Orchestrator
Orchestrator --> User