A natural language to SQL query system that can work with multiple SQLite databases. The system uses OpenAI's GPT model to generate SQL queries based on natural language questions and routes queries to the appropriate database.
The system consists of several components:
- Database servers for each database
- SQL query generator
- Client interface with database routing
- Natural language processing for database selection
- Source: Kaggle - E-commerce Dataset by Olist
- Description: Brazilian E-commerce data from Olist Store containing information about 100k orders from 2016 to 2018
- Tables:
- orders: Order details and timestamps
- customers: Customer information and location
- products: Product details and measurements
- sellers: Seller information and location
- order_items: Order items, prices, and freight values
- Source: Kaggle - 2016 US Election Dataset
- Description: US County-level demographic data combined with 2016 election results
- Tables:
- county_facts: Demographic information by county
- county_facts_dictionary: Column descriptions and metadata
- election_results: County-level election results
- Create a virtual environment and activate it:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set up your environment variables:
- Create a
.envfile in the root directory - Add your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
- Download and place the SQLite databases:
- Download Database 1 from Olist E-commerce Dataset
- Download Database 2 from 2016 US Election Dataset
- Place the files in the
data/directory:
data/
├── database1.sqlite # Rename the Olist database file
└── database2.sqlite # Rename the Election database file
- Start the system:
python client.py- Enter your questions in natural language. The system will:
- Determine which database to use
- Generate an appropriate SQL query
- Execute the query
- Display the results in a formatted table
Example questions:
- For Database 1 (E-commerce):
- "How many orders were placed in 2018?"
- "What are the top selling products by revenue?"
- "Show me the average delivery time of products by state"
- For Database 2 (Demographics):
- "What are the top 5 counties by population?"
- "Show me the election results in Florida counties"
- "What's the median age in counties with population over 1 million?"
If you want to use different databases, you'll need to modify the following files:
- Update the database file paths in
get_schema():
conn = sqlite3.connect("data/your_database.sqlite")- Update
database1_infoanddatabase2_infowith your database schemas:
self.database1_info = {
"name": "Your Database Name",
"description": "Your database description",
"tables": {
"table_name": ["column1", "column2", ...],
...
}
}- Modify keywords and weights in
determine_database():
database1_keywords = {
"keyword1": weight1,
"keyword2": weight2,
...
}- Update the system prompt rules for your databases:
"Important rules for query generation:
1. For Database 1:
- Your specific rules here
2. For Database 2:
- Your specific rules here".
├── client.py # Main client interface
├── sql_generator.py # SQL query generation
├── database1_server.py # Server for first database
├── database2_server.py # Server for second database
├── requirements.txt # Python dependencies
├── .env # Environment variables
└── data/ # Database files
├── database1.sqlite
└── database2.sqlite
-
Database Schema:
- Update the schema information in the database server files
- Modify the schema information in
client.py
-
Keyword Weights:
- Adjust keywords and their weights in
client.pyto match your new databases - Add domain-specific keywords that indicate which database to use
- Adjust keywords and their weights in
-
Query Generation Rules:
- Update the rules in
sql_generator.pyto match your new database structures - Add specific JOIN conditions and table relationships
- Update the rules in
-
Table Display:
- Modify the result formatting in
client.pyif your new databases require different display formats
- Modify the result formatting in
-
If queries aren't routing to the correct database:
- Check the keyword weights in
client.py - Add more specific keywords for your databases
- Adjust the scoring system if needed
- Check the keyword weights in
-
If SQL queries are incorrect:
- Verify the schema information is up to date
- Check the rules in
sql_generator.py - Ensure table relationships are properly defined
-
If results aren't displaying properly:
- Modify the display formatting in
client.py - Adjust column widths and alignment as needed
- Modify the display formatting in