Skip to content

An advanced Text-to-SQL system powered by LangChain and OpenAI that transforms natural language queries into accurate SQL statements for World Bank databases. Features RAG pipeline, few-shot learning, semantic search, and multi-model coordination.

License

Notifications You must be signed in to change notification settings

AnnasMustafaDev/advanced-text2sql-query-system

Repository files navigation

World Bank SQL Agent

An advanced AI-powered natural language interface for querying World Bank databases using LangChain, OpenAI, and semantic search techniques.

Project Overview

This project implements an intelligent SQL agent that allows users to query World Bank data using natural language. The system combines multiple AI components including few-shot learning, retrieval-augmented generation (RAG), and multi-model coordination to provide accurate, context-aware responses.

Key Features

Advanced AI Components

1. Few-Shot Learning

  • Semantic similarity-based example selection using FAISS vector database
  • Dynamic retrieval of relevant query patterns from example database
  • Context-aware prompt engineering with top-K example selection

2. Retrieval-Augmented Generation (RAG)

  • Custom TF-IDF search pipeline for indicator discovery
  • Sentence-level query decomposition for improved matching
  • Cosine similarity-based ranking of relevant database indicators

3. SQL Agent with Function Calling

  • LangChain-based SQL agent with OpenAI function calling
  • Intelligent query generation from natural language
  • Database schema-aware query construction
  • Error recovery and query refinement

4. Multi-Stage Summarization

  • Two-model approach: GPT-4o for queries, GPT-4o-mini for summaries
  • Context-preserving response generation
  • Token-optimized summarization pipeline

Production-Grade Features

Configuration Management

  • Environment-based configuration with .env support
  • Centralized config class for easy maintenance
  • Validation of required environment variables

Error Handling

  • Comprehensive try-catch blocks across all components
  • Graceful degradation on API failures
  • User-friendly error messages

Token Tracking & Cost Optimization

  • OpenAI callback integration for token usage monitoring
  • Cost tracking for production deployment
  • Optimized model selection for different tasks

Custom Search Pipeline

  • Built-from-scratch TF-IDF vectorization
  • Multi-part query processing with sentence tokenization
  • Deduplication and relevance ranking
  • Performance timing for optimization

Agent Orchestration

  • Complex prompt engineering with database schema injection
  • Custom SQL generation rules and guardrails
  • Multi-tool coordination (SQL queries, schema inspection)
  • Verbose logging for debugging and monitoring

Architecture

User Query
    ↓
[NLTK Setup & Preprocessing]
    ↓
[Few-Shot Selector] ← FAISS Vector Store
    ↓
[Indicator Search] ← TF-IDF Similarity
    ↓
[Query Augmentation]
    ↓
[SQL Agent] ← LangChain + OpenAI
    ↓
[Database Query] ← SQLite
    ↓
[Response Summarization] ← GPT-4o-mini
    ↓
Final Answer

Project Structure

world-bank-sql-agent/
├── config.py                 # Configuration management
├── utils.py                  # NLTK setup, database loading
├── few_shot_selector.py      # Semantic example selection
├── indicator_search.py       # TF-IDF search implementation
├── prompts.py               # All prompt templates
├── agent.py                 # SQL agent implementation
├── main.py                  # Application entry point
├── requirements.txt         # Python dependencies
├── .env.example            # Environment variables template
├── .gitignore              # Git ignore rules
└── data/
    ├── world_bank_data_updated.db
    └── Worldbankfewshots.json

Technical Stack

  • LLM Framework: LangChain
  • Models: OpenAI GPT-4o, GPT-4o-mini
  • Vector Store: FAISS
  • Embeddings: OpenAI Embeddings
  • Database: SQLite with SQLAlchemy
  • NLP: NLTK for tokenization
  • ML: Scikit-learn for TF-IDF and cosine similarity
  • Data Processing: Pandas, NumPy

Installation

  1. Clone the repository
git clone https://github.com/yourusername/world-bank-sql-agent.git
cd world-bank-sql-agent
  1. Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
  1. Set up environment variables
cp .env.example .env 
# Edit .env and add your OpenAI API key
  1. Prepare data files
  • Place world_bank_data_updated.db in the data/ directory
  • Place Worldbankfewshots.json in the data/ directory

Usage

Basic Usage

from main import main

# Query the database
query = "What is the GDP growth rate in South Asian countries in 2023?"
result = main(query)

Running from Command Line

python main.py

Example Queries

# Economic indicators
"Compare GDP growth between developed and developing nations"

# Social indicators
"Show literacy rates in South Asia for the last 5 years"

# Environmental data
"What are CO2 emissions trends in European countries?"

# Financial data
"Are citizens in developed countries more aware of financial consumer protection laws?"

🔧 Configuration

Edit .env file to customize:

# OpenAI Configuration
OPENAI_API_KEY=your-api-key-here
CHAT_MODEL=gpt-4o
SUMMARY_MODEL=gpt-4o-mini
TEMPERATURE=0.1

# Database
DATABASE_PATH=data/world_bank_data_updated.db

# Search Parameters
TOP_N_INDICATORS=5
DEFAULT_LIMIT=150

Performance Optimization

  • Token Usage: Monitored via OpenAI callbacks
  • Search Speed: TF-IDF provides <100ms indicator search
  • Model Selection: GPT-4o for complex queries, GPT-4o-mini for summaries
  • Caching: FAISS vector store for fast semantic search
  • Query Limits: Configurable result limits to prevent over-fetching

Security

  • API keys stored in .env (not committed to git)
  • SQL injection prevention via LangChain SQL toolkit
  • Read-only database queries (no DML operations)
  • Input validation and sanitization

Testing

# Run with example query
python main.py

# Modify query in main.py for testing
if __name__ == "__main__":
    query = "Your test query here"
    main(query)

Database Schema

The system works with World Bank data containing:

  • 12 columns: id, indicator_id, indicator_name, indicator_description, country_id, country_name, country_region, country_incomelevel, country_lendingtype, year, value, unitofmeasure
  • Time Range: 2015 onwards
  • Coverage: Multiple countries, regions, and economic indicators

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Your Name

Acknowledgments

  • World Bank for providing comprehensive economic data
  • OpenAI for GPT models
  • LangChain community for excellent documentation
  • FAISS for efficient vector similarity search

Project Complexity

This project demonstrates senior-level AI engineering skills including:

Multiple AI Components Integration

  • Few-shot learning with semantic similarity
  • RAG pipeline with custom search
  • SQL agent with function calling
  • Multi-stage summarization

Production-Ready Features

  • Configuration management
  • Comprehensive error handling
  • Token usage tracking and cost optimization
  • Modular, maintainable architecture

Custom ML Pipelines

  • TF-IDF search built from scratch
  • Sentence-level query decomposition
  • Relevance ranking algorithms

Advanced Agent Orchestration

  • Complex prompt engineering
  • Multi-tool coordination
  • Dynamic query augmentation

Multi-Model Coordination

  • Task-appropriate model selection
  • Cost-performance optimization
  • Parallel processing pipelines

Star this repository if you find it useful!

About

An advanced Text-to-SQL system powered by LangChain and OpenAI that transforms natural language queries into accurate SQL statements for World Bank databases. Features RAG pipeline, few-shot learning, semantic search, and multi-model coordination.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages