An advanced AI-powered natural language interface for querying World Bank databases using LangChain, OpenAI, and semantic search techniques.
This project implements an intelligent SQL agent that allows users to query World Bank data using natural language. The system combines multiple AI components including few-shot learning, retrieval-augmented generation (RAG), and multi-model coordination to provide accurate, context-aware responses.
- Semantic similarity-based example selection using FAISS vector database
- Dynamic retrieval of relevant query patterns from example database
- Context-aware prompt engineering with top-K example selection
- Custom TF-IDF search pipeline for indicator discovery
- Sentence-level query decomposition for improved matching
- Cosine similarity-based ranking of relevant database indicators
- LangChain-based SQL agent with OpenAI function calling
- Intelligent query generation from natural language
- Database schema-aware query construction
- Error recovery and query refinement
- Two-model approach: GPT-4o for queries, GPT-4o-mini for summaries
- Context-preserving response generation
- Token-optimized summarization pipeline
- Environment-based configuration with
.envsupport - Centralized config class for easy maintenance
- Validation of required environment variables
- Comprehensive try-catch blocks across all components
- Graceful degradation on API failures
- User-friendly error messages
- OpenAI callback integration for token usage monitoring
- Cost tracking for production deployment
- Optimized model selection for different tasks
- Built-from-scratch TF-IDF vectorization
- Multi-part query processing with sentence tokenization
- Deduplication and relevance ranking
- Performance timing for optimization
- Complex prompt engineering with database schema injection
- Custom SQL generation rules and guardrails
- Multi-tool coordination (SQL queries, schema inspection)
- Verbose logging for debugging and monitoring
User Query
↓
[NLTK Setup & Preprocessing]
↓
[Few-Shot Selector] ← FAISS Vector Store
↓
[Indicator Search] ← TF-IDF Similarity
↓
[Query Augmentation]
↓
[SQL Agent] ← LangChain + OpenAI
↓
[Database Query] ← SQLite
↓
[Response Summarization] ← GPT-4o-mini
↓
Final Answer
world-bank-sql-agent/
├── config.py # Configuration management
├── utils.py # NLTK setup, database loading
├── few_shot_selector.py # Semantic example selection
├── indicator_search.py # TF-IDF search implementation
├── prompts.py # All prompt templates
├── agent.py # SQL agent implementation
├── main.py # Application entry point
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
├── .gitignore # Git ignore rules
└── data/
├── world_bank_data_updated.db
└── Worldbankfewshots.json
- LLM Framework: LangChain
- Models: OpenAI GPT-4o, GPT-4o-mini
- Vector Store: FAISS
- Embeddings: OpenAI Embeddings
- Database: SQLite with SQLAlchemy
- NLP: NLTK for tokenization
- ML: Scikit-learn for TF-IDF and cosine similarity
- Data Processing: Pandas, NumPy
- Clone the repository
git clone https://github.com/yourusername/world-bank-sql-agent.git
cd world-bank-sql-agent- Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Set up environment variables
cp .env.example .env
# Edit .env and add your OpenAI API key- Prepare data files
- Place
world_bank_data_updated.dbin thedata/directory - Place
Worldbankfewshots.jsonin thedata/directory
from main import main
# Query the database
query = "What is the GDP growth rate in South Asian countries in 2023?"
result = main(query)python main.py# Economic indicators
"Compare GDP growth between developed and developing nations"
# Social indicators
"Show literacy rates in South Asia for the last 5 years"
# Environmental data
"What are CO2 emissions trends in European countries?"
# Financial data
"Are citizens in developed countries more aware of financial consumer protection laws?"Edit .env file to customize:
# OpenAI Configuration
OPENAI_API_KEY=your-api-key-here
CHAT_MODEL=gpt-4o
SUMMARY_MODEL=gpt-4o-mini
TEMPERATURE=0.1
# Database
DATABASE_PATH=data/world_bank_data_updated.db
# Search Parameters
TOP_N_INDICATORS=5
DEFAULT_LIMIT=150- Token Usage: Monitored via OpenAI callbacks
- Search Speed: TF-IDF provides <100ms indicator search
- Model Selection: GPT-4o for complex queries, GPT-4o-mini for summaries
- Caching: FAISS vector store for fast semantic search
- Query Limits: Configurable result limits to prevent over-fetching
- API keys stored in
.env(not committed to git) - SQL injection prevention via LangChain SQL toolkit
- Read-only database queries (no DML operations)
- Input validation and sanitization
# Run with example query
python main.py
# Modify query in main.py for testing
if __name__ == "__main__":
query = "Your test query here"
main(query)The system works with World Bank data containing:
- 12 columns: id, indicator_id, indicator_name, indicator_description, country_id, country_name, country_region, country_incomelevel, country_lendingtype, year, value, unitofmeasure
- Time Range: 2015 onwards
- Coverage: Multiple countries, regions, and economic indicators
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Your Name
- GitHub: @AnnasMustafaDev
- LinkedIn: Annas Mustafa
- World Bank for providing comprehensive economic data
- OpenAI for GPT models
- LangChain community for excellent documentation
- FAISS for efficient vector similarity search
This project demonstrates senior-level AI engineering skills including:
✅ Multiple AI Components Integration
- Few-shot learning with semantic similarity
- RAG pipeline with custom search
- SQL agent with function calling
- Multi-stage summarization
✅ Production-Ready Features
- Configuration management
- Comprehensive error handling
- Token usage tracking and cost optimization
- Modular, maintainable architecture
✅ Custom ML Pipelines
- TF-IDF search built from scratch
- Sentence-level query decomposition
- Relevance ranking algorithms
✅ Advanced Agent Orchestration
- Complex prompt engineering
- Multi-tool coordination
- Dynamic query augmentation
✅ Multi-Model Coordination
- Task-appropriate model selection
- Cost-performance optimization
- Parallel processing pipelines
⭐ Star this repository if you find it useful!