A multilingual Retrieval-Augmented Generation (RAG) system designed to handle Bengali product queries using advanced embedding models and semantic search capabilities.
This project implements a sophisticated RAG system that can process queries in Bengali (Bangla) and English, providing accurate product recommendations and information retrieval. The system uses ChromaDB as the vector database with Alibaba's multilingual embedding model for enhanced semantic understanding across languages.
-
Custom Embedding Function: Utilizes Alibaba-NLP's
gte-multilingual-base
model for generating high-quality embeddings that work effectively across Bengali and English languages. -
ChromaDB Vector Database: Persistent vector storage with custom embedding integration for efficient similarity search and document retrieval.
-
Multi-Strategy Search Engine: Implements multiple search approaches:
- Semantic search using embeddings
- Intent-based filtering with category metadata
- Keyword extraction and matching
- Cross-lingual query processing
-
Language Processing Pipeline:
- Automatic language detection
- Bengali to English translation using LLM
- Intent extraction and classification
- Keyword extraction using YAKE algorithm
-
Large Language Model Integration: Uses Groq's LLaMA-4 Scout model for translation, intent analysis, and response generation.
- Embedding Model: Alibaba-NLP/gte-multilingual-base (768-dimensional)
- Vector Database: ChromaDB with persistent storage
- LLM Provider: Groq (meta-llama/llama-4-scout-17b-16e-instruct)
- Text Processing: NLTK, YAKE keyword extraction
- Language Support: Bengali (primary), English
- Development Environment: Google Colab compatible
- Multilingual Support: Native Bengali query processing with English fallback
- Intent Classification: Automatic categorization of user queries into product categories
- Multi-Modal Search: Combines semantic, keyword, and intent-based search strategies
- Custom Embeddings: Leverages state-of-the-art multilingual embedding models
- Persistent Storage: ChromaDB ensures data persistence across sessions
- Interactive Interface: Command-line interface with real-time query processing
The system processes structured product data with the following categories:
- Baby Products
- Electronics & Technology
- Home & Kitchen Appliances
- Fashion & Clothing
- Beauty & Personal Care
- Sports & Fitness
- Books & Education
- Food & Beverages
- Automotive & Tools
- Health & Wellness
- Pet Supplies
- Garden & Outdoor
- Office & Stationery
- Python 3.7+ environment
- Required packages (automatically installed):
python-dotenv yake langchain-groq chromadb langdetect nltk sentence-transformers torch
-
Environment Configuration:
- Create a
.env
file with yourGROQ_API_KEY
- Prepare your product data in
.txt
format
- Create a
-
File Upload (for Google Colab):
- Upload your
.env
file - Upload your product data file (
products.txt
)
- Upload your
-
System Initialization:
- Run the notebook cells sequentially
- The system will automatically download and initialize the embedding model
- ChromaDB will be set up with persistent storage
-
Data Loading:
- Product data is parsed and indexed automatically
- Embeddings are generated using the custom Alibaba model
- Vector database is populated with product information
-
Interactive Usage:
- Enter Bengali queries in the command prompt
- Use special commands:
quit
- Exit the systemtoggle
- Switch NLP translation mode
- Receive responses in Bengali with relevant product information
Bengali Query: "pet supplier ki ki ache?"
System Response: Relevant pet products with details and pricing in Bangla language
Bengali Query: "10000tk er moddhe valo ki mobile ache?"
System Response: Available mobile phones with specifications, price range and mobile suggestion in Bangla language
The system employs a four-pronged search approach:
- English Semantic Search: Translated query processed through embeddings
- Intent-Based Search: Category-filtered search using extracted intent
- Keyword Search: Both Bengali and English keyword matching
- Direct Bengali Search: Native Bengali query processing
Results are combined and deduplicated to provide comprehensive responses.
- NLP Translation: Can be toggled on/off during runtime
- Search Parameters: Configurable top-k results (default: 4-6)
- Embedding Model: Supports different multilingual models
- LLM Temperature: Adjustable for response creativity (default: 0.0-0.2)
This system provides a robust foundation for multilingual e-commerce search and can be extended to support additional languages and product categories.