vexT is a state-of-the-art (SOTA) prototype for Hybrid Search and Retrieval-Augmented Generation (RAG). It bridges the gap between traditional keyword search and modern semantic understanding, delivering precise, context-aware answers for e-commerce product data.
- 🧠 Hybrid Retrieval Engine: Combines the precision of BM25 (Keyword Search) with the semantic understanding of k-NN HNSW (Vector Search) using OpenSearch.
- 🤖 Generative AI (RAG): Integrates Google Gemini 2.5 Flash to synthesize natural language answers based on retrieved product context.
- 🌍 Multilingual Support: Powered by
paraphrase-multilingual-MiniLM-L12-v2, enabling seamless search in English, Vietnamese, and more. - ⚡ High-Performance Infrastructure: Dockerized OpenSearch cluster with optimized HNSW settings (
m=16,ef_construction=128). - 🛠️ Automated ETL Pipeline: Robust data processing pipeline that handles cleaning, sampling, and vectorization of large datasets.
- 📊 Interactive UI: A clean, responsive Streamlit interface for real-time testing and demonstration.
The system follows a modular architecture:
- Data Ingestion (Offline): Raw CSV data is processed, vectorized, and indexed into OpenSearch.
- Search Runtime (Online): User queries are vectorized and sent to OpenSearch via a Hybrid Query DSL.
- RAG Inference: Top results are injected into a prompt context for Gemini to generate the final response.
(See docs/system_workflow_detailed.md for a deep dive into the internal workflow)
Before you begin, ensure you have the following installed:
- Docker Desktop: For running the OpenSearch cluster.
- Python 3.12+: The core programming language.
- uv (Recommended): An extremely fast Python package installer and resolver.
- Google Cloud API Key: Access to Gemini models.
git clone https://github.com/EurusDevSec/vexT.git
cd vexTCreate a .env file in the root directory:
GOOGLE_API_KEY=your_google_api_key_hereWe recommend using uv for lightning-fast setup:
cd src
uv sync(Alternatively, use pip install -r requirements.txt if you prefer standard pip)
Launch the OpenSearch cluster using Docker Compose. Note that we use custom ports (10200, 10600) to avoid conflicts on Windows.
# From project root
docker-compose -f infra/docker-compose.yml up -dWait ~30 seconds for the cluster to initialize.
Process the data and generate vectors. This step includes smart sampling to preserve demo scenarios.
# From src/ directory
uv run etl_pipeline.pyCreate the index mapping and ingest data into OpenSearch.
uv run search_core.pyStart the Streamlit frontend.
uv run streamlit run app.pyAccess the app at: http://localhost:8501
vexT/
├── docs/ # Documentation & Architecture plans
├── infra/ # Infrastructure (Docker Compose)
│ └── docker-compose.yml
├── res/ # Resources (Data files)
│ ├── flipkart_data.csv
│ └── flipkart_data_ready.json
├── src/ # Source Code
│ ├── app.py # Streamlit Frontend
│ ├── etl_pipeline.py # Data Processing & Vectorization
│ ├── rag_engine.py # Gemini AI Integration
│ └── search_core.py # OpenSearch Logic
├── .env # Environment Variables
├── CODE_OF_CONDUCT.md # Community Standards
├── CONTRIBUTING.md # Contribution Guidelines
└── README.md # Project Documentation
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
We are committed to providing a friendly, safe, and welcoming environment for all. Please review our Code of Conduct before participating.
Distributed under the MIT License. See LICENSE for more information.
EurusDevSec - GitHub Profile
Project Link: https://github.com/EurusDevSec/vexT
cd src
uv run streamlit run app.py # or: streamlit run app.pyOpen the URL shown by Streamlit (usually http://localhost:8501).
vexT/
├── docs/ # Technical documentation
├── infra/ # Docker compose for OpenSearch
├── res/ # Input datasets (e.g. flipkart_data.csv)
├── src/ # Application source code
│ ├── app.py # Streamlit demo
│ ├── etl_pipeline.py
│ ├── search_core.py
│ ├── rag_engine.py
│ └── pyproject.toml
├── .env
└── README.md
- Default OpenSearch credentials (configured in
infra/docker-compose.yml):admin/StrongPassword123!. - The ETL mapping (column names) is configurable in
src/etl_pipeline.pyto accommodate other datasets. - For development, the heap settings in
infra/docker-compose.ymlare conservative; increase them for larger datasets or production use.
If you want, I can also: add a requirements.txt, create a small Makefile for the common commands, or add unit tests for the ETL and search modules.
