An intelligent, retrieval-first QA system powered by Locality Sensitive Hashing (LSH) and Hybrid Retrieval over the NUST Undergraduate and Postgraduate Student Handbooks.
This project was developed for the Big Data Analytics (BDA) course at NUST SEECS. It focuses on scalable, efficient retrieval of academic policies using MinHash, SimHash, and PageRank.
- Python 3.10+
- Node.js 18+
- OpenRouter API Key (for free/low-cost LLM answer generation)
Create a .env file in the backend/ directory:
OPENROUTER_API_KEY=your_key_herecd backend
python -m venv .venv
source .venv/bin/activate # On Windows use \`.venv\Scripts\activate\`
pip install -r requirements.txt
uvicorn main:app --reload --port 8000cd frontend
npm install
npm run devThe application will be available at http://localhost:5173.
backend/: FastAPI application handling ingestion, indexing, and retrieval.frontend/: React + Vite + Tailwind UI for query and analytics visualization.data/:raw/: Source PDFs (UG/PG Handbooks).chunks/: Processed semantic text chunks.index/: Precomputed MinHash, SimHash, and TF-IDF artifacts.results/: Experimental evaluation data and reports.
scripts/: Utilities for running experiments and mass ingestion.
The project comes with pre-computed indices, but if you wish to add new documents or re-index:
- Place PDFs: Put your PDF files in
data/raw/. - Ingest via UI: Navigate to the Ingest page in the dashboard and click "Start Ingestion".
- CLI Ingestion:
python scripts/run_experiments.py # Runs indexing as part of the pipeline
The system includes a robust evaluation framework measuring Precision@k, Recall@k, Latency, and Scalability.
To run the experiments:
cd backend
export PYTHONPATH=$PYTHONPATH:$(pwd)
python ../scripts/run_experiments.pyView the results in the Analytics tab of the web dashboard.
- Ingestion: Semantic chunking using
Unstructured(or custom sliding window). - Retrieval:
- Hybrid: MinHash+LSH (fast filtering) combined with SimHash (Hamming distance scoring).
- Baseline: TF-IDF exact vector similarity.
- Reranking: PageRank importance scores fused with retrieval similarity.
- Generation: Context-grounded response generation using OpenRouter (e.g., Qwen/GPT-4 Free).
This project is for academic purposes as part of the NUST BDA course.