RAGKit is a comprehensive toolkit for building and evaluating information retrieval systems with Retrieval-Augmented Generation (RAG) capabilities. It enables you to scrape domain-specific content, create TREC-style reusable test collections, and build sparse, dense, or hybrid indexes for LLM-powered retrieval and evaluation.
🚀 Run the Tutorial in Colab
📑 See the tutorial slides
📑 See my lecture slides from CS 5001 (Information Retrieval) taught at Missouri S&T in Spring 2025
-
Introduction to RAG
- What is Retrieval-Augmented Generation?
- Why LLMs hallucinate and how RAG mitigates it
-
Core RAG Workflow
- Retrieval → Reranking → Generation pipeline
- Bi-encoder and cross-encoder strategies
-
Hands-on Setup
- Using Mistral API and Colab
- Required libraries and computing resources
-
Customizing RAG for Domains
- Domain-specific data, prompts, and evaluation
- Applications in research, tech support, education, healthcare, and campus info
-
Prompt Engineering
- Designing domain-appropriate prompts
- Role definition, format control, reasoning chains, citation styles
-
Evaluation Techniques
- Scientific, technical, educational, and healthcare evaluation criteria
- IR metrics: nDCG, Recall, Precision@k, Reciprocal Rank
-
Advanced Architectures
- Two-stage retrieval and reranking
- Multi-hop RAG, HyDE, Knowledge Graph RAG, Self-RAG
-
Debugging and Productionization
- Common RAG issues and how to fix them
- Scaling to production: monitoring, routing, feedback loops
-
Tools and Resources
- FAISS, sentence-transformers, ir_datasets, LangChain, LlamaIndex
- Research papers and open datasets
- Features
- Requirements
- Installation
- Quick Start
- Directory Structure
- Data and Usage Guide
- Running Tests
- Contributing
- LICENCE
- Acknowledgements
- Contact
- Web Scraping: Robust academic website scraping with rate limiting and content extraction
- Corpus Building: Create structured corpora using sliding window passages
- Multiple Index Types:
- BM25 sparse index
- Dense retrieval using transformer models
- Hybrid retrieval combining both approaches
- Topic Integration: Convert and validate topics, generate automatic qrels
- Comprehensive Evaluation: Standard IR metrics with visualization
- Test Collection Creation: Tools for creating reusable IR test collections
- Python 3.8 or higher
- Java 21 or higher (for Pyserini)
- 8GB RAM minimum (16GB recommended for dense indexing)
- CUDA-compatible GPU (optional, for faster dense indexing)
Main dependencies include:
- pyserini
- torch
- sentence-transformers
- faiss-cpu (or faiss-gpu)
- numpy
- pandas
- spacy
- beautifulsoup4
- tqdm
-
Clone the Repository
git clone [https://github.com/yourusername/academic-ir-pipeline](https://github.com/yourusername/academic-ir-pipeline) cd academic-ir-pipeline -
Create a Virtual Environment (Recommended)
# Using venv python -m venv venv# Activate the environment # On Windows: venv\Scripts\activate # On Unix or MacOS: source venv/bin/activate
-
Install Dependencies
# Install base requirements pip install -r requirements.txt # Install development requirements (for testing) pip install -r requirements-dev.txt # Download spaCy model python -m spacy download en_core_web_sm
-
Install Java (Required for Pyserini)
On Ubuntu:
sudo apt-get update sudo apt-get install openjdk-21-jdk
On macOS:
brew install openjdk@11
On Windows:
- Download and install OpenJDK 21 from AdoptOpenJDK
- Add Java to your PATH environment variable
-
Verify Installation
# Run tests to verify installation pytest tests/
Here's a minimal example to get started:
- Scrape a website
python -m src.scraping.website_scraper \
--url "https://example.edu/research" \
--output data/scraped \
--max-pages 100- Build corpus
python -m src.corpus.builder \
--input data/scraped/scraped_data.json \
--output academic_corpus- Create index
python -m src.indexing.create_index \
--corpus academic_corpus \
--output index \
--dense \
--encoder sentence-transformers/msmarco-distilbert-base-v3- Run evaluation
python -m src.evaluation.ir_tools evaluate \
--qrels academic_corpus/qrels.txt \
--run runs/bm25_run.txt \
--output evaluation_results.jsonacademic-ir-pipeline/
├── src/ # Source code
│ ├── scraping/ # Web scraping tools
│ ├── corpus/ # Corpus building
│ ├── indexing/ # Index creation
│ ├── evaluation/ # Evaluation tools
│ └── topics/ # Topic handling
├── tests/ # Test suite
├── data/ # Data directory
├── docs/ # Documentation
└── examples/ # Example notebooks
This repository includes two key supporting guides to help you work with the pipeline:
-
📦 Data:
Explains how thedata/directory is organized, including scraped inputs, processed corpus files, and prebuilt indexes. -
🛠️ Usage Guide:
How to use the code in this repository for scraping, corpus creation, indexing, topic integration, evaluation, and troubleshooting. Follow this guide if you're running the pipeline end-to-end or exploring individual components.
- Install test dependencies
You can either install them directly:
pip install pytest pytest-cov pytest-asyncio pytest-timeoutOr install all dev dependencies at once:
pip install -r requirements-dev.txt
- Run the tests
# Run all tests
pytest
# Run with coverage report
pytest --cov=src tests/
# Run specific test file
pytest tests/test_corpus_builder.py
# Run excluding slow tests
pytest -m "not slow"- Fork the repository
- Create a feature branch
- Make your changes
- Run tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Pyserini for IR tools
- BEIR for datasets and evaluation methodology
- Sentence Transformers for dense retrieval
For questions and support, email shubham.chatterjee@mst.edu.