RAGKit

RAGKit is a comprehensive toolkit for building and evaluating information retrieval systems with Retrieval-Augmented Generation (RAG) capabilities. It enables you to scrape domain-specific content, create TREC-style reusable test collections, and build sparse, dense, or hybrid indexes for LLM-powered retrieval and evaluation.

Missouri S&T RAG Tutorial Materials

🚀 Run the Tutorial in Colab

📑 See the tutorial slides

📑 See my lecture slides from CS 5001 (Information Retrieval) taught at Missouri S&T in Spring 2025

📚 Topics Covered in the Tutorial

Introduction to RAG
- What is Retrieval-Augmented Generation?
- Why LLMs hallucinate and how RAG mitigates it
Core RAG Workflow
- Retrieval → Reranking → Generation pipeline
- Bi-encoder and cross-encoder strategies
Hands-on Setup
- Using Mistral API and Colab
- Required libraries and computing resources
Customizing RAG for Domains
- Domain-specific data, prompts, and evaluation
- Applications in research, tech support, education, healthcare, and campus info
Prompt Engineering
- Designing domain-appropriate prompts
- Role definition, format control, reasoning chains, citation styles
Evaluation Techniques
- Scientific, technical, educational, and healthcare evaluation criteria
- IR metrics: nDCG, Recall, Precision@k, Reciprocal Rank
Advanced Architectures
- Two-stage retrieval and reranking
- Multi-hop RAG, HyDE, Knowledge Graph RAG, Self-RAG
Debugging and Productionization
- Common RAG issues and how to fix them
- Scaling to production: monitoring, routing, feedback loops
Tools and Resources
- FAISS, sentence-transformers, ir_datasets, LangChain, LlamaIndex
- Research papers and open datasets

📚 Table of Contents

✨ Features

Web Scraping: Robust academic website scraping with rate limiting and content extraction
Corpus Building: Create structured corpora using sliding window passages
Multiple Index Types:
- BM25 sparse index
- Dense retrieval using transformer models
- Hybrid retrieval combining both approaches
Topic Integration: Convert and validate topics, generate automatic qrels
Comprehensive Evaluation: Standard IR metrics with visualization
Test Collection Creation: Tools for creating reusable IR test collections

📋 Requirements

System Requirements

Python 3.8 or higher
Java 21 or higher (for Pyserini)
8GB RAM minimum (16GB recommended for dense indexing)
CUDA-compatible GPU (optional, for faster dense indexing)

Python Dependencies

Main dependencies include:

pyserini
torch
sentence-transformers
faiss-cpu (or faiss-gpu)
numpy
pandas
spacy
beautifulsoup4
tqdm

🔧 Installation

Clone the Repository

git clone [https://github.com/yourusername/academic-ir-pipeline](https://github.com/yourusername/academic-ir-pipeline)
cd academic-ir-pipeline

Create a Virtual Environment (Recommended)

# Using venv
python -m venv venv

# Activate the environment
# On Windows:
venv\Scripts\activate
# On Unix or MacOS:
source venv/bin/activate

Install Dependencies

# Install base requirements
pip install -r requirements.txt

# Install development requirements (for testing)
pip install -r requirements-dev.txt

# Download spaCy model
python -m spacy download en_core_web_sm

Install Java (Required for Pyserini)

On Ubuntu:
```
sudo apt-get update
sudo apt-get install openjdk-21-jdk
```
On macOS:
```
brew install openjdk@11
```
On Windows:
- Download and install OpenJDK 21 from AdoptOpenJDK
- Add Java to your PATH environment variable

Verify Installation

# Run tests to verify installation
pytest tests/

🚀 Quick Start

Here's a minimal example to get started:

Scrape a website

python -m src.scraping.website_scraper \
    --url "https://example.edu/research" \
    --output data/scraped \
    --max-pages 100

Build corpus

python -m src.corpus.builder \
    --input data/scraped/scraped_data.json \
    --output academic_corpus

Create index

python -m src.indexing.create_index \
    --corpus academic_corpus \
    --output index \
    --dense \
    --encoder sentence-transformers/msmarco-distilbert-base-v3

Run evaluation

python -m src.evaluation.ir_tools evaluate \
    --qrels academic_corpus/qrels.txt \
    --run runs/bm25_run.txt \
    --output evaluation_results.json

📁 Directory Structure

academic-ir-pipeline/
├── src/                    # Source code
│   ├── scraping/          # Web scraping tools
│   ├── corpus/            # Corpus building
│   ├── indexing/          # Index creation
│   ├── evaluation/        # Evaluation tools
│   └── topics/            # Topic handling
├── tests/                 # Test suite
├── data/                  # Data directory
├── docs/                  # Documentation
└── examples/              # Example notebooks

📂 Data & Usage Documentation

This repository includes two key supporting guides to help you work with the pipeline:

📦 Data:
Explains how the data/ directory is organized, including scraped inputs, processed corpus files, and prebuilt indexes.
🛠️ Usage Guide:
How to use the code in this repository for scraping, corpus creation, indexing, topic integration, evaluation, and troubleshooting. Follow this guide if you're running the pipeline end-to-end or exploring individual components.

🧪 Running Tests

Install test dependencies

You can either install them directly:

pip install pytest pytest-cov pytest-asyncio pytest-timeout

Or install all dev dependencies at once:

pip install -r requirements-dev.txt

Run the tests

# Run all tests
pytest

# Run with coverage report
pytest --cov=src tests/

# Run specific test file
pytest tests/test_corpus_builder.py

# Run excluding slow tests
pytest -m "not slow"

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Run tests
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Pyserini for IR tools
BEIR for datasets and evaluation methodology
Sentence Transformers for dense retrieval

📧 Contact

For questions and support, email shubham.chatterjee@mst.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
docs		docs
src		src
tests		tests
tutorials		tutorials
.gitignore		.gitignore
LICENCE		LICENCE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAGKit

Missouri S&T RAG Tutorial Materials

📚 Topics Covered in the Tutorial

📚 Table of Contents

✨ Features

📋 Requirements

System Requirements

Python Dependencies

🔧 Installation

🚀 Quick Start

📁 Directory Structure

📂 Data & Usage Documentation

🧪 Running Tests

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Languages

License

Tasmia22/mst-rag-tutorial

Folders and files

Latest commit

History

Repository files navigation

RAGKit

Missouri S&T RAG Tutorial Materials

📚 Topics Covered in the Tutorial

📚 Table of Contents

✨ Features

📋 Requirements

System Requirements

Python Dependencies

🔧 Installation

🚀 Quick Start

📁 Directory Structure

📂 Data & Usage Documentation

🧪 Running Tests

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages