EntRAG Benchmark

EntRAG is a benchmark for evaluating the performance of Retrieval-Augmented Generation (RAG) systems in an enterprise context. The benchmark is designed to provide a comprehensive evaluation of RAG systems on a heterogeneous corpus of documents across various enterprise-like domains. The benchmark includes a QA datset of 100 handcrafted questions, that are used for the evaluation.

Installation

The project uses uv for the project management. Make sure you have it installed on your local development machine.

Poe Tasks

The project uses Poe for task management. The tasks are defined in the pyproject.toml file. You can run the tasks using the uv run poe <task-name> command.

Available Poe Tasks

check: Check the codebase for linting and formatting errors with Ruff
style: Apply style fixes to the codebase with Ruff
style-all: Format and check the entire codebase for linting and formatting errors
quality: Check the quality of the entire codebase

Quickstart

Clone the repository and navigate to the root directory:

git clone https://github.com/fkapsahili/EntRAG.git
cd EntRAG

Install the dependencies using uv:

uv sync --all-groups

Get an API key for OpenAI or Gemini and add it to the .env file

touch .env
echo "OPENAI_API_KEY=<your_openai_api_key>" >> .env 
echo "GEMINI_API_KEY=<your_gemini_api_key>" >> .env

Navigate to the mock API package

cd packages/entrag-mock-api

# Start the mock API server
uv run poe start

# The server will run on http://localhost:8000
# Keep this terminal open while running the benchmark

Run the benchmark with the default config in a new terminal:

# Navigate to the benchmark package
cd packages/entrag

# Run the pipeline
poe entrag --config example/configs/default.yaml

Data Processing

Using Pre-processed Data

If you have access to the pre-processed markdown data, place it in the data/entrag_processed/ directory (or any directory specified as chunking.files_directory in your config file) and the benchmark will use it directly.

Preprocessing Raw Documents

If you have the raw documents that need to be processed, you can use the standalone processing script before running the benchmark. This script will process the documents and save them in the data/entrag_processed/ directory.

cd packages/entrag
uv run poe create-markdown-documents.py \
    --input-dir <path_to_raw_documents> \
    --output-dir data/entrag_processed/ \
    --workers <number_of_workers> \
    --batch-size <batch_size>

Note: The processing script uses Docling for Markdown conversion and requires significant computational resources if running multiple workers. Ensure you have enough memory and CPU resources available.

Configuration

Using Existing Configurations

The benchmark includes a default configuration located at packages/entrag/evaluation_configs/default/config.yaml. You can run the benchmark with this configuration:

uv run poe entrag --config default

The results will then be saved in the evaluation_results/ directory.

Creating Custom Configurations

To create a custom configuration for your specific evaluation needs:

Create a new configuration directory:

mkdir evaluation_configs/my-custom-config

Create the configuration file:

touch evaluation_configs/my-custom-config/config.yaml

Configure your evaluation settings by editing the config.yaml file. Here's a template with explanations:

config_name: my-custom-config
tasks:
  question_answering:
    run: true                           # Enable/disable QA evaluation
    hf_dataset_id: fkapsahili/EntRAG    # HuggingFace dataset ID
    dataset_path: null                  # Alternative: local dataset path
    split: train                        # Dataset split to use

chunking:
  enabled: true                          # Enable document chunking
  files_directory: data/entrag_processed # Input directory for processed docs
  output_directory: data/entrag_chunked  # Output directory for chunks
  dataset_name: entrag                   # Dataset identifier
  max_tokens: 2048                       # Maximum tokens per chunk

embedding:
  enabled: true                         # Enable embedding generation
  model: text-embedding-3-small         # Embedding model to use
  batch_size: 8                         # Batch size for embedding generation
  output_directory: data/embeddings     # Output directory for embeddings

model_evaluation:
  max_workers: 10                       # Parallel workers for LLM inference
  output_directory: evaluation_results  # Results output directory
  retrieval_top_k: 5                    # Number of documents to retrieve
  model_provider: openai                # AI provider: "openai" or "gemini"
  model_name: gpt-4o-mini               # Model name for evaluation
  reranking_model_name: gpt-4o-mini     # Model for reranking (if applicable)

Run your custom configuration:

uv run poe entrag --config my-custom-config

Configuration Options

Task Configuration:

question_answering.run: Whether to run QA evaluation
hf_dataset_id: HuggingFace dataset containing evaluation questions
split: Which dataset split to use for evaluation

Chunking Configuration:

files_directory: Directory containing processed markdown documents
max_tokens: Maximum size for document chunks
dataset_name: Identifier for the dataset being processed

Model Configuration:

model_provider: Choose between openai or gemini
model_name: Specific model to use for evaluation
retrieval_top_k: Number of relevant documents to retrieve for each question

Available Models:

OpenAI: gpt-4o, gpt-4o-mini, gpt-4.1-nano, gpt-4.1-mini, gpt-4.1
Gemini: gemini-2.0-pro, gemini-2.0-flash

Extending the Benchmark

Custom AI Providers

You can add support for new AI providers (Claude, Ollama, local models, etc.) by implementing the BaseAIEngine interface:

Create your custom AI engine by extending BaseAIEngine
Implement the chat_completion method with your provider's API logic
Add your provider to the create_ai_engine function in main.py
Update your configuration to use the new provider

Example configuration:

model_evaluation:
  model_provider: custom_provider
  model_name: your-custom-model

Custom RAG Implementations

You can implement novel RAG approaches by extending the RAGLM base class:

Create your custom RAG class by extending RAGLM
Implement the required methods:
- build_store(): Build your retrieval index/database
- retrieve(): Implement your retrieval logic
- generate(): Define your response generation strategy
Add your model to the evaluation pipeline in main.py

Example Extensions

Popular extensions you might implement:

Graph-based RAG: Using knowledge graphs for retrieval expansion
Agentic RAG: RAG systems that can use tools and APIs
Hierarchical RAG: Multi-level document organization and retrieval

The benchmark will automatically evaluate your custom implementations alongside the built-in models, providing standardized metrics for comparison.

Getting the EntRAG Document Dataset

To access the complete EntRAG document dataset for evaluation:

Contact for dataset access: Send an email requesting access to the EntRAG benchmark dataset
Include in your request:

Your research affiliation
Intended use case for the benchmark
Whether you need raw documents or pre-processed markdown files

Available data formats:

Pre-processed markdown: Ready-to-use documents for immediate benchmarking
Raw PDF documents: Original source files if you want to test custom preprocessing
Mock API datasets: Data files required for the mock API backend to simulate dynamic API queries

The QA dataset is available on HuggingFace at fkapsahili/EntRAG and will be automatically downloaded during benchmark execution.

Packages

The project uses a Monorepo which is structured into several Python packages.

entrag: The main benchmark package, which contains the code for the pipeline and the evaluation.
entrag-mock-api: A mock API for the benchmark, which is used for the evaluation of dynamic questions, that require a call to an external API.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github/workflows		.github/workflows
packages		packages
.env.template		.env.template
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EntRAG Benchmark

Installation

Poe Tasks

Available Poe Tasks

Quickstart

Data Processing

Using Pre-processed Data

Preprocessing Raw Documents

Configuration

Using Existing Configurations

Creating Custom Configurations

Configuration Options

Extending the Benchmark

Custom AI Providers

Custom RAG Implementations

Example Extensions

Getting the EntRAG Document Dataset

Packages

About

Uh oh!

Languages

License

fkapsahili/EntRAG

Folders and files

Latest commit

History

Repository files navigation

EntRAG Benchmark

Installation

Poe Tasks

Available Poe Tasks

Quickstart

Data Processing

Using Pre-processed Data

Preprocessing Raw Documents

Configuration

Using Existing Configurations

Creating Custom Configurations

Configuration Options

Extending the Benchmark

Custom AI Providers

Custom RAG Implementations

Example Extensions

Getting the EntRAG Document Dataset

Packages

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages