The open-source micro-evaluation tool for Legal AI systems.
Built for evaluating AI agents, RAG pipelines, and LLMs on legal documents.
Metrics · Quickstart · How It Works · Agent Interface · Dataset Schema · Output
LexEval is a lightweight, open-source micro-eval tool purpose-built for Legal AI systems. It gives developers a fast, structured way to measure how accurately an AI agent answers questions about legal documents — contracts, NDAs, service agreements, and more.
Think of it like pytest for your legal AI: plug in your agent, point it at a dataset, and get objective, multi-dimensional scores for every response.
pip install -r requirements.txt
python cli.py agents/my_agent.py datasets/service_agreement_dataset.json
⚠️ LexEval uses an LLM-as-a-judge approach. Set yourOPENAI_API_KEYbefore running.
LexEval evaluates every response across 5 independent metrics, each implemented as an async LLM evaluator:
| Metric | What It Checks | Scoring |
|---|---|---|
| Answer Correctness | Does the predicted answer match the expected legal fact? | 1.0 – 5.0 |
| Hallucination Detection | Did the agent invent facts not present in the document? | 1.0 (hallucinated) – 5.0 (clean) |
| Entity Accuracy | Are named legal parties (companies, individuals) correctly identified? | 1.0 – 5.0 |
| Date Accuracy | Are extracted dates chronologically correct across formats? | 1.0 – 5.0 |
| Refusal Correctness | Does the agent properly refuse when information is absent? | 1.0 – 5.0 |
Each metric returns a structured JSON verdict:
{
"score": 5.0,
"verdict": "Correct",
"reasoning": "The model accurately identifies the governing law as India, per Clause 10."
}Scores are averaged per sample and across all samples into an overall score.
pip install -r requirements.txtCreate a .env file (copy from .env.example):
OPENAI_API_KEY=sk-...
# Optional: Override evaluator model or endpoint
LLM_MODEL=gpt-4o-mini
LLM_API_URL=https://api.openai.com/v1/chat/completionsPlace your contract PDF or TXT in the documents/ folder:
documents/
└── service_agreement.pdf
Define your evaluation questions in datasets/:
{
"document_file": "service_agreement.pdf",
"documentId": "service_agreement",
"samples": [
{
"id": "q1",
"question": "Who is the service provider in this agreement?",
"expected_answer": "AlphaTech Solutions Pvt. Ltd."
},
{
"id": "q2",
"question": "What arbitration rules apply?",
"expected_answer": "NOT_FOUND"
}
]
}Create an agent file that wraps your AI system (see Agent Interface):
# agents/my_agent.py
def run_agent(document, question):
answer = my_legal_ai_system(document, question)
return {"answer": answer}python cli.py agents/my_agent.py datasets/service_agreement_dataset.jsonResults are saved automatically to results/.
LexEval follows a simple, deterministic pipeline:
Dataset (JSON)
│
▼
Document Loader ← PDF or TXT extraction via pdfplumber
│
▼
Agent Runner ← Your run_agent(document, question) function
│
▼
Metric Evaluators ← 5 async LLM-as-a-judge metrics run in parallel
│
▼
Results JSON ← Per-sample scores + overall score
- Dataset is loaded — document text is extracted from PDF/TXT, or provided inline.
- Agent is dynamically imported — your
run_agentfunction is loaded at runtime viaimportlib. - Each question is sent to your agent — the agent returns a predicted answer.
- All 5 metrics evaluate the response — they run concurrently via
asyncio.gather. - Scores are aggregated — per-sample averages and an overall score are computed.
- Results are saved to
results/results-<agent>-<dataset>-<timestamp>.json.
LexEval/
│
├── cli.py # CLI entry point
├── evaluator.py # Async evaluation pipeline
│
├── core/
│ ├── agent_loader.py # Dynamically loads your agent file
│ └── dataset_loader.py # Loads JSON datasets + extracts document text
│
├── metrics/
│ ├── answer_correctness.py # Checks factual correctness of the answer
│ ├── hallucination.py # Detects hallucinated content
│ ├── entity_accuracy.py # Validates legal entity identification
│ ├── date_accuracy.py # Compares dates across formats
│ ├── refusal_correctness.py # Validates NOT_FOUND handling
│ ├── llm_client.py # Shared OpenAI-compatible LLM client
│ └── utils/
│ └── normalize.py # Date parsing + text normalization
│
├── agents/
│ └── my_agent.py # Example agent (edit or replace)
│
├── datasets/ # Your evaluation datasets (JSON)
├── documents/ # Contract PDFs or TXT files
├── results/ # Evaluation outputs (auto-generated)
│
├── agent_interface.md # Agent contract specification
├── dataset_schema.md # Dataset format reference
├── .env.example # Environment variable template
└── requirements.txt
Your agent must expose a function named run_agent. LexEval will dynamically import and call it for each evaluation question.
def run_agent(document: str, question: str) -> dict | str:
...| Parameter | Type | Description |
|---|---|---|
document |
str |
Document text (full contract) or a document ID — depends on your dataset config |
question |
str |
The evaluation question |
Return value — either a dict with an answer key, or a plain string:
# Option A — preferred
return {"answer": "AlphaTech Solutions Pvt. Ltd."}
# Option B — also accepted
return "AlphaTech Solutions Pvt. Ltd."# agents/my_agent.py
def run_agent(document, question):
# Call your AI system, RAG pipeline, or LLM API here
answer = my_legal_ai(document, question)
return {"answer": answer}import requests
def run_agent(document_id, question):
# 1. Upload the document
with open(f"documents/{document_id}.pdf", "rb") as f:
requests.post("http://localhost:8000/upload", files={"file": f},
headers={"x-document-id": document_id})
# 2. Query the agent
resp = requests.post("http://localhost:8000/ask", json={
"question": question
}, headers={"x-document-id": document_id})
return {"answer": resp.json().get("answer", "")}LexEval decides this based on your dataset configuration:
- If
documentIdis set in the dataset →documentis the document ID (use it to fetch your own file). - If
document_textordocument_fileis set withoutdocumentId→documentis the full extracted text.
See Dataset Schema for details.
Datasets are JSON files stored in datasets/. Each dataset represents one document and a list of evaluation questions.
{
"document_file": "service_agreement.pdf",
"document_text": "",
"documentId": "service_agreement",
"samples": [
{
"id": "q1",
"question": "Who is the service provider in this agreement?",
"expected_answer": "AlphaTech Solutions Pvt. Ltd."
},
{
"id": "q15",
"question": "What arbitration rules apply under this agreement?",
"expected_answer": "NOT_FOUND"
}
]
}| Field | Type | Required | Description |
|---|---|---|---|
document_file |
string |
One of these | Filename of a PDF or TXT in documents/. LexEval extracts text automatically. |
document_text |
string |
One of these | Inline document text. Use instead of document_file for short contracts. |
documentId |
string |
Optional | If set, this ID is passed directly to run_agent instead of the document text. |
samples |
array |
✅ | List of evaluation questions. |
samples[].id |
string |
✅ | Unique question identifier. |
samples[].question |
string |
✅ | The question to send to the agent. |
samples[].expected_answer |
string |
✅ | Correct answer. Use "NOT_FOUND" when the document does not contain the information. |
Option 1 — External file (PDF or TXT)
{
"document_file": "contract.pdf"
}LexEval looks for the file in documents/ and extracts text using pdfplumber.
Option 2 — Inline text
{
"document_text": "This Service Agreement is entered into on March 1, 2026..."
}Useful for testing with short or synthetic documents.
Option 3 — Document ID (for external systems)
{
"documentId": "service_agreement"
}The ID is passed directly to run_agent. Use this when your agent fetches the document itself (e.g., from a database or cloud storage).
Use "NOT_FOUND" as the expected answer when the document does not contain the information:
{
"id": "q15",
"question": "What arbitration rules apply under this agreement?",
"expected_answer": "NOT_FOUND"
}The Refusal Correctness metric rewards agents that correctly refuse to answer — and also rewards agents that find the information when the reference was incorrectly marked as NOT_FOUND.
Results are saved automatically to:
results/results-<agent>-<dataset>-<timestamp>.json
{
"overall_score": 4.29,
"results": [
{
"question": "Who is the service provider in this agreement?",
"expected_answer": "AlphaTech Solutions Pvt. Ltd.",
"predicted_answer": "The Service Provider is AlphaTech Solutions Pvt. Ltd.",
"sample_score": 4.8,
"metrics": {
"answer_correctness": {
"score": 5.0,
"verdict": "Correct",
"reasoning": "The model accurately identifies the Service Provider."
},
"hallucination": {
"score": 5.0,
"verdict": "Pass",
"reasoning": "No hallucinated facts detected."
},
"entity_accuracy": {
"score": 5.0,
"verdict": "Correct",
"reasoning": "AlphaTech Solutions Pvt. Ltd. matches the reference entity."
},
"date_accuracy": {
"score": 5.0,
"verdict": "Correct",
"reasoning": "No date comparison required for this question."
},
"refusal_correctness": {
"score": 4.0,
"verdict": "Correct",
"reasoning": "Answer was expected and correctly provided."
}
}
}
]
}| Score | Interpretation |
|---|---|
| 4.5 – 5.0 | Excellent — agent reliably extracts legal facts |
| 3.5 – 4.4 | Good — minor issues with formatting or partial answers |
| 2.5 – 3.4 | Fair — noticeable errors in specific metric categories |
| < 2.5 | Poor — significant hallucination, entity confusion, or refusal failures |
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
(required) | API key for the LLM evaluator |
LLM_MODEL |
gpt-4o-mini |
Model used by all metric evaluators |
LLM_API_URL |
https://api.openai.com/v1/chat/completions |
OpenAI-compatible endpoint |
LexEval supports any OpenAI-compatible API — swap in Mistral, Together AI, Groq, or a local ollama endpoint by setting LLM_API_URL and LLM_MODEL.
LLM_API_KEY=your-key
LLM_API_URL=http://localhost:11434/v1/chat/completions
LLM_MODEL=llama3| Format | Support | Notes |
|---|---|---|
.pdf |
✅ | Text extracted with pdfplumber |
.txt |
✅ | Read directly as UTF-8 |
| Inline text | ✅ | Via document_text field in dataset |
.docx |
❌ | Not yet supported |
| Scanned PDFs | Only if text layer is present |
LexEval is designed for teams building or evaluating:
- Legal RAG systems — Retrieval-augmented generation over contracts and agreements
- Contract QA agents — AI assistants that answer questions about specific clauses
- Legal LLM benchmarking — Compare model performance on structured legal extraction tasks
- Prompt engineering — Test whether prompt changes improve factual precision
- CI/CD evaluation pipelines — Automate regression testing for legal AI systems
Contributions are welcome! Here's how to get started:
-
Fork and clone the repository.
-
Create a feature branch:
git checkout -b feature/your-feature-name
-
Follow module boundaries:
core/— document loading and agent loading utilitiesmetrics/— individual metric evaluators (one file per metric)agents/— example and reference agent implementationsdatasets/— evaluation datasets (JSON)- Keep secrets and endpoints in environment variables, never hardcoded.
-
Testing: There are currently no automated tests. If you introduce complex logic, please add
pytesttests. At minimum, run the CLI against a local dataset to verify your changes work end-to-end. -
Submit a pull request describing:
- What you changed
- Why it is needed
- Any new environment variables or configuration introduced
Please coordinate with project maintainers for coding style expectations.
⭐ If LexEval is useful to your team, consider starring the repo!