⚖️ LexEval — Legal AI Micro-Eval Tool

The open-source micro-evaluation tool for Legal AI systems.
Built for evaluating AI agents, RAG pipelines, and LLMs on legal documents.

Metrics · Quickstart · How It Works · Agent Interface · Dataset Schema · Output

LexEval is a lightweight, open-source micro-eval tool purpose-built for Legal AI systems. It gives developers a fast, structured way to measure how accurately an AI agent answers questions about legal documents — contracts, NDAs, service agreements, and more.

Think of it like pytest for your legal AI: plug in your agent, point it at a dataset, and get objective, multi-dimensional scores for every response.

pip install -r requirements.txt
python cli.py agents/my_agent.py datasets/service_agreement_dataset.json

⚠️ LexEval uses an LLM-as-a-judge approach. Set your OPENAI_API_KEY before running.

🔥 Metrics

LexEval evaluates every response across 5 independent metrics, each implemented as an async LLM evaluator:

Metric	What It Checks	Scoring
Answer Correctness	Does the predicted answer match the expected legal fact?	1.0 – 5.0
Hallucination Detection	Did the agent invent facts not present in the document?	1.0 (hallucinated) – 5.0 (clean)
Entity Accuracy	Are named legal parties (companies, individuals) correctly identified?	1.0 – 5.0
Date Accuracy	Are extracted dates chronologically correct across formats?	1.0 – 5.0
Refusal Correctness	Does the agent properly refuse when information is absent?	1.0 – 5.0

Each metric returns a structured JSON verdict:

{
  "score": 5.0,
  "verdict": "Correct",
  "reasoning": "The model accurately identifies the governing law as India, per Clause 10."
}

Scores are averaged per sample and across all samples into an overall score.

🚀 Quickstart

1. Install dependencies

pip install -r requirements.txt

2. Configure your LLM evaluator

Create a .env file (copy from .env.example):

OPENAI_API_KEY=sk-...

# Optional: Override evaluator model or endpoint
LLM_MODEL=gpt-4o-mini
LLM_API_URL=https://api.openai.com/v1/chat/completions

3. Add your document

Place your contract PDF or TXT in the documents/ folder:

documents/
└── service_agreement.pdf

4. Create a dataset

Define your evaluation questions in datasets/:

{
  "document_file": "service_agreement.pdf",
  "documentId": "service_agreement",
  "samples": [
    {
      "id": "q1",
      "question": "Who is the service provider in this agreement?",
      "expected_answer": "AlphaTech Solutions Pvt. Ltd."
    },
    {
      "id": "q2",
      "question": "What arbitration rules apply?",
      "expected_answer": "NOT_FOUND"
    }
  ]
}

5. Connect your agent

Create an agent file that wraps your AI system (see Agent Interface):

# agents/my_agent.py

def run_agent(document, question):
    answer = my_legal_ai_system(document, question)
    return {"answer": answer}

6. Run the evaluation

python cli.py agents/my_agent.py datasets/service_agreement_dataset.json

Results are saved automatically to results/.

🔍 How It Works

LexEval follows a simple, deterministic pipeline:

Dataset (JSON)
    │
    ▼
Document Loader        ← PDF or TXT extraction via pdfplumber
    │
    ▼
Agent Runner           ← Your run_agent(document, question) function
    │
    ▼
Metric Evaluators      ← 5 async LLM-as-a-judge metrics run in parallel
    │
    ▼
Results JSON           ← Per-sample scores + overall score

Dataset is loaded — document text is extracted from PDF/TXT, or provided inline.
Agent is dynamically imported — your run_agent function is loaded at runtime via importlib.
Each question is sent to your agent — the agent returns a predicted answer.
All 5 metrics evaluate the response — they run concurrently via asyncio.gather.
Scores are aggregated — per-sample averages and an overall score are computed.
Results are saved to results/results-<agent>-<dataset>-<timestamp>.json.

📁 Project Structure

LexEval/
│
├── cli.py                     # CLI entry point
├── evaluator.py               # Async evaluation pipeline
│
├── core/
│   ├── agent_loader.py        # Dynamically loads your agent file
│   └── dataset_loader.py      # Loads JSON datasets + extracts document text
│
├── metrics/
│   ├── answer_correctness.py  # Checks factual correctness of the answer
│   ├── hallucination.py       # Detects hallucinated content
│   ├── entity_accuracy.py     # Validates legal entity identification
│   ├── date_accuracy.py       # Compares dates across formats
│   ├── refusal_correctness.py # Validates NOT_FOUND handling
│   ├── llm_client.py          # Shared OpenAI-compatible LLM client
│   └── utils/
│       └── normalize.py       # Date parsing + text normalization
│
├── agents/
│   └── my_agent.py            # Example agent (edit or replace)
│
├── datasets/                  # Your evaluation datasets (JSON)
├── documents/                 # Contract PDFs or TXT files
├── results/                   # Evaluation outputs (auto-generated)
│
├── agent_interface.md         # Agent contract specification
├── dataset_schema.md          # Dataset format reference
├── .env.example               # Environment variable template
└── requirements.txt

🤖 Agent Interface

Your agent must expose a function named run_agent. LexEval will dynamically import and call it for each evaluation question.

Signature

def run_agent(document: str, question: str) -> dict | str:
    ...

Parameter	Type	Description
`document`	`str`	Document text (full contract) or a document ID — depends on your dataset config
`question`	`str`	The evaluation question

Return value — either a dict with an answer key, or a plain string:

# Option A — preferred
return {"answer": "AlphaTech Solutions Pvt. Ltd."}

# Option B — also accepted
return "AlphaTech Solutions Pvt. Ltd."

Minimal example

# agents/my_agent.py

def run_agent(document, question):
    # Call your AI system, RAG pipeline, or LLM API here
    answer = my_legal_ai(document, question)
    return {"answer": answer}

Example: Agent calling an external API

import requests

def run_agent(document_id, question):
    # 1. Upload the document
    with open(f"documents/{document_id}.pdf", "rb") as f:
        requests.post("http://localhost:8000/upload", files={"file": f},
                      headers={"x-document-id": document_id})

    # 2. Query the agent
    resp = requests.post("http://localhost:8000/ask", json={
        "question": question
    }, headers={"x-document-id": document_id})

    return {"answer": resp.json().get("answer", "")}

When does `document` contain the ID vs. full text?

LexEval decides this based on your dataset configuration:

If documentId is set in the dataset → document is the document ID (use it to fetch your own file).
If document_text or document_file is set without documentId → document is the full extracted text.

See Dataset Schema for details.

📋 Dataset Schema

Datasets are JSON files stored in datasets/. Each dataset represents one document and a list of evaluation questions.

Full schema

{
  "document_file": "service_agreement.pdf",
  "document_text": "",
  "documentId": "service_agreement",
  "samples": [
    {
      "id": "q1",
      "question": "Who is the service provider in this agreement?",
      "expected_answer": "AlphaTech Solutions Pvt. Ltd."
    },
    {
      "id": "q15",
      "question": "What arbitration rules apply under this agreement?",
      "expected_answer": "NOT_FOUND"
    }
  ]
}

Field reference

Field	Type	Required	Description
`document_file`	`string`	One of these	Filename of a PDF or TXT in `documents/`. LexEval extracts text automatically.
`document_text`	`string`	One of these	Inline document text. Use instead of `document_file` for short contracts.
`documentId`	`string`	Optional	If set, this ID is passed directly to `run_agent` instead of the document text.
`samples`	`array`	✅	List of evaluation questions.
`samples[].id`	`string`	✅	Unique question identifier.
`samples[].question`	`string`	✅	The question to send to the agent.
`samples[].expected_answer`	`string`	✅	Correct answer. Use `"NOT_FOUND"` when the document does not contain the information.

Document loading options

Option 1 — External file (PDF or TXT)

{
  "document_file": "contract.pdf"
}

LexEval looks for the file in documents/ and extracts text using pdfplumber.

Option 2 — Inline text

{
  "document_text": "This Service Agreement is entered into on March 1, 2026..."
}

Useful for testing with short or synthetic documents.

Option 3 — Document ID (for external systems)

{
  "documentId": "service_agreement"
}

The ID is passed directly to run_agent. Use this when your agent fetches the document itself (e.g., from a database or cloud storage).

Handling missing information

Use "NOT_FOUND" as the expected answer when the document does not contain the information:

{
  "id": "q15",
  "question": "What arbitration rules apply under this agreement?",
  "expected_answer": "NOT_FOUND"
}

The Refusal Correctness metric rewards agents that correctly refuse to answer — and also rewards agents that find the information when the reference was incorrectly marked as NOT_FOUND.

📊 Output & Results

Results are saved automatically to:

results/results-<agent>-<dataset>-<timestamp>.json

Output structure

{
  "overall_score": 4.29,
  "results": [
    {
      "question": "Who is the service provider in this agreement?",
      "expected_answer": "AlphaTech Solutions Pvt. Ltd.",
      "predicted_answer": "The Service Provider is AlphaTech Solutions Pvt. Ltd.",
      "sample_score": 4.8,
      "metrics": {
        "answer_correctness": {
          "score": 5.0,
          "verdict": "Correct",
          "reasoning": "The model accurately identifies the Service Provider."
        },
        "hallucination": {
          "score": 5.0,
          "verdict": "Pass",
          "reasoning": "No hallucinated facts detected."
        },
        "entity_accuracy": {
          "score": 5.0,
          "verdict": "Correct",
          "reasoning": "AlphaTech Solutions Pvt. Ltd. matches the reference entity."
        },
        "date_accuracy": {
          "score": 5.0,
          "verdict": "Correct",
          "reasoning": "No date comparison required for this question."
        },
        "refusal_correctness": {
          "score": 4.0,
          "verdict": "Correct",
          "reasoning": "Answer was expected and correctly provided."
        }
      }
    }
  ]
}

Score interpretation

Score	Interpretation
4.5 – 5.0	Excellent — agent reliably extracts legal facts
3.5 – 4.4	Good — minor issues with formatting or partial answers
2.5 – 3.4	Fair — noticeable errors in specific metric categories
< 2.5	Poor — significant hallucination, entity confusion, or refusal failures

⚙️ Configuration

Environment variables

Variable	Default	Description
`OPENAI_API_KEY`	(required)	API key for the LLM evaluator
`LLM_MODEL`	`gpt-4o-mini`	Model used by all metric evaluators
`LLM_API_URL`	`https://api.openai.com/v1/chat/completions`	OpenAI-compatible endpoint

LexEval supports any OpenAI-compatible API — swap in Mistral, Together AI, Groq, or a local ollama endpoint by setting LLM_API_URL and LLM_MODEL.

Using a custom evaluator endpoint

LLM_API_KEY=your-key
LLM_API_URL=http://localhost:11434/v1/chat/completions
LLM_MODEL=llama3

🧩 Supported Document Types

Format	Support	Notes
`.pdf`	✅	Text extracted with `pdfplumber`
`.txt`	✅	Read directly as UTF-8
Inline text	✅	Via `document_text` field in dataset
`.docx`	❌	Not yet supported
Scanned PDFs	⚠️	Only if text layer is present

🛠️ Use Cases

LexEval is designed for teams building or evaluating:

Legal RAG systems — Retrieval-augmented generation over contracts and agreements
Contract QA agents — AI assistants that answer questions about specific clauses
Legal LLM benchmarking — Compare model performance on structured legal extraction tasks
Prompt engineering — Test whether prompt changes improve factual precision
CI/CD evaluation pipelines — Automate regression testing for legal AI systems

🤝 Contributing

Contributions are welcome! Here's how to get started:

Fork and clone the repository.

Create a feature branch:

git checkout -b feature/your-feature-name

Follow module boundaries:
- core/ — document loading and agent loading utilities
- metrics/ — individual metric evaluators (one file per metric)
- agents/ — example and reference agent implementations
- datasets/ — evaluation datasets (JSON)
- Keep secrets and endpoints in environment variables, never hardcoded.
Testing: There are currently no automated tests. If you introduce complex logic, please add pytest tests. At minimum, run the CLI against a local dataset to verify your changes work end-to-end.
Submit a pull request describing:
- What you changed
- Why it is needed
- Any new environment variables or configuration introduced

Please coordinate with project maintainers for coding style expectations.

⭐ If LexEval is useful to your team, consider starring the repo!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
core		core
datasets		datasets
docs		docs
documents		documents
metrics		metrics
samples/agents		samples/agents
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
cli.py		cli.py
evaluator.py		evaluator.py

Folders and files

Latest commit

History

Repository files navigation

⚖️ LexEval — Legal AI Micro-Eval Tool

🔥 Metrics

🚀 Quickstart

1. Install dependencies

2. Configure your LLM evaluator

3. Add your document

4. Create a dataset

5. Connect your agent

6. Run the evaluation

🔍 How It Works

📁 Project Structure

🤖 Agent Interface

Signature

Minimal example

Example: Agent calling an external API

When does document contain the ID vs. full text?

📋 Dataset Schema

Full schema

Field reference

Document loading options

Handling missing information

📊 Output & Results

Output structure

Score interpretation

⚙️ Configuration

Environment variables

Using a custom evaluator endpoint

🧩 Supported Document Types

🛠️ Use Cases

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

When does `document` contain the ID vs. full text?

Packages