Skip to content

LexStack-AI/LexEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LexEval Logo

⚖️ LexEval — Legal AI Micro-Eval Tool

The open-source micro-evaluation tool for Legal AI systems.
Built for evaluating AI agents, RAG pipelines, and LLMs on legal documents.

Metrics · Quickstart · How It Works · Agent Interface · Dataset Schema · Output

Python 3.9+ License Legal AI


LexEval is a lightweight, open-source micro-eval tool purpose-built for Legal AI systems. It gives developers a fast, structured way to measure how accurately an AI agent answers questions about legal documents — contracts, NDAs, service agreements, and more.

Think of it like pytest for your legal AI: plug in your agent, point it at a dataset, and get objective, multi-dimensional scores for every response.

pip install -r requirements.txt
python cli.py agents/my_agent.py datasets/service_agreement_dataset.json

⚠️ LexEval uses an LLM-as-a-judge approach. Set your OPENAI_API_KEY before running.


🔥 Metrics

LexEval evaluates every response across 5 independent metrics, each implemented as an async LLM evaluator:

Metric What It Checks Scoring
Answer Correctness Does the predicted answer match the expected legal fact? 1.0 – 5.0
Hallucination Detection Did the agent invent facts not present in the document? 1.0 (hallucinated) – 5.0 (clean)
Entity Accuracy Are named legal parties (companies, individuals) correctly identified? 1.0 – 5.0
Date Accuracy Are extracted dates chronologically correct across formats? 1.0 – 5.0
Refusal Correctness Does the agent properly refuse when information is absent? 1.0 – 5.0

Each metric returns a structured JSON verdict:

{
  "score": 5.0,
  "verdict": "Correct",
  "reasoning": "The model accurately identifies the governing law as India, per Clause 10."
}

Scores are averaged per sample and across all samples into an overall score.


🚀 Quickstart

1. Install dependencies

pip install -r requirements.txt

2. Configure your LLM evaluator

Create a .env file (copy from .env.example):

OPENAI_API_KEY=sk-...

# Optional: Override evaluator model or endpoint
LLM_MODEL=gpt-4o-mini
LLM_API_URL=https://api.openai.com/v1/chat/completions

3. Add your document

Place your contract PDF or TXT in the documents/ folder:

documents/
└── service_agreement.pdf

4. Create a dataset

Define your evaluation questions in datasets/:

{
  "document_file": "service_agreement.pdf",
  "documentId": "service_agreement",
  "samples": [
    {
      "id": "q1",
      "question": "Who is the service provider in this agreement?",
      "expected_answer": "AlphaTech Solutions Pvt. Ltd."
    },
    {
      "id": "q2",
      "question": "What arbitration rules apply?",
      "expected_answer": "NOT_FOUND"
    }
  ]
}

5. Connect your agent

Create an agent file that wraps your AI system (see Agent Interface):

# agents/my_agent.py

def run_agent(document, question):
    answer = my_legal_ai_system(document, question)
    return {"answer": answer}

6. Run the evaluation

python cli.py agents/my_agent.py datasets/service_agreement_dataset.json

Results are saved automatically to results/.


🔍 How It Works

LexEval follows a simple, deterministic pipeline:

Dataset (JSON)
    │
    ▼
Document Loader        ← PDF or TXT extraction via pdfplumber
    │
    ▼
Agent Runner           ← Your run_agent(document, question) function
    │
    ▼
Metric Evaluators      ← 5 async LLM-as-a-judge metrics run in parallel
    │
    ▼
Results JSON           ← Per-sample scores + overall score
  1. Dataset is loaded — document text is extracted from PDF/TXT, or provided inline.
  2. Agent is dynamically imported — your run_agent function is loaded at runtime via importlib.
  3. Each question is sent to your agent — the agent returns a predicted answer.
  4. All 5 metrics evaluate the response — they run concurrently via asyncio.gather.
  5. Scores are aggregated — per-sample averages and an overall score are computed.
  6. Results are saved to results/results-<agent>-<dataset>-<timestamp>.json.

📁 Project Structure

LexEval/
│
├── cli.py                     # CLI entry point
├── evaluator.py               # Async evaluation pipeline
│
├── core/
│   ├── agent_loader.py        # Dynamically loads your agent file
│   └── dataset_loader.py      # Loads JSON datasets + extracts document text
│
├── metrics/
│   ├── answer_correctness.py  # Checks factual correctness of the answer
│   ├── hallucination.py       # Detects hallucinated content
│   ├── entity_accuracy.py     # Validates legal entity identification
│   ├── date_accuracy.py       # Compares dates across formats
│   ├── refusal_correctness.py # Validates NOT_FOUND handling
│   ├── llm_client.py          # Shared OpenAI-compatible LLM client
│   └── utils/
│       └── normalize.py       # Date parsing + text normalization
│
├── agents/
│   └── my_agent.py            # Example agent (edit or replace)
│
├── datasets/                  # Your evaluation datasets (JSON)
├── documents/                 # Contract PDFs or TXT files
├── results/                   # Evaluation outputs (auto-generated)
│
├── agent_interface.md         # Agent contract specification
├── dataset_schema.md          # Dataset format reference
├── .env.example               # Environment variable template
└── requirements.txt

🤖 Agent Interface

Your agent must expose a function named run_agent. LexEval will dynamically import and call it for each evaluation question.

Signature

def run_agent(document: str, question: str) -> dict | str:
    ...
Parameter Type Description
document str Document text (full contract) or a document ID — depends on your dataset config
question str The evaluation question

Return value — either a dict with an answer key, or a plain string:

# Option A — preferred
return {"answer": "AlphaTech Solutions Pvt. Ltd."}

# Option B — also accepted
return "AlphaTech Solutions Pvt. Ltd."

Minimal example

# agents/my_agent.py

def run_agent(document, question):
    # Call your AI system, RAG pipeline, or LLM API here
    answer = my_legal_ai(document, question)
    return {"answer": answer}

Example: Agent calling an external API

import requests

def run_agent(document_id, question):
    # 1. Upload the document
    with open(f"documents/{document_id}.pdf", "rb") as f:
        requests.post("http://localhost:8000/upload", files={"file": f},
                      headers={"x-document-id": document_id})

    # 2. Query the agent
    resp = requests.post("http://localhost:8000/ask", json={
        "question": question
    }, headers={"x-document-id": document_id})

    return {"answer": resp.json().get("answer", "")}

When does document contain the ID vs. full text?

LexEval decides this based on your dataset configuration:

  • If documentId is set in the dataset → document is the document ID (use it to fetch your own file).
  • If document_text or document_file is set without documentIddocument is the full extracted text.

See Dataset Schema for details.


📋 Dataset Schema

Datasets are JSON files stored in datasets/. Each dataset represents one document and a list of evaluation questions.

Full schema

{
  "document_file": "service_agreement.pdf",
  "document_text": "",
  "documentId": "service_agreement",
  "samples": [
    {
      "id": "q1",
      "question": "Who is the service provider in this agreement?",
      "expected_answer": "AlphaTech Solutions Pvt. Ltd."
    },
    {
      "id": "q15",
      "question": "What arbitration rules apply under this agreement?",
      "expected_answer": "NOT_FOUND"
    }
  ]
}

Field reference

Field Type Required Description
document_file string One of these Filename of a PDF or TXT in documents/. LexEval extracts text automatically.
document_text string One of these Inline document text. Use instead of document_file for short contracts.
documentId string Optional If set, this ID is passed directly to run_agent instead of the document text.
samples array List of evaluation questions.
samples[].id string Unique question identifier.
samples[].question string The question to send to the agent.
samples[].expected_answer string Correct answer. Use "NOT_FOUND" when the document does not contain the information.

Document loading options

Option 1 — External file (PDF or TXT)

{
  "document_file": "contract.pdf"
}

LexEval looks for the file in documents/ and extracts text using pdfplumber.

Option 2 — Inline text

{
  "document_text": "This Service Agreement is entered into on March 1, 2026..."
}

Useful for testing with short or synthetic documents.

Option 3 — Document ID (for external systems)

{
  "documentId": "service_agreement"
}

The ID is passed directly to run_agent. Use this when your agent fetches the document itself (e.g., from a database or cloud storage).

Handling missing information

Use "NOT_FOUND" as the expected answer when the document does not contain the information:

{
  "id": "q15",
  "question": "What arbitration rules apply under this agreement?",
  "expected_answer": "NOT_FOUND"
}

The Refusal Correctness metric rewards agents that correctly refuse to answer — and also rewards agents that find the information when the reference was incorrectly marked as NOT_FOUND.


📊 Output & Results

Results are saved automatically to:

results/results-<agent>-<dataset>-<timestamp>.json

Output structure

{
  "overall_score": 4.29,
  "results": [
    {
      "question": "Who is the service provider in this agreement?",
      "expected_answer": "AlphaTech Solutions Pvt. Ltd.",
      "predicted_answer": "The Service Provider is AlphaTech Solutions Pvt. Ltd.",
      "sample_score": 4.8,
      "metrics": {
        "answer_correctness": {
          "score": 5.0,
          "verdict": "Correct",
          "reasoning": "The model accurately identifies the Service Provider."
        },
        "hallucination": {
          "score": 5.0,
          "verdict": "Pass",
          "reasoning": "No hallucinated facts detected."
        },
        "entity_accuracy": {
          "score": 5.0,
          "verdict": "Correct",
          "reasoning": "AlphaTech Solutions Pvt. Ltd. matches the reference entity."
        },
        "date_accuracy": {
          "score": 5.0,
          "verdict": "Correct",
          "reasoning": "No date comparison required for this question."
        },
        "refusal_correctness": {
          "score": 4.0,
          "verdict": "Correct",
          "reasoning": "Answer was expected and correctly provided."
        }
      }
    }
  ]
}

Score interpretation

Score Interpretation
4.5 – 5.0 Excellent — agent reliably extracts legal facts
3.5 – 4.4 Good — minor issues with formatting or partial answers
2.5 – 3.4 Fair — noticeable errors in specific metric categories
< 2.5 Poor — significant hallucination, entity confusion, or refusal failures

⚙️ Configuration

Environment variables

Variable Default Description
OPENAI_API_KEY (required) API key for the LLM evaluator
LLM_MODEL gpt-4o-mini Model used by all metric evaluators
LLM_API_URL https://api.openai.com/v1/chat/completions OpenAI-compatible endpoint

LexEval supports any OpenAI-compatible API — swap in Mistral, Together AI, Groq, or a local ollama endpoint by setting LLM_API_URL and LLM_MODEL.

Using a custom evaluator endpoint

LLM_API_KEY=your-key
LLM_API_URL=http://localhost:11434/v1/chat/completions
LLM_MODEL=llama3

🧩 Supported Document Types

Format Support Notes
.pdf Text extracted with pdfplumber
.txt Read directly as UTF-8
Inline text Via document_text field in dataset
.docx Not yet supported
Scanned PDFs ⚠️ Only if text layer is present

🛠️ Use Cases

LexEval is designed for teams building or evaluating:

  • Legal RAG systems — Retrieval-augmented generation over contracts and agreements
  • Contract QA agents — AI assistants that answer questions about specific clauses
  • Legal LLM benchmarking — Compare model performance on structured legal extraction tasks
  • Prompt engineering — Test whether prompt changes improve factual precision
  • CI/CD evaluation pipelines — Automate regression testing for legal AI systems

🤝 Contributing

Contributions are welcome! Here's how to get started:

  1. Fork and clone the repository.

  2. Create a feature branch:

    git checkout -b feature/your-feature-name
  3. Follow module boundaries:

    • core/ — document loading and agent loading utilities
    • metrics/ — individual metric evaluators (one file per metric)
    • agents/ — example and reference agent implementations
    • datasets/ — evaluation datasets (JSON)
    • Keep secrets and endpoints in environment variables, never hardcoded.
  4. Testing: There are currently no automated tests. If you introduce complex logic, please add pytest tests. At minimum, run the CLI against a local dataset to verify your changes work end-to-end.

  5. Submit a pull request describing:

    • What you changed
    • Why it is needed
    • Any new environment variables or configuration introduced

Please coordinate with project maintainers for coding style expectations.


⭐ If LexEval is useful to your team, consider starring the repo!

About

A lightweight, open-source toolkit for unit-testing Legal RAG pipelines. It replaces generic string matching with specialized logic to validate contract data and model behavior, allowing for rapid iteration on small, high-quality datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages