Autograder Workbench – LLM-based Evaluation for Retrieve/Generate Systems

Autograder Workbench is a toolbox for evaluating information systems that use retrieve and/or generatation approaches.

The workbench builds on a code base for the Exam Answerability Metric that uses a test bank of exam questions to evaluate systems for CAR Y3. In this release, the code base is extended to generate a test bank of nuggets (aka key facts), and to provide better support for a human-in-the-loop to verify and supervise the process (without conducing any passage-level relevance judgments). The code base can evaluate systems even when no manual passage-level judgments are available.

The code base of the Autograder Workbench is released under a BSD-3 open source license.

Approach

The Autograder evaluation paradigm consists of several phases, each supported with utilities provided in this code base.

Test Bank Generation: LLMs with human-in-the-loop develop a collection of test nuggets and/or exam questions.
Grading: The LLM will grade all system responses, passage-by-passage, by either scanning the passage for mentions of each nugget or trying to answer each exam question based on the passage content.
Manual Oversight and Verification: To manually verify the the LLM is operating as intended, our autograding workbench supports to inspect extracted answers and nugget mentions along with grades on a passage-level.
Evaluation: An evaluation score for each IR system is computed, either via Autograde-Cover, or Autograde-Qrels. The latter exports a qrels file that is inter-operable with the evaluation tool trec_eval
Additional analyses: If available, official leaderboards can be used to analyze rank correlation of the predicted system ranking. If manual judgments are available, the workbench provides an inter-annotator agreement analysis.

Resource

This package includes python command line utilities for the phases of the above pipeline:

Phase 1: autograder-generate generates a test bank from query sets.
Phase 2: autograder-grade grades passages from system responses.
Phase 3: autograder-verify supports manual verification and supervision.
Phase 4: autograder-evaluate derives evaluation scores for systems under evaluation.
Additional analyses: autograder-analyze offers leaderboard rank correlation and inter-annotator agreement analyses.

Upon installation, each of these command line utilities supports extensive documentation when called with --help.

Installation via poetry

Let's examine usage of Autograder Workbench on the TREC DL 20 dataset. First, clone this repository, fetch the data-dl20.tar.xz tarball, and extract it into directory data/dl20.

$ git clone <this repository>
$ cd <cloned directory>
$ poetry install
$ wget [https://www.cs.unh.edu/~dietz/autograder/data-dl20.tar.xz](https://www.cs.unh.edu/~dietz/autograder/data-dl20.tar.xz)  # tarball with graded runs, questions, and nuggets
$ tar -x -C data/dl20 data-dl20.tar.xz

Official run files need to be obtained from https://trec.nist.gov/results/trec29/deep.passages.input.html. Access credentials are provided by the TREC Manager. Decompressed run files need to be placed in ./data/dl20/dl20runs

Alternative installation methods are described below.

Interchange Data Model

Different phases are using the same JSON data model (as gzip-compressed JSON-lines).


[
  "Query ID",
  [
    {
      "paragraph_id": "Unique Paragraph Identifier",
      "text": "Full Text of the Paragraph",
      "paragraph": ... // paragraph with additional markup if available.
      "paragraph_data": {
        "judgments": [
          {
            "paragraphId": "Same Paragraph Identifier",
            "query": ""Associated Query ID, potantially Identifier of Subquery",
            "relevance": 2, // judgment grade
            "titleQuery": "Query ID"
          }
        ],
        "rankings": [
          {
            "method": "Ranking Method",
            "paragraphId": "Unique Paragraph Identifier",
            "queryId": "Associated Query ID, potantially Identifier of Subquery",
            "rank": 6, // retrieval rank
            "score": 17.560528 // retrieval score
          }
        ]
      },
      "exam_grades": [  // for exam questions and nuggets
        {
          "correctAnswered": ["List of Correctly Answered Question and Nugget IDs"],
          "wrongAnswered": ["List of Incorrectly Answered Question and Nugget IDs"],
          "self_ratings": [{
             "nugget_id": "Nugget ID",
             // alternatively: "question_id": "Question ID"
             "self_rating": 4 // self-rating grade
           },
          "answers": [
            ["Question or Nugget ID", "Answer Text"]
          ],
          "llm": "Huggingface Language Model Used",
          "llm_options": {
            "prompt_template": "Template Used for Prompt",
            "answer_match": "Answer Matching Strategy"
          },
        "prompt_info": {
          "prompt_class": "NuggetSelfRatedPrompt",
          "prompt_style": "Is the nugget addressed...",
          "context_first": false,
          "check_unanswerable": true,
          "check_answer_key": false,
          "is_self_rated": true
        },
          "exam_ratio": 0.25 // fraction of questions answered correctly
        }
      ],
      "grades": [
        {
          "correctAnswered": true, // if judged as relevant
          "self_ratings": 4 // Self-rating on relevance
          "answers": "Answer Text" 
          "llm": "Huggingface Language Model Used",
          "llm_options": {...},
          "prompt_info": ...
        }
      ]
    }
  ]
]

Data Model. Query, passage text and ID must be provided externally. If available, manual judgment level and with system information can be used for analysis. Phase 2 adds fields exam_grades and/or grades with information about correct nuggets/questions, self-ratings of answerability, and answers for manual verification. Phase 3, the workbench supports filtering based on llm and prompt_class.

Usage

Collection of External Inputs

The following inputs are required:

dl-queries.json: Queries in form of a JSON dictionary mapping query ID to query Text
dl20-passages.jsonl.gz: Collections of passages from system responses (ranking or generated ext) for grading. These follow the data interchange model, providing the Query ID, paragraph_id, text.
System's rank information can be stored in paragraph_data.rankings[]
If available, manual judgments can be stored in paragraph_data.judgment[] An example file is provided in trecDL2020-qrels-runs-with-text.jsonl.gz

Phase 1: Test Bank Generation

Generate a test bank of nuggets as follows

$ export OPENAI_API_KEY=...
$ poetry run autograder-generate \
 -q data/dl20/dl20-queries.json \
 -o data/dl20/dl20-nuggets.jsonl.gz \
 --use-nuggets \
 --gpt-model gpt-3.5-turbo \
 --test-collection DL20 \
 --description "A new set of generated nuggets for DL20"

This will produce dl20-nuggets.jsonl.gz which contains a test bank of nuggets. For instance,

$ zcat data/dl20/dl20-nuggets.jsonl.gz | jq .items[].question_text
"Which musicians or bands are considered pioneers of rock n roll?"
"What were the major influences that led to the emergence of rock n roll?"
"Are there any specific events or performances that marked the beginning of rock n roll?"
...

Phase 2: Grading

We can then assess the quality of an IR system by scanning the system's response for mentions of the nuggets. Here we use a nugget-specific self-rating prompt for the flan-t5-large model.

This phase will use a local GPU. The CUDA device ID and batch size are configured via environment variables

export GPU_DEVICE=0
export BATCH_SIZE=10

Use device None to use CPUs.

$ poetry run autograder-grade \
   data/dl20/dl20-passages.jsonl.gz \
   -o data/dl20/dl20-graded.jsonl.gz \
   --model-name google/flan-t5-large \
   --model-pipeline text2text \
   --prompt-class NuggetSelfRatedPrompt \
   --question-path data/dl20/dl20-nuggets.jsonl.gz  \
   --question-type question-bank \
   --use-nuggets

Alternative prompts classes are

NuggetSelfRatedPrompt: self-rating of nugget mentions (enable --use-nuggets)
NuggetExtractionPrompt: extraction of nugget mentioned, for explaination and verification (to be used with use-nuggets)
QuestionSelfRatedUnanswerablePromptWithChoices: self-rating answerability of exam questions
QuestionCompleteConcisePromptWithAnswerKey2: extract answers for exam questions (informational or for test banks with known correct answers)
FagB,FagB_few, HELM, Sun, Sun_few, Thomas: Direct grading prompts.

Phase 3: Manual Verification

We support manual verification and process supervision with the following commands.

All answers to the grading prompts selfrated/extraction, grouped by question/nugget.

$ poetry run autograder-verify \
   data/dl20/dl20-graded.jsonl.gz \
   --verify-grading \
   --question-path data/dl20/dl20-questions.jsonl.gz  \
   --question-type question-bank \
    > data/dl20/dl20--verify-grading.txt

Questions/nuggets frequently covered by non-relevant passages (those should be removed from the test bank).

$ poetry run autograder-verify \
   data/dl20/dl20-graded.jsonl.gz \
   --uncovered-passages \
   --min-judgment 1  \
   --min-rating 4  \
   --question-path data/dl20/dl20-questions.jsonl.gz  \
   --question-type question-bank \
    > data/dl20/dl20-uncovered-passages.txt

Relevant passages not covered by any question/nugget (these require additional test nuggets/questions).

$ poetry run autograder-verify \
   data/dl20/dl20-graded.jsonl.gz \
   --bad-question \
   --min-judgment 1  \
   --min-rating 4  \
   --question-path data/dl20/dl20-questions.jsonl.gz  \
   --question-type question-bank \
    >  data/dl20/dl20--bad-question.txt

We envision that human verification will leads to an iterate and repeat previous phases with manual refinements of the test bank and adjustment of the grading prompts and models.

Phase 4: Evaluation

To evaluate systems with Autograder qrels, a trec_eval compatible QRELs file is exported.

$ poetry run autograder-evaluate \
     data/dl20/dl20-graded.jsonl.gz \
     -q data/dl20/dl20-autograde-qrels.qrels \
     --min-self-rating 4 \
     --prompt-class $promptclass  \
     --model google/flan-t5-large \
     --question-set question-bank

Our workbench supports to automatically run trec_eval with this qrels file on a directory of run-files when the following options are added (only supported under bash; trec_eval needs to be in PATH):

    --run-dir data/dl20/dl20runs  
    --qrel-leaderboard-out data/dl20/dl20-autograde-qrels-leaderboard.tsv

To evaluate systems with Autograde Cover, system information needs to be included in the passage file (e.g. dl20-passages.jsonl.gz). This information is preserved during the grading process. The leaderboard is produced with:

$ poetry run autograder-evaluate \
    data/dl20/dl20-graded.jsonl.gz \
    --leaderboard-out data/dl20/dl20-autograde-cover-leaderboard.tsv 
    --min-self-rating 4 \
    --prompt-class $promptclass \
    --model google/flan-t5-large \
    --question-set question-bank

Direct grading prompts are only supported via Autograde Qrels.

Additional Analyses

Rank correlation with official leaderboards using Autograde qrels.

$ poetry run autograder-analyze \
    data/dl20/dl20-graded.jsonl.gz \ 
    -q data/dl20/dl20-autograde-qrels.qrels \
    --run-dir data/dl20/dl20runs  \
    --official-leaderboard data/dl20/official_dl20_leaderboard.json \
    --qrel-leaderboard-out data/dl20/dl20-autograde-qrels-leaderboard.tsv \
    --min-relevant-judgment 2 \
    --use-ratings \
    --min-trec-eval-level 4 \
    --prompt-class $promptclass  \
    --model google/flan-t5-large \
    --question-set question-bank

Rank correlation with official leaderboards using Autograde Cover.

$ poetry run autograder-analyze \
    data/dl20/dl20-graded.jsonl.gz \ 
    --leaderboard-out data/dl20/dl20-autograde-cover-leaderboard.tsv \
    --official-leaderboard data/dl20/official_dl20_leaderboard.json \
    --use-ratings \
    --min-self-rating  4 \
    --prompt-class $promptclass  \
    --model google/flan-t5-large \
    --question-set question-bank

Inter-annotator agreement of manual judgments and self-ratings.

$ poetry run autograder-analyze \
      data/dl20/dl20-graded.jsonl.gz \
      --inter-annotator-out data/dl20/dl20-autograde-inter-annotator.tex \
      --min-relevant-judgment 2 \ 
      --use-ratings 4 \
      --prompt-class $promptclass \
      --model google/flan-t5-large \
      --question-set question-bank

Code walk through on example of TREC DL 2020

A bash script with data for the code walkthrough is provided in walkthrough-dl20.sh

Unabrigded results and manual verification analyses.

Alternative Installation Methods

Installation via `nix`

The easiest way to use exampp is via the Nix package manager:

install nix
nix develop <repo url>#cuda
Clone this repository and cd into it
in a shell type: nix develop

If you are getting error message about unfree packages or experimental command, then run one of these longer commands instead

nix --extra-experimental-features 'nix-command flakes' develop
NIXPKGS_ALLOW_UNFREE=1 nix --extra-experimental-features 'nix-command flakes' develop --impure

Usage:

Command line utilities are directly called via python -O -m <command>

Name		Name	Last commit message	Last commit date
Latest commit History 262 Commits
exam_pp		exam_pp
nix		nix
results		results
.envrc		.envrc
.gitignore		.gitignore
Pipfile		Pipfile
README.mkd		README.mkd
autograder.png		autograder.png
bad_questions.sql		bad_questions.sql
flake.lock		flake.lock
flake.nix		flake.nix
minimal_tests.py		minimal_tests.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
walkthrough-dl20.sh		walkthrough-dl20.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autograder Workbench – LLM-based Evaluation for Retrieve/Generate Systems

Approach

Resource

Installation via poetry

Interchange Data Model

Usage

Collection of External Inputs

Phase 1: Test Bank Generation

Phase 2: Grading

Phase 3: Manual Verification

Phase 4: Evaluation

Additional Analyses

Code walk through on example of TREC DL 2020

Alternative Installation Methods

Installation via `nix`

About

Releases

Packages

Contributors 2

Languages

TREMA-UNH/rubric-grading-workbench

Folders and files

Latest commit

History

Repository files navigation

Autograder Workbench – LLM-based Evaluation for Retrieve/Generate Systems

Approach

Resource

Installation via poetry

Interchange Data Model

Usage

Collection of External Inputs

Phase 1: Test Bank Generation

Phase 2: Grading

Phase 3: Manual Verification

Phase 4: Evaluation

Additional Analyses

Code walk through on example of TREC DL 2020

Alternative Installation Methods

Installation via nix

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Installation via `nix`

Packages