# Benchmark Evaluation Demo

This notebook walk you through the steps for doing predictions on the LLM-AggreFact benchmark and obtain the evaluation results.

In [1]:
import pandas as pd
from datasets import load_dataset
from sklearn.metrics import balanced_accuracy_score
from minicheck.minicheck import MiniCheck
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
docs = df.doc.values
claims = df.claim.values

## Load the fact-checking model
There are four models to choose from: ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B'] where:

(1) `MiniCheck-Flan-T5-Large` is the best fack-checking model with size < 1B and reaches GPT-4 performance. \
(2) `Bespoke-MiniCheck-7B` is the most performant fact-checking model in the MiniCheck series AND \
   it outperforms ALL exisiting specialized fact-checkers and off-the-shelf LLMs regardless of size.

In [None]:
model_name = 'Bespoke-MiniCheck-7B'
scorer = MiniCheck(model_name=model_name, cache_dir='./ckpts')

## Predict the labels
In this demo, `Bespoke-MiniCheck-7B` (implemented with vLLM) predicting on the entire test set (29K) requires ~50 mins using a single NVIDA A6000 (48GB VRAM). The average throughput > 500 docs/min, same throughput as `MiniCheck-Flan-T5-Large`.

In [None]:
# pred_label converts the raw probability (raw_prob) into 1/0 using the threshold 0.5
pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims)

## Check performance on LLM-AggreFact

In [1]:
df['preds'] = pred_label
result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
for dataset in df.dataset.unique():
    sub_df = df[df.dataset == dataset]
    bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
    result_df.loc[len(result_df)] = [dataset, bacc]

result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
result_df.round(1)

Unnamed: 0,Dataset,BAcc
0,AggreFact-CNN,65.5
1,AggreFact-XSum,77.8
2,TofuEval-MediaS,76.0
3,TofuEval-MeetB,78.3
4,Wice,83.0
5,Reveal,88.0
6,ClaimVerify,75.3
7,FactCheck-GPT,77.7
8,ExpertQA,59.2
9,Lfqa,86.7
