# Example Demo

This notebook walk you through the steps for doing predictions on the LLM-AggreFact benchmark and obtain the evaluation results shown in the paper.

In [1]:
import pandas as pd
from datasets import load_dataset
from sklearn.metrics import balanced_accuracy_score
from minicheck.minicheck import MiniCheck


df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
docs = df.doc.values
claims = df.claim.values

## Load the fact-checking model
There are three models to choose from: ['roberta-large', 'deberta-v3-large', 'flan-t5-large'], where 'flan-t5-large' is our best performing model, reaching GPT-4 performance but 400x cheaper.

In [None]:
model_name = 'flan-t5-large'
scorer = MiniCheck(model_name=model_name, device=f'cuda:0', cache_dir='./ckpts')

## Predict the labels
Predicting on the entire test set (13K) requires ~10-20 mins, depending on the chosen model and hardware setup. In this demo, we use 'flan-t5-large', which takes ~20 mins (>500 docs/min on average).

A GPU with VRAM of 16GB should be sufficient. The GPU usage during the entire prediction process in our local machine is <10 GB most of the time.

In [None]:
# pred_label converts the raw probability (raw_prob) into 1/0 using the threshold 0.5
pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims)

## Check performance on LLM-AggreFact

In [4]:
df['preds'] = pred_label
result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
for dataset in df.dataset.unique():
    sub_df = df[df.dataset == dataset]
    bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
    result_df.loc[len(result_df)] = [dataset, bacc]

result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
result_df.round(1)

Unnamed: 0,Dataset,BAcc
0,AggreFact-CNN,69.9
1,AggreFact-XSum,74.3
2,TofuEval-MediaS,73.6
3,TofuEval-MeetB,77.3
4,Wice,72.2
5,Reveal,86.2
6,ClaimVerify,74.6
7,FactCheck-GPT,74.7
8,ExpertQA,59.0
9,Lfqa,85.2
