ERBench

Binary Tasks

How to Run

binary/run_qa.py

Description

This python file triggers the data preprocessing steps using the data in the in the {task}_data/crafted folders and runs the models (GPT, Gemini, Llama, Mistral). The output of the program will be saved under the results folder.
Arguments
- task
  
  the dataset that one ought to test
- tasktype
  
  validation step or the main test step
- index
  
  LLM APIs tend to abort their program due to various issues such as timeout errors or sensitive content errors. Then the user should add the corresponding index to the code to skip the index. After that give the corresponding index as the parameter such that the model can continue its testing from the index
- demo
  
  Used for few-shot (parameter value: demo) or Chain-of-Thought Prompting (parameter value: cot)
- rag
  
  Used for Retrieval Augmented Generation (Wikipedia), if wanting to enable RAG, give True as the parameter value
- model
  
  the model that one ought to test

binary/error_analysis.py

This python file returns the numerical analyses results based on the output log files outputted by run_qa.py. Four lines of output will be shown in the terminal, each corresponding to A, R, AR, H. The first numeric result for each row is the value for the basic prompt and the second result is the value for the negated prompt.

Arguments
- task
  
  the dataset that one ought to test
- demo
  
  Used for few-shot (parameter value: demo) or Chain-of-Thought Prompting (parameter value: cot)
- model
  
  the model that one ought to test
- rag
  
  Used for Retrieval Augmented Generation (Wikipedia), if wanting to enable RAG, give True as the parameter value

binary/finetune_dataset.py

This python file creates the dataset needed for finetuning for GPT models.

Arguments
- n
  
  the number of data points that one ought to use
If the user wants to modify the datasets that will be used for finetuning, change line 313 datasets parameter.

finetune.ipynb

This python file executes finetuning based on the dataset created by finetune_dataset.py.

Experiment Procedure

General Tasks

python run_qa.py --model [MODEL] --task [TASK] --tasktype validate
python run_qa.py --model [MODEL] --task [TASK]
python error_analysis.py --model [MODEL] --task [TASK]

Finetuning

python finetune_dataset.py --n [N]
run finetune.ipynb via ipynb kernel
python run_qa.py --model [FINETUNED_MODEL] --task [TASK] --tasktype validate
python run_qa.py --model [FINETUNED_MODEL] --task [TASK]
python error_analysis.py --model [FINETUNED_MODEL] --task [TASK]

finetune_dataset.py -> finetune.ipynb -> run_qa.py -> error_analysis.py

Images for Multimodal Models (Gemini Vision Pro)

https://drive.google.com/drive/folders/1WXCGCG4ZPzkV1qUjR2Z0IegzkPgSSKwl?usp=drive_link

Multi-choice Tasks

How to run

multi_choice/source/run_qa.py

Description

This python file (1) preprocess dataset and (2) run QA/validation task.

(1) preprocessing

Use dataset/crafted to reproduce the results in the paper. You should define your preprocessing function if you want to use your own database.

(2) running tasks
- QA
  
  Run main QA tasks with LLMs. The output of the program (log file) will be saved under the results folder.
  
  For example,
  
  python run_qa.py --task movie --model gpt35 --tasktype multiqa
- Validation
  
  Run validation tasks with LLMs. The output of the program (log file) will be saved under dataset/validated folder.
  
  For example,
  
  python run_qa.py –task movie –model gpt35 –tasktype validate
Argument
- task
  
  choose dataset (movie/soccer/airport/music/book)
- tasktype
  
  choose QA or validation task (multiqa/validate)
- index
  
  entity id to resume code. This is useful when API for LLM fail during the code. When specified, skip the entities before the given index.
- mixed
  
  [0, 1) proportion to mix questions of None-of-above type and normal type. For example, 0.2 means 20% question is None-of-above type.
- random_seed
  
  random seed to reproduce results.
- model
  
  choose model (gpt35/gpt4/mistral/llama/gemini/gemini_v)
- rag
  
  run QA with knowledge augmentation (RAG). Before running code with RAG mode, you should run your normal type QA first.
- demo
  
  run QA with few-shot demonstrations. Before running code with demo mode, make sure you have demonstrations in dataset/demo.

multi_choice/source/error_analysis.py

Description

This python file must run after run_qa.py. This code (1) process validation log file to dataframe (.csv), (2) process QA log file to dataframe (.csv) and (2) analyze performance metrics w.r.t. these processed dataframes.

(1) Processing validation log file

Make sure you have validation log file from run_qa.py. The output of the program (csv) will be saved under the dataset/validated folder.

(2) Processing QA log file

Make sure you have QA log file from run_qa.py. The output of the program (csv) will be saved under the results folder.

(3) Analyzing performances

We provide 3 modes w.r.t. validation output when analyzing QA output: (a) consider all QA pairs (no validation) (b) consider QA pairs w.r.t. given LLM’s own valid entities (c) consider QA pairs w.r.t. all LLMs’ valid entities. We refer “valid entities” as “entities that LLM already knows”. Please refer to our paper for more details.
```
python error_anlaysis.py --task movie --model gpt35 # get type (a), (b), (c) at once 
python error_anlaysis.py --task movie --model gpt35 --only_val # get type (b) only
python error_anlaysis.py --task movie --model gpt35 --only_common # get type (c) only
```
Arguments
- task
  
  choose dataset (movie/soccer/airport/music/book)
- mixed
  
  [0, 1) proportion to mix questions of None-of-above type and normal type. For example, 0.2 means 20% question is None-of-above type.
- model
  
  choose model (gpt35/gpt4/mistral/llama/gemini/gemini_v)
- rag
  
  run QA with knowledge augmentation (RAG). Before running code with RAG mode, you should run your normal type QA first.
- demo
  
  run QA with few-shot demonstrations. Before running code with demo mode, make sure you have demonstrations in dataset/demo.
- only_val
  
  save only type (2) analysis results.
- only_common
  
  save only type (3) analysis results.

Experiment Procedure

python run_qa.py --model [MODEL] --task [TASK] --tasktype validate
python run_qa.py --model [MODEL] --task [TASK]
python error_analysis.py --model [MODEL] --task [TASK]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
binary		binary
multi_choice		multi_choice
ERBench.yaml		ERBench.yaml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

binary

binary

multi_choice

multi_choice

ERBench.yaml

ERBench.yaml

README.md

README.md

Repository files navigation

ERBench

Binary Tasks

How to Run

binary/run_qa.py

binary/error_analysis.py

binary/finetune_dataset.py

finetune.ipynb

Experiment Procedure

General Tasks

Finetuning

Images for Multimodal Models (Gemini Vision Pro)

Multi-choice Tasks

How to run

multi_choice/source/run_qa.py

multi_choice/source/error_analysis.py

Experiment Procedure

About

Releases

Packages

Languages

DILAB-KAIST/ERBench

Folders and files

Latest commit

History

Repository files navigation

ERBench

Binary Tasks

How to Run

binary/run_qa.py

binary/error_analysis.py

binary/finetune_dataset.py

finetune.ipynb

Experiment Procedure

General Tasks

Finetuning

Images for Multimodal Models (Gemini Vision Pro)

Multi-choice Tasks

How to run

multi_choice/source/run_qa.py

multi_choice/source/error_analysis.py

Experiment Procedure

About

Resources

Stars

Watchers

Forks

Languages