NLPBench: Evaluating NLP-Related Problem-solving Ability in Large Language Models

NLPBench is a novel benchmark for Natural Language Processing problems consisting of 378 questions sourced from the NLP course final exams at Yale University.

Data

Our example questions: Our dataset is licensed under the CC BY-ND. You can download our dataset through this link.

Environment Preparation

You can import our environment from the environment.yml by

conda env create -f environment.yml

then activate our conda environment by

conda activate NLPBench

Evaluation

Our evaluations are based on both online (GPT-3.5, GPT-4, and PaLM 2) and open-sourced (LLAMA 2, Falcon, Bloom, etc.) LLMs.

For Online LLM

Online LLM often requires an API-key before access. If you want to access the OpenAI model, you need to add the OPENAI_API_KEY to the system environment as follows:

export OPENAI_API_KEY="YOUR OPENAI API KEY"

and for PaLM 2, you need to add the PALM_API_KEY to your system environment as follows:

export PALM_API_KEY="YOUR PALM API"

For Open-sourced LLM

We use vLLM to start an openai-like endpoint for evaluation. All configurations are summarized in ./utils/utils.py:oai_llm_config. Check this list for information on the supported open-source model.

Basically, if you want to evaluate other open-sourced models, add your model's configuration in the following format into the oai_llm_config:

"HUGGINGFACE REPO": {
    "model": "HUGGINGFACE REPO",
    "api_key": "empty",
    "api_base": "YOUR ENDPOINT HOST, DEFAULT: http://127.0.0.1:8000/v1",
}

then start the endpoint with the following script:

bash scripts/serving.sh [-m HUGGINGFACE REPO][-n NUMBER OF GPUs][-a HOST ADDRESS, DEFAULT: 127.0.0.1][-p PORT, DEFAULT: 8000]

Run Evaluation

We have two steps for evaluation: (1) solving the problems and (2) calculating the accuracy. We adopt sacred to manage our configurations. All configs can be found in ./configs. You can also add your config by creating a specific yaml file. As an example, you can run the following code to let GPT-3.5 with only zero-shot prompting answer the questions without context:

python run.py with configs/zero-shot.yaml model_name='gpt-3.5-turbo' ctx=False

The answer results will be saved in ./res/{SEED}/no_ctx/zero-shot_gpt-3.5-turbo.json. You can evaluate the above result by running the following code:

python evaluate.py

Then the result will be saved in ./res/{SEED}/

Prompt

All the prompts in our evaluation can be found in ./prompts, including prompt for question answering (qa_prompt.py), system prompt (sys_prompt.py), and prompt for tree-of-thought (tot_prompt.py). You can customize your prompt by modifying the above three files.

Citation

If you think our repository and result is useful, please cite our paper by

@misc{song2023nlpbench,
      title={NLPBench: Evaluating Large Language Models on Solving NLP Problems}, 
      author={Linxin Song and Jieyu Zhang and Lechao Cheng and Pengyuan Zhou and Tianyi Zhou and Irene Li},
      year={2023},
      eprint={2309.15630},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
configs		configs
prompts		prompts
scripts		scripts
solve		solve
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
USE_POLICY.md		USE_POLICY.md
environment.yml		environment.yml
evaluate.py		evaluate.py
requirements.txt		requirements.txt
run.py		run.py
tokenizer.model		tokenizer.model
tokenizer_checklist.chk		tokenizer_checklist.chk

License

LinxinS97/NLPBench

Folders and files

Latest commit

History

Repository files navigation

NLPBench: Evaluating NLP-Related Problem-solving Ability in Large Language Models

Data

Environment Preparation

Evaluation

For Online LLM

For Open-sourced LLM

Run Evaluation

Prompt

Citation

About

Resources

License

Stars

Watchers

Forks

Languages