Skip to content

GAIR-NLP/alignment-for-honesty

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Alignment for Honesty

This is the official repository for Alignment for Honesty.

🚀Overview

As for the "HHH" alignment princeple, while there has been a significant focus on enhancing the helpfulness and harmlessness of LLMs, honesty has received relatively less attention in research. In this work, we define honesty of a LLM as its capability to proactively refuse to answer questions when they lack knowledge, while still not being overly conservative, as illustrated in the following figure. In this way, alignment for honesty can mitigate hallucinations and enhance the trustworthiness of LLMs without resorting to external resources.

Illustration of Alignment for Honesty. Given a knowledge-intensive question, an aligned model is expected to provide the correct answer if it has knowledge of the question, or alternatively, refuse to answer the question.

📖Resources

Data

We provide the training and evaluation data, as well as the processing code in data. Please refer to the corresponding README for more information.

Train

We provide the code for processing training data following our proposed honesty-oriented supervised fine-tuning methods in train. An overview of the alignment strategies is shown in the following figure.

Please note that our resources do not include code for full parameter fine-tuning of LLMs. We utilize CoLLiE in the paper; however, you are free to select an alternative LLM training repository that aligns with your specific preferences.

Evaluation

In the paper, we measure the performance of aligned models on several datasets, including public datasets TriviaQA, Non-AmbigQA and MMLU, and specific datasets PUQA and PKQA constructed by ourselves. Detailed evaluation code can be found in evaluation.

👴Confucius

To say "I know" when you know, and "I don't know" when you don't, that is wisdom. — The Analects of Confucius

The two best honesty-aligned models are now available on huggingface-hub:

Model Name HF Checkpoint Size License
confucius-confidence-verb 🤗 GAIR/confucius-confidence-verb 13B Llama2-Chat
confucius-multisample 🤗 GAIR/confucius-multisample 13B Llama2-Chat

Case Study

The following two examples underscore the significance and vast potential of alignment for honesty.

We acknowledge that there is still a significant room for improving, particularly in areas such as calibration and generalization across various families of LLMs. We will focus on these refinements in the future.

🥳Citation

If you find our work useful, please cite our paper:

@article{yang2023alignment,
  title={Alignment for Honesty},
  author={Yang, Yuqing and Chern, Ethan and Qiu, Xipeng and Neubig, Graham and Liu, Pengfei},
  journal={arXiv preprint arXiv:2312.07000},
  year={2023}
}