Alignment for Honesty

This is the official repository for Alignment for Honesty.

🚀Overview

As for the "HHH" alignment princeple, while there has been a significant focus on enhancing the helpfulness and harmlessness of LLMs, honesty has received relatively less attention in research. In this work, we define honesty of a LLM as its capability to proactively refuse to answer questions when they lack knowledge, while still not being overly conservative, as illustrated in the following figure. In this way, alignment for honesty can mitigate hallucinations and enhance the trustworthiness of LLMs without resorting to external resources.

Illustration of Alignment for Honesty. Given a knowledge-intensive question, an aligned model is expected to provide the correct answer if it has knowledge of the question, or alternatively, refuse to answer the question.

📖Resources

Data

We provide the training and evaluation data, as well as the processing code in data. Please refer to the corresponding README for more information.

Train

We provide the code for processing training data following our proposed honesty-oriented supervised fine-tuning methods in train. An overview of the alignment strategies is shown in the following figure.

Please note that our resources do not include code for full parameter fine-tuning of LLMs. We utilize CoLLiE in the paper; however, you are free to select an alternative LLM training repository that aligns with your specific preferences.

Evaluation

In the paper, we measure the performance of aligned models on several datasets, including public datasets TriviaQA, Non-AmbigQA and MMLU, and specific datasets PUQA and PKQA constructed by ourselves. Detailed evaluation code can be found in evaluation.

👴Confucius

To say "I know" when you know, and "I don't know" when you don't, that is wisdom. — The Analects of Confucius

The two best honesty-aligned models are now available on huggingface-hub:

Model Name	HF Checkpoint	Size	License
confucius-confidence-verb	🤗 GAIR/confucius-confidence-verb	13B	Llama2-Chat
confucius-multisample	🤗 GAIR/confucius-multisample	13B	Llama2-Chat

Case Study

The following two examples underscore the significance and vast potential of alignment for honesty.

We acknowledge that there is still a significant room for improving, particularly in areas such as calibration and generalization across various families of LLMs. We will focus on these refinements in the future.

🥳Citation

If you find our work useful, please cite our paper:

@article{yang2023alignment,
  title={Alignment for Honesty},
  author={Yang, Yuqing and Chern, Ethan and Qiu, Xipeng and Neubig, Graham and Liu, Pengfei},
  journal={arXiv preprint arXiv:2312.07000},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

evaluation

evaluation

figure

figure

train

train

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Alignment for Honesty

🚀Overview

📖Resources

Data

Train

Evaluation

👴Confucius

Case Study

🥳Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
evaluation		evaluation
figure		figure
train		train
.gitignore		.gitignore
README.md		README.md

GAIR-NLP/alignment-for-honesty

Folders and files

Latest commit

History

Repository files navigation

Alignment for Honesty

🚀Overview

📖Resources

Data

Train

Evaluation

👴Confucius

Case Study

🥳Citation

About

Resources

Stars

Watchers

Forks

Languages