FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

This repository hosts the code and pre-trained models for our paper FaithDial: A Faithful Benchmark for Information-Seeking Dialogue. Also, it hosts the data annotations for our NAACL paper On the origin of hallucination in dialogue systems. For more information, please visit the project page.

**************************** Updates ****************************

9/06: FaithDial accepted to TACL! Please check out the updated paper.
7/30: We released the code for FaithCritic and uploaded our model to 🤗 Hub.
4/25: We released the FaithDial paper and launched the project page. Check them out!
4/15: We released our paper, to appear at NAACL 2022!

Overview

The goal of information-seeking dialogue is to respond to user queries with natural language utterances that are grounded on knowledge sources. Dialogue systems, however, often hallucinate, i.e. generate unsupported utterances, as they amplify the noise found in existing training datasets. To mitigate this behavior, we adopt a data-centric solution and create FaithDial, a new benchmark for hallucination-free dialogues. Annotators were asked to edit the hallucinated utterances in a pre-existing dataset to ensure they are faithful to knowledge sources and re-purpose the role of the interlocutor from a human wizard to a domain-expert bot.

Data

The dataset is hosted on Huggingface's datasets:

from datasets import load_dataset

dataset = load_dataset("McGill-NLP/FaithDial")

Use with Huggingface

We'll release our fine-tuned models soon! Stay tuned!

Train Your Models

The code for all the models in the paper is available in models, which can be used to reproduce our results or to train your own models.

Requirements

First, install Pytorch 1.7+ from the official website and then, clone this repository and install the dependencies:

git clone git@github.com:McGill-NLP/FaithDial.git
pip install -r requirements.txt

Our code is tested with Python 3.8, and Pytorch 1.7.1 with CUDA 11.0.

Data Format

By default, our code loads data from the Huggingface's datasets. But, you can also provide your own data with the following format:

[
  {
    "utterances": [
      ... // prior utterances, 
      {
        "history": [
          "Have you ever been to a concert? They're so fun!",
          "No I cannot as a bot. However, have you been to Madonna's? Her 10th concert was used to help her 13th album called \"Rebel Heart\".",
          "Yeah I've heard of it but never went or what it was for. Can you tell me more about it?"
        ],
        "speaker": "Wizard",
        "knowledge": "It began on September 9, 2015, in Montreal, Canada, at the Bell Centre and concluded on March 20, 2016, in Sydney, Australia at Allphones Arena.",
        "original_response": "It started in September of 2015 and ran all the way through March of 2016. Can you imagine being on the road that long?",
        "response": "Sure. The concert started in September 9th of 2015 at Montreal, Canada. It continued till 20th of March of 2016, where it ended at Sydney, Australia.",
        "BEGIN": [
          "Hallucination",
          "Entailment"
        ],
        "VRM": [
          "Disclosure",
          "Question"
        ]
      }, 
      ... // more utterances
    ]
  }, 
  ... // more dialogues
]

In the above example, original_response, BEGIN, and VRM are optional and don't have to be provided for your own data.

Training

Here is how to train a model:

python models/dialog.py --model_name_or_path t5-base \ 
  --do_train \
  --output_dir /path/to/output_dir \
  --fp16 \
  --train_batch_size 16 \
  --num_train_epochs 10 \
  --warmup_ratio 0.04 \
  --max_seq_length 512

To run on multiple GPUs, set CUDA_VISIBLE_DEVICES. By default, training early stops and the best model is saved at /path/to/output_dir/best_model.

Other arguments for training are as follows:

--learning_rate: Initial learning rate for Adam.
--gradient_accumulation_steps: Number of steps to accumulate gradient before performing a backward/update pass.
--enable_infonce: Whether to use the InfoNCE model. Note that negative_samples must be present in the input data for contrastive learning. Also, --fp16 should not be set.
--max_negative_samples: The number of negative samples per training example (Works only when InfoNCE is enabled).
--inbatch_negatives: Whether to use inbatch negative sampling (Works only when InfoNCE is enabled).
--loss_truncation: Whether to use loss truncation.
--ctrl: Whether to use controlled generation. Note that control_tokens must be present in the input data. To learn about how to compute control tokens, see here.
--train_dataset_path (optional): Path to your own training dataset.
--eval_dataset_path (optional): Path to your own validation dataset.

For a complete list of arguments, take a look at models/dialog.py and models/lightning_base.py.

Evaluation

To compute perplexity of a model on the validation data, simply run:

python models/dialog.py --model_name_or_path /path/to/model/best_model \
  --do_eval \
  --eval_batch_size 16

For the test data, --do_eval should be replaced with --do_test. Note that evaluation should be run on a single GPU.

To compute other metrics (BLEU, ROUGE, F1, BERTScore, and Q^2), reported in the paper, we used the scripts, provided in https://github.com/orhonovich/q-squared.

Generation

To generate a response, simply run:

python models/generate.py --model_name_or_path /path/to/model/best_model --do_sample --top_p 0.6

Arguments for generation are as follows:

--output (optional): Path of the output directory to save the generated responses.
--dataset_path (optional): Path to your own dataset.
--control_tokens (optional): Control tokens, prepended to the sequence, for controlled generation.
--max_length (default: 100): Maximum length of the generated sequence.

For a complete list of arguments, refer to models/generate.py.

Critic

We also use our collected data to frame the problem of identifying hallucination as a binary classification task where the goal is to predict whether an utterance is faithful or not, given the source knowledge. We call this model FaithCritic.

Huggingface

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("McGill-NLP/roberta-large-faithcritic", return_tensors="pt")
model = AutoModelForSequenceClassification.from_pretrained("McGill-NLP/roberta-large-faithcritic")

knowledge = "A cardigan is a type of knitted garment (sweater) that has an open front."
response = "The old version is the regular one, knitted garment that has open front and buttons!"
input = tokenizer(knowledge, response)
print(torch.argmax(model(**input).logits))

Training

python models/critic.py --model_name_or_path roberta-large --do_train --train_batch_size 16 \
    --learning_rate 1e-5 --weight_decay 0.1 --warmup_ratio 0.08 --pad_to_multiple_of 8 --fp16 \
    --output_dir /path/to/output

Testing

python models/critic.py --model_name_or_path /path/to/model --eval_batch_size 16 --do_test

To test on other datasets, you need to pass --test_task {BEGIN|MNLI}. For BEGIN and MNLI, --test_dataset_path is required and can be downloaded from here and here, respectively. For MNLI, it is possible to use the version that is hosted on 🤗 Datasets by not passing --test_dataset_path, but the results would be slightly different.

Bugs or questions?

If you have any questions (:question:) related to the code, or encounter any problems (:hammer_and_wrench:), or want to report a bug (:bug:), feel free to open an issue.

Citation

If you want to cite our papers, please use:

@article{dziri2022faithdial,
  title = "{FaithDial: A Faithful Benchmark for Information-Seeking Dialogue}",
  author = {Dziri, Nouha and Kamalloo, Ehsan and Milton, Sivan and Zaiane, Osmar and Yu, Mo and Ponti, Edoardo M and Reddy, Siva},
  journal = {Transactions of the Association for Computational Linguistics},
  volume = {10},
  pages = {1473--1490},
  year = {2022},
  month = {12},
  publisher = {MIT Press},
  doi={10.1162/tacl_a_00529}
}

and

@inproceedings{dziri2022origin,
  title = "On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?",
  author = {Dziri, Nouha and Milton, Sivan and Yu, Mo and Zaiane, Osmar and Reddy, Siva},
  booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
  year = {2022},
  pages = "5271--5285",
  address = "Seattle, United States",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2022.naacl-main.387"
}

Bibkey in aclanthology: dziri-etal-2022-origin.

License

This work is licensed under the MIT license. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data-audit		data-audit
models		models
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

Quick Links

Overview

Data

Use with Huggingface

Train Your Models

Requirements

Data Format

Training

Evaluation

Generation

Critic

Huggingface

Training

Testing

Bugs or questions?

Citation

License

About

Releases

Packages

Languages

License

McGill-NLP/FaithDial

Folders and files

Latest commit

History

Repository files navigation

FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

Quick Links

Overview

Data

Use with Huggingface

Train Your Models

Requirements

Data Format

Training

Evaluation

Generation

Critic

Huggingface

Training

Testing

Bugs or questions?

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages