Skip to content



Folders and files

Last commit message
Last commit date

Latest commit



5 Commits

Repository files navigation

Attribution & Alignment

Effects of Local Context Repetition on
Utterance Production and Comprehension in Dialogue

Aron Molnar*, Jaap Jumelet^, Mario Giulianelli^, Arabella Sinclair*

* Department of Computing Science, University of Aberdeen
^ Institute for Logic, Language and Computation, University of Amsterdam

🥳 The paper will be presented at CoNLL 2023, co-located with EMNLP in Singapore! 🥳

📝 Paper PDF on arXiv

Table of Contents:


Language models are often used as the backbone of modern dialogue systems. These models are pre-trained on large amounts of written fluent language. Repetition is typically penalised when evaluating language model generations. However, it is a key component of dialogue. Humans use local and partner specific repetitions; these are preferred by human users and lead to more successful communication in dialogue. In this study, we evaluate (a) whether language models produce human-like levels of repetition in dialogue, and (b) what are the processing mechanisms related to lexical re-use they use during comprehension. We believe that such joint analysis of model production and comprehension behaviour can inform the development of cognitively inspired dialogue generation systems.


Please use the following format to cite this work.

  title={Attribution and Alignment: {E}ffects of Local Context Repetition on Utterance Production and Comprehension in Dialogue},
  author={Molnar, Aron and Jumelet, Jaap and Giulianelli, Mario and Sinclair, Arabella},
  booktitle={Proceedings of the 27th Conference on Computational Natural Language Learning},


🔬 Find out about other work going on at The Context Lab.


Experiment Pipeline

  • Data is split for train vs. text purposes and pre-processed to our format of a 10 utterance sample (see, model_train/, samples.tsv).

  • Trained models that we use in our experiments are available here. If training your own, the training script is

  • Generation: Models are then used to generate utterances directly in the generation scripts here using the contexts from the test samples.

  • Computing sample properties: All analysis .py files augment or label the samples in samples.tsv file, which can then be used to analyse the human or model produced samples.

  • Analysis: Once all properties have been extracted, the analysis is conducted at the turn-level so that local factors can be explored and compared between human- vs. model-produced utterances.

Test Run

Follow the steps described below to run the whole experiment pipeline on a very small sub-set of our data (samples_mini.tsv).

Note: Running the pipeline with all of the data might take hours or days depending on the hardware configuration of your system.

🚀 If you are on Windows, the run_test_pipeline.ps1 PowerShell script will execute all the steps described below. A similar script for UNIX-like systems is coming soon.

  1. Prep (a): Download and prepare Switchboard and Map Task.

    # download data from github
    python prepare/ download
    # create samples.tsv
    python prepare/ prepare --context_length 10
  2. Prep (b): Create a sub-set for testing.

    python prepare/
  3. Generate: Generate model responses and extract attribution scores (with GPT-2 on Switchboard for testing).

    python generate/ full_attribution \
      --corpus switchboard \
      --model_id gpt2 \
      --input_file data/samples_mini.tsv \
      --output_file data/samples_mini_gpt2.tsv

    Then extract attribution scores for the human responses (human response comprehension).

    python generate/ full_attribution \
      --corpus switchboard \
      --model_id gpt2 \
      --style comprehend \
      --input_file data/samples_mini_gpt2.tsv \
      --output_file data/samples_mini_gpt2.tsv
  4. Quality: Compute generation quality metrics.

    python analysis/compute_properties/ run \
      --input_file data/samples_mini_gpt2.tsv \
      --output_file data/samples_mini_gpt2_genq.tsv \
      --corpus switchboard \
  5. Constructions: Extract constructions from responses.

    python analysis/compute_properties/ \
      --input_file data/samples_mini_gpt2_genq.tsv \
      --output_file data/samples_mini_gpt2_genq_constr.tsv \
      --working_dir data/_tmp_dialign/ \
      --delete_working_dir False
  6. Surprisal: Compute surprisal of model- and human-produced responses.

    python analysis/compute_properties/ compute_surprisal \
      --input_file data/samples_mini_gpt2_genq_constr.tsv \
      --output_file data/samples_mini_gpt2_genq_constr_ppl.tsv \
  7. Overlaps: Compute lexical, structural and construction overlap scores.

    python analysis/compute_properties/ \
      --input_file data/samples_mini_gpt2_genq_constr_ppl.tsv \
      --output_file data/samples_mini_gpt2_genq_constr_ppl_ol.tsv \
      --dialogues_dir data/_tmp_dialign/dialogues/switchboard/ \
      --lexica_dir data/_tmp_dialign/lexica/switchboard \
      --corpus switchboard
  8. PMI: Compute pointwise mutual information (PMI) of extracted constructions.

    python analysis/compute_properties/ \
      --input_file data/samples_mini_gpt2_genq_constr_ppl_ol.tsv \
      --output_file data/samples_mini_gpt2_genq_constr_ppl_ol_pmi.tsv \
      --dialign_output_dir data/_tmp_dialign/lexica/switchboard/ \
      --dialign_input_dir data/_tmp_dialign/dialogues/switchboard/ \
  9. Cleanup: Clean up the temporary working folders created during the augmentation process.

    python analysis/compute_properties/

Adapting to Other Corpora

  • Other corpora can be processed an evaluated using this pipeline, following the procedure within the prepare folder.
  • It is possible to vary parameters for the processing and attribution script (e.g. if more or less than 10 utterances in a sample or similar).

Evaluating Other Metrics

  • Other evaluation metrics and properties can be added to the final turn-level tsv to investigate other factors which may contribute to repetition.

Repository Structure

This repository is structured as follows.


The data/ folder contains all of our prepared and generated data used and created during evaluations.

  • The model_train/ folder contains Switchboard and Map Task data prepared for LLM training. Contents of this folder can be re-generated with
  • The samples.tsv file contains prepared dialogue excerpts (samples). This file is used for all of our evaluations. This file can be re-generated with


The prepare/ folder contains scripts that prepare data and models for analysis.


The generate/ folder contains scripts that generate the data we use in our evaluations.

  • generates responses to dialogue excerpts (samples) and extracts raw attribution scores. The script can also extract attribution scores while comprehending human-produced responses to dialogue excerpts.
  • implements our attribution aggregation algorithm. It takes in raw attribution matrices extracted with Inseq during generation and transforms them to utterance-level attribution scores.


The analysis/ folder contains scripts and notebooks that analyse, evaluate and enrich (e.g. with construction annotations) the prepared and generated data.

  • compute_properties/ contains scripts to extract the properties we examine and use in our analysis and evaluation

    • contains helper functions for string operations.
    • computes perplexities of LLM-generated and human responses.
    • computes metrics (BERTScore, BLEU and MAUVE) with which we aim to gauge the relative quality of the LLM-generated responses.
    • extracts constructions (i.e. shared word sequences) from LLM-generated and human responses with Dialign.
    • computes lexical, structural and construction overlap scores between responses and each utterance of the context (dialogue excerpt).
    • computes the pointwise mutual information (PMI) of extracted constructions Implementation adopted from here.
  • scripts/ contains notebooks that combine the data and extracted properties for analysis.


Creative Commons.


No releases published


No packages published


  • Python 66.5%
  • Jupyter Notebook 31.6%
  • PowerShell 1.9%