This page contains the code and data for the reproduction paper for the 2022 ReproGen shared task. For this task we tried to exactly reproduce the work by Santhanam and Shaikh (2019). Parts of the data and code here are a reproduction of their work. See the README in the directories for more precise instructions.
We used the stimuli from the original paper (Santhanam & Shaikh (2019), https://github.com/sashank06/INLG_eval) that were also used for the participant surveys to calculate automatic metrics. See the directory for more information on the automatic metrics.
This directory contains the data we have obtained throughout the experiment.
Code in this directory explores the reliability of our participants' scores.
Contains the pdf and qsf files of our experiment in Qualtrics. The stimuli that we have used are provided by Santhanam & Shaikh (2019).
The statistics directory contains the code needed to run the statistics. These are either directly adopted from Santhanam & Shaikh (2019) or slightly adapted.
Santhanam, S., & Shaikh, S. (2019). Towards Best Experiment Design for Evaluating Dialogue System Output. In Proceedings of the 12th International Conference on Natural Language Generation, pages 88–94, Tokyo, Japan. Association for Computational Linguistics.
The parts that are our code fall under a MIT license (outlier code, parts of the automatic code and parts of the statistics). Parts that are adopted from others fall under their respective licenses. All the stimuli and parts of the code are taken from https://github.com/sashank06/INLG_eval.