Akshay Kumar Gupta, Shantanu Kumar, Surag Nair, Barun Patra
Automatic evaluation of dialogue response generation systems has been a fundamentally difficult task faced by researchers in the field. It has been shown that most automatic metrics that are used either do not correlate or correlate very weakly with human scoring of a dialogue system. We propose a novel automatic method of evaluation that uses a trained deep learning model for the task. We hope that this method addresses the issues faced by traditional evaluation systems, and aligns better with human scoring.