You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Widely used evaluation metrics for text generation either do not work wellwith longer texts or fail to evaluate all aspects of text quality. In thispaper, we introduce a new metric called SMART to mitigate such limitations.Specifically, We treat sentences as basic units of matching instead of tokens,and use a sentence matching function to soft-match candidate and referencesentences. Candidate sentences are also compared to sentences in the sourcedocuments to allow grounding (e.g., factuality) evaluation. Our results showthat system-level correlations of our proposed metric with a model-basedmatching function outperforms all competing metrics on the SummEvalsummarization meta-evaluation dataset, while the same metric with astring-based matching function is competitive with current model-based metrics.The latter does not use any neural model, which is useful during modeldevelopment phases where resources can be limited and fast evaluation isrequired. Finally, we also conducted extensive analyses showing that ourproposed metrics work well with longer summaries and are less biased towardsspecific models.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: