# Abstractive Summarizers Get Emotional on News Summarization


# Introduction

The objective is to analyze the capabilities of abstractive models to generate summaries that exhibit emotions similar to those a human would show in their summaries, and to attempt to improve them.

The following figure shows a subset of sentences from an article about a plane crash in Canada. Some of these sentences are used in the reference summary. Which ones would you say?


<br>
<details>
    <summary><u><b>Click here to display the figure with the text</b></u></summary>
<img src="https://iili.io/RGPLP4.png" width="500" height="500"/>

</details>

<br>
<details>
    <summary><u><b>Click here to display the sentences that appear in the human summary</b></u></summary>
<img src="https://iili.io/RGPsVf.png" width="500" height="500"/>

</details>

<br>

Regarding the topic, we only found [[1]](https://www.cs.utoronto.ca/~akennedy/publications/summarization_cai_2012.pdf) (also [its slides](https://pdfs.semanticscholar.org/55da/aeff6d8684d65fb8981afe8e89a655f54f54.pdf)), which served us as a foundation (speech, some formulas, etc.).


> News is not simply a straight re-telling of events, but rather an interpretation of those events by a reporter, whose feelings and opinions can often become part of the story itself. The emotion of a story is also important to its meaning.

> The human summarizers appear to favour emotional content when generating summaries. It is difficult to explain precisely why this is so; perhaps some aspects of the queries are more naturally answered with emotion.
 
We analyzed four aspects with two corpora (CNNDM and XSUM) and several models (**Abstractives**: BART, PEGASUS, T5, BART-JES y BART-JES-Oracle. **Extractives**: Lead, Random, Extractive oracle):

* [**Emotional Coherence**](#coherenciaysesgo): Are the emotions in the generated summaries coherent with the emotions in the references?
* [**Emotional Bias**](#coherenciaysesgo): Tendency of humans to [over/under]represent certain emotions in the summaries more than in the documents [[1]](https://www.cs.utoronto.ca/~akennedy/publications/summarization_cai_2012.pdf). Do the models exhibit a similar emotional bias to humans? i.e., do they reinforce [more/less] the emotions that humans reinforce [more/less] in the summaries than in the documents?
* [**ROUGE Correlation**](#rougecorrel): How do coherence and emotional bias metrics correlate with ROUGE?

* [**Emotions of novel words**](#alucinacion): Are the emotions of the novel words in the generated summaries coherent with the emotions in the references?

Para todos los análisis nos hemos basado en la **Emotion Density** (ED) y la **Emotion Ratio** entre (referencia, artículo) o (generado, artículo) (las llamamos $R_{M/D}$ y $R_{G/D}$ respectivamente) [[1]](https://www.cs.utoronto.ca/~akennedy/publications/summarization_cai_2012.pdf):

<img src="https://iili.io/RGirgV.png" width="600" height="200"/>
<br>
<img src="https://iili.io/RGivWb.png" width="600" height="200"/>


The authors of [[1]](https://www.cs.utoronto.ca/~akennedy/publications/summarization_cai_2012.pdf) just use $ED_M$, $ED_D$ y $R_{M/D}$ for their analysis, but we need $ED_G$ and $R_{G/D}$ too (computed from the generated summaries instead from the references).

To detect words with emotions/polarity, we use the lemmatized NRC lexicon and the [NRCLex](https://github.com/metalcorebear/NRCLex) library.

Except for **Emotions of novel words** where we use Precision, Recall, and $F_1$, in the rest of the analysis, we use Pearson correlations.

This notebook also contains ideas to try to [**improve the models**](#modelado) and several [**examples**](#visualizaciones). Regarding model improvement, we have proposed "Joint Emotion and Summary generation" (JES), as in [[2]](https://arxiv.org/pdf/2104.07606.pdf) and [[3]](https://arxiv.org/abs/2102.09130), but using prefixes with emotional words instead of entities. In the experimentation, we used BART to implement JES (BART-JES). The results of BART-JES are generally worse than BART-CNNDM in terms of emotions and ROUGE.
However, it has the advantage of being controllable, i.e., we can control the emotions/content of the generated summaries using keywords that trigger those emotions (you have two examples [**here**](#controlled)). Additionally, by using good prompts to condition the summary, results can be increased significantly in all metrics (BART-JES-Oracle).

## Related Work


https://www.cs.utoronto.ca/~akennedy/publications/summarization_cai_2012.pdf

https://link.springer.com/chapter/10.1007/978-81-322-2523-2_39

https://link.springer.com/chapter/10.1007/978-981-13-0589-4_7

# Code

In [1]:
import pickle as pkl
import numpy as np
import spacy
import json
from nltk import word_tokenize
from copy import copy
from scipy.stats import pearsonr, spearmanr
from collections import Counter
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset
from generation_hyperparameters import generation_hyperparameters
from IPython.core.display import display, HTML
from functools import partial
from copy import copy
import torch

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
nlp = spacy.load("en_core_web_sm")

emotion_labels = [
    "fear",
    "anger",
    "anticipation",
    "trust",
    "surprise",
    "sadness",
    "disgust",
    "joy",
    "none",
]

sentiment_labels = [
    "negative",
    "positive",
    "none"
]



In [3]:
class EmotionDistributionMetrics:
    def compute_state_dict(dataset, labels):
        """
        For each example in the dataset, calculate ED_M(E_i) [reference],
        ED_D(E_i) [document], and ED_G(E_i) [generated] as described in
        https://www.cs.utoronto.ca/~akennedy/publications/summarization_cai_2012.pdf.
        """
        n_samples = len(dataset["nrclex_documents"])
        state_dict = {}
        for key in ["ed_m", "ed_d", "ed_g", "ratio_m-d", "ratio_g-d"]:
            state_dict[key] = [
                {label: 0 for label in labels} for _ in range(n_samples)
            ]

        for i in range(n_samples):
            # Compute densities
            for key1, key2 in [
                ("ed_m", "nrclex_ref_summaries"),
                ("ed_d", "nrclex_documents"),
                ("ed_g", "nrclex_gen_summaries"),
            ]:
                # Fill counts of "none" (1)
                state_dict[key1][i] = {
                    **state_dict[key1][i],
                    **{
                        label: dataset[key2][i].raw_emotion_scores[label]
                        for label in labels
                        if label in dataset[key2][i].raw_emotion_scores
                    },
                }

                # Fill counts of "none" (2)
                n_total_words = len(dataset[key2][i].words)
                n_words_with_label = len(
                    [
                        word
                        for word in dataset[key2][i].affect_dict
                        if len(
                            set(labels).intersection(
                                set(dataset[key2][i].affect_dict[word])
                            )
                        )
                        >= 1
                    ]
                )
                state_dict[key1][i]["none"] = n_total_words - n_words_with_label

                # Normalize to compute ED
                denom = sum([state_dict[key1][i][label] for label in labels])
                state_dict[key1][i] = {
                    label: label_freq / denom if denom > 0 else 0
                    for label, label_freq in state_dict[key1][i].items()
                }

            # Compute R_{M/D} y R_{G/D}
            for ratio, summ_key, doc_key in [
                ("ratio_m-d", "ed_m", "ed_d"),
                ("ratio_g-d", "ed_g", "ed_d"),
            ]:
                state_dict[ratio][i] = {
                    label: state_dict[summ_key][i][label]
                    / state_dict[doc_key][i][label]
                    if state_dict[doc_key][i][label] > 0
                    else 0
                    for label in labels
                }

        return state_dict

    def emotion_correlation(
        state_dict, labels, func="pearson", mode="densities"
    ):
        """
        Calculate the Pearson or Spearman correlation between:
        - Ratios: the ratios R_{M/D} and R{G/D} for each emotion.
        - Densities: the densities E_M and E_G for each emotion
        """
        corrs = {label: 0 for label in labels}

        if mode == "densities":
            key_ref = "ed_m"
            key_gen = "ed_g"
        elif mode == "ratios":
            key_ref = "ratio_m-d"
            key_gen = "ratio_g-d"
        else:
            raise NotImplemented

        func = pearsonr if func == "pearson" else spearmanr
        for label in labels:
            ref_emo = []
            gen_emo = []
            for i in range(len(state_dict[key_ref])):
                ref_emo.append(state_dict[key_ref][i][label])
                gen_emo.append(state_dict[key_gen][i][label])
            corrs[label] = func(ref_emo, gen_emo)

        return corrs

    def emotion_summary_metric_correlation(
        state_dict,
        dataset,
        emo_metric_func="pearson",
        mode="densities",
        correlation_func="pearson",
        summary_metric="rouge",
    ):
        """
        Calculate the Pearson or Spearman correlation between the emotional metric
        (calculated with densities or ratios) and other summary evaluation metrics
        (ROUGE->average of R1, R2, and RL (F1)) or BERTScore:
        - Density: Pearson(correlation(ED_M[i], ED_G[i]), ROUGE[i])
        - Ratio: Pearson(correlation(R{M/D}[i], R{M/G}[i]), ROUGE[i])
        - emo_metric_func: correlation metric between ED_M[i] and ED_G[i] (or R{M/D}[i], R{M/G[i]})
        - correlation_func: correlation function between the emotional metrics and ROUGE
        """

        if mode == "densities":
            key_ref = "ed_m"
            key_gen = "ed_g"
        elif mode == "ratios":
            key_ref = "ratio_m-d"
            key_gen = "ratio_g-d"
        else:
            raise NotImplemented

        emo_metric_func = (
            pearsonr if emo_metric_func == "pearson" else spearmanr
        )
        correlation_func = (
            pearsonr if correlation_func == "pearson" else spearmanr
        )

        emo_scores = []
        summary_scores = []

        for i in range(len(state_dict[key_ref])):
            emo_dist_ref = list(state_dict[key_ref][i].values())
            emo_dist_gen = list(state_dict[key_gen][i].values())
            emo_scores.append(emo_metric_func(emo_dist_ref, emo_dist_gen)[0])
            if summary_metric == "rouge":
                summary_scores.append(
                    (
                        dataset["rouge_scores"]["rouge1_f1"][i]
                        + dataset["rouge_scores"]["rouge2_f1"][i]
                        + dataset["rouge_scores"]["rougeLsum_f1"][i]
                    )
                    / 3.0
                )
            else:
                summary_scores.append(dataset["bert_scores"]["f1"][i])

        emo_scores = np.nan_to_num(emo_scores)
        summary_scores = np.nan_to_num(summary_scores)

        return correlation_func(emo_scores, summary_scores)


In [4]:
class EmotionalHallucinationMetric:
    """
    Metric to compute Emotions of novel words.
    """

    def compute(dataset, lemmatized=False):
        outputs = {"precision": [], "recall": [], "f1": []}
        n_samples = len(dataset["documents"])

        for i in range(n_samples):
            # Pick emotions in the reference summary.
            ref_emotions = set(dataset["nrclex_ref_summaries"][i].affect_list)

            # Pick emotions of the novel words in the generated summary.
            if not lemmatized:
                novel_words = list(
                    set(
                        [
                            word.lower()
                            for word in dataset["nrclex_gen_summaries"][
                                i
                            ].affect_dict.keys()
                        ]
                    ).difference(
                        set(
                            [
                                word.lower()
                                for word in dataset["nrclex_documents"][i].words
                            ]
                        )
                    )
                )
            else:
                novel_words = list(
                    set(
                        [
                            word.lower()
                            for word in word_tokenize(
                                dataset["gen_summaries"][i]
                            )
                        ]
                    ).difference(
                        set(
                            [
                                word.lower()
                                for word in word_tokenize(
                                    dataset["documents"][i]
                                )
                            ]
                        )
                    )
                )

                lemmatized_novel_words = []
                for novel_word in nlp.pipe(novel_words):
                    lemmatized_novel_words.append(novel_word[0].lemma_)

                novel_words = [
                    novel_word
                    for novel_word in set(lemmatized_novel_words)
                    if novel_word
                    in dataset["nrclex_gen_summaries"][i].affect_dict
                ]

            novel_words_emotions = set(
                sum(
                    [
                        dataset["nrclex_gen_summaries"][i].affect_dict[word]
                        for word in novel_words
                    ],
                    [],
                )
            )

            # If there are no emotions in the reference, there are no
            # novel words, or there are novel words w/o emotions, then
            # discard the sample.
            if len(ref_emotions) == 0 or len(novel_words_emotions) == 0:
                continue

            # Otherwise, compute P, R, and F1
            else:
                precision = len(
                    novel_words_emotions.intersection(ref_emotions)
                ) / len(novel_words_emotions)
                recall = len(
                    novel_words_emotions.intersection(ref_emotions)
                ) / len(ref_emotions)
                f1 = (
                    (2 * precision * recall) / (precision + recall)
                    if (precision + recall) > 0
                    else 0
                )
                outputs["precision"].append(precision)
                outputs["recall"].append(recall)
                outputs["f1"].append(f1)

        print("Evaluating with %d samples" % len(outputs["precision"]))

        outputs["precision"] = np.array(outputs["precision"])
        outputs["recall"] = np.array(outputs["recall"])
        outputs["f1"] = np.array(outputs["f1"])

        n_samples = len(outputs["precision"])
        conf_interval_95_f = lambda std: 1.96 * (std / np.sqrt(n_samples))

        outputs["precision"] = (
            outputs["precision"].mean(),
            outputs["precision"].std(),
            conf_interval_95_f(outputs["precision"].std()),
        )
        outputs["recall"] = (
            outputs["recall"].mean(),
            outputs["recall"].std(),
            conf_interval_95_f(outputs["recall"].std()),
        )
        outputs["f1"] = (
            outputs["f1"].mean(),
            outputs["f1"].std(),
            conf_interval_95_f(outputs["f1"].std()),
        )

        return outputs


In [5]:
class Visualization:
    def get_top_k_examples_emotional_hallucination(
        dataset,
        k=10,
        reverse=False,
        sort_metric="precision",
        lemmatized=False,
    ):
        outputs = {
            "documents": [],
            "ref_summaries": [],
            "gen_summaries": [],
            "nrclex_documents": [],
            "nrclex_ref_summaries": [],
            "nrclex_gen_summaries": [],
            "novel_words_with_emotions": [],
            "precision": [],
            "recall": [],
            "f1": [],
        }
        n_samples = len(dataset["documents"])

        for i in range(n_samples):
            # Pick emotions from the reference summary.
            ref_emotions = set(dataset["nrclex_ref_summaries"][i].affect_list)

            # Pick emotions of the novel words in the generated summary.
            if not lemmatized:
                novel_words = list(
                    set(
                        [
                            word.lower()
                            for word in dataset["nrclex_gen_summaries"][
                                i
                            ].affect_dict.keys()
                        ]
                    ).difference(
                        set(
                            [
                                word.lower()
                                for word in dataset["nrclex_documents"][i].words
                            ]
                        )
                    )
                )
            else:
                novel_words = list(
                    set(
                        [
                            word.lower()
                            for word in word_tokenize(dataset["gen_summaries"][i])
                        ]
                    ).difference(
                        set(
                            [
                                word.lower()
                                for word in word_tokenize(dataset["documents"][i])
                            ]
                        )
                    )
                )

                lemmatized_novel_words = []
                for novel_word in nlp.pipe(novel_words):
                    lemmatized_novel_words.append(novel_word[0].lemma_)

                novel_words = [
                    novel_word
                    for novel_word in set(lemmatized_novel_words)
                    if novel_word in dataset["nrclex_gen_summaries"][i].affect_dict
                ]

            novel_words_emotions = set(
                sum(
                    [
                        dataset["nrclex_gen_summaries"][i].affect_dict[word]
                        for word in novel_words
                    ],
                    [],
                )
            )

            # If there are no emotions in the reference, there are no
            # novel words, or there are novel words w/o emotions, then
            # discard the sample.
            if len(ref_emotions) == 0 or len(novel_words_emotions) == 0:
                continue
            
            # Otherwise, compute P, R, and F1
            else:
                precision = len(novel_words_emotions.intersection(ref_emotions)) / len(
                    novel_words_emotions
                )
                recall = len(novel_words_emotions.intersection(ref_emotions)) / len(
                    ref_emotions
                )
                f1 = (
                    (2 * precision * recall) / (precision + recall)
                    if (precision + recall) > 0
                    else 0
                )

                outputs["precision"].append(precision)
                outputs["recall"].append(recall)
                outputs["f1"].append(f1)
                for key in [
                    "documents",
                    "ref_summaries",
                    "gen_summaries",
                    "nrclex_documents",
                    "nrclex_ref_summaries",
                    "nrclex_gen_summaries",
                ]:
                    outputs[key].append(dataset[key][i])

                outputs["novel_words_with_emotions"].append(
                    list(
                        zip(
                            novel_words,
                            [
                                dataset["nrclex_gen_summaries"][i].affect_dict[word]
                                for word in novel_words
                            ],
                        )
                    )
                )

        for key in outputs:
            outputs[key] = np.array(outputs[key])

        sorted_ids = outputs[sort_metric].argsort()

        if reverse:
            sorted_ids = sorted_ids[::-1]

        for key in outputs:
            outputs[key] = outputs[key][sorted_ids][:k]

        return outputs

In [6]:
class Statistics:
    def __init__(self, dataset):
        self.dataset = dataset
        self.n_samples = len(dataset["nrclex_documents"])
        self.stats = {
            "percentage_documents_with_emotion": {
                emotion: 0.0 for emotion in emotion_labels
            },
            "percentage_documents_with_sentiment": {
                sentiment: 0.0 for sentiment in sentiment_labels
            },
            "percentage_summaries_with_emotion": {
                emotion: 0.0 for emotion in emotion_labels
            },
            "percentage_summaries_with_sentiment": {
                sentiment: 0.0 for sentiment in sentiment_labels
            },
            "average_document_emotion_density": {
                emotion: 0.0 for emotion in emotion_labels
            },
            "average_document_sentiment_density": {
                sentiment: 0.0 for sentiment in sentiment_labels
            },
            "average_summary_emotion_density": {
                emotion: 0.0 for emotion in emotion_labels
            },
            "average_summary_sentiment_density": {
                sentiment: 0.0 for sentiment in sentiment_labels
            },
            "average_ratio_md_emotion": {
                emotion: 0.0 for emotion in emotion_labels
            },
            "average_ratio_md_sentiment": {
                sentiment: 0.0 for sentiment in sentiment_labels
            },
        }

        self.compute_percentage_samples()
        self.compute_average_densities_and_ratios()
        self.normalize()

    def normalize(self):
        # Normalize
        for key in self.stats:
            for label in self.stats[key]:
                self.stats[key][label] /= self.n_samples

    def compute_percentage_samples(self):
        for label_type, labels in [
            ("sentiment", sentiment_labels),
            ("emotion", emotion_labels),
        ]:
            for i in range(self.n_samples):
                doc = self.dataset["nrclex_documents"][i]
                summ = self.dataset["nrclex_ref_summaries"][i]
                doc_unique_labels = set(doc.affect_list)
                summ_unique_labels = set(summ.affect_list)

                for label in labels:
                    if label_type == "emotion":
                        if label in doc_unique_labels:
                            self.stats["percentage_documents_with_emotion"][
                                label
                            ] += 1
                        if label in summ_unique_labels:
                            self.stats["percentage_summaries_with_emotion"][
                                label
                            ] += 1
                    else:
                        if label in doc_unique_labels:
                            self.stats["percentage_documents_with_sentiment"][
                                label
                            ] += 1
                        if label in summ_unique_labels:
                            self.stats["percentage_summaries_with_sentiment"][
                                label
                            ] += 1

    def compute_average_densities_and_ratios(self):
        for label_type, labels in [
            ("sentiment", sentiment_labels),
            ("emotion", emotion_labels),
        ]:
            state_dict = EmotionDistributionMetrics.compute_state_dict(
                self.dataset, labels
            )

            map_names = {
                "ed_d": {
                    "emotion": "average_document_emotion_density",
                    "sentiment": "average_document_sentiment_density",
                },
                "ed_m": {
                    "emotion": "average_summary_emotion_density",
                    "sentiment": "average_summary_sentiment_density",
                },
                "ratio_m-d": {
                    "emotion": "average_ratio_md_emotion",
                    "sentiment": "average_ratio_md_sentiment",
                },
            }
            for key in map_names:
                name_stat = map_names[key][label_type]
                for sample in state_dict[key]:
                    for label in sample:
                        self.stats[name_stat][label] += sample[label]

    def get_stats(self):
        return self.stats

In [7]:
def prefix_allowed_tokens_fn(batch_id, input_ids, tok_prefixes, tokenizer):
    position = len(input_ids)
    if position < len(tok_prefixes["input_ids"][batch_id]):
        if (
            tok_prefixes["input_ids"][batch_id][position - 1]
            in tokenizer.all_special_ids
        ):
            return None
        return (
            tok_prefixes["input_ids"][batch_id][position - 1].view(1).tolist()
        )
    return None

In [8]:
def uncontrolled_generation(document, tokenizer, model_name, dataset_name):
    inputs = tokenizer(
        [document],
        max_length=512,
        return_tensors="pt",
        truncation=True,
        padding=True,
    )["input_ids"].to("cuda")

    gen_args = copy(generation_hyperparameters[model_name][dataset_name])

    gen_ids = model.generate(inputs, **gen_args)

    gen_summary = [
        tokenizer.decode(
            g, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )
        .replace(". ", " .\n")
        .replace("<n>", "\n")
        for g in gen_ids
    ][0]
    return gen_summary

In [9]:
def controlled_generation(
    document, tokenizer, emotion_chain, model_name, dataset_name
):
    inputs = tokenizer(
        [document],
        max_length=512,
        return_tensors="pt",
        truncation=True,
        padding=True,
    )["input_ids"].to("cuda")

    tok_prefix = tokenizer(
        [emotion_chain],
        max_length=512,
        return_tensors="pt",
        truncation=True,
        padding=True,
    ).to("cuda")

    gen_args = copy(generation_hyperparameters[model_name][dataset_name])

    gen_args["prefix_allowed_tokens_fn"] = partial(
        prefix_allowed_tokens_fn, tok_prefixes=tok_prefix, tokenizer=tokenizer
    )

    gen_ids = model.generate(inputs, **gen_args)

    gen_summary = [
        tokenizer.decode(
            g, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )
        .replace(". ", " .\n")
        .replace("<n>", "\n")
        for g in gen_ids
    ][0]
    return gen_summary

# Emotion statistics in the corpora.

Statistics on emotions and sentiments in the CNNDM and XSUM corpora are considered:

- The percentage of documents and summaries that have words for each emotion/sentiment.
- Averages of emotion/sentiment densities in documents and summaries.
- Averages of emotion/sentiment ratios M/D.

In [None]:
cnn_nrclex = "./NRCLexDatasetWLemmatization/cnn_dailymail-nrclex+lemmatized.pkl"
xsum_nrclex = "./NRCLexDatasetWLemmatization/xsum-nrclex+lemmatized.pkl"

with open(cnn_nrclex, "rb") as fr:
    dataset = pkl.load(fr)

# Repair names...
dataset["nrclex_ref_summaries"] = copy(dataset["nrclex_summaries"])
del(dataset["nrclex_summaries"])
dataset["nrclex_gen_summaries"] = copy(dataset["nrclex_ref_summaries"])

print(json.dumps(Statistics(dataset).get_stats(), indent=1))

| Dataset | Statistic     | Fear  | Anger | Anticipation | Trust | Surprise | Sadness | Disgust | Joy   | None  | Negative | Positive | None  |
|---------|---------------|-------|-------|--------------|-------|----------|---------|---------|-------|-------|----------|----------|-------|
| CNNDM   | % docs        | 99.66% | 98.81% | 99.58%        | 99.98% | 99.24%    | 99.67%   | 97.17%   | 99.40% | 0.0%   | 99.97%    | 97.17%    | 0.0%   |
| CNNDM   | % summs       | 79.22% | 69.63% | 84.23%        | 90.12% | 60.36%    | 76.38%   | 53.81%   | 69.79% | 0.0%   | 91.05%    | 96.03%    | 0.0%   |
| CNNDM   | Avg $ED_{D}$  | 3.18%  | 2.17%  | 3.52%         | 4.56%  | 1.55%     | 2.49%    | 1.23%    | 2.34%  | 78.95% | 5.43%     | 7.61%     | 86.96% |
| CNNDM   | Avg $ED_{M}$  | 4.10%  | 2.86%  | 3.76%         | 5.00%  | 1.76%     | 3.25%    | 1.55%    | 2.55%  | 75.15% | 7.25%     | 8.44%     | 84.31% |
| CNNDM   | Avg $R_{M/D}$ | 1.25  | 1.28  | 1.07         | 1.09  | 1.14     | 1.29    | 1.21    | 1.08  | 0.95  | 1.31     | 1.10     | 0.97  |
| XSUM    | % docs        | 95.72% | 91.74% | 98.60%        | 99.04% | 93.58%    | 96.26%   | 85.70%   | 94.97% | 0.0%   | 98.57%    | 99.59%    | 0.0%   |
| XSUM    | % summs       | 59.44% | 46.89% | 61.19%        | 70.52% | 38.54%    | 54.17%   | 31.75%   | 44.38% | 0.0%   | 73.96%    | 82.65%    | 0.0%   |
| XSUM    | Avg $ED_{D}$  | 3.01%  | 2.04%  | 3.58%         | 4.54%  | 1.54%     | 2.39%    | 1.11%    | 2.17%  | 79.59% | 5.05%     | 7.69%     | 87.25% |
| XSUM    | Avg $ED_{M}$  | 4.06%  | 2.65%  | 3.77%         | 4.88%  | 1.78%     | 3.16%    | 1.42%    | 2.32%  | 75.99% | 6.91%     | 8.48%     | 84.59% |
| XSUM    | Avg $R_{M/D}$ | 1.34  | 1.28  | 1.09         | 1.13  | 1.18     | 1.33    | 1.15    | 1.08  | 0.95  | 1.42     | 1.15     | 0.97  |


All documents and summaries in CNNDM and XSUM show some kind of emotion/sentiment (None=0.0 in %docs and %summs). The percentage of documents for each emotion is nearly 100% in both CNNDM and XSUM, indicating that all documents contain words that evoke all considered emotions/sentiments. There is a lower percentage of summaries than documents for each emotion/sentiment, especially in XSUM. However, all summaries convey some kind of emotion/sentiment (None=0.0), suggesting that summaries highlight only specific emotions/sentiments from the documents.

The highest average density is for the emotion and sentiment None, both for documents and summaries (between 75-80% in emotions and 84-88% in sentiments). This means that the majority of words in both documents and summaries do not convey any emotion/sentiment. The top 3 emotions with the highest density in CNNDM and XSUM documents and summaries are Fear, Trust, and Anticipation, and the sentiment with the highest density is Positive in all cases. The density of emotions (for all emotions) is higher in summaries than in documents.

Summaries in CNNDM and XSUM reinforce more negative emotions (fear, anger, sadness) and negative sentiments compared to others (the negative always has an average R_{M/D} above 1.20). Additionally, the averaged ratios M/D for the emotion and sentiment None are always <1, indicating that there is a higher proportion of words with emotions/sentiment in the summaries than in the documents.

<a id='coherenciaysesgo'></a>
# Emotional coherence and bias

We want to see:

1. Are the emotions in the generated summaries coherent with the emotions in the references?

2. Do the models exhibit a similar emotional bias to humans? i.e., do they reinforce [more/less] the emotions that humans reinforce [more/less] in the summaries than in the documents? From [1]:
  > ... emotional words that appear more frequently in the human-written model summaries than in the document sets for summarizing should also be more numerous in an automatically generated summary.

Two metrics:

1. Correlation of emotion density $ED_M$ and $ED_G$

2. Correlation of ratios $R_{M/D}$ and $R_{G/D}$

<br><br>
<img src="https://iili.io/RGs6Gf.png" width="700" height="700"/>


In [None]:
dataset_paths = [
    "./AnalysisDatasetWOLemmatization/bart-large-cnn+cnn_dailymail.pkl",
    "./AnalysisDatasetWOLemmatization/pegasus-cnn_dailymail+cnn_dailymail.pkl",
    "./AnalysisDatasetWOLemmatization/t5-base+cnn_dailymail.pkl",
    "./AnalysisDatasetWOLemmatization/bart-large-xsum+xsum.pkl",
    "./AnalysisDatasetWOLemmatization/pegasus-xsum+xsum.pkl",
    "./AnalysisDatasetWOLemmatization/t5-base+xsum.pkl",
]

dataset_lemmatized_paths = [
    "./AnalysisDatasetWLemmatization/lead+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/random+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/extractive_oracle+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-large-cnn+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/pegasus-cnn_dailymail+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/t5-base+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-cnn_dailymail+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-cnn_dailymail+cnn_dailymail+lemmatized+oracle.pkl",
    "./AnalysisDatasetWLemmatization/lead+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/random+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/extractive_oracle+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-large-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/pegasus-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/t5-base+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-xsum+xsum+lemmatized+oracle.pkl",
]

for path in dataset_lemmatized_paths:
    print("Path (model+dataset):", path)

    with open(path, "rb") as fr:
        dataset = pkl.load(fr)
        for label_type, labels in [
            ("sentiment", sentiment_labels),
            ("emotion", emotion_labels),
        ]:
            print("\nLabel type:", label_type)
            state_dict = EmotionDistributionMetrics.compute_state_dict(
                dataset, labels
            )
            print("\nCorrelación entre densidades:")
            print(
                "\nPearson:",
                EmotionDistributionMetrics.emotion_correlation(
                    state_dict, labels, func="pearson", mode="densities"
                ),
            )
            print(
                "\nSpearman:",
                EmotionDistributionMetrics.emotion_correlation(
                    state_dict, labels, func="spearman", mode="densities"
                ),
            )

            print("\n\nCorrelación entre ratios:")
            print(
                "\nPearson:",
                EmotionDistributionMetrics.emotion_correlation(
                    state_dict, labels, func="pearson", mode="ratios"
                ),
            )
            print(
                "\nSpearman:",
                EmotionDistributionMetrics.emotion_correlation(
                    state_dict, labels, func="spearman", mode="ratios"
                ),
            )

    print("\n" * 3)

* Pearson correlation of emotion density $ED_M$ and $ED_G$ (Note these are Pearson correlations scaled to 0-100).
    
| Dataset | Model             | Fear  | Anger | Anticipation | Trust | Surprise | Sadness | Disgust | Joy   | None  | Negative | Positive | None  |
|---------|-------------------|-------|-------|--------------|-------|----------|---------|---------|-------|-------|----------|----------|-------|
| CNNDM   | Lead              | 64.16 | 59.69 | 38.23        | 41.71 | 37.07    | 54.72   | 48.75   | 49.82 | 51.05 | 56.85    | 45.62    | 44.08 |
| CNNDM   | Random            | 49.49 | 44.18 | 22.56        | 27.93 | 22.81    | 35.84   | 33.06   | 34.30 | 34.50 | 40.00    | 28.75    | 27.52 |
| CNNDM   | Extractive oracle | 76.29 | 71.67 | 56.91        | 59.80 | 55.85    | 68.29   | 65.89   | 65.97 | 66.42 | 71.31    | 61.83    | 60.60 |
| CNNDM   | BART-CNNDM        | 64.57 | 59.33 | 39.79        | 43.12 | 39.27    | 53.89   | 50.97   | 50.20 | 50.97 | 57.33    | 45.66    | 44.32 |
| CNNDM   | PEGASUS-CNNDM     | 63.15 | 57.99 | 40.17        | 44.15 | 39.41    | 53.41   | 49.82   | 49.79 | 51.60 | 56.34    | 45.73    | 45.13 |
| CNNDM   | T5-BASE           | 62.41 | 57.26 | 36.39        | 40.45 | 35.27    | 51.42   | 48.05   | 45.49 | 48.19 | 55.46    | 43.14    | 42.30 |
| CNNDM   | BART-JES          | 64.59 | 59.21 | 37.85        | 42.24 | 40.94    | 54.72   | 51.05   | 47.82 | 50.64 | 57.18    | 42.92    | 42.04 |
| CNNDM   | BART-JES-Oracle   | 77.59 | 79.71 | 70.82        | 67.25 | 78.57    | 76.51   | 79.25   | 76.76 | 58.90 | 70.00    | 61.09    | 48.87 |
| XSUM    | Lead              | 30.72 | 25.64 | 13.07        | 15.28 | 9.27     | 19.98   | 14.79   | 15.82 | 18.24 | 23.59    | 16.39    | 13.11 |
| XSUM    | Random            | 28.62 | 21.80 | 8.39         | 10.56 | 10.25    | 17.88   | 11.56   | 14.08 | 16.52 | 20.60    | 11.91    | 11.91 |
| XSUM    | Extractive oracle | 44.26 | 37.24 | 25.10        | 27.05 | 22.60    | 34.95   | 30.67   | 29.89 | 31.83 | 36.22    | 28.58    | 27.05 |
| XSUM    | BART-XSUM         | 61.09 | 55.65 | 40.62        | 47.69 | 39.81    | 56.16   | 49.09   | 45.38 | 50.96 | 55.56    | 47.37    | 46.02 |
| XSUM    | PEGASUS-XSUM      | 63.23 | 57.62 | 41.89        | 49.55 | 41.72    | 57.48   | 51.82   | 47.97 | 53.52 | 57.65    | 48.81    | 47.96 |
| XSUM    | T5-BASE           | 44.36 | 35.77 | 19.74        | 21.40 | 16.90    | 33.34   | 23.45   | 24.72 | 30.61 | 35.89    | 24.18    | 24.36 |
| XSUM    | BART-JES          | 60.88 | 53.88 | 39.25        | 45.57 | 37.61    | 55.34   | 48.50   | 44.29 | 48.08 | 54.28    | 44.39    | 42.52 |
| XSUM    | BART-JES-Oracle   | 89.44 | 89.36 | 84.20        | 83.76 | 85.96    | 89.00   | 88.06   | 85.35 | 81.92 | 86.67    | 81.17    | 76.64 |


* Pearson correlation of rates $R_{M/D}$ y $R_{G/D}$

| Dataset | Model             | Fear  | Anger | Anticipation | Trust | Surprise | Sadness | Disgust | Joy   | None  | Negative | Positive | None  |
|---------|-------------------|-------|-------|--------------|-------|----------|---------|---------|-------|-------|----------|----------|-------|
| CNNDM   | Lead              | 15.09 | 16.24 | 16.19        | 15.50 | 14.64    | 19.18   | 15.23   | 17.59 | 23.21 | 20.59    | 20.01    | 21.93 |
| CNNDM   | Random            | 0.94  | 0.00  | 1.88         | 3.33  | -0.01    | 1.55    | 1.75    | 1.60  | 1.67  | -0.01    | 0.00     | 0.15  |
| CNNDM   | Extractive oracle | 39.03 | 37.68 | 40.38        | 42.31 | 39.59    | 41.41   | 40.16   | 41.58 | 47.82 | 44.41    | 44.55    | 45.15 |
| CNNDM   | BART-CNNDM        | 19.76 | 19.79 | 19.94        | 20.97 | 17.01    | 21.59   | 19.62   | 20.53 | 24.82 | 20.42    | 21.77    | 23.17 |
| CNNDM   | PEGASUS-CNNDM     | 23.01 | 23.15 | 20.11        | 24.24 | 20.34    | 22.10   | 22.44   | 22.49 | 28.77 | 24.54    | 25.12    | 26.56 |
| CNNDM   | T5-BASE           | 17.56 | 18.22 | 16.79        | 18.74 | 14.95    | 19.40   | 18.72   | 17.03 | 22.06 | 19.97    | 20.01    | 21.86 |
| CNNDM   | BART-JES          | 21.14 | 21.01 | 19.44        | 21.14 | 19.57    | 24.11   | 20.72   | 19.52 | 25.34 | 22.82    | 22.85    | 22.43 |
| CNNDM   | BART-JES-Oracle   | 76.59 | 77.11 | 70.10        | 66.76 | 76.87    | 74.67   | 79.77   | 76.73 | 45.63 | 62.68    | 58.10    | 39.91 |
| XSUM    | Lead              | 5.85  | 6.33  | 2.08         | 1.17  | 2.82     | 5.48    | 6.17    | 2.47  | 2.08  | 4.27     | 5.64     | 2.38  |
| XSUM    | Random            | 3.95  | 2.02  | 2.38         | 2.28  | 1.79     | 3.78    | 2.52    | 3.77  | 0.90  | 2.96     | 1.08     | 1.56  |
| XSUM    | Extractive oracle | 15.79 | 12.95 | 9.52         | 12.23 | 10.20    | 17.02   | 15.21   | 11.82 | 16.45 | 13.83    | 15.21    | 16.91 |
| XSUM    | BART-XSUM         | 39.29 | 31.13 | 37.51        | 40.35 | 33.36    | 39.61   | 35.49   | 35.31 | 41.89 | 42.21    | 41.66    | 41.06 |
| XSUM    | PEGASUS-XSUM      | 37.58 | 33.40 | 38.98        | 44.04 | 30.16    | 40.37   | 41.35   | 41.94 | 45.04 | 41.35    | 43.94    | 42.30 |
| XSUM    | T5-BASE           | 9.89  | 8.54  | 5.52         | 3.48  | 4.51     | 8.59    | 4.75    | 9.15  | 9.72  | 7.98     | 7.47     | 10.68 |
| XSUM    | BART-JES          | 38.98 | 29.54 | 35.51        | 38.22 | 26.95    | 39.52   | 36.23   | 35.13 | 37.78 | 39.81    | 40.92    | 36.41 |
| XSUM    | BART-JES-Oracle   | 86.15 | 86.22 | 85.29        | 82.61 | 87.02    | 87.42   | 87.53   | 86.71 | 78.69 | 84.40    | 82.08    | 74.68 |

The p-value is 0 in all cases, so we reject $H_0$.

The models approximate emotions well in the summaries (moderate/strong correlations), but interestingly, in CNN/DM, state-of-the-art abstractive models show a correlation very similar to the baseline LEAD.

They do not exhibit the same emotional bias as humans (weak correlations in the second table), i.e., they do not reinforce more/less the same emotions as humans.

It is noteworthy how emotional bias is better captured in XSUM (stronger correlations than in CNNDM between the ratios). I currently don't have an explanation for this

<a id='rougecorrel'></a>
# ROUGE-emotion correlation

Does ROUGE measures emotion densities and ratios?

To what extent does ROUGE measure emotion densities and ratios? For this, we study the correlation between $pearson(ED_M(E_{1...N}), ED_G(E_{1...N}))$ and ROUGE (see how much ROUGE increases/reduces if the model better/worse approximates the emotion/sentiment densities of the reference); and between $pearson(R_{M/D}(E_{1...N}), R_{G/D}(E_{1...N}))$ and ROUGE (see how much ROUGE increases/reduces if the model better/worse approximates the emotional ratio of the reference).

<br><br>
<img src="https://iili.io/RGsDu9.png" width="700" height="700"/>

In [None]:
dataset_paths = [
    "./AnalysisDatasetWOLemmatization/bart-large-cnn+cnn_dailymail.pkl",
    "./AnalysisDatasetWOLemmatization/pegasus-cnn_dailymail+cnn_dailymail.pkl",
    "./AnalysisDatasetWOLemmatization/t5-base+cnn_dailymail.pkl",
    "./AnalysisDatasetWOLemmatization/bart-large-xsum+xsum.pkl",
    "./AnalysisDatasetWOLemmatization/pegasus-xsum+xsum.pkl",
    "./AnalysisDatasetWOLemmatization/t5-base+xsum.pkl",
    "./AnalysisDatasetWLemmatization/pegasus-xsum+xsum+lemmatized.pkl",
]

dataset_lemmatized_paths = [
    "./AnalysisDatasetWLemmatization/lead+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/random+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/extractive_oracle+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-large-cnn+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/pegasus-cnn_dailymail+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/t5-base+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-cnn_dailymail+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-cnn_dailymail+cnn_dailymail+lemmatized+oracle.pkl",
    "./AnalysisDatasetWLemmatization/lead+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/random+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/extractive_oracle+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-large-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/pegasus-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/t5-base+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-xsum+xsum+lemmatized+oracle.pkl",
]

for path in dataset_lemmatized_paths:
    print("Path (model+dataset):", path)

    with open(path, "rb") as fr:
        dataset = pkl.load(fr)
        for label_type, labels in [
            ("sentiment", sentiment_labels),
            ("emotion", emotion_labels),
        ]:
            print("\nLabel type:", label_type)
            state_dict = EmotionDistributionMetrics.compute_state_dict(
                dataset, labels
            )
            print("\nCorrelación entre densidades y ROUGE:")
            print(
                "\nPearson:",
                EmotionDistributionMetrics.emotion_summary_metric_correlation(
                    state_dict,
                    dataset,
                    emo_metric_func="pearson",
                    mode="densities",
                    correlation_func="pearson",
                    summary_metric="rouge",
                ),
            )
            print(
                "\nSpearman:",
                EmotionDistributionMetrics.emotion_summary_metric_correlation(
                    state_dict,
                    dataset,
                    emo_metric_func="pearson",
                    mode="densities",
                    correlation_func="spearman",
                    summary_metric="rouge",
                ),
            )
            print("\nCorrelación entre ratios y ROUGE:")
            print(
                "\nPearson:",
                EmotionDistributionMetrics.emotion_summary_metric_correlation(
                    state_dict,
                    dataset,
                    emo_metric_func="pearson",
                    mode="ratios",
                    correlation_func="pearson",
                    summary_metric="rouge",
                ),
            )
            print(
                "\nSpearman:",
                EmotionDistributionMetrics.emotion_summary_metric_correlation(
                    state_dict,
                    dataset,
                    emo_metric_func="pearson",
                    mode="ratios",
                    correlation_func="spearman",
                    summary_metric="rouge",
                ),
            )

    print("\n" * 3)

* Correlation between pearson($ED_M$, $ED_G$) and ROUGE ("if it improves/worsens capturing emotion density, does ROUGE improve/worsen?")

| Dataset | Model             | Emotion | Sentiment |
|---------|-------------------|---------|-----------|
| CNNDM   | Lead              | 25.97   | 12.67     |
| CNNDM   | Random            | 24.10   | 17.07     |
| CNNDM   | Extractive oracle | 27.80   | 18.97     |
| CNNDM   | BART-CNNDM        | 28.83   | 19.73     |
| CNNDM   | PEGASUS-CNNDM     | 29.92   | 20.44     |
| CNNDM   | T5-BASE           | 27.57   | 17.48     |
| CNNDM   | BART-JES          | 25.06   | 17.11     |
| CNNDM   | BART-JES-Oracle   | 39.12   | 28.50     |
| XSUM    | Lead              | 10.90   | 8.71      |
| XSUM    | Random            | 13.47   | 11.97     |
| XSUM    | Extractive oracle | 9.26    | 7.89      |
| XSUM    | BART-XSUM         | 21.55   | 12.76     |
| XSUM    | PEGASUS-XSUM      | 22,48   | 13.31     |
| XSUM    | T5-BASE           | 11.93   | 9.66      |
| XSUM    | BART-JES          | 17.94   | 11.73     |
| XSUM    | BART-JES-Oracle   | 16.51   | 10.25     |

* Correlation between pearson($R_{M/D}$, $R_{G/D}$) and ROUGE ("if it improves/worsens capturing emotional bias, does ROUGE improve/worsen?")

| Dataset | Model             | Emotion | Sentiment |
|---------|-------------------|---------|-----------|
| CNNDM   | Lead              | 27.12   | 17.79     |
| CNNDM   | Random            | 21.03   | 12.92     |
| CNNDM   | Extractive oracle | 31.21   | 15.35     |
| CNNDM   | BART-CNNDM        | 31.87   | 16.90     |
| CNNDM   | PEGASUS-CNNDM     | 30.89   | 17.92     |
| CNNDM   | T5-BASE           | 30.39   | 16.91     |
| CNNDM   | BART-JES          | 28.31   | 18.22     |
| CNNDM   | BART-JES-Oracle   | 64.44   | 58.29     |
| XSUM    | Lead              | 10.69   | 4.16      |
| XSUM    | Random            | 13.02   | 8.47      |
| XSUM    | Extractive oracle | 15.24   | 8.66      |
| XSUM    | BART-XSUM         | 34.20   | 21.08     |
| XSUM    | PEGASUS-XSUM      | 35.09   | 21.62     |
| XSUM    | T5-BASE           | 12.06   | 8.74      |
| XSUM    | BART-JES          | 31.05   | 20.77     |
| XSUM    | BART-JES-Oracle   | 27.36   | 23.45     |

Positive correlation in all cases, if emotion/sentiment densities and ratios are better approximated, ROUGE increases. However, the correlation is weak in almost all cases. In any case, emotions correlate with ROUGE more than sentiment. Emotional bias (emotion ratios) seems to correlate better than density, especially in XSUM. In some models, the correlation in XSUM increases by more than 10 points for BART-XSUM and PEGASUS-XSUM.

<a id='alucinacion'></a>
# Emotions of novel words

This section explores whether the emotions of the novel words in the generated summaries are 'unfaithful' with respect to the emotions in the reference summaries.

<img src="https://iili.io/RGsAhu.png" width="1000" height="1000"/>

In [None]:
dataset_paths = [
    "./AnalysisDatasetWOLemmatization/bart-large-cnn+cnn_dailymail.pkl",
    "./AnalysisDatasetWOLemmatization/pegasus-cnn_dailymail+cnn_dailymail.pkl",
    "./AnalysisDatasetWOLemmatization/t5-base+cnn_dailymail.pkl",
    "./AnalysisDatasetWOLemmatization/bart-large-xsum+xsum.pkl",
    "./AnalysisDatasetWOLemmatization/pegasus-xsum+xsum.pkl",
    "./AnalysisDatasetWOLemmatization/t5-base+xsum.pkl",
]


dataset_lemmatized_paths = [
    "./AnalysisDatasetWLemmatization/bart-large-cnn+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/pegasus-cnn_dailymail+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/t5-base+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-cnn_dailymail+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-cnn_dailymail+cnn_dailymail+lemmatized+oracle.pkl",
    "./AnalysisDatasetWLemmatization/bart-large-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/pegasus-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/t5-base+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-xsum+xsum+lemmatized+oracle.pkl",
]


for path in dataset_lemmatized_paths:
    print("Path (model+dataset):", path)

    with open(path, "rb") as fr:
        dataset = pkl.load(fr)
        print(EmotionalHallucinationMetric.compute(dataset, lemmatized=True))

| Dataset | Model           | Precision      | Recall         | $F_1$          |
|---------|-----------------|----------------|----------------|----------------|
| CNNDM   | BART-CNNDM      | 89.54$\pm$1.28 | 37.07$\pm$1.32 | 48.05$\pm$1.29 |
| CNNDM   | PEGASUS-CNNDM   | 90.04$\pm$0.98 | 34.71$\pm$0.97 | 45.89$\pm$0.97 |
| CNNDM   | T5-BASE         | 89.82$\pm$1.11 | 30.68$\pm$1.01 | 41.87$\pm$1.35 |
| CNNDM   | BART-JES        | 89.46$\pm$0.90 | 33.71$\pm$0.86 | 44.81$\pm$0.85 |
| CNNDM   | BART-JES-Oracle | 98.85$\pm$0.16 | 51.63$\pm$0.55 | 63.60$\pm$0.49 |
| XSUM    | BART-XSUM       | 79.59$\pm$0.67 | 49.85$\pm$0.66 | 55.57$\pm$0.58 |
| XSUM    | PEGASUS-XSUM    | 81.23$\pm$0.64 | 50.82$\pm$0.65 | 56.91$\pm$0.58 |
| XSUM    | T5-BASE         | 73.08$\pm$1.89 | 35.08$\pm$1.44 | 41.87$\pm$1.35 |
| XSUM    | BART-JES        | 79.66$\pm$0.67 | 50.64$\pm$0.67 | 56.13$\pm$0.59 |
| XSUM    | BART-JES-Oracle | 94.36$\pm$0.34 | 70.38$\pm$0.56 | 76.44$\pm$0.46 |

In general, they seem to hallucinate very little. Most of the novel words generated by the models have emotions present in the reference (high Precision). It is logical that the Recall is not very high since it is challenging to cover all emotions with novel words alone, but still, the values do not seem bad. In the worst case in the table, more than one-third of the emotions in the reference (Recall 34.71) are covered by novel words. In XSUM, the models hallucinate more (lower precision), but they cover more emotions from the reference by generating more novel words (higher recall). This boosts the $F_1$ compared to CNNDM (recall difference greater than precision difference).

# ROUGE y BERTScore of summarization models

In [None]:
dataset_lemmatized_paths = [
    "./AnalysisDatasetWLemmatization/lead+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/random+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/extractive_oracle+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-large-cnn+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/pegasus-cnn_dailymail+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/t5-base+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-cnn_dailymail+cnn_dailymail+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-cnn_dailymail+cnn_dailymail+lemmatized+oracle.pkl",
    "./AnalysisDatasetWLemmatization/lead+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/random+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/extractive_oracle+xsum+lemmatized.pkl",    
    "./AnalysisDatasetWLemmatization/bart-large-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/pegasus-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/t5-base+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-xsum+xsum+lemmatized.pkl",
    "./AnalysisDatasetWLemmatization/bart-JES-xsum+xsum+lemmatized+oracle.pkl",
]


for path in dataset_lemmatized_paths:
    print("Path (model+dataset):", path)

    with open(path, "rb") as fr:
        dataset = pkl.load(fr)
        print("ROUGE-1:", np.array(dataset["rouge_scores"]["rouge1_f1"]).mean())
        print("ROUGE-2:", np.array(dataset["rouge_scores"]["rouge2_f1"]).mean())
        print("ROUGE-L:", np.array(dataset["rouge_scores"]["rougeLsum_f1"]).mean())
        print("BertScore:", np.array(dataset["bert_scores"]["f1"]).mean())

| Dataset | Model         | R1    | R2    | RL    | BS    |
|---------|---------------|-------|-------|-------|-------|
| CNNDM   | Lead | 40.05  | 17.48  | 36.34  | 23.45  |
| CNNDM   | Random | 28.48  | 8.34  | 25.51  | 11.88  |
| CNNDM   | Extractive oracle | 52.34  | 30.23 | 48.86 | 39.77  |
| CNNDM   | BART-CNNDM    | 43.76 | 20.86 | 40.68 | 33.64 |
| CNNDM   | PEGASUS-CNNDM | 43.96 | 21.38 | 41.07 | 35.18 |
| CNNDM   | T5-BASE | 39.95  | 17.47 | 36.93 | 22.93  |
| CNNDM   | BART-JES | 42.04  | 19.19 | 39.09 | 31.24  |
| CNNDM   | BART-JES-Oracle | 48.60  | 28.45 | 45.98 | 24.30  |
| XSUM   | Lead | 16.71  | 1.65  | 12.30  | 14.27  |
| XSUM   | Random | 15.23  | 1.77  | 11.38  | 11.71  |
| XSUM   | Extractive oracle | 29.38  | 8.68 | 22.43 | 22.66  |
| XSUM    | BART-XSUM     | 45.23 | 22.13 | 37.02 | 50.13 |
| XSUM    | PEGASUS-XSUM  | 47.16 | 24.58 | 39.31 | 52.74 |
| XSUM   | T5-BASE | 20.68 | 3.18 | 16.36 | 8.08 |
| XSUM   | BART-JES | 42.35  | 19.45 | 34.76  | 48.02   |
| XSUM   | BART-JES-Oracle | 58.63  | 34.06 | 51.36 | 58.88   |

<a id='modelado'></a>
# Possible improvements in modeling.

While the models already appear to be quite good, are there ways to improve the results of the previous metrics (correlating more with densities and ratios of emotions in the references - especially the ratio - and reducing emotional hallucination -increasing precision, recall, and $F_1$-) in modeling?

Some ideas:

* **Emotional loss**: <br>
<img src="https://iili.io/RGsIp9.png" width="1000" height="1000"/>

* **Join Emotion and Summary generation**: Similar to [Planning with Entity Chains for Abstractive Summarization](https://arxiv.org/abs/2104.07606). Train a model to generate, as a prefix, the sequence of words with emotions from the reference (using lemmas to avoid restricting to exact words), followed by the reference summary. For example, given a document, generate -> <em>[chain] afraid | fright | sad ||| happy [summary] I'm afraid, frightened, and sad. No, I'm happy!</em>. This approach forces the model to create a plan with emotional words to use in the summary generation (the summary is conditioned on the plan/prefix and the document). This approach has the advantage that, during inference, the generation can be controlled by providing the emotional word sequence to condition the summary generation.


<span style="color:red"> We briefly experimented with "Join Emotion and Summary generation" using BART as backbone (BART-JES) and its oracle (BART-JES-Oracle).

<a id='controlled'></a>
# Generation with Emotion-Conditioned BART-JES


In BART-JES, the summary is conditioned by a chain of emotional words. We can set this chain in inference as a prefix to control the generation (of course, the prefix can later be removed if we only want the generated summary). The following cells illustrate how to do this with a sample from the XSUM corpus.

In [10]:
dataset_name = "xsum"
doc_summ_keys = {
    "document": "document",
    "summary": "summary",
}
version = None

dataset = load_dataset(dataset_name, version=version)

documents = [doc for doc in dataset["test"][doc_summ_keys["document"]]]
ref_summaries = [
    ref_summ for ref_summ in dataset["test"][doc_summ_keys["summary"]]
]

Using custom data configuration default
Reusing dataset xsum (/home/jogonba2/.cache/huggingface/datasets/xsum/default/1.2.0/f9abaabb5e2b2a1e765c25417264722d31877b34ec34b437c53242f6e5c30d6d)


In [None]:
model_name = "jogonba2/bart-JES-xsum"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_gen_args = copy(generation_hyperparameters[model_name][dataset_name])

### Example: Reinforcing Negative Emotion in a Sample with Negative Emotions

In [42]:
idx_example = 110

document, ref_summary = documents[idx_example], ref_summaries[idx_example]

uncontrolled_summary = uncontrolled_generation(
    document, tokenizer, model_name, dataset_name
)

emotion_chain = "[emotions] mediocre | bully | harm [summary]"
controlled_summary = controlled_generation(
    document, tokenizer, emotion_chain, model_name, dataset_name
)

display(HTML("<u><h4>Document</h4></u></br></br>%s" % document))

display(HTML("<u><h4>Reference summary</h4></u></br></br>%s" % ref_summary))

display(
    HTML("<u><h4>Generated summary (uncontrolled)</h4></u></br></br>%s" % uncontrolled_summary)
)

display(
    HTML(
        "<u><h4>Generated summary (controlled <font size='2'>using the chain '%s'</font>)</h4></u></br></br>%s"
        % (emotion_chain, controlled_summary)
    )
)

### Example: Reinforcing Positive Emotions in a Sample with Negative Emotions

In [56]:
idx_example = 115

document, ref_summary = documents[idx_example], ref_summaries[idx_example]


uncontrolled_summary = uncontrolled_generation(
    document, tokenizer, model_name, dataset_name
)

emotion_chain = "[emotions] love | hope [summary]"
controlled_summary = controlled_generation(
    document, tokenizer, emotion_chain, model_name, dataset_name
)

display(HTML("<u><h4>Document</h4></u></br></br>%s" % document))

display(HTML("<u><h4>Reference summary</h4></u></br></br>%s" % ref_summary))


display(
    HTML("<u><h4>Generated summary (uncontrolled)</h4></u></br></br>%s" % uncontrolled_summary)
)

display(
    HTML(
        "<u><h4>Generated summary (controlled <font size='2'>using the chain '%s'</font>)</h4></u></br></br>%s"
        % (emotion_chain, controlled_summary)
    )
)

<a id='visualizaciones'></a>
# Visualizations

### Examples with Lower and Higher Emotional Hallucination for a Specific Corpus and Model

In [16]:
dataset_path = "./AnalysisDatasetWOLemmatization/pegasus-xsum+xsum.pkl"
dataset_lemmatized_path = "./AnalysisDatasetWLemmatization/bart-large-cnn+cnn_dailymail+lemmatized.pkl"
dataset_lemmatized_path = (
    "./AnalysisDatasetWLemmatization/bart-JES-xsum+xsum+lemmatized+oracle.pkl"
)
lemmatized = True

with open(dataset_lemmatized_path if lemmatized else dataset_path, "rb") as fr:
    dataset = pkl.load(fr)
    outputs = Visualization.get_top_k_examples_emotional_hallucination(
        dataset,
        k=5000,
        reverse=True,
        sort_metric="f1",
        lemmatized=lemmatized,
    )



In [18]:
c = 3015

print("Example %d with %s" % (c, dataset_lemmatized_path))
print("-" * 50 + "\n")
print("Precision:", outputs["precision"][c])
print("Recall:", outputs["recall"][c])
print("F1:", outputs["f1"][c])

print("\n\nDocument:", outputs["documents"][c])
print("\n\nRef summary:", outputs["ref_summaries"][c])
print("\n\nGen summary:", outputs["gen_summaries"][c])

if not lemmatized:
    print(
        "\n\nEmotion words in ref summary:",
        outputs["nrclex_ref_summaries"][c].affect_dict,
    )
    print(
        "\n\nEmotion words in gen summary:",
        outputs["nrclex_gen_summaries"][c].affect_dict,
    )
    print(
        "\n\nNovel words with emotions:",
        outputs["novel_words_with_emotions"][c],
    )

else:
    print(
        "\n\nEmotion lemmas in ref summary:",
        outputs["nrclex_ref_summaries"][c].affect_dict,
    )
    print(
        "\n\nEmotion lemmas in gen summary:",
        outputs["nrclex_gen_summaries"][c].affect_dict,
    )
    print(
        "\n\nLemmas of the novel words with emotions (the whole word in the gen summary does not appear in the document):",
        outputs["novel_words_with_emotions"][c],
    )

Example 3015 with ./AnalysisDatasetWLemmatization/bart-JES-xsum+xsum+lemmatized+oracle.pkl
--------------------------------------------------

Precision: 1.0
Recall: 0.8888888888888888
F1: 0.9411764705882353


Document: The 25-year-old from Plymouth, who also dived at the 2006 Commonwealth Games and 2009 World Championship, has quit after a
"I just feel like my body is telling me to give up," the 10m platform diver told BBC South West Sport.
"I'm 25-years-old and knowing there's another four years until Rio, I just don't think my body will hold out."
My dream, even as a little girl, was to make an Olympic Games. Not making London was heartbreaking and my dreams were shattered
Graddon, who won a bronze medal at the 2009 European Diving Championship in Turin, cites 2008 as the best year of her career as a diver.
She said: "I'd come back from injury and illness and that year I was British champion and qualified for the Olympic place for Great Britain, although I didn't make Beijing."
Grad

### Example with all the information

In [7]:
dataset_path = "./AnalysisDatasetWOLemmatization/bart-large-xsum+xsum.pkl"
dataset_lemmatized_path = "./AnalysisDatasetWLemmatization/bart-large-cnn+cnn_dailymail+lemmatized.pkl"

with open(dataset_lemmatized_path, "rb") as fr:
    dataset = pkl.load(fr)
    state_dict = EmotionDistributionMetrics.compute_state_dict(
        dataset, emotion_labels
    )

In [8]:
c = 1338

print("Ejemplo %d con %s" % (c, dataset_lemmatized_path))
print("-" * 50 + "\n")

print(
    "ROUGEs F1 (1-2-L):",
    dataset["rouge_scores"]["rouge1_f1"][c],
    dataset["rouge_scores"]["rouge2_f1"][c],
    dataset["rouge_scores"]["rougeLsum_f1"][c],
)
print("\n\nDocumento:", dataset["documents"][c])
print("\n\nReferencia:", dataset["ref_summaries"][c])
print("\n\nGenerado:", dataset["gen_summaries"][c])
print("\n\nED del documento:", state_dict["ed_d"][c])
print("\n\nED de la referencia:", state_dict["ed_m"][c])
print("\n\nED del generado:", state_dict["ed_g"][c])
print("\n\nR{M/D}:", state_dict["ratio_m-d"][c])
print("\n\nR(G/D}):", state_dict["ratio_g-d"][c])
print(
    "\n\naffect_dict de la referencia:",
    dataset["nrclex_ref_summaries"][c].affect_dict,
)
print(
    "\n\naffect_dict del generado:",
    dataset["nrclex_gen_summaries"][c].affect_dict,
)
print(
    "\n\nEmociones en la referencia:",
    set(dataset["nrclex_ref_summaries"][c].affect_list),
)
print(
    "\n\nEmociones en el generado:",
    set(dataset["nrclex_gen_summaries"][c].affect_list),
)

Ejemplo 1338 con ./AnalysisDatasetWLemmatization/bart-large-cnn+cnn_dailymail+lemmatized.pkl
--------------------------------------------------

ROUGEs F1 (1-2-L): 0.46280991735537186 0.20168067226890757 0.4297520661157025


Documento: A game of cat and mouse has been captured in a series of striking images as the pair battle it out on a shed rooftop like a real life version of much-loved cartoon duo Tom and Jerry. It is an age-old rivalry that rarely ends well for one of its parties and so it proved in this remarkable set of photos. The snaps of a cat playing with a mouse on a roof in Shepton Mallet, Somerset, illustrate the perils the tiny rodents face in the town. Ironically the pet cat's name is Mouse. Unfortunately for this mouse that's where all similarities between the moggy and its namesakes end. The pictures were taken by the cat's owner Jason Bryant who confirmed the inevitable outcome of the encounter. 'My cat is a very good mouser,' he said. 'She's done it before. She often