# Subtask 3: Persuasion Techniques Detection

## Description
Given a news article, identify the persuasion techniques in each paragraph. This is a multi-label task at paragraph level.
Multiple languages are available for this task, and label scheme is hierarchical, with 6 broader categories subdivided into
23 fine-grained categories. The number of techniques per language may vary slightly. A paragraph can have N simultaneous
persuasion techniques.

Input data will be sentences from news and web articles in plain text format. Templates for sentence numbers in each
article were provided. Articles were collected from 2020 to mid 2022 and revolve around fixed range of topics such as
COVID-19, climate change, migration, the war on Ukraine and country-specific local events such as elections. Large
fraction of articles were identified by fract-checkers and experts. Titles are always on paragraph 1, if they exist,
then a blank line separates it from the rest of the article body. Spans for annotated parts characterizing the label
inside the paragraph are also provided.

## Submission
Official measure is micro-F1 computed using the 23 fine-grained labels. THe coarse-grained labels will also be evaluated
and communicated to the participating teams.

## Dates
* January 12, 2023 - Release of test set
* January 22, 2023 - Test submission site closes
* February 2023 - Paper submission Deadline


## Labels
Persuasive text is characterized by a specific use of language in order to influence readers. We
distinguishes the following high level 6 approaches: Justification, Simplification, Distraction,
Call, Attack on Reputation and Manipulative Wording.

* **Justification**: an argument made of two parts is given: a statement + justification.
  
* **Simplification**: a statement is made that excessively simplify a problem, usually regarding
the cause, the consequence or the existence of choices.

* **Distraction**: a statement is made that changes the focus away from the main topic or argument.
  
* **Call**: the text is not an argument but an encouragement to act or think in a particular way.
  
* **Manipulative wording**: a statement is made that is not an argument or specific language
is used, which contains words/phrases that are either non-neutral, confusing, exaggerating,
loaded, etc., in order to impact the reader, for instance emotionally.

* **Attack on reputation**: an argument whose object is not the topic of the conversation,
but the personality of a participant, his experience and deeds, typically in order to question
and/or undermine his credibility. The object of the argumentation can also refer to a group
of individuals, organization, or an activity.
  * **Name Calling or Labelling**: Typically used in an *insulting* or demanding way. Labelling an object as something the
  target audience fears, hates, etc. Calls for a qualitative judgement that disregards facts and focus on the subject.
  Is *similar to manipulative wording*. What distinguishes it from Loaded Language is that it is only concerned about
  the characterization of the subject.

    Example: "**’Fascist’ Anti-Vax** Riot Sparks COVID Outbreak in Australia."

  * **Guilt by association**: Associating the subject with another things with negative connotations. Difference
  between this and Name calling is that this requires an **association** while Name calling simply uses the insult word.

    Example: "**Do you know who else was doing that ? Hitler!**”

  * **Casting Doubt**: Tries to discredit something by raising questions.

    Example: "A candidate talks about his opponent and says: **Is he ready to be the Mayor?**”

  * **Appeal to Hypocrisy**: Criticizing someone for something you did. This is related to Whataboutism, but this attacks
    the target directly, while Whataboutism focuses on the topic. This can also be seen as a specific type of Casting Doubt.

    Example: How can you demand that I eat less meat to reduce my carbon footprint if you yourself drive
    a big SUV and fly for holidays to Bali?


In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

from utils import make_dataframe

np.set_printoptions(suppress=True)

In [None]:
languages = ["en", "fr", "ge", "it", "po", "ru"]

In [None]:
train_df, test_df = pd.DataFrame(), pd.DataFrame()
for lang in languages:
    train_folder = f"../data/{lang}/train-articles-subtask-3/"
    dev_folder = f"../data/{lang}/dev-articles-subtask-3/"
    labels_file = f"../data/{lang}/train-labels-subtask-3.txt"

    aux_train = make_dataframe(train_folder, labels_file)
    aux_train["lang"] = lang

    aux_test = make_dataframe(dev_folder)
    aux_test["lang"] = lang

    train_df = pd.concat([train_df, aux_train], ignore_index=True)
    test_df = pd.concat([test_df, aux_test], ignore_index=True)

    # out_folder = f"../results/result-subtask3-dev-{lang}.txt"

In [None]:
train_labels_count = (
    train_df[["labels", "lang"]]
        .apply(lambda x: (x["labels"].split(","), x["lang"]), axis=1, result_type="broadcast")
        .explode("labels")
)

labels = train_labels_count["labels"].unique().tolist()

In [None]:
print("Label Count")
train_labels_count.groupby(["labels"]).count().sort_values(by="lang", ascending=False).plot(kind="bar")
plt.tight_layout()
plt.savefig("figures/label_counts.jpeg")

In [None]:
print("Count of label ocurrence for each language.")
train_labels_count.groupby(["lang"]).count().sort_values(by="labels", ascending=False).plot(kind="bar")
plt.tight_layout()
plt.savefig("figures/language_counts.jpeg")

In [None]:
print("English is missing 4 labels:")
missing_labels = set(labels).difference(set(train_labels_count[train_labels_count["lang"] == "en"]["labels"].unique()))
for lbl in missing_labels:
    print("-", lbl)
train_labels_count.groupby("lang").nunique()

In [None]:
count_biggest_df = pd.DataFrame(columns=["label", "lang"])
for lbl in labels:
    lang_counts = train_labels_count[train_labels_count["labels"] == lbl]["lang"].value_counts()
    lang = lang_counts.index[0]

    count_biggest_df = pd.concat([count_biggest_df, pd.DataFrame({"label": [lbl], "lang": [lang]})], ignore_index=True)

print("Which language has the highest number of examples for each label?")
count_biggest_df

In [None]:
print("Rank of languages with highest number of samples per label.")
count_biggest_df.groupby("lang").agg({"lang": "count", "label": lambda x: list(x)})

In [None]:
plt.figure(figsize=(15,7.5))
# train_labels_count.plot(kind="bar")
sorted_categories = train_labels_count.groupby(["labels"]).count().sort_values(by="lang", ascending=False).index
sns.countplot(data=train_labels_count, x="labels", hue="lang", order=sorted_categories)
plt.xticks(rotation=90)
plt.tight_layout()
plt.savefig("figures/label_counts_by_language.jpeg")

In [None]:
print("Co-occurence matrix: each cell represents the ratio of co-ocurrence of label i with label j. Rows sum to 100%.")
plt.figure(figsize=(10,8))
labels = train_df["labels"].apply(lambda x: x.split(",")).explode().unique()

co_ocurrences = {}
for label in labels:
    co_ocurrences[label] = dict.fromkeys(labels, 0)
    example_labels_df = train_df["labels"].apply(lambda x: x.split(","))
    for row in example_labels_df:
        if label in row:
            for lbl in row:
                co_ocurrences[label][lbl] += 1

coo_occurence_matrix = np.zeros((len(labels), len(labels)))
label_mapping = {label: idx for label, idx in zip(labels, range(len(labels)))}
count_map = {v: k for k,v in train_labels_count["labels"].to_dict().items()}
for lbl_i, aux in co_ocurrences.items():
    for lbl_j, count in aux.items():
        i = label_mapping[lbl_i]
        j = label_mapping[lbl_j]
        if i != j:
            coo_occurence_matrix[i][j] = count/np.sum(sorted(list(aux.values()))[:-1])

sns.heatmap(coo_occurence_matrix, xticklabels=labels, yticklabels=labels)
plt.tight_layout()
plt.savefig("figures/class_cooccurrence.jpeg")

In [None]:
print("What is the ratio between train and development sets for each language?")
aux_test = test_df[["lang"]]
aux_test["split"] = "test"

aux_train = train_labels_count[["lang"]]
aux_train["split"] = "train"

aux_df = pd.concat([aux_train, aux_test], axis=0)
sns.countplot(data=aux_df, x="lang", hue="split")
plt.xticks(rotation=90)
plt.tight_layout()
plt.savefig("figures/train_vs_dev_langs.jpeg")