### Gruppo 4 Text Analytics 2022/23
- Simona Sette
- Giulio Canapa
- Sara Quattrone
- Diego Borsetto

# Generation of multi-class dataset for "Persuasion Techniques Semval2023 - task3" inspired project

The purpose of this notebook is to illustrate the process that led to the dataset generation.

## Import

In [64]:
import pandas as pd
import csv
import os.path
pd.options.mode.chained_assignment = None  

In [65]:
#save into a variable the file name of the file containing the persuasion technique labels with their paragraph and article ID.
completeFile = "train-labels-subtask-3.txt"   

We first display the number of distinct labels and their names within the original dataset:

In [66]:
filelabels = 'techniques_subtask3.txt'
df_labels = pd.read_csv(filelabels, header=None)
df_labels.columns = ["Techniques"]
df_labels

Unnamed: 0,Techniques
0,Appeal_to_Authority
1,Appeal_to_Popularity
2,Appeal_to_Values
3,Appeal_to_Fear-Prejudice
4,Flag_Waving
5,Causal_Oversimplification
6,False_Dilemma-No_Choice
7,Consequential_Oversimplification
8,Straw_Man
9,Red_Herring


Number of Total Distinct Persuasion Techniques: 23

In [67]:
print(df_labels.nunique())

Techniques    23
dtype: int64


## Starting file generation process:

### Label dataset generation

We first generate a dataset containing the persuasion techniques for each article paragraph.

In [68]:
df = pd.read_csv(completeFile ,sep='\t', header=None)
df.columns = ["Article", "Paragraph", "Technique"]

In [69]:
df

Unnamed: 0,Article,Paragraph,Technique
0,111111111,1,
1,111111111,3,Doubt
2,111111111,5,Appeal_to_Authority
3,111111111,7,
4,111111111,9,
...,...,...,...
9493,999001970,9,
9494,999001970,10,
9495,999001970,11,
9496,999001970,12,


In [70]:
print(df['Article'].nunique())

446


The original dataset contains 446 distinct news articles.

Removing from the previous generated dataset the records without any persuasion techniques as being useless for the purpose:

In [71]:
df = df[df['Technique'].notna()]
df

Unnamed: 0,Article,Paragraph,Technique
1,111111111,3,Doubt
2,111111111,5,Appeal_to_Authority
6,111111111,13,Repetition
8,111111111,17,Appeal_to_Fear-Prejudice
9,111111111,19,Appeal_to_Fear-Prejudice
...,...,...,...
9488,999001970,4,"Exaggeration-Minimisation,Slogans"
9489,999001970,5,Exaggeration-Minimisation
9490,999001970,6,Name_Calling-Labeling
9492,999001970,8,"Exaggeration-Minimisation,Name_Calling-Labeling"


In [72]:
print(df['Article'].nunique())

431


A first skimming of the dataset led to the loss of 15 news articles as all the paragraphs identified in them were not annotated.

The persuasion techniques data format displays the presence of commas in the case of multiple labels.
As they are not of interest for the chosen objective, we proceeded to remove these rows by exploiting the presence of the commas themselves in the fields in which there are more labels.

In [73]:
moreThanOne= ","
for index, row in df.iterrows():    
    tec=row["Technique"]
    if moreThanOne in tec:
        df.drop(index, inplace=True)

Lable dataset suitable for multiclass classification tasks (no longer multilabel):

In [74]:
df

Unnamed: 0,Article,Paragraph,Technique
1,111111111,3,Doubt
2,111111111,5,Appeal_to_Authority
6,111111111,13,Repetition
8,111111111,17,Appeal_to_Fear-Prejudice
9,111111111,19,Appeal_to_Fear-Prejudice
...,...,...,...
9465,999001621,41,Doubt
9487,999001970,3,Loaded_Language
9489,999001970,5,Exaggeration-Minimisation
9490,999001970,6,Name_Calling-Labeling


In [75]:
print(df["Technique"].nunique())

19


Unique techniques automatically excluded (always appear in combination with others): 3

### Paragraph textual content data generation

In [76]:
phar = "train-labels-subtask-3-Copy1.txt"
df_paragraph = pd.read_csv(phar,sep='\t', header=None)
df_paragraph.columns = ["Article", "Paragraph", "Text"]
df_paragraph

Unnamed: 0,Article,Paragraph,Text
0,111111111,1,Next plague outbreak in Madagascar could be 's...
1,111111111,3,Geneva - The World Health Organisation chief o...
2,111111111,5,The next transmission could be more pronounced...
3,111111111,7,"An outbreak of both bubonic plague, which is s..."
4,111111111,9,Madagascar has suffered bubonic plague outbrea...
...,...,...,...
9191,999001970,9,"Patel pushed back on the officials’ remarks, a..."
9192,999001970,10,The real world? This is Columbia.
9193,999001970,11,"For Sofia Jao, BC ‘22, problems with the perfo..."
9194,999001970,12,Patel is 32.


### Dataset generation in which the ID of the article, the paragraph ID, the textual content and the technical annotation are present.

In [77]:
df3 = pd.merge(df, df_paragraph, how='inner')
df3

Unnamed: 0,Article,Paragraph,Technique,Text
0,111111111,3,Doubt,Geneva - The World Health Organisation chief o...
1,111111111,5,Appeal_to_Authority,The next transmission could be more pronounced...
2,111111111,13,Repetition,"But Tedros voiced alarm that ""plague in Madaga..."
3,111111111,17,Appeal_to_Fear-Prejudice,He also pointed to the presence of the pneumon...
4,111111111,19,Appeal_to_Fear-Prejudice,He praised the rapid response from WHO and Mad...
...,...,...,...,...
2212,999001621,41,Doubt,The story was completely false and the Guardia...
2213,999001970,3,Loaded_Language,Andy Warhol was only half-right. In the future...
2214,999001970,5,Exaggeration-Minimisation,Saturday Night Live writer and comedian Nimesh...
2215,999001970,6,Name_Calling-Labeling,That's what Columbia snowflakes thought was of...


We don't use the left merge because, surprisingly, there are article-paragraph combinations that __do not have a text but do have a label__:

In [78]:
df4 = pd.merge(df, df_paragraph, how='left')
b=pd.concat([df3, df4]).drop_duplicates(keep=False)
b

Unnamed: 0,Article,Paragraph,Technique,Text
693,729668796,8,Repetition,
1281,766942310,3,Exaggeration-Minimisation,
1282,766942310,4,Name_Calling-Labeling,
1283,766942310,9,Doubt,
1284,766942310,11,Appeal_to_Fear-Prejudice,
...,...,...,...,...
2107,999000136,17,Exaggeration-Minimisation,
2108,999000136,18,Repetition,
2264,999001323,11,Loaded_Language,
2265,999001323,14,Name_Calling-Labeling,


Generation of the first version of the dataset:

In [79]:
#df3.to_csv('Multiclass_problem_withText.csv', index=None)  

In [80]:
#number of distinct paragraph shown is not significant because the numbering is repeated for each item
print(df.nunique())

Article      416
Paragraph    101
Technique     19
dtype: int64


The number of distinct articles at this point is 416 while the number of distinct persuasion techniques is 19.

At this point we were interested in evaluating the number of elements present for each class since the classifiers training is particularly affected by the low data population per class.

In [81]:
occur = df3.groupby(['Technique']).size()
# display occurrences for each persuasion technique
display(occur)

Technique
Appeal_to_Authority                 64
Appeal_to_Fear-Prejudice           122
Appeal_to_Hypocrisy                 14
Appeal_to_Popularity                 3
Causal_Oversimplification           61
Conversation_Killer                 39
Doubt                              210
Exaggeration-Minimisation          102
False_Dilemma-No_Choice             48
Flag_Waving                        102
Guilt_by_Association                19
Loaded_Language                    806
Name_Calling-Labeling              318
Obfuscation-Vagueness-Confusion      8
Red_Herring                         15
Repetition                         218
Slogans                             57
Straw_Man                            7
Whataboutism                         4
dtype: int64

It was decided to only keep the "significant" classes, choosing as choice criterion a frequency higher than 100. 

This process led to the decision to carry out the analysis on 7 definitive classes and 1878 records.

In [82]:
dfFilter= df3.loc[(df3['Technique'] == "Appeal_to_Fear-Prejudice") |(df3['Technique'] == "Doubt")|(df3['Technique'] == "Exaggeration-Minimisation")|(df3['Technique'] == "Flag_Waving")|(df3['Technique'] == "Loaded_Language")|(df3['Technique'] == "Name_Calling-Labeling")|(df3['Technique'] == "Repetition")]
print("Number of definitive classes: ", dfFilter['Technique'].nunique())
dfFilter

Number of definitive classes:  7


Unnamed: 0,Article,Paragraph,Technique,Text
0,111111111,3,Doubt,Geneva - The World Health Organisation chief o...
2,111111111,13,Repetition,"But Tedros voiced alarm that ""plague in Madaga..."
3,111111111,17,Appeal_to_Fear-Prejudice,He also pointed to the presence of the pneumon...
4,111111111,19,Appeal_to_Fear-Prejudice,He praised the rapid response from WHO and Mad...
5,111111111,25,Appeal_to_Fear-Prejudice,That means that Madagascar could be affected m...
...,...,...,...,...
2212,999001621,41,Doubt,The story was completely false and the Guardia...
2213,999001970,3,Loaded_Language,Andy Warhol was only half-right. In the future...
2214,999001970,5,Exaggeration-Minimisation,Saturday Night Live writer and comedian Nimesh...
2215,999001970,6,Name_Calling-Labeling,That's what Columbia snowflakes thought was of...


## Saving the definitive dataset

In [83]:
dfFilter.to_csv('Multiclass_problem_7Classes.csv', index=None)  