This notebook logs the preparation of the bcs sentiment analysis datases. 

# Preliminary exploration and corrections

* Dataset is based on the file `"bcs_polsent_20220502.xlsx"`, sent to peter.rupnik@ijs.si 2022-05-02T22:50+02:00.
* As agreed in a Skype meeting only the first sheet will be used and labels will be downcast to 3 (positive, negative, neutral).



In [1]:
import pandas as pd

df = pd.read_excel("bcs_polsent_20220502.xlsx", "1-1300", index_col="id")
df.head(2)

Unnamed: 0_level_0,sentence,country,type,annotator1,annotator2,gold,reconciliation_hard,id_meta,term,doc_id,sentence_id,date,fullname,party,gender,yob,edu_y,ideology,no_seats,ruling
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,Ja shvatam da međunarodna zajednica i oni koji...,BiH,pilot,Negative,M_Negative,soft_disagreement,0,15262,0,3055,4,19991110,"Špirić, Nikola",SDS,0.0,1956.0,22,,4.0,1.0
2,"Npr. mene i moje braće, npr. mi tražimo našu i...",BiH,pilot,Negative,Negative,Negative,0,72709,0,13704,8,20020417,"Kulenović, Salih",SDA,0.0,1944.0,16,,8.0,0.0


Prepare a column `label` that we will use:

In [2]:
df["label"] = df.gold

Overwrite `label` with `reconciliation_hard`, where the latter is non-trivial (i.e. non-zero):

In [3]:
condition_reconciliation_not_zero = df.reconciliation_hard != 0
df.loc[condition_reconciliation_not_zero, "label"] = df.reconciliation_hard[condition_reconciliation_not_zero]

df.label.value_counts()

Negative             539
P_Neutral            218
Positive             184
soft_disagreement    175
N_Neutral             80
M_Negative            60
M_Positive            44
Name: label, dtype: int64

Note the presence of 175 instances of soft disagreement. Let us check this out:

In [7]:
df[df.label=="soft_disagreement"].head(3)

Unnamed: 0_level_0,sentence,country,type,annotator1,annotator2,gold,reconciliation_hard,id_meta,term,doc_id,...,date,fullname,party,gender,yob,edu_y,ideology,no_seats,ruling,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Ja shvatam da međunarodna zajednica i oni koji...,BiH,pilot,Negative,M_Negative,soft_disagreement,0,15262,0,3055,...,19991110,"Špirić, Nikola",SDS,0.0,1956.0,22,,4.0,1.0,soft_disagreement
14,Ponekad u potrebi da se rješava mnogo stvari p...,BiH,pilot,Negative,M_Negative,soft_disagreement,0,534181,0,114977,...,20160627,"Škaljić, Fehim",SBB,0.0,1949.0,12,,4.0,1.0,soft_disagreement
16,Da li Vijeće ministara raspolaže sa analizom k...,BiH,pilot,N_Neutral,P_Neutral,soft_disagreement,0,555612,0,120407,...,20170726,"Mehmedović, Šemsudin",SDA,0.0,1961.0,18,,10.0,1.0,soft_disagreement


We might resolve this with downcasting labels to but three. Let us do this now.

In [6]:
labels = df.label.unique()
labels

array(['soft_disagreement', 'Negative', 'Positive', 'M_Negative',
       'N_Neutral', 'P_Neutral', 'M_Positive'], dtype=object)

In [8]:
def correct_label(l: str) -> str:
    downcast_dict = {
        'Negative': "Negative",
        'Positive': "Positive",
        'M_Negative': "Negative",
        'N_Neutral': "Neutral", 
        'P_Neutral': "Neutral", 
        'M_Positive': "Positive",
                   }
    return downcast_dict.get(l, l)

df["annotator1_downcast"] = df.annotator1.apply(correct_label)
df["annotator2_downcast"] = df.annotator2.apply(correct_label)

df["label"] = df.label.apply(correct_label)



In [9]:
condition_label_is_soft_disagreement = df.label == "soft_disagreement"
condition_downcast_annotations_differ = df.annotator1_downcast != df.annotator2_downcast

sum(condition_downcast_annotations_differ & condition_label_is_soft_disagreement)

0

As suspected, if we downcast the labels from annotators, the disagreement is resolved. We can now overwrite the labels `soft_disagreement`:

In [10]:
df.loc[condition_label_is_soft_disagreement, "label"] = df.loc[condition_label_is_soft_disagreement, "annotator1_downcast"]

df.label.value_counts()

Negative    666
Neutral     362
Positive    272
Name: label, dtype: int64

In [None]:
T