# Notebook: Check Annotation Majority Vote and Create Folds

This notebook is used to check which label was most often assigned to a tweet. At the end, train/test datasets (CSV) (5 folds) are created.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [1]:
from collections import Counter
import pandas as pd
import glob

## Parameters

In [2]:
ANNOTATED_DATASET_PATH = "../Datasets/annotated_dataset/*.xlsx"
SAVE_ANNOTATIONS_PATH = "../Datasets/annotations.csv"

## Code

### 1. Load Annotations

In [3]:
file_list = sorted(glob.glob(ANNOTATED_DATASET_PATH))
file_list

['../Datasets/annotated_dataset/tweets_session_1_1.xlsx',
 '../Datasets/annotated_dataset/tweets_session_1_2.xlsx',
 '../Datasets/annotated_dataset/tweets_session_1_3.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_1.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_2.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_3.xlsx']

In [4]:
df_all = pd.DataFrame()
for file in file_list:
    df = pd.read_excel(file)
    df_all = pd.concat([df_all, df])

### 2. Add Function to Compare Annotations
In the case that no majority of annotators have decided on a label, then `NO_MAJORITY` is set as the label.

In [5]:
def get_majority(votes):
    # Use the Counter class to count the frequency of each element
    counter = Counter(votes)

    # Get the most common element
    most_common = counter.most_common(1)

    # Check if the most common element was chosen by more than half of the annotators
    if most_common[0][1] > (len(votes.to_numpy()) / 2):
        # Return the most common element
        return most_common[0][0]
    else:
        return "NO_MAJORITY"

In [6]:
annotated_df = df_all.groupby(['id', 'source_account', 'tweet'])['sentiment_label'].apply(lambda x: get_majority(x))
annotated_df = annotated_df.reset_index()

In [7]:
annotated_df

Unnamed: 0,id,source_account,tweet,sentiment_label
0,1345000784115552000,larsklingbeil,@JuliaMaiano @EskenSaskia @NowaboFM @spdde @Ol...,Negative
1,1345296806947779072,cducsubt,@JM_Luczak @Linksfraktion @cducsubt @MIT_bund ...,Negative
2,1345311824439348992,CDU,@Sarayatennis @_FriedrichMerz @CDU Das ist ja ...,Positive
3,1345330989275508992,CDU,@LeBoomio @theNeo42 @InRi5555 @n_roettgen @CDU...,Negative
4,1345336738328284928,CDU,@MickyBeisenherz @n_roettgen @_FriedrichMerz @...,Positive
...,...,...,...,...
1995,1476608889500185088,fdp,@JoelThuering @CDU @fdp Aber sonst geht es dir...,Positive
1996,1476626653577158912,fdp,@Chrissip81 @Judith__Sauer @fdp @_MartinHagen ...,Negative
1997,1476841285873020928,MarcoBuschmann,@MarcoBuschmann Ich habe mir gerade den Koalit...,Negative
1998,1476869223544667904,MarcoBuschmann,@MarcoBuschmann Dann macht es doch endlich bes...,Positive


### 3. Filtering

Filter annotations without majority (`NO_MAJORITY`)

In [8]:
annotated_df = annotated_df.loc[annotated_df['sentiment_label'] != 'NO_MAJORITY'].reset_index(drop=True)

Filter annotations that were labeled as `MIXED` by the Majority

In [9]:
annotated_df = annotated_df.loc[annotated_df['sentiment_label'] != 'MIXED'].reset_index(drop=True)

In [10]:
annotated_df

Unnamed: 0,id,source_account,tweet,sentiment_label
0,1345000784115552000,larsklingbeil,@JuliaMaiano @EskenSaskia @NowaboFM @spdde @Ol...,Negative
1,1345296806947779072,cducsubt,@JM_Luczak @Linksfraktion @cducsubt @MIT_bund ...,Negative
2,1345311824439348992,CDU,@Sarayatennis @_FriedrichMerz @CDU Das ist ja ...,Positive
3,1345330989275508992,CDU,@LeBoomio @theNeo42 @InRi5555 @n_roettgen @CDU...,Negative
4,1345336738328284928,CDU,@MickyBeisenherz @n_roettgen @_FriedrichMerz @...,Positive
...,...,...,...,...
1994,1476608889500185088,fdp,@JoelThuering @CDU @fdp Aber sonst geht es dir...,Positive
1995,1476626653577158912,fdp,@Chrissip81 @Judith__Sauer @fdp @_MartinHagen ...,Negative
1996,1476841285873020928,MarcoBuschmann,@MarcoBuschmann Ich habe mir gerade den Koalit...,Negative
1997,1476869223544667904,MarcoBuschmann,@MarcoBuschmann Dann macht es doch endlich bes...,Positive


### 4. Save Annotations

In [11]:
annotated_df.to_csv(SAVE_ANNOTATIONS_PATH)