# Notebook: Check Annotation Majority Vote and Create Folds

This notebook is used to check which label was most often assigned to a tweet. At the end, train/test datasets (CSV) (5 folds) are created.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [279]:
from sklearn.model_selection import KFold
from collections import Counter
import pandas as pd
import glob
import os

## Parameters

In [280]:
ANNOTATED_DATASET_PATH = "../Datasets/annotated_dataset/*.xlsx"
SAVE_ANNOTATIONS_PATH = "../Datasets/annotations_matched_filtered.csv"
SPLIT_PATH = "../Datasets/k_fold_splits"
SEED_VALUE = 0
N_SPLITS = 5

## Code

### 1. Load Annotations

In [281]:
file_list = sorted(glob.glob(ANNOTATED_DATASET_PATH))
file_list

['../Datasets/annotated_dataset/tweets_session_1_1.xlsx',
 '../Datasets/annotated_dataset/tweets_session_1_2.xlsx',
 '../Datasets/annotated_dataset/tweets_session_1_3.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_1.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_2.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_3.xlsx']

In [282]:
df_all = pd.DataFrame()
for file in file_list:
    df = pd.read_excel(file)
    df_all = pd.concat([df_all, df])

### 2. Add Function to Compare Annotations
In the case that no majority of annotators have decided on a label, then `NO_MAJORITY` is set as the label.

In [283]:
def get_majority(votes):
    # Use the Counter class to count the frequency of each element
    counter = Counter(votes)

    # Get the most common element
    most_common = counter.most_common(1)

    # Check if the most common element was chosen by more than half of the annotators
    if most_common[0][1] > (len(votes.to_numpy()) / 2):
        # Return the most common element
        return most_common[0][0]
    else:
        return "NO_MAJORITY"

In [284]:
annotated_df = df_all.groupby(['id', 'source_account', 'tweet'])['sentiment'].apply(lambda x: get_majority(x))
annotated_df = annotated_df.reset_index()

In [285]:
annotated_df

Unnamed: 0,id,source_account,tweet,sentiment
0,1345310763456613888,CDU,@CDU Was hat dieses planwirtschaftliche Subven...,NEGATIVE
1,1345325468510265088,cducsubt,@CDU @cducsubt @akk @PaulZiemiak @warrings Und...,NEUTRAL
2,1345332075398885120,cducsubt,@markusH107 @Conava2 @SylvieAndresen @realfcki...,NEGATIVE
3,1345341161175719936,CDU,@marcobuelow @CDU Grosspenden bei der Parteien...,NEUTRAL
4,1345385778487184896,CDU,@danhoehrpiano @NiemalsD @CDU Gesagt ja. Aber ...,MIXED
...,...,...,...,...
1995,1476842869117932032,MarcoBuschmann,"@MarcoBuschmann Wenn durch andere „Magie“, z.B...",NEUTRAL
1996,1476850413383130880,MarcoBuschmann,@MarcoBuschmann gut zu wissen. prinzipiell pas...,MIXED
1997,1476868523955740928,MarcoBuschmann,@MarcoBuschmann Die Zumutungen hat die Politik...,NEUTRAL
1998,1476904179918589952,MarcoBuschmann,@MarcoBuschmann Beenden wir endlich corona für...,NEUTRAL


### 3. Filtering

Filter annotations without majority (`NO_MAJORITY`)

In [286]:
annotated_df = annotated_df.loc[annotated_df['sentiment'] != 'NO_MAJORITY'].reset_index(drop=True)

Filter annotations that were labeled as `MIXED` by the Majority

In [287]:
annotated_df = annotated_df.loc[annotated_df['sentiment'] != 'MIXED'].reset_index(drop=True)

In [288]:
annotated_df

Unnamed: 0,id,source_account,tweet,sentiment
0,1345310763456613888,CDU,@CDU Was hat dieses planwirtschaftliche Subven...,NEGATIVE
1,1345325468510265088,cducsubt,@CDU @cducsubt @akk @PaulZiemiak @warrings Und...,NEUTRAL
2,1345332075398885120,cducsubt,@markusH107 @Conava2 @SylvieAndresen @realfcki...,NEGATIVE
3,1345341161175719936,CDU,@marcobuelow @CDU Grosspenden bei der Parteien...,NEUTRAL
4,1345635631788130048,ArminLaschet,@NiAuKo @ArminLaschet Jede Art von Bildung ist...,POSITIVE
...,...,...,...,...
1545,1476801661972659968,RenateKuenast,"@ToWo75590957 @RenateKuenast Leute, die nur an...",NEGATIVE
1546,1476842869117932032,MarcoBuschmann,"@MarcoBuschmann Wenn durch andere „Magie“, z.B...",NEUTRAL
1547,1476868523955740928,MarcoBuschmann,@MarcoBuschmann Die Zumutungen hat die Politik...,NEUTRAL
1548,1476904179918589952,MarcoBuschmann,@MarcoBuschmann Beenden wir endlich corona für...,NEUTRAL


In [289]:
annotated_df = annotated_df.sample(frac=1, random_state=SEED_VALUE).reset_index(drop=True)

### 4. Save Annotations

In [290]:
annotated_df.to_csv(SAVE_ANNOTATIONS_PATH)

### 5. Create Folds

In [291]:
def create_fold_split_dir(index:str):
    # Try to create the directory for the split
    try:
        os.makedirs(SPLIT_PATH + "/TRAIN_TEST_" + str(index))
    except FileExistsError:
        # The directory already exists, so do nothing
        pass

In [292]:
# Create a KFold object with 5 folds
kf = KFold(n_splits=N_SPLITS, random_state=SEED_VALUE, shuffle=True)

# Iterate over the folds
for i, (train_index, test_index) in enumerate(kf.split(X)):
    create_fold_split_dir(index=i)
    # Split the data into train and test sets
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    print(X_train.shape[0], X_test.shape[0])
    
    # Save the train set to a CSV file
    train_df = pd.concat([X_train, y_train], axis=1)
    train_df.to_csv(f'{SPLIT_PATH + "/TRAIN_TEST_" + str(i)}/train.csv', index=False)
    
    # Save the test set to a CSV file
    test_df = pd.concat([X_test, y_test], axis=1)
    test_df.to_csv(f'{SPLIT_PATH + "/TRAIN_TEST_" + str(i)}/test.csv', index=False)

1212 303
1212 303
1212 303
1212 303
1212 303
