# Notebook: Check Annotation Majority Vote and Create Folds

This notebook is used to check which label was most often assigned to a tweet. At the end, train/test datasets (CSV) (5 folds) are created.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [1]:
from sklearn.model_selection import StratifiedKFold
from collections import Counter
import pandas as pd
import glob
import os

## Parameters

In [2]:
ANNOTATED_DATASET_PATH = "../Datasets/annotated_dataset/*.xlsx"
SAVE_ANNOTATIONS_PATH = "../Datasets/annotations_matched_filtered.csv"
SPLIT_PATH = "../Datasets/k_fold_splits"
SEED_VALUE = 0
N_SPLITS = 5

## Code

### 1. Load Annotations

In [3]:
file_list = sorted(glob.glob(ANNOTATED_DATASET_PATH))
file_list

['../Datasets/annotated_dataset/tweets_session_1_1.xlsx',
 '../Datasets/annotated_dataset/tweets_session_1_2.xlsx',
 '../Datasets/annotated_dataset/tweets_session_1_3.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_1.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_2.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_3.xlsx']

In [4]:
df_all = pd.DataFrame()
for file in file_list:
    df = pd.read_excel(file)
    df_all = pd.concat([df_all, df])

### 2. Add Function to Compare Annotations
In the case that no majority of annotators have decided on a label, then `NO_MAJORITY` is set as the label.

In [5]:
def get_majority(votes):
    # Use the Counter class to count the frequency of each label
    counter = Counter(votes)

    # Get the most common label
    most_common = counter.most_common(1)

    # Check if the most common label was chosen by more than half of the annotators
    if most_common[0][1] > (len(votes.to_numpy()) / 2):
        # Return the most common label
        return most_common[0][0]
    else:
        return "NO_MAJORITY"

In [6]:
annotated_df = df_all.groupby(['id', 'source_account', 'source_party', 'tweet'])['sentiment'].apply(lambda x: get_majority(x))
annotated_df = annotated_df.reset_index()

In [7]:
annotated_df

Unnamed: 0,id,source_account,source_party,tweet,sentiment
0,1345053831080641024,larsklingbeil,SPD,@EskenSaskia @NowaboFM @spdde @OlafScholz @spd...,NEGATIVE
1,1345288989541076992,CDU,CDU_CSU,@MBiadaczMdB @CDU @cducsubt Das entspricht mei...,POSITIVE
2,1345337052557156096,CDU,CDU_CSU,@rRockxter @europeika @CDU Was hat denn die CD...,NEUTRAL
3,1345418571627834880,Linksfraktion,LINKE,@Linksfraktion @jankortemdb Spahn hat genau da...,NEUTRAL
4,1345428053053349888,CDU,CDU_CSU,@JM_Luczak @jensspahn @cducsubt @csu_bt @CDU @...,NEGATIVE
...,...,...,...,...,...
1995,1475957960308310016,cem_oezdemir,GRUENE,@TopcuElmas @cem_oezdemir Wenn sie sich die bi...,NEUTRAL
1996,1476247864967930112,CDU,CDU_CSU,@MikeSchuster_ @_FriedrichMerz @CDU hat er wie...,NEGATIVE
1997,1476485257243374080,fdp,FDP,@Judith__Sauer @fdp FDP = Erigierungspartei,NEUTRAL
1998,1476525854264069888,JoanaCotar,AFD,"@JoanaCotar Ist es strafbar, wenn man sie als ...",NEGATIVE


### 3. Filtering

Filter annotations without majority (`NO_MAJORITY`)

In [8]:
annotated_df = annotated_df.loc[annotated_df['sentiment'] != 'NO_MAJORITY'].reset_index(drop=True)

Filter annotations that were labeled as `MIXED` by the Majority

In [9]:
annotated_df = annotated_df.loc[annotated_df['sentiment'] != 'MIXED'].reset_index(drop=True)

In [10]:
annotated_df

Unnamed: 0,id,source_account,source_party,tweet,sentiment
0,1345053831080641024,larsklingbeil,SPD,@EskenSaskia @NowaboFM @spdde @OlafScholz @spd...,NEGATIVE
1,1345288989541076992,CDU,CDU_CSU,@MBiadaczMdB @CDU @cducsubt Das entspricht mei...,POSITIVE
2,1345337052557156096,CDU,CDU_CSU,@rRockxter @europeika @CDU Was hat denn die CD...,NEUTRAL
3,1345418571627834880,Linksfraktion,LINKE,@Linksfraktion @jankortemdb Spahn hat genau da...,NEUTRAL
4,1345428053053349888,CDU,CDU_CSU,@JM_Luczak @jensspahn @cducsubt @csu_bt @CDU @...,NEGATIVE
...,...,...,...,...,...
1868,1475957960308310016,cem_oezdemir,GRUENE,@TopcuElmas @cem_oezdemir Wenn sie sich die bi...,NEUTRAL
1869,1476247864967930112,CDU,CDU_CSU,@MikeSchuster_ @_FriedrichMerz @CDU hat er wie...,NEGATIVE
1870,1476485257243374080,fdp,FDP,@Judith__Sauer @fdp FDP = Erigierungspartei,NEUTRAL
1871,1476525854264069888,JoanaCotar,AFD,"@JoanaCotar Ist es strafbar, wenn man sie als ...",NEGATIVE


In [11]:
annotated_df = annotated_df.sample(frac=1, random_state=SEED_VALUE).reset_index(drop=True)

### 4. Save Annotations

In [12]:
annotated_df.to_csv(SAVE_ANNOTATIONS_PATH)

### 5. Create Folds

In [13]:
def create_fold_split_dir(index:str):
    # Try to create the directory for the split
    try:
        os.makedirs(SPLIT_PATH + "/TRAIN_TEST_" + str(index))
    except FileExistsError:
        # The directory already exists, so do nothing
        pass

In [14]:
# Split the data into features and target
X = annotated_df.drop('sentiment', axis=1)
y = annotated_df['sentiment']

In [15]:
# Create a KFold object with 5 folds
kf = StratifiedKFold(n_splits=N_SPLITS, random_state=SEED_VALUE, shuffle=True)

# Iterate over the folds
for i, (train_index, test_index) in enumerate(kf.split(X, y)):
    create_fold_split_dir(index=i)
    # Split the data into train and test sets
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    print(X_train.shape[0], X_test.shape[0])
    
    # Save the train set to a CSV file
    train_df = pd.concat([X_train, y_train], axis=1)
    train_df.to_csv(f'{SPLIT_PATH + "/TRAIN_TEST_" + str(i)}/train.csv', index=False)
    
    # Save the test set to a CSV file
    test_df = pd.concat([X_test, y_test], axis=1)
    test_df.to_csv(f'{SPLIT_PATH + "/TRAIN_TEST_" + str(i)}/test.csv', index=False)


1498 375
1498 375
1498 375
1499 374
1499 374
