## Dataset Preperation

In this notebook, the data set is divided into distinct categories. A training split, comprising a training, test, and validation subset, is created. These subsets are divided on the basis of topics. This approach has the advantage of ensuring that topics not addressed in the training set are included in the evaluation. 

### Imports

In [2]:
import pandas as pd
import csv
import numpy as np
from sklearn.model_selection import train_test_split

### Load Truth Seeker Dataset

In [3]:
df = pd.read_csv("Dataset/Features_For_Traditional_ML_Techniques.csv")

In [4]:
print(df.columns)

Index(['Unnamed: 0', 'majority_target', 'statement', 'BinaryNumTarget',
       'tweet', 'followers_count', 'friends_count', 'favourites_count',
       'statuses_count', 'listed_count', 'following', 'embeddings', 'BotScore',
       'BotScoreBinary', 'cred', 'normalize_influence', 'mentions', 'quotes',
       'replies', 'retweets', 'favourites', 'hashtags', 'URLs', 'unique_count',
       'total_count', 'ORG_percentage', 'NORP_percentage', 'GPE_percentage',
       'PERSON_percentage', 'MONEY_percentage', 'DATE_percentage',
       'CARDINAL_percentage', 'PERCENT_percentage', 'ORDINAL_percentage',
       'FAC_percentage', 'LAW_percentage', 'PRODUCT_percentage',
       'EVENT_percentage', 'TIME_percentage', 'LOC_percentage',
       'WORK_OF_ART_percentage', 'QUANTITY_percentage', 'LANGUAGE_percentage',
       'Word count', 'Max word length', 'Min word length',
       'Average word length', 'present_verbs', 'past_verbs', 'adjectives',
       'adverbs', 'adpositions', 'pronouns', 'TOs', 'deter

### Split Dataset based on Topics

In [5]:
unique_statements = df["statement"].unique()
train_statements, test_statements = train_test_split(unique_statements, test_size=0.2, random_state=42)

In [6]:
print(f"Evaluation Dataset: {len(test_statements)} Topics")
print(f"Train Dataset: {len(train_statements)} Topics")

Evaluation Dataset: 212 Topics
Train Dataset: 846 Topics


### Collect Tweets from Topics

In [7]:
selected_rows = []
for statement in test_statements:
    matching_rows = df[df["statement"] == statement]
    
    selected_rows.append(matching_rows)

test_df = pd.concat(selected_rows, ignore_index=True)

In [8]:
selected_rows = []
for statement in train_statements:
    matching_rows = df[df["statement"] == statement]
    
    selected_rows.append(matching_rows)

train_df = pd.concat(selected_rows, ignore_index=True)

In [9]:
print(f"Tweets Trainingset: {len(train_df)} Tweets")
print(f"Tweets Evaluation: {len(test_df)} Tweets")

Tweets Trainingset: 104005 Tweets
Tweets Evaluation: 30193 Tweets


### Check Balance

In [10]:
print(f'Balance Evaluation set: {len(test_df[test_df["BinaryNumTarget"]==0]) / len(test_df)}')
print(f'Balance Training: {len(train_df[train_df["BinaryNumTarget"]==0])/len(train_df)}')

Balance Evaluation set: 0.4978637432517471
Balance Training: 0.4830152396519398


### Save Sets

In [12]:
test_df = test_df.rename(columns={"Unnamed: 0": "id"})
train_df = train_df.rename(columns={"Unnamed: 0": "id"})

test_df = test_df.astype(str)
train_df = train_df.astype(str)
test_df.to_csv("Dataset/evaluation.csv", index=False, quoting=csv.QUOTE_ALL, escapechar='\\')
train_df.to_csv("Dataset/train.csv", index=False, quoting=csv.QUOTE_ALL, escapechar='\\')

### Annotationset

In [13]:
train_annotation, label_annotation = train_test_split(train_df, test_size=0.01, random_state=42)
len(label_annotation)

1041

In [14]:
print(f'Balance Labeset: {len(label_annotation[label_annotation["BinaryNumTarget"]=="1.0"])/len(label_annotation)}')
print(f'Unique Topics: {len(label_annotation["statement"].unique())}')

Balance Labeset: 0.5504322766570605
Unique Topics: 339


In [15]:
label_annotation = label_annotation.rename(columns={"Unnamed: 0": "id"})
label_set = label_annotation[['id', 'tweet']]

### Create Test Label Set

In [17]:
df_label = label_set

In [20]:
df_label['Valence_Positive'] = None
df_label['Valence_Neutral'] = None
df_label['Valence_Negative'] = None
df_label['Arousal_High'] = None
df_label['Arousal_Medium'] = None
df_label['Arousal_Low'] = None
df_label.to_csv("Dataset/label_annotation.csv", index=False, quoting=csv.QUOTE_ALL, escapechar='\\', sep=';',)

### Split into Training Models each and together

In [23]:
ls = df_label['id']
train = train_df[~train_df['id'].isin(ls)]

In [27]:
train_each_annotation, train_together_annotation = train_test_split(train, test_size=0.4, random_state=42)

61778

In [29]:
train_each_annotation.to_csv("Dataset/annotation_emotion_and_label_bert.csv", index=False)
train_together_annotation.to_csv("Dataset/train_emotion_together.csv", index=False)