# Split data into training and test sets

Author: Sara Hoxha

In this file, we split our labeled datasets (ground truths) into training and validation sets while ensuring that all emotion classes are proportionally represented in both sets. 

Using the train_test_split function from sklearn.model_selection, we split the data into training (80%) and validation (20%) sets while maintaining the class distribution.
- stratify parameter ensures proportional representation of all classes.
- random_state ensures reproducibility.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Yangswei 85

In [20]:
df_yangswei_85 = pd.read_csv('../Pretrained_Model Implementation/labeled_data_reddit_text_yangswei_85.csv')  
train_data = df_yangswei_85['text']
train_labels = df_yangswei_85['predictions']

X_train, X_test, Y_train, Y_test = train_test_split(train_data, train_labels, test_size=0.2, random_state=42, stratify=train_labels)
train_df = pd.DataFrame({'text': X_train, 'label': Y_train})
test_df = pd.DataFrame({'text': X_test, 'label': Y_test})

train_df.to_csv('data/train_yangswei_85.csv',index=False)
test_df.to_csv('data/test_yangswei_85.csv', index=False)


# T5

In [4]:
df_t5 = pd.read_csv('../Pretrained_Model Implementation/t5_model_final.csv')
train_data = df_t5['text']
train_labels = df_t5['predicted_label']

X_train, X_test, Y_train, Y_test = train_test_split(train_data, train_labels, test_size=0.2, random_state=42, stratify=train_labels)
train_df = pd.DataFrame({'text': X_train, 'label': Y_train})
test_df = pd.DataFrame({'text': X_test, 'label': Y_test})

train_df.to_csv('data/train_t5.csv',index=False)
test_df.to_csv('data/test_t5.csv', index=False)