# Train Test Split

This script splits the dataset into train and test partitions based on the given test size. Through this, each learner will be able to use the same train and test data.

The dataset split with this script was made from a combination of the researchers' own dataset gathered from Reddit (`annotated-dataset.csv`), alongside the [2016 & 2022 Hate Speech Filipino dataset](https://huggingface.co/datasets/mapsoriano/2016_2022_hate_speech_filipino).

## Imports

In [1]:
import pandas as pd
import numpy as np
import math

pd.set_option('max_colwidth', 800)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Settings

In [2]:
# How much in percent to partition the test split
# Must be a float value between 0 and 1
# For example, a TEST_SIZE of 0.2 indicates a split
# of 80:20 for train and test, respectively
TEST_SIZE = 0.2

# Locate the dataset to be split
DATASET = 'datasets/datasetall.csv'

# Set the filenames of the train and test datasets
SAVE_TRAIN_DATASET_FILENAME = 'datasets/dataset-train.csv'
SAVE_TEST_DATASET_FILENAME = 'datasets/dataset-test.csv'

# Set to False to not save the datasets while still
# running the notebook, used for testing purposes
SAVE_DATA = False

## Read the dataset

In [3]:
def read_csv_file(filename: str) -> pd.DataFrame:
    try:
        data = pd.read_csv(filename, lineterminator='\n', usecols=range(2))
        print("CSV file read successfully!")
        return data
    except FileNotFoundError:
        print("ERROR: File not found")
        exit(1)

dataset = read_csv_file(DATASET)
dataset

CSV file read successfully!


Unnamed: 0,text,label
0,Binay: Patuloy ang kahirapan dahil sa maling pamamahala [USERNAME],0
1,SA GOBYERNONG TAPAT WELCOME SA BAGUO ANG LAHAT! Kulay Rosas Ang Bukas Let Leni Kiko Lead Let Leni Lead panalo Natin Para Sa Lahat : [USERNAME],0
2,wait so ur telling me Let Leni Lead mo pero NYONG UYAB BBM SUPPORTER?!??!!!????? to me thats like dating a trump supporter. fuck no bye,1
3,[USERNAME]wish this is just a nightmare that could end. Ma Pa we failed again. let leni lead never again kakampink Sa Gobyernong Tapat Angat Buhay Lahat,0
4,doc willie ong and isko sabunutan po,0
...,...,...
28456,"Bisaya, Probinsyano/a, mostly Bisaya = katulong",1
28457,Amnesia. In my whole life wala pa ako nakasalamuha na nagkaamnesia. Sa telenovela akala mo sipon lang yung amnesia. Nag-maynila yung lead actress. Naging pokpok. Or napasama sa human trafficking eme eme. Baril. Aside sa security guards madalang ako makakita nito. Pero sa telenovela akala mo nabibili sa sari sari store yung baril sa sobrang common. Deadbeat dad and abusive step father. Puta lagi na lang ganito yung cannon event ng bida. Di ba pwedeng normal lang na buhay?,1
28458,Kontrabida na ilang beses na tinalo at obvious naghihirap pero somehow may resource para maghire ng goons and sht... Like btch how are you paying for all these schemes???,1
28459,Yung antagonist laging kailangang sobrang sama. Lalong lalo na sa mga GMA soap. Yung tipong romance tapos sa dulo magiging parang action dahil yung kontrabida may papatayin or i hostage. Ayun kabwiset.,1


In [4]:
dataset['label'].value_counts(ascending=True)

label
0    14115
1    14346
Name: count, dtype: int64

## Functions for splitting

In [5]:
random_number_generator = np.random.default_rng()
def shuffle_data_frame(data_frame):
    text = list(data_frame['text'])
    label = list(data_frame['label'])

    assert(len(text) == len(label))

    indices = list(range(len(label)))

    # Make a random number generator that will shuffle list of indices
    random_number_generator.shuffle(indices)

    shuffled_text = []
    shuffled_labels = []

    # Iterate through the list of indices and add the original data
    # from those shuffled indices
    for index in indices:
        shuffled_text.append(text[index])
        shuffled_labels.append(label[index])

    return pd.DataFrame({
        'text': shuffled_text,
        'label': shuffled_labels,
    })


def get_train_test_split(data_frame: pd.DataFrame, test_size: float):
    """
    Makes a stratified train test split.
    This aims to preserve the distribution between classes.
    """
    if not (1 > test_size > 0):
        print('ERROR: test_size must be between 0 and 1')
        return

    data_frame = shuffle_data_frame(data_frame)

    data_frame_length = len(data_frame)
    train_size = 1 - test_size

    nonhate_rows = data_frame[data_frame['label'] == 0] 
    nonhate_row_length = len(nonhate_rows)

    nonhate_row_train_size = math.ceil(nonhate_row_length * train_size)

    nonhate_row_train = nonhate_rows[0:nonhate_row_train_size]
    nonhate_row_test = nonhate_rows[nonhate_row_train_size:nonhate_row_length]

    assert(len(nonhate_row_train) + len(nonhate_row_test) == nonhate_row_length)

    hate_rows = data_frame[data_frame['label'] == 1] 
    hate_row_length = len(hate_rows)

    hate_row_train_size = math.ceil(hate_row_length * train_size)

    hate_row_train = hate_rows[0:hate_row_train_size]
    hate_row_test = hate_rows[hate_row_train_size:hate_row_length]

    assert(len(hate_row_train) + len(hate_row_test) == hate_row_length)

    combined_train = pd.concat([nonhate_row_train, hate_row_train])
    combined_test = pd.concat([nonhate_row_test, hate_row_test])

    shuffled_train = shuffle_data_frame(combined_train)
    shuffled_test = shuffle_data_frame(combined_test)

    return (
        shuffled_train['text'],
        shuffled_test['text'],
        shuffled_train['label'],
        shuffled_test['label'],
    )

## Split the dataset

In [6]:
X_train, X_test, y_train, y_test = get_train_test_split(dataset, TEST_SIZE)

## Train Data

In [7]:
pd.DataFrame({
  'text': X_train,
  'label': y_train,
})

Unnamed: 0,text,label
0,Matthew Chang [USERNAME] Remind ko lang di ba galit na galit ka dun sa taong di marunong magbayad ng utang? Tapos kay marcos hindi iboboto mo pa? Well Marcos Magnanakaw Never Again,1
1,Yay! The interview served its purpose wellJessica Soho Interviews Angat Buhay LahatKakampink,0
2,I say DASURV,0
3,TayNew said Let Leni Lead,0
4,Gloc 9 is not endorsing Jejomar Binay as his presidential bet 2016 Elections 2016 Polls,0
...,...,...
22764,Nov. 11: on [USERNAME] saw tv ads of Jojo BinayFrancis TolentinoAlan CayetanoMartin RomualdezMar RoxasRisa Hontiveros epal watch,1
22765,Mar Roxas your call for unity describes one thing! SELFISHNESS! You don't deserve to be the PRESIDENT!!!,1
22766,Buti nalang nagdecide nakong hindi manood ng TV. Hindi ko pa napapakinggan yung Only Binay na yan.,0
22767,sang boto para sa pagbabago. Let Leni Lead philippine elections para sa pagbabago laban para sa bayan ofw dubai election,0


In [8]:
y_train.value_counts(ascending=True)

label
0    11292
1    11477
Name: count, dtype: int64

## Test Data

In [9]:
pd.DataFrame({
  'text': X_test,
  'label': y_test,
})

Unnamed: 0,text,label
0,Hindi susuportahan ng theatre and literary establishmentmafiasi Ka Leody de Guzman dahil at huwag na tayong maglokohan dito mga middle class matapobre from the centre centre left to the far left ang mga espasyo na ito sa PILI pinas PH Literary Mafia,0
1,BABAE LABAN SA FAKE AT FRAUDBFFSUMBONGDAYA DESKS The Sumbong Daya booth is our way of encouraging people to get involved in monitoring the elections one expression of peoples vigilance of beingmapagbantay eleksyon,1
2,Im proud to be a Filipino and a kakampink like BroArminAnimo La Salle! LSA NY here,0
3,Grabe noThe hypocrisy of the church to preach the Word of the Lord but then endorse politicians who clearly apparently definitely certainly evidently violated even just the Ten CommandmentsLike how can u do that?Yikes Halalan2022,1
4,BBMSARAUniteam Ph Arena BBMSARA,0
...,...,...
5687,[USERNAME] Rizalito David is a good man you can feel the sincerity everytime he talked And this is my first time i saw him not even before This is the kind of candidate we need,0
5688,A very famous religious cult in the Philippines will vote for Duterte and Marcos. ?? If only I have the means to get out of this country. ??,1
5689,Tama sir VP Leni Di dapat iboto SI BBM Kase No1 SINUNGALING Angat Buhay Lahat,0
5690,RT [USERNAME]: Mar Roxas forever arrogantI can't imagine him as a president plus the irritating first lady on his side.,1


In [10]:
y_test.value_counts(ascending=True)

label
0    2823
1    2869
Name: count, dtype: int64

## Saving Data

In [11]:
if SAVE_DATA:
  pd.DataFrame({
    'text': X_train,
    'label': y_train,
  }).to_csv(SAVE_TRAIN_DATASET_FILENAME, index=False)

In [12]:
if SAVE_DATA:
  pd.DataFrame({
    'text': X_test,
    'label': y_test,
  }).to_csv(SAVE_TEST_DATASET_FILENAME, index=False)