# Training

This notebook demonstrates the process of training an ensemble learning model using a provided CSV file. It showcases data preprocessing, model training, evaluation, and saving the trained model. The ensemble method (hard voting, soft voting, or stacking) can be selected based on the user's choice.


## Importing Necessary Libraries

First, we import all the necessary libraries and modules needed for this script. This includes libraries for handling warnings, data manipulation, machine learning, and the custom Ensemble module containing ensemble learning methods.

In [1]:
import pandas as pd
import torch
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, 
    recall_score, 
    precision_score, 
    f1_score
)
from src import Bayes, Utils
import math

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Utility Functions

In [2]:
MODEL_FOLDER = 'model_bayes'

### Function to Read CSV File

The `read_csv_file` function reads the CSV file and returns a pandas DataFrame. If the file is not found, the script will exit with an error message.

In [3]:
def read_csv_file(filename: str) -> pd.DataFrame:
    try:
        data = pd.read_csv(filename, lineterminator='\n', usecols=range(2))
        print("CSV file read successfully!")
        return data
    except FileNotFoundError:
        print("ERROR: File not found")
        exit(1)

# Demonstrate reading a CSV file (use a sample or mock filename)
# dataset = read_csv_file('datasets/datasetall.csv')

dataset = Utils.read_csv_file('datasets/datasetall.csv')
dataset

CSV file read successfully!


Unnamed: 0,text,label
0,Binay: Patuloy ang kahirapan dahil sa maling p...,0
1,SA GOBYERNONG TAPAT WELCOME SA BAGUO ANG LAHAT...,0
2,wait so ur telling me Let Leni Lead mo pero NY...,1
3,[USERNAME]wish this is just a nightmare that ...,0
4,doc willie ong and isko sabunutan po,0
...,...,...
28456,"Bisaya, Probinsyano/a, mostly Bisaya = katulong",1
28457,Amnesia. In my whole life wala pa ako nakasala...,1
28458,Kontrabida na ilang beses na tinalo at obvious...,1
28459,Yung antagonist laging kailangang sobrang sama...,1


In [4]:
dataset['label'].value_counts(ascending=True)

label
0    14115
1    14346
Name: count, dtype: int64

### Function to Seed Random Number Generators

To ensure reproducibility, the `seed_random_number_generators` function seeds the random number generators for PyTorch and NumPy.

In [5]:
def seed_random_number_generators(seed=0):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    print("Random number generators seeded.")

# Seed the random number generators
# seed_random_number_generators()

Utils.seed_random_number_generators()

Random number generators seeded.


### Function for Train-Test Split

The `get_train_test_split` function splits the dataset into training and testing sets with an 80-20 split ratio and returns them.

In [6]:
TEST_SIZE = 0.2

def shuffle_data_frame(data_frame):
    text = list(data_frame['text'])
    label = list(data_frame['label'])

    assert(len(text) == len(label))

    indices = list(range(len(label)))

    # Make a random number generator that will shuffle list of indices
    # It is seeded to be reproducible
    random_number_generator = np.random.default_rng(seed=0)
    random_number_generator.shuffle(indices)

    shuffled_text = []
    shuffled_labels = []

    # Iterate through the list of indices and add the original data
    # from those shuffled indices
    for index in indices:
        shuffled_text.append(text[index])
        shuffled_labels.append(label[index])

    return pd.DataFrame({
        'text': shuffled_text,
        'label': shuffled_labels,
    })


def get_train_test_split(data_frame: pd.DataFrame, test_size: float):
    """
    Makes a stratified train test split.
    This aims to preserve the distribution between classes.
    """
    if not (1 > test_size > 0):
        print('ERROR: test_size must be between 0 and 1')
        return

    data_frame = shuffle_data_frame(data_frame)

    data_frame_length = len(data_frame)
    train_size = 1 - test_size

    nonhate_rows = data_frame[data_frame['label'] == 0] 
    nonhate_row_length = len(nonhate_rows)

    nonhate_row_train_size = math.ceil(nonhate_row_length * train_size)

    nonhate_row_train = nonhate_rows[0:nonhate_row_train_size]
    nonhate_row_test = nonhate_rows[nonhate_row_train_size:nonhate_row_length]

    assert(len(nonhate_row_train) + len(nonhate_row_test) == nonhate_row_length)

    hate_rows = data_frame[data_frame['label'] == 1] 
    hate_row_length = len(hate_rows)

    hate_row_train_size = math.ceil(hate_row_length * train_size)

    hate_row_train = hate_rows[0:hate_row_train_size]
    hate_row_test = hate_rows[hate_row_train_size:hate_row_length]

    assert(len(hate_row_train) + len(hate_row_test) == hate_row_length)

    combined_train = pd.concat([nonhate_row_train, hate_row_train])
    combined_test = pd.concat([nonhate_row_test, hate_row_test])

    shuffled_train = shuffle_data_frame(combined_train)
    shuffled_test = shuffle_data_frame(combined_test)

    return (
        shuffled_train['text'],
        shuffled_test['text'],
        shuffled_train['label'],
        shuffled_test['label'],
    )

X_train, X_test, y_train, y_test = Utils.get_train_test_split(dataset, TEST_SIZE)

## Train Data

In [7]:
pd.DataFrame(X_train)

Unnamed: 0,text
0,pag hindi nanalo si Norberto Gonzales pwede ba...
1,Ngayon lang ako super proud sa PRESIDENTE na i...
2,JUST SAW SOMEONE CALL BBM BLENGBLONG HAHAHAHAH...
3,Rep. Binay on her leadership style: I am very ...
4,Liwanag o dilim? May oras pa. Kakampink Leni L...
...,...
25611,"""Kala ko wala andito pala si Marcos.""*pertaini..."
25612,cathy [USERNAME] Dec [USERNAME] parang tanga i...
25613,Nognog+pandak= BINAY ftw
25614,BINAY:Did your enormous wealth all come from y...


In [8]:
y_train_dataframe = pd.DataFrame(y_train, columns=['label'])
y_train_dataframe

Unnamed: 0,label
0,1
1,0
2,1
3,0
4,0
...,...
25611,0
25612,1
25613,1
25614,1


In [9]:
y_train_dataframe.value_counts(ascending=True)

label
0        12704
1        12912
Name: count, dtype: int64

## Test Data

In [10]:
pd.DataFrame(X_test)

Unnamed: 0,text
0,PRESIDENTE DUTERTE I'm sure in last debateitao...
1,CHANGE IS BADLY NEEDED No To Mar Roxas2016 Dut...
2,One Pink March Leni Kiko
3,see youuu later Leni Kiko
4,[USERNAME] Nangyari na yan eh pero kahit anong...
...,...
2840,kaya siguro umabot ng milyon yung boto kay MAR...
2841,Dedicating my 21km run for my chosen Presand V...
2842,Bakit si Mar? Because DuterteGrace Poe and VP ...
2843,patalo po ung patalastas ni Mar Roxas....malas...


In [11]:
VAL_SPLIT = 0.5

X_val, X_test, y_val, y_test = Utils.get_train_test_split(
  pd.DataFrame({
    'text': X_test,
    'label': y_test,
  }), 
  VAL_SPLIT,
)

In [12]:
y_test_dataframe = pd.DataFrame(y_test, columns=['label'])
y_test_dataframe

Unnamed: 0,label
0,0
1,0
2,0
3,1
4,0
...,...
1417,0
1418,0
1419,0
1420,1


In [13]:
y_test_dataframe.value_counts(ascending=True)

label
0        705
1        717
Name: count, dtype: int64

## Grid Search

### Bernoulli Naive Bayes

In [14]:
from sklearn.base import clone

alpha = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

total_accuracy = []
total_recall = []
total_precision = []
total_f1 = []
total_test_accuracy = []
total_test_recall = []
total_test_precision = []
total_test_f1 = []

for i in range(len(alpha)):
  train_bayes = clone(Bayes.BayesPipeline)

  train_bayes.set_params(
    bayes__alpha=alpha[i],
  )

  train_bayes.fit(X_train, y_train)

  accuracy, recall, precision, f1 = Utils.get_prediction_results(
    X_val,
    y_val,
    train_bayes,
  )

  test_accuracy, test_recall, test_precision, test_f1 = Utils.get_prediction_results(
    X_test,
    y_test,
    train_bayes,
  )

  total_accuracy.append(accuracy)
  total_recall.append(recall)
  total_precision.append(precision)
  total_f1.append(f1)
  total_test_accuracy.append(test_accuracy)
  total_test_recall.append(test_recall)
  total_test_precision.append(test_precision)
  total_test_f1.append(test_f1)

after count vocab
364
  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	1
  (0, 10)	1
  (0, 11)	1
  (0, 12)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (1, 11)	3
  (1, 16)	2
  (1, 17)	2
  (1, 18)	2
  (1, 19)	1
  (1, 20)	1
  (1, 21)	1
  (1, 22)	2
  (1, 23)	1
  :	:
  (25612, 42612)	1
  (25613, 43)	1
  (25613, 489)	1
  (25613, 490)	1
  (25613, 30328)	1
  (25614, 43)	1
  (25614, 73)	1
  (25614, 75)	1
  (25614, 107)	1
  (25614, 175)	1
  (25614, 217)	1
  (25614, 304)	2
  (25614, 508)	1
  (25614, 567)	1
  (25614, 1131)	1
  (25614, 3947)	1
  (25614, 5089)	1
  (25614, 5354)	1
  (25614, 5629)	1
  (25614, 9716)	1
  (25614, 17453)	1
  (25615, 43)	1
  (25615, 157)	1
  (25615, 724)	1
  (25615, 7287)	1
before sort features
  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	1
  (0, 10)	1
  (0, 11)	1
  (0, 12)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (1, 11)	3
  (1, 16)	2
  (1, 17)	2
  (1, 18)	



after count vocab
36872
  (0, 3139)	1
  (0, 11800)	1
  (0, 17661)	1
  (0, 17964)	1
  (0, 19152)	1
  (0, 24538)	2
  (0, 27023)	1
  (0, 30506)	2
  (0, 33884)	1
  (0, 38226)	1
  (0, 38876)	1
  (0, 40245)	1
  (1, 1908)	1
  (1, 2987)	1
  (1, 4848)	2
  (1, 18702)	1
  (1, 19388)	1
  (1, 19530)	1
  (1, 19719)	1
  (1, 22223)	1
  (1, 22648)	1
  (1, 24503)	1
  (1, 31430)	1
  (1, 32065)	1
  (1, 34533)	1
  :	:
  (1421, 2606)	1
  (1421, 3108)	1
  (1421, 4143)	1
  (1421, 5495)	1
  (1421, 9390)	1
  (1421, 9392)	1
  (1421, 21493)	1
  (1421, 22223)	1
  (1421, 27565)	1
  (1421, 28046)	1
  (1421, 28620)	1
  (1421, 34905)	1
  (1421, 35807)	1
  (1421, 38195)	1
  (1421, 38250)	1
  (1421, 42509)	2
  (1422, 698)	1
  (1422, 1914)	1
  (1422, 5229)	1
  (1422, 7711)	1
  (1422, 12283)	1
  (1422, 15787)	1
  (1422, 18950)	1
  (1422, 34533)	1
  (1422, 38176)	1
11724
Accuracy: 0.7687983134223472
Recall: 0.8242677824267782
Precision: 0.7443324937027708
F1-score: 0.7822634017207147
after count vocab
36872
  (0, 7972)	1
 

after count vocab
364
  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	1
  (0, 10)	1
  (0, 11)	1
  (0, 12)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (1, 11)	3
  (1, 16)	2
  (1, 17)	2
  (1, 18)	2
  (1, 19)	1
  (1, 20)	1
  (1, 21)	1
  (1, 22)	2
  (1, 23)	1
  :	:
  (25612, 42612)	1
  (25613, 43)	1
  (25613, 489)	1
  (25613, 490)	1
  (25613, 30328)	1
  (25614, 43)	1
  (25614, 73)	1
  (25614, 75)	1
  (25614, 107)	1
  (25614, 175)	1
  (25614, 217)	1
  (25614, 304)	2
  (25614, 508)	1
  (25614, 567)	1
  (25614, 1131)	1
  (25614, 3947)	1
  (25614, 5089)	1
  (25614, 5354)	1
  (25614, 5629)	1
  (25614, 9716)	1
  (25614, 17453)	1
  (25615, 43)	1
  (25615, 157)	1
  (25615, 724)	1
  (25615, 7287)	1
before sort features
  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	1
  (0, 10)	1
  (0, 11)	1
  (0, 12)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (1, 11)	3
  (1, 16)	2
  (1, 17)	2
  (1, 18)	

In [15]:
metrics_data_frame = pd.DataFrame(
  {
    'accuracy': total_accuracy,
    'recall': total_recall,
    'precision': total_precision,
    'f1': total_f1,
    'test_accuracy': total_test_accuracy,
    'test_recall': total_test_recall,
    'test_precision': total_test_precision,
    'test_f1': total_test_f1,
  },
  columns=[
    'accuracy', 
    'recall', 
    'precision', 
    'f1',
    'test_accuracy', 
    'test_recall', 
    'test_precision', 
    'test_f1',
  ],
)
metrics_data_frame.to_csv(f'{MODEL_FOLDER}/eval_metrics.csv')