# Training

This notebook demonstrates the process of training an ensemble learning model using a provided CSV file. It showcases data preprocessing, model training, evaluation, and saving the trained model. The ensemble method (hard voting, soft voting, or stacking) can be selected based on the user's choice.


## Importing Necessary Libraries

First, we import all the necessary libraries and modules needed for this script. This includes libraries for handling warnings, data manipulation, machine learning, and the custom Ensemble module containing ensemble learning methods.

In [1]:
import pandas as pd
import torch
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, 
    recall_score, 
    precision_score, 
    f1_score
)
from src import Bayes, Utils
import math

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Utility Functions

In [2]:
MODEL_FOLDER = 'model_bayes'

### Function to Read CSV File

The `read_csv_file` function reads the CSV file and returns a pandas DataFrame. If the file is not found, the script will exit with an error message.

In [3]:
def read_csv_file(filename: str) -> pd.DataFrame:
    try:
        data = pd.read_csv(filename, lineterminator='\n', usecols=range(2))
        print("CSV file read successfully!")
        return data
    except FileNotFoundError:
        print("ERROR: File not found")
        exit(1)

# Demonstrate reading a CSV file (use a sample or mock filename)
# dataset = read_csv_file('datasets/datasetall.csv')

dataset = Utils.read_csv_file('datasets/datasetall.csv')
dataset

CSV file read successfully!


Unnamed: 0,text,label
0,Binay: Patuloy ang kahirapan dahil sa maling p...,0
1,SA GOBYERNONG TAPAT WELCOME SA BAGUO ANG LAHAT...,0
2,wait so ur telling me Let Leni Lead mo pero NY...,1
3,[USERNAME]wish this is just a nightmare that ...,0
4,doc willie ong and isko sabunutan po,0
...,...,...
28456,"Bisaya, Probinsyano/a, mostly Bisaya = katulong",1
28457,Amnesia. In my whole life wala pa ako nakasala...,1
28458,Kontrabida na ilang beses na tinalo at obvious...,1
28459,Yung antagonist laging kailangang sobrang sama...,1


In [4]:
dataset['label'].value_counts(ascending=True)

label
0    14115
1    14346
Name: count, dtype: int64

### Function to Seed Random Number Generators

To ensure reproducibility, the `seed_random_number_generators` function seeds the random number generators for PyTorch and NumPy.

In [5]:
def seed_random_number_generators(seed=0):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    print("Random number generators seeded.")

# Seed the random number generators
# seed_random_number_generators()

Utils.seed_random_number_generators()

Random number generators seeded.


### Function for Train-Test Split

The `get_train_test_split` function splits the dataset into training and testing sets with an 80-20 split ratio and returns them.

In [6]:
TEST_SIZE = 0.2

def shuffle_data_frame(data_frame):
    text = list(data_frame['text'])
    label = list(data_frame['label'])

    assert(len(text) == len(label))

    indices = list(range(len(label)))

    # Make a random number generator that will shuffle list of indices
    # It is seeded to be reproducible
    random_number_generator = np.random.default_rng(seed=0)
    random_number_generator.shuffle(indices)

    shuffled_text = []
    shuffled_labels = []

    # Iterate through the list of indices and add the original data
    # from those shuffled indices
    for index in indices:
        shuffled_text.append(text[index])
        shuffled_labels.append(label[index])

    return pd.DataFrame({
        'text': shuffled_text,
        'label': shuffled_labels,
    })


def get_train_test_split(data_frame: pd.DataFrame, test_size: float):
    """
    Makes a stratified train test split.
    This aims to preserve the distribution between classes.
    """
    if not (1 > test_size > 0):
        print('ERROR: test_size must be between 0 and 1')
        return

    data_frame = shuffle_data_frame(data_frame)

    data_frame_length = len(data_frame)
    train_size = 1 - test_size

    nonhate_rows = data_frame[data_frame['label'] == 0] 
    nonhate_row_length = len(nonhate_rows)

    nonhate_row_train_size = math.ceil(nonhate_row_length * train_size)

    nonhate_row_train = nonhate_rows[0:nonhate_row_train_size]
    nonhate_row_test = nonhate_rows[nonhate_row_train_size:nonhate_row_length]

    assert(len(nonhate_row_train) + len(nonhate_row_test) == nonhate_row_length)

    hate_rows = data_frame[data_frame['label'] == 1] 
    hate_row_length = len(hate_rows)

    hate_row_train_size = math.ceil(hate_row_length * train_size)

    hate_row_train = hate_rows[0:hate_row_train_size]
    hate_row_test = hate_rows[hate_row_train_size:hate_row_length]

    assert(len(hate_row_train) + len(hate_row_test) == hate_row_length)

    combined_train = pd.concat([nonhate_row_train, hate_row_train])
    combined_test = pd.concat([nonhate_row_test, hate_row_test])

    shuffled_train = shuffle_data_frame(combined_train)
    shuffled_test = shuffle_data_frame(combined_test)

    return (
        shuffled_train['text'],
        shuffled_test['text'],
        shuffled_train['label'],
        shuffled_test['label'],
    )

X_train, X_test, y_train, y_test = Utils.get_train_test_split(dataset, TEST_SIZE)

## Train Data

In [7]:
pd.DataFrame(X_train)

Unnamed: 0,text
0,[USERNAME] Palangga ka man sang mga taga Baco...
1,Who dafuq is Jose Montemayor Jr.???
2,Di na nakakatuwa yung mukha ni Mar Roxas sa TV...
3,national elections. | via[USERNAME]
4,"Binay will be staring in a movie called ""The D..."
...,...
22764,"""Kala ko wala andito pala si Marcos.""*pertaini..."
22765,sie ~ [USERNAME]Marcos Magnanakaw Marcos Dikta...
22766,If Mar is BatMarBinay is Bane-ay.
22767,to my moots im sorry in not sorry for flooding...


In [8]:
y_train_dataframe = pd.DataFrame(y_train, columns=['label'])
y_train_dataframe

Unnamed: 0,label
0,0
1,0
2,1
3,0
4,1
...,...
22764,0
22765,1
22766,1
22767,1


In [9]:
y_train_dataframe.value_counts(ascending=True)

label
0        11292
1        11477
Name: count, dtype: int64

## Test Data

In [10]:
pd.DataFrame(X_test)

Unnamed: 0,text
0,Bakit trending ang Only Binay?
1,Mare @ Cebu [USERNAME][USERNAME] Marcos Never ...
2,Kahit anong gawin ko bakit di ko ma appreciate...
3,Oras na para tayo'y bumoto ng taong mag tataas...
4,VP[USERNAME]is currently in Zamboanga Sibugay ...
...,...
5687,[USERNAME] Laban LeniAngat Buhay LahatLeni Kiko
5688,Nagconcede ka man Maimarwala ka prinnagdala ka...
5689,Did You Know that former Philippine secretary ...
5690,Bakit nakakairita commercial ni Mar Roxas?


In [11]:
y_test_dataframe = pd.DataFrame(y_test, columns=['label'])
y_test_dataframe

Unnamed: 0,label
0,0
1,1
2,1
3,0
4,0
...,...
5687,0
5688,1
5689,0
5690,1


In [12]:
y_test_dataframe.value_counts(ascending=True)

label
0        2823
1        2869
Name: count, dtype: int64

## Grid Search

In [13]:
cross_validator = StratifiedKFold(
  n_splits=10,
  # random_state=0,
)

cross_validator

StratifiedKFold(n_splits=10, random_state=None, shuffle=False)

### Bernoulli Naive Bayes

In [14]:
nb_grid_search = GridSearchCV(
  estimator=Bayes.BayesPipeline,
  param_grid={
    'bayes__alpha': [
      0.0, 
      0.1,
      0.2, 
      0.3,
      0.4, 
      0.5,
      0.6, 
      0.7,
      0.8, 
      0.9,
      1.0,
    ],
    # 'bayes__force_alpha': [False, True],
  },
  scoring=['accuracy', 'precision', 'recall', 'f1'],
  refit='f1',
  cv=cross_validator,
)

nb_grid_search.fit(X_train, y_train)

  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


In [15]:
results_data_frame = pd.DataFrame(nb_grid_search.cv_results_)

results_data_frame

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bayes__alpha,params,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,...,split3_test_f1,split4_test_f1,split5_test_f1,split6_test_f1,split7_test_f1,split8_test_f1,split9_test_f1,mean_test_f1,std_test_f1,rank_test_f1
0,0.967886,0.105381,0.099693,0.014795,0.0,{'bayes__alpha': 0.0},0.745121,0.746682,0.747853,0.758002,...,0.758942,0.765974,0.725421,0.739542,0.753328,0.757289,0.745483,0.748682,0.010712,11
1,0.624879,0.056171,0.070811,0.004826,0.1,{'bayes__alpha': 0.1},0.813817,0.814208,0.813817,0.823575,...,0.832344,0.833577,0.814733,0.823091,0.830189,0.838088,0.825735,0.827379,0.006209,10
2,0.680163,0.093623,0.072874,0.009705,0.2,{'bayes__alpha': 0.2},0.813817,0.81655,0.813427,0.827088,...,0.835865,0.835766,0.816176,0.825893,0.828148,0.838949,0.826295,0.828589,0.006413,5
3,0.63387,0.024327,0.074457,0.003588,0.3,{'bayes__alpha': 0.3},0.813817,0.818111,0.813427,0.825917,...,0.834324,0.836072,0.817647,0.825715,0.829016,0.840504,0.826725,0.829013,0.006316,4
4,0.651875,0.027567,0.07699,0.005374,0.4,{'bayes__alpha': 0.4},0.816159,0.819672,0.814598,0.825917,...,0.834447,0.836191,0.816881,0.826409,0.828582,0.840311,0.826725,0.829451,0.006242,1
5,0.635241,0.014432,0.074938,0.00336,0.5,{'bayes__alpha': 0.5},0.81694,0.818891,0.814988,0.824356,...,0.832962,0.835036,0.817582,0.825878,0.828402,0.839259,0.827283,0.829192,0.005622,2
6,0.636064,0.014779,0.075119,0.004375,0.6,{'bayes__alpha': 0.6},0.81655,0.819672,0.815769,0.824356,...,0.83321,0.834307,0.818149,0.825444,0.827917,0.838328,0.827713,0.829135,0.005354,3
7,0.648195,0.012057,0.075571,0.002667,0.7,{'bayes__alpha': 0.7},0.81694,0.819282,0.81733,0.825137,...,0.833951,0.832176,0.819145,0.82379,0.826923,0.835368,0.826803,0.828578,0.004719,6
8,0.640989,0.016035,0.073511,0.00277,0.8,{'bayes__alpha': 0.8},0.815379,0.818111,0.818111,0.825137,...,0.834442,0.830409,0.820607,0.823182,0.827357,0.835181,0.825467,0.828236,0.004539,8
9,0.630265,0.013631,0.075066,0.003355,0.9,{'bayes__alpha': 0.9},0.814988,0.81733,0.818111,0.825527,...,0.834994,0.829803,0.81911,0.823486,0.828275,0.835733,0.825467,0.828143,0.004889,9


In [16]:
nb_grid_search.best_params_

{'bayes__alpha': 0.4}

In [17]:
nb_grid_search.best_score_

0.8294507634828383

In [18]:
results_data_frame.to_csv(f'bayes_metrics.csv')

In [19]:
nb_grid_search.best_estimator_

In [20]:
Utils.save_trained_model(nb_grid_search.best_estimator_, f"{MODEL_FOLDER}/Bayes")

Ensemble model saved to Pipeline(steps=[('tfidf', CountVectorizer()),
                ('bayes', BernoulliNB(alpha=0.4, force_alpha=True))]).pkl


In [13]:
train_bayes = Bayes.BayesPipeline

train_bayes.set_params(
  bayes__alpha=0.3,
)

train_bayes

In [14]:
train_bayes.fit(X_train, y_train)

after count vocab
10168
  (0, 0)	4
  (0, 1)	2
  (0, 2)	2
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	2
  (0, 10)	5
  (0, 11)	2
  (0, 12)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (0, 16)	1
  (0, 17)	1
  (0, 18)	1
  (0, 19)	1
  (0, 20)	1
  (1, 21)	1
  (1, 22)	1
  (1, 23)	1
  (1, 24)	1
  :	:
  (22767, 454)	1
  (22767, 479)	1
  (22767, 527)	1
  (22767, 563)	1
  (22767, 739)	1
  (22767, 740)	1
  (22767, 835)	1
  (22767, 847)	1
  (22767, 892)	1
  (22767, 911)	1
  (22767, 1114)	2
  (22767, 1151)	1
  (22767, 1597)	1
  (22767, 3395)	1
  (22767, 4604)	1
  (22767, 5662)	1
  (22767, 6533)	1
  (22767, 8026)	1
  (22767, 9913)	1
  (22767, 19536)	1
  (22767, 39662)	1
  (22768, 42)	1
  (22768, 99)	1
  (22768, 502)	1
  (22768, 5897)	1
before sort features
  (0, 0)	4
  (0, 1)	2
  (0, 2)	2
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	2
  (0, 10)	5
  (0, 11)	2
  (0, 12)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (0, 16)	1
  (0, 17)	1
  (0, 18)	1
  (0

In [15]:
accuracy, recall, precision, f1 = Utils.get_prediction_results(
  X_test,
  y_test,
  train_bayes,
)

after count vocab
34293
  (0, 1768)	1
  (0, 2906)	1
  (0, 3873)	1
  (0, 26222)	1
  (0, 36793)	1
  (1, 1010)	1
  (1, 5615)	1
  (1, 12746)	1
  (1, 19612)	1
  (1, 21092)	3
  (1, 21180)	1
  (1, 25065)	1
  (1, 25658)	1
  (1, 37801)	2
  (2, 1892)	1
  (2, 2079)	1
  (2, 2906)	1
  (2, 7901)	1
  (2, 11133)	1
  (2, 15751)	1
  (2, 17072)	2
  (2, 19099)	1
  (2, 21044)	1
  (2, 31832)	1
  (2, 33299)	1
  :	:
  (5690, 21044)	1
  (5690, 23814)	1
  (5690, 25262)	1
  (5690, 31832)	1
  (5691, 703)	1
  (5691, 1712)	1
  (5691, 2171)	1
  (5691, 3536)	1
  (5691, 8374)	1
  (5691, 11542)	1
  (5691, 18516)	1
  (5691, 21492)	1
  (5691, 24963)	1
  (5691, 25713)	1
  (5691, 26185)	1
  (5691, 28339)	1
  (5691, 32075)	1
  (5691, 32959)	1
  (5691, 34462)	1
  (5691, 36040)	2
  (5691, 36420)	3
  (5691, 37194)	1
  (5691, 38857)	1
  (5691, 39478)	2
  (5691, 39507)	1
10957
Accuracy: 0.8222066057624736
Recall: 0.8759149529452771
Precision: 0.7929946355317135
F1-score: 0.8323948327260683


In [16]:
Utils.save_trained_model(train_bayes, f"{MODEL_FOLDER}/Bayes")

Ensemble model saved to Pipeline(steps=[('tfidf', CountVectorizer()),
                ('bayes', BernoulliNB(alpha=0.3))]).pkl
