# Training

This notebook demonstrates the process of training an ensemble learning model using a provided CSV file. It showcases data preprocessing, model training, evaluation, and saving the trained model. The ensemble method (hard voting, soft voting, or stacking) can be selected based on the user's choice.


## Importing Necessary Libraries

First, we import all the necessary libraries and modules needed for this script. This includes libraries for handling warnings, data manipulation, machine learning, and the custom Ensemble module containing ensemble learning methods.

In [3]:
import pandas as pd
import torch
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, 
    recall_score, 
    precision_score, 
    f1_score
)
from src import Ensemble

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Utility Functions

### Function to Read CSV File

The `read_csv_file` function reads the CSV file and returns a pandas DataFrame. If the file is not found, the script will exit with an error message.

In [4]:
def read_csv_file(filename: str) -> pd.DataFrame:
    try:
        data = pd.read_csv(filename, lineterminator='\n', usecols=range(2))
        print("CSV file read successfully!")
        return data
    except FileNotFoundError:
        print("ERROR: File not found")
        exit(1)

# Demonstrate reading a CSV file (use a sample or mock filename)
dataset = read_csv_file('datasets/datasetall.csv')
dataset

CSV file read successfully!


Unnamed: 0,text,label
0,Binay: Patuloy ang kahirapan dahil sa maling p...,0
1,SA GOBYERNONG TAPAT WELCOME SA BAGUO ANG LAHAT...,0
2,wait so ur telling me Let Leni Lead mo pero NY...,1
3,[USERNAME]wish this is just a nightmare that ...,0
4,doc willie ong and isko sabunutan po,0
...,...,...
28456,"Bisaya, Probinsyano/a, mostly Bisaya = katulong",1
28457,Amnesia. In my whole life wala pa ako nakasala...,1
28458,Kontrabida na ilang beses na tinalo at obvious...,1
28459,Yung antagonist laging kailangang sobrang sama...,1


In [5]:
dataset['label'].value_counts(ascending=True)

label
0    14115
1    14346
Name: count, dtype: int64

### Function to Seed Random Number Generators

To ensure reproducibility, the `seed_random_number_generators` function seeds the random number generators for PyTorch and NumPy.

In [6]:
def seed_random_number_generators(seed=0):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    print("Random number generators seeded.")

# Seed the random number generators
seed_random_number_generators()

Random number generators seeded.


### Function for Train-Test Split

The `get_train_test_split` function splits the dataset into training and testing sets with an 80-20 split ratio and returns them.

In [15]:
def get_train_test_split(data_frame: pd.DataFrame):
    text = data_frame['text']
    labels = data_frame['label'].to_numpy()

    X_train, X_test, y_train, y_test = train_test_split(
        text, 
        labels, 
        test_size=0.2, 
        random_state=42,
    )
    print("Data split into training and testing sets.")
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = get_train_test_split(dataset)

Data split into training and testing sets.


## Train Data

In [16]:
pd.DataFrame(X_train)

Unnamed: 0,text
15652,Jose Montemayor Jr. is giving avibe umuwi ka n...
1977,[USERNAME] Oct Leody de Guzman should foster ...
15609,Ano ba guys. Hindi pwede basta-basta magpasaga...
24355,"""[USERNAME]: Galawang Binay! "" Benta to !! Hah..."
9202,[USERNAME] [USERNAME]andothers May sinabi bang...
...,...
21575,Cheap ! Mar roxas ????????
5390,Si Binay ba ang pambato mo sa pagkapangulo nga...
860,ABS-CBN News [USERNAME] Oct BREAKING: Sen. Pa...
15795,[USERNAME]and[USERNAME] NOW on ANC: Halalan


In [17]:
y_train_dataframe = pd.DataFrame(y_train, columns=['label'])
y_train_dataframe

Unnamed: 0,label
0,1
1,0
2,1
3,0
4,0
...,...
22763,1
22764,0
22765,0
22766,0


In [18]:
y_train_dataframe.value_counts(ascending=True)

label
0        11354
1        11414
Name: count, dtype: int64

## Test Data

In [19]:
pd.DataFrame(X_test)

Unnamed: 0,text
21414,cutie ng mga quezonduan activitiessssIsama Si ...
21420,Kae [USERNAME]Lisod kaau mahimong social climb...
3971,Wala bang readers for leni-kiko frame? Badly l...
14028,Doc Willie Ong isn't ready for this post. Kail...
14983,[USERNAME] si binay? Si binay na walang malay?...
...,...
18560,RT [USERNAME]: Nognog. Pandak. hahahaha buti a...
21737,Ang arte ng mga Binay shet
26053,Leni Kiko
20067,[USERNAME]things


In [20]:
y_test_dataframe = pd.DataFrame(y_test, columns=['label'])
y_test_dataframe

Unnamed: 0,label
0,0
1,1
2,0
3,0
4,1
...,...
5688,1
5689,1
5690,0
5691,0


In [21]:
y_test_dataframe.value_counts(ascending=True)

label
0        2761
1        2932
Name: count, dtype: int64

## Ensemble Training Function

The `train_ensemble` function initializes and trains the ensemble model using the provided training data. It takes the training features and labels as input, along with the ensemble model instance, and returns the trained model.

In [22]:
def train_ensemble(X_train: list, y_train: list, ensemble):
    seed_random_number_generators()  # Ensure reproducibility
    ensemble.fit(X_train, y_train)
    print("Ensemble model trained.")
    return ensemble

## Prediction and Evaluation Function

The `get_prediction_results` function uses the trained ensemble model to make predictions on the test set and then evaluates these predictions by calculating the accuracy, recall, precision, and F1-score. It returns these metrics for further analysis.

In [None]:
def get_prediction_results(X_test: list, y_test: list, ensemble):
    with torch.inference_mode():
        y_pred = ensemble.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        print(f"Accuracy: {accuracy}\nRecall: {recall}\nPrecision: {precision}\nF1-score: {f1}")
        return accuracy, recall, precision, f1

## Save Model Function

The `save_trained_model` function saves the trained ensemble model to disk using the joblib library. This allows for the model to be reloaded and used for predictions without the need for retraining.

In [None]:
def save_trained_model(ensemble, filename="Ensemble"):
    import joblib
    joblib.dump(ensemble, f'{filename}.pkl', compress=True)
    print(f"Ensemble model saved to {filename}.pkl")

## Main Execution Workflow

This cell combines all the previous steps to execute the workflow. It includes reading the dataset, splitting it into training and testing sets, selecting the ensemble method, training the model, evaluating its performance, and saving the trained model. Replace 'your_dataset.csv' with the path to your dataset and choose an appropriate ensemble method.

In [None]:
FILENAME = 'datasets/datasetall.csv'
ENSEMBLE_METHOD = 'hard'

# Read data and prepare train-test split
data_frame = read_csv_file(FILENAME)
X_train, X_test, y_train, y_test = get_train_test_split(data_frame)

# Initialize and train the ensemble
ensemble_methods = {
    'hard': Ensemble.HardVotingEnsemble(),
    'soft': Ensemble.SoftVotingEnsemble(),
    'stacking': Ensemble.StackingEnsemble(),
}
ensemble = train_ensemble(X_train, y_train, ensemble_methods[ENSEMBLE_METHOD])

# Evaluate the trained ensemble and display results
accuracy, recall, precision, f1 = get_prediction_results(X_test, y_test, ensemble)

# Save the trained model
save_trained_model(ensemble, f'ensemble-{ENSEMBLE_METHOD}')

## Results

To better visualize the evaluation results, this cell creates a pandas DataFrame to display the accuracy, recall, precision, and F1-score in a tabular format.

In [None]:
import pandas as pd
results_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Recall', 'Precision', 'F1-Score'],
    'Value': [accuracy, recall, precision, f1]
})
results_df