# Training (fine-tuning) SloBERTa Transformer model on SLED smalltrain

This notebook is prepared to be viewed on Google Colab. However, I performed most of the experiment on Kaggle (by importing the same notebook and data) instead of Google Colab, because while Kaggle gives you 35-40 hours of GPU per week, Google Colab does not state how much working on GPU it allows, and after I spent one day doing hyperparameter search, I got a message that I spent it all without stating when I'll be able to use it again. Therefore, I recommend using Kaggle for lengthy experiments on GPU.

Before starting, click on the "RAM" and "Disk" information on the top right part of the page and click "Change runtime type" > Choose "GPU" as "Hardware accelerator".

In [1]:
# install the libraries necessary for data wrangling, prediction and result analysis
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, f1_score,precision_score, recall_score
import torch
from numba import cuda
import time
from sklearn.metrics import classification_report

In [None]:
# Install transformers
# (this needs to be done on Colab each time you start the session).
# For information on how to install transformers and simpletransformers on you machine,
# follow simpletransformers instructions: https://simpletransformers.ai/docs/installation/
!pip install -q transformers

# Install the simpletransformers
!pip install -q simpletransformers
from simpletransformers.classification import ClassificationModel

In [3]:
# Install wandb - this will be useful for inspecting the results of a hyperparameter search.
# You need to create an account on Wandb: https://wandb.ai/
!pip install -q wandb

import wandb

# Login to wandb
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

### Import the data

You need to prepare the data into a format that is accepted by Transformers (dataframe with the first column "text" and the second column "labels"). The data was prepared according to the code here: https://github.com/TajaKuzman/FastText-Classification-SLED/blob/main/0-analyse-data-prepare-for-transformers.ipynb. Before importing, upload the data to Google Colab (by clicking on the Folder icon on the left of the site and clicking the "Upload to session storage" icon).

In [6]:
Colab_path = "/content"

train_df = pd.read_csv(f"{Colab_path}/SLED-for-Transformers-train.csv", sep="\t", index_col=0)
dev_df = pd.read_csv(f"{Colab_path}/SLED-for-Transformers-dev.csv", sep="\t", index_col = 0)
test_df = pd.read_csv(f"{Colab_path}/SLED-for-Transformers-test.csv", sep="\t", index_col = 0)

# See the sizes of splits.
# I noticed that when I ran this code on Google Colab,
# it did not import the entire train split every time (sometimes the size was a couple instances smaller than it should be)
# - if this happens, just run this cell again, until the size is fine (it should be 9990). I did not have this problem on Kaggle.
print("Train shape: {}, Dev shape: {}, Test shape: {}.".format(train_df.shape, dev_df.shape, test_df.shape))

Train shape: (9990, 2), Dev shape: (1296, 2), Test shape: (1299, 2).


In [7]:
# Inspect the beginning of the train split.
train_df.head()

Unnamed: 0,text,labels
0,na tolminskem se je dopoldne zgodila prometna ...,crnakronika
1,na cesti zadlaz-žabče se je zgodila prometna n...,crnakronika
2,v sredo ob 1321 je bila novogoriška policija o...,crnakronika
3,malo po 16 uri je v svetem duhu na ostrem vrhu...,crnakronika
4,v eksploziji pirotehnike je bil v soboto popol...,crnakronika


## Training and saving

We will use the monolingual SloBERTa model
https://huggingface.co/EMBEDDIA/sloberta

For training, I'll use the simple transformer library which is much more user-friendly than the hugging face library. They also have very nice instructions on everything: https://simpletransformers.ai/docs/installation/, including tutorials, I recommend reading it.

In [8]:
# Set the "TOKENIZERS_PARALLELISM" to false to avoid getting an error when training the model.
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [9]:
# Create a list of labels
LABELS = train_df.labels.unique().tolist()
LABELS

['crnakronika',
 'druzba',
 'gospodarstvo',
 'izobrazevanje',
 'kultura',
 'okolje',
 'politika',
 'prosticas',
 'sport',
 'vreme',
 'zabava',
 'zdravje',
 'znanost']

In [None]:
# Initialize Wandb (change the names of project, entity and "name" according to your project in Wandb)
wandb.init(project="SLED-categorization", entity="tajak", name="SloBERTa-hyperparameter-search")

In [10]:
# Calculate how many steps will each epoch have
# Num steps in epoch = training samples / batch size
steps_per_epoch = int(9990/8)
steps_per_epoch

1248

### Hyperparameter search

I evaluated the model per every 10th epoch - per 12480 steps. I first trained the model while evaluating it to find the optimal number of epochs. Based on the information on changes in training and evaluation loss which I observed on Wandb, I found out which epochs are the most optimal (epochs before which the evaluation loss starts rising again). Then I trained the model and tested it on dev for each of possible optimal epochs.

In [None]:
# Create a TransformerModel
sloberta_model = ClassificationModel(
    # For each model, you need to specify its model type and its name.
    # You can find this information on the hugging face page of the model (https://huggingface.co/EMBEDDIA/sloberta):
    # you'll find the model type in files > config.json; and the name if you click on the button "Use in Transformers"
        "camembert", "EMBEDDIA/sloberta",
        num_labels=len(LABELS),
        use_cuda=True,
        # Define the hyperparameters (I'll only experiment with epochs, for others,
        # I just use the values that worked nicely in past experiments)
        args= {
            "num_train_epochs": 30,
            "train_batch_size":8,
            "learning_rate": 1e-5,
            "labels_list": LABELS,
            "max_seq_length": 512,
            # Here, write in the name of your Wandb project
            "wandb_project": 'SLED-categorization',
            "silent": True,                        
            # Use these parameters if you want to evaluate during training
            "evaluate_during_training": True,
            "evaluate_during_training_steps": steps_per_epoch*10,
            "evaluate_during_training_verbose": True,
            "use_cached_eval_features": True,
            'reprocess_input_data': True,
            # I use the hyperparameters bellow to prevent filling up the memory.
            # Disable no_save: True and no_cache: True if you want to save the model.
            "overwrite_output_dir": True,
            "no_cache": True,
            "no_save": True,
            "save_steps": -1,
            "save_model_every_epoch":False
            }
        )

In [None]:
# Train the model

# Log time to see how long it takes
training_start_time = time.time()

# Train the model and evaluate it - you need to specify the evaluation split as well
sloberta_model.train_model(train_df, eval_df = dev_df)

print(f"Training and evaluation took {round((time.time() - training_start_time)/60,2)} minutes.")

Based on evaluation during training, analysed on Wandb, the optimum epoch is between 2 and 8 epochs (see graph on GitHub: https://github.com/TajaKuzman/FastText-Classification-SLED#hyperparameter-search)

In [None]:
# Create a list into which you'll save the results.
previous_results = []

In [None]:
# Create a function that you'll use for testing the model

def testing(test_df, test_name, epoch):
    """
    This function takes the test dataset and applies the trained model on it to infer predictions.
    It also prints and saves a confusion matrix, calculates the F1 scores and saves the results in a list of results.

    Args:
    - test_df (pandas DataFrame)
    - test_name
    - epoch: num_train_epochs
    """
    # Get the true labels
    y_true = test_df.labels

    # Define the model
    model = sloberta_model
    
    # Calculate the model's predictions on test
    def make_prediction(input_string):
        return model.predict([input_string])[0][0]

    # Use the model to predict the labels
    y_pred = test_df.text.apply(make_prediction)

    # Calculate the scores
    macro = f1_score(y_true, y_pred, labels=LABELS, average="macro")
    micro = f1_score(y_true, y_pred, labels=LABELS,  average="micro")
    print(f"Macro f1: {macro:0.3}, Micro f1: {micro:0.3}")

    # Plot the confusion matrix:
    cm = confusion_matrix(y_true, y_pred, labels=LABELS)
    plt.figure(figsize=(9, 9))
    plt.imshow(cm, cmap="Oranges")
    for (i, j), z in np.ndenumerate(cm):
        plt.text(j, i, '{:d}'.format(z), ha='center', va='center')
    classNames = LABELS
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    tick_marks = np.arange(len(classNames))
    plt.xticks(tick_marks, classNames, rotation=90)
    plt.yticks(tick_marks, classNames)
    plt.title(f"{test_name}")

    plt.tight_layout()
    fig1 = plt.gcf()
    plt.show()
    plt.draw()
    # Save the confusion matrix
    fig1.savefig(f"Confusion-matrix-{test_name}.png",dpi=100)

    # Print classification report
    print(classification_report(y_true, y_pred, labels = LABELS))

    # Save the results:
    rezdict = {
        "experiment": test_name,
        "num_train_epochs": epoch,
        "train_batch_size":8,
        "learning_rate": 1e-5,
        "microF1": micro,
        "macroF1": macro,
        "y_true": y_true.to_dict(),
        "y_pred": y_pred.to_dict(),
        }
    previous_results.append(rezdict)

    #Save intermediate results (just in case)
    backup = []
    backup.append(rezdict)
    with open(f"backup-results-{test_name}.json", "w") as backup_file:
        json.dump(backup,backup_file, indent= "")

In [None]:
# Train the model for various epochs to find the optimum number of epochs
#epochs = [2, 4, 6, 8]
epochs = [8, 10]

for epoch in epochs:
    sloberta_model = ClassificationModel(
                "camembert", "EMBEDDIA/sloberta",
                num_labels=len(LABELS),
                use_cuda=True,
                args= {
                    "overwrite_output_dir": True,
                    "num_train_epochs": epoch,
                    "train_batch_size":8,
                    "learning_rate": 1e-5,
                    "labels_list": LABELS,
                    # The following parameters (no_cache, no_save) are commented out if I want to save the model
                    "no_cache": True,
                    "no_save": True,
                    "max_seq_length": 512,
                    "save_steps": -1,
                    # Only the trained model will be saved - to prevent filling all of the space
                    "save_model_every_epoch":False,
                    "wandb_project": 'SLED-categorization',
                    "silent": True,
                    }
                )

    # Train the model
    sloberta_model.train_model(train_df)
    
    # Test the model on dev_df
    testing(dev_df, f"SLED-trainsmall-SLOBERTA-dev-epoch-search:{epoch}", epoch)

Optimum number of epochs is 8.

In [None]:
# Compare the results by creating a dataframe from the previous_results dictionary:
results_df = pd.DataFrame(previous_results)

results_df

In [None]:
# Save the file with updated results.
with open("SLED-SloBERTa-trainsmall-hyperparameter-search-results.json", "w") as results_file:
    json.dump(previous_results,results_file, indent= "")

Train and save the model, using 8 epochs.

In [None]:
# Create a TransformerModel
sloberta_model = ClassificationModel(
        "camembert", "EMBEDDIA/sloberta",
        num_labels=len(LABELS),
        use_cuda=True,
        args= {
            "overwrite_output_dir": True,
            "num_train_epochs": 8,
            "train_batch_size":8,
            "learning_rate": 1e-5,
            "labels_list": LABELS,
            # no_cache and no_save are commented out because I want to save the model
            #"no_cache": True,
            #"no_save": True,
            "max_seq_length": 512,
            "save_steps": -1,
            # Only the trained model will be saved - to prevent filling all of the space
            "save_model_every_epoch":False,
            "wandb_project": 'SLED-categorization',
            "silent": True,
            }
        )

In [None]:
# Train and save the model
sloberta_model.train_model(train_df)

In [None]:
# Save the trained model to Wandb (so you'll be able to access it later - see https://github.com/TajaKuzman/FastText-Classification-SLED/blob/main/Saved_models.md for information on how to use saved models)
run = wandb.init(project="SLED-categorization", entity="tajak", name="saving-trained-model")
trained_model_artifact = wandb.Artifact("SLED-SloBERTa-trainsmall-classifier", type="model", description="a SloBERTa model fine-tuned on the Slovene SLED (trainsmall) dataset, annotated with topic.")
trained_model_artifact.add_dir("/kaggle/working/outputs")
run.log_artifact(trained_model_artifact)

## Testing the model

In [None]:
# Create a list to save results into
previous_results = []

In [None]:
# Test the model on the test split
testing(test_df, "SloBERTa-test", 8)

In [None]:
# Compare the results by creating a dataframe from the previous_results dictionary:
results_df = pd.DataFrame(previous_results)

results_df

In [None]:
# Save the results
with open("SloBERTa-SLED-trainsmall-experiments-Results.json", "w") as results_file:
    json.dump(previous_results,results_file, indent= "")