We decided that it might be better to train the joint classifier only on labels that are present in all of the datasets. That is why we discarded Forum and Other from the train, dev and test split for X-GENRE. In this notebook, I will train the model again. The new model will be X-GENRE-2

Import all necessary libraries and install everything you need for training:

In [1]:
# install the libraries necessary for data wrangling, prediction and result analysis
import json
import numpy as np
import pandas as pd
import logging
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, f1_score,precision_score, recall_score
import torch
from numba import cuda
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

In [2]:
# Install transformers
# (this needs to be done on Kaggle each time you start the session)
!pip install -q transformers

# Install the simpletransformers
!pip install -q simpletransformers
from simpletransformers.classification import ClassificationModel

In [3]:
# Install wandb
!pip install -q wandb

import wandb

# Login to wandb
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mtajak[0m (use `wandb login --relogin` to force relogin)


True

In [4]:
# Clean the GPU cache

cuda.select_device(0)
cuda.close()
cuda.select_device(0)
torch.cuda.empty_cache()


### Import the data

In [5]:
# X-GENRE-2
train_df = pd.read_csv("/home/tajak/Genre-Datasets-Comparison/Genre-Datasets-Comparison/Creation-of-classifiers-and-cross-prediction/data-splits/X-GENRE-train.csv-2.csv", index_col=0)
dev_df = pd.read_csv("/home/tajak/Genre-Datasets-Comparison/Genre-Datasets-Comparison/Creation-of-classifiers-and-cross-prediction/data-splits/X-GENRE-dev.csv-2.csv",  index_col = 0)
test_df = pd.read_csv("/home/tajak/Genre-Datasets-Comparison/Genre-Datasets-Comparison/Creation-of-classifiers-and-cross-prediction/data-splits/X-GENRE-test.csv-2.csv", index_col = 0)

print("X-GENRE-2 train shape: {}, Dev shape: {}, Test shape: {}.".format(train_df.shape, dev_df.shape, test_df.shape))

X-GENRE-2 train shape: (1562, 2), Dev shape: (522, 2), Test shape: (522, 2).


In [6]:
train_df.head()

Unnamed: 0,text,labels
2,Abstract Objective: Reporting bias due to soci...,Information/Explanation
3,In 2009 the song was the focus of a successful...,Information/Explanation
4,QuotW This was the week when neither rumours o...,News
5,KaZaA claims it can't stop users sharing music...,News
6,When you first sign up with an online casino a...,Instruction


In [10]:
# Merge the splits and display the label distribution
entire1 = pd.concat([train_df, dev_df])
joined_splits = pd.concat([entire1, test_df])

joined_splits.describe()

Unnamed: 0,text,labels
count,2606,2606
unique,2606,7
top,Abstract Objective: Reporting bias due to soci...,News
freq,1,573


In [14]:
print(joined_splits.labels.value_counts(normalize=True).to_markdown())

|                         |    labels |
|:------------------------|----------:|
| News                    | 0.219877  |
| Information/Explanation | 0.196086  |
| Promotion               | 0.183423  |
| Opinion/Argumentation   | 0.155027  |
| Instruction             | 0.134689  |
| Prose/Lyrical           | 0.0698388 |
| Legal                   | 0.0410591 |


## Training and saving

We will use the multilingual XLM-RoBERTa model
https://huggingface.co/xlm-roberta-base

In [7]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [8]:
# Create a list of labels
LABELS = train_df.labels.unique().tolist()
LABELS

['Information/Explanation',
 'News',
 'Instruction',
 'Opinion/Argumentation',
 'Prose/Lyrical',
 'Legal',
 'Promotion']

In [9]:
# Initialize Wandb
wandb.init(project="X-GENRE classifiers", entity="tajak", name="X-GENRE-2-hyperparameter-search")

In [15]:
# Calculate how many steps will each epoch have
# Num steps in epoch = training samples / batch size
steps_per_epoch = int(1562/8)
steps_per_epoch

195

I evaluated per every 10th epoch - per 1950 steps. I first trained the model while evaluating it to find the optimal number of epochs and then trained it again and saved it.

In [16]:
# Create a TransformerModel
roberta_base_model = ClassificationModel(
        "xlmroberta", "xlm-roberta-base",
        num_labels=len(LABELS),
        use_cuda=True,
        args= {
            "overwrite_output_dir": True,
            "num_train_epochs": 30,
            "train_batch_size":8,
            "learning_rate": 1e-5,
            # Use these parameters if you want to evaluate during training
            "evaluate_during_training": True,
            "evaluate_during_training_steps": steps_per_epoch*10,
            "evaluate_during_training_verbose": True,
            "use_cached_eval_features": True,
            'reprocess_input_data': True,
            "labels_list": LABELS,
            # The following parameters are commented out because I want to save the model
            "no_cache": True,
            # Disable no_save: True if you want to save the model
            "no_save": True,
            "max_seq_length": 512,
            "save_steps": -1,
            # Only the trained model will be saved - to prevent filling all of the space
            "save_model_every_epoch":False,
            "wandb_project": 'X-GENRE classifiers',
            "silent": True,
            }
        )

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.68M [00:00<?, ?B/s]



In [17]:
# Train the model
roberta_base_model.train_model(train_df, eval_df = dev_df)






VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

KeyboardInterrupt: 

Based on evaluation during training, the optimum epoch is between 5 and 15 epochs.

In [18]:
def testing(test_df, test_name, epoch):
    """
    This function takes the test dataset and applies the trained model on it to infer predictions.
    It also prints and saves a confusion matrix, calculates the F1 scores and saves the results in a list of results.

    Args:
    - test_df (pandas DataFrame)
    - test_name
    - epoch: num_train_epochs
    """
    # Get the true labels
    y_true = test_df.labels

    model = roberta_base_model
    
    # Calculate the model's predictions on test
    def make_prediction(input_string):
        return model.predict([input_string])[0][0]

    y_pred = test_df.text.apply(make_prediction)

    # Calculate the scores
    macro = f1_score(y_true, y_pred, labels=LABELS, average="macro")
    micro = f1_score(y_true, y_pred, labels=LABELS,  average="micro")
    print(f"Macro f1: {macro:0.3}, Micro f1: {micro:0.3}")

    # Plot the confusion matrix:
    cm = confusion_matrix(y_true, y_pred, labels=LABELS)
    plt.figure(figsize=(9, 9))
    plt.imshow(cm, cmap="Oranges")
    for (i, j), z in np.ndenumerate(cm):
        plt.text(j, i, '{:d}'.format(z), ha='center', va='center')
    classNames = LABELS
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    tick_marks = np.arange(len(classNames))
    plt.xticks(tick_marks, classNames, rotation=90)
    plt.yticks(tick_marks, classNames)
    plt.title(f"{test_name}")

    plt.tight_layout()
    fig1 = plt.gcf()
    plt.show()
    plt.draw()
    #fig1.savefig(f"Confusion-matrix-{test_name}.png",dpi=100)

    # Save the results:
    rezdict = {
        "experiment": test_name,
        "num_train_epochs": epoch,
        "train_batch_size":8,
        "learning_rate": 1e-5,
        "microF1": micro,
        "macroF1": macro,
        "y_true": y_true.to_dict(),
        "y_pred": y_pred.to_dict(),
        }
    #previous_results.append(rezdict)

    #Save intermediate results (just in case)
    backup = []
    backup.append(rezdict)
    with open(f"backup-results-{test_name}.json", "w") as backup_file:
        json.dump(backup,backup_file, indent= "")

In [19]:
# Train the model for various epochs to find the optimum number
epochs = [5, 8, 10, 15, 20]

for epoch in epochs:
    roberta_base_model = ClassificationModel(
                "xlmroberta", "xlm-roberta-base",
                num_labels=len(LABELS),
                use_cuda=True,
                args= {
                    "overwrite_output_dir": True,
                    "num_train_epochs": epoch,
                    "train_batch_size":8,
                    "learning_rate": 1e-5,
                    "labels_list": LABELS,
                    # The following parameters (no_cache, no_save) are commented out if I want to save the model
                    "no_cache": True,
                    # Disable no_save: True if you want to save the model
                    "no_save": True,
                    "max_seq_length": 512,
                    "save_steps": -1,
                    # Only the trained model will be saved - to prevent filling all of the space
                    "save_model_every_epoch":False,
                    "wandb_project": 'X-GENRE classifiers',
                    "silent": True,
                    }
                )

    # Train the model
    roberta_base_model.train_model(train_df)
    
    # Test the model on dev_df
    testing(dev_df, f"GINCO-X-GENRE-dev-epoch-search:{epoch}", epoch)

Optimum number of epochs is 20.

In [None]:
# Compare the results by creating a dataframe from the previous_results dictionary:
results_df = pd.DataFrame(previous_results)

results_df

In [None]:
# Save the file with updated results.
with open("FTD-X-GENRE-Experiments-Results.json", "w") as results_file:
    json.dump(previous_results,results_file, indent= "")

In [21]:
# Create a TransformerModel
roberta_base_model = ClassificationModel(
        "xlmroberta", "xlm-roberta-base",
        num_labels=len(LABELS),
        use_cuda=True,
        args= {
            "overwrite_output_dir": True,
            "num_train_epochs": 20,
            "train_batch_size":8,
            "learning_rate": 1e-5,
            "labels_list": LABELS,
            # The following parameters are commented out because I want to save the model
            #"no_cache": True,
            # Disable no_save: True if you want to save the model
            #"no_save": True,
            "max_seq_length": 512,
            "save_steps": -1,
            # Only the trained model will be saved - to prevent filling all of the space
            "save_model_every_epoch":False,
            "wandb_project": 'X-GENRE classifiers',
            "silent": True,
            }
        )

In [22]:
# Train the model
roberta_base_model.train_model(train_df)

In [24]:
# Save the trained model to Wandb
run = wandb.init(project="X-GENRE classifiers", entity="tajak", name="saving-trained-model")
trained_model_artifact = wandb.Artifact("SI-GINCO-X-GENRE-classifier", type="model", description="a model trained on the Slovene GINCO dataset, annotated with X-GENRE labels.")
trained_model_artifact.add_dir("/kaggle/working/outputs")
run.log_artifact(trained_model_artifact)