This notebook includes:
- the first setup for SloBERTa training, based on Peter's demo and final code from his repository (but learning on my prepared data)
- on this code, the experiments regarding the max_seq_length were performed.

# Preparing the dataset

We'll import the csv files, created in 1-GINCO-and-Transformers_initial_experiments_with_the_code.ipynb.
The text is deduplicated text and labels are primary_level_2.


In [1]:
!conda install --yes pytorch>=1.6 cudatoolkit=11.0 -c pytorch

In [2]:
import pandas as pd

train_df = pd.read_csv("/kaggle/input/gincodataframededup/GINCO_dataframe_dedup_train.csv")
test_df = pd.read_csv("/kaggle/input/gincodataframededup/GINCO_dataframe_dedup_test.csv")

print("Train shape: {}, Test shape: {}.".format(train_df.shape, test_df.shape))

In [3]:
test_df.tail()

We will need to specify the exact number of labels, so we calculate it from our dataframe.

In [4]:
LABELS = train_df.labels.unique().tolist()
NUM_LABELS = len(LABELS)
NUM_LABELS

Let's transform the labels to integers:

In [5]:
LABELS = ['Information/Explanation',
 'Promotion of Services',
 'News/Reporting',
 'Invitation',
 'Promotion of a Product',
 'Forum',
 'Opinion/Argumentation',
 'Opinionated News',
 'Instruction',
 'List of Summaries/Excerpts',
 'Legal/Regulation',
 'Promotion',
 'Other',
 'Review',
 'Prose',
 'Announcement',
 'Call',
 'Recipe',
 'Correspondence',
 'Research Article',
 'Interview']

STR_TO_NUM = {s: i for i, s in enumerate(LABELS)}
NUM_TO_STR = {i: s for i, s in enumerate(LABELS)}

In [6]:
STR_TO_NUM

In [7]:
train_df.labels = [STR_TO_NUM[i] for i in train_df.labels]

In [8]:
test_df.labels = [STR_TO_NUM[i] for i in test_df.labels]
test_df.tail()

In [9]:
LABELS_NUM = train_df.labels.unique().tolist()

LABELS_NUM

As we are using deduplicated text, it is possible that some of the instances have no text (nan instead of text string). We need to drop them.

In [10]:
train_df = train_df.dropna()
train_df.shape

In [11]:
test_df = test_df.dropna()

test_df.shape

# Training the baseline - SloBERTa

We will use the hyperparameters from the article *The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild* and the code published on Github (https://github.com/5roop/ginco_demo/blob/main/demo.ipynb) as the hyperparameter search revealed them to be the most suitable for the task.

In [12]:
# install simpletransformers
!pip install -q transformers
!pip install --upgrade transformers
!pip install -q simpletransformers

# check installed version
!pip freeze | grep simpletransformers

In [13]:
!pip uninstall -q torch -y
!pip install -q torch==1.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

In [14]:
# pytorch libraries
import torch # the main pytorch library
import torch.nn as nn # the sub-library containing Softmax, Module and other useful functions
import torch.optim as optim # the sub-library containing the common optimizers (SGD, Adam, etc.)
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

In [15]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [16]:
#importing other necessary packages and ClassificationModel for bert
from tqdm import tqdm
import warnings
warnings.simplefilter('ignore')

from scipy.special import softmax

In [18]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
model_args ={"overwrite_output_dir": True,
             "num_train_epochs": 90,
             "labels_list": LABELS_NUM,
             "learning_rate": 1e-5,
             "train_batch_size": 32,
             "no_cache": True,
             "no_save": True,
             "max_seq_length": 512,
             "save_steps": -1,
             }

model = ClassificationModel(
    "camembert", "EMBEDDIA/sloberta",
    use_cuda = device,
    num_labels = NUM_LABELS,
    args = model_args)

model.train_model(train_df)

## SloBERTa Prediction

Let's try if the model works:

In [None]:
Instance_predictions, raw_outputs = model.predict(['Danes poročamo o dogodku, ki se je zgodil 1. 1. 2020. Oseba je dejala:"To je res nenormalen dogodek"'])

In [None]:
Instance_predictions

In [None]:
NUM_TO_STR

In [None]:
from sklearn.metrics import f1_score, confusion_matrix

def eval_model(test_df):
    y_true = test_df.labels
    y_pred = model.predict(test_df.text.tolist())[0]

    microF1 = f1_score(y_true, y_pred, labels=LABELS_NUM, average ="micro")
    macroF1 = f1_score(y_true, y_pred, labels=LABELS_NUM, average ="macro")

    return {"microF1": microF1, 
            "macroF1": macroF1,
            "y_true": y_true,
            "y_pred": y_pred}

In [None]:
eval_model(test_df)

In [None]:
import numpy as np

def plot_cm(y_true, y_pred, labels, save=False, title=None):
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import f1_score
    import matplotlib.pyplot as plt
    cm = confusion_matrix(y_true, y_pred, labels=labels, )
    # print(cm)
    plt.figure(figsize=(9, 9))
    plt.imshow(cm, cmap="Oranges")
    for (i, j), z in np.ndenumerate(cm):
        plt.text(j, i, '{:d}'.format(z), ha='center', va='center')
    classNames = labels
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    tick_marks = np.arange(len(classNames))
    plt.xticks(tick_marks, classNames, rotation=90)
    plt.yticks(tick_marks, classNames)
    microF1 = f1_score(y_true, y_pred, labels=labels, average ="micro")
    macroF1 = f1_score(y_true, y_pred, labels=labels, average ="macro")

    print(f"{microF1:0.4}")
    print(f"{macroF1:0.4}")

    metrics = f"{microF1:0.4}, {macroF1:0.4}"
    if title:
        plt.title(title +";\n" + metrics)
    else:
        plt.title(metrics)
    plt.tight_layout()
    if save:
        plt.savefig(save)
    plt.show()
    return microF1, macroF1

In [None]:
    y_true = test_df.labels
    y_pred = model.predict(test_df.text.tolist())[0]
    
plot_cm(y_true, y_pred, LABELS_NUM,
    save=f"SLOBERTA_baseline_90_epochs_no-sliding_window.png", 
    title =f"SLOBERTA, baseline, 90 epochs, no sliding window")