# Preparing the dataset

In [4]:
# Import the file
import json

with open("/kaggle/input/genre-identification-corpus-ginco-10/GINCO-1.0-suitable.json") as f:
    dataset = json.load(f)

dataset[0]

 ## Extract text from paragraphs into one string

 We'll create two additional parameters for each text: "full_text" with text from all parameters, and "dedup_text" with text from the deduplicated paragraphs only (no near-duplicates).

In [6]:
for instance in dataset:
    paragraphs = instance["paragraphs"]

    # Joining texts:
    instance_full_text = " <p/> ".join([p["text"] for p in paragraphs])

    # Assigning texts to a new field:
    instance["full_text"] = instance_full_text

for instance in dataset:
    paragraphs = instance["paragraphs"]
    # Removing duplicates:
    paragraphs = [p for p in paragraphs if not p["duplicate"]]

    # Joining texts:
    instance_dedup_text = " <p/> ".join([p["text"] for p in paragraphs])

    # Assigning texts to a new field:
    instance["dedup_text"] = instance_dedup_text

dataset[0]

## Create the test-train-dev split

In [7]:
train = [i for i in dataset if i["split"] == "train"]
test = [i for i in dataset if i["split"] == "test"]
dev = [i for i in dataset if i["split"] == "dev"]

print("The train-dev-test splits consist of the following numbers of examples:", len(train), len(test), len(dev))

## Transform the dataset in tabular form

As simpletransformers expects a pandas dataframe input, we now construct a DataFrame with columns text and labels.

For labels we will use the primary_level_2 label and for the text, we'll use the deduplicated text.

In [9]:
import pandas as pd
train_df = pd.DataFrame(data=train, columns=["dedup_text", "primary_level_2"])
# Renaming columns to `text` and `labels`
train_df.columns = ["text", "labels"]

# Let's look at the beginning of the train dataframe

train_df.head()

We will need to specify the exact number of labels, so we calculate it from our dataframe.

In [10]:
LABELS = train_df.labels.unique().tolist()
NUM_LABELS = len(LABELS)
NUM_LABELS

Repeat the process with the test and dev split:

In [12]:
test_df = pd.DataFrame(data=test, columns=["dedup_text", "primary_level_2"])
test_df.columns = ["text", "labels"]
test_df.tail()

In [13]:
dev_df = pd.DataFrame(data=dev, columns=["dedup_text", "primary_level_2"])
dev_df.columns = ["text", "labels"]
dev_df.tail()

In [14]:
# Save the dataframe to a csv:
train_df.to_csv("GINCO_dataframe_dedup_train.csv", index=False)
test_df.to_csv("GINCO_dataframe_dedup_test.csv", index=False)

Merge train to dev to get bigger train data:

In [15]:
train_dev_df = pd.concat([train_df, dev_df], ignore_index=True)
train_dev_df.shape

In [16]:
# Save the dataframe to a csv:
train_dev_df.to_csv("GINCO_dataframe_dedup_train_dev.csv", index=False)

# Training the baseline - SloBERTa

Resources:
- https://towardsdatascience.com/bert-text-classification-in-a-different-language-6af54930f9cb
- Peter's demo code: https://github.com/TajaKuzman/Transformers-GINCO-Experiments/blob/main/Peters-code/Peter-GINCO-demo.ipynb
- Peter's final code (for the LREC article): https://github.com/5roop/task5_webgenres/

We will use the hyperparameters from the article *The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild* as the hyperparameter search revealed them to be the most suitable for the task. That is, the models will be trained for 30 epochs with the learning rate of 10^-5. The sequence length of 512 tokens will be used.

In [None]:
# install simpletransformers
!pip install -q simpletransformers

# check installed version
!pip freeze | grep simpletransformers

In [None]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs


model_args = ClassificationArgs()

model_args.num_train_epochs = 30
model_args.learning_rate = 1e-5
model_args.overwrite_output_dir = True
model_args.train_batch_size = 32
model_args.no_cache = True
model_args.no_save = True
model_args.fp16 = False
model_args.save_steps = -1
model_args.max_seq_length = 512
model_args.labels_list = LABELS


model = ClassificationModel("camembert", "EMBEDDIA/sloberta",
                            num_labels = NUM_LABELS,
                            use_cuda = True,
                            args = model_args,
                            )
model.train_model(train_df)

There are some issues with the model. I've checked whether Google Colab otherwise works with simple transformers (using an example from a BERT tutorial), and it does.

Working with this setting worked, so we must find out which of the settings in the initial code produced an error.

First, I've muted additional parameters from the BERT tutorial. The code worked, so I deleted them.

1. Overwrite_output_dir - okay
2. Num_train_epochs - okay
3. labels_list - okay
4. learning_rate - okay
5. train_batch_size - okay
6. no_cache - okay
7. no_save - okay
8. save_steps - okay

--> the problem is in the parameter **"max_seq_length": 512** When trying the sliding_window method, which can be used for texts, longer than 512 tokens, it was written that the max seq length is 512 (Token indices sequence length is longer than the specified maximum sequence length for this model (622 > 512).), so maybe there is no need to use this parameter.

Peter's code (from demo) does not work as well.