# Importing the dataset

In [1]:
import json
import os

os.system("""curl --remote-name-all https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1467{/GINCO-1.0-suitable.json.zip,/GINCO-1.0-nonsuitable.json.zip}""")
os.system("unzip GINCO-1.0-suitable.json.zip")

ginco_path = "GINCO-1.0-suitable.json/GINCO-1.0-suitable.json"
with open(ginco_path) as f:
    content = json.load(f)


Let's inspect the first instance:

In [2]:
content[0]

{'id': '3949',
 'url': 'http://www.pomurje.si/aktualno/sport/zimska-liga-malega-nogometa/',
 'crawled': '2014',
 'hard': False,
 'paragraphs': [{'text': 'Šport', 'duplicate': False, 'keep': True},
  {'text': 'Zimska liga malega nogometa sobota, 12.02.2011',
   'duplicate': False,
   'keep': True},
  {'text': 'avtor: Tonček Gider', 'duplicate': False, 'keep': True},
  {'text': "V 7. krogu zimske lige v malem nogometu v Križevcih pri Ljutomeru je v prvi ligi vodilni 100 plus iz Križevec izgubil s tretjo ekipo na lestvici Rock'n roll iz Križevec z rezultatom 1:2, druga na lestvici Top Finedika iz Križevec je bila poražena z ekipo Bar Milene iz Ključarovec z rezultatom 7:8. V drugi križevski ligi je vodilni Cafe del Mar iz Vučje vasi premagal Montažo Vrbnjak iz Stare Nove vasi z rezultatom 3:2.",
   'duplicate': False,
   'keep': True},
  {'text': 'oglasno sporočilo', 'duplicate': False, 'keep': True},
  {'text': 'Ocena', 'duplicate': False, 'keep': True},
  {'text': 'Komentiraj Za komenti

The text is packed in the `paragraphs` field. In our machine learning experiments the paragraphs were joined with a separator `<p/>`. It was discovered the best results are obtained when only the paragraphs with `duplicate==False` are included.

In [3]:
for instance in content:
    paragraphs = instance["paragraphs"]
    # Removing duplicates:
    paragraphs = [p for p in paragraphs if not p["duplicate"]]

    # Joining texts:
    instance_text = " <p/> ".join([p["text"] for p in paragraphs])

    # Assigning texts to a new field:
    instance["text"] = instance_text

In [4]:
content[0]

{'id': '3949',
 'url': 'http://www.pomurje.si/aktualno/sport/zimska-liga-malega-nogometa/',
 'crawled': '2014',
 'hard': False,
 'paragraphs': [{'text': 'Šport', 'duplicate': False, 'keep': True},
  {'text': 'Zimska liga malega nogometa sobota, 12.02.2011',
   'duplicate': False,
   'keep': True},
  {'text': 'avtor: Tonček Gider', 'duplicate': False, 'keep': True},
  {'text': "V 7. krogu zimske lige v malem nogometu v Križevcih pri Ljutomeru je v prvi ligi vodilni 100 plus iz Križevec izgubil s tretjo ekipo na lestvici Rock'n roll iz Križevec z rezultatom 1:2, druga na lestvici Top Finedika iz Križevec je bila poražena z ekipo Bar Milene iz Ključarovec z rezultatom 7:8. V drugi križevski ligi je vodilni Cafe del Mar iz Vučje vasi premagal Montažo Vrbnjak iz Stare Nove vasi z rezultatom 3:2.",
   'duplicate': False,
   'keep': True},
  {'text': 'oglasno sporočilo', 'duplicate': False, 'keep': True},
  {'text': 'Ocena', 'duplicate': False, 'keep': True},
  {'text': 'Komentiraj Za komenti

Let's isolate only the data in the train/dev/test split:

In [5]:
train = [i for i in content if i["split"] == "train"]
test = [i for i in content if i["split"] == "test"]
dev = [i for i in content if i["split"] == "dev"]

As `simpletransformers` expects a pandas dataframe input, we now construct a DataFrame with columns `text` and `labels`.

For labels we will use the `primary_level_2` label.

In [6]:
import pandas as pd
train_df = pd.DataFrame(data=train, columns=["text", "primary_level_2"])
# Renaming columns to `text` and `labels`
train_df.columns = ["text", "labels"]


In [7]:
train_df.tail()


Unnamed: 0,text,labels
597,Sedmošolci so imeli tehniški dan na temo progr...,List of Summaries/Excerpts
598,Projektne novine <p/> Promocijski projektni ča...,Information/Explanation
599,Pri opremljanju kopalnice ne pozabite na kakov...,List of Summaries/Excerpts
600,O izdelku Seks igračka Satisfyer Partner Plus ...,List of Summaries/Excerpts
601,Razprava pogosto potegne na plano najprej tist...,Opinion/Argumentation


Now we are ready to use this dataset in `simpletransformers`. We will need to specify the exact number of labels, so we calculate it from our dataframe:

In [8]:
LABELS = train_df.labels.unique().tolist()
NUM_LABELS = len(LABELS)
NUM_LABELS

21

In [9]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs


model_args = ClassificationArgs()

model_args.num_train_epochs = 90
model_args.learning_rate = 1e-5
model_args.overwrite_output_dir = True
model_args.train_batch_size = 32
model_args.no_cache = True
model_args.no_save = True
model_args.save_steps = -1
model_args.max_seq_length = 512
model_args.labels_list = LABELS




model = ClassificationModel("camembert", "EMBEDDIA/sloberta",
                            num_labels = NUM_LABELS,
                            use_cuda = True,
                            args = model_args,
                            )
model.train_model(train_df)

Some weights of the model checkpoint at EMBEDDIA/sloberta were not used when initializing CamembertForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at EMBEDDIA/sloberta and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.weight', 

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=90.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 90', max=19.0, style=ProgressStyle(des…

  torch.nn.utils.clip_grad_norm_(





HBox(children=(FloatProgress(value=0.0, description='Running Epoch 1 of 90', max=19.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 2 of 90', max=19.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 3 of 90', max=19.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 4 of 90', max=19.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 5 of 90', max=19.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 6 of 90', max=19.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 7 of 90', max=19.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 8 of 90', max=19.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 9 of 90', max=19.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 10 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 11 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 12 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 13 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 14 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 15 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 16 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 17 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 18 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 19 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 20 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 21 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 22 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 23 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 24 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 25 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 26 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 27 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 28 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 29 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 30 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 31 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 32 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 33 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 34 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 35 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 36 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 37 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 38 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 39 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 40 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 41 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 42 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 43 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 44 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 45 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 46 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 47 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 48 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 49 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 50 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 51 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 52 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 53 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 54 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 55 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 56 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 57 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 58 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 59 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 60 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 61 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 62 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 63 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 64 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 65 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 66 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 67 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 68 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 69 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 70 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 71 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 72 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 73 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 74 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 75 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 76 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 77 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 78 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 79 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 80 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 81 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 82 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 83 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 84 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 85 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 86 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 87 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 88 of 90', max=19.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 89 of 90', max=19.0, style=ProgressStyle(de…





(1710, 0.8440781195871314)

To predict labels from test split, we can repeat the dataframe construction process for test split:

In [10]:
test_df = pd.DataFrame(data=test, columns=["text", "primary_level_2"])
test_df.columns = ["text", "labels"]
test_df.tail()

Unnamed: 0,text,labels
195,95.000 € <p/> Opis <p/> Hiša se nahaja na mirn...,Promotion of a Product
196,- občasno razstavo V tem domu luč prosvete sij...,Information/Explanation
197,EuroBasket in spremembe v prometnem režimu <p/...,Announcement
198,Omogočamo vam 20 % popusta za nakup kopalnih k...,List of Summaries/Excerpts
199,Pogajalska akadmija v letu 2014 02.02.2014 <p/...,List of Summaries/Excerpts


Now the `text` column can be fed into the model as a list of strings:

In [11]:
y_pred, raw_outputs = model.predict(test_df.text.values.tolist())

HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))




In [12]:
y_true = test_df.labels.values.tolist()

Let's perform a brief evaluation:

In [13]:
from sklearn.metrics import f1_score, confusion_matrix
macro = f1_score(y_true, y_pred, labels=LABELS, average="macro")
micro = f1_score(y_true, y_pred, labels=LABELS, average="micro")

print(f"F1 score: {macro=:0.3}, {micro=:0.3}")

F1 score: macro=0.578, micro=0.65


In [14]:
cm = confusion_matrix(y_true, y_pred, labels=LABELS)
print(cm)

[[25  0  2  0  1  0  0  0  0  1  1  1  1  0  0  0  1  0  0  0  0]
 [ 0  3  0  0  1  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0]
 [ 1  0 18  1  1  0  0  6  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 1  0  2  3  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0]
 [ 1  0  0  0 14  0  0  1  1  0  0  3  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0 10  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  1  0  0  0  0  9  3  0  0  0  0  2  0  0  0  0  0  1  0  0]
 [ 0  0  3  0  0  0  0 16  0  1  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  4  0  1  0  0  0  0  0  1  0  0  0  0]
 [ 0  1  1  1  1  0  1  1  0 10  0  1  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  1  0  0  0  1  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  0  0  0  0  0  2  0  0  0  0  0  3  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  2  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0