# GermEval 2019 Task 1: Hierarchical Classification of Blurbs

The default project is based on shared task (sort of like competition) held in 2019. It's about multi-label classification of so-called blurbs -- short summaries of books (think about an online bookstore). The language of the blurbs and the labels is German.

The task is multi-label classification: that is, each blurb can be classified to one or many classes. Actually, the classes are hierarchical (e.g., Fantasy -> Urban Fantasy) but we don't really use the hierarchical nature of the classes.

Actually, GermEval 2019 Task 1 has two tasks: task A is about predicting the most general class, and task B is about predicting *all* labels. We only focus on task B.

There are a total of 343 different categories and sub-categories.

The paper describing the task is available [here](https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2019-hmc/gest19-1-description.pdf).


What makes the project exciting is that the competition was held in 2019 and many of the teams that competed have published their papers about what they did, and there are also at least some open sourced implementations. The leaderboard of the best submissions is also available in the paper linked above. **Can you outperform the best teams with 2023 technology?**


You can read the papers of  winners here:
 - [Multi-Label Multi-Class Hierarchical Classification using
Convolutional Seq2Seq](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/germeval/Germeval_Task1_paper_2.pdf)
 - [TwistBytes - Hierarchical Classification at GermEval 2019: walking the fine line (of recall and precision)](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/germeval/Germeval_Task1_paper_6.pdf)
 - [Code and paper of the COMTRAVO-DS team](https://github.com/davidsbatista/GermEval-2019-Task_1)


Actually, the papers of all competitiors are available here: https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/ (scroll down).


I have implemented the data loading, evaluation and baseline model for you. 

First, let's download the data:

In [None]:
import sys
import warnings
import torch

In [None]:
! wget https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2019-hmc/germeval2019t1-public-data-final.zip

--2023-05-31 20:25:18--  https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2019-hmc/germeval2019t1-public-data-final.zip
Resolving www.inf.uni-hamburg.de (www.inf.uni-hamburg.de)... 134.100.36.5
Connecting to www.inf.uni-hamburg.de (www.inf.uni-hamburg.de)|134.100.36.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘germeval2019t1-public-data-final.zip.1’

germeval2019t1-publ     [                <=> ]  13.27M  1.12MB/s    in 12s     

2023-05-31 20:25:31 (1.09 MB/s) - ‘germeval2019t1-public-data-final.zip.1’ saved [13915224]



In [None]:
! unzip -u -q germeval2019t1-public-data-final.zip

In [None]:
! ls -la

total 62960
drwxr-xr-x 1 root root     4096 May 31 20:25 .
drwxr-xr-x 1 root root     4096 May 31 17:15 ..
-rw-r--r-- 1 root root   233276 Jun  7  2019 blurbs_dev_label.txt
-rw-r--r-- 1 root root  1920753 Jun  7  2019 blurbs_dev_nolabel.txt
-rw-r--r-- 1 root root  2630464 Jun  7  2019 blurbs_dev.txt
-rw-r--r-- 1 root root   469047 Aug  9  2019 blurbs_test_label.txt
-rw-r--r-- 1 root root  3787630 Jun  7  2019 blurbs_test_nolabel.txt
-rw-r--r-- 1 root root  5217031 Aug 21  2019 blurbs_test.txt
-rw-r--r-- 1 root root  1630376 Aug  9  2019 blurbs_train_label.txt
-rw-r--r-- 1 root root 18174587 Jun  7  2019 blurbs_train.txt
drwxr-xr-x 2 root root     4096 Sep  2  2019 classification_models
drwxr-xr-x 4 root root     4096 May 30 13:33 .config
-rw-r--r-- 1 root root   251606 Jun  7  2019 description.pdf
drwxr-xr-x 3 root root     4096 Dec  6  2019 evaluation
-rw-r--r-- 1 root root 13915224 Dec  6  2019 germeval2019t1-public-data-final.zip
-rw-r--r-- 1 root root 13915224 Dec  6  2019 germeval

The train, dev and test files are in blurbs_{train,dev,test},txt files. They are actually XML files. Let's see:

In [None]:
!head blurbs_train.txt

<book date="2019-01-04" xml:lang="de">
<title>Die Klinik</title>
<body>Ein Blick hinter die Kulissen eines Krankenhauses vom Autor der Bestseller "Der Medicus" und "Der Medicus von Saragossa". Der Wissenschaftler Adam Silverstone, der kubanische Aristokrat Rafael Meomartino und der Farbige Spurgeon Robinson - sie sind drei grundverschiedene Klinik-Ärzte, die unter der unerbittlichen Aufsicht von Dr. Longwood praktizieren. Eines Tages stirbt eine Patientin, und Dr. Longwood wittert einen Behandlungsfehler. Sofort macht er sich auf die Suche nach einem Schuldigen, dem er die Verantwortung in die Schuhe schieben könnte ...</body>
<copyright>(c) Verlagsgruppe Random House GmbH</copyright>
<categories>
<category>
<topic d="0">Literatur & Unterhaltung</topic>
<topic d="1" label="True">Romane & Erzählungen</topic>
</category>
</categories>


Let's implement data reading using the BeautifulSoup XML library:

In [None]:
from tqdm.notebook import trange, tqdm

from bs4 import BeautifulSoup
def load_data(filename):
    """
    Loads labels and blurbs of dataset
    """
    data = []
    soup = BeautifulSoup(open(filename, 'rt').read(), "html.parser")
    for book in tqdm(soup.findAll('book')):
      categories = set([])
      book_soup = BeautifulSoup(str(book), "html.parser")
      for t in book_soup.findAll('topic'):
          categories.add(str(t.string))
      data.append((str(book_soup.find("body").string), categories))
    return data

In [None]:
train_data = load_data("blurbs_train.txt")

  0%|          | 0/14548 [00:00<?, ?it/s]

In [None]:
train_data[1]

('Die Bedrohungen für Midkemia und Kelewan wollen nicht enden: Obwohl das Konklave der Schatten Leso Varen und seinen Nachtfalken dicht auf den Fersen ist, schmieden sie weiter ihre finsteren Umsturzpläne gegen das Herrscherhaus von Kesh. Zugleich stellt sich heraus, dass von den mysteriösen Talnoy eine bisher ungekannte Gefahr ausgeht: durch ihre magischen Kräfte können die fürchterlichen Dasati ins Reich Midkemia eindringen und alle ins Unheil stürzen …',
 {'Fantasy', 'Heroische Fantasy', 'Literatur & Unterhaltung'})

In [None]:
dev_data = load_data("blurbs_dev.txt")

  0%|          | 0/2079 [00:00<?, ?it/s]

In [None]:
dev_data[0]

('Die Konfirmandenzeit wird für Jugendliche besonders dann zu einer nachhaltigen Erfah\xadrung, wenn ihre Eltern Anteil nehmen und sie hilfreich begleiten. Um sie dabei zu unterstützen, bietet diese Broschüre praktische Hinweise zu Formen und Organisation der Konfirmandenzeit in den Gemeinden, Bilder und Berichte zur religiösen und pädagogischen Gestaltung der Konfirmandenarbeit heute sowie Hinweise zur Vorbereitung und Feier der Konfirmation in der Familie. Besonders für die Eltern, für die die Konfirmation ihrer Kinder eine Wiederbegegnung mit Kirche ist, hält dieses Heft zudem Erklärungen zu den Festen und Feiertagen der Kirche, Antworten auf häufig gestellte Fragen und ein kleines Glossar kirchlicher Begriffe bereit. Mit diesem Heft wird die Konfirmandenzeit zu einer Bereicherung auch für die Eltern.',
 {'Gemeindearbeit',
  'Gemeindearbeit mit Kindern & Jugendlichen',
  'Glaube & Ethik',
  'Konfirmation'})

In [None]:
test_data = load_data("blurbs_test.txt")

  0%|          | 0/4157 [00:00<?, ?it/s]

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")
print(max(len(tokenizer.encode(x[0])) for x in test_data))

Token indices sequence length is longer than the specified maximum sequence length for this model (771 > 512). Running this sequence through the model will result in indexing errors


771


In [None]:
from collections import Counter
from torchtext.vocab import vocab

#counter = Counter()
#for sample in train_data:
#    counter.update(tokenizer(sample[0]))
# we'll map all words occurring less than 5 times to <unk>
#text_vocab = vocab(counter, min_freq=5, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))
#text_vocab.set_default_index(text_vocab["<unk>"])

train_labels = []
for sample in train_data:
  for label in sample[1]:
    if label not in train_labels:
      train_labels.append(label)

In [None]:
type(test_data[0][0])

str

In [None]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer

def collate_batch(batch):
   label_list, ids_list, masks_list, token_type_ids_list = [], [], [], []
   mlb = MultiLabelBinarizer(classes = train_labels)
   maxl = max(len(tokenizer.encode(sample[0])) for sample in batch)
   #print(maxl)
   if(maxl) > 512: maxl = 512

   for sample in batch:
        _labels = sample[1] # we can have any number of labels here
        _text = sample[0]
        label_list.append(list(_labels))
        processed_text = tokenizer.encode_plus(
            _text,
            None,
            add_special_tokens=True,
            max_length=maxl,
            padding='max_length',
            return_token_type_ids=True,
            truncation=True
          )
        ids = torch.tensor(processed_text['input_ids'], dtype=torch.long)
        #print(ids.shape)
        ids_list.append(ids)
        masks = torch.tensor(processed_text['attention_mask'], dtype=torch.long)
        #print(masks.shape)
        masks_list.append(masks)
        tocken_types = torch.tensor(processed_text['token_type_ids'], dtype=torch.long)
        #print(tocken_types.shape)
        token_type_ids_list.append(tocken_types)
   res_labels = torch.tensor(mlb.fit_transform(label_list), dtype=torch.float)
   return {
            "labels": res_labels, 
            "ids": torch.stack(ids_list),
            "masks": torch.stack(masks_list), 
            "token_type_ids": torch.stack(token_type_ids_list)
          }

train_dataloader = DataLoader(train_data, batch_size=8, shuffle=True, 
                              collate_fn=collate_batch)

test_dataloader = DataLoader(test_data, batch_size=8, shuffle=False, 
                              collate_fn=collate_batch)

In [None]:
#batch = next(iter(train_dataloader))
#print(batch)

In [None]:
#print(batch["ids"].shape)
#print(batch["masks"].shape)
#print(batch["token_type_ids"].shape)

In [None]:
train_dataloader = DataLoader(train_data, batch_size=16, shuffle=True, 
                              collate_fn=collate_batch)

test_dataloader = DataLoader(test_data, batch_size=16, shuffle=False, 
                              collate_fn=collate_batch)

In [None]:
import torch.nn as nn
import torch.nn.functional as F

In [None]:
device = 'cpu'
if torch.cuda.is_available():
  device = torch.device('cuda')

print(device)

cuda


In [None]:
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertModel
class BertNERModel(torch.nn.Module):
  def __init__(self, num_classes, dropout_prob, device):
    super(BertNERModel, self).__init__()
    #self.sigm = nn.Sigmoid()
    self.l1 = BertModel.from_pretrained('bert-base-german-cased')
    self.l2 = torch.nn.Dropout(dropout_prob)
    self.l3 = torch.nn.Linear(768, num_classes)
    self.device = device
    
  def forward(self, ids, mask, token_type_ids):
      #print("I got here")
      #print(ids.shape)
      #print(mask.shape)
      #print(token_type_ids.shape)
      #print("I got to before 1 layer")
      output_1 = self.l1(ids, attention_mask = mask, token_type_ids = token_type_ids)
      #print("I got to before 2 layer")
      #print(output_1[0])
      #print(output_1[0].shape)
      #print("------------- second")
      #print(output_1[1])
      #print(output_1[1].shape)
      output_2 = self.l2(output_1[1])
      #print("I got to before 3 layer")
      #print(output_2.shape())
      output = self.l3(output_2)
      #print(output.shape())
      return output

In [None]:
#model = BertNERModel(343, 0.3, device).to(device)

In [None]:
#print(model.forward(batch["ids"].to(device), batch["masks"].to(device), batch["token_type_ids"].to(device)))

In [None]:
from sklearn.metrics import f1_score, recall_score, precision_score 

def calculate_metrics(pred, target, threshold=0.2):
    #print("before")
    #print(pred)
    pred = np.array(pred > threshold, dtype=float)
    #print("after")
    #print(pred)
    #print("predicted after threshold")
    #print(pred)
    return {
            'micro/f1': f1_score(y_true=target, y_pred=pred, average='micro'),
            }

In [None]:
from tqdm.notebook import tqdm
def train(model, num_epochs, train_iter, test_iter):
  train_loss = 0
  valid_loss = 0

  optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

  steps = 0
  best_acc = 0
  last_step = 0
  for epoch in range(1, num_epochs+1):
    print("Epoch %d" % epoch)
    model.train()
    # We wrap the dataloader iterator to tqdm, so that we'll get a nice progress bar
    for batch in tqdm(train_iter, total=len(train_iter)):
      ids = batch["ids"].to(device)
      target = batch["labels"].to(device)
      token_type_ids = batch["token_type_ids"].to(device)
      masks = batch["masks"].to(device)

      #print("target")
      #print(target.shape)
      #print(target)

      optimizer.zero_grad()
      output = model(ids, masks, token_type_ids)
      #print(output.shape)
      #print(output)

      criterion = nn.BCEWithLogitsLoss()

      loss = criterion(output, target)

      loss.backward()
      optimizer.step()

      #train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.item() - train_loss))

      steps += 1

    train_acc = evaluate("train", train_iter, model)                
    dev_acc = evaluate("test", test_iter, model)

def evaluate(dataset_name, data_iter, model):
  
  model.eval()
  batch_losses = []
  #total_corrects, avg_loss = 0, 0
  model_result = []
  targets = []
  with torch.inference_mode():
    for batch in data_iter:
      ids = batch["ids"].to(device)
      target = batch["labels"].to(device)
      token_type_ids = batch["token_type_ids"].to(device)
      masks = batch["masks"].to(device)

      output = model(ids, masks, token_type_ids)

      criterion = nn.BCEWithLogitsLoss()

      #print("Before loss 1")

      loss = criterion(output, target).item()
      batch_losses.append(loss)

      #print("After loss 1")
      #print(loss)
      model_result.extend(torch.sigmoid(output).cpu().numpy())
      targets.extend(target.cpu().numpy())
      fin_output = np.array(model_result)
      fin_targets = np.array(targets)

    result = calculate_metrics(fin_output, fin_targets)
    print("Evaluation on {} - micro f1: {:.3f} ".format(dataset_name, 
                                      result['micro/f1']))
                                      #result['macro/f1'],
                                      #result['samples/f1']))
    loss_value = np.mean(batch_losses)
    print("Evaluation on {} - loss:{:.3f}".format(dataset_name, loss_value))
    #size = len(data_iter.dataset)
    #avg_loss /= size
    #accuracy = 100.0 * total_corrects/size
    #print('  Evaluation on {} - loss: {:.3f}  acc: {:.2f}%({}/{})'.format(dataset_name,
    #                                                                  avg_loss, 
    #                                                                  accuracy, 
    #                                                                  total_corrects, 
    #                                                                  size))
    return result['micro/f1']                

In [None]:
model = BertNERModel(343, 0.3, device).to(device)
train(model, 5, train_dataloader, test_dataloader)

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch 1


  0%|          | 0/910 [00:00<?, ?it/s]

before
[[4.4240275e-01 5.6041002e-01 9.8616574e-03 ... 2.9586404e-06
  2.8800699e-07 2.2511686e-06]
 [4.4240275e-01 5.6041002e-01 9.8616574e-03 ... 2.9586404e-06
  2.8800699e-07 2.2511686e-06]
 [4.4240275e-01 5.6041002e-01 9.8616574e-03 ... 2.9586404e-06
  2.8800699e-07 2.2511686e-06]
 ...
 [4.4240275e-01 5.6041002e-01 9.8616574e-03 ... 2.9586404e-06
  2.8800699e-07 2.2511686e-06]
 [4.4240275e-01 5.6041002e-01 9.8616574e-03 ... 2.9586404e-06
  2.8800699e-07 2.2511686e-06]
 [4.4240275e-01 5.6041002e-01 9.8616574e-03 ... 2.9586404e-06
  2.8800699e-07 2.2511686e-06]]
after
[[1. 1. 0. ... 0. 0. 0.]
 [1. 1. 0. ... 0. 0. 0.]
 [1. 1. 0. ... 0. 0. 0.]
 ...
 [1. 1. 0. ... 0. 0. 0.]
 [1. 1. 0. ... 0. 0. 0.]
 [1. 1. 0. ... 0. 0. 0.]]
Evaluation on train - micro f1: 0.274 
Evaluation on train - loss:0.061
before
[[4.4240275e-01 5.6041002e-01 9.8616574e-03 ... 2.9586404e-06
  2.8800699e-07 2.2511686e-06]
 [4.4240275e-01 5.6041002e-01 9.8616574e-03 ... 2.9586404e-06
  2.8800699e-07 2.2511686e-06]
 [

  0%|          | 0/910 [00:00<?, ?it/s]

before
[[1.0332047e-02 5.8821690e-01 8.8676461e-04 ... 1.2302041e-15
  3.0831842e-08 1.3295425e-13]
 [1.0332047e-02 5.8821690e-01 8.8676461e-04 ... 1.2302041e-15
  3.0831842e-08 1.3295425e-13]
 [1.0332047e-02 5.8821690e-01 8.8676461e-04 ... 1.2302041e-15
  3.0831842e-08 1.3295425e-13]
 ...
 [1.0332047e-02 5.8821690e-01 8.8676461e-04 ... 1.2302041e-15
  3.0831842e-08 1.3295425e-13]
 [1.0332047e-02 5.8821690e-01 8.8676461e-04 ... 1.2302041e-15
  3.0831842e-08 1.3295425e-13]
 [1.0332047e-02 5.8821690e-01 8.8676461e-04 ... 1.2302041e-15
  3.0831842e-08 1.3295425e-13]]
after
[[0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]
Evaluation on train - micro f1: 0.206 
Evaluation on train - loss:0.071
before
[[1.0332047e-02 5.8821690e-01 8.8676461e-04 ... 1.2302041e-15
  3.0831842e-08 1.3295425e-13]
 [1.0332047e-02 5.8821690e-01 8.8676461e-04 ... 1.2302041e-15
  3.0831842e-08 1.3295425e-13]
 [

  0%|          | 0/910 [00:00<?, ?it/s]

KeyboardInterrupt: ignored