# Fine-tuning a Danish BERT

This tutorial will take you through how to fine-tune a BERT, both for sentence and token classification. We will start of by downloading [the Danish BERT model by botXO](https://github.com/botxo/nordic_bert). You should put the content of the downloaded .zip file into the folder ´danish_bert_uncased_v2´, where the script ´convert_bert_original_tf_checkpoint_to_pytorch.py´ should already be located. If not download it from [this GitHub repository](https://github.com/KennethEnevoldsen/tutorial_fine-tuning_danish_bert).

After the BERT model has been downloaded, move it to the directory and run the following line from a terminal (in said directory): 

```python convert_bert_original_tf_checkpoint_to_pytorch.py --tf_checkpoint_path model.ckpt --bert_config_file config.json --pytorch_dump_path pytorch_model.bin```

There are several versions to choose from, but we recommend choosing the newest version. Lastly we will need to import the following packages. You might need to pip install some of these.

In [22]:
# native packages
import os

# widely use packages
import pandas as pd

# other packages
from simpletransformers.ner import NERModel
from simpletransformers.classification import ClassificationModel

from danlp.datasets import DDT
import pyconll

---

## Token Classification
Let's start by doing a token classification. Token classification is the act of classifying tokens as is for example used to classify whether a token is an entity and what type of entity it is, e.g. person, organization or location. This is typically called named-entity recognition. Other token classification tasks include part-of-speech tagging as well as others. For this example we will train a BERT for named-entity recognition using the tagged data by DaNLP derived from the Danish dependency Treebank. We will start by loading in the data and examining it.

In [6]:
# Loading the Danish Dependency Tree data
ddt = DDT()
conllu_format = ddt.load_as_conllu(predefined_splits = True)

data = []
for n in range(len(conllu_format)):
    data.append([(i, token.form, token.misc.get("name").pop()) for i, sent in enumerate(conllu_format[n]) for token in sent]) #Getting the sentence #, Word and Tag.

In [27]:
# this dataset contain a training dataset
train = pd.DataFrame(data[0], columns = ['sentence_id', 'words', 'labels']) # note that the names of the columns are important for the model
# a development test dataset
test = pd.DataFrame(data[1], columns = ['sentence_id', 'words', 'labels'])
# and lastly a validation dataset
validation = pd.DataFrame(data[2], columns = ['sentence_id', 'words', 'labels'])

# examing the first ten rows we see some of the structure of the data
train.head(10)

Unnamed: 0,sentence_id,words,labels
0,0,På,O
1,0,fredag,O
2,0,har,O
3,0,SID,B-ORG
4,0,inviteret,O
5,0,til,O
6,0,reception,O
7,0,i,O
8,0,SID-huset,B-LOC
9,0,i,O


Okay so now we are ready to train the model. Beware that this process might take some time to it might be ideal to only use some of the data.

In [30]:
# get list of unique labels
unique_labels = list(train['labels'].unique())

# we will need to rename the config file from bert_config.json to config.json
# os.rename('danish_bert_uncased_v2/bert_config.json', 'danish_bert_uncased_v2/config.json')

# preparing the model
model = NERModel('bert', model_name = 'danish_bert_uncased_v2/', labels=unique_labels, use_cuda=False, args={'overwrite_output_dir': True, 'reprocess_input_data': True})

I0323 21:45:40.435643 4657845696 configuration_utils.py:252] loading configuration file danish_bert_uncased_v2/config.json
I0323 21:45:40.437238 4657845696 configuration_utils.py:290] Model config BertConfig {
  "architectures": null,
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "directionality": "bidi",
  "do_sample": false,
  "eos_token_ids": 0,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-12,
  "length_penalty": 1.0,
  "max_length": 20,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_beams": 1,
  "num_hidden_layers": 12,
  "num_labels": 9,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": t

In [2]:
# Training the model
model.train_model(train)

## Sentence Classification
Sentence classification is the act of classifying a sentence. This could be classyfying the topic of a sentence or classifying whether a sentence is postive or negative. In this case we will try to predict the score of a trustpilot review based on the text of the review. The dataset used for this is avaliable in the [Github repository](https://github.com/KennethEnevoldsen/tutorial_fine-tuning_danish_bert). 


In [19]:
tp = pd.read_csv("trustpilot.csv")
tp.columns = ['text', 'labels'] # rename variables - not that the renames variable names are important

tp['text'] = tp['text'].astype('str')
tp['labels'] = tp['labels'] - 1 # index to zero
tp.head(10)

Unnamed: 0,text,labels
0,bestilte airpods der var på lager i følge over...,0
1,bøvl bøvl bøvl,0
2,dårlig oplevelse,0
3,bestiller en varer som er på lager og vil bli...,0
4,modtog ikke min pakke,0
5,der kom ikke alt hvad jeg havde bestilt,0
6,bestilte en dvd der var markeret med levering...,0
7,jeg bestilte et dykkersæt til børn mellem 5 og...,0
8,jeg fik ingen varer på hjemmesiden fandt jeg ...,0
9,virkelig dårlig service jeg havde bestilt en ...,0


In [27]:
# number og unique labels
n_labels = len(tp['labels'].unique())

# initialize the model
sent_model = ClassificationModel('bert', 'danish_bert_uncased_v2/', num_labels=n_labels, use_cuda=False, args={'reprocess_input_data': True, 'overwrite_output_dir': True})

I0323 22:12:02.215758 4678493632 configuration_utils.py:252] loading configuration file danish_bert_uncased_v2/config.json
I0323 22:12:02.217087 4678493632 configuration_utils.py:290] Model config BertConfig {
  "architectures": null,
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "directionality": "bidi",
  "do_sample": false,
  "eos_token_ids": 0,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-12,
  "length_penalty": 1.0,
  "max_length": 20,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_beams": 1,
  "num_hidden_layers": 12,
  "num_labels": 5,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": t

In [31]:
# train the model
ent_model.train_model(tp)

## Conclusion
That is it you have now trained two BERT models for Danish natural language processing 🥳! To use the model simply use `model.predict()`.

This tutorial was made by [L. Hansen](https://github.com/HLasse), [M. Bertelsen](https://github.com/MalteHB) and [K. Enevoldsen](https://github.com/KennethEnevoldsen). Feel free to ask any question in the GitHub issues.
