<a href="https://colab.research.google.com/github/JanIvarMoldekleiv/AiDataSet/blob/main/Skjerming_fra_journalposttittel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Klassifisering av journalposttitler som skjermet eller uskjermet


I denne Jupyter-notatboken ønsker vi å se om vi kan benytte [BERT](https://arxiv.org/abs/1810.04805) til å trene opp en ai-modell for å avgjøre om en journalpost skal skjermes basert på journalpostittelen. 

Nasjoalbibiloteket i Norge har trent flere norsk modeller, og vi kan her veksle mellom modellene som de har laget. Som default benytter vi [NB-BERTbase Model](https://github.com/NBAiLab/notram). 

Vi fintuner modellen på data hentet fra [eInnsyn](https://www.einnsyn.no). 15000 journalposttitler samt skjermingsstatus er hentet ut og lagret i csv-format. Testdata er lagret i et [Github-Repo](https://github.com/JanIvarMoldekleiv/AiDataSet). 


Notatboken er bare testet i Google Colab. Koden er primært hentet fra NbAiLab - [How to finetune a classification model (advanced)](https://colab.research.google.com/gist/peregilk/3c5e838f365ab76523ba82ac595e2fcc/nbailab-finetuning-and-evaluating-a-bert-model-for-classification.ipynb). Det er gjort endringer for å laste andre datasett, samt eksplisitt definert loss-funksjon. 

I første steg installerer vi nødvendig programvare for BERT. 


In [3]:
!pip install transformers

import pandas as pd
import numpy as np
import tensorflow as tf
import json
import math
import os
from transformers import BertTokenizer, AutoConfig, TFAutoModelForSequenceClassification, optimization_tf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.0-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 14.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 83.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 13.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 63.9 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transfo

# Innstillinger
Default-innstillinger skal gi et ok resultat. Hvis modellen ikke konvergerer kan antallet epochs økes. 

In [17]:
#@markdown Set the main model that the training should start from
model_name = 'NbAiLab/nb-bert-base' #@param ["NbAiLab/nb-bert-base", "NbAiLab/nb-bert-large", "bert-base-multilingual-cased"]
#@markdown ---
#@markdown Set training parameters
batch_size =  16#@param {type: "integer"} 
init_lr = 2e-5 #@param {type: "number"}
end_lr = 0  #@param {type: "number"}
warmup_proportion = 0.1#@param {type: "number"}
num_epochs =   5#@param {type: "integer"}

#You might increase this for bert-base
max_seq_length = 128 

# Import og forberedelse av datasett
Datasettet lastes direkte fra github. Hvis du vil bytte ut datasettet må det være i csv-format, der første kolonne er en integer. 0 for False, 1 for True. 

Datasettet blir deretter tokenisert og det blir opprettet et datasett for tensorflow. 

In [18]:
train_data = pd.read_csv(
    'https://raw.githubusercontent.com/JanIvarMoldekleiv/AiDataSet/main/train.csv',
    names=["label", "text"]
)
dev_data = pd.read_csv(
    'https://raw.githubusercontent.com/JanIvarMoldekleiv/AiDataSet/main/dev.csv',
    names=["label", "text"]
)
test_data = pd.read_csv(
    'https://raw.githubusercontent.com/JanIvarMoldekleiv/AiDataSet/main/test.csv',
    names=["label", "text"]
)

from transformers import AutoTokenizer
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Turn text into tokens
train_encodings = tokenizer(
    train_data["text"].tolist(), truncation=True, padding=True, max_length=max_seq_length
)
dev_encodings = tokenizer(
    dev_data["text"].tolist(), truncation=True, padding=True, max_length=max_seq_length
)
test_encodings = tokenizer(
    test_data["text"].tolist(), truncation=True, padding=True, max_length=max_seq_length
)

# Create a tensorflow dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings), train_data["label"].tolist()
)).shuffle(1000).batch(batch_size)
dev_dataset = tf.data.Dataset.from_tensor_slices((
    dict(dev_encodings), dev_data["label"].tolist()
)).batch(batch_size)
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings), test_data["label"].tolist()
)).batch(batch_size)


print(f'The dataset is imported.\n\nThe training dataset has {len(train_data)} items.\nThe development dataset has {len(dev_data)} items. \nThe test dataset has {len(test_data)} items')
steps = math.ceil(len(train_data)/batch_size)
num_warmup_steps = round(steps*warmup_proportion*num_epochs)
print(f'You are planning to train for a total of {steps} steps * {num_epochs} epochs = {num_epochs*steps} steps. Warmup is {num_warmup_steps}, {round(100*num_warmup_steps/(steps*num_epochs))}%. We recommend at least 10%.')


Downloading tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

The dataset is imported.

The training dataset has 10999 items.
The development dataset has 2346 items. 
The test dataset has 1105 items
You are planning to train for a total of 688 steps * 5 epochs = 3440 steps. Warmup is 344, 10%. We recommend at least 10%.


# Trening av datasettet
Her har jeg knabbet koden direkte fra NoTram-prosjektet[[1]](#1).
Vi bruker Tensorflow gjennom interfacet Huggingface. 
For eksempeldatasettet på totalt 15000 journalposter, og der 11000 journalposter er med i selve treningsgrunnlaget, tar en epoch 11 minutter - og dermed tar en kjøring ca 1 time på Colab Pro med GPU-backing. 

## References
<a id="1">[1]</a>  
Kummervold et.al (2021).
Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model.
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) 2021



In [19]:
# Estimate the number of training steps
train_steps_per_epoch = int(len(train_dataset)/batch_size)
num_train_steps = train_steps_per_epoch * num_epochs

# Initialise a Model for Sequence Classification with 2 labels
config = AutoConfig.from_pretrained(model_name, num_labels=2)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, config=config)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()

# Creating a scheduler gives us a bit more control
optimizer, lr_schedule = optimization_tf.create_optimizer(
    init_lr=init_lr,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps
)
# Compile the model
model.compile(optimizer=optimizer, loss=loss,metrics=metrics) # can also use any keras loss fn

# Start training
history = model.fit(train_dataset, validation_data=dev_dataset, epochs=num_epochs, batch_size=batch_size)

print(f'\nThe training has finished training after {num_epochs} epochs.')



Downloading tf_model.h5:   0%|          | 0.00/1.01G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at NbAiLab/nb-bert-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

The training has finished training after 5 epochs.


## Lagring av modellen
Koden under viser hvordan modellen kan lagres og lastes igjen. Kommentert ut for å unngå lagring når det ikke er behov. 

In [20]:
# Save the model
model.save_weights("/content/mymodel_base.h5")

# Load the saved model
#config = AutoConfig.from_pretrained(model_name, num_labels=2)
#model = TFAutoModelForSequenceClassification.from_pretrained(model_name, config=config)
#model.load_weights("/content/mymodel.h5")

# Validering av modellen med test-datasettet. 
Etter trening av modellen kan den kjøres mot test-datasettet. Testdatasettet er ca 2000 journalposter hentet fra opprinnelig eInnsyn-dump. 

Her kalkulerer vi F1-score ved hjelp av [Sckikit](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). Score går fra 1 til 0 der 1 er best. 

Tabell hentet fra [What is a good F1 Score](https://stephenallwright.com/good-f1-score/)

|F1 score	|Fortolkning
---|---
|>0.9|	Very good|
|0.8 - 0.9|	Good|
|0.5 - 0.8	|OK|
|< 0.5	|Not good|

In [21]:
from sklearn.metrics import classification_report

print("Evaluate test set")
y_pred = model.predict(test_dataset)
y_pred_bool = y_pred["logits"].argmax(-1)
print(classification_report(test_data["label"], y_pred_bool, digits=4))

Evaluate test set
              precision    recall  f1-score   support

           0     0.8024    0.8590    0.8297       312
           1     0.9429    0.9168    0.9297       793

    accuracy                         0.9005      1105
   macro avg     0.8727    0.8879    0.8797      1105
weighted avg     0.9033    0.9005    0.9014      1105

