# Swedish BERT models: SweBERT masked token prediction

***

## Introduction

This example will check accessibility of and load a pre-trained SweBERT model from Arbetsförmedlingen (Swedens unemployment agency), load this model and the perform a simple word prediction task by removing one word from a sample sentence and predicting which word should be in its place.

THIS NOTEBOOK IS BASED ON THE ORIGINAL NOTEBOOK PUBLISHED BY AF-AI at https://github.com/af-ai-center/SweBERT.git

We have modified the prediction example at the end to use Tensorflow instead of PyTorch. 

#### Note: Make sure to run this notebook in a virtual environment with the required packages (see README) installed

***

# Setup

In [69]:
import torch
import tensorflow as tf
from transformers import BertTokenizer, BertModel, TFBertModel, BertForMaskedLM 
from tokenizers import BertWordPieceTokenizer

import warnings; warnings.filterwarnings('ignore')

### Choose SweBERT model

We have to choose one of the available pretrained SweBERT models. For demonstration purposes, the base model is sufficient:

In [70]:
pretrained_model_name = 'af-ai-center/bert-base-swedish-uncased'
# pretrained_model_name = af-ai-center/bert-large-swedish-uncased

### Check SweBERT Model Accessibility

First, we are going to check that the chosen pretrained SweBERT model is accessible through the transformers library.
If it is, we should be able to instantiate a tokenizer and a (PyTorch/TensorFlow) model from it. 

Note that this may take a while the first time you run it as the model needs to be downloaded. 

### a. Load a tokenizer

In [71]:
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name, do_lower_case=False)

### b. Load a PyTorch model

In [72]:
model = BertModel.from_pretrained(pretrained_model_name)

### c. (Load a TensorFlow model)

In [None]:
model = TFBertModel.from_pretrained(pretrained_model_name)

# Masked Word Completion with SweBERT

We are now going to apply the (PyTorch) SweBERT model on an example sentence, loosely following https://huggingface.co/transformers/quickstart.html#quick-tour-usage. To make BERT work with text strings we have to prepare the text a bit, then mask one of the words (aka 'tokens') and then finally use SweBERT to predict the masked word. 

We will:
1. Tokenize the example using BertTokenizer.
2. Tokenize the example using BertWordPieceTokenizer.
3. Mask one of the tokens.
4. Use SweBERT to predict back the masked token.

In [73]:
example = 'Av alla städer i världen, är du den stad som fått allt.'
example

'Av alla städer i världen, är du den stad som fått allt.'

### 1. Tokenize the example using BertTokenizer

The pretrained SweBERT models are uncased. 

In principle, we could account for this by instantiating the BertTokenizer (https://huggingface.co/transformers/model_doc/bert.html#berttokenizer) with the parameter `do_lower_case=True`.
However, the BertTokenizer does not handle the Swedish letters `å, ä & ö` properly (they get replaced by `a & o`).

To avoid this problem, we instruct the bert_tokenizer to not automatically change to lower case, and manually change all text to lowercase instead.

In [74]:
bert_tokenizer = BertTokenizer.from_pretrained(pretrained_model_name, do_lower_case=False)

#### a. lowercase 

In [75]:
example_uncased = example.lower()
example_uncased

'av alla städer i världen, är du den stad som fått allt.'

#### b. add special tokens 

The input of BERT models needs to be provided with special tokens that identify the beginning end end of a string '[CLS]' and '[SEP]':

In [76]:
example_preprocessed = f'[CLS] {example_uncased} [SEP]'
example_preprocessed

'[CLS] av alla städer i världen, är du den stad som fått allt. [SEP]'

#### c. tokenize

Now we will tokenize the text, i.e. separate it to individual words and the special tokens.

In [77]:
tokens = bert_tokenizer.tokenize(example_preprocessed)

print(f'{len(tokens)} tokens:')
print(tokens)

16 tokens:
['[CLS]', 'av', 'alla', 'städer', 'i', 'världen', ',', 'är', 'du', 'den', 'stad', 'som', 'fått', 'allt', '.', '[SEP]']


#### d. convert tokens to ids

In [78]:
indexed_tokens = bert_tokenizer.convert_tokens_to_ids(tokens)
print(indexed_tokens)

[101, 1101, 1186, 3548, 1045, 1596, 1010, 1100, 1153, 1108, 1767, 1099, 1302, 1223, 1012, 102]


### 2. Tokenize the example using BertWordPieceTokenizer

An alternative is to use the BertWordPieceTokenizer from the tokenizers library (https://github.com/huggingface/tokenizers).
It handles the special Swedish letters properly if the parameters `lowercase=True` & `strip_accents=False` are used. 

In [None]:
bert_word_piece_tokenizer = BertWordPieceTokenizer("vocab_swebert.txt", lowercase=True, strip_accents=False)

#### a. lowercase, b. add special tokens, c. tokenize & d. convert tokens to ids

In [None]:
output = bert_word_piece_tokenizer.encode(example)  # attributes: output.ids, output.tokens, output.offsets

In [None]:
tokens_2 = output.tokens

print(f'{len(tokens_2)} tokens:')
print(tokens_2)

In [None]:
indexed_tokens_2 = output.ids
print(indexed_tokens_2)

In [None]:
# check that BertTokenizer & BertWordPieceTokenizer lead to the same results
assert tokens == tokens_2
assert indexed_tokens == indexed_tokens_2

### 3. Mask one of the tokens

In [79]:
masked_index = 10  # 'stad'

In [81]:
tokens[masked_index] = '[MASK]'
print(tokens)

['[CLS]', 'av', 'alla', 'städer', 'i', 'världen', ',', 'är', 'du', 'den', '[MASK]', 'som', 'fått', 'allt', '.', '[SEP]']


In [82]:
# Mask token with BertTokenizer
indexed_tokens[masked_index] = bert_tokenizer.convert_tokens_to_ids('[MASK]')
print(indexed_tokens)

[101, 1101, 1186, 3548, 1045, 1596, 1010, 1100, 1153, 1108, 103, 1099, 1302, 1223, 1012, 102]


In [None]:
# Mask token with BertWordPieceTokenizer
indexed_tokens[masked_index] = bert_word_piece_tokenizer.token_to_id('[MASK]')
print(indexed_tokens)

### 4. Use SweBERT to predict back the masked token

In [83]:
# instantiate model
model = BertForMaskedLM.from_pretrained(pretrained_model_name)
_ = model.eval()

Some weights of the model checkpoint at af-ai-center/bert-base-swedish-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [84]:
# predict all tokens
with torch.no_grad():
    outputs = model(torch.tensor([indexed_tokens]))

predictions = outputs[0]
print(predictions.shape)  # 1 example, 21 tokens, 30522 vocab size

torch.Size([1, 16, 30522])


In [85]:
# show prediction for masked token's index
predicted_index = torch.argmax(predictions[0, masked_index])
print(predicted_index)

tensor(1767)


In [86]:
# show prediction for masked token
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)

stad


Making this prediction was our goal, now let's just confirm that it is the same as the word we masked from the beggining.

In [None]:
assert predicted_token == 'stad'

# Conclusions

- We have checked the accessibility of the SweBERT models through the transformers library. 
- We have demonstrated a very simple model application, where the SweBERT model successfully predicts a masked token.

For additional use cases and information, we refer to the documentation of the transformers library. 

## Next step
Now that we have a trained model we will create a model in STACKn and then deploy it in STACKn to give it an endpoint. 

## Appendix

Now the big question: Is the SweBERT model just trained on all Lasse Berghagen lyrics, and know we just did a travesty of 'Stockholm i mitt hjärta', or does the model evaluate potential substitutions for the masked word based on its training corpus of Swedish language texts?

Instead of looking only at the top model prediction, let's have a look at the top 5 predictions.

In [88]:
# show top5 predictions for masked token's index
predicted_index_top5 = torch.argsort(predictions[0, masked_index], descending=True)[:5]
predicted_index_top5

tensor([1767, 1532, 4394, 1192, 1630])

In [89]:
# show top5 predictions for masked token
for predicted_index in predicted_index_top5:
    predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
    print(predicted_token)

stad
enda
ort
första
person


Of these suggestions, the first three are actually fully reasonable but both 'enda' and 'ort' would give a lyric very far in style and meaning from the original text. 