# BERT

At todays meeting, we're going to focus on BERT -- a model, which transforms our tokens into contextual word embeddings, that are of great quality!

We will use `transformers` library for that purpose, so let's install it first:

In [8]:
# !pip install transformers

# **BERT**

## Task 1: Tokenization using WordPiece

Below you can find a simple fragment of code, which downloads a `bert-base` model (an `uncased` version -- this model introduced a preprocessing step on the data that transformed each text into lowercased text) and instantiates appropriate tokenizer for that model. You can learn about this particular model here: https://huggingface.co/bert-base-uncased (I encourage you to read the description! Many models hosted on the huggingface website have great documentations and it's always worth checking them).

Then, in line 4, we define a text to be tokenized, run the tokenizer in line 5, and use the tokenized input as the input to BERT model, which is called in line 6. 

Please, run the code below. We generated BERT embeddings using 6 lines of code! ;)

In [1]:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)



Runnning tokenizer by simply calling the tokenizer object (`tokenizer(text, return_tensors='pt')`) returns a dictionary containing 3 keys: `input_ids`, `token_type_ids`, and `attention_mask`. As for now, let's focus on `input_ids` only. 

You may visit the website: https://huggingface.co/docs/transformers/glossary to learn more about the role of `token_type_ids` and `attention_mask`.


The `input_ids` is a list of lists (represented as a tensor). The outer list collects documents, while the inner lists collect tokens in that document. Here, we processed only one document (sentence) so that there is only one "outer" list. 

Each of the inner lists contains a seqeunce of identifiers. These are the positions of tokens in the Vocabulary. They can be used to generate one-hot encoding representations for our words, since if we know the length of a vocabulary and know the position of a given token in the vocabulary, we can generate a vector of the vocabulary length that is filled with zeros, then we set the value assigned to a token position to 1 to generate one-hot encoding.

To produce the ids, we have to tokenize our text first. 

Moreover, those identifiers require less memory that storing tokens as strings!

Run the code below, to see the tokenizer's output. Please note that the number of token identifiers generated by the tokenizer is not equal to the number of tokens in the sequence. We'll see why is that in a minute.


In [2]:
encoded_input

{'input_ids': tensor([[ 101, 5672, 2033, 2011, 2151, 3793, 2017, 1005, 1040, 2066, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

The ids generated by the tokenizer are not human readable. However, the tokenizer contains a mapping which relates known tokens to their positions, so that we can use it to map ids to tokens back!

The first line of code obtains the input ids defined for the first sequence provided. 
Then, we call `convert_ids_to_tokens` to transform those ids to tokens. Please run the code and analyze the output.

In [3]:
first_sentence_ids = encoded_input['input_ids'][0]
tokenizer.convert_ids_to_tokens(first_sentence_ids)

['[CLS]',
 'replace',
 'me',
 'by',
 'any',
 'text',
 'you',
 "'",
 'd',
 'like',
 '.',
 '[SEP]']

Whoa, we see that the tokenizer not only tokenizes our text (splits into tokens), but it also generated special tokens required by our model! Nice. Please note that after tokenization we don't have capital letters in our tokens. This is caused by the use of the `bert-base-uncased` model. Since the model was trained on lowercased data, the tokenizer also ensures that the tokens are lowercased.


## Subword units

However, the length of the generated `input_ids` may be even bigger as related to the length of our text. Sometimes, the vocabulary doesn't contain a given word as a whole. Since BERT used WordPiece tokenization, to handle such cases, the tokenizer tries to split those words into smaller fragments, that are stored in our vocabulary. Let's tokenize some document containing rare words and see the result.

In [4]:
tokenizer_output = tokenizer(['NVIDIA DGX A100'])

input_ids = tokenizer_output['input_ids'][0]
input_ids

[101, 1050, 17258, 2401, 1040, 2290, 2595, 17350, 8889, 102]

Whoa, we see that as a result of tokenization, we obtained many tokens. Let's see what they represent: 

In [5]:
tokenizer.convert_ids_to_tokens(input_ids)

['[CLS]', 'n', '##vid', '##ia', 'd', '##g', '##x', 'a1', '##00', '[SEP]']

Ah, as we can see, the vocabulary assigned to `bert-base` doesn't contain tokens such as "nvidia", "dgx" and "a100", that's why they are split into subwrod units. 

Each time a given subword unit starts with a double hash (##), we know that this subword unit is a continuation of the previous token. 

We can use this information to reconstruct the original text, by joining those subword units (sometimes called subtokens) into full tokens. We can achieve this goal using the following line of code:

In [6]:
tokenizer.decode(input_ids)

'[CLS] nvidia dgx a100 [SEP]'

If we know how to map tokens to their positions in the vocabulary, the only missing part is to determine how long our one-hot-encoding vector should be (or how big is our vocabulary). 

This transformation of ids into one-hot-encodings is done automatically by the `transformer` library. However, you can check the size of the vocabulary easily using the following line of code:

In [7]:
tokenizer.vocab_size

30522

## Task 2: Using a pre-trained BERT to generate features that may be used to solve a classification task.

As we discussed during the lecture, we can generate a fixed-length vector represening any input by taking the embedding produced for the `[CLS]` token (which is a representation of the whole seqeunce).

In this task, your goal is to use a pre-trained BERT model to obtain representations for a given dataset. Then, these representations will be used to train a logistic regression model. 

We will try to genearate a solution that can detect whether a given review is positive or not!

For that purpose we're going to use a BERT variant called `distilBERT`. We didn't cover it during the lecture, because it is related to the concept of distillation that will be very important to us and I'll make a separate lecture on it. 

For now, we can treat `distilBERT` the same way as BERT. In fact, this is a compressed BERT model. It behaves the same way but it is distilled so that we aimed to achieve similar quality using less parameters.

Please follow the tutorial you can find here: https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb (there is even a blogpost related to this tutorial: http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)


Just copy and paste fragments of code here and observe the results:

In [None]:
import numpy as np
import pandas as pd
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import torch

# Reading the dataset and limiting only to first 2 000 entries
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
batch_1 = df[:2000]

# Loading DistilBERT model and tokenizer
model_class, tokenizer_class, pretrained_weights = (DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

# Tokenizing
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

# Padding
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

# Masking
attention_mask = np.where(padded != 0, 1, 0)

# Running sentences through BERT
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

# Output of DistilBERT will be input for Logistic Regression
features = last_hidden_states[0][:,0,:].numpy()

# Labels - whether sentence is positive or negative
labels = batch_1[1]

# Splitting data into train and test
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

# Loading and fitting Logistic Regression model
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

### **Evaluation of our model** ###

In [13]:
print(f"Acccuracy achieved by Logistic Regression: {lr_clf.score(test_features, test_labels)}")

Acccuracy achieved by Logistic Regression: 0.806


## Task 3(Optional - not required): fine-tuning BERT

In the example above, a BERT model (or more precisely it's smaller family member: distilBERT) was used only to deliver vectors representing whole documents. 

Now we would like to fine-tune an existing model. As we discussed during the lecture, we can achieve it by simply switching the top-layer of the network. Instead of solving the Masked Language Model and Next Sentence Prediction tasks, we may add our own classification layer (also referred to as a classification head) and train our whole network to solve a given task.

There is a great and easy to follow tutorial on fine-tuning, available here: https://github.com/huggingface/notebooks/blob/main/transformers_doc/training.ipynb 

If you want, please follow that tutorial (you can copy-and-paste the code snippets from the notebook here).

In [17]:
### -------------------------------------------------------------
### a placeholder for code copied-and-pasted from the tutorial ;)
### -------------------------------------------------------------