## Assignment 7 - working with pre-trained BERT-based models

Today, we will work with a Bert variant implementation from Hugging Face (https://huggingface.co/) specifically the TensorFlow version of ALBERT.

This notebook does **NOT** require a GPU to run so you can use it in your existing GCP instance.

This notebook requires their transformers library and the sentencepiece subword model.  Make sure you pip install them in your instance or run the cell below.

In [None]:
!pip install sentencepiece
!pip install tensorflow

In [None]:
import numpy as np

import tensorflow as tf

import transformers

from transformers import AlbertTokenizer, TFAlbertModel

We are going to look at the embeddings produced by the pre-trained model and examine your understanding of how BERT-based models work.

Since we're only working with embeddings, will we need to create an output layer to make predictions?  No, we will not.  We can just use the raw outputs from ALBERT.

In [None]:
#Your tensorflow version should be 2.6.2
tf.__version__

In [None]:
#Your transformers version should be 4.15.0
transformers.__version__

In [None]:
test_sentence = "Children mark the inexorable march of time."

Let's start by tokenizing a sentence. All BERT-based models have their own tokenizers.  These are built based on the texts used in pre-training and are designed to minimize the number of 'UNK' tokens that will be encountered while putting a cap on the overall number of tokens in the vocabulary.  This means that words are often broken in to frequently occuring subwords.  During inference, previously unseen words can be built out of the subwords. 
    
Albert has its own tokenizer and it needs to be used when you're working with an ALBERT model.

In [None]:
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')

Let's tokenize our input sentence to see how it gets broken up.

In [None]:
tokenizer.tokenize(test_sentence)

The prefix '▁' indicates a word boundary.  This allows the original input string to be reconstructed from the tokens. Note the word 'inexorable' has been broken into subwords.  Only the initial token has the prefix. Each token has an associated input embedding that gets passed in to the model. 

Let's do a short excercise to get familiar with BERT-based models. BERT gives us *contextualized embeddings*, i.e. embeddings for the same word in different contexts should be different. Let's see if it's true!

Let's compare the context-based embedding vectors for '*mark*' in the following 6 sentences. In order to do so we first need to tokenize the input, which is very straighforward with the appropriate Hugging Face tokenizer.  

How do we deal with the different sizes of the sentences? Hugging Face includes a padding argument that does it for us. The model calculates the max sentence length and pads other sentences to that length.

In [None]:
albert_inputs = tokenizer(["Mark your calendars for the event",
                    "It ended with a question mark",
                    "Mark really enjoys teaching the W266 class",
                    "Mark left a mark on the wall",
                    "He left a mark on the professional literature",
                    "They prefer the ride in a Lincoln Mark IV" ],
                  padding=True,
                  return_tensors='tf')

albert_inputs

There are actually three outputs: the token ids for the input sentences (starting with the '[CLS]' token by default), the token_type_ids which are useful when one has separate segments, and the attention masks which are used to mask out padding tokens.

**Questions:**

1. Looking at the input_ids layer, what is the integer id for the '[CLS]' token?
2. Looking at the input_ids layer, what is the integer id for the word 'mark'?

Next, let's define a **Keras layer using the pre-trained ALBERT model** from Hugging Face. Make sure the model's name begins with 'TF' so we're usin the TensorFlow version!

In [None]:
albert_layer = TFAlbertModel.from_pretrained('albert-base-v2')

Since we are just using the model as it was already trained (e.g. just using the embeddings that emerge from the top of the model) we can ignore the warning messages.

Let's get the ALBERT encoding for all of our test sentences using the Functional API approach: 

layer_output = layer(layer_input)

In [None]:
albert_outputs = albert_layer(albert_inputs)

print('shape of token outputs: \t\t', albert_outputs[0].shape)

The first ALBERT output gets us the token-level embeddings. Let's define a function that shows the respective cosine distances between a list of vectors.

In [None]:
def cosine_distances(vecs):
    for v_1 in vecs:
        distances = ''
        for v_2 in vecs:
            distances += ('\t' + str(np.dot(v_1, v_2)/np.sqrt(np.dot(v_1, v_1) * np.dot(v_2, v_2)))[:4])
        print(distances)

Now, we designate the 'mark'-token positions in the *encoded* input and extract the proper components: 

In [None]:
mark_1 = albert_outputs[0][0, 1]
mark_2 = albert_outputs[0][1, 6]
mark_3 = albert_outputs[0][2, 1]
mark_4 = albert_outputs[0][3, 1]
mark_5 = albert_outputs[0][3, 4]
mark_6 = albert_outputs[0][4, 4]
mark_7 = albert_outputs[0][5, 9]

marks = [mark_1, mark_2, mark_3, mark_4, mark_5, mark_6, mark_7]

Print the pair-wise cosine distances in a table where the rows are the sentences and the columns are our mentions of the word 'mark':

In [None]:
cosine_distances(marks)

Looks rights! The name 'Mark' in the fourth sentence 'Mark left a mark on the wall' is similar to the embedding for the name in the third sentence but different from the embedding for the 'mark' on the wall.

**Questions:**

3. How are the embeddings contextualized by the model?

4. Which sentence has a 'mark' *least* similar to the name 'Mark' in sentence three?

