# Advanced Text Analytics Lab 2

This notebook is the second of two lab notebooks that you will submit as part of your assessment for the Advanced Data Analytics unit. The notebook contains three required sections plus an optional section:

4. **Introducing Transformers:** This section introduces the Transformers library from HuggingFace, showing you how to use it to obtain contextualised embeddings from pretrained transformer models.

5. **Question Answering with Pretrained Transformers:** Learn about how to use a pretrained model to perform automatic question answering. 

6. **Transformers for Text Classification:** Here we show you how to construct a classifier using Transformers.

7. **OPTIONAL: More on Transformers:** Some pointers to other materials if you want to learn more about transformers, e.g., if using them in your summer project. 

Example code for all the tasks has been tested on a three-year old MacBook Pro, and the longest training process took under 10 minutes. If you find that the code takes too long to run on your own machine, you can try [Google Colab](https://colab.research.google.com/), Amazon Sagemaker Studio, or use lab machines on campus provided by the school. 

## Learning Outcomes

These sections will contain tutorial-like instructions, as you have seen in previous text analytics labs. On completing these sections, the intended learning outcomes are that you will be able to...
1. Use pretrained transformers to obtain contextualised word and sentence embeddings.
1. Apply a pretrained QA model to a new dataset. 
1. Construct classifiers with pretrained transformers. 
1. Find documentation on pretrained models in the Transformers library.

## Your Tasks

Inside each of these sections there are several **'To-do's**, which you must complete for your summative assessment. Your marks will be based on your answers to these to-dos. Please make sure to:
1. Include the output of your code in the saved notebook. Plots and printed output should be visible without re-running the code. 
1. Include all code needed to generate your answers.
1. Provide sufficient comments to understand how your method works.
1. Write text in a cell in markdown format where a written answer is required. You can convert a cell to markdown format by pressing Escape-M. 

There are also some unmarked 'to-do's that are part of the tutorial to help you learn how to implement and use the methods studied here.

## Good Academic Practice

Please follow [the guidance on academic integrity provided by the university](http://www.bristol.ac.uk/students/support/academic-advice/academic-integrity/).
You are required to write your own answers -- do not share your notebooks or copy someone else's writing. Do not copy text or long blocks of code directly into the notebook from online sources -- always rewrite in your own way. Breaking the rules can lead to strong penalties. 

## Marking Criteria

1. The coursework (both notebooks) is worth 30% of the unit in total. 
1. There is a total of 100 marks available for both lab notebooks. 
1. This notebook is worth 50 of those marks.
1. The number of marks for each to-do out of 100 is shown alongside each to-do.
1. For to-dos that require you to write code, a good solution would meet the following criteria (in order of importance):
   1. Solves the task or answers the question asked in the to-do. This means, if the code cells in the notebook are executed in order, we will get the output shown in your notebook.
   1. The code is easy to follow and does not contain unnecessary steps.
   1. The comments show that you understand how your solution works.
   1. A very good answer will also provide code that is computationally efficient but easy to read.
1. You can use any suitable publicly available libraries. Unless the task explicitly asks you to implement something from scratch, there is no penalty for using libraries to implement some steps.

## Support

The main source of support will be during the remaining lab sessions (Fridays 3-6pm) for this unit. 

The TAs and lecturer will help you with questions about the lectures, the code provided for you in this notebook, and general questions about the topics we cover. For the marked 'to-dos', they can only answer clarifying questions about what you have to do. 

Office hours: You can book office hours with Edwin on Mondays 3pm-5pm by sending him an email (edwin.simpson@bristol.ac.uk). If those times are not possible for you, please contact him by email to request an alternative. 

## Deadline

The notebook must be submitted along with the second notebook on Blackboard before **Wednesday 24th May at 13.00**. 

## Submission

You will need to zip up this notebook with the previous notebook into a single .zip file, which you will submit to Blackboard through the 'assessment, submission and feedback' link on the left sidebar. 

Please name your files like this:
   * Name this notebook ADA2_<student_number>.ipynb
   * Name the zip file <student_number>.zip
   * Please don't use your name anywhere as we want to mark anonymously. 

In [2]:
import numpy as np
import torch 
from datasets import load_dataset

cache_dir = "./data_cache"

# 4. Pretrained Transformers (max. 15 marks)

HuggingFace is a company that has developed an open source library for loading pretrained transformer models. They also distribute many models that have been pretrained using language modelling tasks, or fine-tuned to specific downstream NLP tasks.  It is currently the best library to use to create NLP models on top of large, deep neural networks. This is especially useful for tasks where simpler, feature-based methods or smaller LSTM models do not perform well enough, for example, when complex processing of syntax and semantics is required (natural language 'understanding'). 

Let's start by looking at two key types of object in the transformers library: models and tokenizers.

## 4.1. Models

The neural network models available in the Transformers library are accessed through wrapper classes such as `AutoModel`. If we want to load a pretrained model, we can simply pass its name to the `from_pretrained` function, and the pretrained model weights will be downloaded from HuggingFace and a neural network model will be created with those weights. For example:

In [3]:
from transformers import AutoModel # For BERTs

model = AutoModel.from_pretrained("huawei-noah/TinyBERT_General_4L_312D") 

Some weights of the model checkpoint at huawei-noah/TinyBERT_General_4L_312D were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'fit_denses.4.weight', 'fit_denses.0.bias', 'fit_denses.2.bias', 'fit_denses.4.bias', 'fit_denses.3.bias', 'fit_denses.1.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'fit_denses.0.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'fit_denses.1.weight', 'cls.predictions.transform.dense.bias', 'fit_denses.2.weight', 'cls.predictions.bias', 'fit_denses.3.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing

This code loads the TinyBERT model, which is a compressed version of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While TinyBERT will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D).

<!--the RoBERTa variant of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While RoBERTa-tiny will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/arampacha/roberta-tiny).  -->

The same functions can be used to load other models from HuggingFace's repository simply by changing the model's name. Take a look at [the Models page](https://huggingface.co/models) so see what there is on offer. Do you recognise any of the models' names?

# 4.2. Tokenizers

Before we can apply a model to some text, we need to a create Tokenizer object. In Transfomers, Tokenizer objects convert raw text to a sequence of numbers. First, the tokenizer actually performs tokenization, then it maps each token to its numerical ID. There are lots of different tokenizers that we can use to preprocess text. If we are loading a pretrained model, we will need to choose the tokenizer that corresponds to that model. 

We can load the right tokenizer as follows, in the same way we loaded the model itself:

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

Let's see what the TinyBERT tokenizer does to an example sentence:

In [5]:
sentence = "The transformer architecture has transformed the field of NLP."

tokens = tokenizer.tokenize(sentence)
print(tokens)

['the', 'transform', '##er', 'architecture', 'has', 'transformed', 'the', 'field', 'of', 'nl', '##p', '.']


Let's compare with the NLTK tokenizer we have seen before:

In [6]:
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(sentence)
print(nltk_tokens)



['The', 'transformer', 'architecture', 'has', 'transformed', 'the', 'field', 'of', 'NLP', '.']


While NLTK keeps whole words as tokens, the BERT tokenizer splits some words into sub-words and inserts some special characters into the tokens. Splitting is applied to words with low frequency in the training set, such as 'transformer'. 

**TO-DO 4a:** What is the benefit of splitting rare words into sub-word tokens? **(2 marks)**

WRITE YOUR ANSWER HERE:

Because of the BERT is a pre-trained model, the vocabulary has been fixed when the model was trained on the first time. Although the researchers have fed the massive data to it, the size of vocabulary is still limited. Therefore, when a text-sentence is inputted into it, it is likely that some words are not included in the vocabulary, as known as OOV (Out-Of-Vocabulary) items. The BERT tokenizer have provided two methods to solve this problem, the first is to use special token [UNK] to instead the OOV item directly, the second is to split the OOV item into sub-words.

What is described in the question is the second method. In this method, a low-frequency word is represented by a high-frequency word and servel affixes. Compared with the first method, the benefit of this method is that retain as much information in the text-sentence as possible. Moreover, this method allow the BERT model learn embeddings for more words without expanding the vocabulary.

---

It is important to use the right tokenizer with a pretrained model as each model was trained with text tokenized in a particular way. After tokenization, the Tokenizer object can also map the tokens to their IDs (indexes in the vocabulary):

In [7]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[1996, 10938, 2121, 4294, 2038, 8590, 1996, 2492, 1997, 17953, 2361, 1012]


## 4.3. Contextualised Embeddings

Now that we have a sequence of tokens, we are almost ready to process the sequence using the pretrained model. 

Our model takes as input a PyTorch `tensor` object. In PyTorch, `tensor` is a muli-dimensional matrix. Here, we need a two-dimensional matrix, where each row is a sequence of input tokens corresponding to a single sentence or document. Let's convert our list of IDs to a 2-D tensor with a single row:

In [10]:
ids_tensor = torch.tensor([ids])

print(ids_tensor)

tensor([[ 1996, 10938,  2121,  4294,  2038,  8590,  1996,  2492,  1997, 17953,
          2361,  1012]])


Now we can process the sequence using our model. The model maps the sequence of input IDs to a sequence of output vectors, which are contextualised word embeddings. The hidden state values produced in the last hidden layer of the model are used as the contextualised embeddings:

In [11]:
model_outputs = model(ids_tensor)
print('The complete model outputs: ')
print(model_outputs)

print()
print('The last hidden state for the first token in the sequence (the first word embedding): ')
embeddings = model_outputs['last_hidden_state'][0]
print(embeddings)

The complete model outputs: 
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.3608,  0.2862, -0.1549,  ..., -0.2064,  0.2663, -0.0109],
         [ 0.0149,  0.7223, -0.0508,  ..., -0.5505,  0.2355, -0.2962],
         [ 0.1531,  0.5903, -0.1244,  ..., -0.4263,  0.0417, -0.1839],
         ...,
         [ 0.1742, -0.1091, -0.1963,  ..., -0.6736,  0.0472, -0.1840],
         [ 0.2434,  0.1021, -0.2241,  ..., -0.5400, -0.1691, -0.1314],
         [ 0.0854,  0.3272, -0.3016,  ..., -0.2154, -0.5632, -0.1921]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-1.1380e-02, -6.3006e-03,  1.8521e-02,  7.1139e-03, -3.1795e-02,
          1.3882e-02, -1.5459e-02, -1.0610e-03, -1.8263e-02, -3.6515e-02,
         -2.1257e-02, -1.5479e-02, -2.8090e-04, -4.1092e-02, -2.5315e-02,
         -4.3338e-02, -1.1616e-03, -1.3931e-02,  6.0733e-03,  4.3790e-03,
          2.7093e-04, -2.1810e-02, -4.8026e-02,  2.5493e-02, -1.6502e-02,
         -1.2034e-03,  4.2757e-02,  3.

We can retrieve the embedding vector for "transform" like this ("transform" is the second token in the sequence):

In [12]:
emb = embeddings[1]  # get second embedding in the sequence

# convert it to a numpy array so we can perform various operations on it later on
emb = emb.detach().numpy()

print(emb)
print(f'The TinyBERT embeddings have {emb.shape[0]} dimensions.')

[ 1.49150752e-02  7.22318411e-01 -5.07878214e-02 -2.74205089e-01
 -1.38932168e-01  1.00099742e+00  7.11441785e-03  2.71392167e-01
 -3.92800421e-02  6.04102612e-02  1.25739470e-01  4.60632950e-01
  6.25243736e-03  1.61930203e-01  1.23913184e-01 -4.08096403e-01
  1.24866992e-01 -4.71535861e-01  2.24769294e-01  6.35196269e-02
  8.56177211e-02 -1.88045606e-01  1.77258268e-01  3.40049952e-01
 -1.95546120e-01  1.58553332e-01  9.62870792e-02  1.12650014e-01
  2.21045271e-01 -9.56113100e-01 -3.85948241e-01  1.39221326e-01
  5.90011060e-01 -8.06728244e-01 -1.34287983e-01  2.35692129e-01
 -1.02274224e-01  2.78303713e-01  7.94321179e-01 -2.49363691e-01
  1.72771931e-01 -2.07583815e-01  3.00156236e-01 -8.59339088e-02
 -2.25285321e-01 -9.75413024e-02 -3.52348655e-01  3.81160498e-01
 -3.87680382e-01 -1.77613512e-01 -4.13685918e-01  1.38048023e-01
  1.29870996e-02  6.52683675e-01  1.16503254e-01 -5.10778904e-01
 -8.30418169e-02 -2.67046541e-02  3.12863946e-01 -2.62848705e-01
 -1.43285021e-01  1.10269

TO-DO 4b: Retrieve the embedding for "architecture" (this to-do will not be marked).

In [13]:
# WRITE YOUR ANSWER HERE
word = "architecture"
idx_word = tokens.index(word)
print(f'The embedding of "{word}" is ')
print(embeddings[idx_word].detach().numpy())

The embedding of "architecture" is 
[ 2.71387458e-01  7.74581611e-01 -3.24257761e-01 -7.14331269e-02
 -4.95254993e-04  9.37310636e-01 -4.40156646e-03 -4.26921584e-02
  1.27411429e-02  1.89266484e-02  1.02528304e-01  4.54657137e-01
  2.70436347e-01  2.30988786e-01  4.03654762e-03 -1.08995169e-01
 -4.59915325e-02 -3.51154566e-01 -1.34710193e-01  8.29398781e-02
  1.86496884e-01  5.00264913e-02  7.21680447e-02  2.28659511e-01
 -2.19696805e-01  9.40186232e-02  1.65541068e-01  1.85795456e-01
  3.17783564e-01 -5.09367466e-01 -5.00949144e-01  1.52488261e-01
  4.57999438e-01 -8.51876259e-01 -1.58632576e-01  1.58965573e-01
  4.16198075e-02  2.30998188e-01  8.78503203e-01 -6.23165891e-02
  1.87220111e-01 -1.23378038e-02  2.10083514e-01  3.48064750e-02
 -2.51240164e-01 -1.37914658e-01 -3.88697356e-01  2.98188627e-01
 -2.92033523e-01 -3.19503576e-01 -1.98435009e-01  1.32033393e-01
 -6.46379739e-02  7.43183851e-01  7.14249015e-02 -3.02117795e-01
  3.49781036e-01 -5.81779853e-02  2.85069764e-01 -4.09

Sentences and documents usually have varying lengths. So, to put multiple sentences into a single tensor, we need to pad the sequences up to a maximum length. Luckily, the tokenizer class takes care of this for us. When we pass in a list of sentences, the tokenizer creates a matrix, where each row is a sequence of the same length:

In [14]:
sentences = [
    "They received a loan from the bank.",
    "It was not good for either his bank balance or his blood pressure.",
    "She walked along the bank of the river towards the city.",
    "They bank their cheques on Thursdays.",
    "She walked along the embankment towards the city."
]

model_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")  

print(model_inputs)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': tensor([[  101,  2027,  2363,  1037,  5414,  2013,  1996,  2924,  1012,   102,
             0,     0,     0,     0,     0,     0],
        [  101,  2009,  2001,  2025,  2204,  2005,  2593,  2010,  2924,  5703,
          2030,  2010,  2668,  3778,  1012,   102],
        [  101,  2016,  2939,  2247,  1996,  2924,  1997,  1996,  2314,  2875,
          1996,  2103,  1012,   102,     0,     0],
        [  101,  2027,  2924,  2037, 18178, 10997,  2006,  9432,  2015,  1012,
           102,     0,     0,     0,     0,     0],
        [  101,  2016,  2939,  2247,  1996, 22756,  2875,  1996,  2103,  1012,
           102,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': t

`model_inputs` is a dictionary containing three objects:
 * The `input_ids` are the list of token IDs in the input sequences. 
 * The `attention_mask` records which tokens are special padding tokens and which are real tokens. Tokens with a 0 in the attention mask will be ignored.
 * `token_type_ids` is needed when two sequences are passed together as input to the model for tasks such as next sentence prediction that involve comparing two sentences. Here, each input is a single sentence, so we have only one type of token in the output above. 
 
TO-DO 4c: What value do the special padding tokens have? (this to-do is unmarked)

ANSWER: 
The value(ID) of special padding tokens is 100.

There are five special tokens in BERT:
1. [CLS] => 101
2. [SEP] => 102
3. [PAD] => 0
4. [UNK] => 100
5. [MASK] => 103

---

Notice that the input_ids all start with the same token ID, 101, even though they have different first words. They also have token ID 102 before the padding tokens. This is because the tokenizer inserts two special tokens, which are used in some applicaions of BERT. 101 is the '[CLS]' token, which is a dummy token whose embedding can be trained to represent the whole sequence. the [CLS] token's embedding can then be used as input to a text classifier to classify a sentence or document. Token 102 is '[SEP]', which can be used to separate multiple input sequences in a single example. This is needed in tasks where multiple pieces of text are provided as input, e.g., a to build a classifier that can determine whether two sentences contradict each other. 

We can now pass all of the model inputs to the model to produce a set of contextualised embeddings:

In [22]:
# model_inputs is a dictionary, so to provide the arguments to model(), 
# we use the double star to unpack the dictionary so that each key in the dictionary is
# an argument to model() and each value is the value of the argument. 
model_outputs = model(**model_inputs) 

**TO-DO 4d:** The first four example sentences above all contain the word "bank", and the last example contains "embankment". Obtain a list of contextualised word embeddings for 'bank' and 'embankment' in the example sentences using our model. **(4 marks)**

Hint: you may need to convert tensors to numpy arrays.

In [29]:
#WRITE YOUR OWN CODE HERE

# get the token ids of "bank" and "embankment"
bank_ids = tokenizer.convert_tokens_to_ids(tokens='bank')
embankment_ids = tokenizer.convert_tokens_to_ids(tokens='embankment')

# build a index matrix (shape as tensor(5,16), whose shape is same with the input_ids of model_inputs
# the value in the positions of 'bank' and 'embankment' is True (1), and the value in other position is False (0)
word_idxs = torch.zeros_like(model_inputs['input_ids'])
for sent_idx, sent_ids in enumerate(model_inputs['input_ids']):
    if sent_idx == len(sentences)-1:
        word_idxs[sent_idx] = (sent_ids==embankment_ids)
    else:
        word_idxs[sent_idx] = (sent_ids==bank_ids)
word_idxs = (word_idxs == 1)
print(f'A 2-D tensor to index the position of word "bank" and "embankment:')
print(word_idxs)

# retrieve the embedding tensors from the last_hidden_state of model_outputs
contextualised_word_embeddings = model_outputs.last_hidden_state[word_idxs].detach().numpy()
print(f'A tensor to store the embeddings of "bank" and "embankment" in all \
example sentences. Its shape is {contextualised_word_embeddings.shape}')
print(contextualised_word_embeddings)

A 2-D tensor to index the position of word "bank" and "embankment:
tensor([[False, False, False, False, False, False, False,  True, False, False,
         False, False, False, False, False, False],
        [False, False, False, False, False, False, False, False,  True, False,
         False, False, False, False, False, False],
        [False, False, False, False, False,  True, False, False, False, False,
         False, False, False, False, False, False],
        [False, False,  True, False, False, False, False, False, False, False,
         False, False, False, False, False, False],
        [False, False, False, False, False,  True, False, False, False, False,
         False, False, False, False, False, False]])
A tensor to store the embeddings of "bank" and "embankment" in all example sentences. Its shape is (5, 312)
[[ 0.52827054 -0.04592283  0.13089302 ... -0.40409508 -0.02605382
   0.5743132 ]
 [ 0.25252566 -0.5110978   0.08738894 ...  0.06659798  0.33056778
   0.4004468 ]
 [ 0.38

**TO-DO 4e:** Compute the similarities between these embeddings in the cell below, and show the results. Which embeddings are most similar to one another and why? **(6 marks)**

WRITE YOUR ANSWER HERE:

From the result, there are ten values to describe the **Cosine Similarity** between each two of the five words "bank" or "embankment" in the example sentence. I observe that the largest value is 0.7245 which is the cosine similarity between the embedding of “bank” in the third sentence and the embedding of "embankment" in the last sentence. Although they are two different words, they have the same means in their context. Accoding to explanation from the Oxford English Dictionary, their meanings are "a wall of stone or earth made to keep water back or to carry a road or railway/railroad over low ground".

This means the TinyBert obtain correctly the words' meaning based on the context.

In [31]:
# WRITE YOUR OWN CODE HERE

# write a function to calculate the cosine similarity between two input vectors
def get_cosine_similarity(vector_A, vector_B):
    assert vector_A.shape == vector_B.shape     # check if two vectors have the same shape
    return np.sum(vector_A * vector_B) / (np.sqrt(np.sum(vector_A**2)) * np.sqrt(np.sum(vector_B**2)))

# define a (5,5) matrix, named cosine_similarity, to store the cosine similarity between each two words
# the cosine similarity is directionless, i.e. the cosine similarity bewteen A and B is equal to that between B and A,
# so cosine_similarity is a Symmetric Matrices, I just calculate half of the values in this matrix
n = 5
cosine_similarity = np.zeros(shape=(n,n))
for i in range(n):
    for j in range(i+1, n):
        cosine_similarity[i,j] = get_cosine_similarity(vector_A=contextualised_word_embeddings[i],
                                                       vector_B=contextualised_word_embeddings[j])
print('Cosine similarity between each two of the five words "bank" or "embankment" in the example sentence:')
print(cosine_similarity)

Cosine similarity between each two of the five words "bank" or "embankment" in the example sentence:
[[0.         0.63313478 0.48778656 0.52329594 0.36788744]
 [0.         0.         0.43572855 0.47056487 0.2787157 ]
 [0.         0.         0.         0.47878537 0.72452569]
 [0.         0.         0.         0.         0.31861022]
 [0.         0.         0.         0.         0.        ]]


**TO-DO 4f:** Use the [CLS] token's embedding to find the most similar **sentence** to "She walked along the embankment towards the city." from the first four sentences. Print the similarities and the selected sentence. **(3 marks)**

In [34]:
# WRITE YOUR OWN CODE HERE

# get the index of aim sentence "She walked along ..." in the list named 'sentences'
aim_sentence_idx = len(sentences)-1

# the embedding of [CLS] is used to represent the whole sequence
# it is the first token of each sequence's last hidden state
# define an index matrix to obtain a list of [CLS] embeddings for all example sentences
cls_embeddings_idx = torch.zeros_like(model_inputs['input_ids'])    # a matrix filled with True and False used as an index
cls_embeddings_idx[:,0] = 1                                         # set the value of CLS's posisure as True (1)
cls_embeddings_idx = (cls_embeddings_idx == 1)
# obtain a list of contextualised word embeddings of special tokens [CLS] for each example sentences
cls_embeddings = model_outputs.last_hidden_state[cls_embeddings_idx].detach().numpy()
# calculate the cosine similarity bewteen each two sentences
print('According to the last hidden state [CLS]:')
scores = np.zeros(shape=[4])
for idx in range(len(sentences)-1):
    scores[idx] = get_cosine_similarity(vector_A=cls_embeddings[idx], vector_B=cls_embeddings[aim_sentence_idx])
print(f'The cosine similary: {scores} .')
print(f'The max cosine similary value is {np.max(scores)}.')
print(f'The index of max cosine similary value is {np.argmax(scores)} (The first index is 0).')
print(f'The most similar sentence is "{sentences[np.argmax(scores)]}".')



# beside it, I try to use the pooler output of model to assess the similarity bewteen two sentences and I get the same result
print()
print('According to the pooler output [CLS]:')
pooler_output = model_outputs.pooler_output.detach().numpy()
scores = np.zeros(shape=[4])
for idx in range(len(sentences)-1):
    scores[idx] = get_cosine_similarity(vector_A=pooler_output[idx], vector_B=pooler_output[aim_sentence_idx])
print(f'The cosine similary: {scores} .')
print(f'The max cosine similary value is {np.max(scores)}.')
print(f'The index of max cosine similary value is {np.argmax(scores)} (The first index is 0).')
print(f'The most similar sentence is "{sentences[np.argmax(scores)]}".')

According to the last hidden state [CLS]:
The cosine similary: [0.90930408 0.79363501 0.99479598 0.89889157] .
The max cosine similary value is 0.9947959780693054.
The index of max cosine similary value is 2 (The first index is 0).
The most similar sentence is "She walked along the bank of the river towards the city.".

According to the pooler output [CLS]:
The cosine similary: [0.91320854 0.79239655 0.99556798 0.90125519] .
The max cosine similary value is 0.9955679774284363.
The index of max cosine similary value is 2 (The first index is 0).
The most similar sentence is "She walked along the bank of the river towards the city.".


# 5. Question Answering with Pretrained Transformers (max. 11 marks)

The previous section showed us how to obtain a sequence of contextualised word embeddings using a pretrained transformer. How are these embeddings used to extract answers from documents to a given question?

First, let's load up the [Tweet QA](https://huggingface.co/datasets/tweet_qa) dataset, which we will use to test a pretrained question answering (QA) model. This dataset contains tweets along with questions about the information in the tweets, and a list of correct answers. As we are not going to train our own QA model (it requires a lot of compute time), we will only need the validation set:

In [35]:
from sklearn.metrics import f1_score

val_dataset = load_dataset(
    "tweet_qa",
    split="validation",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Validation dataset with {len(val_dataset)} instances loaded")

Using custom data configuration default
Reusing dataset tweet_qa (./data_cache/tweet_qa/default/1.0.0/7d588f7f477946b10f60c035ca55175737315ac446102b015218af38d2638777)


Validation dataset with 1086 instances loaded


Now we are working with complete dataset using the HuggingFace datasets library. In the next cell, we create a tokenizer to tokenize the examples in the dataset. We need to choose the right tokenizer for the QA model we want to use, so let's decide to use `"distilbert-base-cased-distilled-squad"` as our pretrained model. This is based on a smaller version of BERT, called Distilbert, which was fine-tuned on the SQUAD question answering dataset.

In [36]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad") 

def tokenize_function(dataset):
    # Pass two strings to the tokenizer -- it will concatenate them with a [SEP] special token between them. 
    model_inputs = tokenizer(dataset['Question'], dataset['Tweet'], padding="max_length", max_length=200, truncation='only_second')
    return model_inputs

Again, we can use the `map()` method to apply the tokenizer to each example in the dataset. 

In [40]:
val_dataset = val_dataset.map(tokenize_function, batched=True) 

  0%|          | 0/2 [00:00<?, ?ba/s]

The type of QA model we are going to work with is _extractive_, meaning that the model will extract the answer from the 'context' (also known as the 'passage' or 'source document'). It does this by identifying the index of the start and end tokens of the answer span within the context, or returning `(0, 0)` (the index 0 for both the start and end token) if the context does not contain an answer to the given question. 

As explained in the lectures, BERT forms the basis of the QA model, and maps each token to a contextualised embedding. The QA model then maps each token's contextualised embedding to the probability that the token is the start of the answer span, and to the probability that the token is the end of the answer span. The layers that map the embeddings to the start and end probabilities are known as the 'head' of the model. [The original BERT paper](https://arxiv.org/pdf/1810.04805.pdf) depicts the QA model like this (Devlin et al., 2018):

<img src="bert_qa.png" alt="BERT QA diagram from the slides in week 10 showing the embedding of each token connected to the start and end output layers" width="400px"/>

We can see a similar structure in most neural network models. Our original text classifier from the first notebook used a fully-connected layer to produce a hidden representation of the whole sentence (rather than using BERT to produce a sequence of embeddings). This hidden representation was then fed to an output layer to produce a probability distribution over class labels (rather than the start and end probabilities):

<img src="neural_text_classifier_smaller.png" alt="Neural text classifier diagram from the slides in lecture 8.1" width="400px"/>


<!--With transformers, 
we can do something very similar, by connecting the transfomer's output to a fully-connected layer. However, with BERT, we do not need to pass the embedding of each individual word to the fully-connected layer because there is a special [CLS] token that represents the whole sentence:

The code below shows how to access a tensor containing the [CLS] embeddings:-->

Now, we have the dataset in the right format, let's see how to load a pretrained QA model based on a pretrained transformer. The QA model was trained by taking a pretrained BERT model (pretrained on masked language modelling with unlabelled text), adding the QA head, then further training the complete model on a QA dataset. 

The transformers library provides some useful wrapper classes for loading pretrained models for various NLP tasks, such as QA or text classification. These 'auto' classes are documented here: https://huggingface.co/docs/transformers/model_doc/auto . Let's use an auto class to load the `"distilbert-base-cased-distilled-squad"` pretrained QA model (this code will try to reload the model from a cache or download the model from HuggingFace):

In [41]:
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

As our model was pretrained, we can use it directly on our Tweet_QA dataset (you may see a message to this effect when you run the cell above the first time). 

So, how do we get a prediction from the model? Let's take a single example from Tweet_QA and obtain the start and end probabilities for all tokens in the 'context':

In [54]:
def predict_nn(qa_model, dataset):
    
    # Switch off dropout
    qa_model.eval()

    # Pass the required inputs from the dataset to the model    
    output = qa_model(attention_mask=torch.tensor(dataset["attention_mask"]), input_ids=torch.tensor(dataset["input_ids"]))
        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    probs_start = torch.nn.Softmax(dim=1)(output["start_logits"]).detach().numpy()
    probs_end = torch.nn.Softmax(dim=1)(output["end_logits"]).detach().numpy()
        
    return probs_start, probs_end

# Run the prediction function to get the results for the first 20 examples:
probs_start, probs_end = predict_nn(model, val_dataset[0:20])

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[ -3.2792,  -1.5579,  -5.2233,  ...,  -9.1283,  -9.1421,  -9.1385],
        [ -2.9796,  -0.7380,  -4.5796,  ...,  -9.6077,  -9.6307,  -9.6197],
        [ -4.6679,  -4.2097,  -8.2780,  ..., -12.3520, -12.3545, -12.3360],
        ...,
        [ -6.4368,  -7.6735,  -9.3097,  ..., -12.8042, -12.8305, -12.8321],
        [ -3.8143,  -6.6175,  -7.9168,  ..., -11.7859, -11.8477, -11.8434],
        [ -1.5484,  -2.3328,  -3.5560,  ...,  -8.0880,  -8.0825,  -8.1029]],
       grad_fn=<CloneBackward0>), end_logits=tensor([[ -1.6764,  -3.5921,  -6.7713,  ...,  -8.7122,  -8.7293,  -8.7356],
        [ -1.3737,  -2.3578,  -6.9687,  ...,  -9.2173,  -9.1775,  -9.1721],
        [ -3.7538,  -4.5660,  -9.0925,  ..., -12.0416, -12.0330, -12.0327],
        ...,
        [ -4.8956,  -8.0379, -10.4365,  ..., -12.2881, -12.2993, -12.3126],
        [ -2.4702,  -7.1073,  -8.6548,  ..., -11.0500, -11.0612, -11.1155],
        [ -0.0421,  -2.1054,  -4.2895, 

Now that we have the probabilities that each token is a start or end token, we combine these probabilities to estimate the probability of each possible answer span. This will allow us to choose the answer span with highest probability. 

In the next cell is our first attempt, which you will need to improve to get valid answers. This code loops through each possible combination of start and end tokens, obtains the start and end probabilities, and extracts the answer text for the corresponding span.

**TO-DO 5a:** Use the start and end probabilities to compute the answer span probability at the place marked inside the predict_answer() function below. **2 marks**

In [61]:
# our example:
example_index = 18

example = val_dataset[example_index]
print(f'CONTEXT = {example["Tweet"]}')
print(f'QUESTION = {example["Question"]}')
print(f'LIST OF POSSIBLE ANSWERS = {example["Answer"]}')

CONTEXT = Started researching this novel in 2009. Now it is almost ready for you to read. Excited! #InTheUnlikelyEvent Judy Blume (@judyblume) December 15, 2014
QUESTION = what is the name of the novel?
LIST OF POSSIBLE ANSWERS = ['in the unlikely event.', 'in the unlikely event']


In [44]:
def predict_answer(probs_start, probs_end, input_ids, tokenizer):
    
    input_length = len(input_ids)  # length of the input sequence, in the form "[CLS] question [SEP] context"

    SEP_SPECIAL_TOKEN = 102  # the input id for the sep special token the separates the question from the context. The context starts after this token. 
    PAD_SPECIAL_TOKEN = 0  # the input id for padding tokens added to the end of the context
    
    span_probabilities = []  # save the probabilities here
    spans = []  # save the possible answer spans here
    
    for start_idx in range(0, input_length):
        for end_idx in range(0, input_length):
            
            start_prob = probs_start[start_idx]
            end_prob = probs_end[end_idx]
            
            ### WRITE YOUR ANSWER HERE
            # sum the start and end probabilities as the answer span probability for each possible combination
            span_probabilities.append(start_prob + end_prob)
            # Addition is used here rather than multiplication, because the output contains logits
            ###
            
            span = tokenizer.decode(input_ids[start_idx:end_idx+1])
            spans.append(span)

    # sort the spans according to probability:
    sorted_span_index = np.argsort(span_probabilities)
    
    # print the top 20 answers:
    for i in range(20):
        print(f'Span prob = {span_probabilities[sorted_span_index[-i-1]]}, answer = {spans[sorted_span_index[-i-1]]}')
            
predict_answer(probs_start[example_index], probs_end[example_index], example['input_ids'], tokenizer)

Span prob = 1.5330551862716675, answer = endangeredriver
Span prob = 1.0912036895751953, answer = The # endangeredriver
Span prob = 1.0646520853042603, answer = # endangeredriver
Span prob = 0.9734905958175659, answer = 
Span prob = 0.9423065781593323, answer = 
Span prob = 0.9349806904792786, answer = 
Span prob = 0.9338609576225281, answer = ##river
Span prob = 0.9330174922943115, answer = 
Span prob = 0.9280965924263, answer = 
Span prob = 0.9277582764625549, answer = 
Span prob = 0.9274642467498779, answer = 
Span prob = 0.9273731112480164, answer = 
Span prob = 0.9272754788398743, answer = 
Span prob = 0.9271112680435181, answer = 
Span prob = 0.9269941449165344, answer = 
Span prob = 0.9269336462020874, answer = 
Span prob = 0.926930844783783, answer = 
Span prob = 0.9268699884414673, answer = 
Span prob = 0.9268397092819214, answer = 
Span prob = 0.9268389940261841, answer = 


Are all of the top 20 valid and unique answers? If not, what do you think is going wrong? 

**TO-DO 5b:** Use the cell below to define a new and improved version of `predict_answer()` that only includes valid answers. Summarise in a couple of sentences what kind of invalid answers your code removes. **4 marks**

WRITE YOUR ANSWER HERE:

No, most of the answers are empty, they are invalid. 

I think there are two main reasons for this phenomenon. 

> The first is the wrong area(range) for answer search. The answer search area in cell above is the entire input, which includes a large number of [PAD] tokens and the [CLS] and [SEP] token that were filled in the tokenisation phase. They cannot be part of the answer.

> The second is that the code in cell above lacks the constraint that the end_idx must be greater than or equal to start_idx, i.e. **end_idx >= start_idx**. When end_idx is less than start_idx, the code 'input_ids[start_idx:end_idx]' returns a null list.

I make two improvements to the code based on the two reasons above.

> **IMPROVEMENT 1**: Detect the boundaries of the sequence of tweet content and searche for answers within this area.

> **IMPROVEMENT 2**: Modify the range of the loop to ensure that end_idx is always greater than or equal to start_idx.

In [62]:
### WRITE YOUR OWN CODE HERE
def predict_answer(probs_start, probs_end, input_ids, tokenizer):
    
    input_length = len(input_ids)  # length of the input sequence, in the form "[CLS] question [SEP] context"

    SEP_SPECIAL_TOKEN = 102  # the input id for the sep special token the separates the question from the context. The context starts after this token. 
    PAD_SPECIAL_TOKEN = 0  # the input id for padding tokens added to the end of the context
    
    span_probabilities = []  # save the probabilities here
    spans = []  # save the possible answer spans here

    # IMPROVEMENT 1: detect the boundary of context sequence
    context_sequence_start = input_ids.index(SEP_SPECIAL_TOKEN) + 1
    context_sequence_end = input_ids.index(SEP_SPECIAL_TOKEN, (context_sequence_start)) - 1
    
    for start_idx in range(context_sequence_start, context_sequence_end):
        for end_idx in range(start_idx, context_sequence_end):                # IMPROVEMENT 2: end_idx >= start_idx
            
            start_prob = probs_start[start_idx]
            end_prob = probs_end[end_idx]
            span_probabilities.append(start_prob + end_prob)

            span = tokenizer.decode(input_ids[start_idx:end_idx+1])
            spans.append(span)

    # sort the spans according to probability:
    sorted_span_index = np.argsort(span_probabilities)
    
    # print the top 20 answers:
    for i in range(20):
        print(f'Span prob = {span_probabilities[sorted_span_index[-i-1]]}, answer = {spans[sorted_span_index[-i-1]]}')
        # print(spans)
            
predict_answer(probs_start[example_index], probs_end[example_index], example['input_ids'], tokenizer)

Span prob = 1.6801480054855347, answer = Excited! # InTheUnlikelyEvent Judy Blume
Span prob = 0.9847623109817505, answer = Excited!
Span prob = 0.9311493635177612, answer = InTheUnlikelyEvent Judy Blume
Span prob = 0.9155838489532471, answer = Excited
Span prob = 0.8738287091255188, answer = Excited! # InTheUnlikelyEvent
Span prob = 0.873778223991394, answer = Excited! # InTheUnlikelyEvent Judy
Span prob = 0.8694409132003784, answer = Excited! # InTheUnlikelyEvent Judy Blume ( @ judyblume )
Span prob = 0.8692814707756042, answer = Excited! # InTheUnlikely
Span prob = 0.8692633509635925, answer = Excited! #
Span prob = 0.8679510354995728, answer = Excited! # InTheUnlikelyEvent Judy Blume ( @ judyblume
Span prob = 0.8676655292510986, answer = Excited! # InTheUnlikelyEvent Judy Blu
Span prob = 0.867581844329834, answer = Ex
Span prob = 0.8675663471221924, answer = Excited! # InTheUnlikelyE
Span prob = 0.8675362467765808, answer = Excite
Span prob = 0.8674971461296082, answer = Excited! # 

You can try out the pretrained QA model on a few examples and try to identify its common mistakes.

**TO-DO 5c:** State one way that we could improve the performance of our extractive QA model on the Tweet QA dataset.  **2 marks**

WRITE YOUR ANSWER HERE

We can fine-tuned a model on the training set of the Tweet QA dataset to improve the performance of our extractive QA model on the Tweet QA dataset.

We directly use the fine-tuned model using the SQUAD question answering dataset in the cells above. This model lacks the feature from the Tweet QA dataset.

--- 

As well as answering ad-hoc queries, question answering models can help us to extract structured information about entities of interest from a large set of documents. Suppose that we want to automatically collect information on tech companies, such as Apple and Open AI. We want to extract information about each company's activities from social media, including the names and release dates of new products and services, the company's earnings in a specific year, and who its CEO is.  

**TO-DO 5d:** Given a list of tech company names, how could we use question answering to extract the required information for each company from a set of tweets?  **(3 marks)** 

WRITE YOUR ANSWER HERE

The first step is to design or collect several questions based on the information to be obtained.

The second step is to input the question and social media into the model in the format "[CLS] Question [SPE] Social Media [SPE]" after tokenisation.

The third step is to find the answer span with maximum probability from the social media based on the output of the model.

# 6. Transformer-based Text Classifiers (max. 24 marks)

The previous section showed us how to use a pretrained QA model based on a pretrained transformer. In this section, you will learn how to construct and train a text classifier on top of a pretrained transformer. 

We will use the [Poem Sentiment](https://huggingface.co/datasets/poem_sentiment) dataset to train and test a classifier. The task is to classify lines from poems into one of  0: negative, 1: positive, 2: no impact, or 3: mixed sentiment. For more information, see [Sheng and Uthus, 2020](https://arxiv.org/pdf/2011.02686.pdf). 

To begin you will need to instantiate a suitable model.

**TO-DO 6a:** Find an AutoModel class that constructs a text classifier from the pretrained TinyBERT model, "huawei-noah/TinyBERT_General_4L_312D". Create the `model` object in the cell below using this class. Refer to the [Hugging Face documentation for auto models](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) as needed. **(2 marks)**

In [63]:
### WRITE YOUR ANSWER TO 6a HERE ###
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D") 

Some weights of the model checkpoint at huawei-noah/TinyBERT_General_4L_312D were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'fit_denses.4.weight', 'fit_denses.0.bias', 'fit_denses.2.bias', 'fit_denses.4.bias', 'fit_denses.3.bias', 'fit_denses.1.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'fit_denses.0.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'fit_denses.1.weight', 'cls.predictions.transform.dense.bias', 'fit_denses.2.weight', 'cls.predictions.bias', 'fit_denses.3.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a

**TO-DO 6b:** Provide a link to the documentation for your chosen auto model for text classification. Briefly describe how the text classifier `model` it creates differs from the QA model created by `AutoModelForQuestionAnswering`. Note: A useful reference may be the original BERT paper (https://arxiv.org/pdf/1810.04805.pdf), which includes diagrams (Figure 4) showing how BERT can be adapted to different tasks. **(2 marks)** 

WRITE YOUR ANSWER TO 6b HERE:

The two models have different inputs. The input to the QA model is a sequential pair of question and answer, which are separated from each other by [SPE]. The input to the SC model is a single srquence.

The two models have different 'head'. The QA model has a span classification head, i.e. a linear layers on top of the hidden-states output to compute span start logits and span end logits. The SC model has a span classification head, i.e. a linear layer on top of the pooled output.

Overall, the QA model implementes the token-level tasks, and the SC model implementes the sequence-level task.

---

For the QA task, the complete model was pretrained and we could apply it to a dataset without further training. However, for our poem sentiment classification task,
we will need to train our model before we can use it (you may see a message in the output of the last cell telling you this). 

**TO-DO 6c:** The emotion classifier is built on top of a pretrained TinyBERT model, so why do we need to train it before we can use it? **(2 marks)**

WRITE YOUR ANSWER TO 6C HERE:

In the emotion classification task, a pretrained TinyBERT model outputs the embedding of [CLS] token to represent the whole sentence. This [CLS] embedding is then used as input to a downstream task model, in this case a classifier. This classifier maps the [CLS] embeddings to the target labels. Hence, we need to train to obtain the optimal classifier parameters.

---

Next, let's learn how to train our model. For some tasks it is not necessary to update the weights in the BERT model itself, so we can freeze them to save a lot of computation time. We can do this as follows. Since our pretrained model is based on BERT, we can access the weights inside BERT through the variable `model.bert`.

In [64]:
for param in model.bert.parameters():
    param.requires_grad = False

To train our model, we can make use of the Trainer class, which encapsulates a lot of the complex training steps and avoids the need to define our own training function, as we did in the previous notebook (we don't need to write our own `train_nn`).

First, define some settings for the training process. This is where we can set training hyperparameters:

In [202]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="transformer_checkpoints",  # specify the directory where models weights will be saved a certain points during training (checkpoints)
    num_train_epochs=10, # A sensible and sufficient number to use for the to-dos below
    per_device_train_batch_size=8,  # you can decrease this if memory usage is too high while training
    logging_steps=50,  # how often to print progress during training
)

Next, create a trainer object. Note that the next cell will currently fail with an error, because the variables `poem_train_dataset` and `poem_val_dataset` do not exist yet! Don't worry, we'll fix this later. 

In [203]:
from transformers import Trainer
from torch import nn

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=poem_train_dataset,
    eval_dataset=poem_val_dataset,
)

NameError: name 'poem_train_dataset' is not defined

To train the model, you will need to call `trainer.train()`.

Once the model is trained, we can obtain predictions using the function below. Notice that it is simpler than obtaining the spans for QA -- we simply get the logits for each tweet in the test set, then apply argmax over the classes to find the most probable class for each tweet:

In [204]:
def predict_nn(trained_model, test_dataset):

    # Switch off dropout
    trained_model.eval()
    
    # Pass the required items from the dataset to the model    
    output = trained_model(attention_mask=torch.tensor(test_dataset["attention_mask"]), input_ids=torch.tensor(test_dataset["input_ids"]))
                        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    pred_labs = np.argmax(output["logits"].detach().numpy(), axis=1)

    return pred_labs

You should now have all the bits and pieces needed to build and train a text classifier. Let's put them all together...

**TO-DO 6d:** Implement and test a classifier for the [Poem Sentiment](https://huggingface.co/datasets/poem_sentiment) dataset using a pretrained transformer. Evaluate the classifier with both frozen and unfrozen (i.e., fine-tuned) parameters in the pretrained transformer. Choose a suitable evaluation metric and provide a comparison of the results below, including a brief explanation  (1-2 sentences) for any differences you observe between the frozen and unfrozen variants. Make sure to comment your code.  **(10 marks)**

Notes: 
 * Strong classifier performance is not required to achive good marks -- rather, we award marks for implementing and testing a transformer-based classifier correctly.
 * You may implement any suitable kind of classifier you like, as long as you are using a pretrained transformer model.
 * 'tiny' BERT variants such as TinyBERT and roberta-tiny are recommended because they are small enough to fine-tune with a typical laptop CPU. We recommend sticking with these smaller pretrained models unless you have access to a GPU, e.g., via Google Colab. 

WRITE YOUR ANSWER HERE (DESCRIPTION OF RESULTS FOR 6d):

Due to the uneven distribution of data for each category in the test set, I select weight average F1 score as the evaluation metric. The specific values, as shown in the table below.

Model     | weight avg F1 score
-------- | -----
classifier with frozen  | 0.53
classifier with unfrozen  | 0.84

It is show that the fine-tuned model have a better performance than the model with frozen. 
Moreover, I observe that all the outputs of the model with freeze are label 2, which further illustrates the bad performance of this model.

In [65]:
### WRITE YOUR ANSWER HERE (Code for 6d; feel free to use multiple cells and copy code from above) ###

## 6d-1 Load Dataset (Poem Sentiment)

In [110]:
from datasets import load_dataset

cache_dir = "./data_cache"

train_dataset = load_dataset(
    path="poem_sentiment",
    split="train",
    ignore_verifications=True,
    cache_dir=cache_dir
)

valid_dataset = load_dataset(
    path="poem_sentiment",
    split="validation",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

test_dataset = load_dataset(
    path="poem_sentiment",
    split="test",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

Using custom data configuration default
Reusing dataset poem_sentiment (./data_cache/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)
Using custom data configuration default
Reusing dataset poem_sentiment (./data_cache/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)
Using custom data configuration default
Reusing dataset poem_sentiment (./data_cache/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)


In [111]:
# detect the lagre size of the verse_text
max_len = []
for i in train_dataset['verse_text']:
    max_len.append(len(i))
max_len.sort(reverse=True)
max_len

[109,
 104,
 83,
 78,
 78,
 76,
 75,
 73,
 73,
 73,
 72,
 72,
 72,
 72,
 71,
 70,
 69,
 69,
 69,
 69,
 68,
 67,
 67,
 67,
 67,
 67,
 66,
 66,
 66,
 66,
 65,
 65,
 65,
 64,
 64,
 64,
 63,
 63,
 63,
 63,
 62,
 62,
 61,
 60,
 60,
 60,
 59,
 59,
 58,
 57,
 57,
 56,
 55,
 55,
 55,
 55,
 55,
 55,
 55,
 54,
 54,
 54,
 54,
 53,
 53,
 53,
 53,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 51,
 51,
 51,
 51,
 51,
 50,
 50,
 50,
 50,
 50,
 50,
 50,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 49,
 48,
 48,
 48,
 48,
 48,
 48,
 48,
 48,
 48,
 48,
 48,
 48,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 46,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45,
 45

## 6d-2 Tokenization & DataLoader

In [113]:
from transformers import AutoTokenizer

# import a tokenizer from a pretrained model
tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

def tokenize_function(dataset):
    # pass the text of poem verses to the toknizer
    model_inputs = tokenizer(dataset['verse_text'], 
                             padding="max_length", 
                             max_length=120,            # to pad the sequences up to a maximum length (120)
                             truncation=True)           # to truncate the part of the sequence that exceeds the maximum length
    return model_inputs


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /Users/kan/.cache/huggingface/hub/models--huawei-noah--TinyBERT_General_4L_312D/snapshots/34707a33cd59a94ecde241ac209bf35103691b43/config.json
Model config BertConfig {
  "_name_or_path": "huawei-noah/TinyBERT_General_4L_312D",
  "attention_probs_dropout_prob": 0.1,
  "cell": {},
  "classifier_dropout": null,
  "emb_size": 312,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 312,
  "initializer_range": 0.02,
  "intermediate_size": 1200,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 4,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "pre_trained": "",
  "structure": [],
  "transformers_version": "4.25.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file vocab.txt from cache at /User

use the `map()` method to apply the tokenizer to each example in the training set, validation set and testing set.

In [114]:
poem_train_dataset = train_dataset.map(tokenize_function, batched=True)
poem_valid_dataset = valid_dataset.map(tokenize_function, batched=True)
poem_test_dataset = test_dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at ./data_cache/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099/cache-1c80317fa3b1799d.arrow
Loading cached processed dataset at ./data_cache/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099/cache-bdd640fb06671ad1.arrow
Loading cached processed dataset at ./data_cache/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099/cache-3eb13b9046685257.arrow


## 6d-3 Train

In [117]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="transformer_checkpoints",  # specify the directory where models weights will be saved a certain points during training (checkpoints)
    num_train_epochs=30, # A sensible and sufficient number to use for the to-dos below
    per_device_train_batch_size=128,  # you can decrease this if memory usage is too high while training
    logging_steps=100,  # how often to print progress during training
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Train a classifier model with **frozen** parameters in the pretrained transformer

In [118]:
from transformers import AutoModelForSequenceClassification
from transformers import Trainer

num_labels = 4
# import the model from a pretrained model and set the parameter num_labels as 4
model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=num_labels) 

# frozen parameters in the pretrained transformer
for param in model.bert.parameters():
    param.requires_grad = False

frozen_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=poem_train_dataset,
    eval_dataset=poem_valid_dataset,
)
frozen_trainer.train()

loading configuration file config.json from cache at /Users/kan/.cache/huggingface/hub/models--huawei-noah--TinyBERT_General_4L_312D/snapshots/34707a33cd59a94ecde241ac209bf35103691b43/config.json
Model config BertConfig {
  "_name_or_path": "huawei-noah/TinyBERT_General_4L_312D",
  "attention_probs_dropout_prob": 0.1,
  "cell": {},
  "classifier_dropout": null,
  "emb_size": 312,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 312,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 1200,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 4,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "pre_trained": "",
  "structure": [],
  "transformers_version": "4.25.0",
  "type_vocab

  0%|          | 0/210 [00:00<?, ?it/s]

{'loss': 1.3695, 'learning_rate': 2.6190476190476192e-05, 'epoch': 14.29}
{'loss': 1.3548, 'learning_rate': 2.3809523809523808e-06, 'epoch': 28.57}




Training completed. Do not forget to share your model on huggingface.co/models =)




{'train_runtime': 270.2913, 'train_samples_per_second': 99.004, 'train_steps_per_second': 0.777, 'train_loss': 1.3616205624171667, 'epoch': 30.0}


TrainOutput(global_step=210, training_loss=1.3616205624171667, metrics={'train_runtime': 270.2913, 'train_samples_per_second': 99.004, 'train_steps_per_second': 0.777, 'train_loss': 1.3616205624171667, 'epoch': 30.0})

Train a classifier model with **unfrozen** parameters in the pretrained transformer

In [120]:
from transformers import AutoModelForSequenceClassification
from transformers import Trainer

num_labels = 4
# import the model from a pretrained model and set the parameter num_labels as 4
model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=num_labels) 

# unfrozen parameters in the pretrained transformer
for param in model.bert.parameters():
    param.requires_grad = True

unfrozen_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=poem_train_dataset,
    eval_dataset=poem_valid_dataset,
)

unfrozen_trainer.train()

loading configuration file config.json from cache at /Users/kan/.cache/huggingface/hub/models--huawei-noah--TinyBERT_General_4L_312D/snapshots/34707a33cd59a94ecde241ac209bf35103691b43/config.json
Model config BertConfig {
  "_name_or_path": "huawei-noah/TinyBERT_General_4L_312D",
  "attention_probs_dropout_prob": 0.1,
  "cell": {},
  "classifier_dropout": null,
  "emb_size": 312,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 312,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 1200,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 4,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "pre_trained": "",
  "structure": [],
  "transformers_version": "4.25.0",
  "type_vocab

  0%|          | 0/210 [00:00<?, ?it/s]

{'loss': 0.8489, 'learning_rate': 2.6190476190476192e-05, 'epoch': 14.29}
{'loss': 0.3988, 'learning_rate': 2.3809523809523808e-06, 'epoch': 28.57}




Training completed. Do not forget to share your model on huggingface.co/models =)




{'train_runtime': 508.5744, 'train_samples_per_second': 52.618, 'train_steps_per_second': 0.413, 'train_loss': 0.6107900653566632, 'epoch': 30.0}


TrainOutput(global_step=210, training_loss=0.6107900653566632, metrics={'train_runtime': 508.5744, 'train_samples_per_second': 52.618, 'train_steps_per_second': 0.413, 'train_loss': 0.6107900653566632, 'epoch': 30.0})

## 6d-4 Predict & Evaluate

In [84]:
import torch
import numpy as np

def predict_nn(trained_model, test_dataset):

    # Switch off dropout
    trained_model.eval()
    
    # Pass the required items from the dataset to the model    
    output = trained_model(attention_mask=torch.tensor(test_dataset["attention_mask"]), input_ids=torch.tensor(test_dataset["input_ids"]))
                        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    pred_labs = np.argmax(output["logits"].detach().numpy(), axis=1)

    return pred_labs

Test & evaluate the classifier with **frozen** paramters in the TinyBERT model

In [119]:
from sklearn.metrics import classification_report

pred_labs = predict_nn(trained_model=frozen_trainer.model, test_dataset=poem_test_dataset)
print(pred_labs)

print(classification_report(y_true=np.array(poem_test_dataset['label']), y_pred=pred_labs))

[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        19
           1       0.00      0.00      0.00        16
           2       0.66      1.00      0.80        69

    accuracy                           0.66       104
   macro avg       0.22      0.33      0.27       104
weighted avg       0.44      0.66      0.53       104



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Test & evaluate the classifier with **unfrozen** paramters in the TinyBERT model

In [121]:
from sklearn.metrics import classification_report

pred_labs = predict_nn(trained_model=unfrozen_trainer.model, test_dataset=poem_test_dataset)
print(pred_labs)

print(classification_report(y_true=np.array(poem_test_dataset['label']), y_pred=pred_labs))

[2 1 2 2 1 0 0 2 0 2 1 2 2 2 2 0 2 1 2 2 2 2 1 2 0 0 2 2 0 0 1 2 2 2 1 0 2
 2 2 1 0 2 2 1 2 2 0 2 0 0 1 2 0 2 0 2 2 2 2 2 0 1 0 2 0 2 2 2 0 2 1 2 2 2
 2 2 2 2 2 2 0 2 1 2 0 0 2 2 0 2 2 2 0 2 1 2 2 2 2 2 2 2 2 0]
              precision    recall  f1-score   support

           0       0.72      0.95      0.82        19
           1       0.71      0.62      0.67        16
           2       0.91      0.86      0.88        69

    accuracy                           0.84       104
   macro avg       0.78      0.81      0.79       104
weighted avg       0.84      0.84      0.84       104



**TO-DO 6e:** Did your sentiment classifier make use of any kind of model transfer or transfer learning? If so, what kinds of transfer were used and what benefit do they provide? **(4 marks)**

WRITE YOUR ANSWER HERE:

Yes. When I train the sentiment classifier with frozen the parameters in the TinyBERT, it is **direct transfer**. On the contrary, when I train the sentiment classifier with unfrozen the parameters in the TinyBERT, it is **inductive transfer learning**.

Transfer provides three benefit following:
1. Leverage large unlabelled datasets for pretraining. The TinyBERT model have learned massive text knowledge via this method, so it can provide correct [CLS] embeddings to represent the whole sentence semanteme.
2. Use the easiest task for learning a particular feature. Because of transfer, the sentiment classifier is combination of the TinyBERT and one additional output layer. When it is trained, a minimal number of parameters need to be learned from scratch.
3. Share information between related tasks.

---

**TO-DO 6f:** Use your model to compute the probability of sentiment for a sentence of your choosing. Comment your code and print the sentence with its probability distribution. Label the values so that we know which class they refer to. **(4 marks)**

Hint: you could use a poem generator, such as [this one](https://www.poemofquotes.com/tools/poetry-generator/ai-poem-generator), to generate a test sentence. 

In [134]:
# WRITE YOUR ANSWER HERE   
import pandas as pd

test_sentence = ["The dreaded deadline.The deadline that leaves you with nightmares"]

# Tokenization
input = tokenizer(test_sentence, padding="max_length", max_length=120, truncation=True)

# Switch off dropout
unfrozen_trainer.model.eval()
    
# Pass the test_sentence to the model    
output = unfrozen_trainer.model(attention_mask=torch.tensor(input["attention_mask"]), input_ids=torch.tensor(input["input_ids"]))

# The output dictionary contains logits, which are the unnormalised scores for each class for each example
# A softmax function is applied to re-scale the output so that the elements lie in the range [0, 1] and sum to 1.
probability = torch.softmax(output["logits"].detach(), dim=1)

# print the result
labels = ['negative', 'positive', 'no impact', 'mixed sentiment']
print(f'{"Label":>18} \t Probability')
for i in range(num_labels):
    print(f'{str(i) + ":" + labels[i]:>18} \t {probability[0][i]:.4f}')


             Label 	 Probability
        0:negative 	 0.6237
        1:positive 	 0.1075
       2:no impact 	 0.0396
 3:mixed sentiment 	 0.2292


# 7. OPTIONAL: More on Transformers

There are many great resources out there to show you how to use this kind of model in practice:
* An extensive online course is provided by HuggingFace: https://huggingface.co/course/chapter1/1. The pages linked from the HuggingFace course website have an 'open in Colab' button on the top right. You can open the notebook and run it on a Google server there to access GPUs.
* Chapters that may be particularly useful: 
   * Transformers, what can they do? https://huggingface.co/course/chapter1/3?fw=pt
   * Using Transformers: https://huggingface.co/course/chapter2/2?fw=pt
* They provide information on fine-tuning the transformer models here: https://huggingface.co/docs/transformers/training. Fine-tuning updates the weights inside the pretrained network and requires extensive GPU or TPU computing. 
* Text Generation: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb. This topic goes way beyond data analytics on this unit and shows you another powerful feature of pretrained transformers.


