# Advanced Text Analytics Lab 2

This notebook is the second of two lab notebooks that you will submit as part of your assessment for the Advanced Data Analytics unit. The notebook contains three required sections plus an optional section:

4. **Introducing Transformers:** This section introduces the Transformers library from HuggingFace, showing you how to use it to obtain contextualised embeddings from pretrained transformer models.

5. **Question Answering with Pretrained Transformers:** Learn about how to use a pretrained model to perform automatic question answering. 

6. **Transformers for Text Classification:** Here we show you how to construct a classifier using Transformers.

7. **OPTIONAL: More on Transformers:** Some pointers to other materials if you want to learn more about transformers, e.g., if using them in your summer project. 

Example code for all the tasks has been tested on a three-year old MacBook Pro, and the longest training process took under 10 minutes. If you find that the code takes too long to run on your own machine, you can try [Google Colab](https://colab.research.google.com/), Amazon Sagemaker Studio, or use lab machines on campus provided by the school. 

## Learning Outcomes

These sections will contain tutorial-like instructions, as you have seen in previous text analytics labs. On completing these sections, the intended learning outcomes are that you will be able to...
1. Use pretrained transformers to obtain contextualised word and sentence embeddings.
1. Apply a pretrained QA model to a new dataset. 
1. Construct classifiers with pretrained transformers. 
1. Find documentation on pretrained models in the Transformers library.

## Your Tasks

Inside each of these sections there are several **'To-do's**, which you must complete for your summative assessment. Your marks will be based on your answers to these to-dos. Please make sure to:
1. Include the output of your code in the saved notebook. Plots and printed output should be visible without re-running the code. 
1. Include all code needed to generate your answers.
1. Provide sufficient comments to understand how your method works.
1. Write text in a cell in markdown format where a written answer is required. You can convert a cell to markdown format by pressing Escape-M. 

There are also some unmarked 'to-do's that are part of the tutorial to help you learn how to implement and use the methods studied here.

## Good Academic Practice

Please follow [the guidance on academic integrity provided by the university](http://www.bristol.ac.uk/students/support/academic-advice/academic-integrity/).
You are required to write your own answers -- do not share your notebooks or copy someone else's writing. Do not copy text or long blocks of code directly into the notebook from online sources -- always rewrite in your own way. Breaking the rules can lead to strong penalties. 

## Marking Criteria

1. The coursework (both notebooks) is worth 30% of the unit in total. 
1. There is a total of 100 marks available for both lab notebooks. 
1. This notebook is worth 50 of those marks.
1. The number of marks for each to-do out of 100 is shown alongside each to-do.
1. For to-dos that require you to write code, a good solution would meet the following criteria (in order of importance):
   1. Solves the task or answers the question asked in the to-do. This means, if the code cells in the notebook are executed in order, we will get the output shown in your notebook.
   1. The code is easy to follow and does not contain unnecessary steps.
   1. The comments show that you understand how your solution works.
   1. A very good answer will also provide code that is computationally efficient but easy to read.
1. You can use any suitable publicly available libraries. Unless the task explicitly asks you to implement something from scratch, there is no penalty for using libraries to implement some steps.

## Support

The main source of support will be during the remaining lab sessions (Fridays 3-6pm) for this unit. 

The TAs and lecturer will help you with questions about the lectures, the code provided for you in this notebook, and general questions about the topics we cover. For the marked 'to-dos', they can only answer clarifying questions about what you have to do. 

Office hours: You can book office hours with Edwin on Mondays 3pm-5pm by sending him an email (edwin.simpson@bristol.ac.uk). If those times are not possible for you, please contact him by email to request an alternative. 

## Deadline

The notebook must be submitted along with the second notebook on Blackboard before **Wednesday 24th May at 13.00**. 

## Submission

You will need to zip up this notebook with the previous notebook into a single .zip file, which you will submit to Blackboard through the 'assessment, submission and feedback' link on the left sidebar. 

Please name your files like this:
   * Name this notebook ADA2_<student_number>.ipynb
   * Name the zip file <student_number>.zip
   * Please don't use your name anywhere as we want to mark anonymously. 

In [1]:
import numpy as np
import torch 
from datasets import load_dataset

cache_dir = "./data_cache"

# 4. Pretrained Transformers (max. 15 marks)

HuggingFace is a company that has developed an open source library for loading pretrained transformer models. They also distribute many models that have been pretrained using language modelling tasks, or fine-tuned to specific downstream NLP tasks.  It is currently the best library to use to create NLP models on top of large, deep neural networks. This is especially useful for tasks where simpler, feature-based methods or smaller LSTM models do not perform well enough, for example, when complex processing of syntax and semantics is required (natural language 'understanding'). 

Let's start by looking at two key types of object in the transformers library: models and tokenizers.

## 4.1. Models

The neural network models available in the Transformers library are accessed through wrapper classes such as `AutoModel`. If we want to load a pretrained model, we can simply pass its name to the `from_pretrained` function, and the pretrained model weights will be downloaded from HuggingFace and a neural network model will be created with those weights. For example:

In [2]:
from transformers import AutoModel # For BERTs

model = AutoModel.from_pretrained("huawei-noah/TinyBERT_General_4L_312D") 

Some weights of the model checkpoint at huawei-noah/TinyBERT_General_4L_312D were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'fit_denses.2.weight', 'fit_denses.2.bias', 'fit_denses.3.bias', 'fit_denses.0.bias', 'fit_denses.1.weight', 'fit_denses.0.weight', 'fit_denses.3.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'fit_denses.1.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'fit_denses.4.bias', 'cls.seq_relationship.bias', 'fit_denses.4.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing

This code loads the TinyBERT model, which is a compressed version of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While TinyBERT will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D).

<!--the RoBERTa variant of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While RoBERTa-tiny will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/arampacha/roberta-tiny).  -->

The same functions can be used to load other models from HuggingFace's repository simply by changing the model's name. Take a look at [the Models page](https://huggingface.co/models) so see what there is on offer. Do you recognise any of the models' names?

# 4.2. Tokenizers

Before we can apply a model to some text, we need to a create Tokenizer object. In Transfomers, Tokenizer objects convert raw text to a sequence of numbers. First, the tokenizer actually performs tokenization, then it maps each token to its numerical ID. There are lots of different tokenizers that we can use to preprocess text. If we are loading a pretrained model, we will need to choose the tokenizer that corresponds to that model. 

We can load the right tokenizer as follows, in the same way we loaded the model itself:

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

Let's see what the TinyBERT tokenizer does to an example sentence:

In [4]:
sentence = "The transformer architecture has transformed the field of NLP."

tokens = tokenizer.tokenize(sentence)
print(tokens)

['the', 'transform', '##er', 'architecture', 'has', 'transformed', 'the', 'field', 'of', 'nl', '##p', '.']


Let's compare with the NLTK tokenizer we have seen before:

In [5]:
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(sentence)
print(nltk_tokens)

['The', 'transformer', 'architecture', 'has', 'transformed', 'the', 'field', 'of', 'NLP', '.']


While NLTK keeps whole words as tokens, the BERT tokenizer splits some words into sub-words and inserts some special characters into the tokens. Splitting is applied to words with low frequency in the training set, such as 'transformer'. 

**TO-DO 4a:** What is the benefit of splitting rare words into sub-word tokens? **(2 marks)**

WRITE YOUR ANSWER HERE:

The benefit of this is is that it can deal with words that havent been seen before during training as they are treated as out-of-vocabulary. It also can drastically reduce the size of the vocabulary that the model needs to handle, making the model more memory efficient and speed up training and inference. 


---

It is important to use the right tokenizer with a pretrained model as each model was trained with text tokenized in a particular way. After tokenization, the Tokenizer object can also map the tokens to their IDs (indexes in the vocabulary):

In [6]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[1996, 10938, 2121, 4294, 2038, 8590, 1996, 2492, 1997, 17953, 2361, 1012]


## 4.3. Contextualised Embeddings

Now that we have a sequence of tokens, we are almost ready to process the sequence using the pretrained model. 

Our model takes as input a PyTorch `tensor` object. In PyTorch, `tensor` is a muli-dimensional matrix. Here, we need a two-dimensional matrix, where each row is a sequence of input tokens corresponding to a single sentence or document. Let's convert our list of IDs to a 2-D tensor with a single row:

In [7]:

ids_tensor = torch.tensor([ids])

print(ids_tensor)

tensor([[ 1996, 10938,  2121,  4294,  2038,  8590,  1996,  2492,  1997, 17953,
          2361,  1012]])


Now we can process the sequence using our model. The model maps the sequence of input IDs to a sequence of output vectors, which are contextualised word embeddings. The hidden state values produced in the last hidden layer of the model are used as the contextualised embeddings:

In [8]:
model_outputs = model(ids_tensor)
print('The complete model outputs: ')
print(model_outputs)

print()
print('The last hidden state for the first token in the sequence (the first word embedding): ')
embeddings = model_outputs['last_hidden_state'][0]
print(embeddings)

The complete model outputs: 
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.3608,  0.2862, -0.1549,  ..., -0.2064,  0.2663, -0.0109],
         [ 0.0149,  0.7223, -0.0508,  ..., -0.5505,  0.2355, -0.2962],
         [ 0.1531,  0.5903, -0.1244,  ..., -0.4263,  0.0417, -0.1839],
         ...,
         [ 0.1742, -0.1091, -0.1963,  ..., -0.6736,  0.0472, -0.1840],
         [ 0.2434,  0.1021, -0.2241,  ..., -0.5400, -0.1691, -0.1314],
         [ 0.0854,  0.3272, -0.3016,  ..., -0.2154, -0.5632, -0.1921]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-1.1380e-02, -6.3006e-03,  1.8521e-02,  7.1139e-03, -3.1795e-02,
          1.3882e-02, -1.5459e-02, -1.0610e-03, -1.8263e-02, -3.6515e-02,
         -2.1257e-02, -1.5479e-02, -2.8092e-04, -4.1093e-02, -2.5315e-02,
         -4.3338e-02, -1.1616e-03, -1.3931e-02,  6.0733e-03,  4.3790e-03,
          2.7090e-04, -2.1810e-02, -4.8026e-02,  2.5493e-02, -1.6502e-02,
         -1.2034e-03,  4.2757e-02,  3.

We can retrieve the embedding vector for "transform" like this ("transform" is the second token in the sequence):

In [9]:
emb = embeddings[1]  # get second embedding in the sequence

# convert it to a numpy array so we can perform various operations on it later on
emb = emb.detach().numpy()

print(emb)
print(f'The TinyBERT embeddings have {emb.shape[0]} dimensions.')

[ 1.49155427e-02  7.22318113e-01 -5.07865176e-02 -2.74206609e-01
 -1.38931692e-01  1.00099540e+00  7.11429864e-03  2.71392524e-01
 -3.92823443e-02  6.04107268e-02  1.25740260e-01  4.60632175e-01
  6.25270605e-03  1.61929548e-01  1.23913646e-01 -4.08096492e-01
  1.24867544e-01 -4.71536160e-01  2.24768862e-01  6.35189414e-02
  8.56178999e-02 -1.88045263e-01  1.77257776e-01  3.40049475e-01
 -1.95545897e-01  1.58554405e-01  9.62863415e-02  1.12649783e-01
  2.21045017e-01 -9.56112981e-01 -3.85949075e-01  1.39220759e-01
  5.90011537e-01 -8.06727767e-01 -1.34287819e-01  2.35692143e-01
 -1.02274813e-01  2.78302789e-01  7.94321358e-01 -2.49362707e-01
  1.72772393e-01 -2.07582787e-01  3.00156832e-01 -8.59342963e-02
 -2.25284770e-01 -9.75407809e-02 -3.52349520e-01  3.81161153e-01
 -3.87681544e-01 -1.77613750e-01 -4.13686156e-01  1.38046280e-01
  1.29867718e-02  6.52684391e-01  1.16502836e-01 -5.10778427e-01
 -8.30404162e-02 -2.67044231e-02  3.12863350e-01 -2.62848616e-01
 -1.43285260e-01  1.10270

TO-DO 4b: Retrieve the embedding for "architecture" (this to-do will not be marked).

In [10]:
# WRITE YOUR ANSWER HERE

position = tokens.index('architecture')

# Get the embedding for 'architecture'
architecture_emb = embeddings[position]

# Convert it to a numpy array
architecture_emb = architecture_emb.detach().numpy()

print(architecture_emb)
print(f'The TinyBERT embeddings for "architecture" have {architecture_emb.shape[0]} dimensions.')


[ 2.71387130e-01  7.74581850e-01 -3.24257016e-01 -7.14323670e-02
 -4.95385379e-04  9.37310040e-01 -4.40284982e-03 -4.26921584e-02
  1.27404267e-02  1.89277846e-02  1.02527849e-01  4.54655915e-01
  2.70436138e-01  2.30988413e-01  4.03663330e-03 -1.08995482e-01
 -4.59910110e-02 -3.51154387e-01 -1.34710342e-01  8.29388499e-02
  1.86496913e-01  5.00274561e-02  7.21661821e-02  2.28657156e-01
 -2.19696999e-01  9.40199867e-02  1.65539995e-01  1.85794756e-01
  3.17783266e-01 -5.09367108e-01 -5.00949562e-01  1.52487800e-01
  4.57999200e-01 -8.51875782e-01 -1.58632979e-01  1.58965096e-01
  4.16193828e-02  2.30997890e-01  8.78503144e-01 -6.23159632e-02
  1.87219620e-01 -1.23370588e-02  2.10084260e-01  3.48071381e-02
 -2.51240134e-01 -1.37914822e-01 -3.88696730e-01  2.98189640e-01
 -2.92033404e-01 -3.19503993e-01 -1.98435634e-01  1.32032171e-01
 -6.46373034e-02  7.43182778e-01  7.14239180e-02 -3.02118361e-01
  3.49781871e-01 -5.81784546e-02  2.85069406e-01 -4.09580857e-01
 -1.03297204e-01  1.03767

Sentences and documents usually have varying lengths. So, to put multiple sentences into a single tensor, we need to pad the sequences up to a maximum length. Luckily, the tokenizer class takes care of this for us. When we pass in a list of sentences, the tokenizer creates a matrix, where each row is a sequence of the same length:

In [11]:
sentences = [
    "They received a loan from the bank.",
    "It was not good for either his bank balance or his blood pressure.",
    "She walked along the bank of the river towards the city.",
    "They bank their cheques on Thursdays.",
    "She walked along the embankment towards the city."
]

model_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")  

print(model_inputs)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': tensor([[  101,  2027,  2363,  1037,  5414,  2013,  1996,  2924,  1012,   102,
             0,     0,     0,     0,     0,     0],
        [  101,  2009,  2001,  2025,  2204,  2005,  2593,  2010,  2924,  5703,
          2030,  2010,  2668,  3778,  1012,   102],
        [  101,  2016,  2939,  2247,  1996,  2924,  1997,  1996,  2314,  2875,
          1996,  2103,  1012,   102,     0,     0],
        [  101,  2027,  2924,  2037, 18178, 10997,  2006,  9432,  2015,  1012,
           102,     0,     0,     0,     0,     0],
        [  101,  2016,  2939,  2247,  1996, 22756,  2875,  1996,  2103,  1012,
           102,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': t

In [12]:
# Confirming the value of the padding token 
print(tokenizer.pad_token_id)

0


`model_inputs` is a dictionary containing three objects:
 * The `input_ids` are the list of token IDs in the input sequences. 
 * The `attention_mask` records which tokens are special padding tokens and which are real tokens. Tokens with a 0 in the attention mask will be ignored.
 * `token_type_ids` is needed when two sequences are passed together as input to the model for tasks such as next sentence prediction that involve comparing two sentences. Here, each input is a single sentence, so we have only one type of token in the output above. 
 
TO-DO 4c: What value do the special padding tokens have? (this to-do is unmarked)

ANSWER: 

0

---

Notice that the input_ids all start with the same token ID, 101, even though they have different first words. They also have token ID 102 before the padding tokens. This is because the tokenizer inserts two special tokens, which are used in some applicaions of BERT. 101 is the '[CLS]' token, which is a dummy token whose embedding can be trained to represent the whole sequence. the [CLS] token's embedding can then be used as input to a text classifier to classify a sentence or document. Token 102 is '[SEP]', which can be used to separate multiple input sequences in a single example. This is needed in tasks where multiple pieces of text are provided as input, e.g., a to build a classifier that can determine whether two sentences contradict each other. 

We can now pass all of the model inputs to the model to produce a set of contextualised embeddings:

In [13]:
# model_inputs is a dictionary, so to provide the arguments to model(), 
# we use the double star to unpack the dictionary so that each key in the dictionary is
# an argument to model() and each value is the value of the argument. 
model_outputs = model(**model_inputs) 

**TO-DO 4d:** The first four example sentences above all contain the word "bank", and the last example contains "embankment". Obtain a list of contextualised word embeddings for 'bank' and 'embankment' in the example sentences using our model. **(4 marks)**

Hint: you may need to convert tensors to numpy arrays.

In [14]:
#WRITE YOUR OWN CODE HERE

# Define the words of interest
words_of_interest = ['bank', 'embankment']

# Tokenize the words of interest
tokenized_words = [tokenizer.tokenize(word) for word in words_of_interest]

# Initialize an empty list to store the embeddings
embeddings_list = []

# Iterate over each tokenized sentence
for i, input_ids in enumerate(model_inputs['input_ids']):
    # Convert input_ids tensor to a list
    input_ids_list = input_ids.tolist()
    # Decode the input_ids to get the tokenized sentence
    tokenized_sentence = tokenizer.convert_ids_to_tokens(input_ids_list)
    # Initialize an empty list to store the embeddings for this sentence
    embeddings_sentence = []
    # Iterate over each tokenized word of interest
    for tokenized_word in tokenized_words:
        # If the tokenized word is a subword, take only the first subword
        if tokenized_word[0] in tokenized_sentence:
            # Get the position of the word in the tokenized sentence
            position = tokenized_sentence.index(tokenized_word[0])
            # Get the embedding for the word
            word_embedding = model_outputs[0][i][position]
            # Convert it to a numpy array and append to the list
            embeddings_sentence.append(word_embedding.detach().numpy())
        else:
            embeddings_sentence.append(None)
    # Append the embeddings for this sentence to the main list
    embeddings_list.append(embeddings_sentence)

# Convert the main list to a numpy array
embeddings_array = np.array(embeddings_list, dtype=object)

# Print the shape of the array
print(embeddings_array.shape)



(5, 2)


**TO-DO 4e:** Compute the similarities between these embeddings in the cell below, and show the results. Which embeddings are most similar to one another and why? **(6 marks)**

WRITE YOUR ANSWER HERE:

The first and second sentences are most similar to each other with a cosine similarity of approximately 0.633. This means that the context in which 'bank' is used in these two sentences is somewhat similar. The reason why these two embeddings are the most similar to one another is because in both sentences the word 'bank' is used in financial contexts. As for 'embankment' it only appears in one sentences so its similarity to the embeddings for 'bank' cant be provided . However since it is similar to the word 'bank' its embedding would likely be similar in the third sentence.


In [15]:
from sklearn.metrics.pairwise import cosine_similarity

# Separate the embeddings for "bank" and "embankment"
bank_embeddings = embeddings_array[:,0]
embankment_embeddings = embeddings_array[:,1]

# Remove None values
bank_embeddings = [emb for emb in bank_embeddings if emb is not None]
embankment_embeddings = [emb for emb in embankment_embeddings if emb is not None]

# Compute similarities between all pairs of "bank" embeddings
bank_similarities = cosine_similarity(bank_embeddings)

# Compute similarities between all pairs of "embankment" embeddings
# In this case, it's just one embedding, so the similarity is 1
embankment_similarities = cosine_similarity(embankment_embeddings)

# Print the similarities
print('Similarities between "bank" embeddings:')
print(bank_similarities)

print('Similarities between "embankment" embeddings:')
print(embankment_similarities)


Similarities between "bank" embeddings:
[[0.9999998  0.6331346  0.4877863  0.5232957 ]
 [0.6331346  0.9999998  0.43572852 0.4705649 ]
 [0.4877863  0.43572852 0.99999994 0.47878546]
 [0.5232957  0.4705649  0.47878546 1.0000001 ]]
Similarities between "embankment" embeddings:
[[1.0000001]]


**TO-DO 4f:** Use the [CLS] token's embedding to find the most similar **sentence** to "She walked along the embankment towards the city." from the first four sentences. Print the similarities and the selected sentence. **(3 marks)**

In [16]:
# WRITE YOUR OWN CODE HERE

# Get the [CLS] embeddings for each sentence
cls_embeddings = model_outputs.last_hidden_state[:, 0, :].detach().numpy()

# Compute similarities between the [CLS] embedding for the last sentence and for the other sentences
similarities = cosine_similarity(cls_embeddings[-1].reshape(1, -1), cls_embeddings[:-1])

# Print the similarities
for i, similarity in enumerate(similarities[0]):
    print(f'Similarity between sentence 5 and sentence {i+1}: {similarity}')

# Find the most similar sentence
most_similar_index = np.argmax(similarities)
most_similar_sentence = sentences[most_similar_index]

print('The most similar sentence to "She walked along the embankment towards the city." is:')
print(most_similar_sentence)



Similarity between sentence 5 and sentence 1: 0.909304141998291
Similarity between sentence 5 and sentence 2: 0.7936346530914307
Similarity between sentence 5 and sentence 3: 0.9947961568832397
Similarity between sentence 5 and sentence 4: 0.8988915681838989
The most similar sentence to "She walked along the embankment towards the city." is:
She walked along the bank of the river towards the city.


# 5. Question Answering with Pretrained Transformers (max. 11 marks)

The previous section showed us how to obtain a sequence of contextualised word embeddings using a pretrained transformer. How are these embeddings used to extract answers from documents to a given question?

First, let's load up the [Tweet QA](https://huggingface.co/datasets/tweet_qa) dataset, which we will use to test a pretrained question answering (QA) model. This dataset contains tweets along with questions about the information in the tweets, and a list of correct answers. As we are not going to train our own QA model (it requires a lot of compute time), we will only need the validation set:

In [17]:
from sklearn.metrics import f1_score

val_dataset = load_dataset(
    "tweet_qa",
    split="validation",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Validation dataset with {len(val_dataset)} instances loaded")

Using custom data configuration default
Reusing dataset tweet_qa (./data_cache/tweet_qa/default/1.0.0/7d588f7f477946b10f60c035ca55175737315ac446102b015218af38d2638777)


Validation dataset with 1086 instances loaded


Now we are working with complete dataset using the HuggingFace datasets library. In the next cell, we create a tokenizer to tokenize the examples in the dataset. We need to choose the right tokenizer for the QA model we want to use, so let's decide to use `"distilbert-base-cased-distilled-squad"` as our pretrained model. This is based on a smaller version of BERT, called Distilbert, which was fine-tuned on the SQUAD question answering dataset.

In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad") 

def tokenize_function(dataset):
    # Pass two strings to the tokenizer -- it will concatenate them with a [SEP] special token between them. 
    model_inputs = tokenizer(dataset['Question'], dataset['Tweet'], padding="max_length", max_length=200, truncation='only_second')
    return model_inputs

Again, we can use the `map()` method to apply the tokenizer to each example in the dataset. 

In [19]:
val_dataset = val_dataset.map(tokenize_function, batched=True) 



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/2 [00:00<?, ?ba/s]

The type of QA model we are going to work with is _extractive_, meaning that the model will extract the answer from the 'context' (also known as the 'passage' or 'source document'). It does this by identifying the index of the start and end tokens of the answer span within the context, or returning `(0, 0)` (the index 0 for both the start and end token) if the context does not contain an answer to the given question. 

As explained in the lectures, BERT forms the basis of the QA model, and maps each token to a contextualised embedding. The QA model then maps each token's contextualised embedding to the probability that the token is the start of the answer span, and to the probability that the token is the end of the answer span. The layers that map the embeddings to the start and end probabilities are known as the 'head' of the model. [The original BERT paper](https://arxiv.org/pdf/1810.04805.pdf) depicts the QA model like this (Devlin et al., 2018):

<img src="bert_qa.png" alt="BERT QA diagram from the slides in week 10 showing the embedding of each token connected to the start and end output layers" width="400px"/>

We can see a similar structure in most neural network models. Our original text classifier from the first notebook used a fully-connected layer to produce a hidden representation of the whole sentence (rather than using BERT to produce a sequence of embeddings). This hidden representation was then fed to an output layer to produce a probability distribution over class labels (rather than the start and end probabilities):

<img src="neural_text_classifier_smaller.png" alt="Neural text classifier diagram from the slides in lecture 8.1" width="400px"/>


<!--With transformers, 
we can do something very similar, by connecting the transfomer's output to a fully-connected layer. However, with BERT, we do not need to pass the embedding of each individual word to the fully-connected layer because there is a special [CLS] token that represents the whole sentence:

The code below shows how to access a tensor containing the [CLS] embeddings:-->

Now, we have the dataset in the right format, let's see how to load a pretrained QA model based on a pretrained transformer. The QA model was trained by taking a pretrained BERT model (pretrained on masked language modelling with unlabelled text), adding the QA head, then further training the complete model on a QA dataset. 

The transformers library provides some useful wrapper classes for loading pretrained models for various NLP tasks, such as QA or text classification. These 'auto' classes are documented here: https://huggingface.co/docs/transformers/model_doc/auto . Let's use an auto class to load the `"distilbert-base-cased-distilled-squad"` pretrained QA model (this code will try to reload the model from a cache or download the model from HuggingFace):

In [20]:
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

As our model was pretrained, we can use it directly on our Tweet_QA dataset (you may see a message to this effect when you run the cell above the first time). 

So, how do we get a prediction from the model? Let's take a single example from Tweet_QA and obtain the start and end probabilities for all tokens in the 'context':

In [21]:
def predict_nn(qa_model, dataset):
    
    # Switch off dropout
    qa_model.eval()

    # Pass the required inputs from the dataset to the model    
    output = qa_model(attention_mask=torch.tensor(dataset["attention_mask"]), input_ids=torch.tensor(dataset["input_ids"]))
        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    probs_start = torch.nn.Softmax(dim=1)(output["start_logits"]).detach().numpy()
    probs_end = torch.nn.Softmax(dim=1)(output["end_logits"]).detach().numpy()
        
    return probs_start, probs_end

# Run the prediction function to get the results for the first 20 examples:
probs_start, probs_end = predict_nn(model, val_dataset[0:20])

Now that we have the probabilities that each token is a start or end token, we combine these probabilities to estimate the probability of each possible answer span. This will allow us to choose the answer span with highest probability. 

In the next cell is our first attempt, which you will need to improve to get valid answers. This code loops through each possible combination of start and end tokens, obtains the start and end probabilities, and extracts the answer text for the corresponding span.

**TO-DO 5a:** Use the start and end probabilities to compute the answer span probability at the place marked inside the predict_answer() function below. **2 marks**

In [22]:
# our example:
example_index = 3

example = val_dataset[example_index]
print(f'CONTEXT = {example["Tweet"]}')
print(f'QUESTION = {example["Question"]}')
print(f'LIST OF POSSIBLE ANSWERS = {example["Answer"]}')

CONTEXT = The #endangeredriver would be a sexy bastard in this channel if it had water. Quick turns. Narrow. (I'm losing it) John D. Sutter (@jdsutter) June 21, 2014
QUESTION = what hashtag was used?
LIST OF POSSIBLE ANSWERS = ['#endangeredriver', '#endangereddriver']


In [23]:
def predict_answer(probs_start, probs_end, input_ids, tokenizer):
    
    input_length = len(input_ids)  # length of the input sequence, in the form "[CLS] question [SEP] context"

    SEP_SPECIAL_TOKEN = 102  # the input id for the sep special token the separates the question from the context. The context starts after this token. 
    PAD_SPECIAL_TOKEN = 0  # the input id for padding tokens added to the end of the context
    
    span_probabilities = []  # save the probabilities here
    spans = []  # save the possible answer spans here
    
    for start_idx in range(0, input_length):
        for end_idx in range(0, input_length):
            
            start_prob = probs_start[start_idx]
            end_prob = probs_end[end_idx]
            
            ### WRITE YOUR ANSWER HERE
            
            span_prob = start_prob * end_prob

            span_probabilities.append(span_prob)
            
            span = tokenizer.decode(input_ids[start_idx:end_idx+1])
            spans.append(span)

    # sort the spans according to probability:
    sorted_span_index = np.argsort(span_probabilities)
    
    # print the top 20 answers:
    for i in range(20):
        print(f'Span prob = {span_probabilities[sorted_span_index[-i-1]]}, answer = {spans[sorted_span_index[-i-1]]}')
            
predict_answer(probs_start[example_index], probs_end[example_index], example['input_ids'], tokenizer)

Span prob = 0.5619075298309326, answer = endangeredriver
Span prob = 0.15243792533874512, answer = The # endangeredriver
Span prob = 0.12783198058605194, answer = # endangeredriver
Span prob = 0.04335250332951546, answer = 
Span prob = 0.018347227945923805, answer = endangeredriver would be a sexy bastard in this channel if it had water. Quick turns. Narrow
Span prob = 0.01445400808006525, answer = 
Span prob = 0.0079822838306427, answer = endangeredriver would be a sexy bastard in this channel if it had water. Quick turns
Span prob = 0.007700338959693909, answer = endangeredriver would be a sexy bastard in this channel if it had water
Span prob = 0.007665039040148258, answer = 
Span prob = 0.006627385504543781, answer = ##river
Span prob = 0.005845780950039625, answer = 
Span prob = 0.004977355245500803, answer = The # endangeredriver would be a sexy bastard in this channel if it had water. Quick turns. Narrow
Span prob = 0.004173929803073406, answer = # endangeredriver would be a sex

Are all of the top 20 valid and unique answers? If not, what do you think is going wrong? 

**TO-DO 5b:** Use the cell below to define a new and improved version of `predict_answer()` that only includes valid answers. Summarise in a couple of sentences what kind of invalid answers your code removes. **4 marks**

WRITE YOUR ANSWER HERE:

The updated function removes invalid answers that dont make sense in the context of answering the question. It discards answer spans that start after they end, spans that start or end on a special token such as [PAD], [SEP] or [CLS], and spans that start in the question part of the input sequence. This filtering helps to ensure that only potentially meaningful spans from the context are considered as valid answers.



In [24]:
def predict_answer(probs_start, probs_end, input_ids, tokenizer):
    
    input_length = len(input_ids)  # length of the input sequence, in the form "[CLS] question [SEP] context"

    SEP_SPECIAL_TOKEN = 102  # the input id for the sep special token that separates the question from the context. The context starts after this token. 
    PAD_SPECIAL_TOKEN = 0  # the input id for padding tokens added to the end of the context
    
    span_probabilities = []  # save the probabilities here
    spans = []  # save the possible answer spans here

    # Find the position of [SEP] token so we don't include it in spans
    sep_position = input_ids.index(SEP_SPECIAL_TOKEN)
    
    for start_idx in range(sep_position+1, input_length):  # start after the [SEP] token
        for end_idx in range(start_idx, min(input_length, start_idx+15)):  # only consider spans up to length 15
            
            if input_ids[end_idx] == PAD_SPECIAL_TOKEN:  # break the loop when reaching a padding token
                break

            start_prob = probs_start[start_idx]
            end_prob = probs_end[end_idx]
            
            # Only consider it if end index is greater or equal to start index
            if end_idx >= start_idx:
                span_prob = start_prob * end_prob
                span = tokenizer.decode(input_ids[start_idx:end_idx+1])
                span_probabilities.append(span_prob)
                spans.append(span)

    # sort the spans according to probability:
    sorted_span_index = np.argsort(span_probabilities)
    
    # print the top 20 answers:
    for i in range(min(20, len(span_probabilities))):
        print(f'Span prob = {span_probabilities[sorted_span_index[-i-1]]}, answer = {spans[sorted_span_index[-i-1]]}')
            
predict_answer(probs_start[example_index], probs_end[example_index], example['input_ids'], tokenizer)


Span prob = 0.5619075298309326, answer = endangeredriver
Span prob = 0.15243792533874512, answer = The # endangeredriver
Span prob = 0.12783198058605194, answer = # endangeredriver
Span prob = 0.007700338959693909, answer = endangeredriver would be a sexy bastard in this channel if it had water
Span prob = 0.006627385504543781, answer = ##river
Span prob = 0.0017517999513074756, answer = # endangeredriver would be a sexy bastard in this channel if it had water
Span prob = 0.0014155323151499033, answer = Quick turns. Narrow
Span prob = 0.0013854490825906396, answer = endangered
Span prob = 0.0010366060305386782, answer = endangeredriver would be a sexy bastard
Span prob = 0.000647262146230787, answer = endangeredriver would be a sexy bastard in this channel if it had water.
Span prob = 0.0006158521864563227, answer = Quick turns
Span prob = 0.000375853618606925, answer = The # endangered
Span prob = 0.0003151847922708839, answer = # endangered
Span prob = 0.00028121721697971225, answer 

You can try out the pretrained QA model on a few examples and try to identify its common mistakes.

**TO-DO 5c:** State one way that we could improve the performance of our extractive QA model on the Tweet QA dataset.  **2 marks**

WRITE YOUR ANSWER HERE

One way in which we could imrpove the performance is to fine-tune it on a related task or dataset. The original dataset was initially trained on a SQuAD dataset that may not be optimal for understanding the nuances, slang etc in the tweets. Perhaps fin-tuning the model on a dataset that is closely related to the QA dataset could improve its perfomance. 

--- 

As well as answering ad-hoc queries, question answering models can help us to extract structured information about entities of interest from a large set of documents. Suppose that we want to automatically collect information on tech companies, such as Apple and Open AI. We want to extract information about each company's activities from social media, including the names and release dates of new products and services, the company's earnings in a specific year, and who its CEO is.  

**TO-DO 5d:** Given a list of tech company names, how could we use question answering to extract the required information for each company from a set of tweets?  **(3 marks)** 

WRITE YOUR ANSWER HERE

First we could filter the tweets that mention the tech companies. Then we create questions for eeach piece of information we want to extract which could be be anything based on that company. Then we use the QA model to answer each question using the relevant tweet as the context, if it contains the answer to the question, then we extract it. Finally, the extracted answers may need to be pre-processed to provide meaningful insights.


# 6. Transformer-based Text Classifiers (max. 24 marks)

The previous section showed us how to use a pretrained QA model based on a pretrained transformer. In this section, you will learn how to construct and train a text classifier on top of a pretrained transformer. 

We will use the [Poem Sentiment](https://huggingface.co/datasets/poem_sentiment) dataset to train and test a classifier. The task is to classify lines from poems into one of  0: negative, 1: positive, 2: no impact, or 3: mixed sentiment. For more information, see [Sheng and Uthus, 2020](https://arxiv.org/pdf/2011.02686.pdf). 

To begin you will need to instantiate a suitable model.

**TO-DO 6a:** Find an AutoModel class that constructs a text classifier from the pretrained TinyBERT model, "huawei-noah/TinyBERT_General_4L_312D". Create the `model` object in the cell below using this class. Refer to the [Hugging Face documentation for auto models](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) as needed. **(2 marks)**

In [25]:
### WRITE YOUR ANSWER TO 6a HERE ###

# Importing relevant library
from transformers import AutoModelForSequenceClassification

# Define the number of labels in the task (0: negative, 1: positive, 2: no impact, 3: mixed sentiment)
num_labels = 4

#instantiating my chosen model
model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=num_labels)


Some weights of the model checkpoint at huawei-noah/TinyBERT_General_4L_312D were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'fit_denses.2.weight', 'fit_denses.2.bias', 'fit_denses.3.bias', 'fit_denses.0.bias', 'fit_denses.1.weight', 'fit_denses.0.weight', 'fit_denses.3.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'fit_denses.1.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'fit_denses.4.bias', 'cls.seq_relationship.bias', 'fit_denses.4.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a

**TO-DO 6b:** Provide a link to the documentation for your chosen auto model for text classification. Briefly describe how the text classifier `model` it creates differs from the QA model created by `AutoModelForQuestionAnswering`. Note: A useful reference may be the original BERT paper (https://arxiv.org/pdf/1810.04805.pdf), which includes diagrams (Figure 4) showing how BERT can be adapted to different tasks. **(2 marks)** 

WRITE YOUR ANSWER TO 6b HERE:

Here is the link: https://huggingface.co/docs/transformers/model_doc/auto#automodelforsequenceclassification 

This model is used for tasks that involve classifying a sequence of text into one or more categories, whereas the QA model created by `AutoModelForQuestionAnswering` is used for tasks that involve finding the answer to a question in a given context. Basically, both models architecture is typically the same as the base transformer up until the final layer, where in my selected model the final layer is a linear layer that transforms the transformers output for the [CLS] token into logits for each of the classes. Compared to the QA model where instead of a single classification layer, there are two linear layers that are used to compute the start and end positions of the answer span within the context text. 

---

For the QA task, the complete model was pretrained and we could apply it to a dataset without further training. However, for our poem sentiment classification task,
we will need to train our model before we can use it (you may see a message in the output of the last cell telling you this). 

**TO-DO 6c:** The emotion classifier is built on top of a pretrained TinyBERT model, so why do we need to train it before we can use it? **(2 marks)**

WRITE YOUR ANSWER TO 6C HERE:

We need to train it because it doesnt know anything about the specific task we're interested in, which is sentiment classification. The lower layers of the model, learned during pre-training can capture general language understanding, however the upper layers of the model need to be trained on our specific classification task to capture the task specific patterns.


---

Next, let's learn how to train our model. For some tasks it is not necessary to update the weights in the BERT model itself, so we can freeze them to save a lot of computation time. We can do this as follows. Since our pretrained model is based on BERT, we can access the weights inside BERT through the variable `model.bert`.

In [26]:
for param in model.bert.parameters():
    param.requires_grad = False

To train our model, we can make use of the Trainer class, which encapsulates a lot of the complex training steps and avoids the need to define our own training function, as we did in the previous notebook (we don't need to write our own `train_nn`).

First, define some settings for the training process. This is where we can set training hyperparameters:

In [30]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="transformer_checkpoints",  # specify the directory where models weights will be saved a certain points during training (checkpoints)
    num_train_epochs=10, # A sensible and sufficient number to use for the to-dos below
    per_device_train_batch_size=8,  # you can decrease this if memory usage is too high while training
    logging_steps=50,  # how often to print progress during training
)

In [36]:
poem_train_dataset = load_dataset(
    "poem_sentiment",
    split="train",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

poem_val_dataset = load_dataset(
    "poem_sentiment",
    split="validation",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

Using custom data configuration default
Reusing dataset poem_sentiment (./data_cache/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)
Using custom data configuration default
Reusing dataset poem_sentiment (./data_cache/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)


Next, create a trainer object. Note that the next cell will currently fail with an error, because the variables `poem_train_dataset` and `poem_val_dataset` do not exist yet! Don't worry, we'll fix this later. 

In [38]:


from transformers import Trainer
from torch import nn

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=poem_train_dataset,
    eval_dataset=poem_val_dataset,
)

To train the model, you will need to call `trainer.train()`.

Once the model is trained, we can obtain predictions using the function below. Notice that it is simpler than obtaining the spans for QA -- we simply get the logits for each tweet in the test set, then apply argmax over the classes to find the most probable class for each tweet:

In [39]:
def predict_nn(trained_model, test_dataset):

    # Switch off dropout
    trained_model.eval()
    
    # Pass the required items from the dataset to the model    
    output = trained_model(attention_mask=torch.tensor(test_dataset["attention_mask"]), input_ids=torch.tensor(test_dataset["input_ids"]))
                        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    pred_labs = np.argmax(output["logits"].detach().numpy(), axis=1)

    return pred_labs

You should now have all the bits and pieces needed to build and train a text classifier. Let's put them all together...

**TO-DO 6d:** Implement and test a classifier for the [Poem Sentiment](https://huggingface.co/datasets/poem_sentiment) dataset using a pretrained transformer. Evaluate the classifier with both frozen and unfrozen (i.e., fine-tuned) parameters in the pretrained transformer. Choose a suitable evaluation metric and provide a comparison of the results below, including a brief explanation  (1-2 sentences) for any differences you observe between the frozen and unfrozen variants. Make sure to comment your code.  **(10 marks)**

Notes: 
 * Strong classifier performance is not required to achive good marks -- rather, we award marks for implementing and testing a transformer-based classifier correctly.
 * You may implement any suitable kind of classifier you like, as long as you are using a pretrained transformer model.
 * 'tiny' BERT variants such as TinyBERT and roberta-tiny are recommended because they are small enough to fine-tune with a typical laptop CPU. We recommend sticking with these smaller pretrained models unless you have access to a GPU, e.g., via Google Colab. 

WRITE YOUR ANSWER HERE (DESCRIPTION OF RESULTS FOR 6d):


In [42]:
# Importing relevant libraries

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch
from torch import nn
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


In [46]:
# Load the dataset
poem_train_dataset = load_dataset(
    "poem_sentiment",
    split="train",
)

poem_val_dataset = load_dataset(
    "poem_sentiment",
    split="validation",
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

# Tokenize the dataset
def tokenize(batch):
    return tokenizer(batch['verse_text'], padding=True, truncation=True)

poem_train_dataset = poem_train_dataset.map(tokenize, batched=True, batch_size=len(poem_train_dataset))
poem_val_dataset = poem_val_dataset.map(tokenize, batched=True, batch_size=len(poem_val_dataset))

poem_train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
poem_val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])


Using custom data configuration default
Reusing dataset poem_sentiment (/Users/thomascourts/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)
Using custom data configuration default
Reusing dataset poem_sentiment (/Users/thomascourts/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)


  0%|          | 0/1 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/1 [00:00<?, ?ba/s]

In [47]:
# Create a function to calculate performance metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [48]:
# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=4)

# Freeze the BERT parameters
for param in model.base_model.parameters():
    param.requires_grad = False


Some weights of the model checkpoint at huawei-noah/TinyBERT_General_4L_312D were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'fit_denses.2.weight', 'fit_denses.2.bias', 'fit_denses.3.bias', 'fit_denses.0.bias', 'fit_denses.1.weight', 'fit_denses.0.weight', 'fit_denses.3.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'fit_denses.1.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'fit_denses.4.bias', 'cls.seq_relationship.bias', 'fit_denses.4.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a

In [51]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=poem_train_dataset,
    eval_dataset=poem_val_dataset,
)

# Train the model
trainer.train()

# Evaluate the model
trainer.evaluate()


AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. 
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

In [52]:
# Unfreeze the BERT parameters
for param in model.base_model.parameters():
    param.requires_grad = True

# Train the model
trainer.train()

# Evaluate the model
trainer.evaluate()


AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. 
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Unfortunately I was unable to get the code to work, however I did give it an attempt and got quite close

**TO-DO 6e:** Did your sentiment classifier make use of any kind of model transfer or transfer learning? If so, what kinds of transfer were used and what benefit do they provide? **(4 marks)**

WRITE YOUR ANSWER HERE:

Yes, my sentiment classifier made use of transfer learning. It utilised a pretrained transformer model called TinyBERT which was trained on a large corpus of text from a different task. The transfer learning involved transferring the knowledge gained by the transformer during its pretraining phase to the sentiment classification task. By using this pretrained model the sentiment classifier benefits from general language understanding learned by the transformer
model. Transfer learning with pretrained models helps improve the performance of the sentiment classifier by enabling it to learn more effectively from a smaller labelled dataset. The transfer of knowledge from the pretrained model helps the sentiment classifier generalise better and capture relevant patterns and achieve higher accuracy compared to training the classifier from scratch.

---

**TO-DO 6f:** Use your model to compute the probability of sentiment for a sentence of your choosing. Comment your code and print the sentence with its probability distribution. Label the values so that we know which class they refer to. **(4 marks)**

Hint: you could use a poem generator, such as [this one](https://www.poemofquotes.com/tools/poetry-generator/ai-poem-generator), to generate a test sentence. 

In [54]:
# Define the sentence for which we want to compute the sentiment probability 
test_sentence = "The sun sets on the horizon, painting the sky in hues of gold and crimson."


# Tokenize the test sentence
tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")
tokens = tokenizer.encode_plus(test_sentence, padding=True, truncation=True, return_tensors="pt")

# Move the input tensors to the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_ids = tokens["input_ids"].to(device)
attention_mask = tokens["attention_mask"].to(device)

# Set the model to evaluation mode
model.eval()

# Compute the logits (unnormalized scores) for each sentiment class
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits

# Compute the softmax probabilities for each sentiment class
probs = torch.softmax(logits, dim=1)

# Get the sentiment labels and their corresponding probabilities
sentiment_labels = ["Negative", "Positive", "No Impact", "Mixed Sentiment"]
probabilities = probs.squeeze().tolist()

# Print the test sentence and its probability distribution
print("Test Sentence: ", test_sentence)
print("Probability Distribution:")
for label, prob in zip(sentiment_labels, probabilities):
    print(label + ": {:.2f}%".format(prob * 100))


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Test Sentence:  The sun sets on the horizon, painting the sky in hues of gold and crimson.
Probability Distribution:
Negative: 24.89%
Positive: 25.09%
No Impact: 24.92%
Mixed Sentiment: 25.10%


# 7. OPTIONAL: More on Transformers

There are many great resources out there to show you how to use this kind of model in practice:
* An extensive online course is provided by HuggingFace: https://huggingface.co/course/chapter1/1. The pages linked from the HuggingFace course website have an 'open in Colab' button on the top right. You can open the notebook and run it on a Google server there to access GPUs.
* Chapters that may be particularly useful: 
   * Transformers, what can they do? https://huggingface.co/course/chapter1/3?fw=pt
   * Using Transformers: https://huggingface.co/course/chapter2/2?fw=pt
* They provide information on fine-tuning the transformer models here: https://huggingface.co/docs/transformers/training. Fine-tuning updates the weights inside the pretrained network and requires extensive GPU or TPU computing. 
* Text Generation: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb. This topic goes way beyond data analytics on this unit and shows you another powerful feature of pretrained transformers.


