# Advanced Text Analytics Lab 2

This notebook is the second of two lab notebooks that you will submit as part of your assessment for the Advanced Data Analytics unit. The notebook contains three sections:
1. **Introducing Transformers:** This section introduces the Transformers library from HuggingFace, showing you how to use it to obtain contextualised embeddings from pretrained transformer models.
2. **Classification with transformers:** Here we show you how to construct a neural network classifier using Transformers, and give you the task of applying it to irony detection in tweets.
3. **OPTIONAL: More on Transformers:** Some pointers to other materials if you want to learn more about transformers, e.g., if using them in your summer project. 

## Learning Outcomes

These sections will contain tutorial-like instructions, as you have seen in previous text analytics labs. On completing these sections, the intended learning outcomes are that you will be able to...
1. Use pretrained transformers to obtain contextualised word embeddings.
1. Construct classifiers with pretrained transformers. 
1. Find documentation on pretrained models in the Transformers library.

## Your Tasks

Inside each of these sections there are several **'To-do's**, which you must complete for your summative assessment. Your marks will be based on your answers to these to-dos. Please make sure to:
1. Include the output of your code in the saved notebook. Plots and printed output should be visible without re-running the code. 
1. Include all code needed to generate your answers.
1. Provide sufficient comments to understand how your method works.
1. Write text in a cell in markdown format where a written answer is required. You can convert a cell to markdown format by pressing Escape-M. 

There are also some unmarked 'to-do's that are part of the tutorial to help you learn how to implement and use the methods studied here.

## Marking Criteria

1. The coursework (both notebooks) is worth 30% of the unit in total. 
1. There is a total of 100 marks available for both lab notebooks. 
1. This notebook is worth 34 of those marks.
1. The number of marks for each to-do out of 100 is shown alongside each to-do.
1. For to-dos that require you to write code, a good solution would meet the following criteria (in order of importance):
   1. Solves the task or answers the question asked in the to-do. This means, if the code cells in the notebook are executed in order, we will get the output shown in your notebook.
   1. The code is easy to follow and does not contain unnecessary steps.
   1. The comments show that you understand how your solution works.
   1. A very good answer will also provide code that is computationally efficient but easy to read.
1. You can use any suitable publicly available libraries. Unless the task explicitly asks you to implement something from scratch, there is no penalty for using libraries to implement some steps.

## Support

The main source of support will be during the remaining lab sessions (Fridays 2-5pm) for this unit. 

The TAs and lecturer will help you with questions about the lectures, the code provided for you in this notebook, and general questions about the topics we cover. For the marked 'to-dos', they can only answer clarifying questions about what you have to do. 

Office hours: You can book office hours with Edwin on Tuesdays 3pm-5pm by sending him an email (edwin.simpson@bristol.ac.uk). If those times are not possible for you, please contact him by email to request an alternative. 

## Deadline

The notebook must be submitted along with the second notebook on Blackboard before **Wednesday 11th May at 13.00**. 

## Submission

You will need to zip up this notebook with the previous notebook into a single .zip file, which you will submit to Blackboard through the 'assessment, submission and feedback' link on the left sidebar. 

Please name your files like this:
   * Name this notebook ADA2_<student_number>.ipynb
   * Name the zip file <student_number>.zip
   * Please don't use your name anywhere as we want to mark anonymously. 

# 0. Packages

You will need to upgrade the version of transformers to transformers=4.14.1 . An older version was previously included by mistake in the crossplatform_environment.yml, but this is not needed. You can upgrade using:

``conda upgrade -c huggingface transformers``

# 4. Pretrained Transformers (max. 12 marks)

HuggingFace is a company that has developed an open source library for loading pretrained transformer models. They also distribute many models themselves.  It is therefore the best library to use to create NLP models on top of large, deep neural networks. This is especially useful for tasks where simpler, feature-based methods or smaller LSTM models do not perform well enough, for example, when complex processing of syntax and semantics is required (natural language 'understanding'). 

Let's start by looking at two key types of object in the transformers library: models and tokenizers.

## 4.1. Models

The neural network models available in the Transformers library are accessed through wrapper classes such as `AutoModel`. If we want to load a pretrained model, we can simply pass its name to the `from_pretrained` function, and the pretrained model weights will be downloaded from HuggingFace and a neural network model will be created with those weights. For example:

In [126]:
from transformers import AutoModel # For BERTs

model = AutoModel.from_pretrained("prajjwal1/bert-tiny") 

loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at C:\Users\12055/.cache\huggingface\transformers\3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.18.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/pytorch_model.bin from cache at C:\Users\12055/.cache\huggin

This code loads the BERT-tiny model, which is a compressed version of BERT with 4.4 million parameters, compared to the standard version of BERT, called 'BERT-base', which has 110 million parameters. While BERT-tiny will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/prajjwal1/bert-tiny).  

The same functions can be used to load other models from HuggingFace's repository simply by changing the model's name. Take a look at [the Models page](https://huggingface.co/models) so see what there is on offer. Do you recognise any of the models' names?

# 4.2. Tokenizers

Before we can apply a model to some text, we need to a create Tokenizer object. In Transfomers, Tokenizer objects convert raw text to a sequence of numbers. First, the tokenizer actually performs tokenization, then it maps each token to its numerical ID. There are lots of different tokenizers that we can use to preprocess text. If we are loading a pretrained model, we will need to choose the tokenizer that corresponds to that model. 

We can load the right tokenizer as follows, in the same way we loaded the model itself:

In [127]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-tiny") 

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at C:\Users\12055/.cache\huggingface\transformers\3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.18.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/prajjwal

Let's see what the BERT-tiny tokenizer does to an example sentence:

In [128]:
sentence = "The transformer architecture is widely used in NLP."

tokens = tokenizer.tokenize(sentence)
print(tokens)

['the', 'transform', '##er', 'architecture', 'is', 'widely', 'used', 'in', 'nl', '##p', '.']


Let's compare with the NLTK tokenizer we have seen before:

In [129]:
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(sentence)
print(nltk_tokens)

['The', 'transformer', 'architecture', 'is', 'widely', 'used', 'in', 'NLP', '.']


While NLTK keeps whole words as tokens, the BERT tokenizer splits some words into sub-words. Splitting is applied to words with low frequency in the training set, such as 'transformer'. 

**TO-DO 4.2a:** What is the benefit of splitting words into sub-words? **(2 marks)**

WRITE YOUR ANSWER HERE.





1. To some extents, Sub-Word Tokenization alleviates OOV (out-of-vocabulary) problem. When a unknown word token out of the vocabulary appears, the sub word tokenizer can split this token into several fragments, and some of these fragments have probobility of existing in the vocabulary.  
1. When the amount of words in corpus is massive, the sub-word tokenization can actually decrease the size of vocabulary, thus reduced computing resources



It is important to use the right tokenizer with a pretrained model as each model was trained with text tokenized in a particular way. After tokenization, the Tokenizer object can also map the tokens to their IDs (indexes in the vocabulary):

In [130]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[1996, 10938, 2121, 4294, 2003, 4235, 2109, 1999, 17953, 2361, 1012]


## 4.3. Contextualised Embeddings

Now that we have a sequence of tokens, we are almost ready to process the sequence using the pretrained model. 

Our model takes as input a PyTorch `tensor` object. In PyTorch, `tensor` is a muli-dimensional matrix. Here, we need a two-dimensional matrix, where each row is a sequence of input tokens corresponding to a single sentence or document. Let's convert our list of IDs to a 2-D tensor with a single row:

In [131]:
import torch 

ids_tensor = torch.tensor([ids])

print(ids_tensor)

tensor([[ 1996, 10938,  2121,  4294,  2003,  4235,  2109,  1999, 17953,  2361,
          1012]])


Now we can process the sequence using our model. The model maps the sequence of input IDs to a sequence of output vectors, which are contextualised word embeddings. The hidden state values produced in the last hidden layer of the model are used as the contextualised embeddings:

In [132]:
model_outputs = model(ids_tensor)
print(model_outputs)
embeddings = model_outputs['last_hidden_state'][0]
print(embeddings)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.6550,  0.3572, -1.8545,  ..., -2.2321, -2.4890,  0.8569],
         [-0.7762,  0.7065, -0.4053,  ..., -1.0436, -1.4757,  1.0586],
         [-0.0331,  0.0583, -0.5069,  ..., -3.0095, -0.8549,  0.6007],
         ...,
         [-0.1059,  0.2619,  0.2993,  ..., -0.9318, -2.3270,  1.5508],
         [ 0.2457,  0.2863,  0.6015,  ..., -2.2550, -2.1556,  0.7440],
         [ 0.6512, -0.0050,  0.4048,  ..., -1.6719, -2.0540,  0.0476]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.9995, -0.0462, -0.9893,  0.8689, -0.9952,  0.6976, -0.7959, -0.8617,
         -0.0733,  0.0631, -0.2004, -0.0188,  0.0957,  0.9973,  0.3465, -0.5966,
         -0.0231,  0.2026, -0.5178, -0.9471,  0.9236, -0.1754, -0.5852, -0.9344,
         -0.9866, -0.0732, -0.9979,  0.8367,  0.6465,  0.0414,  0.0751,  0.0109,
         -0.9963, -0.0876,  0.9558,  0.9860, -0.8553,  0.1024,  0.2728, -0.9718,
          0.8817,  0.7812, -0.97

We can retrieve the embedding vector for "transform" like this:

In [133]:
emb = embeddings[1]

# convert it to a numpy array so we can perform various operations on it later on
emb = emb.detach().numpy()

print(emb)
print(f'The BERT-tiny embeddings have {emb.shape[0]} dimensions.')

[-0.7762389   0.7064995  -0.40526837 -1.0537281   0.5996382  -1.4787799
 -0.06114486  1.0198277  -0.2592845  -1.3763579   0.15379119  1.0128261
  0.7475126  -0.17591816  2.0003283  -1.0197401  -0.7689709   0.0530691
  0.130649    0.19979407 -1.0313494  -0.5410453   1.0834258   0.49249512
  2.2506454   1.3008718  -0.16233611  0.22524881  0.7293377   0.37714246
  0.07086009  0.39800274 -0.37489387 -0.18650481  0.5223738  -2.747382
 -0.5368228   0.35264593 -1.8976287  -0.35527697  0.07477656 -0.39572445
 -0.55447954  0.6223204   1.0455049  -2.1943057   0.40990472 -0.62277496
  2.219217   -0.13648622  0.89714205  0.8076682   0.1879431  -0.01698842
  0.5216419  -0.32894918  0.07476728 -1.1039577   1.2602047   3.4293036
 -0.91396147 -1.8800973  -0.08931094 -0.7966867   0.06266165  0.69099706
 -0.73700416 -0.23590541 -0.42857617 -0.68002903 -0.6193421   0.01592653
  1.6605312   0.66483396 -1.6665554   2.1701148   0.797216   -0.5222837
  0.6280762  -0.41740733 -0.11712596 -1.3964881  -0.486962

TO-DO 4.3a: Retrieve the embedding for "architecture" (this to-do will not be marked).

In [134]:
# WRITE YOUR ANSWER HERE
emb2 = embeddings[2]
emb2 = emb2.detach().numpy()

Sentences and documents usually have varying lengths. So, to put multiple sentences into a single tensor, we need to pad the sequences up to a maximum length. Luckily, the tokenizer class takes care of this for us. When we pass in a list of sentences, the tokenizer creates a matrix, where each row is a sequence of the same length:

In [135]:
sentences = [
    "She opened the book to page 37 and began to read aloud.",
    "Many readers find the first book of A Tale of Two Cities to be confusing.",
    "I can book tickets for the concert next week.",
    "The police wanted to book him for driving too fast.",
    "I can reserve tickets for the concert next week."
]

model_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")  

print(model_inputs)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': tensor([[  101,  2016,  2441,  1996,  2338,  2000,  3931,  4261,  1998,  2211,
          2000,  3191, 12575,  1012,   102,     0,     0,     0],
        [  101,  2116,  8141,  2424,  1996,  2034,  2338,  1997,  1037,  6925,
          1997,  2048,  3655,  2000,  2022, 16801,  1012,   102],
        [  101,  1045,  2064,  2338,  9735,  2005,  1996,  4164,  2279,  2733,
          1012,   102,     0,     0,     0,     0,     0,     0],
        [  101,  1996,  2610,  2359,  2000,  2338,  2032,  2005,  4439,  2205,
          3435,  1012,   102,     0,     0,     0,     0,     0],
        [  101,  1045,  2064,  3914,  9735,  2005,  1996,  4164,  2279,  2733,
          1012,   102,     0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

`model_inputs` is a dictionary containing three objects. The `input_ids` are the list of token IDs in the input sequences. 

TO-DO 4.3b: What value do the special padding tokens have? (this to-do is unmarked):  
0

Notice that the input_ids all start with the same token ID, 101, even though they have different first words. They also have token ID 102 before the padding tokens. This is because the tokenizer inserts two special tokens, which are used in some applicaions of BERT. 101 is the '[CLS]' token, which is a dummy token whose embedding can be trained to represent the whole sequence. the [CLS] token's embedding can then be used as input to a text classifier to classify a sentence or document. Token 102 is '[SEP]', which can be used to separate multiple input sequences in a single example. This is needed in tasks where multiple pieces of text are provided as input, e.g., a to build a classifier that can determine whether two sentences contradict each other. 

The `attention_mask` records which tokens are special padding tokens and which are real tokens. Tokens with a 0 in the attention mask will be ignored.

`token_type_ids` is needed when two sequences are passed together as input to the model. Here, each input is a single sentence, so we have only one type of token in the output above. 

In [136]:
# model_inputs is a dictionary, so to provide the arguments to model(), 
# we use the double star to unpack the dictionary so that each key in the dictionary is
# an argument to model() and each value is the value of the argument. 
model_outputs = model(**model_inputs) 

**TO-DO 4.3c:** The first four example sentences above all contain the word "book", and the last examples contains "reserve". Obtain the contextualised word embeddings for 'book' and 'reserve' in the example sentences using our model. **(4 marks)**

Hint: you may need to convert tensors to numpy arrays.

In [137]:
#WRITE YOUR OWN CODE HERE
import numpy as np

book_id = tokenizer.convert_tokens_to_ids("book")
reserve_id = tokenizer.convert_tokens_to_ids("reserve")
model_inputIDs_list = model_inputs["input_ids"].detach().numpy()

embedding_list = model_outputs["last_hidden_state"]

bookANDreserve_embeddings = []
for idx, sent in enumerate(model_inputIDs_list):
    if idx != 4:
        # Find the location of the word "book" in the first sentence
        loc = list(sent).index(book_id)
    else:
        # Find the location of the word "reserve" in the last sentence
        loc = list(sent).index(reserve_id)
    bookANDreserve_embeddings.append(embedding_list[idx][loc].detach().numpy())
print(np.array(bookANDreserve_embeddings).shape)

(5, 128)


**TO-DO 4.3d:** Write code to compare these embeddings in the cell below. In a few sentences, explain what your comparison tells us about the contextualised embeddings for "book". **(6 marks)**

WRITE YOUR ANSWER HERE

 

I made a similarity matrix for the words "book" and "reserve" in these five sentences based on euclidean distance. Following, I will illustrate the effectiveness of contextualised embedding based on this case.   
  
It can be seen from the matrix that the “book” word embeddings in the first two sentences are very close together, compared with other embeddings. And from the perspective of human understanding, "book" in the first two sentences both indicate the noun "book" that can be read.  
  
In the third sentence, the word “book” is similar in meaning to the word “reserve” in the fifth sentence. This is also reflected in the embedding space, where the word embedding of "book" in the third sentence and the word embedding of "reserve" in the fifth sentence are closer to each other than other shown word embeddings.   
  
As for "book" in the fourth sentence, its wordiness is a verb, the same as "book" in the third sentence, which may explain why the distance between the two word embeddings is shorter in the embedding space than other word embeddings.  


In [138]:
# WRITE YOUR ANSWER HERE
import pandas as pd
from scipy.spatial.distance import cdist

dists = np.empty((5,5))
for row, word_emb_row in enumerate(bookANDreserve_embeddings):
    for col, word_emb_col in enumerate(bookANDreserve_embeddings):
        dists[row][col] = cdist([word_emb_row], [word_emb_col], 'euclidean')
row_name = ["book sent 1","book sent 2","book sent 3","book sent 4","reserve sent 5"]
sim_mat = pd.DataFrame(data = dists, columns= row_name)
print(sim_mat)

   book sent 1  book sent 2  book sent 3  book sent 4  reserve sent 5
0     0.000000     8.520038     9.546872    10.407844       13.018836
1     8.520038     0.000000    10.206396    10.488382       13.664334
2     9.546872    10.206396     0.000000     9.426083        7.451720
3    10.407844    10.488382     9.426083     0.000000       11.671614
4    13.018836    13.664334     7.451720    11.671614        0.000000


# 5. Transformer-basd Text Classifiers (max. 22 marks)

The previous section showed us how to obtain a sequence of contextualised word embeddings using a pretrained transformer. How can we use a pretrained model to build a classifier?

First, let's load up the [Tweet Eval](https://huggingface.co/datasets/tweet_eval) emotion analysis dataset, which we will use to train and test a classifier. The Emotion dataset is relatively small compared to the sentiment dataset we used earlier. The task is to classify tweets into one of  0: anger, 1: joy, 2: optimism, or 3: sadness.

In [139]:
from datasets import load_dataset
import numpy as np
cache_dir = "./data_cache"

train_dataset = load_dataset(
    "tweet_eval",
    name="emotion",
    split="train",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")

val_dataset = load_dataset(
    "tweet_eval",
    name="emotion",
    split="validation",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Validation dataset with {len(val_dataset)} instances loaded")

test_dataset = load_dataset(
    "tweet_eval",
    name="emotion",
    split="test",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Test dataset with {len(test_dataset)} instances loaded")
num_classes = np.unique(train_dataset['label']).size

Reusing dataset tweet_eval (./data_cache\tweet_eval\emotion\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Training dataset with 3257 instances loaded


Reusing dataset tweet_eval (./data_cache\tweet_eval\emotion\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Validation dataset with 374 instances loaded


Reusing dataset tweet_eval (./data_cache\tweet_eval\emotion\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Test dataset with 1421 instances loaded


Now we are working with a proper dataset, which uses the datasets library. We can use our tokenizer to tokenize the examples in the dataset using the code in the next cell. Here, we use the ``map()`` method again to apply the tokenizer to each example in the dataset. 

In [140]:
def tokenize_function(dataset):
    model_inputs = tokenizer(dataset['text'], padding="max_length", max_length=100, truncation=True)
    return model_inputs

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at ./data_cache\tweet_eval\emotion\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343\cache-c34cae45b308f8b3.arrow
Loading cached processed dataset at ./data_cache\tweet_eval\emotion\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343\cache-dd50e62ebbf10e21.arrow
Loading cached processed dataset at ./data_cache\tweet_eval\emotion\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343\cache-5880ad1ede6e3d2a.arrow


Now, we have the dataset in the right format, let's see how to create a classifier based on a pretrained transformer.

Our original text classifier from the first notebook used a fully-connected layer to produce a hidden representation of the whole sentence. This hidden representation was then fed to an output layer to produce a probability distribution over class labels:

<img src="neural_text_classifier_smaller.png" alt="Neural text classifier diagram from the slides in lecture 8.1" width="400px"/>

With transformers, we can do something very similar, by connecting the transfomer's output to a fully-connected layer. However, with BERT, we do not need to pass the embedding of each individual word to the fully-connected layer because there is a special [CLS] token that represents the whole sentence:

<img src="bert_text_classifier.png" alt="BERT text classifier diagram from the slides in lecture 9.2" width="400px"/>

Diagram from ["BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"](https://teaching.bb-ai.net/Student-Projects/Winograd-Challenge-Papers/2018-Devlin-BERT.pdf), Devlin et al., 2018.

The code below shows how to access a tensor containing the [CLS] embeddings:

In [141]:
cls_embs = model(**model_inputs)['last_hidden_state'][:, 0]

print(cls_embs.shape)

torch.Size([5, 128])


So, given the pretrained BERT model, we need to put a classifier 'head' (fully connected layers that map the CLS embedding to a class probability) onto it, and train the classifier head for emotion classification. 

The transformers library provides some useful wrappers around the pretrained models that construct complete models for typical tasks such as text classification. These auto classes are documented here: https://huggingface.co/docs/transformers/model_doc/auto

The code below will create a complete model for sequence classification, based on the BERT-tiny model:

In [142]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("prajjwal1/bert-tiny", num_labels=num_classes)

loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at C:\Users\12055/.cache\huggingface\transformers\3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.18.0",
  "type_vocab_s

**TO-DO 5a:** Can you use this model straight away to classify emotions? Explain your answer. **(2 marks)**

WRITE YOUR ANSWER HERE

Yes. In order to directly classify with the BERT model, we need to fine-tune the BERT model. On a BERTmodel, we can retrain model with the target dataset, which is much smaller than the dataset for pretraining. Then, through backpropagation, the pre-trained weights of the model are updated based on the new dataset (of course, we also can freeze all layers except the final layer). When predicting categories, the model would feed the output to the softmax layer, and the predicted results would be stored in the first token [ELC] of the final layer.  

**TO-DO 5b:** If you want to perform NER, you would need to replace the use of `AutoModelForSequenceClassification` in the cell above to load a suitable model. Which auto class could you use, and how would the model differ from the model we have loaded above? Reference the documentation for the chosen class in your answer. **(4 marks)** 

WRITE YOUR ANSWER HERE  
AutoModelForTokenClassification  
From the names, we can understand that AutoModelForTokenClassification is a generic model class that can instantiate the token classification model, and AutoModelForSequenceClassification is for instantiating the sequence classification model. I think the biggest difference between them is the granularity of output embeddings, one is word embedding and another is sequence embedding.  [Document](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelfortokenclassification)

Next, we are going to train our model. Sometimes it is not necessary to update the weights in the BERT model itself, so we can freeze them. This can save a lot of computation time. We can do this as follows:

In [143]:
for param in model.bert.parameters():
    param.requires_grad = False

To train our model on the emotion data, we can make use of the Trainer class. This class encapsulates a lot of the complex training steps and avoids the need to define our own training function (``train_nn`` in the previous notebook).

Run the code below to train the model.

In [144]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="transformer_checkpoints",  # specify the directory where models weights will be saved a certain points during training (checkpoints)
    num_train_epochs=3,  # change this if it is taking too long on your computer
)  

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [145]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3257
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1224
 41%|████      | 500/1224 [00:16<00:20, 35.84it/s]Saving model checkpoint to transformer_checkpoints\checkpoint-500
Configuration saved in transformer_checkpoints\checkpoint-500\config.json
Model weights saved in transformer_checkpoints\checkpoint-500\pytorch_model.bin
 41%|████▏     | 506/1224 [00:16<00:21, 33.64it/s]

{'loss': 1.3026, 'learning_rate': 2.957516339869281e-05, 'epoch': 1.23}


 82%|████████▏ | 1000/1224 [00:31<00:08, 25.21it/s]Saving model checkpoint to transformer_checkpoints\checkpoint-1000
Configuration saved in transformer_checkpoints\checkpoint-1000\config.json
Model weights saved in transformer_checkpoints\checkpoint-1000\pytorch_model.bin
 82%|████████▏ | 1003/1224 [00:31<00:09, 23.62it/s]

{'loss': 1.262, 'learning_rate': 9.150326797385621e-06, 'epoch': 2.45}


100%|█████████▉| 1223/1224 [00:39<00:00, 29.12it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


100%|██████████| 1224/1224 [00:39<00:00, 31.15it/s]

{'train_runtime': 39.2278, 'train_samples_per_second': 249.083, 'train_steps_per_second': 31.202, 'train_loss': 1.2793277167027293, 'epoch': 3.0}





TrainOutput(global_step=1224, training_loss=1.2793277167027293, metrics={'train_runtime': 39.2278, 'train_samples_per_second': 249.083, 'train_steps_per_second': 31.202, 'train_loss': 1.2793277167027293, 'epoch': 3.0})

Let's make some predictions with our model on the test dataset:

In [146]:
def predict_nn(trained_model, test_dataset):

    # Pass the required items from the dataset to the model    
    output = trained_model(attention_mask=torch.tensor(test_dataset["attention_mask"]), input_ids=torch.tensor(test_dataset["input_ids"]))
        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    pred_labs = np.argmax(output["logits"].detach().numpy(), axis=1)
    
    gold_labs = test_dataset["label"]
    
    return gold_labs, pred_labs

# Run the prediction function to get the results:
gold_labs, pred_labs = predict_nn(model, test_dataset)

**TO-DO 5c:** 
Implement and test a classifier for the "irony" subset of the [Tweet_eval dataset](https://huggingface.co/datasets/tweet_eval) using a pretrained transformer. Evaluate the classifier with both frozen and unfrozen (i.e., fine-tuned) BERT layers. Choose a suitable evaluation metric and provide a comparison of the results below, including a brief explanation (1-2 sentences) for any differences you observe. Make sure to comment your code.  **(8 marks)**

Note: you may implement any kind of classifier you like, as long as you are using a pretrained transformer model. 

WRITE YOUR ANSWER HERE   


In [147]:
# WRITE YOUR ANSWER HERE
# Loda the train and test dataset
from sklearn.metrics import f1_score
from collections import Counter
cache_dir = "./data_cache"

train_dataset = load_dataset(
    "tweet_eval",
    name="irony",
    split="train",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")

val_dataset = load_dataset(
    "tweet_eval",
    name="irony",
    split="validation",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Validation dataset with {len(val_dataset)} instances loaded")
test_dataset = load_dataset(
    "tweet_eval",
    name="irony",
    split="test",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Test dataset with {len(test_dataset)} instances loaded")
num_classes = np.unique(train_dataset['label']).size

print(f"training dataset{dict(Counter(train_dataset['label']))}")
print(f"validation dataset{dict(Counter(val_dataset['label']))}")
print(f"test dataset{dict(Counter(test_dataset['label']))}")

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Reusing dataset tweet_eval (./data_cache\tweet_eval\irony\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Training dataset with 2862 instances loaded


Reusing dataset tweet_eval (./data_cache\tweet_eval\irony\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Validation dataset with 955 instances loaded


Reusing dataset tweet_eval (./data_cache\tweet_eval\irony\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)
Loading cached processed dataset at ./data_cache\tweet_eval\irony\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343\cache-0a3f795c693ba54c.arrow


Test dataset with 784 instances loaded
training dataset{1: 1445, 0: 1417}
validation dataset{1: 456, 0: 499}
test dataset{0: 473, 1: 311}


100%|██████████| 1/1 [00:00<00:00, 15.87ba/s]
Loading cached processed dataset at ./data_cache\tweet_eval\irony\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343\cache-19de43ce85521bc7.arrow


In [148]:
model_forzen = AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", num_labels=num_classes)
model_tuneAll = AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", num_labels=num_classes)
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", )
for param in model_forzen.bert.parameters():
    param.requires_grad = False
def get_trainer(model):
    trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset)
    return trainer
trainer_forzen = get_trainer(model_forzen)
trainer_tuneAll = get_trainer(model_tuneAll)
trainer_forzen.train()
trainer_tuneAll.train()
gold_labs_forzen, pred_labs_forzen = predict_nn(model_forzen, test_dataset)
gold_labs_tuneAll, pred_labs_tuneAll = predict_nn(model_tuneAll, test_dataset)

https://huggingface.co/vinai/bertweet-base/resolve/main/config.json not found in cache or force_download set to True, downloading to C:\Users\12055\.cache\huggingface\transformers\tmpsaxvx8v0
Downloading: 100%|██████████| 558/558 [00:00<00:00, 558kB/s]
storing https://huggingface.co/vinai/bertweet-base/resolve/main/config.json in cache at C:\Users\12055/.cache\huggingface\transformers\356366feedcea0917e30f7f235e1e062ffc2d28138445d5672a184be756c8686.a2b6026e688d1b19cebc0981d8f3a5b1668eabfda55b2c42049d5eac0bc8cb2d
creating metadata file for C:\Users\12055/.cache\huggingface\transformers\356366feedcea0917e30f7f235e1e062ffc2d28138445d5672a184be756c8686.a2b6026e688d1b19cebc0981d8f3a5b1668eabfda55b2c42049d5eac0bc8cb2d
loading configuration file https://huggingface.co/vinai/bertweet-base/resolve/main/config.json from cache at C:\Users\12055/.cache\huggingface\transformers\356366feedcea0917e30f7f235e1e062ffc2d28138445d5672a184be756c8686.a2b6026e688d1b19cebc0981d8f3a5b1668eabfda55b2c42049d5eac0

AttributeError: 'RobertaForSequenceClassification' object has no attribute 'bert'

In [None]:
print("F1 Score for classifier with frozen")
print(f1_score(gold_labs_forzen, pred_labs_forzen))

print("F1 Score for classifier with unfrozen")
print(f1_score(gold_labs_tuneAll, pred_labs_tuneAll))


F1 Score for classifier with frozen
0.4954128440366973
F1 Score for classifier with unfrozen
0.5608308605341247


**TO-DO 5d:** Briefly describe how the classifiers you implemented for the irony dataset use transfer learning. **(4 marks)**

WRITE YOUR ANSWER HERE


**TO-DO 5e:** Use your model to compute the probability of irony for a sentence of your choosing. Comment your code and print the sentence with its probability. **(4 marks)**

Hint: you could choose a sentence from [this page on verbal irony](https://examples.yourdictionary.com/examples-of-verbal-irony.html). 

In [None]:
# WRITE YOUR ANSWER HERE   


# 6. OPTIONAL: More on Transformers

There are many great resources out there to show you how to use this kind of model in practice:
* An extensive online course is provided by HuggingFace: https://huggingface.co/course/chapter1/1. The pages linked from the HuggingFace course website have an 'open in Colab' button on the top right. You can open the notebook and run it on a Google server there to access GPUs.
* Chapters that may be particularly useful: 
   * Transformers, what can they do? https://huggingface.co/course/chapter1/3?fw=pt
   * Using Transformers: https://huggingface.co/course/chapter2/2?fw=pt
* They provide information on fine-tuning the transformer models here: https://huggingface.co/docs/transformers/training. Fine-tuning updates the weights inside the pretrained network and requires extensive GPU or TPU computing. 
* Text Generation: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb. This topic goes way beyond data analytics on this unit and shows you another powerful feature of pretrained transformers.


