In [1]:
__author__ = "Carina Silberer"
__version__ = "Multimodal Semantics, IMS Stuttgart, Summer 2023" 

*Parts of this notebook are based on those of
Christopher Potts' notebook of CS224u, Stanford, Spring 2021.*

# Assignment: Transformer-Based Contexual Representation
Documentation: https://huggingface.co/transformers/

In this notebook, you will be working with RoBERTa's pre-trained contextual embeddings. We will use the [Hugging Face library](https://huggingface.co), its  `transformers` package and the pre-trained model it provides.

If you have never worked with transformers, then go through the Warm-Up. Otherwise you can directly jump to the exercise below. 

## Warm-Up: Basics of `transformers`
### Setup: 

In [2]:
try:
    import transformers
except ModuleNotFoundError:
    !pip install transformers
    # Uncomment the line below if you want to use conda for installation
finally:
    import transformers

In [3]:
#import json
import numpy as np
import pandas as pd
# Uncomment the line bbelow if you cannot import torch (or install it using pip):
#!conda install -y pytorch torchvision torchaudio -c pytorch
import torch

### RoBERTa
#### Loading the model
I recommend to use the base model at least to set up everything. If your computing power is limited, then it may be best to stick to the base model for the final results. Otherwise, you may also compare the base model to the capabilities of the large model, by running both on the same examples. 
* https://huggingface.co/roberta-base
* https://huggingface.co/roberta-large

***Example Usage: Masked Language Modelling***

*Conveniently, with the pipeline*

In [4]:
from transformers import pipeline
robertaUnmasker = pipeline('fill-mask', model='roberta-large')

In [5]:
robertaUnmasker("Hello I'm a <mask> gnome.")

[{'score': 0.1163899302482605,
  'token': 410,
  'token_str': ' little',
  'sequence': "Hello I'm a little gnome."},
 {'score': 0.04736984893679619,
  'token': 5192,
  'token_str': ' friendly',
  'sequence': "Hello I'm a friendly gnome."},
 {'score': 0.047068677842617035,
  'token': 5671,
  'token_str': ' garden',
  'sequence': "Hello I'm a garden gnome."},
 {'score': 0.037068430334329605,
  'token': 3034,
  'token_str': ' computer',
  'sequence': "Hello I'm a computer gnome."},
 {'score': 0.03495166078209877,
  'token': 5262,
  'token_str': ' tiny',
  'sequence': "Hello I'm a tiny gnome."}]

*Or, if you want to, e.g., extract embeddings or use it in any other customised form*

In [6]:
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
robertaModel = RobertaModel.from_pretrained('roberta-base')
robertaModel.eval()
# Side note: You will receive a warning which basically tells you that 
# some weights were randomly initialised (RoBERTa's classification head),
# That is, it tells you to finetune the model if you want to use it for 
# some supervised inference task (which we don't).

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Drop

In [7]:
robertaModel

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Drop

In [8]:
text = "I am a gnome"
encoded_input = tokenizer(text, return_tensors='pt')
output = robertaModel(**encoded_input)

**Q:** *What does the output below, `[1, 768]`, mean?*

In [9]:
output["pooler_output"].shape

torch.Size([1, 768])

---
#### Basics: Tokenisation
Documentation: https://huggingface.co/docs/tokenizers/python/latest/ 
API: https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#tokenizer
Let's see what tokenisation does with an input text:

In [10]:
sentence = "Commonly, koalas are black and white"

The `tokenize` method breaks up the input into 'tokens':

In [11]:
tokenizer.tokenize(sentence)

['Comm', 'only', ',', 'Ġko', 'al', 'as', 'Ġare', 'Ġblack', 'Ġand', 'Ġwhite']

**Q:** *What does the `Ġ` do?*

RoBERTa's vocabulary is fixed to a certain number of tokens. Therefore, as you can see above, the tokenizer splits up words that are not part of the vocabulary into smaller subwords and characters ([`word pieces`](https://huggingface.co/transformers/tokenizer_summary.html#wordpiece)). <br/>
Here is another example:

In [12]:
tokenizer.tokenize("bananas")

['ban', 'anas']

---
#### Encoding and Decoding Input
The `encode` method maps individual strings to indices into the underlying embedding used by the model. Recall, that BERT has the special tokens `[SEP]`, `[CLS]` (and `[MASK]`for the Masked Language Modeling (MLM)).

In [13]:
ex_ids = tokenizer.encode(sentence, add_special_tokens=True)
ex_ids

[0, 33479, 8338, 6, 12546, 337, 281, 32, 909, 8, 1104, 2]

To see what these tokens look like, we can `convert_ids_to_tokens`, i.e., map the ids back to the tokens:

In [14]:
tokenizer.convert_ids_to_tokens(ex_ids)

['<s>',
 'Comm',
 'only',
 ',',
 'Ġko',
 'al',
 'as',
 'Ġare',
 'Ġblack',
 'Ġand',
 'Ġwhite',
 '</s>']

The `decode` method maps the indices back to individual strings.

### Basics of Representations
See also the documentation for more details: https://huggingface.co/transformers/model_doc/bert.html#bertmodel

In [15]:
sentence = "Everyone knows that most bananas are yellow."
ex_ids = tokenizer.encode(sentence, add_special_tokens=True)

In [16]:
print(len(ex_ids))

10


With the `forward` method of the model, we can obtain representations for a batch of example inputs. By setting the optional bool `output_hidden_state` to `True`, the hidden states of all layers are returned:

In [17]:
with torch.no_grad():
    reps = robertaModel(input_ids=torch.tensor([ex_ids]), output_hidden_states=True)

In [18]:
type(reps)

transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions

The return value `reps` is an instance of a special `transformers` class comprising various representations.<br/>
To just get the final output representations for each token, we use `last_hidden_state`:

In [19]:
reps.last_hidden_state.shape

torch.Size([1, 10, 768])

As we see from the `shape` of the last hidden state, our batch has 1 example, with 10 tokens, where each token is represented by a vector of dimensionality 768. 

**Side note:** The `transformers` models also have a `pooler_output` value. 
In the case of RoBERTa, this is the output representation above the [CLS] token. Recall, **that this is often used as a summary representation for the entire sequence**, on which basis a classification decision is computed.
However, __we cannot use `pooler_output` in the current context__, as `transformers` adds new randomised parameters on top of it, to facilitate fine-tuning. (See also the warning message above when loading the model.)<br/>

**If we want the [CLS] representation, we need to use `reps.last_hidden_state[:, 0]`:**

In [20]:
reps.last_hidden_state[:, 0].shape

torch.Size([1, 768])

Since we set the `output_hidden_state` to `True`, we can access the output representations from each layer of the model by using `hidden_states`. 
(Setting `output_hidden_state` to `False` would return  `None`.) <br/>
There are 13 layers in total for the RoBERTa-base model, the embeddings, followed by the 12 hidden layers:

In [21]:
len(reps.hidden_states)

13

That is,`reps.hidden_states[-1]`would be the same as `reps.hidden_states[12]` for `BERT-base`, and the embeddings can be accessed by `reps.hidden_states[0]`.

In [22]:
tokenizer.decode(ex_ids)

'<s>Everyone knows that most bananas are yellow.</s>'

## Exercise: Language Perception of Colour

*We will use the data provided by the coda dataset (for Colour Perception)*
https://github.com/nala-cub/coda

### Instructions
* Run the code below. It loads the underlying coda dataset. We will only use the *object names* and *colours* for this exercise.
* Choose 20 object names from different categories (e.g., vegetable/fruits, vehicles, ...).
* For each object, determine their typical colours from the given list of colours. This is your personal reference output.
* Now, get the top colours that RoBERTa predicts for each of these objects, using the prompts below.
  * That is, filter the predictions/scores of RoBERTa to contain only a ranking of the target colour terms we consider. 
  
### Taks
1. Compute the accuracy of colour prediction using your small reference data. 
2. Are there objects for which RoBERTa predicts wrong colours? Why could this be the case?
3. Summarise in a few sentences your conclusions regarding the colour knowledge one can learn from language use (i.e., large text corpora). 

***Submit the following files through ILIAS***:
* `ex1_transformers_<yourname>.ipynb`: your completed jupyter notebook
* `ex1_transformers_<yourname>.[txt|md]`: a plain text file that contains your answers to Tasks 1-3.

In [23]:
import json

with open('coda_objects.jsonl', 'r') as json_file:
    json_list = list(json_file)

for json_str in json_list:
    result = json.loads(json_str)
    print(f"{result['display_name']}")

Aircraft
Airplane
Alarm clock
Ambulance
Ant
Artichoke
Apple
Apricot
Asparagus
Auto part
Avocado
Axe
Backpack
Bagel
Baked goods
Ball
Balloon
Bamboo
Banana
Band-aid
Barrel
Baseball bat
Baseball glove
Basketball (Ball)
Bean
Bear
Bed
Bee
Beehive
Beer
Beet
Beetle
Bell pepper
Belt
Bench
Bicycle
Bicycle helmet
Bicycle wheel
Billboard
Billiard table
Binoculars
Blood
Blue jay
Blueberry
Boat
Bomb
Bone
Book
Bookcase
Boot
Bottle opener
Bowl
Box
Bread
Briefcase
Broccoli
Bronze sculpture
Brown bear
Building
Bull
Burrito
Bus
Bust
Butter
Butterfly
Cabbage
Cake
Cake stand
Calculator
Camel
Camera
Can opener
Canary
Candle
Candy
Cannon
Canoe
Cantaloupe
Car
Carrot
Cassette deck
Castle
Cat
Caterpillar
Cattle
Cauliflower
Ceiling fan
Centipede
Chainsaw
Chair
Cheese
Cheetah
Cherry
Chicken
Chime
Chisel
Chocolate
Chopsticks
Christmas tree
Clock
Cloud
Coal
Coat
Cocktail
Cocktail shaker
Coconut
Coffee
Coffee cup
Coffee table
Coffeemaker
Coin
Common sunflower
Computer keyboard
Computer monitor
Computer mouse
Cookie

In [24]:
target_vocab = ["red", "orange", "yellow", "green", "blue", 
                "purple", "pink", "white", "black", "gray", "brown"]

In [25]:
prompts = [
    "Most {object_areis_pl} <mask>",
    "Everyone knows that most {object_areis_pl} <mask>.",
    "Everyone knows that {object_areis_pl} <mask>.",
    "Commonly {object_areis_pl} <mask>.",
    "All {object_areis_pl} <mask>.",
    "It is known that {object_areis_pl} <mask>.",
    "It's known that {object_areis_pl} <mask>.",
    "It is known that most {object_areis_pl} <mask>.",
    "It's known that most {object_areis_pl} <mask>.",
    "This {object_areis_s} <mask>."
]

*Example of instantiated prompt:*

In [26]:
prompt = "Most {object_areis_pl} <mask>".format(
    object_areis_pl="bananas are")
print(prompt)

Most bananas are <mask>


In [27]:
print(robertaUnmasker(prompt))

[{'score': 0.0479198694229126, 'token': 23318, 'token_str': ' ripe', 'sequence': 'Most bananas are ripe'}, {'score': 0.023059047758579254, 'token': 27532, 'token_str': ' edible', 'sequence': 'Most bananas are edible'}, {'score': 0.02161657065153122, 'token': 35, 'token_str': ':', 'sequence': 'Most bananas are:'}, {'score': 0.019580114632844925, 'token': 5718, 'token_str': ' yellow', 'sequence': 'Most bananas are yellow'}, {'score': 0.01915721222758293, 'token': 34382, 'token_str': ' poisonous', 'sequence': 'Most bananas are poisonous'}]


In [28]:
predictions1 = robertaUnmasker(prompt, top_k=100)
print([pred["token_str"] for pred in predictions1])
print([pred["score"] for pred in predictions1])

[' ripe', ' edible', ':', ' yellow', ' poisonous', ' sweet', ' bitter', '</s>', '.', ' expensive', ' not', ' delicious', ' bananas', ' black', '…', ' small', ' eaten', ' green', '...', ' a', ' brown', ' grown', ' harvested', ' white', ' red', ' acidic', ' imported', ' …', ' seeds', ' cheap', '.', ' fermented', ' processed', ' toxic', ' sold', ' nutritious', ' bruised', ' dried', ' hard', ' cultivated', ' produced', ' organic', '—', '...', ' :', ' flavored', ',', ' purple', ' available', ' stored', ' in', ' canned', ' soft', ' dry', ' very', ' safe', ' wild', ' protected', ' heavy', ' also', ' fruit', ' fat', ' free', ' sour', ' tasty', ' rotten', ' inexpensive', ' consumed', ' large', ' fruits', ',', ' like', ' strong', ' natural', ' contaminated', ' nuts', ' raw', ' huge', ' tropical', ' frozen', ' good', ' known', ' labeled', ' tough', ' from', ' mixed', ' oily', ' concentrated', ' used', ' growing', ' seasonal', ' sliced', ' preserved', ' healthy', ' planted', ' cooked', ' sticky', 