FEEDBACK: https://forms.gle/FvmxWWisFqCmsoDf9

*permalink: https://tinyurl.com/CuAINLP2*

BERT attention visualisation: https://colab.research.google.com/drive/1hXIQ77A4TYS4y3UthWF-Ci7V7vVUoxmQ

If you want to do a task more similar to what we did last week, we recommend the Stanford CS224N exercise here: https://web.stanford.edu/class/cs224n/assignments/a5.zip , instructions here: https://web.stanford.edu/class/cs224n/assignments/a5.pdf (requires running Python locally).

Change Pytorch classes: https://colab.research.google.com/drive/1dQVo-krTVoj0b9ekJNdGAd0lirjUQV70#scrollTo=MdYXZvhrGLIW

# Introduction to Transformers



In [None]:
%pip install transformers
%pip install sentencepiece ## 16 seconds
import torch as t
import matplotlib.pyplot as plt

## Machine Translation

In [None]:
import sentencepiece
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
translator = pipeline("translation_en_to_de")
translation_tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-nl") 
translation_model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-nl") ## 1 minute

In [None]:
translation_text = "Welcome to our natural language processing workshop"
tokenized_text = translation_tokenizer.prepare_seq2seq_batch([translation_text])
translation_input_ids = t.tensor(tokenized_text['input_ids']).long()
translation = translation_model.generate(input_ids = translation_input_ids)
translated_text = translation_tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
print('-' * 50)
print(f"ORIGINAL: {translation_text}")
print(f"TRANSLATION: {translated_text}")

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



--------------------------------------------------
ORIGINAL: Welcome to our natural language processing workshop
TRANSLATION: Welkom op onze workshop natuurlijke taalverwerking


In [None]:
translation_model

## Language Modelling

In [None]:
# from transformers import AutoModelForCausalLM
# language_model_tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
# language_model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B") # a better model, but colab fails loading it -_-

from transformers import GPT2Tokenizer, GPT2LMHeadModel

gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2')

In [None]:
inputs = gpt2_tokenizer("""def fib(n):\n # implement Fibonacci numbers""", return_tensors="pt")["input_ids"]

gen_len = 50
batch_size = 5

samples = gpt2_model.generate(
    inputs, 
    max_length=inputs.shape[-1]+gen_len, 
    min_length=inputs.shape[-1]+gen_len, 
    do_sample=True, 
    temperature=0.6, 
    top_k=len(gpt2_tokenizer), 
    top_p=1.0, 
    num_return_sequences=batch_size, 
    use_cache=True
)

for sample_no in range(1, batch_size + 1):
    print('-' * 50)
    print(f"Completion {sample_no} of {batch_size}:")
    print(gpt2_tokenizer.decode(samples[sample_no - 1])) ## this cell takes 

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--------------------------------------------------
Completion 1 of 5:
def fib(n):
 # implement Fibonacci numbers

for i in range(100): print (i[i]) # output fib(i[1]) fib(1) # output fib(1) fib(1)

And then, using the above method, we can define
--------------------------------------------------
Completion 2 of 5:
def fib(n):
 # implement Fibonacci numbers

#

# Note: It is possible to use fib() to return a number in a larger unit.

#

# The Fibonacci number is the sum of the sum of the

# integers in a given
--------------------------------------------------
Completion 3 of 5:
def fib(n):
 # implement Fibonacci numbers.

return fib(n + 1, fib(n))

def fib(n):

# implement Fibonacci numbers.

return fib(n + 1, fib(n))

def fib_index(
--------------------------------------------------
Completion 4 of 5:
def fib(n):
 # implement Fibonacci numbers # for i in range(n): # for i in range(n) % fib(n) def fib(n): return fib(n)

This code is far from being complete. I have already written some 

## Vision Transformer

Source: <a href="https://www.facebook.com/cambridge.university/photos/a.10150161669754864/10159007779694864/">University of Cambridge Facebook</a>, Jan 2022.
<img src="https://i.imgur.com/ChLjNFc.png">

Model: https://huggingface.co/microsoft/beit-base-patch16-224-pt22k-ft22k


## What is a Transformer?

In [None]:
print(gpt2_model.transformer.h[0].attn.c_attn._parameters['weight'].shape)
gpt2_model

    # (9): GPT2Block(
    # (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    # (attn): GPT2Attention(
    #     (c_attn): Conv1D()
    #     (c_proj): Conv1D()
    #     (attn_dropout): Dropout(p=0.1, inplace=False)
    #     (resid_dropout): Dropout(p=0.1, inplace=False)
    # )
    # (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    # (mlp): GPT2MLP(
    #     (c_fc): Conv1D()
    #     (c_proj): Conv1D()
    #     (dropout): Dropout(p=0.1, inplace=False)
    # )
    # )

torch.Size([768, 2304])


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )


* Just an abstraction for a composition of many functions (mostly matrix multiplication)?

* Composed of *blocks* that stack not only deeply (like powerful image models), but wide, too.

![](https://jalammar.github.io/images/t/Transformer_encoder.png)

* Confusion alert: interpret 'encoder' to mean 'transformation of word embeddings', however in the context of translation, there are separate encoding and decoding parts of a transformer model (though very similar):

![](https://jalammar.github.io/images/t/Transformer_decoder.png)

* More detail on a block:

![](https://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png)

* A vision transformer works very similarly https://ai.googleblog.com/2020/12/transformers-for-image-recognition-at.html :

![](https://1.bp.blogspot.com/-_mnVfmzvJWc/X8gMzhZ7SkI/AAAAAAAAG24/8gW2AHEoqUQrBwOqjhYB37A7OOjNyKuNgCLcBGAsYHQ/s16000/image1.gif)

* A language model produces encodings for all text it has processed. We only use the last encoding for language modelling, but it is very useful to use the other encodings for other tasks (see below).

## Using and finetuning a pretrained model

In this extension section, we will provide code to use a pretrained model in order to do another task upstream: sentiment classification!

Some code is borrowed from https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/ (which is apache licensed :) ).

Download the dataset:

In [None]:
!wget -q -nc http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf /content/aclImdb_v1.tar.gz

### Boring setup:

No need to understand anything from the below cell.

In [None]:
import io
import os
import torch
from tqdm.notebook import tqdm
from torch.utils.data import Dataset, DataLoader
# from ml_things import plot_dict, plot_confusion_matrix, fix_text
from sklearn.metrics import classification_report, accuracy_score
from transformers import (set_seed,
                          TrainingArguments,
                          Trainer,
                          GPT2Config,
                          GPT2Tokenizer,
                          AdamW, 
                          get_linear_schedule_with_warmup,
                          GPT2ForSequenceClassification)

class Gpt2ClassificationCollator(object):
    r"""
    Data Collator used for GPT2 in a classificaiton rask. 
    
    It uses a given tokenizer and label encoder to convert any text and labels to numbers that 
    can go straight into a GPT2 model.

    This class is built with reusability in mind: it can be used as is as long
    as the `dataloader` outputs a batch in dictionary format that can be passed 
    straight into the model - `model(**batch)`.

    Arguments:

      use_tokenizer (:obj:`transformers.tokenization_?`):
          Transformer type tokenizer used to process raw text into numbers.

      labels_ids (:obj:`dict`):
          Dictionary to encode any labels names into numbers. Keys map to 
          labels names and Values map to number associated to those labels.

      max_sequence_len (:obj:`int`, `optional`)
          Value to indicate the maximum desired sequence to truncate or pad text
          sequences. If no value is passed it will used maximum sequence size
          supported by the tokenizer and model.

    """

    def __init__(self, use_tokenizer, labels_encoder, max_sequence_len=None):

        # Tokenizer to be used inside the class.
        self.use_tokenizer = use_tokenizer
        # Check max sequence length.
        self.max_sequence_len = use_tokenizer.model_max_length if max_sequence_len is None else max_sequence_len
        # Label encoder used inside the class.
        self.labels_encoder = labels_encoder

        return

    def __call__(self, sequences):
        r"""
        This function allowes the class objesct to be used as a function call.
        Sine the PyTorch DataLoader needs a collator function, I can use this 
        class as a function.

        Arguments:

          item (:obj:`list`):
              List of texts and labels.

        Returns:
          :obj:`Dict[str, object]`: Dictionary of inputs that feed into the model.
          It holddes the statement `model(**Returned Dictionary)`.
        """

        # Get all texts from sequences list.
        texts = [sequence['text'] for sequence in sequences]
        # Get all labels from sequences list.
        labels = [sequence['label'] for sequence in sequences]
        # Encode all labels using label encoder.
        labels = [self.labels_encoder[label] for label in labels]
        # Call tokenizer on all texts to convert into tensors of numbers with 
        # appropriate padding.
        inputs = self.use_tokenizer(text=texts, return_tensors="pt", padding=True, truncation=True,  max_length=self.max_sequence_len)
        # Update the inputs with the associated encoded labels as tensor.
        inputs.update({'labels':torch.tensor(labels)})

        return inputs

# Set seed for reproducibility.
set_seed(123)

# Number of training epochs (authors on fine-tuning Bert recommend between 2 and 4).
epochs = 4

# Number of batches - depending on the max sequence length and GPU memory.
# For 512 sequence length batch of 10 works without cuda memory issues.
# For small sequence length can try batch of 32 or higher.
batch_size = 32

# Pad or truncate text sequences to a specific length
# if `None` it will use maximum sequence of word piece tokens allowed by model.
max_length = 60

# Look for gpu to use. Will use `cpu` by default if no gpu found.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Name of transformers model - will use already pretrained model.
# Path of transformer model - will load your own model from local disk.
model_name_or_path = 'gpt2'

# Dictionary of labels and their id - this will be used to convert.
# String labels to number ids.
labels_ids = {'neg': 0, 'pos': 1}

# How many labels are we using in training.
# This is used to decide size of classification head.
n_labels = len(labels_ids)

class MovieReviewsDataset(Dataset):
  r"""PyTorch Dataset class for loading data.

  This is where the data parsing happens.

  This class is built with reusability in mind: it can be used as is as.

  Arguments:

    path (:obj:`str`):
        Path to the data partition.

  """

  def __init__(self, path, use_tokenizer):

    # Check if path exists.
    if not os.path.isdir(path):
      # Raise error if path is invalid.
      raise ValueError('Invalid `path` variable! Needs to be a directory')
    self.texts = []
    self.labels = []
    # Since the labels are defined by folders with data we loop 
    # through each label.
    for label in ['pos', 'neg']:
      sentiment_path = os.path.join(path, label)

      # Get all files from path.
      files_names = os.listdir(sentiment_path)#[:10] # Sample for debugging.
      # Go through each file and read its content.
      for file_name in tqdm(files_names, desc=f'{label} files'):
        file_path = os.path.join(sentiment_path, file_name)

        # Read content.
        content = io.open(file_path, mode='r', encoding='utf-8').read()
        # Fix any unicode issues.
        # content = fix_text(content)
        # Save content.
        self.texts.append(content)
        # Save encode labels.
        self.labels.append(label)

    # Number of exmaples.
    self.n_examples = len(self.labels)
    

    return

  def __len__(self):
    r"""When used `len` return the number of examples.

    """
    
    return self.n_examples

  def __getitem__(self, item):
    r"""Given an index return an example from the position.
    
    Arguments:

      item (:obj:`int`):
          Index position to pick an example to return.

    Returns:
      :obj:`Dict[str, str]`: Dictionary of inputs that contain text and 
      asociated labels.

    """

    return {'text':self.texts[item],
            'label':self.labels[item]}

def validation(model, dataloader, device_, valid_batches):

  model.eval()

  correct = 0
  incorrect = 0

  for batch_idx, batch in tqdm(enumerate(dataloader), total=min(valid_batches, len(dataloader))):
    if batch_idx == valid_batches:
        break
    true_labels = batch['labels']
    batch = {k:v.type(torch.long).to(device_) for k,v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
        loss, logits = outputs[:2]
        
        logits = logits.detach().cpu().numpy()
        predict_content = logits.argmax(axis=-1).flatten().tolist()
        print(predict_content)

        correct += t.sum( (t.tensor(predict_content) == t.tensor(true_labels)).long() )
        incorrect += t.sum( (t.tensor(predict_content) != t.tensor(true_labels)).long() )

  model.train()  
  print(f"On the validation dataset, the model got {correct} sentiments correct and {incorrect} sentiments incorrect")
  return

gpt2_tokenizer.padding_side = "left"
# Define PAD Token = EOS Token = 50256
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

gpt2_classificaiton_collator = Gpt2ClassificationCollator(use_tokenizer=gpt2_tokenizer, 
                                                          labels_encoder=labels_ids, 
                                                          max_sequence_len=max_length)

print('Dealing with Train...')
# Create pytorch dataset.
train_dataset = MovieReviewsDataset(path='/content/aclImdb/train', 
                               use_tokenizer=gpt2_tokenizer)
print('Created `train_dataset` with %d examples!'%len(train_dataset))

# Move pytorch dataset into dataloader.
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=gpt2_classificaiton_collator)
print('Created `train_dataloader` with %d batches!'%len(train_dataloader))

print()

print('Dealing with Validation...')
# Create pytorch dataset.
valid_dataset =  MovieReviewsDataset(path='/content/aclImdb/test', 
                               use_tokenizer=gpt2_tokenizer)
print('Created `valid_dataset` with %d examples!'%len(valid_dataset))

# Move pytorch dataset into dataloader.
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True, collate_fn=gpt2_classificaiton_collator)
print('Created `eval_dataloader` with %d batches!'%len(valid_dataloader))

Dealing with Train...


pos files:   0%|          | 0/12500 [00:00<?, ?it/s]

neg files:   0%|          | 0/12500 [00:00<?, ?it/s]

Created `train_dataset` with 25000 examples!
Created `train_dataloader` with 782 batches!

Dealing with Validation...


pos files:   0%|          | 0/12500 [00:00<?, ?it/s]

neg files:   0%|          | 0/12500 [00:00<?, ?it/s]

Created `valid_dataset` with 25000 examples!
Created `eval_dataloader` with 782 batches!


## Training

Let's see some examples from the dataset. Note that all reviews are truncated to just 60 tokens.

In [None]:
for batch in train_dataloader:
    true_labels = batch['labels'].numpy().flatten().tolist()
    batch = {k:v.type(torch.long).to(device) for k,v in batch.items()}
    print("Dataset example (decoded):")
    print(gpt2_tokenizer.decode(batch['input_ids'][0]))
    print(f"This has {['negative', 'positive'][true_labels[0]]} sentiment")
    break

Dataset example (decoded):
What a sad sight these TV stalwarts make, running out the clock on their careers stumbling about a little rusting hulk of a ship - boat might be more appropriate. The whole production feels cheap and shabby, and it's not helped by a "big name" star who is barely capable
This has negative sentiment


We now load the GPT-2 model.

In [None]:
# Get model configuration.
print('Loading config...')
model_config = GPT2Config.from_pretrained(pretrained_model_name_or_path=model_name_or_path, num_labels=n_labels)

# Get model's tokenizer.
print('Loading tokenizer...')
tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path=model_name_or_path)
# default to left padding
tokenizer.padding_side = "left"
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token

def get_model():
    # Get the actual model.
    print('Loading model...')
    model = GPT2ForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_name_or_path, config=model_config)

    # resize model embedding to match new tokenizer
    model.resize_token_embeddings(len(tokenizer))

    # fix model padding token id
    model.config.pad_token_id = model.config.eos_token_id

    # Load model to defined device.
    model.to(device)
    print('Model loaded to `%s`'%device)
    return model

Loading config...
Loading tokenizer...


We can run the model on many sentences to check that it is randomly initialised:

In [None]:
model = get_model()
validation(model, valid_dataloader, "cuda", 3)

Loading model...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded to `cuda`


  0%|          | 0/3 [00:00<?, ?it/s]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1]




[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
On the validation dataset, the model got 45 sentiments correct and 51 sentiments incorrect


Here is how you can extract the outputs of the GPT-2's sentiment prediction:

In [None]:
model = get_model()
true_labels = []
losses = []
DEVICE = "cuda"

for batch_idx, batch in tqdm(enumerate(train_dataloader)):
    loss = 0.0
    true_labels += batch['labels'].numpy().flatten().tolist()
    batch = {k:v.type(torch.long).to(DEVICE) for k,v in batch.items()}
    model.zero_grad()
    outputs = model(**batch)
    print("The output of the model has keys:", outputs.keys())
    break

Loading model...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded to `cuda`


0it [00:00, ?it/s]

The output of the model has keys: odict_keys(['loss', 'logits', 'past_key_values'])


You should now have all the tools to finetune the GPT-2 for sentiment classification (which element of that dictionary will be *very* useful?).

This task may require digging around to find how to debug things. In order to proceed we recommend using i) the above cell, as well as ii) https://pytorch.org/tutorials/beginner/introyt/trainingyt.html#the-training-loop in order to setup a training loop.

You should be able to get similar loss to the following plot in just the few iterations.

![](https://i.imgur.com/hjtDipf.png)

Which movie reviews does the model fail on (look at the `validate` code to see how to implement this)?

Write a review with positive or negative sentiment - can your pretrained model classify it?


## Footnotes

### Resources:

* Very visual explanation: https://jalammar.github.io/illustrated-transformer/

* Quick summary of `transformers` library: https://neptune.ai/blog/hugging-face-pre-trained-models-find-the-best

* Vision transformers: https://ai.googleblog.com/2020/12/transformers-for-image-recognition-at.html

* Best (?) **intuitive** explanation of attention: https://nostalgebraist.tumblr.com/post/185326092369/1-classic-fully-connected-neural-networks-these