# Homework 6: Transformers

The goals of this assignment are:
1. Develop a better understanding of the *self-attention mechanism* in Transformers by implementing it in numpy. 
2. Understand and train a BERT-based model. 
3. Strengthen your understanding of using HuggingFace's `transformers` package. 

## Organization and Instructions
Execute the code cells in Part 1 to understand the background for this assignment. You will not need to modify or add anything to Part 1. Part 2 is where your solution begins.

**Part 1: Background.** 
- 1A. Environment set-up 
- 1B. Data exploration 

**Part 2: Your implementation.** 
- 2A. Self-attention 
- 2B. Zero-shot predictions 
- 2C. Fine-tuning 


**Addtional instructions.** 
- Please follow the 50-foot rule. Your submitted solution and code must be yours alone. Copying and pasting a solution from the internet or another source is considered a violation of the honor code. 

**Evaluation.** Your solution will be evaluated *manually* by the TAs and instructor. 

To help bridge the gap between previous homeworks and the final project. We are **not giving you an autograder**. We hope to help wean you off the grader and give you practice testing your own code.

Please come see us during help hours if you need additional assistance! 

## 1A. Environment Set-up 

If you set-up your conda environment correctly in HW0, you should see `Python [conda env:cs375]` as the kernel in the upper right-hand corner of the Jupyter webpage you are currently on. Run the cell below to make sure your environment is correctly installed. 

In [108]:
# Environment check 
# Return to HW0 if you run into errors in this cell 
# Do not modify this cell 
import os
assert os.environ['CONDA_DEFAULT_ENV'] == "cs375"

import sys
assert sys.version_info.major == 3 and sys.version_info.minor == 11

If there are any errors after running the cell above, return to the instructions from `HW0`. If you are still having difficulty, reach out to the instructor or TAs via Piazza. 

#### Installing other packages

In [109]:
import re
import typing
from typing import List
import numpy as np
import torch
import torch.nn.functional as F
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, 
                          TrainingArguments, Trainer, DataCollatorWithPadding)
from datasets import Dataset, load_dataset
from sklearn.metrics import f1_score

In [110]:
import util #inspect util.py to see what is in this file 

## 1B. Data exploration

In this homework, we will use the WinoGrande dataset. This is an **extremely challenging dataset** that even a BERT-based model might have **a lot of room for performance improvements!** 

You can read more about the dataset in [this paper](https://cdn.aaai.org/ojs/6399/6399-13-9624-1-10-20200517.pdf). 

Here is Table 1 from the WinoGrande paper with examples:  

![](figs/winograd.png)

HuggingFace provides a Python package for loading (and uploading datasets). You can read more about the `datasets` Python package [here](https://huggingface.co/docs/datasets/en/index). 

In [111]:
# Load the WinoGrande dataset
dataset = load_dataset("allenai/winogrande", "winogrande_s", trust_remote_code=True)

# Access the training and validation splits
train_dataset = dataset["train"]
validation_dataset = dataset["validation"].select(range(100)) #We'll just look at 100 dev exs

print(f"Num. train exs= {len(train_dataset)}")
print(f"Num. dev exs= {len(validation_dataset)}")

Num. train exs= 640
Num. dev exs= 100


In [112]:
# Let's look at one example from the validation dataset
print(validation_dataset[12])

{'sentence': 'I had to read an entire story for class tomorrow. Luckily, the _ was short.', 'option1': 'story', 'option2': 'class', 'answer': '1'}


Above, the `'sentence'` is the full sentence with a `_` for where the pronoun or noun options should go. 

Then `option1` and `option2` are the two token spans from the sentence the model will eventually choose from and `answer` is the correct answer. 

## 2A. Self-attention

In this part, you will implement the parallelized version of the *masked* self-attention mechanism in Transformers using only numpy.


Recall, for each layer $k$ in the transformer block we have 

For a single example with $n$ tokens and embedding dimension $d$, we first have $X^k$, the contextual embedding matrix (size $n\times d$) for layer $k$. 

Then, we introduce the weights, 

$$ Q = X^k \times W_Q$$ 
$$ K = X^k \times W_K$$
$$ V = X^k \times W_V $$

and use the new matrices to get the contextual embedding matrix for the next layer, 

$$ X^{k+1} = \text{softmax} \bigg( \text{mask} \bigg( \frac{QK^T}{\sqrt{d}} \bigg) \bigg) V$$

This is computationally efficient in a matrix-multiplication-optimized library like `numpy` because it should have **no for-loops!** 

Let's implement self-attention for the (modified) example we were looking at in Part 1 

*"I had to read an entire story for class tomorrow. Luckily, it was short."*

In [113]:
# Tokens for our example 
toks = ["i", "had", "to", "read", "an", 
        "entire", "story", "for", "class", "tomorrow", ".",
       "luckily", "it", "was", "short", "."]

In [114]:
# Load pre-specified embeddings and weights (for testing)
X, W_Q, W_K, W_V = util.load_attention_data(toks)

In [115]:
# TODO: Implement your approach in this function
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def self_attention(X: np.ndarray, W_Q: np.ndarray, 
                   W_K: np.ndarray, W_V: np.ndarray) -> np.ndarray: 
    """
    Implements (masked) self-attention mechanism for a single layer 
    (and a single example)
    
    Returns: X_new, a np.ndarray that is the same shape as X
    
    Notes:
    - You can only use numpy for this part of the homework and no other packages
    - Your solution must not have any for-loops!
    
    Tips: 
        - Double-check the shapes of all the matrices you're working with. 
        - We recommend making a helper function for the softmax.
        - You may a subset of these numpy methods and operators helpful: `np.exp`, `@`, `np.triu_indicies`, `np.reshape`, `np.inf`, `np.broadcast_to`, `np.choose`.  
    """
    Q = X @ W_Q 
    K = X @ W_K  
    V = X @ W_V  
    
    n = X.shape[0]
    d = X.shape[1]
   
    scores = (Q @ K.T) / np.sqrt(d)

    indexes = np.triu_indices(n, k = 1)
    scores[indexes] = -np.inf
   
    attention_weights = softmax(scores)  

    X_new = attention_weights @ V  
    
    return X_new

In [116]:
X_new = self_attention(X, W_Q, W_K, W_V)

## 2B. Zero-shot predictions

Now, we will use a distilled version of "RoBERTa" (a BERT variant) to make zero-shot predictions on the WinoGrande dataset. 

#### Tokenization and pre-processing

In [117]:
model_name = "distilbert/distilbert-base-uncased"

In [118]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True)

In [119]:
# First example in the dev dataset (see dataset loading in Part 1B above)
text1 = validation_dataset[0]['sentence']
text1

'Sarah was a much better surgeon than Maria so _ always got the easier cases.'

In [120]:
# Converts to tokens and attention mask 
# The attention mask will be 0 if there are special "PAD" tokens
# offset_mapping gives the start and end character index for each token in the original text 
inputs = tokenizer(text1, return_tensors="pt", return_offsets_mapping=True)
inputs

{'input_ids': tensor([[ 101, 4532, 2001, 1037, 2172, 2488, 9431, 2084, 3814, 2061, 1035, 2467,
         2288, 1996, 6082, 3572, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'offset_mapping': tensor([[[ 0,  0],
         [ 0,  5],
         [ 6,  9],
         [10, 11],
         [12, 16],
         [17, 23],
         [24, 31],
         [32, 36],
         [37, 42],
         [43, 45],
         [46, 47],
         [48, 54],
         [55, 58],
         [59, 62],
         [63, 69],
         [70, 75],
         [75, 76],
         [ 0,  0]]])}

In [121]:
# Examine the tokens 
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
tokens

['[CLS]',
 'sarah',
 'was',
 'a',
 'much',
 'better',
 'surgeon',
 'than',
 'maria',
 'so',
 '_',
 'always',
 'got',
 'the',
 'easier',
 'cases',
 '.',
 '[SEP]']

Note: Above, when we see the `###` characters before tokens, this means the BERT tokenizer has split a word into subwords.

In [122]:
# TODO: fill in the function below 
def which_tok_index(tok_string: str, sentence: str, inputs: torch.tensor) -> int:
    """
    Inputs: the token string (tok_string) of interest, the original sentence
    and the tokenized input tensors 
    
    Returns: (int) the first index that matches the tok_string. 
    
        If tok_string gets tokenized into multiple tokens, return the *first* index corresponding
        to the multi-token span. 

        If there is no match (which could happen), return 0. 
    
    Example: 
        tok_string="sarah"
        
        input_ids = {'input_ids': tensor([[ 101, 4532, 2001, 1037, 2172, 2488, 9431, 2084, 3814, 2061, 1035, 2467,
         2288, 1996, 6082, 3572, 1012,  102]]), 
                     'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 
                     'offset_mapping': tensor([[[ 0,  0],
                     [ 0,  5],
                     [ 6,  9],
                     [10, 11],
                     [12, 16],
                     [17, 23],
                     [24, 31],
                     [32, 36],
                     [37, 42],
                     [43, 45],
                     [46, 47],
                     [48, 54],
                     [55, 58],
                     [59, 62],
                     [63, 69],
                     [70, 75],
                     [75, 76],
                     [ 0,  0]]])}
        
        Returns: 1 
        
        This example returns 1 since the token id at index 1 (value 432)
        starts at character index 6 in the orginal sentence and corresponds to "sarah."
        
    Tips:  
        - Understaning 'offset_mapping' and Python string methods may be helpful here
        - tokenizer.convert_ids_to_tokens could help with debugging 
    """
    start_idx = sentence.find(tok_string)
    
    if start_idx == -1:
        return 0  
    
    offset_mapping = inputs['offset_mapping'][0]
    
    for i, (token_start, token_end) in enumerate(offset_mapping):
        if token_start == start_idx and token_end != 0:
            return i
    
    return 0

In [123]:
# Unit test
sent = validation_dataset[0]['sentence']
inputs = tokenizer(sent, return_tensors="pt", return_offsets_mapping=True)

t1 = which_tok_index("sarah", sent.lower(), inputs)
print(t1, "== 1?")
t2 = which_tok_index("maria", sent.lower(), inputs)
print(t2, "== 8?")
t3 = which_tok_index("_", sent.lower(), inputs)
print(t3, "== 10?")

1 == 1?
8 == 8?
10 == 10?


In [124]:
# Another unit test 
sent3 = validation_dataset[3]['sentence']
inputs3 = tokenizer(sent3, return_tensors="pt", return_offsets_mapping=True)

t1 = which_tok_index(validation_dataset[3]['option1'].lower(), sent3, inputs3)
print(t1, "== 7?")
t2 = which_tok_index(validation_dataset[3]['option2'].lower(), sent3, inputs3)
print(t2, "== 12?")
t3 = which_tok_index("blah", sent3, inputs3)
print(t3, "== 0?")

7 == 7?
12 == 12?
0 == 0?


#### Zero-shot prediction

Now, we'll use the model to make zero-shot predictions. Note, this is "zero-shot" because we haven't ever trained the model on this particular task or dataset.  

Here's how we will make zero-shot predictions: 
1. Pass the sentence (after tokenization) into the pre-trained model 
2. Obtain the final layer contextual embeddings for the `"_"` token as well as the *first* token for the substring of `option1` and likewise for `option2`. 
3. Find the cosine similarity between these contextual embeddings between `"_"` and the embedding we chose for `option1` and likewise for `option2`. 
4. Whichever has the higher cosine similarity (`option1` or `option2`), make this the prediction. 

In [125]:
# TODO: Your implementation
def zero_shot_predictions(model, tokenizer, dataset) -> List[int]: 
    """
    Make zero-shot predictions with the last layer contextual embedding
    cosine similarity method described in the previous cell. 
    
    Returns: 
        List[str], a list of strings, one element for each 
        example in the input dataset. Each element is an int: 
            - 1 corresponding to "option1" in the dataset
            - 2 corresponding to "option2" in the dataset
    
    Note: 
        - For now, it's ok if you have a for-loop over examples. 
          (In an actual industry setting, you would make this all parallelized)
        - You might make use of the `which_tok_index()` helper function 
        you just implemented 
        - The documentation on AutoModelForSequenceClassification may be helpful here. 
        - torch.nn.functional may have some helpful methods 
        - Using model.eval() and torch.no_grad() will speed things up (since Pytorch will not
            have to make the computation graph)
    """
    predictions = []

    
    model.eval()
    with torch.no_grad():
        for example in dataset:
            sentence = example["sentence"]
            option1 = example["option1"]
            option2 = example["option2"]
        
            inputs = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True)
            
            blank_index = which_tok_index("_", sentence, inputs)
            option1_index = which_tok_index(option1, sentence, inputs)
            option2_index = which_tok_index(option2, sentence, inputs)

            inputs.pop("offset_mapping")
            outputs = model(**inputs, output_hidden_states = True)
            last_hidden_states = outputs.hidden_states[-1][0] 

            blank_embedding = last_hidden_states[blank_index]
            option1_embedding = last_hidden_states[option1_index]
            option2_embedding = last_hidden_states[option2_index]
            
            similarity1 = F.cosine_similarity(blank_embedding, option1_embedding, dim=0)
            similarity2 = F.cosine_similarity(blank_embedding, option2_embedding, dim=0)
            
            prediction = 1 if similarity1 > similarity2 else 2
            predictions.append(prediction)

        return predictions

In [126]:
print(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# You might see a warning below. 
# We'll end up doing this in the next part of the homework;)

distilbert/distilbert-base-uncased


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [127]:
# Test your code on just a single example 
zero_shot_predictions(model, tokenizer, validation_dataset.select(range(1)))

[2]

In [128]:
# Make the full predictions 
preds = zero_shot_predictions(model, tokenizer, validation_dataset)

In [129]:
# Uncomment and check the following
len(preds) == len(validation_dataset)

True

#### Evaluation

In [130]:
try: 
    truth = [int(x['answer']) for x in validation_dataset]
    assert len(truth) == len(preds)
    y_true = np.array(truth)
    y_pred = np.array(preds)
    y_baseline = np.ones(len(y_true)) *2
    print("F1 of baseline (maj. class)=", np.round(f1_score(y_true, y_baseline, pos_label=2), 2))
    print("F1 of zero-shot=", np.round(f1_score(y_true, y_pred, pos_label=2), 2))
except: 
    print("Need preds to be equal to truth for eval")

F1 of baseline (maj. class)= 0.67
F1 of zero-shot= 0.43


**What do you attribute to the zero-shot model's performance?** 

The nature of the task for this data set largely depends on the model somehow "understanding" the meanings of words like 'because' and 'but'(in the case of second example). Since the input is just the sentence(which is relatively short) and the model doesn't have much context to work with it is not effectively understand the meanings of the words that change between the two sentences. The model is too general so it's not doing well because this specific task (specifically on this dataset). reqThe model might have done better if we had more context on either side of the transition words(like if we had a paragraph on each side). It might also have done better if we had more layers. 

##  2C. Extra Credit: Fine-tuning

Try fine-tuning a model on this training datasets (or other training datasets). Can you do better than the zero-shot model? 

Tips:
- [This](https://huggingface.co/docs/transformers/en/tasks/sequence_classification) HuggingFace tutorial may be helpful for learning the transformers syntax. 
- Think **carefully** about the dataset and task when you're reasoning about the model's performance. 


In [107]:
# TODO: Put your code here



Map:   0%|          | 0/640 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TypeError: Trainer.__init__() got an unexpected keyword argument 'processing_class'

## Submission

In [131]:
%%bash

if [[ ! -f "./hw6.ipynb" ]]
then
    echo "WARNING: Did not find notebook in Jupyter working directory. Manual solution: go to File->Download .ipynb to download your notebok and other files, then zip them locally."
else
    echo "Found notebook file, creating submission zip..."
    zip -r submission.zip hw6.ipynb
fi

Found notebook file, creating submission zip...
  adding: hw6.ipynb (deflated 73%)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
