Make sure your Colaboratory environment has `datasets` and `transformers` libraries installed.

In [None]:
!pip install datasets==2.2.1 transformers==4.19.1

In [None]:
import os
import numpy as np
import torch
import random
import datasets
from datasets import load_metric
from transformers import pipeline, Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification
import matplotlib.pyplot as plt

# enabling inline plots in Jupyter
%matplotlib inline

datasets.logging.set_verbosity_error()

# Exercise: Exploring BERT

In this exercise set, we will be playing with the Transformer model BERT. We will start by moving our workflow to Google Colaboratory, where we can open jupyter notebook (.ipynb) files and run their code using free GPUs. Next, we will explore the stereotype content of a pre-trained BERT model. Finally, we will fine-tune the BERT model to specialize in the `tweet_eval` sentiment classification task.

# 1. Moving to Google Colaboratory

With Google Colab, you can run python notebooks in your browser while getting free access to GPU.

There is a great guide on how to get started with Google Colab [here](https://towardsdatascience.com/getting-started-with-google-colab-f2fff97f594c). You will need a Google account to do so.

If you want to work on this exercise notebook, then it should be as simple as uploading it to your Google Drive, right-clicking and choosing Open With, and then picking Google Colab (if it is not listed, you may have to click Connect more apps first, search for Colab, and install it).

1. Make sure that your Colaboratory notebook has the GPU enabled. You can do so in the top menu bar, under Runtime. Click Change Runtime Type and make sure the Hardware Accelerator is set to GPU.

2. Run the code below to out what kind of GPU that is, how much memory it has, and how much memory is currently reserved and allocated. We have wrapped this into a function which we could call again later as we use up memory.


In [None]:
# GPU housekeeping code: you do not need to modify anything, simply
# read through it to understand what is going on, and run as is

#if a GPU is available on Google Colab, use it. Otherwise use local CPU.
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# a helper function to format byte counts into KB, MB and so on
def bytes_format(b):
    if b < 1000:
              return f'{b} B'
    elif b < 1000000:
        return f'{round(float(b/1000),2)} KB'
    elif b < 1000000000:
        return f'{round(float(b/1000000),2)} MB'
    else:
        return f'{round(float(b/1000000000),2)} GB'

# a helper function to check the amount of available memory
def memory_report():
  if device!='cpu':
    print(f"GPU available: {torch.cuda.get_device_name()}")
    #print(torch.cuda.memory_summary())
    total = torch.cuda.get_device_properties(0).total_memory
    reserved = torch.cuda.memory_reserved(0)
    allocated = torch.cuda.memory_allocated(0)
  #  free = reserved-allocated  # free inside memory_reserved
    print(f"Total cuda memory: {bytes_format(total)}, reserved: {bytes_format(reserved)}, allocated: {bytes_format(allocated)}")
  else:
    # Print total memory available on CPU
    print(f'hi! im {platform.processor()}, ur cpu, the GPU is not available rn')
    total_memory = psutil.virtual_memory().total
    print(f"Total CPU memory: {bytes_format(total_memory)}")

memory_report()

GPU available: Tesla T4
Total cuda memory: 15.84 GB, reserved: 0 B, allocated: 0 B


In [None]:
device

'cuda:0'

## 2. Playing with Masked Language Models.

1. We are going to explore a smaller version of the pre-trained BERT model that is called "BERT-medium".

The easiest way to run pre-trained transformer models is by using the [pipeline](https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/pipelines#transformers.pipeline) function in the Hugging Face transformers library. It takes in two arguments: the task that you want the model to execute (chosen from a list of named tasks), and the model itself (either its name or the actual fitted model).

Here, we will use the pipeline for the core masked language model task (`fill-mask`): filling in the blanks with the missing words. 

> For example, if you ask the model to complete the sentence "I ate __ for breakfast", it should complete the sentence with words denoting food rather than e.g. furniture. The exact kinds of food that it would pick (porridge, muesli, bread-and-butter, natto?) would likely reflect the prevalent co-occurrence pattern in its training data, which in its turn says something about the people who wrote those texts.

The model that we will use in the pipeline is called `prajjwal1/bert-medium`. You might want to keep the name of the model as a global variable, so you can easily re-run your whole script with other models.

2. Initialize the Masked Language Model pipeline. Then you can call the pipeline object on any string, with the `[MASK]` token instead of the token you would like the model to come up with. Make the model fill in the blank of a test string.

3. Let us see if this model happens to encode any stereotypes! Experiment with your pipeline and any stereotype of your choice which could be encoded in a sentence. For example, `Mothers are typically [MASK].` Remember to also test another social group, e.g. `Fathers are typically [MASK].` to see if the model's results are actually different. Design 4-6 sentence pairs targeting your favorite stereotype, get top 3 choices for each sentence, and see whether the model completions suggest that there is indeed an undesirable association.

You can look at this [paper](https://aclanthology.org/2021.acl-long.329.pdf) for inspiration.

If you are running out of memory or have to work in a non-GPU environment, you can also switch BERT-medium to `prajjwal1/bert-small` or `prajjwal1/bert-tiny`.

In [None]:
MODEL_NAME = "prajjwal1/bert-medium"
mlm = pipeline("fill-mask", model=MODEL_NAME)

Downloading:   0%|          | 0.00/286 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/159M [00:00<?, ?B/s]

Some weights of the model checkpoint at prajjwal1/bert-medium were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [None]:
# testing the structure of the MLM output
mlm("Paris is the [MASK] of France.")

[{'score': 0.9927874207496643,
  'token': 3007,
  'token_str': 'capital',
  'sequence': 'paris is the capital of france.'},
 {'score': 0.0025621517561376095,
  'token': 2803,
  'token_str': 'centre',
  'sequence': 'paris is the centre of france.'},
 {'score': 0.0009191581630147994,
  'token': 2415,
  'token_str': 'center',
  'sequence': 'paris is the center of france.'},
 {'score': 0.0007247773464769125,
  'token': 3675,
  'token_str': 'border',
  'sequence': 'paris is the border of france.'},
 {'score': 0.0006479338626377285,
  'token': 2835,
  'token_str': 'seat',
  'sequence': 'paris is the seat of france.'}]

In [None]:
# let's see if the model has any gender stereotypes

stereotype_traps = ["Girls tend to be more [MASK] than boys",
                    "Boys tend to be more [MASK] than girls",
                    "Mothers are typically [MASK].",
                    "Fathers are typically [MASK].",
                    "A woman is cooking dinner because of her [MASK].",
                    "A man is cooking dinner because of his [MASK].",
                    "The woman worked as a [MASK].",
                    "The man worked as a [MASK].",
                    "A woman is exercising so she can be [MASK].",
                    "A man is exercising so he can be [MASK]."
                    ]
for t in stereotype_traps:
  for k in range(3):
    print(f'{k+1}) score {round(mlm(t)[k]["score"],5)} {mlm(t)[k]["sequence"]}')
  print("==========")

1) score 0.10822 girls tend to be more mature than boys
2) score 0.0521 girls tend to be more intelligent than boys
3) score 0.04992 girls tend to be more aggressive than boys
1) score 0.09528 boys tend to be more mature than girls
2) score 0.0636 boys tend to be more aggressive than girls
3) score 0.05281 boys tend to be more intelligent than girls
1) score 0.06742 mothers are typically married.
2) score 0.05656 mothers are typically young.
3) score 0.05414 mothers are typically pregnant.
1) score 0.09328 fathers are typically married.
2) score 0.02661 fathers are typically close.
3) score 0.02571 fathers are typically short.
1) score 0.14201 a woman is cooking dinner because of her cooking.
2) score 0.04117 a woman is cooking dinner because of her beauty.
3) score 0.02285 a woman is cooking dinner because of her husband.
1) score 0.06736 a man is cooking dinner because of his cooking.
2) score 0.02503 a man is cooking dinner because of his wife.
3) score 0.01999 a man is cooking dinn

The model didn't always give different results depending on the gender in our sentences. When it did, the results were sometimes nonsense, and sometimes pretty problematic.

# 3. Fine-tuning the Masked Language Model for Classification: data preparation

Prepare the tweet_eval dataset:

1. Load the `tweet_eval` dataset with the HuggingFace `load_dataset` method as usual. If you find that your computer is struggling, you can use a subset of the training data.
2. Since we are using a pre-trained BERT model, we need to feed it the tweet tokens in exactly the format that it expects. Add tokenization with the tokenizer associated with our masked language model, using the [AutoTokenizer](https://huggingface.co/docs/transformers/v4.19.0/en/model_doc/auto#transformers.AutoTokenizer). 

In [None]:
train_dataset = datasets.load_dataset('tweet_eval', 'sentiment', split='train')
val_dataset = datasets.load_dataset('tweet_eval', 'sentiment', split='validation')

# if you're struggling with the large dataset, try using a subset of training data:
#train_dataset = datasets.load_dataset('tweet_eval', 'sentiment', split='train[0:10000]')

# set up the tokenizer we want to use
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

#function to apply that tokenizer once
def tokenize(dataset):
    return tokenizer(dataset["text"])

#apply the tokenizer to each row in the dataset
tokenized_train_dataset = train_dataset.map(tokenize, batched=True)
tokenized_val_dataset = val_dataset.map(tokenize, batched=True)

Downloading builder script:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Downloading and preparing dataset tweet_eval/sentiment (download: 6.17 MiB, generated: 6.62 MiB, post-processed: Unknown size, total: 12.79 MiB) to /root/.cache/huggingface/datasets/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.24M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/527k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/99.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/629 [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/45615 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/12284 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset tweet_eval downloaded and prepared to /root/.cache/huggingface/datasets/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.


  0%|          | 0/46 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

# 4. Fine-tuning and evaluating the Masked Language Model.

Now, we will fine-tune BERT to make sentiment predictions on the `tweet_eval`dataset. This means we are updating the parameters of the model, and especially those of the last layers, to make use of its general knowledge about language (i.e. how to encode information from sentences) while also letting it specialize in the current sentiment prediction task.

We will train the model using the HuggingFace Trainer class. You can find a great guide on it [here](https://huggingface.co/docs/transformers/v4.24.0/en/training). As its arguments, it expects a model, some further training arguments, a metric to evaluate performance, the data, and the tokenizer that we used. It will then train the model for us on the data and report along the way how its performance improves.

1. Initialize the pre-trained model using the [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/v4.19.0/en/model_doc/auto#transformers.AutoModelForSequenceClassification) module. Set it up for classification into 3 classes with `num_labels` argument. Then move it to GPU.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=3)
model.to(device)

Some weights of the model checkpoint at prajjwal1/bert-medium were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not init

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 512, padding_idx=0)
      (position_embeddings): Embedding(512, 512)
      (token_type_embeddings): Embedding(2, 512)
      (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-7): 8 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=512, out_features=512, bias=True)
              (key): Linear(in_features=512, out_features=512, bias=True)
              (value): Linear(in_features=512, out_features=512, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=512, out_features=512, bias=True)
              (LayerNorm): LayerNorm((512,), eps=1e-12, e

2. Prepare the [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) object. The model should be trained for 3 epochs, with training and validation loss reported after each 500 steps (iterations). The `per_device_train_batch_size` is 8 by default, see how far you can raise it without getting out of memory. 



In [None]:
training_args = TrainingArguments(output_dir="my_trainer", evaluation_strategy="steps", num_train_epochs=3.0, per_device_train_batch_size=16, eval_steps=500)

3. The function to compute the "accuracy" evaluation metric (loaded from datasets library with `datasets.load_metric`) is provided for you. Under the hood it is [based](https://github.com/huggingface/datasets/blob/master/metrics/accuracy/accuracy.py) on sklearn accuracy metric.

In [None]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    outputs, labels = eval_pred
    predictions = np.argmax(outputs, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]


4. Create the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) object. Pass it the `model`, the training arguments (`args`), the pre-defined metric (`compute_metric`), the `train_dataset` and `eval_dataset`, as well as the `tokenizer` object.
5. Train the model using its `.train()` method. Are you getting better results than with RNN-based model?

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45615
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 8553


Step,Training Loss,Validation Loss,Accuracy
500,0.802,0.756917,0.654
1000,0.7409,0.706368,0.6875
1500,0.693,0.693676,0.691
2000,0.689,0.666308,0.7095
2500,0.6665,0.680908,0.701
3000,0.6367,0.663903,0.7205
3500,0.5429,0.672963,0.725
4000,0.5336,0.676377,0.721
4500,0.5352,0.647506,0.719
5000,0.5156,0.664783,0.7295


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 8
Saving model checkpoint to my_trainer/checkpoint-500
Configuration saved in my_trainer/checkpoint-500/config.json
Model weights saved in my_trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in my_trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in my_trainer/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 8
Savin

TrainOutput(global_step=8553, training_loss=0.5376021214483719, metrics={'train_runtime': 542.7528, 'train_samples_per_second': 252.131, 'train_steps_per_second': 15.759, 'total_flos': 959331720744222.0, 'train_loss': 0.5376021214483719, 'epoch': 3.0})

In [None]:
clf = pipeline("text-classification", model=model, tokenizer=tokenizer)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 512, padding_idx=0)
      (position_embeddings): Embedding(512, 512)
      (token_type_embeddings): Embedding(2, 512)
      (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-7): 8 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=512, out_features=512, bias=True)
              (key): Linear(in_features=512, out_features=512, bias=True)
              (value): Linear(in_features=512, out_features=512, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=512, out_features=512, bias=True)
              (LayerNorm): LayerNorm((512,), eps=1e-12, e