Make sure your Colaboratory environment has `datasets` and `transformers` libraries installed.

In [None]:
!pip install datasets==2.2.1 transformers==4.19.1

In [3]:
import os
import numpy as np
import torch
import random
import datasets
from datasets import load_metric
from transformers import pipeline, Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification
import matplotlib.pyplot as plt
import platform
import psutil

# enabling inline plots in Jupyter
%matplotlib inline

datasets.logging.set_verbosity_error()

# Exercise: Exploring BERT

In this exercise set, we will be playing with the Transformer model BERT. We will start by moving our workflow to Google Colaboratory, where we can open jupyter notebook (.ipynb) files and run their code using free GPUs. Next, we will explore the stereotype content of a pre-trained BERT model. Finally, we will fine-tune the BERT model to specialize in the `tweet_eval` sentiment classification task.

# 1. Moving to Google Colaboratory

With Google Colab, you can run python notebooks in your browser while getting free access to GPU.

There is a great guide on how to get started with Google Colab [here](https://towardsdatascience.com/getting-started-with-google-colab-f2fff97f594c). You will need a Google account to do so.

If you want to work on this exercise notebook, then it should be as simple as uploading it to your Google Drive, right-clicking and choosing Open With, and then picking Google Colab (if it is not listed, you may have to click Connect more apps first, search for Colab, and install it).

1. Make sure that your Colaboratory notebook has the GPU enabled. You can do so in the top menu bar, under Runtime. Click Change Runtime Type and make sure the Hardware Accelerator is set to GPU.

2. The `memory_report`() function below tells you what kind of GPU that is, how much memory it has, and how much memory is currently reserved and allocated. We have wrapped this into a function which we could call again later as we use up memory.

In [None]:
# GPU housekeeping code: you do not need to modify anything, simply
# read through it to understand what is going on, and run as is

#if a GPU is available on Google Colab, use it. Otherwise use local CPU.
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# a helper function to format byte counts into KB, MB and so on
def bytes_format(b):
    if b < 1000:
              return f'{b} B'
    elif b < 1000000:
        return f'{round(float(b/1000),2)} KB'
    elif b < 1000000000:
        return f'{round(float(b/1000000),2)} MB'
    else:
        return f'{round(float(b/1000000000),2)} GB'

# a helper function to check the amount of available memory
def memory_report():
  if device!='cpu':
    print(f"GPU available: {torch.cuda.get_device_name()}")
    #print(torch.cuda.memory_summary())
    total = torch.cuda.get_device_properties(0).total_memory
    reserved = torch.cuda.memory_reserved(0)
    allocated = torch.cuda.memory_allocated(0)
  #  free = reserved-allocated  # free inside memory_reserved
    print(f"Total cuda memory: {bytes_format(total)}, reserved: {bytes_format(reserved)}, allocated: {bytes_format(allocated)}")
  else:
    # Print total memory available on CPU
    print(f'hi! im {platform.processor()}, ur cpu, the GPU is not available rn')
    total_memory = psutil.virtual_memory().total
    print(f"Total CPU memory: {bytes_format(total_memory)}")

memory_report()

## 2. Playing with Masked Language Models.

1. We are going to explore a smaller version of the pre-trained BERT model that is called "BERT-medium".

The easiest way to run pre-trained transformer models is by using the [pipeline](https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/pipelines#transformers.pipeline) function in the Hugging Face transformers library. It takes in two arguments: the task that you want the model to execute (chosen from a list of named tasks), and the model itself (either its name or the actual fitted model).

Here, we will use the pipeline for the core masked language model task (`fill-mask`): filling in the blanks with the missing words. 

> For example, if you ask the model to complete the sentence "I ate __ for breakfast", it should complete the sentence with words denoting food rather than e.g. furniture. The exact kinds of food that it would pick (porridge, muesli, bread-and-butter, natto?) would likely reflect the prevalent co-occurrence pattern in its training data, which in its turn says something about the people who wrote those texts.

The model that we will use in the pipeline is called `prajjwal1/bert-medium`. You might want to keep the name of the model as a global variable, so you can easily re-run your whole script with other models.

2. Initialize the Masked Language Model pipeline. Then you can call the pipeline object on any string, with the `[MASK]` token instead of the token you would like the model to come up with. Make the model fill in the blank of a test string.

3. Let us see if this model happens to encode any stereotypes! Experiment with your pipeline and any stereotype of your choice which could be encoded in a sentence. For example, `Mothers are typically [MASK].` Remember to also test another social group, e.g. `Fathers are typically [MASK].` to see if the model's results are actually different. Design 4-6 sentence pairs targeting your favorite stereotype, get top 3 choices for each sentence, and see whether the model completions suggest that there is indeed an undesirable association.

You can look at this [paper](https://aclanthology.org/2021.acl-long.329.pdf) for inspiration.

If you are running out of memory or have to work in a non-GPU environment, you can also switch BERT-medium to `prajjwal1/bert-small` or `prajjwal1/bert-tiny`.

In [None]:
MODEL_NAME = 
mlm = pipeline(FILLINTHEBLANK)
mlm(FILLINTHEBLANK) #try the pipeline on a test string

Downloading:   0%|          | 0.00/286 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/159M [00:00<?, ?B/s]

Some weights of the model checkpoint at prajjwal1/bert-medium were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

# 3. Fine-tuning the Masked Language Model for Classification: data preparation

Prepare the tweet_eval dataset:

1. Load the `tweet_eval` dataset with the HuggingFace `load_dataset` method as usual. If you find that your computer is struggling, you can use a subset of the training data.
2. Since we are using a pre-trained BERT model, we need to feed it the tweet tokens in exactly the format that it expects. Add tokenization with the tokenizer associated with our masked language model, using the [AutoTokenizer](https://huggingface.co/docs/transformers/v4.19.0/en/model_doc/auto#transformers.AutoTokenizer). 

In [None]:
train_dataset = 
val_dataset = 

# set up the tokenizer we want to use
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

#apply this tokenizer to the texts in both datasets
tokenized_train_dataset = 
tokenized_val_dataset = 

# 4. Fine-tuning and evaluating the Masked Language Model.

Now, we will fine-tune BERT to make sentiment predictions on the `tweet_eval`dataset. This means we are updating the parameters of the model, and especially those of the last layers, to make use of its general knowledge about language (i.e. how to encode information from sentences) while also letting it specialize in the current sentiment prediction task.

We will train the model using the HuggingFace Trainer class. You can find a great guide on it [here](https://huggingface.co/docs/transformers/v4.24.0/en/training). As its arguments, it expects a model, some further training arguments, a metric to evaluate performance, the data, and the tokenizer that we used. It will then train the model for us on the data and report along the way how its performance improves.

1. Initialize the pre-trained model using the [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/v4.19.0/en/model_doc/auto#transformers.AutoModelForSequenceClassification) module. Set it up for classification into 3 classes with `num_labels` argument. Then move it to GPU.

In [None]:
model = 
model.to(device) #move to GPU

2. Prepare the [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) object. The model should be trained for 3 epochs, with training and validation loss reported after each 500 steps (iterations). The `per_device_train_batch_size` is 8 by default, see how far you can raise it without getting out of memory. 
3. The function to compute the "accuracy" evaluation metric (loaded from datasets library with `datasets.load_metric`) is provided for you. Under the hood it is [based](https://github.com/huggingface/datasets/blob/master/metrics/accuracy/accuracy.py) on sklearn accuracy metric.



In [None]:
training_args = TrainingArguments(SET_THE_PARAMETERS)

In [None]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    """the HuggingFace Trainer expects a function that processes outputs and labels"""
    outputs, labels = eval_pred
    predictions = np.argmax(outputs, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]



---


4. Create the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) object. Pass it the `model`, the training arguments (`args`), the pre-defined metric (`compute_metric`), the `train_dataset` and `eval_dataset`, as well as the `tokenizer` object.
5. Train the model. Are you getting better results than with RNN-based model?

In [None]:
trainer = Trainer(FILLINTHEBLANK)

In [None]:
# train the model here using the trainer object you defined