In [None]:
# Installing datasets and transformers for Colab
!pip install datasets==2.2.1 transformers==4.19.1

In [None]:
import os
import numpy as np
from collections import Counter
import torch
import datasets
datasets.logging.set_verbosity_error()
from datasets import load_metric
from google.colab import drive
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import f1_score
import pandas as pd

# # uncomment if CAN'T CONNECT TO GPU (it happens...)
# import psutil
# import platform

In [None]:
# GPU housekeeping code: you do not need to modify anything, simply
# read through it to understand what is going on, and run as is

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# a helper function to format byte counts into KB, MB and so on
def bytes_format(b):
    if b < 1000:
              return f'{b} B'
    elif b < 1000000:
        return f'{round(float(b/1000),2)} KB'
    elif b < 1000000000:
        return f'{round(float(b/1000000),2)} MB'
    else:
        return f'{round(float(b/1000000000),2)} GB'

# a helper function to check the amount of available memory
def memory_report():
  if device!='cpu':
    print(f"GPU available: {torch.cuda.get_device_name()}")
    #print(torch.cuda.memory_summary())
    total = torch.cuda.get_device_properties(0).total_memory
    reserved = torch.cuda.memory_reserved(0)
    allocated = torch.cuda.memory_allocated(0)
  #  free = reserved-allocated  # free inside memory_reserved
    print(f"Total cuda memory: {bytes_format(total)}, reserved: {bytes_format(reserved)}, allocated: {bytes_format(allocated)}")
  else:
    # Print total memory available on CPU
    print(f'Device is CPU {platform.processor()}. GPU is not available rn')
    total_memory = psutil.virtual_memory().total
    print(f"Total CPU memory: {bytes_format(total_memory)}")

memory_report()

# Exercise: sentence classification

In this exercise, we will focus a bit more deeply on using supervised machine learning for classifying sentences (and other short documents). Of course, classifying short documents is what we have been doing throughout section 4 and 5 of the course. Here, we will look at irony prediction and stance detection as examples of tasks that go beyond sentiment classification. We will (1) take a closer look at annotations to understand the difficulty of coding (annotating) text, even for human coders; and (2) evaluate the performance of a fine-tuned, pre-trained BERT model on these tasks.

We will once again run this notebook on Google Colab (as in exercise set 4.3), so that we can use GPUs for fine-tuning BERT. Note that below, you will need to use a file with hand-coded annotations that you create yourself. This means you will have to give the notebook access to the Google drive folder where you store this file; the code for that is included below. 

# 1. Understand the irony detection data

Download the `tweet_eval` data set for the irony detection task. The whole suite of `tweet_eval` data sets is described [here](https://huggingface.co/datasets/tweet_eval); select "irony" as the subset to see examples of the irony detection task.

1. How many tweets are in the training and validation set? How many are in the irony and no-irony categories?

In [None]:
# load the tweet_eval irony datasets
train_dataset = 
val_dataset = 

2. Have a look at the [paper](https://aclanthology.org/S18-1005.pdf) that explains this dataset and task:

Van Hee, Cynthia, Els Lefever, and Véronique Hoste. "Semeval-2018 task 3: Irony detection in English tweets." In Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 39-50. 2018.

How were the tweets for this task selected (before being hand-coded)? How could this influence the performance of the task on other tweets? Discuss this with a neighbor, if you can.

3. Hand-annotate the irony of 50 randomly selected tweets yourself. Calculate the [Cohen's kappa](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html) for interrater agreement between yourself and the original coder(s). Compare your annotations to those of the trained coder, and look at the disagreements: how many of them would you consider to be mistakes on your end, mistakes on their end, or tweets whose true irony label is simply unclear?

**Comments on the hand-annotating**

Normally, we would export a sample of the data without the labels, since of course we don't want to see the original coder's labels as we are doing our own annotations. Then, we would merge the original labels back in using their IDs. However, since data wrangling isn't the focus here, you can just export the data with labels to an excel file, and "hide" the original labels are you do your own annotations. We provide some skeleton code below.

**Comments on Cohen's Kappa Score**

The kappa score is a number between -1 and 1. Neuendorf (2002) rates levels of annotator agreement (also called intercoder reliability or IRC) as follows:

- Above .8 is nearly perfect agreement
- Between  0.61 and 0.80 as substantial agreement 
- Between 0.41 and 0.60 as moderate agreement
- Between 0.21 and 0.40 as fair agreement
- Below 0.2 is slight agreement

Zero or lower means no agreement (practically random labels). 

In [None]:
# In order to work in a directory on our Google drive, we first have to mount our drive

# NB: The code will trigger permission prompts 
drive.mount('/content/drive')

# Setting path to current working directory
path = '/content/drive/My Drive/path/to/folder/where/you/will/put/your/annotation/file'
#change directory
os.chdir(path)

In [None]:
# Selecting random sample of 50 tweets

# Setting a seed to make sure that everyone gets the same sample 
seed=42

# Converting dataset to a pandas dataframe
df_sample=pd.DataFrame(train_dataset)

# Taking sample 
df_sample=df_sample.sample(FILLINTHEBLANK, random_state=seed)

# Saving sample to excel-file that can be manually annotated 
# We open the xlsx-file in our drive as a google sheets file and manually annotate the data in a new column called "my_label"
# During annotation we hide the column "label" containing the existing annotations  
df_sample.to_excel("irony_annotation.xlsx", index = False)

# 2. Finetune BERT for irony detection

Building on the work that you did in exercise set 4.3, fine-tuning BERT for the `tweet_eval` sentence classification task, now it's time to fine-tune BERT for the irony detection task. You can use your own code or the solutions code from the previous exercise to accomplish this. That means:

1. Setting up `transformer` for the medium-size BERT model "prajjwal1/bert-medium"
2. Tokenizing the tweets with the tokenizer associated with our masked language model, using the [AutoTokenizer](https://huggingface.co/docs/transformers/v4.19.0/en/model_doc/auto#transformers.AutoTokenizer).
3. Initializing the pre-trained model using the [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/v4.19.0/en/model_doc/auto#transformers.AutoModelForSequenceClassification) module, setting it up for classification into the right number of classes, and then moving it to GPU.
4. Preparing a [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) object and a function that computes the F1 evaluation metric, to be passed as arguments to the Trainer. Set the number of epochs to 5 (or 2-3 if you don't have a GPU) and the batch size to 16 in the training arguments.
5. Creating a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) object and passing it the model, the training arguments (args), the pre-defined metric (compute_metric), the train_dataset and eval_dataset, as well as the tokenizer object.
6. Training the model using its `.train()` method.

What kind of performance do you see in terms of F1? How does this compare to the F1 scores reported in the paper?

In [None]:
# 1. Defining the model

# NB: Try using bert-small if running on CPU (not GPU)
bert_medium = 

In [None]:
# 2. Set up the tokenizer we want to use
tokenizer = 

# Moving tokenizer to work on GPU 
tokenizer.to_device = device

# Apply the tokenizer to each row in the dataset
tokenized_train_dataset =
tokenized_val_dataset =

In [None]:
# 3.  initializing the pre-trained model using the AutoModelForSequenceClassification module 
irony_classifier = 

# Moving model to GPU
irony_classifier.to(device)

In [None]:
# 4.  Setting the training arguments
# NB:  If your are not connected to the GPU try lowering the number of epochs to 2 or 3  
training_args = 


In [None]:
# 5. Defining the f1 score metric
metric = #use the load_metric method from the datasets library to load f1 from sklearn

# Defining a function that computes it given a tuple of outputs and labels
def compute_f1(eval_pred):
    outputs, labels = eval_pred
    predictions = np.argmax(outputs, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# 5. Defining a trainer object with the information from above 
trainer = Trainer(FILLINTHEBLANK)

# 6. Training and evaluating model using its .train() method


# 3. Repeat the exercise for climate stance detection.

Download the `tweet_eval` data set for the `stance_climate` task. Also have a look at the relevant [paper](https://aclanthology.org/S16-1003.pdf):

Mohammad, Saif, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. "Semeval-2016 task 6: Detecting stance in tweets." In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), pp. 31-41. 2016.

Repeat exercise 1 and 2 for this dataset and task.

For 1:

Looking at your own Cohen's kappa with the original coders, is this a more or less difficult task? (Note: Cohen's kappa can be compared between tasks that have different number of output classes and different degrees of balance, because it takes into accunt the baseline probability that two coders would agree on a label)

For 2:

This time, since we have three outcome categories, we need to define a slightly difference performance metric. Complete the example code to define an evaluation metric to match the F_avg metric that is used in the paper, which takes the average of the F1 metrics for the categories "favor" and "against". After fine-tuning BERT, how does this F_avg metric compare to the one you found for the irony task? Would you have expected this given the size of the datasets, the balance in the classes, and the difficulty of the coding task?



In [None]:
# Defining a function for finding the average f1 for favor and against labels 
def compute_f_avg(eval_pred):
    outputs, labels = eval_pred 
    predictions = np.argmax(outputs, axis=-1) 
    
    # Filter labels and predictions for "favor" and "against" categories
    favor_labels = labels[labels == 2]
    favor_predictions = predictions[labels == 2]
    against_labels = labels[labels == 1]
    against_predictions = predictions[labels == 1]
    
    # Calculating f1 for favor and against
    f1_favor = f1_score(favor_labels, favor_predictions, average='weighted', zero_division=0) # The zero_division parameter is set to 0 to handle the case when there are no instances of a particular class.
    f1_against = f1_score(against_labels, against_predictions, average='weighted', zero_division=0)
    
    # Finding average
    f_avg = FILLINTHEBLANK
    
    return {'f_avg': f_avg}