In [1]:
# Installing datasets and transformers for Colab
!pip install datasets==2.2.1 transformers==4.19.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets==2.2.1
  Downloading datasets-2.2.1-py3-none-any.whl (342 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m342.2/342.2 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.19.1
  Downloading transformers-4.19.1-py3-none-any.whl (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from datasets==2.2.1)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets==2.2.1)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollect

In [2]:
import os
import numpy as np
from collections import Counter
import torch
import datasets
datasets.logging.set_verbosity_error()
from datasets import load_metric
from google.colab import drive
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import f1_score
import pandas as pd

# # uncomment if CAN'T CONNECT TO GPU (it happens...)
# import psutil
# import platform

In [3]:
# GPU housekeeping code: you do not need to modify anything, simply
# read through it to understand what is going on, and run as is

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# a helper function to format byte counts into KB, MB and so on
def bytes_format(b):
    if b < 1000:
              return f'{b} B'
    elif b < 1000000:
        return f'{round(float(b/1000),2)} KB'
    elif b < 1000000000:
        return f'{round(float(b/1000000),2)} MB'
    else:
        return f'{round(float(b/1000000000),2)} GB'

# a helper function to check the amount of available memory
def memory_report():
  if device!='cpu':
    print(f"GPU available: {torch.cuda.get_device_name()}")
    #print(torch.cuda.memory_summary())
    total = torch.cuda.get_device_properties(0).total_memory
    reserved = torch.cuda.memory_reserved(0)
    allocated = torch.cuda.memory_allocated(0)
  #  free = reserved-allocated  # free inside memory_reserved
    print(f"Total cuda memory: {bytes_format(total)}, reserved: {bytes_format(reserved)}, allocated: {bytes_format(allocated)}")
  else:
    # Print total memory available on CPU
    print(f'Device is CPU {platform.processor()}. GPU is not available rn')
    total_memory = psutil.virtual_memory().total
    print(f"Total CPU memory: {bytes_format(total_memory)}")

memory_report()

GPU available: Tesla T4
Total cuda memory: 15.84 GB, reserved: 0 B, allocated: 0 B


# Exercise: sentence classification

In this exercise, we will focus a bit more deeply on using supervised machine learning for classifying sentences (and other short documents). Of course, classifying short documents is what we have been doing throughout section 4 and 5 of the course. Here, we will look at irony prediction and stance detection as examples of tasks that go beyond sentiment classification. We will (1) take a closer look at annotations to understand the difficulty of coding (annotating) text, even for human coders; and (2) evaluate the performance of a fine-tuned, pre-trained BERT model on these tasks.

We will once again run this notebook on Google Colab (as in exercise set 4.3), so that we can use GPUs for fine-tuning BERT. Note that below, you will need to use a file with hand-coded annotations that you create yourself. This means you will have to give the notebook access to the Google drive folder where you store this file; the code for that is included below. 

# 1. Understand the irony detection data

Download the `tweet_eval` data set for the irony detection task. The whole suite of `tweet_eval` data sets is described [here](https://huggingface.co/datasets/tweet_eval); select "irony" as the subset to see examples of the irony detection task.

1. How many tweets are in the training and validation set? How many are in the irony and no-irony categories?

In [4]:
# load the tweet_eval irony datasets
train_dataset = datasets.load_dataset('tweet_eval', 'irony', split='train')
val_dataset = datasets.load_dataset('tweet_eval', 'irony', split='validation')

Downloading builder script:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Downloading and preparing dataset tweet_eval/irony (download: 376.58 KiB, generated: 411.24 KiB, post-processed: Unknown size, total: 787.82 KiB) to /root/.cache/huggingface/datasets/tweet_eval/irony/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/108k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/32.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/211 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/36.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244 [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/2862 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/784 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/955 [00:00<?, ? examples/s]

Dataset tweet_eval downloaded and prepared to /root/.cache/huggingface/datasets/tweet_eval/irony/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.


In [5]:
# Examining size of each dataset 
for d in [train_dataset,val_dataset]:
  print(f'{d.split}',d.shape)

train (2862, 2)
validation (955, 2)


In [10]:
# Checking number of cases in each category
print(Counter(train_dataset["label"]))
print(Counter(val_dataset["label"]))


Counter({1: 1445, 0: 1417})
Counter({0: 499, 1: 456})


The dataset is almost perfectly balanced between positive and negative categories.

2. Have a look at the [paper](https://aclanthology.org/S18-1005.pdf) that explains this dataset and task:

Van Hee, Cynthia, Els Lefever, and Véronique Hoste. "Semeval-2018 task 3: Irony detection in English tweets." In Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 39-50. 2018.

How were the tweets for this task selected (before being hand-coded)? How could this influence the performance of the task on other tweets? Discuss this with a neighbor, if you can.

Solution: The tweets were selected by filtering tweets on the hashtags #not, #sarcasm, and #irony. One possible way in which this could influence performance are that these are people who actually like to be really obvious (after all, they also insisted on adding an irony hashtag to their tweet), and so these tweets might be easier to classify than the average ironic tweet. On the other hand, it could also be that these are statements where a reader would hardly be able to tell whether or not the statement is sarcastic without the hastag, which is why the user added the tag. In that case, these tweets might be especially hard.

3. Hand-annotate the irony of 50 randomly selected tweets yourself. Calculate the [Cohen's kappa](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html) for interrater agreement between yourself and the original coder(s). Compare your annotations to those of the trained coder, and look at the disagreements: how many of them would you consider to be mistakes on your end, mistakes on their end, or tweets whose true irony label is simply unclear?

**Comments on Cohen's Kappa Score**

The kappa score is a number between -1 and 1. Neuendorf (2002) rates levels of annotator agreement (also called intercoder reliability or IRC) as follows:

- Above .8 is nearly perfect agreement
- Between  0.61 and 0.80 as substantial agreement 
- Between 0.41 and 0.60 as moderate agreement
- Between 0.21 and 0.40 as fair agreement
- Below 0.2 is slight agreement

Zero or lower means no agreement (practically random labels). 

In [27]:
# In order to work in a directory on our Google drive, we first have to mount our drive

# NB: The code will trigger permission prompts 
drive.mount('/content/drive')

# Setting path to current working directory
path = '/content/drive/My Drive/ASDS II/problem sets/TA versions'
#change directory
os.chdir(path)

Mounted at /content/drive


In [28]:
# Selecting random sample of 50 tweets

# Setting a seed to make sure that we get the same sample 
seed=42

# Converting dataset do dataframe
df_sample=pd.DataFrame(train_dataset)

# Taking sample 
df_sample=df_sample.sample(n=50, random_state=seed)

# Saving sample to excel-file that can be manually annotated 
# We open the xlsx-file in our drive as a google sheets file and manually annotate the data in a new column called "my_label"
# During annotation we hide the column "label" containing the existing annotations  
df_sample.to_excel("irony_annotation_sample.xlsx", index = False)

In [30]:
# After annotating, we reload sample 
df_annotated=pd.read_excel("irony_annotation.xlsx")

# We check that it has the extra manually annotated column
df_annotated.columns

Index(['Unnamed: 0', 'text', 'label', 'my_label'], dtype='object')

In [None]:
# Defining original and our own annotations
y1_original_annotations= df_annotated.label
y2_my_annotations= df_annotated.my_label

# Calculating kappa_score for data set
kappa_score = cohen_kappa_score(y1_original_annotations, y2_my_annotations)
print("Cohen's Kappa score:", kappa_score)

Cohen's Kappa score: 0.28



We get an intra-rater reliability Cohens Kappa Score of 0.28 which Neuendorf (2002) would rate as "fair" agreement. This kappa for our irony annotations indicates that it is a difficult task for humans. Of course, the original coders would have been trained in using the agreed-upon codebook (coding guidelines), so with training, perhaps our score could be improved.

# 2. Finetune BERT for irony detection

Building on the work that you did in exercise set 4.3, fine-tuning BERT for the `tweet_eval` sentence classification task, now it's time to fine-tune BERT for the irony detection task. You can use your own code or the solutions code from the previous exercise to accomplish this. That means:

1. Setting up `transformer` for the medium-size BERT model "prajjwal1/bert-medium"
2. Tokenizing the tweets with the tokenizer associated with our masked language model, using the [AutoTokenizer](https://huggingface.co/docs/transformers/v4.19.0/en/model_doc/auto#transformers.AutoTokenizer).
3. Initializing the pre-trained model using the [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/v4.19.0/en/model_doc/auto#transformers.AutoModelForSequenceClassification) module, setting it up for classification into the right number of classes, and then moving it to GPU.
4. Preparing a [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) object and a function that computes the F1 evaluation metric, to be passed as arguments to the Trainer. Set the number of epochs to 5 (or 2-3 if you don't have a GPU) and the batch size to 16 in the training arguments.
5. Creating a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) object and passing it the model, the training arguments (args), the pre-defined metric (compute_metric), the train_dataset and eval_dataset, as well as the tokenizer object.
6. Training the model using its `.train()` method.

What kind of performance do you see in terms of F1? How does this compare to the F1 scores reported in the paper?

In [11]:
# 1. Defining the model

# NB: Try using bert-small if running on CPU (not GPU)
bert_medium = "prajjwal1/bert-medium"

In [12]:
# 2. Set up the tokenizer we want to use
tokenizer = AutoTokenizer.from_pretrained(bert_medium)

# Moving tokenizer to work on GPU 
tokenizer.to_device = device

# Function to apply that tokenizer once
def tokenize(dataset):
    return tokenizer(dataset["text"])

# Apply the tokenizer to each row in the dataset
tokenized_train_dataset = train_dataset.map(tokenize, batched=True)
tokenized_val_dataset = val_dataset.map(tokenize, batched=True)

Downloading:   0%|          | 0.00/286 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [13]:
# 3.  initializing the pre-trained model using the AutoModelForSequenceClassification module 
irony_classifier = AutoModelForSequenceClassification.from_pretrained(bert_medium, num_labels=2)

# Moving model to GPU
irony_classifier.to(device)

Downloading:   0%|          | 0.00/159M [00:00<?, ?B/s]

Some weights of the model checkpoint at prajjwal1/bert-medium were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not init

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 512, padding_idx=0)
      (position_embeddings): Embedding(512, 512)
      (token_type_embeddings): Embedding(2, 512)
      (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-7): 8 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=512, out_features=512, bias=True)
              (key): Linear(in_features=512, out_features=512, bias=True)
              (value): Linear(in_features=512, out_features=512, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=512, out_features=512, bias=True)
              (LayerNorm): LayerNorm((512,), eps=1e-12, e

In [14]:
# 4.  Setting the training arguments
# NB:  If your are not connected to the GPU try lowering the number of epochs to 2 or 3  

training_args = TrainingArguments(output_dir="my_trainer_irony",
                                  evaluation_strategy="steps",  # evaluate at specific steps rather than after epochs
                                  num_train_epochs=5,
                                  per_device_train_batch_size=16, # Model processes 16 docs a time 
                                  logging_steps=100, # logging at every 100 steps simultaniously with evaluation
                                  eval_steps=100) # Evaluation at every 100 steps


In [15]:
# 5.  Defining the f1 score metric  
metric = load_metric("f1") 

# Defining a function
def compute_f1(eval_pred):
    outputs, labels = eval_pred
    predictions = np.argmax(outputs, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

In [16]:
# Defining a trainer object with the information from above 
trainer = Trainer(
    model=irony_classifier,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    compute_metrics=compute_f1,
    tokenizer=tokenizer)

# Training and evaluating model 
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2862
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 895


Step,Training Loss,Validation Loss,F1
100,0.6762,0.655051,0.564958
200,0.6117,0.648608,0.573957
300,0.5207,0.674302,0.579365
400,0.4633,0.736177,0.662564
500,0.3773,0.688326,0.661555
600,0.2711,1.028922,0.682701
700,0.2245,1.199555,0.686222
800,0.1282,1.198411,0.68


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 955
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 955
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 955
  Batch size = 8
The following colum

TrainOutput(global_step=895, training_loss=0.380261556529466, metrics={'train_runtime': 58.7312, 'train_samples_per_second': 243.652, 'train_steps_per_second': 15.239, 'total_flos': 89613807272112.0, 'train_loss': 0.380261556529466, 'epoch': 5.0})

We get an F1 score of .68, which would place us second in the ranking reported in the paper. :)

# 3. Repeat the exercise for climate stance detection.

Download the `tweet_eval` data set for the `stance_climate` task. Also have a look at the relevant [paper](https://aclanthology.org/S16-1003.pdf):

Mohammad, Saif, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. "Semeval-2016 task 6: Detecting stance in tweets." In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), pp. 31-41. 2016.

Repeat exercise 1 and 2 for this dataset and task.

For 1:

Looking at your own Cohen's kappa with the original coders, is this a more or less difficult task? (Note: Cohen's kappa can be compared between tasks that have different number of output classes and different degrees of balance, because it takes into accunt the baseline probability that two coders would agree on a label)

For 2:

This time, since we have three outcome categories, we need to define a slightly difference performance metric. Use the example code to define an evaluation metric to match the F_avg metric that is used in the paper, which takes the average of the F1 metrics for the categories "favor" and "against". After fine-tuning BERT, how does this F_avg metric compare to the one you found for the irony task? Would you have expected this given the size of the datasets, the balance in the classes, and the difficulty of the coding task?



In [17]:
# Loading data
climate_train_dataset = datasets.load_dataset('tweet_eval', 'stance_climate', split='train')
climate_val_dataset = datasets.load_dataset('tweet_eval', 'stance_climate', split='validation')

# Selecting random sample
seed=42
climate_df=pd.DataFrame(climate_train_dataset)
climate_annotation_sample=climate_df.sample(n=50, random_state=seed)

# Run before annotating
climate_annotation_sample.to_excel("climate_annotation1.xlsx", index=False)

Downloading and preparing dataset tweet_eval/stance_climate (download: 59.05 KiB, generated: 63.46 KiB, post-processed: Unknown size, total: 122.51 KiB) to /root/.cache/huggingface/datasets/tweet_eval/stance_climate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/133 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.38k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/45.0 [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/355 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/169 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/40 [00:00<?, ? examples/s]

Dataset tweet_eval downloaded and prepared to /root/.cache/huggingface/datasets/tweet_eval/stance_climate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.


In [21]:
# Examining size of each dataset 
for d in [climate_train_dataset,climate_val_dataset]:
  print(f'{d.split}',d.shape)

train (355, 2)
validation (40, 2)


In [22]:
# Checking number of cases in each category
print(Counter(climate_train_dataset["label"]))
print(Counter(climate_val_dataset["label"]))

Counter({2: 191, 0: 151, 1: 13})
Counter({2: 21, 0: 17, 1: 2})


The dataset is quite small and unbalanced. There are almost no examples of "against" stances (coded as 1) in the data set.

In [None]:
# Run after annotating
df_annotated_climate=pd.read_excel("climate_annotation.xlsx")

In [None]:
# Calculating Cohens Kappa  
y1_original= df_annotated_climate.label
y2_yours= df_annotated_climate.my_labels

kappa_score = cohen_kappa_score(y1_original, y2_yours)
print("Cohen's Kappa score:", kappa_score)

**Comment on Cohen's Kappa for climate stance annotations**

For this annotation task the intercoder aggreement somewhat higher than in the previous task. This appears to be an easier task to to for a human.

In [24]:
# Defining a function for finding the average f1 for favor and against labels 
def compute_f_avg(eval_pred):
    outputs, labels = eval_pred 
    predictions = np.argmax(outputs, axis=-1) 
    
    # Filter labels and predictions for "favor" and "against" categories
    favor_labels = labels[labels == 2]
    favor_predictions = predictions[labels == 2]
    against_labels = labels[labels == 1]
    against_predictions = predictions[labels == 1]
    
    # Calculating f1 for favor and against
    f1_favor = f1_score(favor_labels, favor_predictions, average='weighted', zero_division=0) # The zero_division parameter is set to 0 to handle the case when there are no instances of a particular class.
    f1_against = f1_score(against_labels, against_predictions, average='weighted', zero_division=0)
    
    # Finding average
    f_avg = (f1_favor + f1_against) / 2
    
    return {'f_avg': f_avg}

In [25]:
#apply the tokenizer to each row in the dataset
climate_tokenized_train_dataset = climate_train_dataset.map(tokenize, batched=True)
climate_tokenized_val_dataset = climate_val_dataset.map(tokenize, batched=True)

# initializing the pre-trained model using the AutoModelForSequenceClassification module 
climate_stance_classifer = AutoModelForSequenceClassification.from_pretrained(bert_medium, num_labels=3)
climate_stance_classifer.to(device)

# Setting training arguments
# It is a small dataset so we update the model weights more often with smaller steps and make batches smaller
climate_training_args = TrainingArguments(output_dir="climate_trainer",
                                  evaluation_strategy="steps",
                                  num_train_epochs=5,
                                  per_device_train_batch_size=8,
                                  eval_steps=10, 
                                   logging_steps=10)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

loading configuration file https://huggingface.co/prajjwal1/bert-medium/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/288b0ee1e79a7c3fe770ab8a84ece013c573e7d226ccb5d9ffad317b3419faac.4344f82f77799c092b30b2e0d3749c809f82df14c5993e43dbbdc52f5a0d86e0
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-medium",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 512,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 2048,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 8,
  "num_hidden_layers": 8,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_siz

In [26]:
# Training and evaluating model
climate_trainer = Trainer(
    model=climate_stance_classifer,
    args=climate_training_args,
    train_dataset=climate_tokenized_train_dataset,
    eval_dataset=climate_tokenized_val_dataset,
    compute_metrics=compute_f_avg,
    tokenizer=tokenizer)

climate_trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 355
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 225


Step,Training Loss,Validation Loss,F Avg
10,0.8478,0.863685,0.045455
20,0.9438,0.857757,0.045455
30,0.7976,0.849381,0.5
40,0.7312,0.750322,0.5
50,0.7498,0.645731,0.461538
60,0.5828,0.618892,0.487805
70,0.5925,0.550983,0.416667
80,0.5283,0.470251,0.487805
90,0.4822,0.518379,0.475
100,0.4222,0.471467,0.475


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 40
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 40
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 40
  Batch size = 8
The following columns 

TrainOutput(global_step=225, training_loss=0.41112090508143107, metrics={'train_runtime': 23.137, 'train_samples_per_second': 76.717, 'train_steps_per_second': 9.725, 'total_flos': 10443003734562.0, 'train_loss': 0.41112090508143107, 'epoch': 5.0})

**Comments on final model**
Our average F1 is close to the one in the paper. It is not surprising that the F1 is not great as the data set is both very small and unbalanced, making it very difficult for the model to learn how to predict the "against" category. 