# **Assignment 3 for Computational Semantics**

**Topic**: SemEval 2020 Task 4 Commonsense validation, explanation and generation

**Member**: Sijie Ju

**Introduction**: The task is to test whether a model can differentiate natural language statements that make sense from those that do not make sense. This task contains three subtasks. The following codes is the solution to subtask C.



### **Subtask C**: Commonsense explanation

**The subtask C** is to generate the reason why this statement is against common sense.

**Examples**

>Statement: He put an elephant into the fridge.
>
>Referential Reasons:
1. An elephant is much bigger than a fridge.
2. A fridge is much smaller than an elephant.
3. Most of the fridges aren’t large enough to contain an elephant.


### **1. General preparation**

In [None]:
!pip install datasets
!pip uninstall transformers
!pip install accelerate==0.15.0
!pip install transformers==4.28.1

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6
Found existing installation: transformers 4.35.2
Uninstalling transformers-4.35.2:
  Woul

In [None]:
! git clone https://github.com/google-research/bleurt.git
! cd bleurt
! pip install /content/bleurt/

Cloning into 'bleurt'...
remote: Enumerating objects: 134, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 134 (delta 0), reused 17 (delta 0), pack-reused 116[K
Receiving objects: 100% (134/134), 31.28 MiB | 19.38 MiB/s, done.
Resolving deltas: 100% (49/49), done.
Processing ./bleurt
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from BLEURT==0.0.2)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: BLEURT
  Building wheel for BLEURT (setup.py) ... [?25l[?25hdone
  Created wheel for BLEURT: filename=BLEURT-0.0.2-py3-none-any.whl size=16456765 sha256=f38b45ce1e7c880b45630573a44ffeb5ea161e8292ba680f5217bc57c07a9037
  Stored in directory: /tmp/pip-ephem-wheel-cache-j79k562z

In [None]:
# import libraries
import transformers
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer,TrainingArguments, EvalPrediction
from transformers import TrainerCallback, AutoConfig
from accelerate import Accelerator

import torch
import pandas as pd
import matplotlib.pyplot as plt
import torch
from datasets import Dataset, DatasetDict

from bleurt import score
import bleurt

### **2. Data processing**

In [None]:
# download the data
!git clone https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.git

Cloning into 'SemEval2020-Task4-Commonsense-Validation-and-Explanation'...
remote: Enumerating objects: 88, done.[K
remote: Counting objects: 100% (88/88), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 88 (delta 36), reused 64 (delta 19), pack-reused 0[K
Receiving objects: 100% (88/88), 2.22 MiB | 5.83 MiB/s, done.
Resolving deltas: 100% (36/36), done.


In [None]:
# read the data from csv files
def read_data(text_path,answer_path):
  text = pd.read_csv(text_path, header = 0, names = ['ID','FalseSen'])
  answer = pd.read_csv(answer_path, header = None, names = ['ID','Answer 1','Answer 2', 'Answer 3'])
  return text, answer

train_text, train_answer = read_data ('/content/SemEval2020-Task4-Commonsense-Validation-and-Explanation/ALL data/Training  Data/subtaskC_data_all.csv','/content/SemEval2020-Task4-Commonsense-Validation-and-Explanation/ALL data/Training  Data/subtaskC_answers_all.csv')
val_text,val_answer = read_data('/content/SemEval2020-Task4-Commonsense-Validation-and-Explanation/ALL data/Dev Data/subtaskC_dev_data.csv','/content/SemEval2020-Task4-Commonsense-Validation-and-Explanation/ALL data/Dev Data/subtaskC_gold_answers.csv')
test_text, test_answer = read_data ('/content/SemEval2020-Task4-Commonsense-Validation-and-Explanation/ALL data/Test Data/subtaskC_test_data.csv','/content/SemEval2020-Task4-Commonsense-Validation-and-Explanation/ALL data/Test Data/subtaskC_gold_answers.csv')

# process the original data as some of the sentences are not capitalized or don't have the period.
def modify(sentence):
    if not sentence.endswith('.') :
        sentence = sentence + '.'
    if not sentence[0].isupper():
        sentence = sentence.capitalize()
    return sentence

# process and combine the table
def data_process(text,answer):
  data = text.merge(answer, on = 'ID',how = 'left')
  data = data.drop(labels = 'ID',axis =1)
  data[['FalseSen', 'Answer 1', 'Answer 2', 'Answer 3']] = data[['FalseSen', 'Answer 1', 'Answer 2', 'Answer 3']].applymap(modify)
  data['Text'] = data.apply(lambda row: ' '.join([row['FalseSen'],row['Answer 1'], row['Answer 2'], row['Answer 3']]), axis=1)
  data_processed = pd.DataFrame(data['Text'])
  return data_processed,data

train_texts,train_data = data_process(train_text,train_answer)
val_texts, val_data = data_process(val_text, val_answer)
test_texts, test_data = data_process(test_text,test_answer)

In [None]:
# calculate the max length of data
max_len = train_texts ['Text'].apply(lambda x: len(x)).max()
print(max_len)

561


In [None]:
# load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
eos_token, eos_token_id = tokenizer.eos_token, tokenizer.eos_token_id

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [None]:
# convert to dataset
train_dataset = Dataset.from_pandas(train_texts)
val_dataset = Dataset.from_pandas(val_texts)
test_dataset = Dataset.from_pandas(test_texts)

dataset = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

In [None]:
# encode the data
def encode_batch(batch):
  return tokenizer(batch['Text'], max_length = max_len, truncation=True)

dataset = dataset.map(encode_batch, batched=True)
dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/997 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

### **3. Train the model**

In [None]:
# load the model
config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size = len(tokenizer),
    bos_token_id = tokenizer.bos_token_id,
    eos_token_id = tokenizer.eos_token_id
)

model = GPT2LMHeadModel(config)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer,mlm=False)

In [None]:
# set training arguments
training_args = TrainingArguments(
    learning_rate = 2e-4,
    lr_scheduler_type = 'cosine',
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_steps=100,
    overwrite_output_dir=True,
    output_dir="./training_output"
    )

In [None]:
# set accelerator
accelerator = Accelerator()
model, training_args, dataset['train'],dataset['validation'] = accelerator.prepare(model, training_args, dataset['train'],dataset['validation'])

In [None]:
# start training
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer = tokenizer,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    data_collator=data_collator
)

trainer.train()



Step,Training Loss
100,7.2957
200,6.3463
300,5.9943
400,5.8764
500,5.6748
600,5.6305
700,5.4874
800,5.4151
900,5.3915
1000,5.2872


TrainOutput(global_step=6250, training_loss=4.510665546875, metrics={'train_runtime': 1116.951, 'train_samples_per_second': 44.765, 'train_steps_per_second': 5.596, 'total_flos': 1343775928320000.0, 'train_loss': 4.510665546875, 'epoch': 5.0})

### **4. Test the model**

In [None]:
from transformers import pipeline

generator = pipeline('text-generation',model = model,tokenizer = tokenizer,device=training_args.device.index)

In [None]:
# test the model manually
text = generator('He loves to stroll at the park with his bed.', max_new_tokens = 128)
print(text[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


He loves to stroll at the park with his bed. You can't wear a bed with a person to draw. A wall cannot be built to store. One cannot put a bus in a movie shop. You cannot call an area inside a key. People don't go. A bike in a bicycle cannot hold. A fork shop that will break an ball a backpack. Humans


In [None]:
# test the model
explanations = []

for prompt in test_data['FalseSen']:
  result = generator(prompt,max_new_tokens = 128)
  explanations.append(result[0]['generated_text'])

In [None]:
# convert the result into a table
generation = pd.DataFrame({'Explanation': explanations})
generation.sample(10)

In [None]:
# evaluate the model performance
candidates_list = generation['Explanation'].tolist()
references_list = test_data['Text'].tolist()

checkpoint = '/content/bleurt/bleurt/test_checkpoint'
scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references=references_list, candidates=candidates_list)

total_score = 0
for score in scores:
  total_score+=score

avg = total_score/1000

sorted_list = sorted(scores)

max_value = sorted_list[-1]
best = scores.index(max_value)
min_value = sorted_list[0]
worst = scores.index(min_value)

print(f'The score of this model:{avg}.')
print(f'The highest score of this model: {max_value}.')
print(f'The lowest score of this model: {min_value}.')
print(generation['Explanation'][best])
print(generation['Explanation'][worst])

The score of this model:-0.3501493607535958.
The highest score of this model: 0.18506819009780884.
The lowest score of this model: -0.9189718961715698.
She put the giraffe in the freezer. A giraff is too big to big to fit into a pen. A giraffe is not a place to fit on a fridge. Giraffe are much smaller than a closet. Giraffe are located in a ovens and very large. Giraffe. Giraffe. A giraffe for put in ocean in kitchen.
The cow flew away. Fcrete is too heavy for water, but it is too small to stand in space. The giraffe is too far for a whale,it should not fit into road. Bats are found in space and a giraffroom. Spens are not much bigger than a giraffe. A girOWaffes do
