#RDF-to-Text: Fine-tuning GPT2 with WebNLG Corpus
###Fina Emilova Yilmaz Polat

This is the last notebook of a series of 4.

We are going to:
* pre-process WebNLG Dataset - Part 1
* fine-tune GPT2 language model with WebNLG Dataset. - Part 2
* generate text with the trained model - Part 3
* evaluate generated text - Part 4

The WebNLG data (Gardent el al., 2017) was created to promote the development (i) of RDF verbalisers and (ii) of microplanners able to handle a wide range of linguistic constructions.

Gardent, C., Shimorina, A., Narayan, S., & Perez-Beltrachini, L. (2017, September). The WebNLG challenge: Generating text from RDF data. In Proceedings of the 10th International Conference on Natural Language Generation (pp. 124-133).

GPT2 Language Model : Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.


Evaluation code is partially adapted from https://github.com/cltl/ma-communicative-robots/blob/2021/projects/transformers/Caya's%20project/evaluation_metrics.ipynb 

In [1]:
!pip install bert-score
!pip install -U nltk
!pip install sacrebleu
!pip install datasets
!pip install rouge-score



In [15]:
!pip install git+https://github.com/google-research/bleurt.git

Collecting git+https://github.com/google-research/bleurt.git
  Cloning https://github.com/google-research/bleurt.git to /tmp/pip-req-build-palmspdn
  Running command git clone -q https://github.com/google-research/bleurt.git /tmp/pip-req-build-palmspdn
Collecting tf-slim>=1.1
  Downloading tf_slim-1.1.0-py2.py3-none-any.whl (352 kB)
[K     |████████████████████████████████| 352 kB 14.4 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 67.3 MB/s 
Collecting tf-estimator-nightly==2.8.0.dev2021122109
  Downloading tf_estimator_nightly-2.8.0.dev2021122109-py2.py3-none-any.whl (462 kB)
[K     |████████████████████████████████| 462 kB 72.9 MB/s 
Building wheels for collected packages: BLEURT
  Building wheel for BLEURT (setup.py) ... [?25l[?25hdone
  Created wheel for BLEURT: filename=BLEURT-0.0.2-py3-none-any.whl size=16456761 sha256=bd5d57039bc5ab62

In [2]:
#import required libraries
from google.colab import drive
import pandas as pd
import os

In [3]:
from bert_score import BERTScorer
import sacrebleu
from nltk.translate.meteor_score import single_meteor_score
from rouge_score import rouge_scorer
from nltk.tokenize import word_tokenize

In [4]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [18]:
import datasets

In [5]:
MOUNTPOINT = '/content/gdrive'
Working_Dir = os.path.join(MOUNTPOINT, 'My Drive', 'WebNLG with GPT2')
drive.mount(MOUNTPOINT)
print(Working_Dir)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/WebNLG with GPT2


In [6]:
# Let's load the data:
gen_df=pd.read_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/webNLG2020_test_with_generated_outputs.csv', index_col=[0])
gen_df.head

<bound method NDFrame.head of                                             input_text  \
0                        Darlington | areaCode | 01325   
1                        Darlington | areaCode | 01325   
2                        Darlington | areaCode | 01325   
3            Israel | officialLanguage | Modern_Hebrew   
4            Israel | officialLanguage | Modern_Hebrew   
..                                                 ...   
731  English_Without_Tears | writer | Anatole_de_Gr...   
732  English_Without_Tears | writer | Anatole_de_Gr...   
733                Nurhan_Atasoy | birthPlace | Turkey   
734                Nurhan_Atasoy | birthPlace | Turkey   
735                Nurhan_Atasoy | birthPlace | Turkey   

                                           target_text  \
0       The Darlington town has an area code of 01325.   
1     The telephone area code for Darlington is 01325.   
2                The area code in Darlington is 01325.   
3    The official language of Israel is m

In [7]:
reference_text_list = gen_df["target_text"].tolist()
generated_text_list = gen_df["generated_text"].tolist()

We start with BLEU Score: 

In [8]:
bleu_scores = []
for model_output, gold_references in zip(generated_text_list, reference_text_list):
    bleu = sacrebleu.sentence_bleu(model_output, [gold_references], smooth_method='exp').score
    bleu_scores.append(bleu)

gen_df['BLEU score'] = bleu_scores

ROUGE Score:

In [9]:
r_scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

rouge_scores = {'precision': [], 'recall': [], 'fmeasure': []}

for (model_output, gold_references) in zip(generated_text_list, reference_text_list):
  score = r_scorer.score(model_output, gold_references)
  precision, recall, fmeasure = score['rouge1']
  rouge_scores['fmeasure'].append(fmeasure)

In [10]:
gen_df['ROUGE (f1) score']= rouge_scores['fmeasure']

METEOR Score:

In [11]:
meteor_scores = []

for (model_output, gold_references) in zip(generated_text_list, reference_text_list):
  model_output = word_tokenize(model_output)
  gold_references = word_tokenize(gold_references)
  meteor_score = single_meteor_score(model_output, gold_references)
  meteor_scores.append(meteor_score)


In [12]:
gen_df['METEOR Scores'] = meteor_scores

BLEURT Score:

In [19]:
bleurt = datasets.load_metric("bleurt")
CHECKPOINT_URLS = {
    "bleurt-tiny-128": "https://storage.googleapis.com/bleurt-oss/bleurt-tiny-128.zip",
    "bleurt-tiny-512": "https://storage.googleapis.com/bleurt-oss/bleurt-tiny-512.zip",
    "bleurt-base-128": "https://storage.googleapis.com/bleurt-oss/bleurt-base-128.zip",
    "bleurt-base-512": "https://storage.googleapis.com/bleurt-oss/bleurt-base-512.zip",
    "bleurt-large-128": "https://storage.googleapis.com/bleurt-oss/bleurt-large-128.zip",
    "bleurt-large-512": "https://storage.googleapis.com/bleurt-oss/bleurt-large-512.zip",
}

Using default BLEURT-Base checkpoint for sequence maximum length 128. You can use a bigger model for better results with e.g.: datasets.load_metric('bleurt', 'bleurt-large-512').


Downloading data:   0%|          | 0.00/405M [00:00<?, ?B/s]

INFO:tensorflow:Reading checkpoint /root/.cache/huggingface/metrics/bleurt/default/downloads/extracted/887f2dc36c17f53c287f696681b8f7c947278407c1cf9f226662e16c8c0dc417/bleurt-base-128.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Will load checkpoint bert_custom
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:... name:bert_custom
INFO:tensorflow:... vocab_file:vocab.txt
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... do_lower_case:True
INFO:tensorflow:... max_seq_length:128
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating WordPiece tokenizer.
INFO:tensorflow:WordPiece tokenizer instantiated.
INFO:tensorflow:Creating Eager Mode predictor.
INFO:tensorflow:Loading model.
INFO:tensorflow:BLEURT initialized.


In [20]:
bleurt_scores = []
for (model_output, gold_references) in zip(generated_text_list, reference_text_list):
  results = bleurt.compute(predictions=[model_output], references=[gold_references])
  bleurt_scores.append(str(results['scores'])[1:-1])

In [22]:
gen_df['BLEURT Scores'] = bleurt_scores

BERTScore:

In [24]:
scorer = BERTScorer(lang="en", rescale_with_baseline=True)
P, R, F1 = scorer.score(generated_text_list, reference_text_list)

Downloading:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [25]:
gen_df['bertscore F1'] = F1
gen_df['bertscore P'] = P
gen_df['bertscore R'] = R

In [26]:
gen_df

Unnamed: 0,input_text,target_text,generated_text,BLEU score,ROUGE (f1) score,METEOR Scores,BLEURT Scores,bertscore F1,bertscore P,bertscore R
0,Darlington | areaCode | 01325,The Darlington town has an area code of 01325.,"01325 is the area code for Darlington, in Engl...",3.499026,0.226415,0.079186,-0.5757420063018799,0.248634,0.103195,0.400114
1,Darlington | areaCode | 01325,The telephone area code for Darlington is 01325.,"01325 is the area code of Darlington, in Engla...",6.180307,0.350000,0.203252,-0.015493102371692657,0.394794,0.250160,0.545631
2,Darlington | areaCode | 01325,The area code in Darlington is 01325.,The area codes for the area codes of Darlingto...,8.493099,0.500000,0.419776,0.5403301119804382,0.693669,0.627686,0.760212
3,Israel | officialLanguage | Modern_Hebrew,The official language of Israel is modern Hebrew.,Modern Hebrew is a native language of the Isra...,3.989383,0.259259,0.153016,-0.0546884723007679,0.406802,0.267287,0.551954
4,Israel | officialLanguage | Modern_Hebrew,Israel’s official language is Modern Hebrew.,Modern Hebrew is a language of the Israeli peo...,10.552670,0.500000,0.375758,0.24549144506454468,0.609954,0.634244,0.584619
...,...,...,...,...,...,...,...,...,...,...
731,English_Without_Tears | writer | Anatole_de_Gr...,The writer of English Without Tears was Anatol...,The writer of English Without Tears is Anatoli...,47.987821,0.700000,0.708097,0.09726543724536896,0.593602,0.574926,0.611072
732,English_Without_Tears | writer | Anatole_de_Gr...,"""English Without Tears"" was written by Anatole...",Anatolian De Grunwald was the author of the no...,3.234245,0.500000,0.384663,-0.5824552178382874,0.536845,0.477058,0.596422
733,Nurhan_Atasoy | birthPlace | Turkey,Nurhan Atasoy's birthplace is Turkey.,"The birthplace of Nurhan Asoy is Turkey.""",20.164946,0.615385,0.506818,0.2720884382724762,0.577012,0.541755,0.611329
734,Nurhan_Atasoy | birthPlace | Turkey,The place where Nurhan Atasoy was born is Turkey.,"The birthplace of Nurhan Atasoy was Ankara, Tu...",9.560409,0.538462,0.345865,0.28510770201683044,0.605535,0.567538,0.642755


In [74]:
# save the scores
gen_df.to_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/output/webNLG2020_GPT2_evaluation_scores.csv')

In [28]:
import numpy as np

In [63]:
df_evaluation_mean = pd.DataFrame()

In [64]:
bleu_scores = np.array(bleu_scores).astype(np.float64)
bleu_mean = np.mean(bleu_scores)
df_evaluation_mean["Bleu Mean"] = [bleu_mean]
print(bleu_mean)

14.544828645506522


In [65]:
rouge_scores = gen_df['ROUGE (f1) score'].tolist() 
rouge_scores = np.array(rouge_scores).astype(np.float64)
rouge_mean = np.mean(rouge_scores)
df_evaluation_mean["Rouge Mean"] = [rouge_mean]
print(rouge_mean)

0.5112762659918528


In [66]:
meteor_scores = np.array(meteor_scores).astype(np.float64)
meteor_mean = np.mean(meteor_scores)
df_evaluation_mean["Meteor Mean"] = [meteor_mean]
print(meteor_mean)

0.39164307713572216


In [67]:
bleurt_scores = np.array(bleurt_scores).astype(np.float64)
bleurt_mean = np.mean(bleurt_scores)
df_evaluation_mean["Bleurt Mean"] = [bleurt_mean]
print(bleurt_mean)

-0.21171647527649917


In [68]:
F1 = np.array(F1).astype(np.float64)
F1_mean = np.mean(F1)
df_evaluation_mean["Bertscore F1 Mean"] = [F1_mean]
print(F1_mean)

0.4986154903409719


In [69]:
P = np.array(P).astype(np.float64)
P_mean = np.mean(P)
df_evaluation_mean["Bertscore Precision Mean"] = [P_mean]
print(P_mean)

0.43061864324408816


In [70]:
R = np.array(R).astype(np.float64)
R_mean = np.mean(R)
df_evaluation_mean["Bertscore Recall Mean"] = [R_mean]
print(R_mean)

0.5699330617018227


In [71]:
df_evaluation_mean

Unnamed: 0,Bleu Mean,Rouge Mean,Meteor Mean,Bleurt Mean,Bertscore F1 Mean,Bertscore Precision Mean,Bertscore Recall Mean
0,14.544829,0.511276,0.391643,-0.211716,0.498615,0.430619,0.569933


In [73]:
# save the scores
df_evaluation_mean.to_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/output/webNLG2020_GPT2_evaluation_mean_scores.csv')


End of the notebook and project