#RDF-to-Text: Fine-tuning GPT2 with WebNLG Corpus
###Fina Emilova Yilmaz Polat

This is the third notebook of a series of 4.

We are going to:
* pre-process WebNLG Dataset - Part 1
* fine-tune GPT2 language model with WebNLG Dataset. - Part 2
* generate text with the trained model - Part 3
* evaluate generated text - Part 4

The WebNLG data (Gardent el al., 2017) was created to promote the development (i) of RDF verbalisers and (ii) of microplanners able to handle a wide range of linguistic constructions.

Gardent, C., Shimorina, A., Narayan, S., & Perez-Beltrachini, L. (2017, September). The WebNLG challenge: Generating text from RDF data. In Proceedings of the 10th International Conference on Natural Language Generation (pp. 124-133).

GPT2 Language Model : Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.


In [None]:
#install required libraries
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 39.9 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 42.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.4 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 34.5 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fo

In [5]:
#import required libraries
from google.colab import drive
import pandas as pd
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import os
import glob
import re
import xml.etree.ElementTree as ET
import csv
from csv import reader

In [6]:
MOUNTPOINT = '/content/gdrive'
Working_Dir = os.path.join(MOUNTPOINT, 'My Drive', 'WebNLG with GPT2')
drive.mount(MOUNTPOINT)
print(Working_Dir)

Mounted at /content/gdrive
/content/gdrive/My Drive/WebNLG with GPT2


We start with some pre-processing:
1. parse the test xml file with reference text.
2. filter the instances with 1 triple (for this study, we will evaluate only 1 triple instances.)
3. save it in a csv file
4. remove unwanted characters from the input text.
5. clean up the generated text for the evaluation. 

In [None]:
#Parse the test file

file = "/content/gdrive/My Drive/WebNLG with GPT2/data/test/rdf-to-text-generation-test-data-with-refs-en.xml"
triple_re=re.compile('(\d)triples')
data_dct={}
tree = ET.parse(file)
root = tree.getroot()
for sub_root in root:
    for ss_root in sub_root:
        strutured_master=[]
        unstructured=[]
        for entry in ss_root:
            unstructured.append(entry.text)
            strutured=[triple.text for triple in entry]
            strutured_master.extend(strutured)
        unstructured=[i for i in unstructured if i.replace('\n','').strip()!='' ]
        strutured_master_str=(' && ').join(strutured_master)
        data_dct[strutured_master_str]=unstructured
mdata_dct={"prefix":[], "input_text":[], "target_text":[]}
for st,unst in data_dct.items():
    for i in unst:
        mdata_dct['prefix'].append('webNLG')
        mdata_dct['input_text'].append(st)
        mdata_dct['target_text'].append(i)


df=pd.DataFrame(mdata_dct)
df.to_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/webNLG2020_test.csv')

In [None]:
#Lets check the file:
test_df=pd.read_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/webNLG2020_test.csv', index_col=[0])
#Let's inspect the dataset:
test_df.head

<bound method NDFrame.head of       prefix                                         input_text  \
0     webNLG  Estádio_Municipal_Coaracy_da_Mata_Fonseca | lo...   
1     webNLG  Estádio_Municipal_Coaracy_da_Mata_Fonseca | lo...   
2     webNLG  Nie_Haisheng | birthDate | 1964-10-13 && Nie_H...   
3     webNLG  Nie_Haisheng | birthDate | 1964-10-13 && Nie_H...   
4     webNLG  Nie_Haisheng | birthDate | 1964-10-13 && Nie_H...   
...      ...                                                ...   
5145  webNLG  Turn_Me_On_(album) | genre | Punk_blues && Tur...   
5146  webNLG  Turn_Me_On_(album) | genre | Punk_blues && Tur...   
5147  webNLG  Ciudad_Ayala | country | Mexico && Ciudad_Ayal...   
5148  webNLG  Ciudad_Ayala | country | Mexico && Ciudad_Ayal...   
5149  webNLG  Ciudad_Ayala | country | Mexico && Ciudad_Ayal...   

                                            target_text  
0     Estádio Municipal Coaracy da Mata Fonseca is t...  
1     Estádio Municipal Coaracy da Mata Fonseca i

In [None]:
input_list = test_df['input_text'].tolist()
target_list = test_df['target_text'].tolist()

In [None]:
# let's get rid of duplicate triples
filtered_instances = []
for triples, targets in zip(input_list, target_list):
  #print(f"number of triples: {len(re.findall('&&', triples))}")
  #print(f"triples : {triples}")
  #print(f"targets : {targets}")
  triple_set = set()
  triples_list = triples.split("&&")
  #print(len(triples_list))
  for t in triples_list:
    t = t.strip()
    #print(t)
    triple_set.add(t)
  if len(triple_set) == 1:
    filtered_instances.append(tuple((triple_set, targets)))

In [None]:
# this cell is just for inspection
print(f"Number of 1 triple instances in the test set: {len(filtered_instances)}")
#for x, y in filtered_instances:
  #if len(x) == 1:
    #print(x)
    #print(len(x))
    #print(y)

Number of 1 triple instances in the test set: 736


In [None]:
# create a new dataframe and save it to the csv file
df_test_1triple = pd.DataFrame(filtered_instances, columns =['input_text', 'target_text'])
df_test_1triple.to_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/webNLG2020_test_1triple.csv')

In [None]:
#let's load the saved file

#Lets check the file:
test1triple_df=pd.read_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/webNLG2020_test_1triple.csv', index_col=[0])


In [None]:
def remove_unwanted_chars(text):
  remove_list = ["[", "]", ")", "(", "'", "{", "}", "<", ">"]
  for char in text:
    if char in remove_list:
      text = text.replace(char, "")
  return text

In [None]:
test1triple_list = test1triple_df['input_text'].tolist()
clean_list = []
for x in test1triple_list:
  x = remove_unwanted_chars(x)
  clean_list.append(x)

In [None]:
test1triple_df['input_text'] = clean_list

In [None]:
test1triple_df['input_text']

0                          Darlington | areaCode | 01325
1                          Darlington | areaCode | 01325
2                          Darlington | areaCode | 01325
3              Israel | officialLanguage | Modern_Hebrew
4              Israel | officialLanguage | Modern_Hebrew
                             ...                        
731    English_Without_Tears | writer | Anatole_de_Gr...
732    English_Without_Tears | writer | Anatole_de_Gr...
733                  Nurhan_Atasoy | birthPlace | Turkey
734                  Nurhan_Atasoy | birthPlace | Turkey
735                  Nurhan_Atasoy | birthPlace | Turkey
Name: input_text, Length: 736, dtype: object

In [None]:
test1triple_df['target_text']

0         The Darlington town has an area code of 01325.
1       The telephone area code for Darlington is 01325.
2                  The area code in Darlington is 01325.
3      The official language of Israel is modern Hebrew.
4           Israel’s official language is Modern Hebrew.
                             ...                        
731    The writer of English Without Tears was Anatol...
732    "English Without Tears" was written by Anatole...
733                Nurhan Atasoy's birthplace is Turkey.
734    The place where Nurhan Atasoy was born is Turkey.
735                    Nurhan Atasoy was born in Turkey.
Name: target_text, Length: 736, dtype: object

It is time to generate sentences:



In [None]:
#set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#load the trained model
model = GPT2LMHeadModel.from_pretrained("/content/gdrive/My Drive/WebNLG with GPT2/model")
model.to(device)
print(device)

cuda


In [None]:
#load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium', bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
#inference function

def gen_text(triple, tokenizer, model):
  """generate sentence"""
  prompt = tokenizer("<|startoftext|>Triple:{} Target".format(triple), return_tensors="pt").input_ids.cuda()
  outputs = model.generate(prompt, do_sample=True, top_k=2, max_length=70, top_p=0.95, temperature=1.9, num_return_sequences=1)
  # change num_return_sequence if you want to generete more than one sentence with the given input
  for i, sample_output in enumerate(outputs):
    text = tokenizer.decode(sample_output, skip_special_tokens=True)
    text = text.strip("<endoftext><startoftext>\n ")
    text = text.split("Target: ")
    text = text[-1]
    #text = text.split("Target: ")
    #text = text[1]
    #text = text.strip('<endoftext>')
  return str(text)

In [None]:
test_triple = "Fina | birthDate | 30 April"
test_output = gen_text(test_triple, tokenizer, model)

print(test_output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The date of birth of the Fina is 30 April.


In [None]:
gen_text_list = []
for rdf in clean_list:
  text = gen_text(rdf, tokenizer, model)
  gen_text_list.append(text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [None]:
for text in gen_text_list:
  #text = text.strip("<endoftext><startoftext>\n ")
  #text = text.split("Target: ")
  #text = text[-1]
  print(text)

01325 is the area code for Darlington, in England and Wales. The area codes for Darlington are 01326 and 01327. The area codes for the area codes of the area code of the area code is 01326 and 01327. The area codes of th
01325 is the area code of Darlington, in England, in England, in England. The area code of Darlington, is 01 325. The total area code for the city of Darlington, is 01325.
The area codes for the area codes of Darlington are 01325 and 01325.
Modern Hebrew is a native language of the Israelites. The language spoken in the country of Israel is spoken in the Modern Hebrew. The language spoken is spoken in the country of the Jews. The language spoken in the country of the Jews is Modern Hebrew.
Modern Hebrew is a language of the Israeli people.
Modern Hebrew is the official language in Israel.
Chinabank is located within the Philippine city of Manila. The location of Chinabank in the Philippines is the city of Manila, Philippines.
Chinabank are located in the Philippines.<s

In [None]:
test1triple_df['generated_text'] = gen_text_list

In [None]:
test1triple_df['generated_text']

0      01325 is the area code for Darlington, in Engl...
1      01325 is the area code of Darlington, in Engla...
2      The area codes for the area codes of Darlingto...
3      Modern Hebrew is a native language of the Isra...
4      Modern Hebrew is a language of the Israeli peo...
                             ...                        
731    The writer of English Without Tears is Anatoli...
732    Anatolian De Grunwald was the author of the no...
733            The birthplace of Nurhan Asoy is Turkey."
734    The birthplace of Nurhan Atasoy was Ankara, Tu...
735           The birthplace of Nurhan Atasoy is Ankara.
Name: generated_text, Length: 736, dtype: object

In [None]:
test1triple_df.head

<bound method NDFrame.head of                                             input_text  \
0                        Darlington | areaCode | 01325   
1                        Darlington | areaCode | 01325   
2                        Darlington | areaCode | 01325   
3            Israel | officialLanguage | Modern_Hebrew   
4            Israel | officialLanguage | Modern_Hebrew   
..                                                 ...   
731  English_Without_Tears | writer | Anatole_de_Gr...   
732  English_Without_Tears | writer | Anatole_de_Gr...   
733                Nurhan_Atasoy | birthPlace | Turkey   
734                Nurhan_Atasoy | birthPlace | Turkey   
735                Nurhan_Atasoy | birthPlace | Turkey   

                                           target_text  \
0       The Darlington town has an area code of 01325.   
1     The telephone area code for Darlington is 01325.   
2                The area code in Darlington is 01325.   
3    The official language of Israel is m

In [None]:
test1triple_df.to_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/output/webNLG2020_test_with_generated_outputs.csv')

In [7]:
# Let's check how the file looks:
gen_df=pd.read_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/output/webNLG2020_test_with_generated_outputs.csv', index_col=[0])
gen_df.head

<bound method NDFrame.head of                                             input_text  \
0                        Darlington | areaCode | 01325   
1                        Darlington | areaCode | 01325   
2                        Darlington | areaCode | 01325   
3            Israel | officialLanguage | Modern_Hebrew   
4            Israel | officialLanguage | Modern_Hebrew   
..                                                 ...   
731  English_Without_Tears | writer | Anatole_de_Gr...   
732  English_Without_Tears | writer | Anatole_de_Gr...   
733                Nurhan_Atasoy | birthPlace | Turkey   
734                Nurhan_Atasoy | birthPlace | Turkey   
735                Nurhan_Atasoy | birthPlace | Turkey   

                                           target_text  \
0       The Darlington town has an area code of 01325.   
1     The telephone area code for Darlington is 01325.   
2                The area code in Darlington is 01325.   
3    The official language of Israel is m

End of the notebook.