#RDF-to-Text: Fine-tuning GPT2 with WebNLG Corpus
###Fina Emilova Yilmaz Polat

This is the first notebook of a series of 4.

We are going to:
* pre-process WebNLG Dataset - Part 1
* fine-tune GPT2 language model with WebNLG Dataset. - Part 2
* generate text with the trained model - Part 3
* evaluate generated text - Part 4

The WebNLG data (Gardent el al., 2017) was created to promote the development (i) of RDF verbalisers and (ii) of microplanners able to handle a wide range of linguistic constructions.

Gardent, C., Shimorina, A., Narayan, S., & Perez-Beltrachini, L. (2017, September). The WebNLG challenge: Generating text from RDF data. In Proceedings of the 10th International Conference on Natural Language Generation (pp. 124-133).

GPT2 Language Model : Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

The code in this notebook is partially adapted from https://github.com/MathewAlexander/T5_nlg


In [None]:
### import the required libraries ###

import pandas as pd
import os
from google.colab import drive
import glob
import re
import xml.etree.ElementTree as ET

In [None]:
MOUNTPOINT = '/content/gdrive'
Working_Dir = os.path.join(MOUNTPOINT, 'My Drive', 'WebNLG with GPT2')
drive.mount(MOUNTPOINT)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
print(Working_Dir)

/content/gdrive/My Drive/WebNLG with GPT2


Training set:

In [None]:
files = glob.glob("/content/gdrive/My Drive/WebNLG with GPT2/data/train/1triples/*.xml", recursive=True)
triple_re=re.compile('(\d)triples')
data_dct={}
for file in files:
    tree = ET.parse(file)
    root = tree.getroot()
    triples_num=int(triple_re.findall(file)[0])
    for sub_root in root:
        for ss_root in sub_root:
            strutured_master=[]
            unstructured=[]
            for entry in ss_root:
                unstructured.append(entry.text)
                strutured=[triple.text for triple in entry]
                strutured_master.extend(strutured)
            unstructured=[i for i in unstructured if i.replace('\n','').strip()!='' ]
            strutured_master=strutured_master[-triples_num:]
            strutured_master_str=(' && ').join(strutured_master)
            data_dct[strutured_master_str]=unstructured
mdata_dct={"prefix":[], "input_text":[], "target_text":[]}
for st,unst in data_dct.items():
    for i in unst:
        mdata_dct['prefix'].append('webNLG')
        mdata_dct['input_text'].append(st)
        mdata_dct['target_text'].append(i)


df=pd.DataFrame(mdata_dct)
df.to_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/webNLG2020_train.csv')

In [None]:
#Lets check the file:
train_df=pd.read_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/webNLG2020_train.csv', index_col=[0])
#Let's inspect the dataset:
train_df.head

<bound method NDFrame.head of       prefix                                         input_text  \
0     webNLG  11th_Mississippi_Infantry_Monument | category ...   
1     webNLG  11th_Mississippi_Infantry_Monument | category ...   
2     webNLG  11th_Mississippi_Infantry_Monument | category ...   
3     webNLG  11th_Mississippi_Infantry_Monument | country |...   
4     webNLG  11th_Mississippi_Infantry_Monument | country |...   
...      ...                                                ...   
7460  webNLG  United_Petrotrin_F.C. | ground | Palo_Seco_Vel...   
7461  webNLG                VfL_Wolfsburg | league | Bundesliga   
7462  webNLG                VfL_Wolfsburg | league | Bundesliga   
7463  webNLG           VfL_Wolfsburg | manager | Dieter_Hecking   
7464  webNLG           VfL_Wolfsburg | manager | Dieter_Hecking   

                                            target_text  
0     The 11th Mississippi Infantry Monument falls u...  
1     The 11th Mississippi Infantry Monument is c

Development set:

In [None]:
files = glob.glob("/content/gdrive/My Drive/WebNLG with GPT2/data/dev/1triples/*.xml", recursive=True)
triple_re=re.compile('(\d)triples')
data_dct={}
for file in files:
    tree = ET.parse(file)
    root = tree.getroot()
    triples_num=int(triple_re.findall(file)[0])
    for sub_root in root:
        for ss_root in sub_root:
            strutured_master=[]
            unstructured=[]
            for entry in ss_root:
                unstructured.append(entry.text)
                strutured=[triple.text for triple in entry]
                strutured_master.extend(strutured)
            unstructured=[i for i in unstructured if i.replace('\n','').strip()!='' ]
            strutured_master=strutured_master[-triples_num:]
            strutured_master_str=(' && ').join(strutured_master)
            data_dct[strutured_master_str]=unstructured
mdata_dct={"prefix":[], "input_text":[], "target_text":[]}
for st,unst in data_dct.items():
    for i in unst:
        mdata_dct['prefix'].append('webNLG')
        mdata_dct['input_text'].append(st)
        mdata_dct['target_text'].append(i)


df=pd.DataFrame(mdata_dct)
df.to_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/webNLG2020_dev.csv')

In [None]:
#Lets check the file:
dev_df=pd.read_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/webNLG2020_dev.csv', index_col=[0])
#Let's inspect the dataset:
dev_df.head

<bound method NDFrame.head of      prefix                                         input_text  \
0    webNLG  Accademia_di_Architettura_di_Mendrisio | acade...   
1    webNLG  Accademia_di_Architettura_di_Mendrisio | acade...   
2    webNLG  Accademia_di_Architettura_di_Mendrisio | acade...   
3    webNLG  Acharya_Institute_of_Technology | campus | "In...   
4    webNLG  Acharya_Institute_of_Technology | campus | "In...   
..      ...                                                ...   
954  webNLG         Peñarol | manager | Jorge_Orosmán_da_Silva   
955  webNLG         Peñarol | manager | Jorge_Orosmán_da_Silva   
956  webNLG                  Point_Fortin | country | Trinidad   
957  webNLG                  Point_Fortin | country | Trinidad   
958  webNLG                  Point_Fortin | country | Trinidad   

                                           target_text  
0    The academic staff size of Accademia di Archit...  
1    The academic staff number 100 at the Accademia...  
2    T

End of the Part 1