#RDF-to-Text: Fine-tuning GPT2 with WebNLG Corpus
###Fina Emilova Yilmaz Polat

This is the second notebook of a series of 4.

We are going to:
* pre-process WebNLG Dataset - Part 1
* fine-tune GPT2 language model with WebNLG Dataset. - Part 2
* generate text with the trained model - Part 3
* evaluate generated text - Part 4

The WebNLG data (Gardent el al., 2017) was created to promote the development (i) of RDF verbalisers and (ii) of microplanners able to handle a wide range of linguistic constructions.

Gardent, C., Shimorina, A., Narayan, S., & Perez-Beltrachini, L. (2017, September). The WebNLG challenge: Generating text from RDF data. In Proceedings of the 10th International Conference on Natural Language Generation (pp. 124-133).

GPT2 Language Model : Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

The code in this notebook is partially adapted from https://towardsdatascience.com/guide-to-fine-tuning-text-generation-models-gpt-2-gpt-neo-and-t5-dc5de6b3bc5e

In [None]:
#install required libraries
!pip install transformers



In [None]:
!pip install pynvml



In [None]:
#import required libraries
from google.colab import drive
import pandas as pd
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
from torch.utils.data import Dataset
from transformers import TrainingArguments, Trainer

from pynvml import *

In [None]:
MOUNTPOINT = '/content/gdrive'
Working_Dir = os.path.join(MOUNTPOINT, 'My Drive', 'WebNLG with GPT2')
drive.mount(MOUNTPOINT)
print(Working_Dir)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/WebNLG with GPT2


In [None]:
# create Dataset object and its functions
class MyDataset(Dataset):
    def __init__(self, input_list, target_list, tokenizer):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for inputs, targets in zip(input_list, target_list):
          prep_input = f'<startoftext>Triple: {inputs} '
          pred_output = f'Target: {targets}<endoftext>'

          encodings_dict = tokenizer('<|startoftext|>' + prep_input + pred_output + '<|endoftext|>', truncation=True, padding="max_length")
          self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
          self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

In [None]:
# to handle cuda memory issues
def print_gpu_utilization():
  """A fuction to monitor memory issues """
  nvmlInit()
  handle = nvmlDeviceGetHandleByIndex(0)
  info = nvmlDeviceGetMemoryInfo(handle)
  print(f"GPU memory occupied: {info.used//1024**2} MB.")

Upload Training Data:

In [None]:
#upload training data
train_df=pd.read_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/webNLG2020_train.csv', index_col=[0])
#train_df.head

In [None]:
train_input_list = train_df['input_text'].tolist()
print(len(train_input_list))
print(train_input_list[1])
#train_input_list = train_input_list[:100]
train_target_list = train_df['target_text'].tolist()
print(len(train_target_list))
print(train_target_list[1])
#train_target_list = train_target_list[:100]

7465
11th_Mississippi_Infantry_Monument | category | Contributing_property
7465
The 11th Mississippi Infantry Monument is categorized as a Contributing Property.


Upload Validation Data:

In [None]:
#upload training data
val_df=pd.read_csv('/content/gdrive/My Drive/WebNLG with GPT2/data/webNLG2020_dev.csv', index_col=[0])
#val_df.head

In [None]:
val_input_list = val_df['input_text'].tolist()
print(val_input_list[1])
print(len(val_input_list))
#val_input_list = val_input_list[:10]
val_target_list = val_df['target_text'].tolist()
print(val_target_list[1])
print(len(val_target_list))
#val_target_list = val_target_list[:10]

Accademia_di_Architettura_di_Mendrisio | academicStaffSize | 100
959
The academic staff number 100 at the Accademia di Architettura di Mendrisio.
959


In [None]:
model_name = "gpt2"
torch.manuel_seed = 42
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium', bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
train_dataset = MyDataset(train_input_list, train_target_list, tokenizer)
val_dataset = MyDataset(val_input_list, val_target_list, tokenizer)

In [None]:
model_dir = "/content/gdrive/My Drive/WebNLG with GPT2/model"

In [None]:
torch.cuda.empty_cache()
print_gpu_utilization()

GPU memory occupied: 0 MB.


In [None]:
model = GPT2LMHeadModel.from_pretrained(model_name).cuda()
model.resize_token_embeddings(len(tokenizer))

Embedding(50259, 768)

In [None]:
print_gpu_utilization()

GPU memory occupied: 1290 MB.


In [None]:
training_args = TrainingArguments(output_dir= model_dir, num_train_epochs=1, logging_steps=1000, save_steps=5000,
                                  per_device_train_batch_size=1, per_device_eval_batch_size=1, gradient_accumulation_steps=4,
                                  gradient_checkpointing=True, warmup_steps=10, weight_decay=0.05, logging_dir='./logs')

In [None]:
Trainer(model=model,  args=training_args, train_dataset=train_dataset,
        eval_dataset=val_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()


***** Running training *****
  Num examples = 7465
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 4
  Total optimization steps = 1866
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
1000,0.0928


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incom

TrainOutput(global_step=1866, training_loss=0.06940370255089011, metrics={'train_runtime': 6742.9503, 'train_samples_per_second': 1.107, 'train_steps_per_second': 0.277, 'total_flos': 3900567453696000.0, 'train_loss': 0.06940370255089011, 'epoch': 1.0})

In [None]:
model.save_pretrained(model_dir)

Configuration saved in /content/gdrive/My Drive/WebNLG with GPT2/model/config.json
Model weights saved in /content/gdrive/My Drive/WebNLG with GPT2/model/pytorch_model.bin


The model is trained and save. Just a quick test with one example.

In [None]:
# put the model on evaluation mode
_ = model.eval()

In [None]:
triple = "Angelina Jolie | birth name | Angelina Jolie Voight"

prompt = tokenizer("<|startoftext|>Triple:{} \nTarget: ".format(triple), return_tensors="pt").input_ids.cuda()

outputs = model.generate(prompt, do_sample=True, top_k=2, max_length=100, top_p=0.95, temperature=1.9, num_return_sequences=5)

for i, sample_output in enumerate(outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: Triple:Angelina Jolie | birth name | Angelina Jolie Voight 
Target: シャラショウォール is the birth name of Angelina Jolie, born onth December 1941.
1: Triple:Angelina Jolie | birth name | Angelina Jolie Voight 
Target: シャラウィンドウィル is the name of the birth name of Angelina Jolie.
2: Triple:Angelina Jolie | birth name | Angelina Jolie Voight 
Target: シャルシャル Jolie is the name of an angelina Jolie born in the Philippines.
3: Triple:Angelina Jolie | birth name | Angelina Jolie Voight 
Target: シャラシュ_Angelina_Jolie was born on 31st of December 1941. The name Angelina Jolie is the name of the birth of the singer, Angelina Jolie.
4: Triple:Angelina Jolie | birth name | Angelina Jolie Voight 
Target: シャラウィンド is a reference to Angelina Jolie's birth name, Angelina Jolie.
Triple: Angelina Jolie's maiden name is Angelina Joliet.


End of the second notebook. 