# Attempt 2

## Node Generation

Node generation leverages the representation power of pre-trained language models. The task of node generation can be seen as a sequence-to-sequence task, with sequence of text as input and sequence of nodes as output.

The pre-trained model used in the paper is T5 (by Raffel et al.). It converts all text-based NLP problems to text-to-text generation task. In our case, the sequence of text is the input, and the sequence of nodes will also be generated as a text sequence output.

Two special tokens are added for this task:
1. `<NODE_SEP>`: to separate multi-word nodes easily.
2. `<NO_NODE>`: when the maximum number of nodes to be generated is set, after those many nodes are generated, the rest are filled with this token.

In [2]:
!pip install sentencepiece



In [3]:
!pip install xmltodict

Collecting xmltodict
  Downloading xmltodict-0.14.2-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading xmltodict-0.14.2-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict
Successfully installed xmltodict-0.14.2


In [4]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained('t5-small')
tokenizer.add_tokens('__no_node__')
tokenizer.add_tokens('__node_sep__')

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


1

In [5]:
from transformers import T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained('t5-small')

ip = tokenizer("generate nodes: A pizza trailer business in Galway East set up in the depths of the pandemic lockdown has landed a major national restaurant award. Ugly Doughlicious – or Ugly D’s – trades at Yeats Lodge in Peterswell Friday to Sunday, and on Saturday mornings at the Ardrahan Farmers Market. Selling DIY at home pizza kits as well as pizzas cooked in a wood fired oven in the trailer, the business has proved popular with lovers of the Italian staple since it set up. Despite being up against some much larger and more established brands at the national finals of the Irish Restaurant Awards, Ugly D’s was named Innovator of the Year for Connacht. Born during family pizza nights in the Covid pandemic lockdowns, Michelle Creavan and Darren Hoare tried just about very oven pizza on the market before they began experimenting with their own dough recipes.",return_tensors="pt")
output =  model.generate(inputs=ip.input_ids,max_length=150,attention_mask=ip.attention_mask)
print(tokenizer.decode(output[0], skip_special_tokens=True))

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

in the depths of the pandemic lockdown has landed a major national restaurant award. Ugly Doughlicious – or Ugly D’s – trades at Yeats Lodge in Peterswell. despite being up against some much larger and more established brands at the national finals of the Irish Restaurant Awards.


In [6]:
import torch
decoder_input_ids = torch.tensor([[model.config.decoder_start_token_id]])

output = model(input_ids=ip.input_ids,attention_mask=ip.attention_mask,decoder_input_ids=decoder_input_ids,output_hidden_states=True)
logits_nodes = output.logits
joint_features = output.decoder_hidden_states[-1]
logits_nodes.argmax(-1)

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


tensor([[32099]])

In [7]:
import xmltodict
import pprint

with open('Airport.xml', 'r', encoding='utf-8') as file:
    my_xml = file.read()

my_dict = xmltodict.parse(my_xml)

airport = my_dict['benchmark']['entries']['entry']
airport[0].keys()

dict_keys(['@category', '@eid', '@shape', '@shape_type', '@size', 'originaltripleset', 'modifiedtripleset', 'lex'])

In [8]:
print(airport[0]['originaltripleset']['otriple'])
print(airport[0]['modifiedtripleset']['mtriple'])
print(airport[0]['lex'][0]['#text'])

['Aarhus | leaderName | Jacob_Bundsgaard', 'Aarhus_Airport | city | Aarhus']
['Aarhus | leader | Jacob_Bundsgaard', 'Aarhus_Airport | cityServed | Aarhus']
Aarhus airport serves the city of Aarhus whose leader is Jacob Bundsgaard.


In [9]:
len(airport)

193

In [10]:
import xmltodict

# Extract subject and object from triple
def get_sub_obj(triple_text,sep):
  components = triple_text.split(sep)
  sub = components[0].strip().replace('_',' ')
  obj = components[2].strip().replace('_',' ')

  # How about train it on the whole triples
  pred = components[1].strip().replace('_',' ')
  return (sub,pred,obj)
  # return (sub,obj)

# xml_doc: xml filename with .xml extension
def get_input_target_pairs(xml_doc,node_sep):
  with open(xml_doc, 'r', encoding='utf-8') as file:
    xml_text = file.read()

  pairs = []
  topics = xmltodict.parse(xml_text)['benchmark']['entries']['entry']
  for topic in topics:
    # target nodes extraction
    otriple = topic['originaltripleset']
    target_nodes = []
    if isinstance(otriple,list):
      for otriplemini in otriple:
        target_nodes.extend(otriplemini['otriple'])
    else:
      target_nodes.extend(otriple['otriple'])

    seps_added = []
    for trip_str in target_nodes:
      seps_added.append(node_sep.join(get_sub_obj(trip_str,sep='|')))

    target_nodes = seps_added

    # input text extraction
    lex = topic['lex']
    input_texts = []
    if isinstance(lex,list):
      for lexmini in lex:
        input_texts.append(lexmini['#text'])
    else:
      input_texts.append(lex['#text'])

    for input_text in input_texts:
      temp_dict = {
          'input':'generate_triples:'+input_text,
          'output': target_nodes
      }
      pairs.append(temp_dict)

  return pairs

In [46]:
airport_dataset = get_input_target_pairs('Airport.xml','__sep__')

In [12]:
import json

with open('Airport.json','w',encoding='utf-8') as file:
  file.write(json.dumps(airport_dataset,indent=4))

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [20]:
import xmltodict

artist_xml_file = '/content/drive/MyDrive/webnlg-train/Artist.xml'
artist_dataset = get_input_target_pairs(artist_xml_file,'__sep__')

In [21]:
import json

with open('Artist.json','w',encoding='utf-8') as file:
  file.write(json.dumps(artist_dataset,indent=4))

In [None]:
airport_dataset.extend(artist_dataset)

In [30]:
import json

with open('Flyartist.json','w',encoding='utf-8') as file:
  file.write(json.dumps(airport_dataset,indent=4))

In [49]:
import os
import json
import random

training_dir = '/content/drive/MyDrive/webnlg-train'
xml_files = []
master_dataset = []
for (root,dirs,file) in os.walk(training_dir):
  xml_files = file

for xml_file in xml_files:
  dataset = get_input_target_pairs(training_dir+'/'+xml_file,'__sep__')
  print(xml_file+':'+str(len(dataset)))
  master_dataset.extend(dataset)

random.shuffle(master_dataset)
print('---')
print('Master_dataset'+':'+str(len(master_dataset)))

Artist.xml:642
City.xml:500
Monument.xml:101
WrittenWork.xml:554
Airport.xml:491
Athlete.xml:534
Politician.xml:702
Food.xml:760
University.xml:105
Company.xml:232
CelestialBody.xml:360
MeanOfTransportation.xml:611
Building.xml:451
Astronaut.xml:194
SportsTeam.xml:498
ComicsCharacter.xml:213
---
Master_dataset:6948


In [50]:
import json

with open('Master_dataset.json','w',encoding='utf-8') as file:
  file.write(json.dumps(master_dataset,indent=4))