#KRW Group Project: Building Narratives from Knowledge Graphs
##Group number: P5-2
###Group members: Fina Polat, Hein Kolk, Jelle Wassenaar, Siddharth Chaubal

This is the last notebook of a series of 3.

Research goal: We are a newspaper agency and want to develop a system to create articles semi-automatically. The goal is to create a newspaper story using information from existing KGs, and to help readers better understanding the content/setting of the story (e.g. visualise a timeline to understand big events and actors in a political or historical event, summarising a movie or the life of someone, etc.).

We are going to generate a gossip story using T5 language model (LM). In order to do that we are going to:
* pre-process WebNLG Dataset - Part 1
* fine-tune T5 language model with WebNLG Dataset. - Part 2
* automatically generate stories (template + automatically generated text) - Part 3

The WebNLG data (Gardent el al., 2017) was created to promote the development (i) of RDF verbalisers and (ii) of microplanners able to handle a wide range of linguistic constructions.

T5 Language Model : Colin Raffel et al. “Exploring the limits of transfer learning with a unified
text-to-text transformer”. In: arXiv preprint arXiv:1910.10683 (2019).

The code in this notebook is adapted from https://github.com/MathewAlexander/T5_nlg

In [None]:
!pip install transformers
!pip install sentencepiece

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 4.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 42.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 26.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 38.8 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fo

In [None]:
### import the required libraries ###

import pandas as pd
import numpy as np
from google.colab import files
from google.colab import drive
import os
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
import time
import warnings
warnings.filterwarnings('ignore')

In [None]:
MOUNTPOINT = '/content/gdrive'
DATADIR = os.path.join(MOUNTPOINT, 'My Drive', 'KRW_P5-2')
drive.mount(MOUNTPOINT)
print(DATADIR)

Mounted at /content/gdrive
/content/gdrive/My Drive/KRW_P5-2


In [None]:
#Loading the trained model from the path

model = T5ForConditionalGeneration.from_pretrained('/content/gdrive/My Drive/KRW_P5-2/pytoch_model.bin', 
                                                   return_dict=True,
                                                   config='/content/gdrive/My Drive/KRW_P5-2/t5-base-config.json')

In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

In [None]:
#function to generate text

def generate(text):
  model.eval()
  input_ids = tokenizer.encode("WebNLG:{} </s>".format(text), return_tensors="pt")  # Batch size 1
  #input_ids.to(dev)
  s = time.time()
  outputs = model.generate(input_ids)
  gen_text=tokenizer.decode(outputs[0]).replace('<pad>','').replace('</s>','')
  elapsed = time.time() - s
  #print('Generated in {} seconds'.format(str(elapsed)[:4]))

  
  return gen_text

#Gossip Time: Let's generate a story!

A gossip story consists of four sections:
1. Introduction: Here we introduce the celebrities. 
2. Body paragraph 1: Information about the first celebrity.
3. Body paragraph 2: Information about the second celebrity.
4. Conclusion: concluding with a question.

###Let's write our functions:

In [None]:
#some text processing
def remove_unwanted_chars(text):
  remove_list = ["[", "]", ")", "(", "'"]
  for char in text:
    if char in remove_list:
      text = text.replace(char, "")
  return text

In [None]:
def get_entity_names(ent1, ent2):
    ent1 = ent1.replace("_", " ")
    ent2 = ent2.replace("_", " ")
    return ent1, ent2

In [None]:
def prepare_triples(triples):
    triple_list = triples.split(",")
    prepared_triples = []
    for line in triple_list:
        line = remove_unwanted_chars(line)
        line = line.split(",")
        for triple in line:
            prepared_triples.append(triple)
            #print(f'triple: {triple}')
    return prepared_triples


In [None]:
def write_a_story(filename, ent1, ent2, ent1_text, ent2_text):

    with open(filename, "w") as f:
        f.write(f"According to our birds, there might be something going on between {ent1} and {ent2}. \nSince they have recently been seen in touch quite a lot.")
        f.write(f"\nLet us give you some background about {ent1} and {ent2}:")
        f.write("\n ")

        f.write(f"\nLadies first! Let's start with {ent1}:")
        for sent1 in ent1_text:
            f.write(f"\n{sent1}")
        f.write("\n ")
        
        f.write(f"\n{ent2}:")
        for sent2 in ent2_text:
            f.write(f"\n{sent2}")
        f.write("\n ")

        f.write(f"\nSo, what do you think? Are the rumours true about {ent1} and {ent2}?")

###Read the data and start generating:

In [None]:
KB_df=pd.read_csv('/content/gdrive/My Drive/KRW_P5-2/data4stories.tsv', index_col=[0], sep="\t")
KB_df = KB_df.fillna(value=np.nan)
#Let's inspect the dataset:
KB_df.head

<bound method NDFrame.head of               entity 1          entity 2  connected  \
1           Emma_Stone       Ram_Charan       False   
2       Britney_Spears     Ranbir_Kapoor      False   
3     Jennifer_Aniston     Harry_Knowles      False   
4        Vicky_Kaushal       Phil_McGraw      False   
5           Alia_Bhatt     Justin_Bieber      False   
6   Scarlett_Johansson     Harry_Knowles       True   
7    Jennifer_Lawrence     Justin_Bieber       True   
8           Alia_Bhatt      Pawan_Kalyan       True   
9     Jennifer_Aniston  Robert_Pattinson       True   
10     Keira_Knightley    Bradley_Cooper       True   

                                      Triple Entity 1  \
1   ['Emma Stone | birth place | Scottsdale, Arizo...   
2   ['Britney Spears | has occupation | Singer', '...   
3   ['Jennifer Aniston | had partner | Justin Ther...   
4   ['Vicky Kaushal | birth date | 1988-05-16', 'V...   
5   ['Alia Bhatt | birth date | 1993-03-15', 'Alia...   
6   ['Scarlett Johanss

In [None]:
entity1_list = KB_df["entity 1"].tolist()
entity2_list = KB_df["entity 2"].tolist()
entity1_triples = KB_df["Triple Entity 1"].tolist()
entity2_triples = KB_df["Triple Entity 2"].tolist()

In [None]:
for ent1, ent2, triples1, triples2 in zip(entity1_list, entity2_list, entity1_triples, entity2_triples): 
  ent1_name, ent2_name = get_entity_names(ent1, ent2)
  
  prerared_trips1 = prepare_triples(triples1)
  gen_text_list1 = []
  for triple1 in prerared_trips1:
    gen_text1 = generate(triple1)
    gen_text_list1.append(gen_text1)

  prerared_trips2 = prepare_triples(triples2)
  gen_text_list2 = []  
  for triple2 in prerared_trips2:
    gen_text2 = generate(triple2)
    gen_text_list2.append(gen_text2)

  filename = f"gossip_about{ent1}_{ent2}.txt"
  write_a_story(filename, ent1_name, ent2_name, gen_text_list1, gen_text_list2)
  files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>