# Knowledge Graphs and Semantic Technologies -- Information Extraction


## Setup

The code in this cell prepares the files and libraries needed. You can run it without expanding its contents (but of course you can peek into it if you're curious!)

In [None]:
%%bash
# Transformers installation
pip install transformers
# To install from source instead of the last release, comment the command above and uncomment the following one.
#pip install git+https://github.com/huggingface/transformers.git
pip install Wikipedia-API
pip install pyspotlight


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 85.5 MB/s eta 0:00:00
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 112.1 MB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 190.3/190.3 KB 26.6 MB/s eta 0:00:00
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Wikipedia-API
  Downloading Wikipedia_API-0.5.8-py3-none-any.whl

In [None]:
import requests
from bs4 import BeautifulSoup
from pprint import pprint

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import os, random, json, logging, csv
import wikipediaapi
import spotlight

import torch

# Entity linking methods

We are first looking at entity linking on natural language text by using a popular online tool: DBpedia Spotlight.
It takes sentences as an input and returns entity URI's from the Wikipedia-based knowledge graph DBpedia.


## DBpedia Spotlight 🔦

[DBpedia Spotlight](https://www.dbpedia-spotlight.org/) is a tool for annotating text with metadata about entities. It is based on a pipeline that performs named entity recognition, candidate generation, and entity linking.

The following code defines a function that takes as input some text and returns the annotated response from the Spotlight API.

In [None]:


def spotlight_link(sentence):
    parameters = {'text' : sentence}
    r = requests.get("https://api.dbpedia-spotlight.org/en/annotate", params=parameters)
    soup = BeautifulSoup(r.text)
    body = soup.find('body')
    return body.prettify()

pprint(spotlight_link('The president of the United States visited Vietnam.'))

<body>
<div>
The president of the <a href="http://dbpedia.org/resource/United_States" target="_blank" title="http://dbpedia.org/resource/United_States">United States</a> visited <a href="http://dbpedia.org/resource/Vietnam" target="_blank" title="http://dbpedia.org/resource/Vietnam">Vietnam</a>.
</div>
</body>


## Prompt-based Relation Extraction

Instead of fine-tuning a relation extraction model, which often takes several GPU hours/days for training, existing pre-trained language models can be directly used for relation extraction.

Here, we only demonstrate a very basic approach for prompt-based relation extraction:

We prompt the language model with the input sentence that we want to extract the triple from and the subject/object entities. The goal of the model is to find a word which best fits between the subject and the object entity. 

#### Problems: 🙅

We still need to map back from the predicted word to the relation in the knowledge graph.

*Try out the code from the celle below*

As you can see, the model correctly predicts the word *directed*. 


In [None]:
generator = pipeline(model='facebook/opt-1.3b')

input_sentence = "Inception is a 2010 science fiction action film directed by Christopher Nolan."
subj = "Inception"
obj = "Christopher Nolan"
generator(f"{input_sentence} What is the relation between {subj} and {obj}?")

Input length of input_ids is 26, but `max_length` is set to 21. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.


[{'generated_text': 'Inception is a 2010 science fiction action film directed by Christopher Nolan. What is the relation between Inception and Christopher Nolan?\n'}]

##End-to-End Information Extraction

In this part, we will have a explore end-to-end information extractionwith the important NLP library 🤗  Transformers. 
Instead of performing entity recognition, entity linking, and relation extraction separately, recent Transformer models can be used to perform all tasks in a single step.


###  Extract Information with the REBEL Model

Instead of training our own machine learning model here, we download a model, which is trained on the distantly supervised training dataset from Wikipedia and WIkidata. The model that we are working with is a fine-tuned generative language model for information extraction. It is based on the Transformer model 🤖 [REBEL](https://huggingface.co/Babelscape/rebel-large).

The Huggingface library offers lots of existing pre-trained models for a variety of tasks that can be easily downloaded and used for various NLP tasks. 

#### Helper methods

In [None]:
def call_wiki_api(item):
  try:
    url = f"https://www.wikidata.org/w/api.php?action=wbsearchentities&search={item}&language=en&format=json"
    data = requests.get(url).json()
    # Return the first id (Could upgrade this in the future)
    return data['search'][0]['id']
  except:
    return item


def write2csv(file_name, triples):
  with open(file_name, "w", newline="") as f:
      writer = csv.writer(f)
      writer.writerows(triples)

def annotate(text):
  try:
    spotlight_results = spotlight.annotate('https://api.dbpedia-spotlight.org/en/annotate',text)
    urls = []
    for r in spotlight_results:
      urls.append(r['URI'])
    return urls
  except:
    print(f'No entity found for {text}')

#Return the Wikipedia abstract
def get_wikipedia_abstract(url):
    wiki_wiki = wikipediaapi.Wikipedia('en')
    page_name = url.replace('http://dbpedia.org/resource/','')
    page_py = wiki_wiki.page(page_name)
    return page_py.summary

def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append([subject.strip(), relation.strip(), object_.strip()])
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append([subject.strip(), relation.strip(), object_.strip()])
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append([subject.strip(), relation.strip(), object_.strip()])
    return triplets

#### Extraction

This code extracts triples and writes them to a .csv file.

In [None]:
# Text to extract triplets from
text = 'Punta Cana is a resort town in the municipality of Higüey, in La Altagracia Province, the easternmost province of the Dominican Republic.'


# Use GPU if available
device = "cuda:0" if torch.cuda.is_available() else "cpu"



#Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large").to(device)
gen_kwargs = {
    "max_length": 256,
    "length_penalty": 0,
    "num_beams": 3,
    "num_return_sequences": 1,
}



# Tokenizer text
model_inputs = tokenizer(text, max_length=256, padding=True, truncation=True, return_tensors = 'pt').to(device)

# Generate
generated_tokens = model.generate(
    model_inputs["input_ids"].to(device),
    attention_mask=model_inputs["attention_mask"].to(device),
    **gen_kwargs,
)

# Extract text
decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)

# Extract triplets
for idx, sentence in enumerate(decoded_preds):
  print(f'Prediction triplets sentence {idx}')
  triples = extract_triplets(sentence)   
  print(triples)
  write2csv('output.csv', triples)

Prediction triplets sentence 0
[['Punta Cana', 'located in the administrative territorial entity', 'La Altagracia Province'], ['Punta Cana', 'country', 'Dominican Republic'], ['Higüey', 'located in the administrative territorial entity', 'La Altagracia Province'], ['Higüey', 'country', 'Dominican Republic'], ['La Altagracia Province', 'country', 'Dominican Republic'], ['Dominican Republic', 'contains administrative territorial entity', 'La Altagracia Province']]


### Hands-On 💻

Until here, we have seen on how to use DBpedia Spotlight for extracting entities, using prompt-based relation extraction, and REBEL as an end-to-end model for relation extraction from text.

We will now try to combine your knowledge with the previous hands-on exercises by extracting additional triples from Wikipedia to enrich your ontology.
1. Link artist names to Wikipedia with Dbpedia Spotlight.
2. Get the Wikipedia Abstracts for the artists.
3. Perform relation extraction with REBEL on these Wikipedia Abstracts and save the triples to a .csv file.

Below you find some first Python code for the usage of two APIs that we will use.



In [None]:
#Example code for usage of DBpedia Spotlight and for getting the respective Wikipedia Abstract.

#Read the artist names from .csv
artists = []
with open('artists.csv', newline='') as csvfile:
    artistreader = csv.reader(csvfile, delimiter='\t')
    #skip header
    next(artistreader)
    for row in artistreader:
      artists.append(row[1])


#Use Dbpedia Spotlight API to link entity
url = annotate('Travis Scott')
print(url)
abstract = get_wikipedia_abstract(url[0])
print(abstract)



['http://dbpedia.org/resource/Travis_Scott']
Jacques Bermon Webster II (born April 30, 1991), better known by his stage name Travis Scott (formerly stylized as Travi$ Scott), is an American rapper, singer, songwriter, and record producer. His stage name is the namesake of a favorite uncle combined with the first name of one of his inspirations, Kid Cudi (whose real name is Scott Mescudi).In 2012, Scott signed his first major-label contract with Epic Records and a publishing deal with Kanye West's GOOD Music. In April 2013, he signed a joint-recording contract with Epic and T.I.'s Grand Hustle imprint. Scott's first full-length project, the mixtape Owl Pharaoh, was self-released in 2013. It was followed with a second mixtape, Days Before Rodeo, in 2014. His debut studio album, Rodeo (2015), was led by the hit single "Antidote". His second album, Birds in the Trap Sing McKnight (2016), became his first number one album on the Billboard 200. The following year, Scott released a collaborat