<a href="https://colab.research.google.com/github/Naman-Priyadarshi/WebSem_Project/blob/main/WebSem_Project_Creating_a_Knowledge_Graph_for_Cycling_Domain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WebSem Project: Constructing and Querying a Knowledge Graph in the Cycling Domain

## Introduction

The goal of this project is to extract information from multilingual textual documents about cycling and create a knowledge graph (KG) using the extracted entities and relations. The KG will be compatible with a cycling ontology and queries will be written in SPARQL to retrieve specific information from the KG. The project will be implemented using Jupyter Notebook and the following steps will be followed:

* Collect multilingual textual documents about cycling.
* Pre-process the documents to get clean text files.
* Run named entity recognition (NER) on the documents to extract named entities of the type Person, Organization and Location using spaCy and LLMs.
* Run co-reference resolution on the input text using spaCy.
* Disambiguate the entities with Wikidata using OpenTapioca and LLMs.
* Run relation extraction using Stanford OpenIE and LLMs.
* Implement some mappings between the entity types and relations returned with the cycling ontology you developed during the Assignment 1 in order to create a knowledge graph of the domain represented in RDF.
* Load the data in the Corese engine as you did for the Assignment 2 with your cycling ontology and the knowledge graph built in the previous step and write some SPARQL queries to retrieve specific information from the KG.

### Useful resources
* The github repository "Building knowledge graph from input data" at  https://github.com/varun196/knowledge_graph_from_unstructured_text can be used as an inspiration.

### References
* NLTK: https://www.nltk.org/
* spaCy: https://spacy.io/
* Stanford OpenIE: https://nlp.stanford.edu/software/openie.html
* OpenTapioca: https://opentapioca.org/
* Corese engine: https://project.inria.fr/corese/
* Wikidata: https://www.wikidata.org/

## Step 1: Collect multilingual textual documents about cycling
For this mini project, we will collect multilingual textual documents about cycling from various sources such as news articles, blog posts, and race reports. We will download the documents and save them in a directory called `cycling_docs`.

The list of documents to download are available at:

* English:
 - https://en.wikipedia.org/wiki/2022_Tour_de_France
 - https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_1_to_Stage_11
 - https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_12_to_Stage_21
 - https://www.bbc.com/sport/cycling/61940037
 - https://www.bbc.com/sport/cycling/62017114 (stage 1)
 - https://www.bbc.com/sport/cycling/62097721 (stage 7)
 - https://www.bbc.com/sport/cycling/62153759 (stage 11)
 - https://www.bbc.co.uk/sport/cycling/62285420 (stage 21)

* French:
 - https://fr.wikipedia.org/wiki/Tour_de_France_2022
 - https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-epoustouflant-jonas-vingegaard-remporte-la-11e-etape-et-s-empare-du-maillot-jaune-de-tadej-pogacar_5254102.html
 - https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-jonas-vingegaard-vainqueur-de-sa-premiere-grande-boucle-jasper-philipsen-s-offre-au-sprint-la-21e-etape_5275612.html

In [None]:
#
# Feel free to install more dependencies if needed!
#

# Install requests for querying HTTP endpoints
!pip install --quiet requests

# Install jusText for automatically extracting text from web pages
!pip install --quiet jusText

# Install nltk for text processing
!pip install --quiet nltk

# Install spaCy for NER extraction
!pip install --quiet spacy

# Install pycorenlp for Stanford CoreNLP
!pip install --quiet pycorenlp

# Install pandas for data visualization
!pip install --quiet pandas

# Install rdflib for writing RDF
!pip install --quiet rdflib

In [None]:
# Import necessary modules
import requests
import justext
import os
from urllib.parse import urlsplit


# Define a function to get filename from URL
def get_filename_from_url(url):
  urlpath = urlsplit(url).path
  return os.path.basename(urlpath)


# Define a function to download URLs and extract text
def download_urls(urls_list, language):
  # Loop over each URL in the list
  for url in urls_list:
    # Fetch and extract text from the URL using jusText
    response = requests.get(url)
    paragraphs = justext.justext(
      response.content,
      justext.get_stoplist(language.capitalize()),
      no_headings=True,
      max_heading_distance=150,
      length_low=70,
      length_high=140,
      stopwords_low=0.2,
      stopwords_high=0.3,
      max_link_density=0.4
    )
    extracted_text = '\n'.join(list(filter(None, map(
      lambda paragraph: paragraph.text if not paragraph.is_boilerplate else '',
      paragraphs
    ))))

    # Truncate text if it's too long
    extracted_text = extracted_text[0:10000]

    # Create the output directory if it does not exist
    output_dir = os.path.join('cycling_docs', language)
    os.makedirs(output_dir, exist_ok=True)

    # Save extracted text as a .txt file
    filename = get_filename_from_url(url)
    output_path = os.path.join(output_dir, f'{filename}.txt')
    with open(output_path, 'w') as f:
      f.write(extracted_text)

    print(f'Downloaded {url} into {output_path}')


# List of URLs to download
urls_list_english = [
  'https://en.wikipedia.org/wiki/2022_Tour_de_France',
  'https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_1_to_Stage_11',
  'https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_12_to_Stage_21',
  'https://www.bbc.com/sport/cycling/61940037',
  'https://www.bbc.com/sport/cycling/62017114',
  'https://www.bbc.com/sport/cycling/62097721',
  'https://www.bbc.com/sport/cycling/62153759',
  'https://www.bbc.co.uk/sport/cycling/62285420',
]
urls_list_french = [
  'https://fr.wikipedia.org/wiki/Tour_de_France_2022',
  'https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-epoustouflant-jonas-vingegaard-remporte-la-11e-etape-et-s-empare-du-maillot-jaune-de-tadej-pogacar_5254102.html',
  'https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-jonas-vingegaard-vainqueur-de-sa-premiere-grande-boucle-jasper-philipsen-s-offre-au-sprint-la-21e-etape_5275612.html',
]

# Download the listed URLs
download_urls(urls_list_english, 'english')
download_urls(urls_list_french, 'french')

## Step 2: Pre-process the documents to get clean txt files
We will pre-process the documents to get clean txt files by removing any unnecessary characters, punctuation, and stopwords. We will use Python's [re](https://docs.python.org/3/library/re.html) and [nltk](https://www.nltk.org/) libraries for this purpose. We will save the results in a `clean_docs` folder.

In [None]:
"""
Document class which holds all the necessary variables for the purpose of this
project.
"""
class Document:
  def __init__(self, text, language = None, raw_text = None, filepath = None):
    self.filepath = filepath    # Path to the document file
    self.language = language    # Language of the document
    self.raw_text = raw_text    # Origial text before cleaning
    self.cleaned_text = text    # Text after cleaning (Step 2)
    self.spacy_entities = []    # List of spaCy entities (Step 3a)
    self.llm_entities = []      # List of LLM entities (Step 3b)
    self.resolved_text = None   # Text after resolving co-references (Step 4)
    self.coreferences = None    # CoreNLP coreferences object (Step 4)
    self.wiki_entities = {}     # Dictionary of Wikidata entities extracted with OpenTapioca (Step 5a)
    self.llm_wiki_entities = {} # Dictionary of Wikidata entities extracted with LLMs (Step 5b)
    self.relations = []         # List of OpenIE relations (Step 6a)
    self.llm_relations = []     # List of LLM relations (Step 6b)

In [None]:
# 📝 TODO: Import the necessary libraries for natural language processing
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

nltk.download('punkt_tab')
nltk.download('stopwords')

def clean_text(dirty_text, language):
  # 📝 TODO: Define a function to clean text (words tokenization, stopwords
  #          removal, ...).
  # Tokenize the text into words
  tokens = word_tokenize(dirty_text)

  # Convert tokens to lowercase and remove punctuation
  tokens = [word.lower() for word in tokens if word.isalnum()]

  # Remove stopwords
  stop_words = set(stopwords.words(language))
  cleaned_text = [word for word in tokens if word not in stop_words]

  # Return the cleaned text
  return cleaned_text

In [None]:
# Define a function to process a file and write the result to a new file
def process_file(file, language):
  # Open the file in read-only mode and read all of its lines
  with open(file, 'r') as f:
    lines = f.readlines()

  # Concatenate all the lines into a single string
  raw_text = '\n'.join(lines)

  # Clean the text using the `clean_text` function
  cleaned_text = clean_text(raw_text, language)

  # Create a new document and return it
  doc = Document(cleaned_text, language=language, raw_text=raw_text, filepath=os.path.abspath(file))
  return doc


# Create a list to store all our documents
docs = []

# Loop through all the files in the "cycling_docs" folder
folder = 'cycling_docs'
for language in os.listdir(folder):
  for filename in os.listdir(os.path.join(folder, language)):
    # Construct the full path to the file
    file = os.path.join(folder, language, filename)

    # Check if the file is a regular file and has a .txt extension
    if os.path.isfile(file) and file.endswith('.txt'):
      # Process the file and append the new Document to our list
      doc = process_file(file, language)
      docs.append(doc)

In [None]:
# Display the text of the first document
display(docs[0].cleaned_text)

## Step 3: Run named entity recognition (NER) on the documents

The goal of this step is to extract named entities from the text of our documents. We will attempt to use two methods: spaCy, and LLMs.

### Step 3a: Using spaCy

We will use [spaCy](https://spacy.io)'s pre-trained models to perform NER on the documents and extract the entities of type PER/ORG/LOC. The extracted entities will be saved in a file.

**⚠️ Important Note:** We must use the raw text files (before the cleaning step in Step 2) for NER to ensure we do not lose context or critical information needed for accurate entity recognition.

In [None]:
# 📝 TODO: Import spaCy and other libraries that might be required for entity
#          extraction

import spacy

def extract_entities(text, language):
  # 📝 TODO: Use spaCy to extract named entities and store them into a list.
  # The format of the end result should look like this:
  # ```
  # entities = [
  #   { "text": "Tour de France", "label": "ORG" },
  #   { "text": "Peter Sagan", "label": "PERSON" },
  # ]
  # ```

# Load the spaCy language model (already installed with `en_core_web_sm`)
  # Map language to appropriate SpaCy model
  if language.lower() == "english":
      model = "en_core_web_sm"
  elif language.lower() == "french":
      model = "fr_core_news_sm"
  else:
      raise ValueError(f"Unsupported language: {language}")

  # Load the SpaCy model
  nlp = spacy.load(model)

  # Process the text with spaCy
  doc = nlp(text)

  # Extract named entities
  entities = []
  for ent in doc.ents:
    entities.append({"text": ent.text, "label": ent.label_})

  return entities

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

In [None]:
# Extract entities for each document
for doc in docs:
  doc.spacy_entities = extract_entities(doc.raw_text, doc.language)

Display entities which have been extracted:

In [None]:
# 📝 TODO: Display the extracted entities for the first document
# Check if there are documents in the list
if docs:
    # Get the first document
    first_doc = docs[0]

    # Print the file path and language of the document
    print(f"Document Path: {first_doc.filepath}")
    print(f"Language: {first_doc.language}")

    # Print the extracted entities
    print("\nExtracted Entities:")
    for entity in first_doc.spacy_entities:
        print(f"Text: {entity['text']}, Label: {entity['label']}")
else:
    print("No documents available to display.")


### Step 3b: Using LLMs

**⚠️ Important Note:** We must use the raw text files (before the cleaning step in Step 2) for NER to ensure we do not lose context or critical information needed for accurate entity recognition.

First we create a function `llm_generate` which calls an [Ollama](https://ollama.com/) server hosted at EURECOM. For the purpose of this exercise, you are limited to using a specific model (`mistral-nemo:12b-instruct-2407-fp16`). The full documentation for the API is available at: https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion

In [None]:
import json
import requests

# Define a function to call the Ollama WebSem endpoint using a given payload and return the response.
def llm_generate(payload):
  # Define the API endpoint and payload
  url = "https://websem:eurecom@ollama-websem.tools.eurecom.fr/api/generate"
  payload["model"] = "mistral-nemo:12b-instruct-2407-fp16"
  payload["stream"] = "false"

  # Define the headers
  headers = {
    "Content-Type": "application/json"
  }

  # Send the POST request
  response = requests.post(url, headers=headers, data=json.dumps(payload))

  # Check if the request was successful
  if response.status_code == 200:
    # Parse the JSON response
    response_data = response.json()

    # Return the structured entities
    return response_data
  else:
    # Handle errors
    print(f"Error {response.status_code}: {response.text}")
    return None

Now let's learn how to use it:

In [None]:
# Example with basic prompt
payload = {
  "prompt": """
  Compose a haiku about web semantics, emphasizing the interconnectedness of
  data, the elegance of structured knowledge, and the power of understanding
  through linked information.
  """
}
llm_response = llm_generate(payload)
print(llm_response["response"])

You can even use structured outputs. More informations are available at:
* https://ollama.com/blog/structured-outputs
* https://github.com/ollama/ollama/blob/main/docs/api.md#request-structured-outputs

In [None]:
# Example with structured outputs
payload = {
    "prompt": """
    Solve the following math problem and explain the steps clearly:

    Problem:
    What is 2 + 2?

    Provide the solution and explanation in the following structure:
    """,
    "format": {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "type": "object",
        "properties": {
            "solution": {"type": "integer"},
            "explanation": {"type": "string"}
        },
        "required": ["solution", "explanation"]
    }
}
llm_response = llm_generate(payload)
print(llm_response["response"])

Now let's create a function which can extract entities using `llm_generate` with a custom prompt. There are many ways to achieve this. For example, you could try providing examples, and/or using a JSON schema.

In [None]:
def extract_entities_with_llm(text):
    # Define the payload for the LLM request
    payload = {
        "prompt": f"""
        Extract named entities from the following text and classify them into appropriate categories such as PERSON, ORG, EVENT, LOCATION, etc.:

        Text:
        {text}

        Provide the result in the following JSON structure:
        [
          {{ "text": "<entity_text>", "label": "<entity_label>" }},
          ...
        ]
        """,
        "format": {
            "$schema": "http://json-schema.org/draft-07/schema#",
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "text": {"type": "string"},
                    "label": {"type": "string"}
                },
                "required": ["text", "label"]
            }
        }
    }

    # Call the LLM generate function
    llm_response = llm_generate(payload)

    # Check if the response is valid
    if llm_response and "response" in llm_response:
        try:
            # Parse the JSON response
            entities = json.loads(llm_response["response"])
            return entities
        except json.JSONDecodeError:
            print("Error decoding JSON response.")
            return []
    else:
        print("Error generating response from LLM.")
        return []


In [None]:
# Extract entities for each document
for doc in docs:
  doc.llm_entities = extract_entities_with_llm(doc.raw_text)

Display entities which have been extracted:

In [None]:
# Display entities for the first document
display(docs[0].llm_entities)

## Step 4: Run co-reference resolution on the input text
We will use CoreNLP to perform [co-reference resolution](https://en.wikipedia.org/wiki/Coreference) on the input text and resolve coreferences.

For this project, we will use a hosted version of CoreNLP at: https://corenlp.tools.eurecom.fr/ (username: `websem`, password: `eurecom`). Feel free to try out the web interface before writing the code.

First, we compute the annotations and store them into the `coreferences` variable of our Document:

In [None]:
import json
from pycorenlp import StanfordCoreNLP


# Set up the CoreNLP client
nlp = StanfordCoreNLP('https://websem:eurecom@corenlp.tools.eurecom.fr')

# Define a function which computes coreferences for a given text and language
def compute_coreferences(text, language):
  props = {
    'timeout': 300000,
    'annotators': 'tokenize,ssplit,coref',
    'pipelineLanguage': language[:2],
    'outputFormat': 'json'
  }

  # Annotate the text for co-reference resolution
  corenlp_output = nlp.annotate(text, properties=props)
  try:
    corenlp_output = json.loads(corenlp_output)
  except Exception as err:
    print(f'Unexpected response: {corenlp_output}')
    raise

  return corenlp_output

In [None]:
# Test co-references computation
example = compute_coreferences("John is a software engineer. He is very talented. Sarah is a designer. She works with him.", language="en")

# Pretty-print them
print(json.dumps(example, indent=2))

In [None]:
# Compute co-references for all documents
for doc in docs:
  if doc.language == "english":  # CoreNLP Coref-resolution only supports english
    doc.coreferences = compute_coreferences(doc.raw_text, doc.language)

The first step is to display all co-references for each mentions in the text.

For example:

> "He" -> "John"
>
> "She" -> "Sarah"
>
> "him" -> "John"

In [None]:
for coref_cluster in example['corefs'].values():
  # 📝 TODO: Print each co-references like so: "He" -> "John"
  # 💡 Each cluster has one representative mention, flagged with `isRepresentativeMention: True`
  representative_mention = None
  for mention in coref_cluster:
      if mention['isRepresentativeMention']:
          representative_mention = mention['text']
          break

  if representative_mention:
      for mention in coref_cluster:
          if not mention['isRepresentativeMention']:
              print(f'"{mention["text"]}" -> "{representative_mention}"')

### 🏆 Challenge

Replace values within the text with their resolved co-reference. For example, with the following text:

> **John** is a software engineer. **He** is very talented.

In the second sentence, the pronoun "He" would be replaced with its co-reference, and the final text would become:

> **John** is a software engineer. **John** is very talented.

In [None]:
# Define a function which resolves coreferences inside a document
def resolve_coreferences(corenlp_output):

  # 📝 TODO: Replace values within the text with their resolved co-reference.
  # 💡 You can start by printing the `corenlp_output` object to understand its
  #    structure.

  #get the coreference clusters
  coref_clusters = corenlp_output['corefs']

  #initialize a list to store the resolved text pieces
  resolved_text = []

  #extract the sentences and their corresponding mentions
  sentences = corenlp_output.get('sentences', [])

  #flatten all mentions in the document and create a mapping of mentions to their representative
  mention_to_representative = {}
  for coref_cluster in coref_clusters.values():
      representative_mention = None
      for mention in coref_cluster:
          if mention['isRepresentativeMention']:
              representative_mention = mention['text']
              break

      #map all mentions in the cluster to the representative
      for mention in coref_cluster:
          mention_to_representative[mention['text']] = representative_mention

  #replacing mentions with their representatives
  for sentence in sentences:
      sentence_text = ' '.join([token['originalText'] for token in sentence['tokens']])

      for mention, representative in mention_to_representative.items():
          sentence_text = sentence_text.replace(mention, representative)

      sentence_text = sentence_text.strip()

      #ensuring the sentence ends with a period, without extra spaces
      if sentence_text and not sentence_text.endswith('.'):
          sentence_text += '.'

      resolved_text.append(sentence_text)

  return ' '.join(resolved_text).replace(' .', '.')

In [None]:
# Test resolving co-references
original_text = "John is a software engineer. He is very talented. Sarah is a designer. She works with him."
corefs = compute_coreferences(original_text, language="en")
resolved_text = resolve_coreferences(corefs)
print(original_text)
print(resolved_text)

In [None]:
# Resolve co-references for all documents
for doc in docs:
  if doc.coreferences is not None:
    doc.resolved_text = resolve_coreferences(doc.coreferences)

In [None]:
# 📝 TODO: Display text with resolved co-references for the any document of your choice
document_of_choice = docs[7]

print("Resolved Text for the chosen document:")
print(document_of_choice.resolved_text)

## Step 5: Disambiguate the entities

We will reuse the same method `llm_generate` from Step 3, but with a different prompt in order to disambiguate the entities from Wikidata and DBpedia.

In [None]:
def disambiguate_with_llm(text):
    # Define the payload for the LLM request to disambiguate entities based on Wikidata
    payload_wikidata = {
        "prompt": f"""
        Disambiguate the following named entities based on Wikidata. For each entity, provide its correct Wikidata entity ID and label:

        Text:
        {text}

        Provide the result in the following JSON structure:
        [
          {{ "text": "<entity_text>", "wikidata_id": "<wikidata_entity_id>", "label": "<wikidata_label>" }},
          ...
        ]
        """,
        "format": {
            "$schema": "http://json-schema.org/draft-07/schema#",
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "text": {"type": "string"},
                    "wikidata_id": {"type": "string"},
                    "label": {"type": "string"}
                },
                "required": ["text", "wikidata_id", "label"]
            }
        }
    }

    # Call the LLM generate function for Wikidata disambiguation
    llm_response_wikidata = llm_generate(payload_wikidata)

    # Define the payload for the LLM request to disambiguate entities based on DBpedia
    payload_dbpedia = {
        "prompt": f"""
        Disambiguate the following named entities based on DBpedia. For each entity, provide its correct DBpedia resource URI and label:

        Text:
        {text}

        Provide the result in the following JSON structure:
        [
          {{ "text": "<entity_text>", "dbpedia_uri": "<dbpedia_resource_uri>", "label": "<dbpedia_label>" }},
          ...
        ]
        """,
        "format": {
            "$schema": "http://json-schema.org/draft-07/schema#",
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "text": {"type": "string"},
                    "dbpedia_uri": {"type": "string"},
                    "label": {"type": "string"}
                },
                "required": ["text", "dbpedia_uri", "label"]
            }
        }
    }

    # Call the LLM generate function for DBpedia disambiguation
    llm_response_dbpedia = llm_generate(payload_dbpedia)

    # Process the responses for Wikidata
    if llm_response_wikidata and "response" in llm_response_wikidata:
        try:
            # Parse the JSON response for Wikidata
            disambiguated_wikidata_entities = json.loads(llm_response_wikidata["response"])
        except json.JSONDecodeError:
            print("Error decoding Wikidata JSON response.")
            disambiguated_wikidata_entities = []
    else:
        print("Error generating response from LLM for Wikidata.")
        disambiguated_wikidata_entities = []

    # Process the responses for DBpedia
    if llm_response_dbpedia and "response" in llm_response_dbpedia:
        try:
            # Parse the JSON response for DBpedia
            disambiguated_dbpedia_entities = json.loads(llm_response_dbpedia["response"])
        except json.JSONDecodeError:
            print("Error decoding DBpedia JSON response.")
            disambiguated_dbpedia_entities = []
    else:
        print("Error generating response from LLM for DBpedia.")
        disambiguated_dbpedia_entities = []

    # Combine the results from Wikidata and DBpedia
    disambiguated_entities = {
        "wikidata": disambiguated_wikidata_entities,
        "dbpedia": disambiguated_dbpedia_entities
    }

    return disambiguated_entities

In [None]:
for doc in docs:
  doc.wiki_entities = {}
  entities = {}
  for j in range(0, len(doc.raw_text), 4000):
    doc.wiki_entities |= disambiguate_with_llm(doc.raw_text[j:j+4000])

Display the entities disambiguated according to DBpedia and Wikidata

In [None]:
# 📝 TODO: Display extracted Wikidata entities for the first document
if docs:
    first_doc = docs[0]
    print("Extracted Wikidata entities for the first document:")
    print(first_doc.wiki_entities)

In [None]:
# Iterate through the documents and populate the dbpedia_entities
for doc in docs:
    doc.dbpedia_entities = {}  # Initialize the dbpedia_entities attribute
    entities = {}
    for j in range(0, len(doc.raw_text), 4000):
        doc.dbpedia_entities |= disambiguate_with_llm(doc.raw_text[j:j+4000])

# Display extracted DBpedia entities for the first document
if docs:
    first_doc = docs[0]
    print("Extracted DBpedia entities for the first document:")
    print(first_doc.dbpedia_entities)

## Step 6: Run relation extraction

### Step 6a: Using OpenIE

We will use [Stanford OpenIE](https://nlp.stanford.edu/software/openie.html) to extract the relations between the entities in the input text.

In [None]:
import json
from pycorenlp import StanfordCoreNLP

# Create a StanfordCoreNLP object
nlp = StanfordCoreNLP('https://websem:eurecom@corenlp.tools.eurecom.fr')

# Define a function to extract relations from input text using Stanford OpenIE
def extract_relations_with_openie(input_text, language):
  output = nlp.annotate(input_text, properties={
    'timeout': 300000,
    'annotators': 'tokenize,ssplit,openie',
    'outputFormat': 'json',
    'pipelineLanguage': language[:2]
  })
  try:
    output = json.loads(output)
  except Exception as err:
    print(f'Unexpected response: {output}')
    raise

  # 📝 TODO: Get relations from the `output` object (subject, relation, object)
  #    and append them to a `extracted_relations` list.
  # 💡 You can start by printing the `output` object to understand its structure.
  # Initialize the list to store extracted relations
  extracted_relations = []

  for sentence in output.get('sentences', []):
      for triple in sentence.get('openie', []):
          subject = triple.get('subject')
          relation = triple.get('relation')
          object_ = triple.get('object')

          if subject and relation and object_:
              extracted_relations.append({
                  'subject': subject,
                  'relation': relation,
                  'object': object_
              })

  # Return relations
  return extracted_relations

In [None]:
for doc in docs:
  if doc.language == "english":  # CoreNLP OpenIE only supports english
    doc.relations = extract_relations_with_openie(doc.raw_text, doc.language)

Display relations which have been extracted:

In [None]:
# 📝 TODO: Display extracted relations for the first document
if docs:
    first_doc = docs[0]
    print("Extracted relations for the first document:")
    print(first_doc.relations)

### Step 6b: Using LLMs

As an alternative to OpenIE, we will use LLMs to do the same task and compare the results.

In [None]:
def extract_relations_with_llm(text):
    # 📝 TODO: Create a prompt and query the LLM to get relations (subject, relation, object)
    #    from the text and append them to a `extracted_relations` list.
    payload = {
        "prompt": f"""
        Extract relations (subject, relation, object) from the following text:

        Text:
        {text}

        Provide the result in the following JSON structure:
        [
            {{ "subject": "<subject_text>", "relation": "<relation_text>", "object": "<object_text>" }},
            ...
        ]
        """,
        "format": {
            "$schema": "http://json-schema.org/draft-07/schema#",
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "subject": {"type": "string"},
                    "relation": {"type": "string"},
                    "object": {"type": "string"}
                },
                "required": ["subject", "relation", "object"]
            }
        }
    }

    # Call the LLM generate function
    llm_response = llm_generate(payload)

    # Check if the response is valid
    if llm_response and "response" in llm_response:
        try:
            # Parse the JSON response
            extracted_relations = json.loads(llm_response["response"])
            return extracted_relations
        except json.JSONDecodeError:
            print("Error decoding JSON response.")
            return []
    else:
        print("Error generating response from LLM.")
        return []

In [None]:
for doc in docs:
  doc.llm_relations = extract_relations_with_llm(doc.raw_text)

Display relations which have been extracted:

In [None]:
# Display the extracted relations for the first document
display(docs[0].llm_relations)

## Step 7: Implement some mappings between the entity types and relations returned with a given cycling ontology
We will implement mappings between the entity types and relations returned with the cycling ontology available at https://nextcloud.eurecom.fr/s/yKaMDEnRoSqjNAL.

In [None]:
import rdflib
from rdflib import Graph, URIRef, Literal, Namespace

g = Graph()

CYCLING = Namespace("http://example.org/cycling#")
WIKI = Namespace("http://example.org/wiki#")

g.bind("cycling", CYCLING)
g.bind("wiki", WIKI)

entities_en = [
    {"id": "cyclist1", "name": "Chris Froome", "type": "Cyclist"},
    {"id": "team1", "name": "Team Sky", "type": "Team"},
]

relations_en = [
    {"subject": "cyclist1", "predicate": "is_part_of", "object": "team1"},
]

wiki_entities_en = [
    {"entity": "Chris Froome", "wiki_url": "https://en.wikipedia.org/wiki/Chris_Froome"},
]

for entity in entities_en:
    entity_uri = URIRef(CYCLING[entity["id"]])
    g.add((entity_uri, rdflib.RDF.type, URIRef(CYCLING[entity["type"]])))
    g.add((entity_uri, CYCLING.name, Literal(entity["name"])))

    for wiki in wiki_entities_en:
        if wiki["entity"] == entity["name"]:
            g.add((entity_uri, CYCLING.hasWikiLink, URIRef(wiki["wiki_url"])))

for relation in relations_en:
    subject_uri = URIRef(CYCLING[relation["subject"]])
    predicate_uri = URIRef(CYCLING[relation["predicate"]])
    object_uri = URIRef(CYCLING[relation["object"]])

    g.add((subject_uri, predicate_uri, object_uri))

print(g.serialize(format="turtle"))

In [None]:
# Save the result into a file
g.serialize(destination='output.ttl')

## Step 8: Load the data in the Corese engine with the ontology and write the SPARQL queries to retrieve specific information from the KG
We will load the data in the [Corese](http://wimmics.inria.fr/doc/tutorial/corese-3.2.3c.jar) engine (the same you used in the Assignment 2) with the ontology and write the SPARQL queries to retrieve specific information from the KG. We will write the following queries:

* 📝 List the name of the cycling teams

In [None]:
PREFIX : <http://example.org/ontology#>

SELECT ?team_name
WHERE {
  ?team a :CyclingTeam ;
        :hasName ?team_name .
}

* 📝 List the name of the cycling riders

In [None]:
PREFIX : <http://example.org/ontology#>

SELECT ?rider_name
WHERE {
  ?rider a :CyclingRider ;
         :hasName ?rider_name .
}

* 📝 Retrieve the name of the winner of the Prologue

In [None]:
PREFIX : <http://example.org/ontology#>

SELECT ?winner_name
WHERE {
  ?race a :Prologue ;
        :hasWinner ?winner .
  ?winner :hasName ?winner_name .
}

📝 We will also write the same 3 queries on Wikidata starting from `Q98043180` to compare the results.

In [None]:
SELECT ?team_name WHERE {
  ?team wdt:P31 wd:Q13406325;  # Cycling team (Q13406325)
        rdfs:label ?team_name.
  FILTER(LANG(?team_name) = "en")
}

In [None]:
SELECT ?rider_name WHERE {
  ?rider wdt:P31 wd:Q13393280;  # Cyclist (Q13393280)
         rdfs:label ?rider_name.
  FILTER(LANG(?rider_name) = "en")
}

In [None]:
SELECT ?winner_name WHERE {
  ?race wdt:P31 wd:Q31431;  # Prologue (Q31431)
        wdt:P1344 ?winner.  # Winner (P1344)
  ?winner rdfs:label ?winner_name.
  FILTER(LANG(?winner_name) = "en")
}