# Exploring code generation for KIARA's Network Analysis module by integrating HILDEGARD's Knowledge Graph Builder

Acknowledgments:

*Mariella De Crouy Chanel (General Ideation and Design of prompt-based module builder)*

*Markus Binsteiner (Technical Support)*

Notebook Author: *Cosimo Palma*  
cosimo.palma@phd.unipi.it

This notebook gathers some outputs of a GPT-4o-based [Kiara Module Builder](https://chatgpt.com/g/g-Z2RwpuJbw-kiara-module-builder). The knowledge base has been built upon kiara code and documentation as freely downloadable at https://github.com/DHARPA-Project and https://dharpa.org/kiara.documentation/latest/ (see the Appendix for further details of implementation).
As a proof-of-concept for the network analysis module, the creation of a Kiara plugin integrating the [HILDEGARD](https://github.com/Glottocrisio/HILDEGARD/tree/main) workflow has been selected.

HILDEGARD (acronym for "Human In the Loop Data Extraction and Graphically Augmented Relation Discovery") is a Digital Heritage Management Tool aiming at retrieving relationships between Heritage Objects conserved in museums. The following functions makes up a "lightweight" version of HILDEGARD tailored for Digital Historians. It creates a Knowledge Graph based on two seed-Wikipedia entities and saves it in a .csv file that can be easily stored and queried in a kuzu knowledge base, or exploited in kiara for the [network analysis module](https://github.com/DHARPA-Project/kiara_plugin.dh_tagung_2023/blob/main/docs/notebooks/Network_Analysis.ipynb).

First of all, let us download all the necessary packages (the latest version of the 10 core plugins).

In [1]:
!pip install kiara kiara-plugin.core-types kiara-plugin.html kiara-plugin.jupyter kiara-plugin.language-processing kiara-plugin.network-analysis kiara-plugin.onboarding kiara-plugin.streamlit kiara-plugin.tabular

Collecting kiara
  Downloading kiara-0.5.12-py3-none-any.whl.metadata (9.6 kB)
Collecting kiara-plugin.core-types
  Downloading kiara_plugin.core_types-0.5.1-py3-none-any.whl.metadata (5.1 kB)
Collecting kiara-plugin.html
  Downloading kiara_plugin.html-0.5.0-py3-none-any.whl.metadata (6.9 kB)
Collecting kiara-plugin.jupyter
  Downloading kiara_plugin.jupyter-0.5.0-py3-none-any.whl.metadata (6.7 kB)
Collecting kiara-plugin.language-processing
  Downloading kiara_plugin.language_processing-0.5.0-py3-none-any.whl.metadata (6.6 kB)
Collecting kiara-plugin.network-analysis
  Downloading kiara_plugin.network_analysis-0.5.1-py3-none-any.whl.metadata (6.5 kB)
Collecting kiara-plugin.onboarding
  Downloading kiara_plugin.onboarding-0.5.1-py3-none-any.whl.metadata (5.2 kB)
Collecting kiara-plugin.streamlit
  Downloading kiara_plugin.streamlit-0.5.1-py3-none-any.whl.metadata (7.1 kB)
Collecting kiara-plugin.tabular
  Downloading kiara_plugin.tabular-0.5.5-py3-none-any.whl.metadata (5.3 kB)
Colle

Then, we install all necessary modules to build the Web-Scraping, the KG-relationships finder, and the querable knowledge base.

In [2]:
!pip install kuzu requests selenium beautifulsoup4 SPARQLWrapper


Collecting kuzu
  Downloading kuzu-0.6.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.6 kB)
Collecting selenium
  Downloading selenium-4.26.1-py3-none-any.whl.metadata (7.1 kB)
Collecting SPARQLWrapper
  Downloading SPARQLWrapper-2.0.0-py3-none-any.whl.metadata (2.0 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.27.0-py3-none-any.whl.metadata (8.6 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting rdflib>=6.1.1 (from SPARQLWrapper)
  Downloading rdflib-7.1.1-py3-none-any.whl.metadata (11 kB)
Collecting isodate<1.0.0,>=0.7.2 (from rdflib>=6.1.1->SPARQLWrapper)
  Downloading isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Do

Through this codelet the chrome-driver for using the WebScraper is installed.

In [3]:
!apt-get update
!apt-get install -y chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin


0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:8 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,456 kB]
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,452 kB]
Get:13 http://security.ubuntu.com/ubuntu jammy-

This command provides for a quick view of the installed versions of every package.

In [4]:
!pip list --format=freeze

absl-py==1.4.0
accelerate==0.34.2
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiosignal==1.3.1
airium==0.2.6
alabaster==0.7.16
albucore==0.0.19
albumentations==1.4.20
altair==4.2.2
annotated-types==0.7.0
anyio==3.7.1
anywidget==0.9.13
appdirs==1.4.4
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
array_record==0.5.1
arrow==1.3.0
arviz==0.20.0
astor==0.8.1
astropy==6.1.4
astropy-iers-data==0.2024.10.28.0.34.7
astunparse==1.6.3
async-timeout==4.0.3
atpublic==4.1.0
attrs==24.2.0
audioread==3.0.1
autograd==1.7.0
babel==2.16.0
backcall==0.2.0
backoff==2.2.1
bases==0.3.0
beautifulsoup4==4.12.3
bibtexparser==1.4.2
bidict==0.23.1
bigframes==1.25.0
bigquery-magics==0.4.0
black==24.10.0
bleach==6.2.0
blinker==1.4
blis==0.7.11
blosc2==2.0.0
bokeh==3.4.3
boltons==24.1.0
Bottleneck==1.4.2
bqplot==0.12.43
branca==0.8.0
CacheControl==0.14.0
cachetools==5.5.0
catalogue==2.0.10
certifi==2024.8.30
cffi==1.17.1
chardet==5.2.0
charset-normalizer==3.4.0
chex==0.1.87
clarabel==0.9.0
click==8.1.7
click-default

# Input validation

This function performs the validation of the seed input entities in the Wikipedia Knowledge Graph. In case of error, the user is invited to re-insert a valid entity. Spaces shall be replaced by underscores "_".

In [5]:
import requests
import json
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time
from SPARQLWrapper import SPARQLWrapper, JSON

def validate_entity(entity):
    """Validate if a Wikipedia entity exists."""
    response = requests.get(f"https://en.wikipedia.org/wiki/{entity}")
    return response.status_code == 200

# Take user input and validate
entity_start = input("Enter the starting Wikipedia entity: ")
while not validate_entity(entity_start):
    print(f"{entity_start} is not a valid Wikipedia entity. Try again.")
    entity_start = input("Enter the starting Wikipedia entity: ")

entity_end = input("Enter the target Wikipedia entity: ")
while not validate_entity(entity_end):
    print(f"{entity_end} is not a valid Wikipedia entity. Try again.")
    entity_end = input("Enter the target Wikipedia entity: ")

print(f"Valid entities: {entity_start} and {entity_end}")


Enter the starting Wikipedia entity: Albert_Einstein
Enter the target Wikipedia entity: Willibrord
Valid entities: Albert_Einstein and Willibrord


# Shortest Path algorithm between two input entities by Web Scraping

Through the following functions the website "Six Degrees of Wikipedia" is scraped for retrieving middle entities between the two input ones. For each entity, the title, the description and the URL are stored.

In [6]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time

def related_entities_triples(start, end):
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')

    driver = webdriver.Chrome(options=options)
    driver.get(f"https://www.sixdegreesofwikipedia.com/?source={start}&target={end}")

    # Click the button to generate the shortest path
    try:
        driver.find_element(By.CSS_SELECTOR, "button").click()
        time.sleep(5)  # Allow time for content to load
    except Exception as e:
        print("Error clicking button:", e)
        driver.quit()
        return []

    # Scroll to load the "INDIVIDUAL PATHS" content
    try:
        webtext = driver.find_elements(By.XPATH, "//div[1]/div[2]/div[5]")[0]  # Container for paths content
        for _ in range(5):  # Scroll down several times to ensure content loads
            driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.PAGE_DOWN)
            time.sleep(1)  # Wait briefly for new content to load

        webtexto = webtext.text
    except Exception as e:
        print("Error retrieving 'INDIVIDUAL PATHS' content:", e)
        driver.quit()
        return []

    # Process the extracted text from "INDIVIDUAL PATHS"
    hrefs_list = []
    titles_list = []
    captions_list = []

    # Split webtext by lines to parse titles and captions
    lines = webtexto.split("\n")
    for i in range(0, len(lines), 2):  # Assuming title and caption alternate in lines
        if i < len(lines):
            titles_list.append(lines[i])  # Title on even lines
        if i + 1 < len(lines):
            captions_list.append(lines[i + 1])  # Caption on odd lines

    # Create triples with titles, captions, and hrefs
    triples = []
    for title, caption in zip(titles_list, captions_list):
        href = f"https://en.wikipedia.org/wiki/{title.replace(' ', '_')}"
        triples.append({
            "title": title,
            "caption": caption,
            "href": href
        })

    # Generate triple groups
    triple_groups = []
    for i in range(len(triples) - 2):
        triple_groups.append((triples[i], triples[i+1], triples[i+2]))

    driver.quit()
    return triple_groups

# Example usage
#start_entity = "Anubis"
#end_entity = "Tale of Two Brothers"
entity_triples = related_entities_triples(entity_start, entity_end)
print(entity_triples)


[({'title': 'Albert Einstein', 'caption': 'German-born theoretical physicist (1879–1955)', 'href': 'https://en.wikipedia.org/wiki/Albert_Einstein'}, {'title': 'Turkey', 'caption': 'Country straddling Southeast Europe and West Asia', 'href': 'https://en.wikipedia.org/wiki/Turkey'}, {'title': 'Luxembourg', 'caption': 'Country in Northwestern Europe', 'href': 'https://en.wikipedia.org/wiki/Luxembourg'}), ({'title': 'Turkey', 'caption': 'Country straddling Southeast Europe and West Asia', 'href': 'https://en.wikipedia.org/wiki/Turkey'}, {'title': 'Luxembourg', 'caption': 'Country in Northwestern Europe', 'href': 'https://en.wikipedia.org/wiki/Luxembourg'}, {'title': 'Willibrord', 'caption': 'Christian bishop and Roman Catholic saint', 'href': 'https://en.wikipedia.org/wiki/Willibrord'}), ({'title': 'Luxembourg', 'caption': 'Country in Northwestern Europe', 'href': 'https://en.wikipedia.org/wiki/Luxembourg'}, {'title': 'Willibrord', 'caption': 'Christian bishop and Roman Catholic saint', 'h

# CIDOC-CRM Ontology Harmonization

This procedures connects the titles, descriptions and URLs previously retrieved using the CIDOC-CRM ontology.

Mapping:

P67: refersTo

P102: hasTitle

P104: isSubjectTo

P196: defines

In [7]:
def harmonize_triples_to_crm(triple_groups):
    """
    Harmonizes a list of triple groups into CIDOC-CRM ontology format.

    Parameters:
    - triple_groups: List of tuple groups, where each group contains dictionaries
                     with "title", "caption", and "href" keys.

    Returns:
    - A list of dictionaries in CIDOC-CRM harmonized format.
    """
    harmonized_triples = []

    for group in triple_groups:
        for triple in group:
            title = triple["title"]
            caption = triple["caption"]
            href = triple["href"]

            # Map to CIDOC-CRM relations
            harmonized_triples.extend([
                {"title": title, "cidoc-relation": "P104", "descr": caption},
                {"descr": caption, "cidoc-relation": "P196", "uri": href},
                {"uri": href, "cidoc-relation": "P102", "title": title},
                {"descr": caption, "cidoc-relation": "P196", "title": title},
                {"title": title, "cidoc-relation": "P104", "descr": caption},
                {"uri": href, "cidoc-relation": "P67", "descr": caption},
                {"title": title, "cidoc-relation": "P67", "uri": href}
            ])

        # Add a "prev_title" relation to link the last entity to the next one
        for idx in range(1, len(group)):
            previous_triple = group[idx - 1]
            current_triple = group[idx]
            harmonized_triples.append({
                "prev_title": previous_triple["title"],
                "cidoc-relation": "P67",
                "title": current_triple["title"]
            })

    return harmonized_triples

# Example usage
crm_harmonized_triples = harmonize_triples_to_crm(entity_triples)
print(crm_harmonized_triples)


[{'title': 'Albert Einstein', 'cidoc-relation': 'P104', 'descr': 'German-born theoretical physicist (1879–1955)'}, {'descr': 'German-born theoretical physicist (1879–1955)', 'cidoc-relation': 'P196', 'uri': 'https://en.wikipedia.org/wiki/Albert_Einstein'}, {'uri': 'https://en.wikipedia.org/wiki/Albert_Einstein', 'cidoc-relation': 'P102', 'title': 'Albert Einstein'}, {'descr': 'German-born theoretical physicist (1879–1955)', 'cidoc-relation': 'P196', 'title': 'Albert Einstein'}, {'title': 'Albert Einstein', 'cidoc-relation': 'P104', 'descr': 'German-born theoretical physicist (1879–1955)'}, {'uri': 'https://en.wikipedia.org/wiki/Albert_Einstein', 'cidoc-relation': 'P67', 'descr': 'German-born theoretical physicist (1879–1955)'}, {'title': 'Albert Einstein', 'cidoc-relation': 'P67', 'uri': 'https://en.wikipedia.org/wiki/Albert_Einstein'}, {'title': 'Turkey', 'cidoc-relation': 'P104', 'descr': 'Country straddling Southeast Europe and West Asia'}, {'descr': 'Country straddling Southeast Eu

# DBpedia relationship finder

This function takes in input the entity pairs retrieved through the shortest path algorithm and recursively executes a SPARQL query to find non-trivial DBpedia relationships.
Change the parameter *num_mids* to modify the number of middle relationships between the given entities.


In [8]:
from SPARQLWrapper import SPARQLWrapper, JSON
from urllib.error import HTTPError
import time
import json

def execute_query(query, retries=3, wait=2):
    """Executes a SPARQL query on DBpedia with retry logic."""
    sparql = SPARQLWrapper("http://dbpedia.org/sparql")
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)

    for attempt in range(retries):
        try:
            results = sparql.query().convert()
            return results["results"]["bindings"]
        except HTTPError as e:
            print(f"HTTPError: {e} - Retrying ({attempt + 1}/{retries})...")
            time.sleep(wait)  # Wait before retrying
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            break

    print(f"Query failed after {retries} attempts.")
    return []  # Return an empty list if the query fails

def generate_query(entity1, entity2, num_mids=5):
    """Generates a SPARQL query with intermediate nodes and filters."""

    # Define prefixes and initial part of the query
    query = f"""
    PREFIX dbo: <http://dbpedia.org/ontology/>
    PREFIX dbr: <http://dbpedia.org/resource/>
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    SELECT ?entity1 {" ".join([f"?pf{i} ?mid{i}" for i in range(1, num_mids + 1)])} ?pf{num_mids + 1} ?entity2
    WHERE {{
      VALUES (?entity1 ?entity2) {{ (dbr:{entity1} dbr:{entity2}) }}
      ?entity1 ?pf1 ?mid1 .
    """

    # Loop to add the intermediate relationships based on num_mids
    for i in range(1, num_mids + 1):
        query += f"?mid{i} ?pf{i+1} ?mid{i+1} .\n"

    # Final connection to the target entity
    query += f"?mid{num_mids} ?pf{num_mids + 1} ?entity2 .\n"

    # Filter to ensure distinct nodes in the path
    query += "FILTER(?entity1 != ?mid1 && ?entity2 != ?mid1 "
    for i in range(1, num_mids + 1):
        query += f"&& ?mid{i} != ?mid{i+1} "
    query += "&& ?entity1 != ?entity2) \n"

    # Additional FILTER to exclude unwanted properties
    for i in range(1, num_mids + 2):
        if i != 5:  # Skip filter for certain relationships (if needed)
            query += f"FILTER (?pf{i} NOT IN (dbo:Person, dbo:wikiPageWikiLink, owl:Thing)) \n"

    # Close the query
    query += "} LIMIT 20"

    return query.strip()

def find_relationships_for_entity_pairs(triple_groups):
    """
    Finds DBpedia relationships between each pair of entities in `triple_groups`.

    Parameters:
    - triple_groups: List of triple groups where each entry is a dictionary
                     containing "title" for each entity.

    Returns:
    - A dictionary where each key is an (entity1, entity2) pair, and the value is
      a list of relationships between them.
    """
    relationships = {}
    failed_queries = []  # Track failed queries

    # Collect all unique entity pairs across triple groups
    pairs = set()
    for group in triple_groups:
        titles = [triple["title"].replace(" ", "_") for triple in group]
        pairs.update((titles[i], titles[j]) for i in range(len(titles)) for j in range(i + 1, len(titles)))

    # Execute SPARQL queries for each unique entity pair
    for entity1, entity2 in pairs:
        query = generate_query(entity1, entity2)
        print(f"Finding relationships between {entity1} and {entity2}...")  # Debug output
        results = execute_query(query)

        if results:
            # Parse results to capture relationships
            relationship_data = []
            for result in results:
                relationship_path = []
                for key, value in result.items():
                    relationship_path.append(value["value"])
                relationship_data.append(relationship_path)

            # Store the results in the dictionary
            relationships[(entity1, entity2)] = relationship_data
        else:
            print(f"Failed to retrieve relationships for {entity1} and {entity2}.")
            failed_queries.append((entity1, entity2))

    if failed_queries:
        print("The following queries failed and were retried without success:")
        for entity1, entity2 in failed_queries:
            print(f" - {entity1} to {entity2}")

    return relationships

dbpedia_relationships = find_relationships_for_entity_pairs(entity_triples)



Finding relationships between Culture_of_Germany and Carolingian_dynasty...
Failed to retrieve relationships for Culture_of_Germany and Carolingian_dynasty.
Finding relationships between Albert_Einstein and German_Empire...
Finding relationships between Willibrord and Institute_for_Advanced_Study...
Finding relationships between Martin_Luther_King_Jr. and Netherlands...
An unexpected error occurred: QueryBadFormed: A bad request has been sent to the endpoint: probably the SPARQL query is badly formed. 

Response:
b"Virtuoso 37000 Error SP030: SPARQL compiler, line 7: syntax error at '.' before 'dbr:Netherlands'\n\nSPARQL query:\n#output-format:application/sparql-results+json\nPREFIX dbo: <http://dbpedia.org/ontology/>\n    PREFIX dbr: <http://dbpedia.org/resource/>\n    PREFIX owl: <http://www.w3.org/2002/07/owl#>\n    SELECT ?entity1 ?pf1 ?mid1 ?pf2 ?mid2 ?pf3 ?mid3 ?pf4 ?mid4 ?pf5 ?mid5 ?pf6 ?entity2\n    WHERE {\n      VALUES (?entity1 ?entity2) { (dbr:Martin_Luther_King_Jr. dbr:Net

# Saving DBpedia relationships in JSON format

In [9]:
#import json

# Convert tuple keys to strings for JSON compatibility
dbpedia_relationships_str_keys = {str(key): value for key, value in dbpedia_relationships.items()}

# Save the modified dictionary to a .txt file in JSON format
with open("dbpedia_relationships.txt", "w", encoding="utf-8") as file:
    json.dump(dbpedia_relationships_str_keys, file, ensure_ascii=False, indent=4)

print("DBpedia relationships saved to dbpedia_relationships.txt")
print(dbpedia_relationships)

DBpedia relationships saved to dbpedia_relationships.txt
{('Albert_Einstein', 'German_Empire'): [['http://dbpedia.org/resource/Albert_Einstein', 'http://dbpedia.org/ontology/citizenship', 'http://dbpedia.org/resource/Free_State_of_Prussia', 'http://dbpedia.org/property/today', 'http://dbpedia.org/resource/Russia', 'http://dbpedia.org/property/establishedEvent', 'http://dbpedia.org/resource/Tsardom_of_Russia', 'http://dbpedia.org/property/titleLeader', 'http://dbpedia.org/resource/List_of_Russian_monarchs', 'http://dbpedia.org/property/name', 'http://dbpedia.org/resource/Canonization_of_the_Romanovs', 'http://dbpedia.org/ontology/birthPlace', 'http://dbpedia.org/resource/German_Empire'], ['http://dbpedia.org/resource/Albert_Einstein', 'http://dbpedia.org/ontology/citizenship', 'http://dbpedia.org/resource/Free_State_of_Prussia', 'http://dbpedia.org/property/today', 'http://dbpedia.org/resource/Russia', 'http://dbpedia.org/property/establishedEvent', 'http://dbpedia.org/resource/Tsardom_

# Saving retrieved DBpedia triples in a CSV file

In [10]:
import json
import csv

# Step 1: Load the JSON data
with open('/content/dbpedia_relationships.txt', 'r') as file:
    dbpedia_relationships = json.load(file)

# Step 2: Collect triples
unique_triples = set()  # Using a set to avoid duplicate triples

# Process each entry in dbpedia_relationships
for paths in dbpedia_relationships.values():
    for path in paths:
        # Build triples along each path
        for i in range(0, len(path) - 2, 2):
            subject = path[i]
            predicate = path[i + 1]
            obj = path[i + 2]
            unique_triples.add((subject, predicate, obj))

# Step 3: Save unique triples to a CSV file
with open("dbpedia_relationships_triples.csv", "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["subject", "predicate", "object"])  # Header

    for triple in unique_triples:
        writer.writerow(triple)

print("Unique triples saved to dbpedia_relationships_triples.csv")


Unique triples saved to dbpedia_relationships_triples.csv


# Saving shortest-path-retrieved triples in a CSV file

In [11]:
import csv

def convert_and_save_to_csv(harmonized_triples, csv_filename="knowledge_graph_triples.csv"):
    """
    Convert the output of harmonize_triples_to_crm and find_relationships_for_entity_pairs
    into CSV format compatible with Kuzu and Kiara.

    Args:
        harmonized_triples (list): Output from harmonize_triples_to_crm.
        dbpedia_relationships (dict): Output from find_relationships_for_entity_pairs.
        csv_filename (str): Name of the CSV file to save.
    """

    # Prepare data for CSV format: (subject, predicate, object)
    csv_data = []

    # Process harmonized triples
    for triple in harmonized_triples:
        # Extract the subject, predicate, and object based on available keys
        if "title" in triple and "descr" in triple:
            csv_data.append((triple["title"], triple["cidoc-relation"], triple["descr"]))
        elif "title" in triple and "uri" in triple:
            csv_data.append((triple["title"], triple["cidoc-relation"], triple["uri"]))
        elif "descr" in triple and "uri" in triple:
            csv_data.append((triple["descr"], triple["cidoc-relation"], triple["uri"]))
        elif "prev_title" in triple and "title" in triple:
            csv_data.append((triple["prev_title"], triple["cidoc-relation"], triple["title"]))
        else:
            print(f"Skipping incomplete data in dictionary format: {triple}")

    # Save to CSV
    with open(csv_filename, mode="w", newline="", encoding="utf-8") as file:
        writer = csv.writer(file)
        writer.writerow(["subject", "predicate", "object"])  # Header
        writer.writerows(csv_data)

    print(f"Data successfully saved to {csv_filename}")


convert_and_save_to_csv(crm_harmonized_triples)


Data successfully saved to knowledge_graph_triples.csv


# Merging the files

In [12]:
import csv

# Define the paths to the CSV files
file1 = "/content/dbpedia_relationships_triples.csv"
file2 = "/content/knowledge_graph_triples.csv"
merged_file = "/content/merged_knowledge_graph_triples.csv"

# Step 1: Collect triples from both files into a set to ensure uniqueness
unique_triples = set()

# Read the first CSV file
with open(file1, mode="r", newline="", encoding="utf-8") as f1:
    reader = csv.reader(f1)
    next(reader)  # Skip header
    for row in reader:
        if len(row) == 3:  # Ensure row has subject, predicate, object
            unique_triples.add(tuple(row))

# Read the second CSV file
with open(file2, mode="r", newline="", encoding="utf-8") as f2:
    reader = csv.reader(f2)
    next(reader)  # Skip header
    for row in reader:
        if len(row) == 3:  # Ensure row has subject, predicate, object
            unique_triples.add(tuple(row))

# Step 2: Write the merged unique triples to a new CSV file
with open(merged_file, mode="w", newline="", encoding="utf-8") as mf:
    writer = csv.writer(mf)
    writer.writerow(["subject", "predicate", "object"])  # Header
    for triple in unique_triples:
        writer.writerow(triple)

print(f"Data successfully merged into {merged_file}")


Data successfully merged into /content/merged_knowledge_graph_triples.csv


Initializing kuzu knowledge base

In [13]:
import kuzu
import csv
import shutil
import os

# Specify the database path
database_path = "knowledge_graph_db"

# Check if the directory exists and delete it (including WAL files)
if os.path.exists(database_path):
    shutil.rmtree(database_path)
    print(f"Database directory '{database_path}' deleted.")

# Re-initialize the Kuzu database
db = kuzu.Database(database_path)
conn = kuzu.Connection(db)

# Step 2: Create the schema for entities and relationships without additional attributes
try:
    # Define node type for entities
    conn.execute("""
    CREATE NODE TABLE Entity (
        uri STRING,
        PRIMARY KEY (uri)
    )
    """)

    # Define a basic relationship type without additional attributes
    conn.execute("""
    CREATE REL TABLE RELATIONSHIP (FROM Entity TO Entity)
    """)

    print("Schema created successfully.")
except Exception as e:
    print(f"Error creating schema: {e}")


Schema created successfully.


Populating the kuzu Knowledge Base with the generated CSV file

In [14]:
# Step 3: Define function to insert nodes and relationships
def insert_triple(conn, subject, predicate, obj):
    try:
        # Insert the subject and object nodes if they don't already exist
        conn.execute(f"MERGE (s:Entity {{uri: '{subject}'}})")
        conn.execute(f"MERGE (o:Entity {{uri: '{obj}'}})")

        # Insert the relationship without storing the predicate directly
        conn.execute(f"""
        MATCH (s:Entity {{uri: '{subject}'}}), (o:Entity {{uri: '{obj}'}})
        MERGE (s)-[:RELATIONSHIP]->(o)
        """)
    except Exception as e:
        print(f"Error inserting triple ({subject}, {predicate}, {obj}): {e}")

# Step 4: Upload data from CSV to Kuzu
csv_file = "/content/merged_knowledge_graph_triples.csv"

with open(csv_file, mode="r", newline="", encoding="utf-8") as file:
    reader = csv.reader(file)
    next(reader)  # Skip header

    for row in reader:
        subject, predicate, obj = row
        insert_triple(conn, subject, predicate, obj)

print("Data successfully uploaded to Kuzu database.")

Error inserting triple (King's College London, P196, Public research university in London, United Kingdom): Parser exception: Invalid input <MERGE (s:Entity {uri: 'King's>: expected rule oC_SingleQuery (line: 1, offset: 28)
"MERGE (s:Entity {uri: 'King's College London'})"
                             ^
Error inserting triple (King's College London, P102, https://en.wikipedia.org/wiki/King's_College_London): Parser exception: Invalid input <MERGE (s:Entity {uri: 'King's>: expected rule oC_SingleQuery (line: 1, offset: 28)
"MERGE (s:Entity {uri: 'King's College London'})"
                             ^
Error inserting triple (King's College London, P104, Public research university in London, United Kingdom): Parser exception: Invalid input <MERGE (s:Entity {uri: 'King's>: expected rule oC_SingleQuery (line: 1, offset: 28)
"MERGE (s:Entity {uri: 'King's College London'})"
                             ^
Error inserting triple (http://dbpedia.org/resource/People's_Consultative_Assembly, ht

# Importing the CSV file in KIARA

In [23]:
from kiara.api import KiaraAPI
import pandas as pd
import networkx as nx

# Initialize Kiara instance
kiara = KiaraAPI.instance()
# Load the CSV file as a pandas DataFrame
csv_file_path = "/content/merged_knowledge_graph_triples.csv"
data = pd.read_csv(csv_file_path)

# Preview the data (optional)
print("Data preview:")
print(data.head())

KG = kiara.run_job('import.local.file', inputs={'path': csv_file_path}, comment="")
KG


Data preview:
                                        subject  \
0                           Walhalla (memorial)   
1       http://dbpedia.org/resource/Ueli_Maurer   
2                               Albert Einstein   
3  http://dbpedia.org/resource/Friedrich_Dickel   
4   http://dbpedia.org/resource/Asian_Americans   

                                predicate  \
0                                     P67   
1  http://dbpedia.org/ontology/birthPlace   
2                                     P67   
3  http://dbpedia.org/ontology/birthPlace   
4      http://dbpedia.org/property/region   

                                              object  
0  https://en.wikipedia.org/wiki/Walhalla_(memorial)  
1       http://dbpedia.org/resource/Canton_of_Zürich  
2                       Institute for Advanced Study  
3          http://dbpedia.org/resource/German_Empire  
4                 http://dbpedia.org/resource/Hawaii  


From this point on, it is possible to exploit the extracted knowledge graph in the network analysis module as described in the [related tutorial](https://github.com/DHARPA-Project/kiara_plugin.dh_tagung_2023/blob/main/docs/notebooks/Network_Analysis.ipynb).

# Lessons learnt and future work

As in the case of the colab tutorial for [journal harvesting using EUROPEANA API](https://github.com/DHARPA-Project/kiara_plugin.topic_modelling/blob/develop/docs/jupyter/kiarapeana_topic_modeling.ipynb) the generation based on the *Kiara Module Builder* was only slightly more useful than the general purpose ChatGPT, this time because the part concerning *kiara* and *kiara*'s data structures was even lesser than in the previous project.

The present contribution can be improved by proposing a workflow to assist the programming-agnostic user to query the knowledge graph in kuzu, maybe offering template queries easy to be run off-the-shelf.

Furthermore, the whole pipeline and its functions can be packaged as a kiara plug-in or kiara pipeline.

#Appendix


SNIPPET of code to save all kiara codebase into a single txt file:

# Clone the main kiara repository
git clone https://github.com/DHARPA-Project/kiara.git

# Clone the kiara_plugin.network_analysis repository
git clone https://github.com/DHARPA-Project/kiara_plugin.network_analysis.git

# Clone the NetworkAnalysis repository
git clone https://github.com/DHARPA-Project/NetworkAnalysis.git

# Clone the TopicModelling- repository
git clone https://github.com/DHARPA-Project/TopicModelling-.git

# Clone the jupyterlab-extension-example repository
git clone https://github.com/DHARPA-Project/jupyterlab-extension-example.git

# Clone the asciinet repository
git clone https://github.com/DHARPA-Project/asciinet.git


# Navigate to the kiara repository
cd kiara
# List all files
find . > ../kiara_files.txt
# Return to the parent directory
cd ..

# Repeat for each repository
cd kiara_plugin.network_analysis
find . > ../kiara_plugin_network_analysis_files.txt
cd ..

cd NetworkAnalysis
find . > ../NetworkAnalysis_files.txt
cd ..

cd TopicModelling-
find . > ../TopicModelling_files.txt
cd ..

cd jupyterlab-extension-example
find . > ../jupyterlab_extension_example_files.txt
cd ..

cd asciinet
find . > ../asciinet_files.txt
cd ..


# Combine all listings into a single file
cat kiara_files.txt kiara_plugin_network_analysis_files.txt NetworkAnalysis_files.txt TopicModelling_files.txt jupyterlab_extension_example_files.txt asciinet_files.txt > DHARPA_Project_files.txt

then run:





In [None]:
import os

# List of repository directories
repos = [
    "kiara",
    "kiara_plugin.network_analysis",
    "NetworkAnalysis",
    "TopicModelling-",
    "jupyterlab-extension-example",
    "asciinet"
]

# Output file
output_file = "DHARPA_Project_code.txt"

with open(output_file, 'w', encoding='utf-8') as outfile:
    for repo in repos:
        for root, _, files in os.walk(repo):
            for file in files:
                file_path = os.path.join(root, file)
                if file.endswith('.py') or file.endswith('.md') or file.endswith('.txt') or file.endswith('.sh') or file.endswith('.json') or file.endswith('.js') or file.endswith('.yml'):
                    outfile.write(f"\n\n# {file_path}\n")
                    with open(file_path, 'r') as infile:
                        outfile.write(infile.read())