---
title: "Working with JSON-LD"
author: "Charles F. Vardeman II"
date: "2023-09-12"
format:
    html:
        code-fold: true
        grid:
          margin-width: 350px
reference-location: margin
citation-location: margin
bibliography: kg.bib
---

## Uning JSON-LD and Generative AI to construct a Knowledge Graph
We have been using the [kglab tutorial](https://derwen.ai/docs/kgl/tutorial/) to learn about using W3C web standards to construct knowledge graphs. One particularly useful standard that is briefly touched on in the tutorial is JSON-LD in the [serialization exercise](https://derwen.ai/docs/kgl/ex1_0/#serialization-as-json-ld). This corresponds to [building a knowledge graph with rdflib](kglab/examples/ex1_0.ipynb) in the kglab/examples subdirectory.

## Using Generative AI to create column types.
[@koriniColumnTypeAnnotation2023]

----
**ChatGPT Advanced Data Analysis:** 

Absolutely, having a summary and a set of steps can make the process easily reproducible. Here it is:

## Summary

To generate Turtle data from a Pandas DataFrame:

1. **Pandas to JSON**: Convert the DataFrame to JSON format using `to_json()` with `orient='records'`.

2. **Add UUID and Type to JSON**: Iterate through the JSON objects and add a UUID prefixed by `urn:uuid:` and a `@type` with the value `Person`.

3. **Create JSON-LD Context**: Define a JSON-LD context to map JSON keys to RDF vocabularies. Include mappings for the UUID and `rdf:type`.

4. **Combine to Form JSON-LD**: Create a JSON-LD object that incorporates both the JSON data and the JSON-LD context. Put the JSON objects inside a `@graph` array.

5. **Load into RDFlib**: Use RDFlib's `parse()` method to load the JSON-LD data into an RDF graph.

6. **Serialize to Turtle**: Use RDFlib's `serialize()` method to convert the RDF graph to Turtle format.

## Code Prompt

Here's a condensed Python code snippet that includes all the steps:

```python
import pandas as pd
import json
import uuid
from rdflib import Graph

# Step 1: Convert DataFrame to JSON
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30], 'Occupation': ['Engineer', 'Doctor']}
df = pd.DataFrame(data)
json_data = df.to_json(orient='records')

# Step 2: Add UUID and Type to JSON
json_data_dict = json.loads(json_data)
for obj in json_data_dict:
    obj["UUID"] = "urn:uuid:" + str(uuid.uuid4())
    obj["@type"] = "Person"

# Step 3: Create JSON-LD Context
context = {
    "@context": {
        "UUID": "@id",
        "Name": "http://schema.org/name",
        "Age": "http://schema.org/age",
        "Occupation": "http://schema.org/occupation",
        "@type": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
        "Person": "http://xmlns.com/foaf/0.1/Person"
    }
}

# Step 4: Combine to Form JSON-LD
json_ld_data = {
    "@context": context["@context"],
    "@graph": json_data_dict
}

# Step 5: Load into RDFlib
g = Graph()
g.parse(data=json.dumps(json_ld_data), format='json-ld')

# Step 6: Serialize to Turtle
turtle_data = g.serialize(format='turtle')
print(turtle_data)
```

This will generate Turtle data that includes UUIDs and a `rdf:type` of `Person`, aligning with your requirements.

---

### Transforming Standard JSON to a Knowledge Graph
The beauty of JSON-LD lies in its ability to transform standard JSON data into a knowledge graph simply and efficiently. By adding a '@context' to a JSON document, developers can define how the data should be interpreted semantically. This turns a flat data structure into an interconnected web of information, opening up powerful querying and linking possibilities.


## Our general procedure towards automated KG Construction using LLMs
I understand the workflow you've outlined for constructing a knowledge graph from CSV data using JSON-LD. The process includes six major steps:

1. **ETL from CSV to JSON**: Extract, transform, and load the data from a CSV file into a JSON object. Here, we'll use Python libraries like pandas.
  
2. **Creating JSON-LD Vocabulary File**: Establish the terms, data types, and definitions in a JSON-LD vocabulary file based on the LLM template.
  
3. **Convert from JSON to JSON-LD**: Extend the JSON data to JSON-LD by adding the context, linking it with the vocabulary's IRI, wrapping the data in a @graph object, and assigning unique IDs.
  
4. **Introduce New Ontology File**: Create another JSON-LD vocabulary file to serve as an ontology.
  
5. **Link Vocabularies in Another JSON-LD File**: Establish relationships between the vocabularies, definitions, and ontologies in a separate JSON-LD file.
  
6. **Capturing Provenance Information**: Document metadata about the @graph object to capture its origin and transformation history.

## Example: Using JSON-LD to construct a Knowledge Graph from Wikipedia tables

Wikipedia can provide a starting point for structuring and enriching knowledge graphs. For our Navy project, we will want to provide LLM Agents with context for factual information that may have changed since "pre-training" using [`Retrieval Augmented Generation`](https://learn.microsoft.com/en-us/azure/machine-learning/concept-retrieval-augmented-generation?view=azureml-api-2#Technical%20overview%20of%20using%20RAG%20on%20Large%20Language%20Models%20(LLMs)). For our purposes, we want to augment our KG with broader contextual information that may be useful to agents for question and answering. One set of context, is the [List of current ships of the United States Navy](https://en.wikipedia.org/wiki/List_of_current_ships_of_the_United_States_Navy) that contain a number of tables that could be useful to add to our knowledge graph. [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) does contain most of the `ships` but the entities are not always complete. For example, the Wikidata entity page for [USS Abraham Lincoln](https://www.wikidata.org/wiki/Q587557) shows a rather rich set of relations including `events` that correspond to `ship naming ceremony`, `ship commissioning`. However, ships like [USS Carter Hall](https://www.wikidata.org/wiki/Q2468798) are very basic and don't contain a lot of information we can extract. Looking at the [History for `List of current ships of the United States Navy`](https://en.wikipedia.org/w/index.php?title=List_of_current_ships_of_the_United_States_Navy&action=history) tells us that the page is fairly active and Wikipedia contributors are keeping it more up to date than the Wikidata entries.

Lets' do a little [Exploratory Data Analysis](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners) using large language models similar to [Getting Started With LLMs](https://www.kaggle.com/code/jhoward/getting-started-with-llms/). We will use [Beautiful Soup](https://pypi.org/project/beautifulsoup4/) to retreive the Wikipedia page and find the "html tables" in the document.

In [1]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_current_ships_of_the_United_States_Navy"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find_all("table")[0]


Now that we have the tables, lets create a pandas dataframe from the table and sanity check that the table is the same as [what is in the first table](https://en.wikipedia.org/wiki/List_of_current_ships_of_the_United_States_Navy#Commissioned).

In [2]:
#| label: fig-commissioned
#| fig-cap: Dataframe from Wikipedia Commissioned ship table.
#| echo: true

import pandas as pd

df = pd.read_html(str(table))[0]
df

  df = pd.read_html(str(table))[0]


Unnamed: 0,Ship name,Hull number,Class,Type,Commission date,Homeport[2],Note
0,USS Abraham Lincoln,CVN-72,Nimitz,Aircraft carrier,11 November 1989,"San Diego, CA",[3]
1,USS Alabama,SSBN-731,Ohio,Ballistic missile submarine,25 May 1985,"Bangor, WA",[4]
2,USS Alaska,SSBN-732,Ohio,Ballistic missile submarine,25 January 1986,"Kings Bay, GA",[5]
3,USS Albany,SSN-753,Los Angeles,Attack submarine,7 April 1990,"Norfolk, VA",[6]
4,USS Alexandria,SSN-757,Los Angeles,Attack submarine,29 June 1991,"San Diego, CA",[7] Scheduled to be decommissioned 2026[8]
...,...,...,...,...,...,...,...
233,USS William P. Lawrence,DDG-110,Arleigh Burke,Destroyer,19 May 2011,"San Diego, CA",[241]
234,USS Winston S. Churchill,DDG-81,Arleigh Burke,Destroyer,10 March 2001,"Norfolk, VA",[242]
235,USS Wichita,LCS-13,Freedom,Littoral combat ship,12 January 2019,"Mayport, FL",[243] Proposed to be decommissioned 2023[17]
236,USS Wyoming,SSBN-742,Ohio,Ballistic missile submarine,13 July 1996,"Kings Bay, GA",[244]


So, if we want to use the column names as the basis for eventually constructing a URI, we unfortunately need to make it web safe and remove spaces and other issues. The other question is "what does a row" represent? If we want to build a knowledge graph of all of the ships of the navy, we might want to consider a basic ontology to start with. The first table is "commissioned" ships and the idea of `commisioned` is a role relationship since it has a start time and a end time. 

In [3]:
column_names = df.columns.tolist()
column_names

['Ship name',
 'Hull number',
 'Class',
 'Type',
 'Commission date',
 'Homeport[2]',
 'Note']

Before we construct the KG, we need to do a bit of data cleaning so we have a properly constructed Knowledge Graph. We will use ChatGPT "Advanced Data Analysis" as a surrogate for a Large Language Model based Agent.

---

**Prompt:** Before we proceed, lets think critically about the dataframe structure. We will use the to_json method to construct a JSON document and then combine it with the context. However, there are some issues with the data, and in particular the format of dates in the dataframe. Think critically and look for issues in the pandas dataframe that will result in inconsistent or invalid KG construction.

---

**ChatGPT:** You're correct; a critical review of the DataFrame structure is essential before proceeding to the knowledge graph construction. This will help ensure that the resulting KG is both consistent and valid. Here are some aspects to consider:

### Issues with the DataFrame:

1. **Date Format**: The "Commission date" column is formatted as a string in "YYYY-MM-DD" format. While the `to_json` method would convert it into a string in JSON, it's crucial that the context file specifies the date's type as "xsd:date" to ensure that it's interpreted correctly in the resulting KG.

2. **Notes Column**: The "Note" column contains bracketed numbers, which could be references or footnotes. The meaning is unclear, and depending on the intended use, this could introduce ambiguity into the KG.

3. **Homeport Formatting**: The "Homeport[2]" column contains location data as a string, but it's not in a standardized geographical format. This could be an issue if geographical querying is required later.

4. **Column Names**: The column name "Homeport[2]" includes a special character, which may require special handling in both JSON and RDF. It's generally good practice to use simple, descriptive column names.

5. **Uniqueness**: The "Hull number" is assumed to be a unique identifier for each ship, but the DataFrame doesn't enforce this. If there are duplicates, this would result in invalid KG construction.

6. **Implicit Relationships**: The data implicitly suggests relationships (e.g., a ship belongs to a class and type), but these are not explicitly modeled in the DataFrame. They will be modeled in the KG, but it's good to be aware of these.

7. **Role Modeling**: Since we decided to model "Commissioned Ship" as a role, the DataFrame structure doesn't capture this. Specifically, roles like "Commissioned Ship" often have start and end dates, but our DataFrame only has the start date ("Commission date").

8. **Missing Data**: The DataFrame doesn't indicate missing or null values explicitly, which could be an issue when constructing the KG.

9. **Data Types**: While pandas will infer the data types, it's good to explicitly set them, especially for columns that should be interpreted as categorical data (e.g., "Class", "Type").

Addressing these issues before KG construction will help ensure that the resulting KG is both valid and useful for downstream applications. Would you like to address these issues step-by-step?

In [4]:
#| label: fig-pandasJSON
#| fig-cap: JSON generated directly from Pandas dataframe.
#| echo: true

import json
import uuid

from pprint import pprint

# Normalize column names in DataFrame
normalized_columns = {col: col.replace(" ", "_").replace("[", "").replace("]", "") for col in column_names}
df.rename(columns=normalized_columns, inplace=True)

# Let's change the date to be consistent with xsd:date
# df['Commission_date'] = pd.to_datetime(df['Commission_date']).dt.strftime('%Y-%m-%d')

# Convert DataFrame to JSON-formatted string
df_json_str = df.to_json(orient="records", indent=4)

# Convert JSON-formatted string to Python object
df_json = json.loads(df_json_str)

# Pretty-print the first record
pprint(df_json[0])

{'Class': 'Nimitz',
 'Commission_date': '11 November 1989',
 'Homeport2': 'San Diego, CA',
 'Hull_number': 'CVN-72',
 'Note': '[3]',
 'Ship_name': 'USS\xa0Abraham Lincoln',
 'Type': 'Aircraft carrier'}


Looking at this JSON version of the first row of the dataframe, we can see some small data problems that we might want to address. First the "Commission_data" is not in an xsd:date format our knowledge graph would expect for a date literal. Second, the "note" field isn't very useful without the note. For the "Null_number", Homeport2, Class, Type and Ship Name, we will eventually want to align to wikidata properties. Remember, the Column names serve as properties to link "whatever" the row semantics are to the literal values of the table.

In [6]:

from datetime import datetime


# Data Cleaning Steps
# 1. Replace special characters in 'Ship_name'
df['Ship_name'] = df['Ship_name'].str.replace('\xa0', ' ')

# 2. Standardize 'Commission_date' to ISO 8601 format
df['Commission_date'] = pd.to_datetime(df['Commission_date']).dt.strftime('%Y-%m-%d')

# 3. Remove 'Note' field
df = df.drop(columns=['Note'])

# 4. Standardize 'Homeport' field name
df.rename(columns={'Homeport2': 'Homeport'}, inplace=True)

# Convert cleaned DataFrame to JSON
json_data_cleaned = df.to_json(orient='records')

# Convert JSON-formatted string to Python object
df_json = json.loads(json_data_cleaned)

# Pretty-print the first record
pprint(df_json[0])

{'Class': 'Nimitz',
 'Commission_date': '1989-11-11',
 'Homeport': 'San Diego, CA',
 'Hull_number': 'CVN-72',
 'Ship_name': 'USS Abraham Lincoln',
 'Type': 'Aircraft carrier'}


::: {.callout-note}
## Conventions for naming in RDF

The prevalent practice in RDF ontology naming is to employ [Camel case](https://en.wikipedia.org/wiki/Camel_case). Specifically, **lower camel case** is preferred for relationships, which include properties and edges in the knowledge graph. In contrast, **upper camel case** is advised for Classes and types, representing nodes in the knowledge graph. It's worth noting that while camel case serves as an implicit identifier, more explicit identifiers such as `rdf:Property` are also commonly used, especially in the context of large language models. 

**For example:**
```json
{
  "@context": {
    "@vocab": "http://schema.org/",
    "name": {
      "@id": "rdf:Property",
      "@type": "xsd:string"
    }
  },
  "@type": "Person",
  "name": "John Doe"
}
```

:::

To generate a RDF knowledge graph from the JSON, we also need a @context that defines our JSON `keys`.

In [33]:
#| label: fig-turtle
#| fig-cap: JSON generated directly from Pandas dataframe.
#| echo: true

from rdflib import Graph, URIRef

# Step 1: Add UUID and Type to JSON
json_data_cleaned_dict = json.loads(json_data_cleaned)
for obj in json_data_cleaned_dict:
    obj["UUID"] = "urn:uuid:" + str(uuid.uuid4())
    obj["@type"] = "Ship"

# Step 2: Create JSON-LD Context with SKOS definitions and XSD literals
context_with_skos_and_xsd = {
    "@context": {
        "UUID": "@id",
        "Class": {
            "@id": "http://example.org/Class",
            "@type": "xsd:string",
            "skos:definition": "The class of the ship."
        },
        "Commission_date": {
            "@id": "http://example.org/Commission_date",
            "@type": "xsd:date",
            "skos:definition": "The date the ship was commissioned."
        },
        "Homeport": {
            "@id": "http://example.org/Homeport",
            "@type": "xsd:string",
            "skos:definition": "The homeport of the ship."
        },
        "Hull_number": {
            "@id": "http://example.org/Hull_number",
            "@type": "xsd:string",
            "skos:definition": "The hull number of the ship."
        },
        "Ship_name": {
            "@id": "http://example.org/Ship_name",
            "@type": "xsd:string",
            "skos:definition": "The name of the ship."
        },
        "Type": {
            "@id": "http://example.org/Type",
            "@type": "xsd:string",
            "skos:definition": "The type of the ship."
        },
        "@type": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
        "Ship": "http://example.org/Ship"
    }
}

# Step 3: Combine to Form JSON-LD
json_ld_data_cleaned = {
    "@context": context_with_skos_and_xsd["@context"],
    "@graph": json_data_cleaned_dict
}

# Step 4: Load into RDFlib
g_cleaned = Graph()
g_cleaned.parse(data=json.dumps(json_ld_data_cleaned), format='json-ld')

# Step 5: Serialize to Turtle
turtle_data = g_cleaned.serialize(format='turtle')

In [44]:
#| label: fig-turtleout
#| fig-cap: Knowledge Graph output in Turtle.
#| echo: false

# Initialize variables
result = []
current_statement = []
count_statements = 0

# Parse lines
for line in turtle_data.splitlines():
    # If the line is a prefix, add it directly to the result
    if line.startswith("@prefix"):
        result.append(line)
    else:
        # Accumulate lines for each statement
        current_statement.append(line.strip())
        if line.endswith("."):
            count_statements += 1
            result.append(" ".join(current_statement))
            current_statement = []
    
    # Stop if we have reached 2 statements
    if count_statements == 2:
        break

# Combine result lines into a string
turtle_str = "\n".join(result)
# Split the string by the end of triple statements, and then rejoin with line returns and indentation
# pretty_turtle_str = '.\n'.join([triplet.strip() for triplet in turtle_str.split('. ') if triplet.strip()])

# Pretty-print the Turtle string
pretty_turtle_str = pretty_print_turtle(turtle_str)
print(pretty_turtle_str)

@prefix ns1: <http://example.org/> .
<urn:uuid:0080c026-d273-450a-a687-5482a792447a> a ns1:Ship ;
    ns1:Class "Ticonderoga"^^<xsd:string> ;
    ns1:Commission_date "1991-01-12"^^<xsd:date> ;
    ns1:Homeport "Pearl Harbor, HI"^^<xsd:string> ;
    ns1:Hull_number "CG-65"^^<xsd:string> ;
    ns1:Ship_name "USS Chosin"^^<xsd:string> ;
    ns1:Type "Cruiser"^^<xsd:string> .
<urn:uuid:00d02856-4403-45c4-a056-1e9e7ec826ee> a ns1:Ship ;
    ns1:Class "Arleigh Burke"^^<xsd:string> ;
    ns1:Commission_date "1995-10-14"^^<xsd:date> ;
    ns1:Homeport "San Diego, CA"^^<xsd:string> ;
    ns1:Hull_number "DDG-62"^^<xsd:string> ;
    ns1:Ship_name "USS Fitzgerald"^^<xsd:string> ;
    ns1:Type "Destroyer"^^<xsd:string> .


In [43]:
# Function to pretty-print Turtle string with indentation and line returns
def pretty_print_turtle(turtle_str):
    lines = turtle_str.strip().split('\n')
    pretty_lines = []
    
    for line in lines:
        if ';' in line and '.' in line:
            triples = line.split(';')
            formatted_triples = ' ;\n    '.join(triple.strip() for triple in triples)
            pretty_lines.append(formatted_triples.rstrip(' .') + ' .' )
        else:
            pretty_lines.append(line)

    return '\n'.join(pretty_lines)