# core

> jsonldnavigator 

# JSON-LD Navigation Tools for LLMs: Project Summary

## Motivation and Goals

We're developing a set of tools to help LLMs effectively navigate and understand JSON-LD documents, with a particular focus on Croissant metadata for ML datasets. The motivation is to:

1. Create semantic navigation capabilities for JSON-LD similar to what toolslm.md_hier provides for Markdown
2. Enable LLMs to access and understand complex semantic web data without requiring deep JSON-LD expertise
3. Support agentic workflows where LLMs can autonomously explore and extract meaning from JSON-LD documents
4. Make Croissant dataset metadata more accessible to LLM-based tools

## Approach

We're following a fast.ai-inspired methodology characterized by:

1. **Simple, focused functions** that do one thing well
2. **Minimal abstractions** with clean, pythonic interfaces
3. **Composable tools** that can be combined in flexible ways
4. **Leveraging LLM capabilities** rather than building complex parsing logic

Our approach uses prompt chaining to progressively build understanding of JSON-LD documents:

1. **Document Overview Analysis**: Identify types, properties, and the document's purpose
2. **Context Resolution**: Map namespace prefixes to URIs and understand vocabularies
3. **Type-Based Exploration**: Analyze each type and its relationships
4. **Path Generation**: Create navigation paths for accessing important properties
5. **Document Summary**: Synthesize a comprehensive understanding of the document

## Implementation

We've built several key components:

1. **JSON-LD Analysis Class**: A container for analysis results with helper methods
2. **Namespace Resolution**: Tools for fetching and understanding vocabulary definitions
3. **Term Explanation**: Using LLMs to extract meaning from vocabulary definitions
4. **XML-Tagged Response Parsing**: Extracting structured information from LLM responses

We've integrated these with toolslm-inspired patterns for:
- Content fetching and parsing
- Caching results for efficiency
- Providing clean, notebook-friendly representations

## Agentic Workflow Integration

The tools are designed to support agentic workflows by:

1. Providing ground truth from JSON-LD documents
2. Enabling semantic exploration of document structure
3. Supporting term definition lookup and explanation
4. Generating navigation paths for accessing data

The LLM acts as the semantic analyzer, while our code provides the scaffolding to make JSON-LD documents accessible and navigable.

The end goal is a clean, efficient toolkit that enables LLMs to work with JSON-LD data as easily as they work with natural language text.

In [None]:
#| default_exp core

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| export
from fastcore.foundation import patch  # @patch
from httpx import get as xget, post as xpost
from bs4 import BeautifulSoup as bs
from fastcore.utils import *
import json
from typing import Any, Dict, List, Optional, Union
from pyld import jsonld  # For JSON-LD processing
from claudette import *

In [None]:
#| export
class JsonLdNavigator:
    "Navigator for JSON-LD documents with semantic understanding"
    def __init__(self, jsonld_data):
        "Initialize with JSON-LD data"
        self.data = jsonld_data if not isinstance(jsonld_data, str) else json.loads(jsonld_data)
        self.index = create_schema_index(self.data)
        
    def get_class(self, class_id):
        "Get details about a class"
        return get_class_details(self.data, class_id)
    
    def get_property(self, property_id):
        "Get details about a property"
        return get_property_details(self.data, property_id)
    
    def get_properties_for(self, class_id, include_inherited=True):
        "Get properties applicable to a class"
        return get_class_properties(self.data, class_id, include_inherited)
    
    def find_related(self, class_id):
        "Find classes related to the specified class"
        return find_related_classes(self.data, class_id)
    
    def search(self, query):
        "Search for classes or properties matching query"
        return search_terms(self.data, query)
    
    def find_path(self, source_class, target_class, max_depth=2):
        "Find property paths between classes"
        return get_property_path(self.data, source_class, target_class, max_depth)
    
    def show_index(self):
        "Show a summary of the index"
        vocab = self.index["vocabularies"]["schema.org"]
        print(f"Schema.org Index:")
        print(f"- Classes: {vocab['class_count']}")
        print(f"- Properties: {vocab['property_count']}")
        print(f"\nAvailable Affordances:")
        for name, desc in self.index["affordances"].items():
            print(f"- {name}: {desc}")
        
    def __repr__(self):
        vocab = self.index["vocabularies"]["schema.org"]
        return f"JsonLdNavigator(classes={vocab['class_count']}, properties={vocab['property_count']})"

In [None]:
#| export
class JsonLdNavigator:
    "Navigator for JSON-LD documents with semantic understanding"
    def __init__(self, jsonld_data):
        "Initialize with JSON-LD data"
        self.data = jsonld_data if not isinstance(jsonld_data, str) else json.loads(jsonld_data)
        self.index = create_schema_index(self.data)
        
    def get_class(self, class_id):
        "Get details about a class"
        return get_class_details(self.data, class_id)
    
    def get_property(self, property_id):
        "Get details about a property"
        return get_property_details(self.data, property_id)
    
    def get_properties_for(self, class_id, include_inherited=True):
        "Get properties applicable to a class"
        return get_class_properties(self.data, class_id, include_inherited)
    
    def find_related(self, class_id):
        "Find classes related to the specified class"
        return find_related_classes(self.data, class_id)
    
    def search(self, query):
        "Search for classes or properties matching query"
        return search_terms(self.data, query)
    
    def find_path(self, source_class, target_class, max_depth=2):
        "Find property paths between classes"
        return get_property_path(self.data, source_class, target_class, max_depth)
    
    def show_index(self):
        "Show a summary of the index"
        vocab = self.index["vocabularies"]["schema.org"]
        print(f"Schema.org Index:")
        print(f"- Classes: {vocab['class_count']}")
        print(f"- Properties: {vocab['property_count']}")
        print(f"\nAvailable Affordances:")
        for name, desc in self.index["affordances"].items():
            print(f"- {name}: {desc}")
        
    def __repr__(self):
        vocab = self.index["vocabularies"]["schema.org"]
        return f"JsonLdNavigator(classes={vocab['class_count']}, properties={vocab['property_count']})"

In [None]:
#| export
# Core indexing function
def create_schema_index(jsonld_data):
    "Create a semantic index from JSON-LD data"
    index = {
        "vocabularies": {
            "schema.org": {
                "prefix": "schema",
                "namespace": "https://schema.org/",
                "description": "Schema.org vocabulary for structured data",
                "classes": [],
                "properties": [],
                "class_count": 0,
                "property_count": 0
            }
        },
        "class_hierarchy": {},  # Tree structure of classes
        "property_groups": {},  # Properties grouped by domain/function
        "affordances": {
            "get_class(class_id)": "Get detailed information about a class",
            "get_property(property_id)": "Get detailed information about a property",
            "get_properties_for(class_id, include_inherited=True)": "Get properties for a class",
            "find_related(class_id)": "Find classes semantically related to this class",
            "search(query)": "Search for classes or properties matching a query",
            "find_path(source_class, target_class)": "Find property paths between classes"
        }
    }
    
    # Process classes and build class hierarchy
    class_hierarchy = {}
    for item in jsonld_data.get('@graph', []):
        if item.get('@type') == 'rdfs:Class':
            class_id = item.get('@id')
            class_name = safe_text(item.get('rdfs:label'))
            
            if class_name:
                index["vocabularies"]["schema.org"]["classes"].append(class_name)
                index["vocabularies"]["schema.org"]["class_count"] += 1
                
                # Add to class hierarchy
                parent_class = item.get('rdfs:subClassOf', {}).get('@id')
                if parent_class:
                    if parent_class not in class_hierarchy:
                        class_hierarchy[parent_class] = []
                    class_hierarchy[parent_class].append(class_id)
    
    index["class_hierarchy"] = class_hierarchy
    
    # Process properties and group them
    property_groups = {
        "identification": [],  # Properties for identifying things
        "descriptive": [],     # Properties for describing things
        "relational": [],      # Properties linking to other entities
        "temporal": [],        # Time-related properties
        "other": []            # Other properties
    }
    
    for item in jsonld_data.get('@graph', []):
        if item.get('@type') == 'rdf:Property':
            prop_id = item.get('@id')
            prop_name = safe_text(item.get('rdfs:label'))
            
            if prop_name:
                index["vocabularies"]["schema.org"]["properties"].append(prop_name)
                index["vocabularies"]["schema.org"]["property_count"] += 1
                
                # Simple property categorization
                if prop_name in ['identifier', 'url', 'name', 'id']:
                    property_groups["identification"].append(prop_id)
                elif prop_name in ['description', 'text', 'keywords']:
                    property_groups["descriptive"].append(prop_id)
                elif any(x in prop_name.lower() for x in ['date', 'time', 'duration']):
                    property_groups["temporal"].append(prop_id)
                else:
                    # Check if it links to another entity by examining ranges
                    ranges = item.get('schema:rangeIncludes', [])
                    if not isinstance(ranges, list): ranges = [ranges]
                    
                    if any('schema:' in r.get('@id', '') for r in ranges if r):
                        property_groups["relational"].append(prop_id)
                    else:
                        property_groups["other"].append(prop_id)
    
    index["property_groups"] = property_groups
    
    return index


In [None]:
#| export
# Core navigation functions
def get_class_details(jsonld_data, class_id):
    "Get detailed information about a specific class"
    for item in jsonld_data.get('@graph', []):
        if item.get('@id') == class_id:
            return {
                "id": item.get('@id'),
                "label": item.get('rdfs:label'),
                "description": item.get('rdfs:comment'),
                "parent_class": item.get('rdfs:subClassOf', {}).get('@id'),
                "equivalent_classes": [c.get('@id') for c in L(item.get('owl:equivalentClass'))]
            }
    return {"error": f"Class {class_id} not found"}

In [None]:
#| export
def get_property_details(jsonld_data, property_id):
    "Get detailed information about a specific property"
    for item in jsonld_data.get('@graph', []):
        if item.get('@id') == property_id:
            # Handle domains and ranges which might be lists or single items
            domains = item.get('schema:domainIncludes', [])
            if not isinstance(domains, list): domains = [domains]
            domain_ids = [d.get('@id') for d in domains if d]
            
            ranges = item.get('schema:rangeIncludes', [])
            if not isinstance(ranges, list): ranges = [ranges]
            range_ids = [r.get('@id') for r in ranges if r]
            
            return {
                "id": item.get('@id'),
                "label": item.get('rdfs:label'),
                "description": item.get('rdfs:comment'),
                "domains": domain_ids,
                "ranges": range_ids,
                "subproperty_of": item.get('rdfs:subPropertyOf', {}).get('@id')
            }
    return {"error": f"Property {property_id} not found"}


In [None]:
#| export
def get_class_properties(jsonld_data, class_id, include_inherited=True):
    "Get properties applicable to a class"
    properties = []
    
    # Get class hierarchy if including inherited properties
    class_hierarchy = [class_id]
    if include_inherited:
        current_class = get_class_details(jsonld_data, class_id)
        while current_class and 'error' not in current_class:
            parent_class = current_class.get('parent_class')
            if parent_class and parent_class not in class_hierarchy:
                class_hierarchy.append(parent_class)
                current_class = get_class_details(jsonld_data, parent_class)
            else:
                break
    
    # Find properties for each class in the hierarchy
    for item in jsonld_data.get('@graph', []):
        if item.get('@type') == 'rdf:Property':
            domains = item.get('schema:domainIncludes', [])
            if not isinstance(domains, list): domains = [domains]
            domain_ids = [d.get('@id') for d in domains if d]
            
            # Check if any class in the hierarchy is in the domains
            if any(cls_id in domain_ids for cls_id in class_hierarchy):
                prop = {
                    "id": item.get('@id'),
                    "label": item.get('rdfs:label'),
                    "description": item.get('rdfs:comment')[:100] + "..." 
                        if item.get('rdfs:comment') and len(item.get('rdfs:comment')) > 100 
                        else item.get('rdfs:comment'),
                    "inherited": class_id != domain_ids[0] if domain_ids else False
                }
                properties.append(prop)
    
    return properties



In [None]:
#| export
def find_related_classes(jsonld_data, class_id):
    """Find classes related to the specified class through properties"""
    related = {
        "parent_classes": [],
        "child_classes": [],
        "referenced_by": [],
        "references": []
    }
    
    # Find parent class
    for item in jsonld_data.get('@graph', []):
        if item.get('@id') == class_id and item.get('@type') == 'rdfs:Class':
            parent = item.get('rdfs:subClassOf', {}).get('@id')
            if parent:
                related["parent_classes"].append(parent)
    
    # Find child classes
    for item in jsonld_data.get('@graph', []):
        if item.get('@type') == 'rdfs:Class':
            parent = item.get('rdfs:subClassOf', {}).get('@id')
            if parent == class_id:
                related["child_classes"].append(item.get('@id'))
    
    # Find classes referenced by this class's properties
    properties = get_class_properties(jsonld_data, class_id)
    for prop in properties:
        prop_details = get_property_details(jsonld_data, prop.get('id'))
        for range_class in prop_details.get('ranges', []):
            if range_class not in related["references"]:
                related["references"].append(range_class)
    
    # Find classes that reference this class
    for item in jsonld_data.get('@graph', []):
        if item.get('@type') == 'rdf:Property':
            domains = item.get('schema:domainIncludes', [])
            if not isinstance(domains, list): domains = [domains]
            domain_ids = [d.get('@id') for d in domains if d]
            
            ranges = item.get('schema:rangeIncludes', [])
            if not isinstance(ranges, list): ranges = [ranges]
            range_ids = [r.get('@id') for r in ranges if r]
            
            if class_id in range_ids:
                for domain in domain_ids:
                    if domain not in related["referenced_by"]:
                        related["referenced_by"].append(domain)
    
    return related


In [None]:
#| export
def search_terms(jsonld_data, query):
    """Search for classes or properties matching a query string"""
    results = {
        "classes": [],
        "properties": []
    }
    
    query = query.lower()
    
    for item in jsonld_data.get('@graph', []):
        label = str(item.get('rdfs:label', '')).lower()
        description = str(item.get('rdfs:comment', '')).lower()
        
        if query in label or query in description:
            result = {
                "id": item.get('@id'),
                "label": item.get('rdfs:label'),
                "description": item.get('rdfs:comment')[:100] + "..." 
                    if item.get('rdfs:comment') and len(item.get('rdfs:comment')) > 100 
                    else item.get('rdfs:comment')
            }
            
            if item.get('@type') == 'rdfs:Class':
                results["classes"].append(result)
            elif item.get('@type') == 'rdf:Property':
                results["properties"].append(result)
    
    return results



In [None]:
#| export
def get_property_path(jsonld_data, source_class, target_class, max_depth=2):
    """Find property paths between two classes"""
    paths = []
    
    # Get properties of the source class
    properties = get_class_properties(jsonld_data, source_class)
    
    # First level - direct connections
    for prop in properties:
        prop_details = get_property_details(jsonld_data, prop.get('id'))
        if target_class in prop_details.get('ranges', []):
            paths.append({
                "path": [{"class": source_class, "property": prop.get('id'), "target": target_class}],
                "description": f"{source_class} → {prop.get('label')} → {target_class}"
            })
    
    # Second level - if max_depth > 1
    if max_depth > 1 and not paths:
        for prop in properties:
            prop_details = get_property_details(jsonld_data, prop.get('id'))
            for intermediate_class in prop_details.get('ranges', []):
                # Skip if not a class (like literal values)
                if not intermediate_class.startswith('schema:'):
                    continue
                    
                # Check if this intermediate class has properties to the target
                intermediate_properties = get_class_properties(jsonld_data, intermediate_class)
                for int_prop in intermediate_properties:
                    int_prop_details = get_property_details(jsonld_data, int_prop.get('id'))
                    if target_class in int_prop_details.get('ranges', []):
                        paths.append({
                            "path": [
                                {"class": source_class, "property": prop.get('id'), "target": intermediate_class},
                                {"class": intermediate_class, "property": int_prop.get('id'), "target": target_class}
                            ],
                            "description": f"{source_class} → {prop.get('label')} → {intermediate_class} → {int_prop.get('label')} → {target_class}"
                        })
    
    return paths




In [None]:
#| export
# Helper function to always return a list
def L(x): return x if isinstance(x, list) else [x] if x is not None else []


In [None]:
#| export
def safe_text(value):
    "Safely extract text from a JSON-LD value which might be a string or a dict"
    if value is None:
        return ""
    if isinstance(value, dict):
        return str(value.get('@value', ''))
    return str(value)

In [None]:
#| hide
#| 
def fetch_schema_org_term_info(term):
    "Fetch detailed information about a Schema.org term using BeautifulSoup"
    url = f"https://schema.org/{term}"
    response = xget(url)
    
    if response.status_code != 200:
        return {"error": f"Failed to fetch {url}, status code: {response.status_code}"}
    
    # Parse the HTML
    soup = bs(response.text, 'html.parser')
    
    # Look for JSON-LD script in the page
    jsonld_script = soup.find('script', {'type': 'application/ld+json'})
    
    if jsonld_script:
        try:
            # Extract and parse the JSON-LD data
            jsonld_data = json.loads(jsonld_script.string)
            return {
                "source": "schema.org",
                "term": term,
                "jsonld": jsonld_data,
                "url": url
            }
        except json.JSONDecodeError:
            return {"error": "Failed to parse JSON-LD from the page"}
    
    # If no JSON-LD, extract basic info from HTML
    description = soup.find('div', class_='description')
    if description:
        return {
            "source": "schema.org",
            "term": term,
            "description": description.text.strip(),
            "url": url
        }
    
    return {"error": "Could not extract term information"}

# Let's try it with "Dataset"
dataset_info = fetch_schema_org_term_info("Dataset")
dataset_info


{'source': 'schema.org',
 'term': 'Dataset',
 'jsonld': {'@context': {'brick': 'https://brickschema.org/schema/Brick#',
   'csvw': 'http://www.w3.org/ns/csvw#',
   'dc': 'http://purl.org/dc/elements/1.1/',
   'dcam': 'http://purl.org/dc/dcam/',
   'dcat': 'http://www.w3.org/ns/dcat#',
   'dcmitype': 'http://purl.org/dc/dcmitype/',
   'dcterms': 'http://purl.org/dc/terms/',
   'doap': 'http://usefulinc.com/ns/doap#',
   'foaf': 'http://xmlns.com/foaf/0.1/',
   'odrl': 'http://www.w3.org/ns/odrl/2/',
   'org': 'http://www.w3.org/ns/org#',
   'owl': 'http://www.w3.org/2002/07/owl#',
   'prof': 'http://www.w3.org/ns/dx/prof/',
   'prov': 'http://www.w3.org/ns/prov#',
   'qb': 'http://purl.org/linked-data/cube#',
   'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
   'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
   'schema': 'https://schema.org/',
   'sh': 'http://www.w3.org/ns/shacl#',
   'skos': 'http://www.w3.org/2004/02/skos/core#',
   'sosa': 'http://www.w3.org/ns/sosa/',
   's

In [None]:
# Example usage
nav = JsonLdNavigator(dataset_info['jsonld'])
nav.show_index()

Schema.org Index:
- Classes: 3
- Properties: 136

Available Affordances:
- get_class(class_id): Get detailed information about a class
- get_property(property_id): Get detailed information about a property
- get_properties_for(class_id, include_inherited=True): Get properties for a class
- find_related(class_id): Find classes semantically related to this class
- search(query): Search for classes or properties matching a query
- find_path(source_class, target_class): Find property paths between classes


In [None]:
# Get Dataset class details
dataset = nav.get_class('schema:Dataset')
print(f"\nDataset: {dataset['label']}")
print(f"Description: {dataset['description'][:100]}...")


Dataset: Dataset
Description: A body of structured information describing some topic(s) of interest....


In [None]:
# Find related classes
related = nav.find_related('schema:Dataset')
print(f"\nClasses related to Dataset:")
print(f"- Parent classes: {related['parent_classes']}")
print(f"- Referenced classes: {related['references'][:3]}...")


Classes related to Dataset:
- Parent classes: ['schema:CreativeWork']
- Referenced classes: ['schema:CreativeWork', 'schema:URL', 'schema:Product']...


In [None]:
# Search for a term
results = nav.search('catalog')
print(f"\nSearch results for 'catalog':")
print(f"- Found {len(results['properties'])} properties and {len(results['classes'])} classes")


Search results for 'catalog':
- Found 4 properties and 0 classes


In [None]:
#| hide
import nbdev; nbdev.nbdev_export()