# 07 - Graph Model JSON Generator

**Epic:** F5 - Fabric Graph Integration  
**Feature:** F5.1 - Graph Model JSON Generator  
**Priority:** P0

## Purpose

Generate Fabric Graph Model JSON definition from translated schema. This JSON file defines the graph schema for import into Fabric Graph databases.

## Input

- `silver_node_types` - Node type definitions from RDF classes
- `silver_properties` - Property definitions (datatype and object)

## Output

- Graph Model JSON file saved to `Files/graph_models/`

## Key Features

- Maps RDF datatypes to Fabric Graph types
- Sanitizes names for Graph API compatibility
- Validates JSON structure before saving

## Setup

In [None]:
import json
import re
import os
from datetime import datetime
from pyspark.sql import functions as F
from pyspark.sql.types import StringType

## Configuration

In [None]:
# Graph model configuration
GRAPH_MODEL_NAME = "RdfTranslatedGraph"
GRAPH_MODEL_VERSION = "1.0"

# Output path (relative to lakehouse Files)
OUTPUT_DIR = "Files/graph_models"

# Reserved words that cannot be used as names in Fabric Graph
RESERVED_WORDS = {
    "node", "edge", "graph", "vertex", "source", "target",
    "id", "label", "type", "name", "properties"
}

## Verify Required Tables Exist

In [None]:
# Check that required tables exist before proceeding (using Spark catalog)
required_tables = ["silver_node_types", "silver_properties"]
missing_tables = []

for table in required_tables:
    try:
        # Try to read the table from Spark catalog
        spark.table(table).limit(1)
    except Exception:
        missing_tables.append(table)

if missing_tables:
    print("ERROR: Required tables not found:")
    for t in missing_tables:
        print(f"  - {t}")
    print("\nPlease run the following notebooks first:")
    print("  - 01_rdf_parser.ipynb (creates bronze_triples)")
    print("  - 02_schema_analyzer.ipynb (creates bronze_schema_analysis)")
    print("  - 03_class_mapper.ipynb (creates silver_node_types)")
    print("  - 04_property_mapper.ipynb (creates silver_properties)")
    raise RuntimeError(f"Missing required tables: {missing_tables}")
else:
    print("All required tables exist")

## Helper Functions

In [None]:
def map_rdf_to_fabric_type(rdf_datatype: str) -> str:
    """
    Map RDF/XSD datatypes to Fabric Graph types.
    
    Fabric Graph supports: string, int, double, boolean, datetime
    """
    if rdf_datatype is None:
        return "string"  # Default to string
    
    datatype_lower = rdf_datatype.lower()
    
    # String types
    if any(t in datatype_lower for t in ["string", "langstring", "anyuri", "token", "normalizedstring"]):
        return "string"
    
    # Integer types
    if any(t in datatype_lower for t in ["integer", "int", "long", "short", "byte", "nonpositiveinteger", "nonnegativeinteger", "positiveinteger", "negativeinteger", "unsignedlong", "unsignedint", "unsignedshort", "unsignedbyte"]):
        return "int"
    
    # Double/Float types
    if any(t in datatype_lower for t in ["double", "float", "decimal"]):
        return "double"
    
    # Boolean
    if "boolean" in datatype_lower:
        return "boolean"
    
    # DateTime types
    if any(t in datatype_lower for t in ["datetime", "date", "time", "gyear", "gmonth", "gday"]):
        return "datetime"
    
    # Default to string for unknown types
    return "string"


def sanitize_name(name: str) -> str:
    """
    Sanitize a name for Fabric Graph compatibility.
    
    - Remove special characters
    - Replace spaces with underscores
    - Handle reserved words
    - Ensure name starts with letter
    """
    if name is None:
        return "Unknown"
    
    # Remove namespace prefixes if present (take local name)
    if ":" in name and not name.startswith("http"):
        name = name.split(":")[-1]
    
    # Replace spaces and hyphens with underscores
    name = re.sub(r"[\s-]+", "_", name)
    
    # Remove special characters (keep only alphanumeric and underscore)
    name = re.sub(r"[^a-zA-Z0-9_]", "", name)
    
    # Ensure starts with letter
    if name and not name[0].isalpha():
        name = "N_" + name
    
    # Handle reserved words
    if name.lower() in RESERVED_WORDS:
        name = name + "_type"
    
    return name if name else "Unknown"


def validate_graph_model(model: dict) -> tuple[bool, list[str]]:
    """
    Validate the graph model structure.
    
    Returns tuple of (is_valid, list of errors)
    """
    errors = []
    
    # Check required top-level fields
    if "name" not in model:
        errors.append("Missing required field: name")
    if "nodes" not in model:
        errors.append("Missing required field: nodes")
    if "edges" not in model:
        errors.append("Missing required field: edges")
    
    # Validate nodes
    node_names = set()
    for i, node in enumerate(model.get("nodes", [])):
        if "name" not in node:
            errors.append(f"Node {i}: missing name")
        else:
            if node["name"] in node_names:
                errors.append(f"Duplicate node name: {node['name']}")
            node_names.add(node["name"])
        
        if "properties" not in node:
            errors.append(f"Node {node.get('name', i)}: missing properties array")
        else:
            for j, prop in enumerate(node["properties"]):
                if "name" not in prop:
                    errors.append(f"Node {node.get('name', i)}, property {j}: missing name")
                if "type" not in prop:
                    errors.append(f"Node {node.get('name', i)}, property {j}: missing type")
    
    # Validate edges
    edge_names = set()
    for i, edge in enumerate(model.get("edges", [])):
        if "name" not in edge:
            errors.append(f"Edge {i}: missing name")
        else:
            if edge["name"] in edge_names:
                errors.append(f"Duplicate edge name: {edge['name']}")
            edge_names.add(edge["name"])
        
        if "source" not in edge:
            errors.append(f"Edge {edge.get('name', i)}: missing source")
        elif edge["source"] not in node_names:
            errors.append(f"Edge {edge.get('name', i)}: source '{edge['source']}' not in node types")
        
        if "target" not in edge:
            errors.append(f"Edge {edge.get('name', i)}: missing target")
        elif edge["target"] not in node_names:
            errors.append(f"Edge {edge.get('name', i)}: target '{edge['target']}' not in node types")
    
    return len(errors) == 0, errors

## Load Schema Data

In [None]:
# Load node types
df_node_types = spark.table("silver_node_types")
print(f"Loaded {df_node_types.count()} node types")
df_node_types.show(truncate=False)

In [None]:
# Load properties
df_properties = spark.table("silver_properties")
print(f"Loaded {df_properties.count()} properties")
df_properties.show(truncate=False)

In [None]:
# Separate datatype and object properties
df_datatype_props = df_properties.filter(F.col("property_type") == "datatype")
df_object_props = df_properties.filter(F.col("property_type") == "object")

print(f"Datatype properties: {df_datatype_props.count()}")
print(f"Object properties: {df_object_props.count()}")

## Build Node Definitions

In [None]:
# Collect node types
node_types = df_node_types.select("node_type", "class_uri").collect()
node_type_set = {row["node_type"] for row in node_types}

# Collect datatype properties with their domains
datatype_props = df_datatype_props.select(
    "property_name", "data_type", "source_types"
).collect()

print(f"Building definitions for {len(node_types)} node types")
print(f"Using {len(datatype_props)} datatype properties")

In [None]:
# Build node definitions with properties
nodes = []

for node_row in node_types:
    node_type = node_row["node_type"]
    class_uri = node_row["class_uri"]
    
    # Sanitize node name
    node_name = sanitize_name(node_type)
    
    # Find properties for this node type
    # Check if node type is in source_types array
    node_properties = []
    
    for prop_row in datatype_props:
        source_types = prop_row["source_types"] or []
        
        # Check if this property applies to this node type
        # Match either by exact node_type or by class_uri IRI
        if node_type in source_types or class_uri in source_types:
            prop_name = sanitize_name(prop_row["property_name"])
            prop_type = map_rdf_to_fabric_type(prop_row["data_type"])
            
            # Avoid duplicate properties
            if not any(p["name"] == prop_name for p in node_properties):
                node_properties.append({
                    "name": prop_name,
                    "type": prop_type
                })
    
    # Add default 'uri' property for the original RDF IRI
    node_properties.insert(0, {"name": "uri", "type": "string"})
    
    nodes.append({
        "name": node_name,
        "properties": node_properties
    })

print(f"Built {len(nodes)} node definitions")
for node in nodes[:5]:  # Show first 5
    print(f"  {node['name']}: {len(node['properties'])} properties")

## Build Edge Definitions

In [None]:
# Collect object properties (these become edges)
object_props = df_object_props.select(
    "property_name", "source_types", "target_types"
).collect()

print(f"Processing {len(object_props)} object properties for edge definitions")

In [None]:
# Build edge definitions
edges = []
seen_edges = set()  # Track unique (name, source, target) combinations

# Get sanitized node names for lookup
node_name_map = {sanitize_name(row["node_type"]): sanitize_name(row["node_type"]) for row in node_types}
class_uri_to_node = {row["class_uri"]: sanitize_name(row["node_type"]) for row in node_types}

for prop_row in object_props:
    edge_name = sanitize_name(prop_row["property_name"])
    source_types = prop_row["source_types"] or []
    target_types = prop_row["target_types"] or []
    
    # Skip if no targets specified
    if not target_types:
        continue
    
    # Create edge for each source-target combination
    for source_type in source_types:
        source = class_uri_to_node.get(source_type) or node_name_map.get(sanitize_name(source_type))
        
        if not source:
            continue  # Skip if source not found
        
        for target_type in target_types:
            target = class_uri_to_node.get(target_type) or node_name_map.get(sanitize_name(target_type))
            
            if not target:
                continue  # Skip if target not found
            
            # Avoid duplicates
            edge_key = (edge_name, source, target)
            if edge_key in seen_edges:
                continue
            seen_edges.add(edge_key)
            
            edges.append({
                "name": edge_name,
                "source": source,
                "target": target,
                "properties": []
            })

print(f"Built {len(edges)} edge definitions")
for edge in edges[:5]:  # Show first 5
    print(f"  {edge['source']} --[{edge['name']}]--> {edge['target']}")

## Assemble Graph Model JSON

In [None]:
# Create the graph model
graph_model = {
    "name": GRAPH_MODEL_NAME,
    "version": GRAPH_MODEL_VERSION,
    "created": datetime.utcnow().isoformat() + "Z",
    "nodes": nodes,
    "edges": edges
}

# Display summary
print(f"Graph Model: {graph_model['name']} v{graph_model['version']}")
print(f"  Node types: {len(graph_model['nodes'])}")
print(f"  Edge types: {len(graph_model['edges'])}")
print(f"  Total properties: {sum(len(n['properties']) for n in nodes)}")

In [None]:
# Validate the model
is_valid, errors = validate_graph_model(graph_model)

if is_valid:
    print("Graph model validation passed")
else:
    print(f"Graph model validation failed with {len(errors)} errors:")
    for error in errors[:10]:  # Show first 10 errors
        print(f"  - {error}")
    if len(errors) > 10:
        print(f"  ... and {len(errors) - 10} more errors")

## Save Graph Model JSON

In [None]:
# Create output directory if needed
# In Fabric, Files directory is at /lakehouse/default/Files
output_base = "/lakehouse/default/Files/graph_models"
os.makedirs(output_base, exist_ok=True)

# Generate filename with timestamp
timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
filename = f"graph_model_{GRAPH_MODEL_NAME}_{timestamp}.json"
output_path = os.path.join(output_base, filename)

# Also save a "latest" version for easy access
latest_path = os.path.join(output_base, f"graph_model_{GRAPH_MODEL_NAME}_latest.json")

print(f"Output paths:")
print(f"  Timestamped: {output_path}")
print(f"  Latest: {latest_path}")

In [None]:
# Write the JSON file
json_output = json.dumps(graph_model, indent=2)

# Save timestamped version
with open(output_path, "w", encoding="utf-8") as f:
    f.write(json_output)
print(f"Saved timestamped model: {output_path}")

# Save latest version
with open(latest_path, "w", encoding="utf-8") as f:
    f.write(json_output)
print(f"Saved latest model: {latest_path}")

In [None]:
# Preview the JSON output
print("=" * 60)
print("GRAPH MODEL JSON PREVIEW")
print("=" * 60)
# Show condensed version for preview
preview_model = {
    "name": graph_model["name"],
    "version": graph_model["version"],
    "created": graph_model["created"],
    "nodes": graph_model["nodes"][:3] if len(graph_model["nodes"]) > 3 else graph_model["nodes"],
    "edges": graph_model["edges"][:3] if len(graph_model["edges"]) > 3 else graph_model["edges"],
    "_truncated": {
        "total_nodes": len(graph_model["nodes"]),
        "total_edges": len(graph_model["edges"])
    }
}
print(json.dumps(preview_model, indent=2))

## Summary

This notebook generates a Fabric Graph Model JSON definition from the translated RDF schema.

**Pipeline Position:** Step 7 (after gold layer is written)

**Dependencies:**
- `silver_node_types` - Node type definitions
- `silver_properties` - Property definitions

**Outputs:**
- `Files/graph_models/graph_model_{name}_{timestamp}.json` - Timestamped version
- `Files/graph_models/graph_model_{name}_latest.json` - Latest version for easy access

**Next Steps:**
- Use the Graph Model JSON with Fabric Graph API
- Import gold_nodes and gold_edges data after creating graph schema