# 07 - Ontology Definition Generator

**Epic:** F5 - Fabric Ontology Integration  
**Feature:** F5.1 - Ontology Definition Generator  
**Priority:** P0

## Purpose

Generate Fabric Ontology definition from silver layer schema tables. The output is a structured definition with base64-encoded JSON parts, ready for upload via the Fabric Ontology REST API.

## Input

- `silver_node_types` - Node type definitions from RDF classes
- `silver_properties` - Property definitions (datatype and object)

## Output

- Ontology definition JSON saved to `Files/ontology_definitions/`
- Structure ready for `POST /ontologies/{id}/updateDefinition`

## Definition Structure

```
definition.json                                    → Root definition
EntityTypes/{id}/definition.json                   → Entity type (name, properties, key)  
EntityTypes/{id}/DataBindings/{bindingId}.json     → Data binding to lakehouse table
RelationshipTypes/{id}/definition.json             → Relationships between entity types
.platform                                          → Metadata
```

## Setup

In [None]:
import json
import re
import base64
import hashlib
from datetime import datetime
from pyspark.sql import functions as F
from pyspark.sql.types import StringType

## Configuration

In [None]:
# Ontology configuration
ONTOLOGY_NAMESPACE = "rdftranslation"  # Custom namespace for generated types
ONTOLOGY_VERSION = "1.0"

# Output path (relative to lakehouse Files)
OUTPUT_DIR = "Files/ontology_definitions"

# Fabric Ontology naming constraints
# - 1-26 characters
# - Alphanumeric + hyphens/underscores
# - Start and end with alphanumeric
MAX_NAME_LENGTH = 26

# Reserved words that cannot be used as names in Fabric Ontology
RESERVED_WORDS = {
    "node", "edge", "graph", "vertex", "source", "target",
    "id", "label", "type", "name", "properties", "entity",
    "relationship", "property", "ontology", "binding"
}

## Verify Required Tables Exist

In [None]:
# Check that required tables exist before proceeding (using Spark catalog)
required_tables = ["silver_node_types", "silver_properties"]
missing_tables = []

for table in required_tables:
    try:
        spark.table(table).limit(1)
    except Exception:
        missing_tables.append(table)

if missing_tables:
    print("ERROR: Required tables not found:")
    for t in missing_tables:
        print(f"  - {t}")
    print("\nPlease run the following notebooks first:")
    print("  - 01_rdf_parser.ipynb (creates bronze_triples)")
    print("  - 02_schema_analyzer.ipynb (creates bronze_schema_analysis)")
    print("  - 03_class_mapper.ipynb (creates silver_node_types)")
    print("  - 04_property_mapper.ipynb (creates silver_properties)")
    raise RuntimeError(f"Missing required tables: {missing_tables}")
else:
    print("All required tables exist")

## Helper Functions - Naming & Validation

In [None]:
def sanitize_name(name: str) -> str:
    """
    Sanitize a name for Fabric Ontology compatibility.
    
    Constraints:
    - 1-26 characters
    - Alphanumeric + hyphens/underscores
    - Start and end with alphanumeric
    """
    if name is None:
        return "Unknown"
    
    # Remove namespace prefixes if present (take local name)
    if ":" in name and not name.startswith("http"):
        name = name.split(":")[-1]
    
    # For URIs, extract local name
    if "/" in name:
        name = name.rsplit("/", 1)[-1]
    if "#" in name:
        name = name.rsplit("#", 1)[-1]
    
    # Replace spaces and hyphens with underscores
    name = re.sub(r"[\s-]+", "_", name)
    
    # Remove special characters (keep only alphanumeric and underscore)
    name = re.sub(r"[^a-zA-Z0-9_]", "", name)
    
    # Ensure starts with alphanumeric
    while name and not name[0].isalnum():
        name = name[1:]
    if not name:
        name = "Unknown"
    if not name[0].isalnum():
        name = "E" + name
    
    # Ensure ends with alphanumeric
    while name and not name[-1].isalnum():
        name = name[:-1]
    
    # Handle reserved words
    if name.lower() in RESERVED_WORDS:
        name = name + "Type"
    
    # Truncate to max length while keeping alphanumeric end
    if len(name) > MAX_NAME_LENGTH:
        name = name[:MAX_NAME_LENGTH]
        while name and not name[-1].isalnum():
            name = name[:-1]
    
    return name if name else "Unknown"


def generate_id(seed: str) -> str:
    """
    Generate a deterministic numeric ID from a seed string.
    Used for entity type IDs, property IDs, etc.
    """
    # Use MD5 hash and take first 13 digits for a numeric ID
    hash_hex = hashlib.md5(seed.encode('utf-8')).hexdigest()
    # Convert hex to decimal and take first 13 digits
    numeric_id = str(int(hash_hex[:13], 16))[:13]
    return numeric_id


def validate_name(name: str) -> tuple[bool, str]:
    """
    Validate a name against Fabric Ontology rules.
    Returns (is_valid, error_message)
    """
    if not name:
        return False, "Name cannot be empty"
    
    if len(name) < 1 or len(name) > 26:
        return False, f"Name must be 1-26 chars, got {len(name)}"
    
    if not name[0].isalnum():
        return False, f"Name must start with alphanumeric, got '{name[0]}'"
    
    if not name[-1].isalnum():
        return False, f"Name must end with alphanumeric, got '{name[-1]}'"
    
    if not re.match(r'^[a-zA-Z0-9][a-zA-Z0-9_-]*[a-zA-Z0-9]$|^[a-zA-Z0-9]$', name):
        return False, f"Name contains invalid characters: '{name}'"
    
    return True, ""


# Test sanitize_name
test_cases = [
    ("Person", "Person"),
    ("my_class", "my_class"),
    ("This Is A Very Long Name That Should Be Truncated", None),  # Should be <= 26
    ("123numeric", "E123numeric"),  # Should add prefix
    ("http://example.org/Person", "Person"),  # URI extraction
    ("ex:Person", "Person"),  # Prefix removal
    ("id", "idType"),  # Reserved word
]

print("Testing sanitize_name():")
for input_name, expected in test_cases:
    result = sanitize_name(input_name)
    is_valid, error = validate_name(result)
    status = "✓" if is_valid else f"✗ {error}"
    print(f"  '{input_name}' → '{result}' [{status}]")

## Helper Functions - Datatype Mapping

In [None]:
def map_to_ontology_datatype(rdf_datatype: str) -> str:
    """
    Map RDF/XSD datatypes to Fabric Ontology types.
    
    Fabric Ontology supports: String, Int32, Int64, Double, Boolean, DateTime
    """
    if rdf_datatype is None:
        return "String"  # Default to String
    
    datatype_lower = rdf_datatype.lower()
    
    # String types
    if any(t in datatype_lower for t in ["string", "langstring", "anyuri", "token", "normalizedstring", "literal"]):
        return "String"
    
    # 32-bit Integer types
    if any(t in datatype_lower for t in ["int", "short", "byte", "unsignedshort", "unsignedbyte"]):
        return "Int32"
    
    # 64-bit Integer types (larger ranges)
    if any(t in datatype_lower for t in ["integer", "long", "unsignedint", "unsignedlong", "nonpositiveinteger", "nonnegativeinteger", "positiveinteger", "negativeinteger"]):
        return "Int64"
    
    # Double/Float types
    if any(t in datatype_lower for t in ["double", "float", "decimal"]):
        return "Double"
    
    # Boolean
    if "boolean" in datatype_lower:
        return "Boolean"
    
    # DateTime types
    if any(t in datatype_lower for t in ["datetime", "date", "time", "gyear", "gmonth", "gday"]):
        return "DateTime"
    
    # Default to String for unknown types
    return "String"


# Test datatype mapping
test_types = [
    ("xsd:string", "String"),
    ("xsd:integer", "Int64"),
    ("xsd:int", "Int32"),
    ("xsd:double", "Double"),
    ("xsd:boolean", "Boolean"),
    ("xsd:dateTime", "DateTime"),
    ("http://www.w3.org/2001/XMLSchema#string", "String"),
    (None, "String"),
]

print("Testing map_to_ontology_datatype():")
for rdf_type, expected in test_types:
    result = map_to_ontology_datatype(rdf_type)
    status = "✓" if result == expected else f"✗ expected {expected}"
    print(f"  {rdf_type} → {result} [{status}]")

## Helper Functions - Base64 Encoding

In [None]:
def encode_payload(data: dict) -> str:
    """
    Encode a dictionary as base64 JSON string for Fabric API.
    """
    json_str = json.dumps(data, indent=2)
    return base64.b64encode(json_str.encode('utf-8')).decode('utf-8')


def decode_payload(encoded: str) -> dict:
    """
    Decode a base64 JSON string back to a dictionary.
    """
    json_str = base64.b64decode(encoded.encode('utf-8')).decode('utf-8')
    return json.loads(json_str)


# Test encoding/decoding
test_data = {"name": "Test", "value": 123}
encoded = encode_payload(test_data)
decoded = decode_payload(encoded)
print(f"Encoding test: {test_data} → '{encoded[:30]}...'")
print(f"Decode matches original: {decoded == test_data}")

## Load Schema Data

In [None]:
# Load node types
df_node_types = spark.table("silver_node_types")
print(f"Loaded {df_node_types.count()} node types")
df_node_types.show(truncate=False)

In [None]:
# Load properties
df_properties = spark.table("silver_properties")
print(f"Loaded {df_properties.count()} properties")
df_properties.show(truncate=False)

In [None]:
# Separate datatype and object properties
df_datatype_props = df_properties.filter(F.col("property_type") == "datatype")
df_object_props = df_properties.filter(F.col("property_type") == "object")

print(f"Datatype properties (become entity properties): {df_datatype_props.count()}")
print(f"Object properties (become relationship types): {df_object_props.count()}")

## Build Entity Type Definitions

In [None]:
# Collect node types to build entity types
node_types = df_node_types.select(
    "node_type", "class_uri", "display_name", "description"
).collect()

# Build lookup for node_type to entity_id mapping (for relationships)
node_type_to_entity_id = {}
for row in node_types:
    entity_id = generate_id(f"entity_{row['class_uri']}")
    node_type_to_entity_id[row['node_type']] = entity_id
    node_type_to_entity_id[row['class_uri']] = entity_id  # Also index by URI

# Collect datatype properties with their domains
datatype_props = df_datatype_props.select(
    "property_name", "property_uri", "data_type", "source_types"
).collect()

print(f"Building entity type definitions for {len(node_types)} types")
print(f"Using {len(datatype_props)} datatype properties")

In [None]:
# Build entity type definitions
entity_types = []

for node_row in node_types:
    node_type = node_row["node_type"]
    class_uri = node_row["class_uri"]
    
    # Generate entity ID
    entity_id = node_type_to_entity_id[node_type]
    
    # Sanitize entity name
    entity_name = sanitize_name(node_type)
    
    # Build properties list
    # Start with required 'uri' property for the original RDF IRI
    uri_prop_id = generate_id(f"prop_{class_uri}_uri")
    properties = [
        {
            "id": uri_prop_id,
            "name": "uri",
            "dataType": "String"
        }
    ]
    
    # Find datatype properties for this entity type
    seen_prop_names = {"uri"}
    
    for prop_row in datatype_props:
        source_types = prop_row["source_types"] or []
        
        # Check if this property applies to this entity
        if node_type in source_types or class_uri in source_types:
            prop_name = sanitize_name(prop_row["property_name"])
            prop_uri = prop_row["property_uri"]
            
            # Avoid duplicate property names
            if prop_name not in seen_prop_names:
                prop_id = generate_id(f"prop_{class_uri}_{prop_uri}")
                prop_type = map_to_ontology_datatype(prop_row["data_type"])
                
                properties.append({
                    "id": prop_id,
                    "name": prop_name,
                    "dataType": prop_type
                })
                seen_prop_names.add(prop_name)
    
    # Build entity type definition
    entity_definition = {
        "id": entity_id,
        "namespace": ONTOLOGY_NAMESPACE,
        "name": entity_name,
        "entityIdParts": [uri_prop_id],  # Use URI as entity key
        "displayNamePropertyId": uri_prop_id,  # Use URI as display name
        "properties": properties
    }
    
    entity_types.append({
        "id": entity_id,
        "name": entity_name,
        "class_uri": class_uri,
        "definition": entity_definition
    })

print(f"\nBuilt {len(entity_types)} entity type definitions:")
for et in entity_types:
    prop_count = len(et['definition']['properties'])
    print(f"  {et['name']}: {prop_count} properties")

## Build Relationship Type Definitions

In [None]:
# Collect object properties (these become relationship types)
object_props = df_object_props.select(
    "property_name", "property_uri", "source_types", "target_types"
).collect()

print(f"Building relationship type definitions from {len(object_props)} object properties")

In [None]:
# Build relationship type definitions
relationship_types = []

# Track entity names for lookup
entity_name_to_id = {et['name']: et['id'] for et in entity_types}
node_type_to_entity_name = {et['class_uri']: et['name'] for et in entity_types}
for et in entity_types:
    # Also map by sanitized node_type name
    node_type_to_entity_name[et['name']] = et['name']

for prop_row in object_props:
    prop_name = prop_row["property_name"]
    prop_uri = prop_row["property_uri"]
    source_types = prop_row["source_types"] or []
    target_types = prop_row["target_types"] or []
    
    # Sanitize relationship name
    rel_name = sanitize_name(prop_name)
    
    # Generate relationship ID
    rel_id = generate_id(f"rel_{prop_uri}")
    
    # Find source entity types
    sources = []
    for st in source_types:
        if st in node_type_to_entity_id:
            sources.append(node_type_to_entity_id[st])
    
    # Find target entity types
    targets = []
    for tt in target_types:
        if tt in node_type_to_entity_id:
            targets.append(node_type_to_entity_id[tt])
    
    # Skip if we can't resolve source or target
    if not sources or not targets:
        print(f"  Warning: Skipping '{rel_name}' - unresolved source/target")
        print(f"    Sources: {source_types} → {sources}")
        print(f"    Targets: {target_types} → {targets}")
        continue
    
    # Build relationship definition
    # Note: Fabric allows multiple source/target entity types
    relationship_definition = {
        "id": rel_id,
        "namespace": ONTOLOGY_NAMESPACE,
        "name": rel_name,
        "fromEntityTypeIds": list(set(sources)),  # Deduplicate
        "toEntityTypeIds": list(set(targets)),    # Deduplicate
        "properties": []  # Relationships can have properties too
    }
    
    relationship_types.append({
        "id": rel_id,
        "name": rel_name,
        "property_uri": prop_uri,
        "definition": relationship_definition
    })

print(f"\nBuilt {len(relationship_types)} relationship type definitions:")
for rt in relationship_types[:10]:  # Show first 10
    from_count = len(rt['definition']['fromEntityTypeIds'])
    to_count = len(rt['definition']['toEntityTypeIds'])
    print(f"  {rt['name']}: {from_count} source(s) → {to_count} target(s)")

## Assemble Definition Parts

In [None]:
# Assemble the full definition structure with base64-encoded parts
definition_parts = []

# Add .platform metadata
platform_metadata = {
    "$schema": "https://developer.microsoft.com/json-schemas/fabric/gitIntegration/platformProperties/2.0.0/schema.json",
    "metadata": {
        "type": "Ontology",
        "displayName": "RDF Translated Ontology"
    },
    "config": {
        "version": ONTOLOGY_VERSION
    }
}
definition_parts.append({
    "path": ".platform",
    "payload": encode_payload(platform_metadata),
    "payloadType": "InlineBase64"
})

# Add root definition.json
root_definition = {
    "version": ONTOLOGY_VERSION,
    "namespace": ONTOLOGY_NAMESPACE,
    "entityTypeIds": [et['id'] for et in entity_types],
    "relationshipTypeIds": [rt['id'] for rt in relationship_types]
}
definition_parts.append({
    "path": "definition.json",
    "payload": encode_payload(root_definition),
    "payloadType": "InlineBase64"
})

print(f"Added root definition with {len(entity_types)} entity types, {len(relationship_types)} relationship types")

In [None]:
# Add entity type definitions
for et in entity_types:
    path = f"EntityTypes/{et['id']}/definition.json"
    definition_parts.append({
        "path": path,
        "payload": encode_payload(et['definition']),
        "payloadType": "InlineBase64"
    })

print(f"Added {len(entity_types)} entity type definition parts")

In [None]:
# Add relationship type definitions
for rt in relationship_types:
    path = f"RelationshipTypes/{rt['id']}/definition.json"
    definition_parts.append({
        "path": path,
        "payload": encode_payload(rt['definition']),
        "payloadType": "InlineBase64"
    })

print(f"Added {len(relationship_types)} relationship type definition parts")

In [None]:
# Summary of all parts
print(f"\nTotal definition parts: {len(definition_parts)}")
print("\nPart paths:")
for part in definition_parts[:15]:  # Show first 15
    print(f"  {part['path']}")
if len(definition_parts) > 15:
    print(f"  ... and {len(definition_parts) - 15} more")

## Validate Definition

In [None]:
def validate_ontology_definition(parts: list[dict]) -> tuple[bool, list[str]]:
    """
    Validate the ontology definition structure.
    
    Returns (is_valid, list of errors)
    """
    errors = []
    
    # Check required files exist
    paths = {p['path'] for p in parts}
    if '.platform' not in paths:
        errors.append("Missing .platform metadata")
    if 'definition.json' not in paths:
        errors.append("Missing definition.json")
    
    # Validate each part
    for part in parts:
        try:
            payload = decode_payload(part['payload'])
            
            # Validate entity type definitions
            if part['path'].startswith('EntityTypes/') and part['path'].endswith('/definition.json'):
                required_fields = ['id', 'namespace', 'name', 'properties']
                for field in required_fields:
                    if field not in payload:
                        errors.append(f"Entity {part['path']}: missing {field}")
                
                # Validate name
                if 'name' in payload:
                    is_valid, error = validate_name(payload['name'])
                    if not is_valid:
                        errors.append(f"Entity {part['path']}: {error}")
                
                # Validate properties
                for i, prop in enumerate(payload.get('properties', [])):
                    for pf in ['id', 'name', 'dataType']:
                        if pf not in prop:
                            errors.append(f"Entity {part['path']}, property {i}: missing {pf}")
            
            # Validate relationship type definitions
            if part['path'].startswith('RelationshipTypes/') and part['path'].endswith('/definition.json'):
                required_fields = ['id', 'namespace', 'name', 'fromEntityTypeIds', 'toEntityTypeIds']
                for field in required_fields:
                    if field not in payload:
                        errors.append(f"Relationship {part['path']}: missing {field}")
                
                # Validate name
                if 'name' in payload:
                    is_valid, error = validate_name(payload['name'])
                    if not is_valid:
                        errors.append(f"Relationship {part['path']}: {error}")
        
        except Exception as e:
            errors.append(f"Part {part['path']}: decode error - {e}")
    
    return len(errors) == 0, errors


# Validate
is_valid, validation_errors = validate_ontology_definition(definition_parts)

if is_valid:
    print("✓ Ontology definition is valid")
else:
    print(f"✗ Ontology definition has {len(validation_errors)} errors:")
    for error in validation_errors[:20]:  # Show first 20
        print(f"  - {error}")

## Save Definition to Files

In [None]:
# Create output directory if it doesn't exist
import os

# In Fabric, Files/ is a mounted path
output_dir = "/lakehouse/default/Files/ontology_definitions"
os.makedirs(output_dir, exist_ok=True)

print(f"Output directory: {output_dir}")

In [None]:
# Save the full definition (for API upload)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
definition_filename = f"ontology_definition_{timestamp}.json"

full_definition = {
    "metadata": {
        "generated_at": datetime.now().isoformat(),
        "generator": "07_ontology_definition_generator",
        "version": ONTOLOGY_VERSION,
        "entity_type_count": len(entity_types),
        "relationship_type_count": len(relationship_types)
    },
    "definition": {
        "parts": definition_parts
    }
}

definition_path = os.path.join(output_dir, definition_filename)
with open(definition_path, 'w') as f:
    json.dump(full_definition, f, indent=2)

print(f"Saved: {definition_filename}")
print(f"  Entity types: {len(entity_types)}")
print(f"  Relationship types: {len(relationship_types)}")
print(f"  Definition parts: {len(definition_parts)}")

In [None]:
# Also save human-readable versions for debugging

# Entity types summary
entity_summary = []
for et in entity_types:
    entity_summary.append({
        "id": et['id'],
        "name": et['name'],
        "class_uri": et['class_uri'],
        "property_count": len(et['definition']['properties']),
        "properties": [p['name'] for p in et['definition']['properties']]
    })

entity_summary_path = os.path.join(output_dir, f"entity_types_{timestamp}.json")
with open(entity_summary_path, 'w') as f:
    json.dump(entity_summary, f, indent=2)

# Relationship types summary
rel_summary = []
for rt in relationship_types:
    rel_summary.append({
        "id": rt['id'],
        "name": rt['name'],
        "property_uri": rt['property_uri'],
        "from_entity_ids": rt['definition']['fromEntityTypeIds'],
        "to_entity_ids": rt['definition']['toEntityTypeIds']
    })

rel_summary_path = os.path.join(output_dir, f"relationship_types_{timestamp}.json")
with open(rel_summary_path, 'w') as f:
    json.dump(rel_summary, f, indent=2)

print(f"\nSaved human-readable summaries:")
print(f"  entity_types_{timestamp}.json")
print(f"  relationship_types_{timestamp}.json")

## Summary & Next Steps

In [None]:
print("="*60)
print("Ontology Definition Generation Complete")
print("="*60)
print(f"\nGenerated:")
print(f"  - {len(entity_types)} entity types")
print(f"  - {len(relationship_types)} relationship types")
print(f"  - {len(definition_parts)} definition parts")

print(f"\nOutput files in {output_dir}:")
print(f"  - {definition_filename} (API-ready definition)")
print(f"  - entity_types_{timestamp}.json (human-readable)")
print(f"  - relationship_types_{timestamp}.json (human-readable)")

print(f"\nValidation: {'✓ PASSED' if is_valid else '✗ FAILED'}")

print(f"\n" + "="*60)
print("Next Steps:")
print("="*60)
print("1. Run F5.2 (Fabric Ontology REST API Client) to upload definition")
print("2. Run F5.3 (Lakehouse Data Binding) to bind gold tables")
print("3. Query the materialized graph via Fabric Graph!")

## Display Generated Definition (Sample)

In [None]:
# Show first entity type definition (decoded) as example
if entity_types:
    print("Sample Entity Type Definition:")
    print("-" * 40)
    print(json.dumps(entity_types[0]['definition'], indent=2))

print()

# Show first relationship type definition (decoded) as example
if relationship_types:
    print("Sample Relationship Type Definition:")
    print("-" * 40)
    print(json.dumps(relationship_types[0]['definition'], indent=2))