# Exploring Baroque Ceiling Painting Data in the NFDI4Culture Knowledge Graph

This notebook is a starting point for a data story about baroque art and ceiling paintings using the NFDI4Culture Knowledge Graph.

Focus:
- Work with **data portals** (especially CbDD and the Color Slide Archive of Wall and Ceiling Painting)
- Use **SPARQL** to query the KG
- Prepare results for visualisation (maps, timelines, comparisons)

You can adapt the queries step by step as you learn more about the concrete RDF schema of the datasets.

In [1]:
# Install dependencies (run once per environment)
!pip install SPARQLWrapper pandas matplotlib --quiet

In [2]:
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 20)
pd.set_option("display.width", 120)

# NFDI4Culture SPARQL endpoint
ENDPOINT_URL = "https://nfdi4culture.de/sparql"

# Prefixes used in queries
# NOTE: The KG uses http://schema.org/ (not https://)
PREFIXES = """\
PREFIX fabio: <http://purl.org/spar/fabio/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX nfdicore: <https://nfdi.fiz-karlsruhe.de/ontology/>
PREFIX schema:  <http://schema.org/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dcat:    <http://www.w3.org/ns/dcat#>
PREFIX n4c:     <https://nfdi4culture.de/id/>
"""

def run_sparql(query: str) -> pd.DataFrame:
    """Run a SPARQL query against the NFDI4Culture endpoint and return a pandas DataFrame.

    The query body should *not* include prefixes, they are automatically prepended.
    This version accesses the JSON result safely to avoid indexing errors in static type checkers.
    """
    sparql = SPARQLWrapper(ENDPOINT_URL)
    sparql.setReturnFormat(JSON)
    sparql.setQuery(PREFIXES + "\n" + query)
    results = sparql.query().convert()

    # Be defensive: ensure results is a dict and extract bindings safely
    if not isinstance(results, dict):
        return pd.DataFrame()

    bindings = results.get("results", {}).get("bindings", [])
    rows = []
    for binding in bindings:
        # each binding is a dict of variable -> { "type": ..., "value": ... }
        row = {var: val.get("value") for var, val in binding.items()}
        rows.append(row)
    return pd.DataFrame(rows)

## 1. Inspect the CbDD portal (Corpus of Baroque Ceiling Painting in Germany)

- Portal ID from the registry: `n4c:E4264`
- Goal: See which properties connect the portal to data feeds, homepages, subjects, etc.

Run this once and scan the property list. It tells you which predicates to use in later queries.

In [3]:
query_inspect_cbdd = """\
SELECT ?p ?o
WHERE {
  n4c:E4264 ?p ?o .
}
ORDER BY ?p
LIMIT 200
"""

df_cbdd_props = run_sparql(query_inspect_cbdd)
df_cbdd_props

Unnamed: 0,p,o
0,http://schema.org/contributor,nodeID://b695742
1,http://schema.org/contributor,nodeID://b696184
2,http://schema.org/contributor,nodeID://b696827
3,http://schema.org/contributor,nodeID://b699558
4,http://schema.org/description,\n The Corpus of Baroque Ceiling Painting i...
5,http://schema.org/hasPart,https://nfdi4culture.de/id/E6077
6,http://schema.org/image,https://nfdi4culture.de//fileadmin/user_upload...
7,http://schema.org/keywords,https://nfdi4culture.de/id/E3953
8,http://schema.org/keywords,https://nfdi4culture.de/id/E3959
9,http://schema.org/keywords,https://nfdi4culture.de/id/E3968


## 2. Discover the CbDD Data Feed

The CbDD portal (`n4c:E4264`) contains a data feed that holds all painting records. 
Let's find the feed and understand how paintings are connected to it.

In [4]:
# Find what points TO the CbDD portal - this reveals the data feed
query_find_feed = """
SELECT ?feed ?feedLabel ?feedType ?predicate
WHERE {
  ?feed ?predicate n4c:E4264 .
  OPTIONAL { ?feed rdfs:label ?feedLabel . }
  OPTIONAL { ?feed rdf:type ?feedType . }
}
LIMIT 20
"""

df_feeds = run_sparql(query_find_feed)
print("Entities pointing to the CbDD portal:")
print(df_feeds)

# The main feed is E6077 - let's verify its structure
print("\n" + "="*60)
print("Verifying E6077 feed structure:")

query_feed_structure = """
SELECT ?p (COUNT(?o) AS ?count) 
WHERE {
  n4c:E6077 ?p ?o .
}
GROUP BY ?p
ORDER BY DESC(?count)
LIMIT 10
"""
df_feed_struct = run_sparql(query_feed_structure)
print(df_feed_struct)

Entities pointing to the CbDD portal:
                                feed                                          feedLabel  \
0   https://nfdi4culture.de/id/E2971                                               JPEG   
1   https://nfdi4culture.de/id/E2971                                               JPEG   
2   https://nfdi4culture.de/id/E3978                                            CC0 1.0   
3   https://nfdi4culture.de/id/E3978                                            CC0 1.0   
4   https://nfdi4culture.de/id/E2312                                       Architecture   
5   https://nfdi4culture.de/id/E2312                                       Architecture   
6   https://nfdi4culture.de/id/E2313                                        Art History   
7   https://nfdi4culture.de/id/E2313                                        Art History   
8   https://nfdi4culture.de/id/E2957                                  Image File Format   
9   https://nfdi4culture.de/id/E3596                

In [5]:
# Define the CbDD feed URI - this is the main entry point for querying paintings
CBDD_FEED_URI = "n4c:E6077"

# Verify the data path: Feed -> DataFeedItem -> Painting
query_verify_path = f"""
SELECT (COUNT(DISTINCT ?painting) AS ?totalPaintings)
WHERE {{
  {CBDD_FEED_URI} schema:dataFeedElement ?feedItem .
  ?feedItem schema:item ?painting .
}}
"""
df_verify = run_sparql(query_verify_path)
print(f"‚úì CbDD Feed URI: {CBDD_FEED_URI}")
print(f"‚úì Total paintings accessible: {df_verify['totalPaintings'].iloc[0]}")
print(f"\nData path: Feed ‚Üí schema:dataFeedElement ‚Üí DataFeedItem ‚Üí schema:item ‚Üí Painting")  

‚úì CbDD Feed URI: n4c:E6077
‚úì Total paintings accessible: 6228

Data path: Feed ‚Üí schema:dataFeedElement ‚Üí DataFeedItem ‚Üí schema:item ‚Üí Painting


## 3. Explore Painting Properties

Now let's discover what properties are available on the painting records.

In [6]:
# Discover all predicates used by paintings in the dataset
query_painting_predicates = f"""
SELECT ?predicate (COUNT(?o) AS ?count) (SAMPLE(?o) AS ?sampleValue)
WHERE {{
  {CBDD_FEED_URI} schema:dataFeedElement ?feedItem .
  ?feedItem schema:item ?painting .
  ?painting ?predicate ?o .
}}
GROUP BY ?predicate
ORDER BY DESC(?count)
LIMIT 30
"""

df_painting_preds = run_sparql(query_painting_predicates)

# Add resolved labels using the ontology resolver (defined in cell 13)
# This will be populated after running the ontology resolution cell
def add_resolved_labels(df):
    """Add a 'resolved_label' column with human-readable property names."""
    if 'resolve_property_name' in dir():
        df['resolved_label'] = df['predicate'].apply(resolve_property_name)
    else:
        # Fallback: extract last part of URI
        df['resolved_label'] = df['predicate'].apply(
            lambda x: x.split('/')[-1] if '/' in x else x
        )
    return df

df_painting_preds = add_resolved_labels(df_painting_preds)

print("All predicates used by paintings (with resolved ontology labels):")
print("="*80)
print("\nRun the 'Automatic Ontology Resolution' cell first to get full CTO/NFDI labels.\n")

# Display with resolved labels
df_painting_preds[['resolved_label', 'count', 'predicate', 'sampleValue']]

All predicates used by paintings (with resolved ontology labels):

Run the 'Automatic Ontology Resolution' cell first to get full CTO/NFDI labels.



Unnamed: 0,resolved_label,count,predicate,sampleValue
0,CTO_0001026,23359,https://nfdi4culture.de/ontology/CTO_0001026,http://vocab.getty.edu/aat/300004792
1,CTO_0001009,6672,https://nfdi4culture.de/ontology/CTO_0001009,nodeID://b2646823
2,CTO_0001025,6230,https://nfdi4culture.de/ontology/CTO_0001025,nodeID://b2640114
3,rdf-schema#label,6228,http://www.w3.org/2000/01/rdf-schema#label,"Hofhegnenberg, Schloss"
4,CTO_0001049,6228,https://nfdi4culture.de/ontology/CTO_0001049,https://nfdi4culture.de/ontology/CTO_0001047
5,NFDI_0001008,6228,https://nfdi.fiz-karlsruhe.de/ontology/NFDI_00...,https://www.deckenmalerei.eu/50c603ef-f42c-43f...
6,CTO_0001006,6228,https://nfdi4culture.de/ontology/CTO_0001006,https://nfdi4culture.de/id/E6077
7,22-rdf-syntax-ns#type,6228,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,https://nfdi4culture.de/ontology/CTO_0001005
8,NFDI_0000142,6228,https://nfdi.fiz-karlsruhe.de/ontology/NFDI_00...,https://nfdi4culture.de/id/E6404
9,NFDI_0000191,6228,https://nfdi.fiz-karlsruhe.de/ontology/NFDI_00...,https://nfdi4culture.de/id/E2430


In [7]:
# Get a sample of paintings with key properties to understand the data
# Key properties: CTO_0001073 = creation period/year
query_sample_paintings = f"""
SELECT ?painting ?label ?year ?lat ?lon 
WHERE {{
  {CBDD_FEED_URI} schema:dataFeedElement ?feedItem .
  ?feedItem schema:item ?painting .
  ?painting rdfs:label ?label .
  OPTIONAL {{ ?painting <https://nfdi4culture.de/ontology/CTO_0001073> ?year . }}
  OPTIONAL {{
    ?painting schema:latitude ?lat .
    ?painting schema:longitude ?lon .
  }}
}}
LIMIT 10
"""

df_sample_paintings = run_sparql(query_sample_paintings)
print(f"Sample paintings ({len(df_sample_paintings)} records):")
print(df_sample_paintings)

# =============================================================================
# Function to get ALL metadata for a specific painting
# Uses the automatic ontology resolver for human-readable property names
# =============================================================================
def get_painting_metadata(painting_uri: str, use_ontology_labels: bool = True) -> pd.DataFrame:
    """
    Retrieve ALL properties (predicates and values) for a specific painting URI.
    This shows the complete metadata stored in the knowledge graph.
    
    Integrates with the CTO/NFDI ontology resolver for human-readable names.
    
    Args:
        painting_uri: The full URI of the painting (e.g., 'https://nfdi4culture.de/id/...')
        use_ontology_labels: If True, use resolved ontology labels (requires cell 13 to be run)
        
    Returns:
        DataFrame with columns: property_name, value, value_type, property
    """
    query = f"""
    SELECT ?property ?value
    WHERE {{
      <{painting_uri}> ?property ?value .
    }}
    ORDER BY ?property
    """
    
    df = run_sparql(query)
    
    if not df.empty:
        # Add a readable property name column using ontology resolver if available
        if use_ontology_labels and 'resolve_property_name' in dir():
            df['property_name'] = df['property'].apply(resolve_property_name)
        else:
            # Fallback: extract last part of URI
            df['property_name'] = df['property'].apply(
                lambda x: x.split('/')[-1] if '/' in x else x
            )
        
        # Detect value type (URI vs literal)
        df['value_type'] = df['value'].apply(
            lambda x: 'URI' if x.startswith('http') else 'Literal'
        )
        # Reorder columns for better readability
        df = df[['property_name', 'value', 'value_type', 'property']]
    
    return df

# Show all metadata for the first painting in our sample
print("\n" + "="*80)
print("üìã COMPLETE METADATA for first painting:")
print("   (Property names resolved via CTO/NFDI ontology when available)")
print("="*80)

if not df_sample_paintings.empty:
    first_painting_uri = df_sample_paintings.iloc[0]['painting']
    first_painting_label = df_sample_paintings.iloc[0]['label']
    print(f"\nüñºÔ∏è  {first_painting_label}")
    print(f"URI: {first_painting_uri}\n")
    
    df_metadata = get_painting_metadata(first_painting_uri)
    print(f"Found {len(df_metadata)} property values:\n")
    
    # Group by property for cleaner display
    for prop_name in df_metadata['property_name'].unique():
        prop_rows = df_metadata[df_metadata['property_name'] == prop_name]
        values = prop_rows['value'].tolist()
        value_type = prop_rows['value_type'].iloc[0]
        
        if len(values) == 1:
            val_display = values[0][:80] + '...' if len(values[0]) > 80 else values[0]
            print(f"  ‚Ä¢ {prop_name}: {val_display}")
        else:
            print(f"  ‚Ä¢ {prop_name}: ({len(values)} values)")
            for v in values[:3]:  # Show first 3 values
                val_display = v[:70] + '...' if len(v) > 70 else v
                print(f"      - {val_display}")
            if len(values) > 3:
                print(f"      ... and {len(values)-3} more")

print("\n‚úÖ Function defined: get_painting_metadata(painting_uri)")
print("   Use it to explore any painting: get_painting_metadata(df_sample_paintings.iloc[N]['painting'])")
print("   Set use_ontology_labels=False to disable ontology resolution")

Sample paintings (10 records):
                                            painting                                            label  \
0  https://www.deckenmalerei.eu/00e1625e-0ac7-423...                        Burggen, Kapelle St. Anna   
1  https://www.deckenmalerei.eu/021afb11-438b-4f7...                       Iffeldorf, Heuwinklkapelle   
2  https://www.deckenmalerei.eu/02f7125d-cfb1-4fa...  Hessental, H√§llische Erbsch√§nke, Gasthaus Krone   
3  https://www.deckenmalerei.eu/03414469-1219-4fc...                             Lauchheim, Pfarrhaus   
4  https://www.deckenmalerei.eu/037d1d8a-4487-439...                             Berlin, Stadtschloss   
5  https://www.deckenmalerei.eu/043e1e20-2c95-42b...        Eisenberg, Residenzschloss Christiansburg   
6  https://www.deckenmalerei.eu/0656df8b-2e41-4cc...    Schmidm√ºhlen, Unteres Schloss (Hammerschloss)   
7  https://www.deckenmalerei.eu/0678f9cc-e52d-46e...                           Weimar, R√∂misches Haus   
8  https://www.decke

### Automatic Ontology Resolution for CTO/NFDI Codes

The painting metadata uses property codes from two namespaces:

1. **CTO (Culture Ontology)**: `https://nfdi4culture.de/ontology/CTO_XXXXXXX`
   - Domain-specific extension for NFDI4Culture cultural heritage data
   - Example: `CTO_0001009` = "has related person", `CTO_0001011` = "has related location"

2. **NFDIcore**: `https://nfdi.fiz-karlsruhe.de/ontology/NFDI_XXXXXXX`
   - Mid-level ontology for all NFDI consortia
   - Example: `NFDI_0001006` = "has external identifier" (links to GND, etc.)

**Automatic Resolution:**

Instead of hardcoding property labels, we dynamically fetch and parse the official ontology files from the GitHub repositories:

- **CTO**: [cto.ttl](https://github.com/ISE-FIZKarlsruhe/nfdi4culture/blob/main/cto.ttl)
- **NFDIcore**: [nfdicore.ttl](https://github.com/ISE-FIZKarlsruhe/nfdicore/blob/main/nfdicore.ttl)

The `rdfs:label` annotations are extracted for each CTO/NFDI entity, providing human-readable names automatically.

In [8]:
# =============================================================================
# Automatic CTO/NFDI Ontology Resolution
# =============================================================================
# Dynamically resolve ontology codes to human-readable labels by parsing
# the official OWL/TTL files from the GitHub repositories.
#
# Sources:
#   - CTO (NFDI4Culture Ontology): https://github.com/ISE-FIZKarlsruhe/nfdi4culture
#   - NFDIcore (Mid-level Ontology): https://github.com/ISE-FIZKarlsruhe/nfdicore
#
# This approach fetches the ontology files once and extracts rdfs:label
# for all CTO_* and NFDI_* entities, avoiding hardcoded mappings.

import requests
from functools import lru_cache
import re

# =============================================================================
# Ontology Sources (Raw TTL files from GitHub)
# =============================================================================
ONTOLOGY_SOURCES = {
    'CTO': {
        'url': 'https://raw.githubusercontent.com/ISE-FIZKarlsruhe/nfdi4culture/main/cto.ttl',
        'namespace': 'https://nfdi4culture.de/ontology/',
        'prefix_pattern': r'CTO_\d+',
    },
    'NFDIcore': {
        'url': 'https://raw.githubusercontent.com/ISE-FIZKarlsruhe/nfdicore/main/nfdicore.ttl',
        'namespace': 'https://nfdi.fiz-karlsruhe.de/ontology/',
        'prefix_pattern': r'NFDI_\d+',
    }
}

# Global cache for resolved ontology labels
_ontology_cache = {}
_ontology_loaded = False

def _parse_ttl_labels(ttl_content: str, namespace: str, prefix_pattern: str) -> dict:
    """
    Parse a TTL file and extract rdfs:label for entities matching the prefix pattern.
    Handles both full URI format and prefix notation (used in nfdicore.ttl).
    
    Args:
        ttl_content: The TTL file content as a string
        namespace: The namespace URI (e.g., 'https://nfdi4culture.de/ontology/')
        prefix_pattern: Regex pattern for codes (e.g., 'CTO_\\d+')
    
    Returns:
        dict mapping code -> label (e.g., 'CTO_0001009' -> 'has related person')
    """
    labels = {}
    
    # Pattern 1: Full URI format - <namespace/CODE> ... rdfs:label "Label"@en .
    entity_pattern = re.compile(
        rf'<{re.escape(namespace)}({prefix_pattern})>\s+[^;]*?'
        rf'rdfs:label\s+"([^"]+)"(?:@en)?\s*[;.]',
        re.MULTILINE | re.DOTALL
    )
    
    for match in entity_pattern.finditer(ttl_content):
        code = match.group(1)
        label = match.group(2)
        labels[code] = label
    
    # Pattern 2: Prefix notation - ontology:NFDI_XXXXXX ... rdfs:label "Label"@en
    # First find the prefix definition
    prefix_match = re.search(r'@prefix\s+(\w+):\s+<' + re.escape(namespace) + r'>\s*\.', ttl_content)
    if prefix_match:
        prefix_name = prefix_match.group(1)
        # Now find entities using that prefix
        prefix_entity_pattern = re.compile(
            rf'{prefix_name}:({prefix_pattern})\s+[^;]*?'
            rf'rdfs:label\s+"([^"]+)"(?:@en)?\s*[;.]',
            re.MULTILINE | re.DOTALL
        )
        for match in prefix_entity_pattern.finditer(ttl_content):
            code = match.group(1)
            label = match.group(2)
            if code not in labels:
                labels[code] = label
    
    # Pattern 3: Multi-line format with entity definition on one line, label on another
    lines = ttl_content.split('\n')
    current_entity = None
    
    for line in lines:
        # Check for full URI entity definition
        entity_match = re.match(rf'^<{re.escape(namespace)}({prefix_pattern})>', line)
        if entity_match:
            current_entity = entity_match.group(1)
        
        # Check for prefix notation entity definition (e.g., "ontology:NFDI_0000004")
        if prefix_match:
            prefix_name = prefix_match.group(1)
            prefix_entity_match = re.match(rf'^{prefix_name}:({prefix_pattern})\s', line)
            if prefix_entity_match:
                current_entity = prefix_entity_match.group(1)
        
        # Check for rdfs:label in the current context
        if current_entity:
            label_match = re.search(r'rdfs:label\s+"([^"]+)"(?:@en)?', line)
            if label_match and current_entity not in labels:
                labels[current_entity] = label_match.group(1)
            
            # Reset current entity on blank line or new entity definition
            if line.strip() == '':
                current_entity = None
    
    return labels

def load_ontology_labels(force_reload: bool = False) -> dict:
    """
    Load and cache all ontology labels from CTO and NFDIcore.
    
    Args:
        force_reload: If True, reload even if already cached
    
    Returns:
        dict mapping code -> {'label': str, 'namespace': str, 'uri': str}
    """
    global _ontology_cache, _ontology_loaded
    
    if _ontology_loaded and not force_reload:
        return _ontology_cache
    
    print("Loading ontology labels from GitHub...")
    
    for source_name, source_info in ONTOLOGY_SOURCES.items():
        try:
            print(f"   Fetching {source_name} from {source_info['url'][:50]}...")
            response = requests.get(source_info['url'], timeout=30)
            response.raise_for_status()
            
            labels = _parse_ttl_labels(
                response.text,
                source_info['namespace'],
                source_info['prefix_pattern']
            )
            
            for code, label in labels.items():
                _ontology_cache[code] = {
                    'label': label,
                    'namespace': source_info['namespace'],
                    'uri': f"{source_info['namespace']}{code}",
                    'source': source_name
                }
            
            print(f"   Loaded {len(labels)} labels from {source_name}")
            
        except Exception as e:
            print(f"   Failed to load {source_name}: {e}")
    
    _ontology_loaded = True
    print(f"\nTotal: {len(_ontology_cache)} ontology codes resolved")
    return _ontology_cache

@lru_cache(maxsize=500)
def resolve_ontology_code(code: str) -> dict:
    """
    Resolve a CTO/NFDI ontology code to its label.
    
    Args:
        code: Ontology code like 'CTO_0001009' or 'NFDI_0001006'
    
    Returns:
        dict with 'code', 'label', 'uri', 'source', 'resolved' keys
    """
    result = {'code': code, 'label': code, 'uri': None, 'source': None, 'resolved': False}
    
    # Ensure ontology is loaded
    if not _ontology_loaded:
        load_ontology_labels()
    
    if code in _ontology_cache:
        cached = _ontology_cache[code]
        result['label'] = cached['label']
        result['uri'] = cached['uri']
        result['source'] = cached['source']
        result['resolved'] = True
    else:
        # Construct URI even if label not found
        if code.startswith('CTO_'):
            result['uri'] = f"https://nfdi4culture.de/ontology/{code}"
            result['source'] = 'CTO'
        elif code.startswith('NFDI_'):
            result['uri'] = f"https://nfdi.fiz-karlsruhe.de/ontology/{code}"
            result['source'] = 'NFDIcore'
    
    return result

def resolve_property_name(property_uri: str) -> str:
    """
    Convert a full property URI to a human-readable label.
    
    Args:
        property_uri: Full URI like 'https://nfdi4culture.de/ontology/CTO_0001009'
    
    Returns:
        Human-readable label like 'has related person (CTO_0001009)'
    """
    # Extract the code from the URI
    code = property_uri.split('/')[-1] if '/' in property_uri else property_uri
    
    # Handle standard vocabularies
    if 'schema.org' in property_uri:
        return code
    if 'w3.org' in property_uri:
        return code.split('#')[-1] if '#' in code else code
    
    # Resolve CTO/NFDI codes
    if code.startswith('CTO_') or code.startswith('NFDI_'):
        resolved = resolve_ontology_code(code)
        if resolved['resolved'] and resolved['label'] != code:
            return f"{resolved['label']} ({code})"
    
    return code

def get_ontology_reference_table() -> pd.DataFrame:
    """
    Get a DataFrame with all resolved ontology codes for reference.
    
    Returns:
        DataFrame with columns: code, label, source, uri
    """
    if not _ontology_loaded:
        load_ontology_labels()
    
    rows = []
    for code, info in sorted(_ontology_cache.items()):
        rows.append({
            'code': code,
            'label': info['label'],
            'source': info['source'],
            'uri': info['uri']
        })
    
    return pd.DataFrame(rows)

# =============================================================================
# Load ontology on first run
# =============================================================================
ontology_labels = load_ontology_labels()

# Display summary
print("\n" + "="*70)
print("CTO/NFDI Ontology Code Reference (Auto-loaded from GitHub)")
print("="*70)

# Show some key properties used in CbDD dataset
key_codes = ['CTO_0001005', 'CTO_0001009', 'CTO_0001010', 'CTO_0001011',
             'CTO_0001019', 'CTO_0001026', 'CTO_0001073', 'CTO_0001021',
             'NFDI_0000004', 'NFDI_0000005', 'NFDI_0000008', 'NFDI_0000015']

print("\nKey properties used in the CbDD ceiling painting dataset:\n")
for code in key_codes:
    resolved = resolve_ontology_code(code)
    status = '[OK]' if resolved['resolved'] else '[??]'
    print(f"  {status} {code:15} -> {resolved['label']}")

print("\n" + "="*70)
print("\nOntology Sources:")
for name, info in ONTOLOGY_SOURCES.items():
    print(f"  - {name}: {info['url']}")

print("\nFunctions defined:")
print("   - resolve_ontology_code(code) -> resolve CTO/NFDI codes to labels")
print("   - resolve_property_name(uri) -> human-readable property names")
print("   - get_ontology_reference_table() -> DataFrame with all codes")
print("   - load_ontology_labels(force_reload=True) -> refresh from GitHub")

Loading ontology labels from GitHub...
   Fetching CTO from https://raw.githubusercontent.com/ISE-FIZKarlsruhe...
   Loaded 70 labels from CTO
   Fetching NFDIcore from https://raw.githubusercontent.com/ISE-FIZKarlsruhe...
   Loaded 197 labels from NFDIcore

Total: 267 ontology codes resolved

CTO/NFDI Ontology Code Reference (Auto-loaded from GitHub)

Key properties used in the CbDD ceiling painting dataset:

  [OK] CTO_0001005     -> source item
  [OK] CTO_0001009     -> has related person
  [OK] CTO_0001010     -> has related organization
  [OK] CTO_0001011     -> has related location
  [OK] CTO_0001019     -> has related item
  [OK] CTO_0001026     -> has external classifier
  [OK] CTO_0001073     -> has creation period
  [OK] CTO_0001021     -> has content url
  [OK] NFDI_0000004    -> person
  [OK] NFDI_0000005    -> place
  [OK] NFDI_0000008    -> creative work
  [OK] NFDI_0000015    -> identifier


Ontology Sources:
  - CTO: https://raw.githubusercontent.com/ISE-FIZKarlsruhe/nf

In [9]:
# =============================================================================
# Subject Resolution via External SPARQL Endpoints
# =============================================================================
# Resolves subject URIs from CTO_0001026 ("has external classifier") to labels
# using the official ICONCLASS and Getty AAT SPARQL endpoints.
#
# Integrates with the CTO/NFDI ontology resolver for consistent
# property name resolution throughout the notebook.

import requests
import time
from functools import lru_cache
import urllib.parse

@lru_cache(maxsize=500)
def query_iconclass_sparql(notation):
    """Query ICONCLASS SPARQL endpoint for a label."""
    try:
        # URL-decode the notation (e.g., "48C14%28SCHEINARCHITEKTUR%29" -> "48C14(SCHEINARCHITEKTUR)")
        notation_decoded = urllib.parse.unquote(notation)
        
        endpoint = "https://iconclass.org/sparql"
        query = f"""
        PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
        
        SELECT ?label
        WHERE {{
          <https://iconclass.org/{notation_decoded}> skos:prefLabel ?label .
          FILTER(LANG(?label) = "en")
        }}
        LIMIT 1
        """.strip()  # IMPORTANT: strip whitespace!
        
        resp = requests.get(
            endpoint,
            params={'query': query, 'format': 'json'},
            headers={'Accept': 'application/sparql-results+json'},
            timeout=10
        )
        if resp.ok:
            data = resp.json()
            bindings = data.get("results", {}).get("bindings", [])
            if bindings:
                return bindings[0].get("label", {}).get("value")
    except Exception as e:
        pass
    return None

@lru_cache(maxsize=500)
def query_getty_sparql(aat_id):
    """Query Getty AAT SPARQL endpoint for a label using gvp:prefLabelGVP."""
    try:
        endpoint = "http://vocab.getty.edu/sparql"
        # Getty uses gvp:prefLabelGVP/xl:literalForm for preferred labels
        # IMPORTANT: Must strip whitespace - Getty returns empty response if query has leading whitespace!
        query = f"""
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX xl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX aat: <http://vocab.getty.edu/aat/>

SELECT ?label
WHERE {{
  aat:{aat_id} gvp:prefLabelGVP/xl:literalForm ?label .
}}
LIMIT 1
""".strip()
        
        resp = requests.get(
            endpoint,
            params={'query': query, 'format': 'json'},
            headers={'Accept': 'application/sparql-results+json'},
            timeout=10
        )
        if resp.ok and resp.text:  # Also check response is not empty
            data = resp.json()
            bindings = data.get("results", {}).get("bindings", [])
            if bindings:
                return bindings[0].get("label", {}).get("value")
    except Exception as e:
        pass
    return None

def resolve_subject_from_sparql(uri):
    """
    Resolve a subject URI to its label using external SPARQL endpoints.
    
    Handles subjects from CTO_0001026 ("has external classifier"):
    - ICONCLASS: iconographic classification for art
    - Getty AAT: Art & Architecture Thesaurus
    
    Args:
        uri: Subject URI (e.g., 'https://iconclass.org/92D1521' or 'http://vocab.getty.edu/aat/300004792')
    
    Returns:
        dict with 'uri', 'code', 'label', 'source', 'resolved' keys
    """
    code = uri.split('/')[-1]
    
    if 'iconclass.org' in uri:
        label = query_iconclass_sparql(code)
        source = 'ICONCLASS'
    elif 'vocab.getty.edu' in uri:
        label = query_getty_sparql(code)
        source = 'Getty AAT'
    else:
        label = None
        source = 'Unknown'
    
    return {
        'uri': uri,
        'code': code,
        'label': label or f'[{code}]',
        'source': source,
        'resolved': label is not None
    }

# Test with sample codes
print("Testing external SPARQL endpoints for subject resolution...")
print("="*70)
print(f"\nSubjects come from CTO_0001026", end="")
if 'resolve_ontology_code' in dir():
    resolved = resolve_ontology_code('CTO_0001026')
    print(f" ({resolved['label']})")
else:
    print(" (has external classifier)")

print("\n1. ICONCLASS tests:")
for code in ["92D1521", "25HH", "5"]:
    label = query_iconclass_sparql(code)
    print(f"   {code}: {label}")

print("\n2. Getty AAT tests (using gvp:prefLabelGVP/xl:literalForm):")
for code in ["300004792", "300411453"]:
    label = query_getty_sparql(code)
    print(f"   {code}: {label}")

print("\n" + "="*70)
print("‚úÖ Functions defined:")
print("   - resolve_subject_from_sparql(uri) -> resolve ICONCLASS/AAT URIs to labels")
print("   - query_iconclass_sparql(notation) -> query ICONCLASS endpoint")
print("   - query_getty_sparql(aat_id) -> query Getty AAT endpoint")
print("\nThese integrate with CTO_0001026 ('has external classifier') property.")

Testing external SPARQL endpoints for subject resolution...

Subjects come from CTO_0001026 (has external classifier)

1. ICONCLASS tests:
   92D1521: Cupid shooting a dart
   25HH: landscapes - HH - ideal landscapes
   5: Abstract Ideas and Concepts

2. Getty AAT tests (using gvp:prefLabelGVP/xl:literalForm):
   300004792: buildings (structures)
   300411453: ceiling paintings

‚úÖ Functions defined:
   - resolve_subject_from_sparql(uri) -> resolve ICONCLASS/AAT URIs to labels
   - query_iconclass_sparql(notation) -> query ICONCLASS endpoint
   - query_getty_sparql(aat_id) -> query Getty AAT endpoint

These integrate with CTO_0001026 ('has external classifier') property.


### CbDD Graph Data Integration

The CbDD (Corpus of Baroque Ceiling Painting in Germany) provides a pre-exported graph dataset (`graphData.json`) with rich relational data that complements the NFDI4Culture Knowledge Graph.

**Data available from the CbDD graph:**
- **Painters** (link type: `PAINTERS`) - directly named, no GND resolution needed
- **Commissioners** (link type: `COMMISSIONERS`) - patrons who commissioned the work
- **Rooms** (link type: `PART` to `OBJECT_ROOM`) - where the painting is located
- **Buildings** (link type: `PART` to `OBJECT_BUILDING`) - the church/palace containing the room
- **Dates** (link type: `DATE`) - creation dates
- **Architects, Plasterers, etc.** - other related persons

**Matching strategy:** Paintings are matched by their `rdfs:label` from NFDI4Culture to the `name` field in the CbDD graph.

This approach is more reliable than GND resolution because:
1. Names are pre-resolved and curated in the CbDD database
2. Role classification (painter vs commissioner) is explicit in the graph structure
3. No external API calls needed, making it faster and more robust

In [10]:
# =============================================================================
# CbDD Graph Data Loader (Enhanced)
# =============================================================================
# Loads the pre-exported CbDD graph data (graphData.json) and provides
# comprehensive functions to enrich painting data with:
#   - People: painters, commissioners, architects, plasterers, sculptors, etc.
#   - Locations: room ‚Üí building ‚Üí state hierarchy
#   - Building metadata: function, architects, commissioners
#   - Relationships: painter networks, template providers
#
# Link types extracted: PAINTERS, COMMISSIONERS, ARCHITECTS, PLASTERERS,
#   SCULPTORS, DESIGNERS, TEMPLATE_PROVIDERS, BUILDERS, FUNCTION, LOCATION,
#   DATE, METHOD, MATERIAL, PART, and more.

import json
import os
from typing import Optional, Dict, List, Any
from collections import defaultdict

# =============================================================================
# Load and Parse CbDD Graph
# =============================================================================
CBDD_GRAPH_PATH = os.path.join(os.path.dirname(os.path.abspath('__file__')), 'graphData.json')

# Global cache for the graph data and indices
_cbdd_graph = None
_cbdd_nodes_by_id = None
_cbdd_nodes_by_name = None
_cbdd_paintings_by_name = None
_cbdd_links_by_source = None
_cbdd_links_by_target = None
_cbdd_buildings_by_name = None
_cbdd_painter_to_paintings = None
_cbdd_graph_loaded = False

def load_cbdd_graph(force_reload: bool = False) -> dict:
    """
    Load the CbDD graph data from graphData.json and build lookup indices.
    
    Returns:
        dict with 'nodes', 'links', 'exportDate' and lookup indices
    """
    global _cbdd_graph, _cbdd_nodes_by_id, _cbdd_nodes_by_name
    global _cbdd_paintings_by_name, _cbdd_links_by_source, _cbdd_links_by_target
    global _cbdd_buildings_by_name, _cbdd_painter_to_paintings, _cbdd_graph_loaded
    
    if _cbdd_graph_loaded and not force_reload:
        return _cbdd_graph
    
    print("üì• Loading CbDD graph data from graphData.json...")
    
    try:
        with open('graphData.json', encoding='utf-8') as f:
            _cbdd_graph = json.load(f)
        
        # Build lookup indices for fast access
        _cbdd_nodes_by_id = {n['id']: n for n in _cbdd_graph['nodes']}
        
        # Build name lookup (case-insensitive, normalized)
        _cbdd_nodes_by_name = {}
        for n in _cbdd_graph['nodes']:
            name = n.get('name', '').strip()
            if name:
                key = name.lower()
                if key not in _cbdd_nodes_by_name:
                    _cbdd_nodes_by_name[key] = []
                _cbdd_nodes_by_name[key].append(n)
        
        # Build painting-specific lookup by exact name
        _cbdd_paintings_by_name = {}
        paintings = [n for n in _cbdd_graph['nodes'] if n.get('type') == 'OBJECT_PAINTING']
        for p in paintings:
            name = p.get('name', '').strip()
            if name:
                if name not in _cbdd_paintings_by_name:
                    _cbdd_paintings_by_name[name] = []
                _cbdd_paintings_by_name[name].append(p)
        
        # Build building lookup by name
        _cbdd_buildings_by_name = {}
        buildings = [n for n in _cbdd_graph['nodes'] if n.get('type') == 'OBJECT_BUILDING']
        for b in buildings:
            name = b.get('name', '').strip()
            if name:
                _cbdd_buildings_by_name[name] = b
        
        # Build links index by source AND target for fast lookup
        _cbdd_links_by_source = {}
        _cbdd_links_by_target = {}
        for link in _cbdd_graph['links']:
            src, tgt = link['source'], link['target']
            if src not in _cbdd_links_by_source:
                _cbdd_links_by_source[src] = []
            _cbdd_links_by_source[src].append(link)
            if tgt not in _cbdd_links_by_target:
                _cbdd_links_by_target[tgt] = []
            _cbdd_links_by_target[tgt].append(link)
        
        # Build painter -> paintings index for network analysis
        _cbdd_painter_to_paintings = defaultdict(list)
        for link in _cbdd_graph['links']:
            if link['type'] == 'PAINTERS':
                painter_id = link['target']
                painting_id = link['source']
                painter = _cbdd_nodes_by_id.get(painter_id)
                painting = _cbdd_nodes_by_id.get(painting_id)
                if painter and painting:
                    _cbdd_painter_to_paintings[painter.get('name', '')].append({
                        'id': painting_id,
                        'name': painting.get('name', '')
                    })
        
        _cbdd_graph_loaded = True
        
        # Statistics
        node_types = {}
        link_types = {}
        for n in _cbdd_graph['nodes']:
            t = n.get('type', 'UNKNOWN')
            node_types[t] = node_types.get(t, 0) + 1
        for l in _cbdd_graph['links']:
            t = l['type']
            link_types[t] = link_types.get(t, 0) + 1
        
        print(f"   ‚úì Loaded {len(_cbdd_graph['nodes']):,} nodes, {len(_cbdd_graph['links']):,} links")
        print(f"   ‚úì Export date: {_cbdd_graph.get('exportDate', 'unknown')}")
        print(f"\n   Node types:")
        for t, count in sorted(node_types.items(), key=lambda x: -x[1])[:8]:
            print(f"      {t}: {count:,}")
        print(f"\n   Key link types:")
        for t in ['PAINTERS', 'COMMISSIONERS', 'ARCHITECTS', 'FUNCTION', 'LOCATION', 'PART', 'TEMPLATE_PROVIDERS']:
            print(f"      {t}: {link_types.get(t, 0):,}")
        print(f"\n   ‚úì Indices built: {len(_cbdd_paintings_by_name):,} paintings, {len(_cbdd_buildings_by_name):,} buildings")
        print(f"   ‚úì Painter network: {len(_cbdd_painter_to_paintings):,} painters tracked")
        
        return _cbdd_graph
        
    except FileNotFoundError:
        print("   ‚ö† graphData.json not found! Download it from the CbDD portal.")
        _cbdd_graph_loaded = False
        return None
    except Exception as e:
        print(f"   ‚ö† Error loading graph: {e}")
        _cbdd_graph_loaded = False
        return None


def get_painting_from_graph(painting_name: str) -> Optional[Dict]:
    """Find a painting in the CbDD graph by its name."""
    if not _cbdd_graph_loaded:
        load_cbdd_graph()
    
    if not _cbdd_paintings_by_name:
        return None
    
    name = painting_name.strip()
    if name in _cbdd_paintings_by_name:
        return _cbdd_paintings_by_name[name][0]
    
    # Try case-insensitive match
    name_lower = name.lower()
    for key, paintings in _cbdd_paintings_by_name.items():
        if key.lower() == name_lower:
            return paintings[0]
    
    return None


def get_building_info(building_id: str) -> Dict[str, Any]:
    """
    Extract comprehensive information about a building from the CbDD graph.
    
    Args:
        building_id: The UUID of the building in the CbDD graph
    
    Returns:
        dict with building details: name, function, location, architects, commissioners, etc.
    """
    if not _cbdd_graph_loaded:
        load_cbdd_graph()
    
    result = {
        'building_name': None,
        'building_id': building_id,
        'function': None,
        'location_state': None,
        'architects': [],
        'building_commissioners': [],
        'builders': [],
        'construction_date': None,
    }
    
    building = _cbdd_nodes_by_id.get(building_id)
    if not building:
        return result
    
    result['building_name'] = building.get('name')
    
    # Get all outgoing links from this building
    links = _cbdd_links_by_source.get(building_id, [])
    
    for link in links:
        target = _cbdd_nodes_by_id.get(link['target'])
        if not target:
            continue
        
        link_type = link['type']
        target_name = target.get('name', '')
        
        if link_type == 'FUNCTION':
            # Clean up function name (e.g., "Funktion: Kirche -> Abteikirche" -> "Abteikirche")
            func = target_name
            if func.startswith('Funktion: '):
                func = func[10:]
            if ' -> ' in func:
                func = func.split(' -> ')[-1]  # Take most specific function
            result['function'] = func
        elif link_type == 'LOCATION':
            result['location_state'] = target_name
        elif link_type == 'ARCHITECTS':
            result['architects'].append(target_name)
        elif link_type == 'COMMISSIONERS':
            result['building_commissioners'].append(target_name)
        elif link_type == 'BUILDERS':
            result['builders'].append(target_name)
        elif link_type == 'DATE':
            # Get construction date
            if not result['construction_date']:
                result['construction_date'] = target_name
    
    return result


def get_painting_relations(painting_id: str) -> Dict[str, Any]:
    """
    Extract ALL relationships for a painting from the CbDD graph.
    
    Args:
        painting_id: The UUID of the painting in the CbDD graph
    
    Returns:
        dict with comprehensive relationship data
    """
    if not _cbdd_graph_loaded:
        load_cbdd_graph()
    
    result = {
        # People
        'painters': [],
        'commissioners': [],
        'architects': [],
        'plasterers': [],
        'sculptors': [],
        'designers': [],
        'template_providers': [],
        'other_artists': [],
        # Location hierarchy
        'room': None,
        'room_id': None,
        'building': None,
        'building_id': None,
        'building_function': None,
        'location_state': None,
        'building_architects': [],
        'building_commissioners': [],
        # Artwork metadata
        'date': None,
        'method': None,
        'material': None,
    }
    
    if not _cbdd_links_by_source or not _cbdd_nodes_by_id:
        return result
    
    # Get all links FROM this painting (outgoing)
    links = _cbdd_links_by_source.get(painting_id, [])
    
    for link in links:
        target = _cbdd_nodes_by_id.get(link['target'])
        if not target:
            continue
        
        link_type = link['type']
        target_name = target.get('name', '')
        
        # People relationships
        if link_type == 'PAINTERS':
            result['painters'].append(target_name)
        elif link_type == 'COMMISSIONERS':
            result['commissioners'].append(target_name)
        elif link_type == 'ARCHITECTS':
            result['architects'].append(target_name)
        elif link_type == 'PLASTERERS':
            result['plasterers'].append(target_name)
        elif link_type == 'SCULPTORS':
            result['sculptors'].append(target_name)
        elif link_type == 'DESIGNERS':
            result['designers'].append(target_name)
        elif link_type == 'TEMPLATE_PROVIDERS':
            result['template_providers'].append(target_name)
        elif link_type in ('ARTISTS', 'IMAGE_CARVERS', 'CABINETMAKERS', 'CARPENTERS'):
            result['other_artists'].append(target_name)
        # Metadata
        elif link_type == 'DATE':
            result['date'] = target_name
        elif link_type == 'METHOD':
            method = target_name
            if method.startswith('Technik: '):
                method = method[9:]
            result['method'] = method
        elif link_type == 'MATERIAL':
            result['material'] = target_name
    
    # Find room/building via PART links by traversing ALL the way up the hierarchy
    # PART links go from PARENT ‚Üí CHILD (source ‚Üí target)
    # Hierarchy can be: PAINTING -> ROOM -> ROOM -> ... -> BUILDING -> ENSEMBLE
    # We need to traverse until we find OBJECT_BUILDING
    
    def traverse_to_building(node_id: str, depth: int = 0, max_depth: int = 10) -> Optional[Dict]:
        """Recursively traverse up the PART hierarchy to find the building."""
        if depth >= max_depth:
            return None
        
        part_links = _cbdd_links_by_target.get(node_id, [])
        for link in part_links:
            if link['type'] != 'PART':
                continue
            
            parent = _cbdd_nodes_by_id.get(link['source'])
            if not parent:
                continue
            
            parent_type = parent.get('type', '')
            
            if parent_type == 'OBJECT_BUILDING':
                return parent
            elif parent_type in ('OBJECT_ROOM', 'OBJECT_ENSEMBLE'):
                # Continue traversing up
                found = traverse_to_building(parent['id'], depth + 1, max_depth)
                if found:
                    return found
        
        return None
    
    # Get immediate parent (room) first
    part_links = _cbdd_links_by_target.get(painting_id, [])
    
    for link in part_links:
        if link['type'] != 'PART':
            continue
        
        parent = _cbdd_nodes_by_id.get(link['source'])
        if not parent:
            continue
        
        parent_type = parent.get('type', '')
        
        if parent_type == 'OBJECT_ROOM':
            result['room'] = parent.get('name')
            result['room_id'] = parent['id']
            
            # Traverse ALL the way up to find the building
            building = traverse_to_building(parent['id'])
            if building:
                result['building'] = building.get('name')
                result['building_id'] = building['id']
                
                # Get building info
                building_info = get_building_info(building['id'])
                result['building_function'] = building_info.get('function')
                result['location_state'] = building_info.get('location_state')
                result['building_architects'] = building_info.get('architects', [])
                result['building_commissioners'] = building_info.get('building_commissioners', [])
            break
        
        elif parent_type == 'OBJECT_BUILDING':
            # Painting directly in building (no room)
            result['building'] = parent.get('name')
            result['building_id'] = parent['id']
            building_info = get_building_info(parent['id'])
            result['building_function'] = building_info.get('function')
            result['location_state'] = building_info.get('location_state')
            result['building_architects'] = building_info.get('architects', [])
            result['building_commissioners'] = building_info.get('building_commissioners', [])
            break
    
    return result


def enrich_painting_from_graph(painting_name: str) -> Optional[Dict[str, Any]]:
    """
    Get all enrichment data for a painting by its name.
    
    This is the main function to use for enriching NFDI4Culture data with CbDD graph data.
    
    Args:
        painting_name: The painting label (rdfs:label from NFDI4Culture)
    
    Returns:
        dict with all available data, or None if painting not found in graph
    """
    painting = get_painting_from_graph(painting_name)
    if not painting:
        return None
    
    relations = get_painting_relations(painting['id'])
    
    return {
        'cbdd_id': painting['id'],
        'cbdd_name': painting.get('name'),
        **relations
    }


def get_painter_network(painter_name: str) -> Dict[str, Any]:
    """
    Get network information for a painter: their paintings and co-painters.
    
    Args:
        painter_name: Name of the painter
        
    Returns:
        dict with paintings list, co_painters, building_count, etc.
    """
    if not _cbdd_graph_loaded:
        load_cbdd_graph()
    
    result = {
        'painter_name': painter_name,
        'painting_count': 0,
        'paintings': [],
        'co_painters': {},
        'buildings_worked_in': set(),
        'commissioners_worked_for': set(),
    }
    
    paintings = _cbdd_painter_to_paintings.get(painter_name, [])
    result['painting_count'] = len(paintings)
    result['paintings'] = paintings[:20]  # Limit for display
    
    # Find co-painters and other info
    for painting_info in paintings:
        painting_id = painting_info['id']
        
        # Get other painters on same painting
        links = _cbdd_links_by_source.get(painting_id, [])
        for link in links:
            if link['type'] == 'PAINTERS':
                other_painter = _cbdd_nodes_by_id.get(link['target'])
                if other_painter:
                    other_name = other_painter.get('name', '')
                    if other_name != painter_name:
                        result['co_painters'][other_name] = result['co_painters'].get(other_name, 0) + 1
            elif link['type'] == 'COMMISSIONERS':
                commissioner = _cbdd_nodes_by_id.get(link['target'])
                if commissioner:
                    result['commissioners_worked_for'].add(commissioner.get('name', ''))
        
        # Get building
        relations = get_painting_relations(painting_id)
        if relations.get('building'):
            result['buildings_worked_in'].add(relations['building'])
    
    # Convert sets to sorted lists
    result['buildings_worked_in'] = sorted(result['buildings_worked_in'])
    result['commissioners_worked_for'] = sorted(result['commissioners_worked_for'])
    result['co_painters'] = dict(sorted(result['co_painters'].items(), key=lambda x: -x[1]))
    
    return result


def enrich_dataframe_from_graph(df: pd.DataFrame, name_column: str = 'label') -> pd.DataFrame:
    """
    Enrich a DataFrame of paintings with comprehensive data from the CbDD graph.
    
    Adds columns: painters, commissioners, architects, room, building, building_function,
                  location_state, building_address, date_cbdd, method, template_providers, cbdd_id
    
    Args:
        df: DataFrame with painting data (must have a name/label column)
        name_column: Name of the column containing painting names
    
    Returns:
        DataFrame with additional columns from CbDD graph
    """
    if not _cbdd_graph_loaded:
        load_cbdd_graph()
    
    if df.empty:
        return df
    
    # Initialize new columns (including building_address)
    enrichment_cols = [
        'painters', 'commissioners', 'architects', 'plasterers', 'template_providers',
        'room', 'building', 'building_address', 'building_function', 'location_state',
        'building_architects', 'date_cbdd', 'method', 'cbdd_id'
    ]
    for col in enrichment_cols:
        if col not in df.columns:
            df[col] = None
    
    matched = 0
    for idx, row in df.iterrows():
        name = row.get(name_column)
        if not name:
            continue
        
        enrichment = enrich_painting_from_graph(name)
        if enrichment:
            matched += 1
            df.at[idx, 'cbdd_id'] = enrichment.get('cbdd_id')
            df.at[idx, 'painters'] = ', '.join(enrichment.get('painters', [])) or None
            df.at[idx, 'commissioners'] = ', '.join(enrichment.get('commissioners', [])) or None
            df.at[idx, 'architects'] = ', '.join(enrichment.get('architects', [])) or None
            df.at[idx, 'plasterers'] = ', '.join(enrichment.get('plasterers', [])) or None
            df.at[idx, 'template_providers'] = ', '.join(enrichment.get('template_providers', [])) or None
            df.at[idx, 'room'] = enrichment.get('room')
            df.at[idx, 'building'] = enrichment.get('building')
            df.at[idx, 'building_function'] = enrichment.get('building_function')
            df.at[idx, 'location_state'] = enrichment.get('location_state')
            df.at[idx, 'building_architects'] = ', '.join(enrichment.get('building_architects', [])) or None
            df.at[idx, 'date_cbdd'] = enrichment.get('date')
            df.at[idx, 'method'] = enrichment.get('method')
            
            # Extract address from building name (format: "City, Building Name" or "City, Street Number")
            building_name = enrichment.get('building', '')
            if building_name:
                df.at[idx, 'building_address'] = building_name  # The building name IS the address
    
    print(f"   ‚úì Matched {matched}/{len(df)} paintings ({100*matched/len(df):.1f}%) with CbDD graph")
    return df


def get_top_painters(limit: int = 20) -> List[Dict]:
    """Get list of most prolific painters with their painting counts."""
    if not _cbdd_graph_loaded:
        load_cbdd_graph()
    
    painters = [(name, len(paintings)) for name, paintings in _cbdd_painter_to_paintings.items()]
    painters.sort(key=lambda x: -x[1])
    
    return [{'name': name, 'count': count} for name, count in painters[:limit]]


# =============================================================================
# Load the graph on first run
# =============================================================================
cbdd_graph = load_cbdd_graph()

# Test with a sample painting name
print("\n" + "="*70)
print("Testing enhanced CbDD graph lookup:")
print("="*70)

test_names = ["Spes", "Der Goldene Saal", "Mannheim, Kurf√ºrstliches Residenzschloss"]
for name in test_names:
    result = enrich_painting_from_graph(name)
    if result:
        print(f"\n‚úì '{name}':")
        if result.get('painters'):
            print(f"   üé® Painters: {', '.join(result['painters'][:3])}")
        if result.get('commissioners'):
            print(f"   üë§ Commissioners: {', '.join(result['commissioners'][:2])}")
        if result.get('template_providers'):
            print(f"   üìê Template providers: {', '.join(result['template_providers'][:2])}")
        if result.get('room'):
            print(f"   üö™ Room: {result['room']}")
        if result.get('building'):
            print(f"   üèõÔ∏è Building: {result['building']}")
        if result.get('building_function'):
            print(f"   ‚õ™ Function: {result['building_function']}")
        if result.get('location_state'):
            print(f"   üìç State: {result['location_state']}")
        if result.get('building_architects'):
            print(f"   üèóÔ∏è Building architects: {', '.join(result['building_architects'][:2])}")
    else:
        print(f"\n‚úó '{name}': Not found in graph")

# Show top painters
print("\n" + "="*70)
print("Top 10 most prolific painters in CbDD:")
print("="*70)
for p in get_top_painters(10):
    print(f"   üé® {p['name']}: {p['count']} paintings")

print("\n" + "="*70)
print("‚úÖ Enhanced CbDD Graph functions defined:")
print("   - load_cbdd_graph() -> load/reload the graph data")
print("   - get_painting_from_graph(name) -> find painting by name")
print("   - get_painting_relations(id) -> get all relations for a painting")
print("   - get_building_info(id) -> get building details (function, architects)")
print("   - enrich_painting_from_graph(name) -> get enrichment data by name")
print("   - enrich_dataframe_from_graph(df) -> enrich a whole DataFrame")
print("   - get_painter_network(name) -> painter's works and collaborators")
print("   - get_top_painters(limit) -> most prolific painters")

üì• Loading CbDD graph data from graphData.json...
   ‚úì Loaded 13,835 nodes, 60,150 links
   ‚úì Export date: 2025-12-01

   Node types:
      OBJECT_PAINTING: 5,839
      ACTOR_PERSON: 2,772
      OBJECT_ROOM: 2,376
      OBJECT_BUILDING: 1,260
      TEXT: 1,230
      FUNCTION: 200
      ACTOR_SOCIETY: 59
      OBJECT_ENSEMBLE: 32

   Key link types:
      PAINTERS: 7,051
      COMMISSIONERS: 11,160
      ARCHITECTS: 1,743
      FUNCTION: 2,910
      LOCATION: 1,308
      PART: 8,276
      TEMPLATE_PROVIDERS: 1,646

   ‚úì Indices built: 5,109 paintings, 1,260 buildings
   ‚úì Painter network: 553 painters tracked

Testing enhanced CbDD graph lookup:

‚úì 'Spes':
   üé® Painters: Messmer, Johann Georg
   üë§ Commissioners: Stadion, Maria Maximiliana von
   üö™ Room: Die Stiftskirche
   üèõÔ∏è Building: Bad Buchau, Stiftskirche
   ‚õ™ Function: Klosterkirche
   üìç State: Baden-W√ºrttemberg
   üèóÔ∏è Building architects: D'Ixnard, Pierre Michel

‚úó 'Der Goldene Saal': Not fou

In [11]:
# =============================================================================
# Building Coordinates Lookup from NFDI4Culture KG
# =============================================================================
# The CbDD graph provides building names/addresses but not coordinates.
# We query the NFDI4Culture KG to get lat/lon for buildings.
#
# Building names in CbDD follow patterns like:
#   - "Bad Buchau, Stiftskirche" (City, Building)
#   - "Altenburg, Haus Moritzstra√üe 6" (City, Street Address)
#   - "M√ºnchen, Schloss Nymphenburg, Hauptschloss" (City, Complex, Building)
#
# Strategy:
#   1. Extract city name from building address (first part before comma)
#   2. Search KG for items containing city name with coordinates
#   3. Match against building name parts
#   4. Cache results for efficiency

from functools import lru_cache
import re

# Cache for building coordinates
_building_coordinates_cache = {}


def extract_address_parts(building_name: str) -> Dict[str, str]:
    """
    Extract city, street, and building parts from a CbDD building name.
    
    Examples:
        "Bad Buchau, Stiftskirche" -> {city: "Bad Buchau", building: "Stiftskirche"}
        "Altenburg, Haus Moritzstra√üe 6" -> {city: "Altenburg", street: "Moritzstra√üe 6", building: "Haus"}
        "M√ºnchen, Schloss Nymphenburg, Hauptschloss" -> {city: "M√ºnchen", complex: "Schloss Nymphenburg", building: "Hauptschloss"}
    """
    if not building_name:
        return {}
    
    parts = [p.strip() for p in building_name.split(',')]
    result = {
        'full_name': building_name,
        'city': parts[0] if parts else None,
        'building': parts[-1] if len(parts) > 1 else None,
        'complex': parts[1] if len(parts) > 2 else None,
    }
    
    # Check for street address patterns (contains numbers or street keywords)
    street_patterns = ['stra√üe', 'str.', 'gasse', 'platz', 'weg', 'allee']
    for part in parts[1:]:
        part_lower = part.lower()
        if any(p in part_lower for p in street_patterns) or re.search(r'\d+', part):
            result['street'] = part
            break
    
    return result


@lru_cache(maxsize=500)
def get_building_coordinates_from_kg(building_name: str) -> Optional[Dict]:
    """
    Query NFDI4Culture KG to find coordinates for a building.
    Uses multiple search strategies for better matching.
    
    Args:
        building_name: Building name/address from CbDD (e.g., "Bad Buchau, Stiftskirche")
    
    Returns:
        dict with lat, lon, uri, matched_label, or None if not found
    """
    if not building_name:
        return None
    
    # Extract address components
    addr = extract_address_parts(building_name)
    city = addr.get('city', '')
    
    if not city:
        return None
    
    # Try multiple search strategies
    search_terms = [
        city,  # Just the city
        f"{city}, {addr.get('building', '')}" if addr.get('building') else None,
        addr.get('complex', '') if addr.get('complex') else None,
    ]
    search_terms = [t for t in search_terms if t]
    
    for search_term in search_terms:
        # Clean the search term for SPARQL
        search_clean = search_term.replace('"', '\\"').replace("'", "\\'")
        
        query = f"""
        SELECT ?building ?label ?lat ?lon
        WHERE {{
          {CBDD_FEED_URI} schema:dataFeedElement ?feedItem .
          ?feedItem schema:item ?painting .
          
          # Find parent items via CTO_0001019 (is part of) - follow up to 5 levels
          ?painting <https://nfdi4culture.de/ontology/CTO_0001019>* ?building .
          ?building rdfs:label ?label .
          ?building schema:latitude ?lat .
          ?building schema:longitude ?lon .
          
          FILTER(CONTAINS(LCASE(?label), LCASE("{search_clean}")))
        }}
        LIMIT 10
        """
        
        try:
            df = run_sparql(query)
            if not df.empty:
                # Find best match - prefer exact matches
                building_lower = building_name.lower()
                
                for idx, row in df.iterrows():
                    label = row.get('label', '')
                    label_lower = label.lower()
                    
                    # Exact match
                    if building_lower == label_lower:
                        return {
                            'lat': float(row['lat']),
                            'lon': float(row['lon']),
                            'uri': row['building'],
                            'matched_label': label,
                            'match_type': 'exact'
                        }
                
                # Partial match - building name in label or vice versa
                for idx, row in df.iterrows():
                    label = row.get('label', '')
                    label_lower = label.lower()
                    
                    if building_lower in label_lower or label_lower in building_lower:
                        return {
                            'lat': float(row['lat']),
                            'lon': float(row['lon']),
                            'uri': row['building'],
                            'matched_label': label,
                            'match_type': 'partial'
                        }
                
                # City match - if city matches, use it as fallback
                for idx, row in df.iterrows():
                    label = row.get('label', '')
                    if city.lower() in label.lower():
                        return {
                            'lat': float(row['lat']),
                            'lon': float(row['lon']),
                            'uri': row['building'],
                            'matched_label': label,
                            'match_type': 'city'
                        }
                
                # Last resort: return first result
                row = df.iloc[0]
                return {
                    'lat': float(row['lat']),
                    'lon': float(row['lon']),
                    'uri': row['building'],
                    'matched_label': row.get('label', ''),
                    'match_type': 'first'
                }
        except Exception as e:
            pass
    
    return None


def get_coordinates_for_painting(painting_row: pd.Series) -> Dict:
    """
    Get coordinates for a painting, trying multiple sources:
    1. Direct coordinates on painting (from NFDI4Culture)
    2. Building coordinates (via CbDD building -> KG lookup)
    
    Args:
        painting_row: DataFrame row with painting data
    
    Returns:
        dict with lat, lon, source, building_name (if from building)
    """
    # First check if painting has direct coordinates
    lat = painting_row.get('lat')
    lon = painting_row.get('lon')
    
    if lat is not None and lon is not None and str(lat) != 'nan' and str(lon) != 'nan':
        try:
            return {
                'lat': float(lat),
                'lon': float(lon),
                'source': 'painting',
                'building_name': None
            }
        except (ValueError, TypeError):
            pass
    
    # Try to get coordinates from building
    building_name = painting_row.get('building')
    if building_name:
        coords = get_building_coordinates_from_kg(building_name)
        if coords:
            return {
                'lat': coords['lat'],
                'lon': coords['lon'],
                'source': 'building',
                'building_name': building_name,
                'matched_label': coords.get('matched_label'),
                'match_type': coords.get('match_type')
            }
    
    return {'lat': None, 'lon': None, 'source': None, 'building_name': None}


def enrich_dataframe_with_coordinates(df: pd.DataFrame, verbose: bool = True) -> pd.DataFrame:
    """
    Add coordinate columns to a DataFrame, trying painting then building coordinates.
    
    Args:
        df: DataFrame with painting data (should have 'building' column from CbDD enrichment)
        verbose: Print progress information
    
    Returns:
        DataFrame with added/updated lat, lon, coord_source columns
    """
    if 'coord_source' not in df.columns:
        df['coord_source'] = None
    
    direct_coords = 0
    building_coords = 0
    no_coords = 0
    
    for idx, row in df.iterrows():
        coords = get_coordinates_for_painting(row)
        
        if coords['lat'] is not None:
            df.at[idx, 'lat'] = coords['lat']
            df.at[idx, 'lon'] = coords['lon']
            df.at[idx, 'coord_source'] = coords['source']
            
            if coords['source'] == 'painting':
                direct_coords += 1
            else:
                building_coords += 1
        else:
            no_coords += 1
    
    if verbose:
        print(f"   ‚úì Coordinates enrichment:")
        print(f"      Direct (painting): {direct_coords}")
        print(f"      From building: {building_coords}")
        print(f"      No coordinates: {no_coords}")
    
    return df


# =============================================================================
# Alternative: Query KG for all buildings with coordinates at once
# =============================================================================
def load_all_building_coordinates() -> Dict[str, Dict]:
    """
    Pre-load coordinates for all buildings in the CbDD dataset.
    This is more efficient than individual queries.
    
    Returns:
        dict mapping building_name -> {lat, lon, uri}
    """
    global _building_coordinates_cache
    
    if _building_coordinates_cache:
        return _building_coordinates_cache
    
    print("üìç Loading building coordinates from NFDI4Culture KG...")
    
    # Query for all items with coordinates
    query = f"""
    SELECT DISTINCT ?item ?label ?lat ?lon ?itemType
    WHERE {{
      {CBDD_FEED_URI} schema:dataFeedElement ?feedItem .
      ?feedItem schema:item ?painting .
      
      # Get paintings and their parent items
      {{
        ?painting schema:latitude ?lat .
        ?painting schema:longitude ?lon .
        ?painting rdfs:label ?label .
        BIND(?painting AS ?item)
        BIND("painting" AS ?itemType)
      }}
      UNION
      {{
        ?painting <https://nfdi4culture.de/ontology/CTO_0001019> ?parent .
        ?parent schema:latitude ?lat .
        ?parent schema:longitude ?lon .
        ?parent rdfs:label ?label .
        BIND(?parent AS ?item)
        BIND("parent" AS ?itemType)
      }}
    }}
    """
    
    try:
        df = run_sparql(query)
        
        if not df.empty:
            for idx, row in df.iterrows():
                label = row.get('label', '')
                if label:
                    _building_coordinates_cache[label] = {
                        'lat': float(row['lat']),
                        'lon': float(row['lon']),
                        'uri': row['item'],
                        'type': row.get('itemType', 'unknown')
                    }
            
            print(f"   ‚úì Loaded coordinates for {len(_building_coordinates_cache)} items")
        else:
            print("   ‚ö† No coordinate data found")
            
    except Exception as e:
        print(f"   ‚ö† Error loading coordinates: {e}")
    
    return _building_coordinates_cache


def get_cached_coordinates(name: str) -> Optional[Dict]:
    """Get coordinates from cache by exact or partial name match."""
    if not _building_coordinates_cache:
        load_all_building_coordinates()
    
    # Try exact match
    if name in _building_coordinates_cache:
        return _building_coordinates_cache[name]
    
    # Try partial match
    name_lower = name.lower()
    for cached_name, coords in _building_coordinates_cache.items():
        if name_lower in cached_name.lower() or cached_name.lower() in name_lower:
            return coords
    
    return None


print("‚úÖ Building coordinates functions defined:")
print("   - get_building_coordinates_from_kg(building_name) -> query KG for single building")
print("   - get_coordinates_for_painting(row) -> get coords from painting or building")
print("   - enrich_dataframe_with_coordinates(df) -> add coords to DataFrame")
print("   - load_all_building_coordinates() -> pre-load all coords for efficiency")
print("   - get_cached_coordinates(name) -> lookup from cache")

‚úÖ Building coordinates functions defined:
   - get_building_coordinates_from_kg(building_name) -> query KG for single building
   - get_coordinates_for_painting(row) -> get coords from painting or building
   - enrich_dataframe_with_coordinates(df) -> add coords to DataFrame
   - load_all_building_coordinates() -> pre-load all coords for efficiency
   - get_cached_coordinates(name) -> lookup from cache


In [12]:
# =============================================================================
# GND Resolution (Optional - for additional research)
# =============================================================================
# NOTE: For painter/commissioner names, we now use the CbDD Graph (graphData.json)
# which provides direct names without needing GND resolution.
#
# These GND functions are kept for optional research purposes:
#   - Looking up additional person details
#   - Resolving GND URIs found in other contexts
#   - Cross-referencing with the German National Library

import requests
from functools import lru_cache

@lru_cache(maxsize=1000)
def resolve_gnd_uri(gnd_uri: str) -> dict:
    """
    Resolve a GND URI to its preferred name using lobid.org API.
    
    NOTE: For painter/commissioner names, prefer using the CbDD graph
    via enrich_painting_from_graph() which is faster and more reliable.
    
    Args:
        gnd_uri: A GND URI like 'https://d-nb.info/gnd/118636960'
        
    Returns:
        dict with 'name', 'type', 'uri', 'resolved' keys
    """
    result = {'uri': gnd_uri, 'name': None, 'type': None, 'resolved': False}
    
    if not gnd_uri or not isinstance(gnd_uri, str):
        return result
    
    try:
        gnd_id = gnd_uri.split('/')[-1].strip()
        if not gnd_id or len(gnd_id) < 3:
            return result
        
        response = requests.get(
            f'https://lobid.org/gnd/{gnd_id}.json',
            headers={'Accept': 'application/json'},
            timeout=10
        )
        
        if response.ok:
            data = response.json()
            result['name'] = data.get('preferredName')
            type_val = data.get('type', [])
            if isinstance(type_val, list) and type_val:
                result['type'] = type_val[0]
            elif isinstance(type_val, str):
                result['type'] = type_val
            else:
                result['type'] = 'Unknown'
            result['resolved'] = result['name'] is not None
            
    except Exception as e:
        pass
    
    return result


print("‚úÖ GND resolution functions defined (optional, for research):")
print("   - resolve_gnd_uri(gnd_uri) -> resolve single GND URI via lobid.org")
print()
print("üìå NOTE: For painter/commissioner names in this dataset, use:")
print("   - enrich_painting_from_graph(painting_name)")
print("   - enrich_dataframe_from_graph(df)")
print("   These use the CbDD graph data which is faster and more reliable.")

‚úÖ GND resolution functions defined (optional, for research):
   - resolve_gnd_uri(gnd_uri) -> resolve single GND URI via lobid.org

üìå NOTE: For painter/commissioner names in this dataset, use:
   - enrich_painting_from_graph(painting_name)
   - enrich_dataframe_from_graph(df)
   These use the CbDD graph data which is faster and more reliable.


In [13]:
# =============================================================================
# Fetch Paintings from NFDI4Culture Knowledge Graph
# =============================================================================
# This query fetches the core data from NFDI4Culture:
#   - Painting URI, label, year, coordinates, image URL
#   - ICONCLASS/AAT subjects (for thematic analysis)
#   - Parent entity (part-of relationships)
#
# Person data (painters, commissioners) and location details are enriched
# from the CbDD graph (graphData.json) in the next step.

query_paintings = f"""
SELECT DISTINCT ?painting ?label ?year ?lat ?lon ?imageUrl ?license
       (GROUP_CONCAT(DISTINCT ?iconclass; separator="|") AS ?subjects)
       ?parentUri ?parentLabel
WHERE {{
  {CBDD_FEED_URI} schema:dataFeedElement ?feedItem .
  ?feedItem schema:item ?painting .
  
  # Required: Title and image
  ?painting rdfs:label ?label .
  ?painting schema:associatedMedia ?image .
  ?image <https://nfdi4culture.de/ontology/CTO_0001021> ?imageUrl .
  
  # Optional properties from NFDI4Culture
  OPTIONAL {{ ?image <https://nfdi4culture.de/ontology/CTO_0001007> ?license . }}
  OPTIONAL {{ ?painting <https://nfdi4culture.de/ontology/CTO_0001073> ?year . }}
  OPTIONAL {{
    ?painting schema:latitude ?lat .
    ?painting schema:longitude ?lon .
  }}
  OPTIONAL {{ ?painting <https://nfdi4culture.de/ontology/CTO_0001026> ?iconclass . }}
  
  # Part-of relationships (CTO_0001019)
  OPTIONAL {{
    ?painting <https://nfdi4culture.de/ontology/CTO_0001019> ?parentUri .
    FILTER(?parentUri != ?painting)
    ?parentUri rdfs:label ?parentLabel .
  }}
}}
GROUP BY ?painting ?label ?year ?lat ?lon ?imageUrl ?license ?parentUri ?parentLabel
LIMIT 50
"""

df_paintings = run_sparql(query_paintings)

# Ensure optional columns exist
for col in ['parentLabel', 'parentUri', 'subjects', 'lat', 'lon']:
    if col not in df_paintings.columns:
        df_paintings[col] = None

print(f"Fetched {len(df_paintings)} paintings from NFDI4Culture Knowledge Graph")
print(f"  - With coordinates: {len(df_paintings[df_paintings['lat'].notna()])}")
print(f"  - With subjects: {len(df_paintings[df_paintings['subjects'].notna() & (df_paintings['subjects'] != '')])}")
print(f"  - With year: {len(df_paintings[df_paintings['year'].notna()])}")

# Show property references
print("\nüìã SPARQL Properties used (from CTO/NFDI ontology):")
if 'resolve_ontology_code' in dir():
    for code in ['CTO_0001021', 'CTO_0001073', 'CTO_0001026', 'CTO_0001019']:
        resolved = resolve_ontology_code(code)
        print(f"   {code}: {resolved['label']}")

df_paintings[['label', 'year', 'lat', 'lon']].head(10)

Fetched 50 paintings from NFDI4Culture Knowledge Graph
  - With coordinates: 12
  - With subjects: 50
  - With year: 46

üìã SPARQL Properties used (from CTO/NFDI ontology):
   CTO_0001021: has content url
   CTO_0001073: has creation period
   CTO_0001026: has external classifier
   CTO_0001019: has related item


Unnamed: 0,label,year,lat,lon
0,"Erpfting, Kapelle Maria-Eich",1696/1697,48.02702019535047,10.839428936798988
1,IDEM AMBO: Pfirsich und gespaltenes Blatt an e...,1595-1605,,
2,Venus beweint Adonis,um 1700,,
3,Der Glaube,nach 1679,,
4,Deckenmalerei in Kabinett I,1700-1800,,
5,Pomona,,,
6,"M√ºnster, ehem. Prinzipalmarkt 39","1400-1500, 1495-1505, 1662",51.96241,7.62801
7,Josef flieht vor Potifars Frau,zwischen 1746 und 1755,,
8,"Hamburg, Valentinskamp 34","1644, 1750, 1772, 1856",53.5554848,9.9832873
9,"Juno, Venus und Minerva","Datierung: 1729, Zerst√∂rung: 1940/1943, Rekons...",,


In [14]:
# =============================================================================
# Enrich Paintings with CbDD Graph Data (Extended)
# =============================================================================
# Match paintings from NFDI4Culture with the CbDD graph by name and add:
#   - Painters (directly from graph, no GND resolution needed)
#   - Commissioners
#   - Room and Building information
#   - Building Function (e.g., Kloster, Schloss)
#   - Location State (Bundesland)
#   - Building Architects
#   - Template Providers (Vorlagenlieferanten)
#   - Technique/Method

print("Enriching paintings with CbDD graph data (extended)...")
print("="*70)

# Enrich the dataframe with ALL available CbDD data
df_enriched = enrich_dataframe_from_graph(df_paintings.copy(), name_column='label')

# Show results summary with NEW fields
print("\n" + "="*70)
print("üìä Enrichment Summary (Extended):")
print("="*70)

# Count non-null values for each enriched column
enrichment_stats = {
    'painters': ('üé® Painters', df_enriched[df_enriched['painters'].notna()]),
    'commissioners': ('üë§ Commissioners', df_enriched[df_enriched['commissioners'].notna()]),
    'room': ('üö™ Room', df_enriched[df_enriched['room'].notna()]),
    'building': ('üèõÔ∏è Building', df_enriched[df_enriched['building'].notna()]),
    'building_function': ('‚öôÔ∏è Building Function', df_enriched[df_enriched['building_function'].notna()]),
    'location_state': ('üìç Location State', df_enriched[df_enriched['location_state'].notna()]),
    'building_architects': ('üèóÔ∏è Building Architects', df_enriched[df_enriched['building_architects'].notna()]),
    'template_providers': ('üìê Template Providers', df_enriched[df_enriched['template_providers'].notna()]),
    'method': ('üñåÔ∏è Technique', df_enriched[df_enriched['method'].notna()]),
}

for col, (label, df_subset) in enrichment_stats.items():
    count = len(df_subset)
    pct = 100 * count / len(df_enriched)
    print(f"   {label}: {count}/{len(df_enriched)} ({pct:.1f}%)")

# Show sample results with ALL new fields
print("\n" + "="*70)
print("SAMPLE ENRICHED DATA (showing new fields):")
print("="*70)

for idx, row in df_enriched[df_enriched['painters'].notna()].head(5).iterrows():
    print(f"\nüñºÔ∏è  {row['label'][:70]}")
    if row.get('painters'):
        print(f"   üé® Painter(s): {row['painters']}")
    if row.get('commissioners'):
        print(f"   üë§ Commissioner(s): {row['commissioners']}")
    if row.get('template_providers') and pd.notna(row.get('template_providers')):
        print(f"   üìê Template provider(s): {row['template_providers']}")
    if row.get('room'):
        print(f"   üö™ Room: {row['room']}")
    if row.get('building'):
        func = f" ({row['building_function']})" if row.get('building_function') and pd.notna(row.get('building_function')) else ""
        print(f"   üèõÔ∏è Building: {row['building']}{func}")
    if row.get('building_architects') and pd.notna(row.get('building_architects')):
        print(f"   üèóÔ∏è Building architect(s): {row['building_architects']}")
    if row.get('location_state') and pd.notna(row.get('location_state')):
        print(f"   üìç State: {row['location_state']}")
    if row.get('method'):
        print(f"   üñåÔ∏è Technique: {row['method']}")
    if row.get('year'):
        print(f"   üìÖ Year: {row['year']}")

# Show distribution of building functions
print("\n" + "="*70)
print("Building Function Distribution:")
print("="*70)
if 'building_function' in df_enriched.columns:
    func_counts = df_enriched['building_function'].value_counts().head(10)
    for func, count in func_counts.items():
        print(f"   {func}: {count}")

# Show distribution of states
print("\n" + "="*70)
print("Location State Distribution:")
print("="*70)
if 'location_state' in df_enriched.columns:
    state_counts = df_enriched['location_state'].value_counts().head(10)
    for state, count in state_counts.items():
        print(f"   {state}: {count}")

# Display all columns
print("\n" + "="*70)
print("Available columns in enriched DataFrame:")
print(df_enriched.columns.tolist())

Enriching paintings with CbDD graph data (extended)...
   ‚úì Matched 38/50 paintings (76.0%) with CbDD graph

üìä Enrichment Summary (Extended):
   üé® Painters: 21/50 (42.0%)
   üë§ Commissioners: 25/50 (50.0%)
   üö™ Room: 38/50 (76.0%)
   üèõÔ∏è Building: 38/50 (76.0%)
   ‚öôÔ∏è Building Function: 36/50 (72.0%)
   üìç Location State: 38/50 (76.0%)
   üèóÔ∏è Building Architects: 27/50 (54.0%)
   üìê Template Providers: 1/50 (2.0%)
   üñåÔ∏è Technique: 24/50 (48.0%)

SAMPLE ENRICHED DATA (showing new fields):

üñºÔ∏è  Der Glaube
   üé® Painter(s): Zink, Matthias, Merz, Johann Michael
   üë§ Commissioner(s): Oettingen-Spielberg, Ludovika Rosalie von
   üö™ Room: Hauptraum
   üèõÔ∏è Building: Oettingen, Neues Schloss (Residenz)
   üèóÔ∏è Building architect(s): Wei√ü, Matthias
   üìç State: Bayern
   üñåÔ∏è Technique: √ñlmalerei
   üìÖ Year: nach 1679

üñºÔ∏è  Pomona
   üé® Painter(s): Zick, Januarius
   üö™ Room: Dianasaal
   üèõÔ∏è Building: Engers, Schloss (Jagd

In [16]:
# =============================================================================
# Test Building Hierarchy Traversal and Coordinate Enrichment
# =============================================================================
# This verifies:
# 1. The recursive hierarchy traversal reaches buildings (not stopping at rooms)
# 2. We can get coordinates from building names via the KG

print("Testing Building Hierarchy Traversal...")
print("="*70)

# Count how many have rooms vs buildings
with_room = df_enriched[df_enriched['room'].notna()]
with_building = df_enriched[df_enriched['building'].notna()]

print(f"\nüìä Location Hierarchy Coverage:")
print(f"   Paintings with Room: {len(with_room)}/{len(df_enriched)} ({100*len(with_room)/len(df_enriched):.1f}%)")
print(f"   Paintings with Building: {len(with_building)}/{len(df_enriched)} ({100*len(with_building)/len(df_enriched):.1f}%)")

# Show paintings that have room but NO building (these would be the problematic cases)
room_but_no_building = df_enriched[(df_enriched['room'].notna()) & (df_enriched['building'].isna())]
print(f"\n‚ö†Ô∏è Paintings with Room but NO Building (hierarchy traversal issue): {len(room_but_no_building)}")
if len(room_but_no_building) > 0:
    for idx, row in room_but_no_building.head(5).iterrows():
        print(f"   - {row['label'][:50]} (Room: {row['room']})")

# Now test coordinate enrichment from buildings
print("\n" + "="*70)
print("Testing Coordinate Enrichment from Buildings...")
print("="*70)

# Check initial coordinates vs coordinates after building lookup
initial_coords = df_enriched['lat'].notna().sum()
print(f"\nüìç Initial coordinates (from painting directly): {initial_coords}/{len(df_enriched)}")

# Enrich with building coordinates
df_with_coords = enrich_dataframe_with_coordinates(df_enriched.copy(), verbose=True)

# Compare
final_coords = df_with_coords['lat'].notna().sum()
print(f"\nüìç Final coordinates (after building lookup): {final_coords}/{len(df_with_coords)}")
print(f"   üÜï Additional coordinates from buildings: {final_coords - initial_coords}")

# Show some examples of coordinates from buildings
building_coord_rows = df_with_coords[(df_with_coords['coord_source'] == 'building')]
if len(building_coord_rows) > 0:
    print(f"\nüìç Sample paintings with coordinates from building lookup:")
    for idx, row in building_coord_rows.head(5).iterrows():
        print(f"   - {row['label'][:40]}")
        print(f"     Building: {row['building']}")
        print(f"     Coords: ({row['lat']:.4f}, {row['lon']:.4f})")

# Store for later use
df_enriched = df_with_coords

Testing Building Hierarchy Traversal...

üìä Location Hierarchy Coverage:
   Paintings with Room: 38/50 (76.0%)
   Paintings with Building: 38/50 (76.0%)

‚ö†Ô∏è Paintings with Room but NO Building (hierarchy traversal issue): 0

Testing Coordinate Enrichment from Buildings...

üìç Initial coordinates (from painting directly): 12/50
   ‚úì Coordinates enrichment:
      Direct (painting): 12
      From building: 38
      No coordinates: 0

üìç Final coordinates (after building lookup): 50/50
   üÜï Additional coordinates from buildings: 38

üìç Sample paintings with coordinates from building lookup:
   - IDEM AMBO: Pfirsich und gespaltenes Blat
     Building: Dillingen, F√ºrstbisch√∂fliche Residenz
     Coords: (48.5763, 10.4952)
   - Venus beweint Adonis
     Building: Allensbach, Schloss Freudental
     Coords: (47.7478, 9.0787)
   - Der Glaube
     Building: Oettingen, Neues Schloss
     Coords: (48.9545, 10.6050)
   - Deckenmalerei in Kabinett I
     Building: K√∂rtlinghausen, 

In [17]:
# =============================================================================
# Enhanced Display Function for Enriched Paintings
# =============================================================================
# Displays paintings with all available metadata from:
#   - NFDI4Culture KG: title, year, coordinates, subjects, image
#   - CbDD Graph: painters, commissioners, architects, room, building, 
#                 function, state, technique, template providers
from IPython.display import HTML, display

def display_painting_card(row, max_width=500, resolve_subjects=True, show_all_details=True):
    """
    Display a painting with complete metadata as an HTML card.
    
    Data sources:
    - NFDI4Culture: title (label), year, coordinates, subjects (ICONCLASS/AAT), image
    - CbDD Graph: painters, commissioners, architects, room, building, function, 
                  location_state, method/technique, template_providers
    
    Args:
        row: DataFrame row or dict with painting data
        max_width: Maximum image width in pixels
        resolve_subjects: Whether to resolve ICONCLASS/AAT URIs to labels
        show_all_details: Whether to show all available metadata
    """
    # Basic info (from NFDI4Culture)
    label = row.get('label', 'Unknown')
    year = row.get('year') or row.get('date_cbdd') or 'Unknown date'
    image_url = row.get('imageUrl', '')
    subjects = row.get('subjects', '')
    lat = row.get('lat')
    lon = row.get('lon')
    painting_uri = row.get('painting', '')
    parent_label = row.get('parentLabel', '')
    coord_source = row.get('coord_source', '')
    
    # Enriched info (from CbDD Graph)
    painters = row.get('painters', '')
    commissioners = row.get('commissioners', '')
    architects = row.get('architects', '')
    plasterers = row.get('plasterers', '')
    template_providers = row.get('template_providers', '')
    room = row.get('room', '')
    building = row.get('building', '')
    building_function = row.get('building_function', '')
    location_state = row.get('location_state', '')
    building_architects = row.get('building_architects', '')
    method = row.get('method', '')
    cbdd_id = row.get('cbdd_id', '')
    
    # Geo enrichment info (if present)
    geo_source = row.get('geo_source', 'original')
    matched_place = row.get('matched_place', '')
    wikidata_place = row.get('wikidata_place', '')
    
    # Build HTML sections
    html_parts = []
    
    # Title
    html_parts.append(f'<h3 style="margin-top: 0; color: #333;">{label}</h3>')
    
    # Year/Date
    html_parts.append(f'<p style="color: #000;"><strong>üìÖ Date:</strong> {year}</p>')
    
    # Technique (from CbDD)
    if method and pd.notna(method):
        html_parts.append(f'<p style="color: #000;"><strong>üñåÔ∏è Technique:</strong> {method}</p>')
    
    # Painters (from CbDD)
    if painters and pd.notna(painters):
        html_parts.append(f'<p style="color: #000;"><strong>üé® Painter(s):</strong> {painters}</p>')
    
    # Commissioners (from CbDD)
    if commissioners and pd.notna(commissioners):
        html_parts.append(f'<p style="color: #000;"><strong>üë§ Commissioner(s):</strong> {commissioners}</p>')
    
    # Template providers (from CbDD)
    if show_all_details and template_providers and pd.notna(template_providers):
        html_parts.append(f'<p style="color: #000;"><strong>üìê Template provider(s):</strong> {template_providers}</p>')
    
    # Architects for this painting (from CbDD)
    if show_all_details and architects and pd.notna(architects):
        html_parts.append(f'<p style="color: #000;"><strong>üèóÔ∏è Architect(s):</strong> {architects}</p>')
    
    # Plasterers (from CbDD)
    if show_all_details and plasterers and pd.notna(plasterers):
        html_parts.append(f'<p style="color: #000;"><strong>üß± Plasterer(s):</strong> {plasterers}</p>')
    
    # Room (from CbDD)
    if room and pd.notna(room):
        html_parts.append(f'<p style="color: #000;"><strong>üö™ Room:</strong> {room}</p>')
    
    # Building with function (from CbDD)
    if building and pd.notna(building):
        building_text = building
        if building_function and pd.notna(building_function):
            building_text += f' <span style="color: #666;">({building_function})</span>'
        html_parts.append(f'<p style="color: #000;"><strong>üèõÔ∏è Building:</strong> {building_text}</p>')
    
    # Building architects (from CbDD)
    if show_all_details and building_architects and pd.notna(building_architects):
        html_parts.append(f'<p style="color: #000; margin-left: 20px;"><small>üèóÔ∏è Building architect(s): {building_architects}</small></p>')
    
    # Location/State (from CbDD)
    if location_state and pd.notna(location_state):
        html_parts.append(f'<p style="color: #000;"><strong>üìç State:</strong> {location_state}</p>')
    
    # Part-of hierarchy (from NFDI4Culture)
    if parent_label and pd.notna(parent_label):
        html_parts.append(f'<p style="color: #000;"><strong>üì¶ Part of:</strong> {parent_label}</p>')
    
    # Subjects (ICONCLASS/AAT)
    subject_html = ''
    if subjects and resolve_subjects:
        separator = '|' if '|' in str(subjects) else ','
        subject_list = [s.strip() for s in str(subjects).split(separator) if s.strip()]
        subject_items = []
        for uri in subject_list[:5]:
            try:
                resolved = resolve_subject_from_sparql(uri)
            except NameError:
                code = uri.split('/')[-1]
                resolved = {'label': f'[{code}]', 'source': 'ICONCLASS' if 'iconclass' in uri else 'AAT', 'code': code}
            
            badge_color = '#4CAF50' if 'iconclass' in uri.lower() else '#2196F3'
            subject_items.append(
                f'<span style="background: {badge_color}; color: white; padding: 2px 8px; '
                f'border-radius: 12px; font-size: 12px; margin: 2px; display: inline-block;" '
                f'title="{resolved.get("source", "")}: {resolved.get("code", "")}">{resolved["label"]}</span>'
            )
        if subject_items:
            subject_html = f'''
            <div style="margin: 10px 0;">
                <strong style="color: #000;">Subjects:</strong><br>
                <div style="margin-top: 5px;">{"".join(subject_items)}</div>
            </div>'''
    
    html_parts.append(subject_html)
    
    # Coordinates
    if lat is not None and str(lat) != 'nan' and lat != '':
        try:
            lat_f = float(lat)
            lon_f = float(lon) if lon else 0
            
            coord_badge = ''
            if coord_source == 'building':
                coord_badge = '<span style="background: #FF9800; color: white; padding: 2px 6px; border-radius: 4px; font-size: 11px;">Building</span> '
            elif geo_source == 'wikidata' and matched_place:
                coord_badge = '<span style="background: #9C27B0; color: white; padding: 2px 6px; border-radius: 4px; font-size: 11px;">Wikidata</span> '
            
            html_parts.append(f'<p style="color: #000;">üìç {coord_badge}{lat_f:.4f}, {lon_f:.4f}</p>')
        except (ValueError, TypeError):
            pass
    
    # Data source badges
    source_badges = []
    source_badges.append('<span style="background: #1976D2; color: white; padding: 2px 6px; border-radius: 4px; font-size: 10px;">Getty</span>')
    if cbdd_id:
        source_badges.append('<span style="background: #388E3C; color: white; padding: 2px 6px; border-radius: 4px; font-size: 10px;">ICONCLASS</span>')
    
    html_parts.append(f'<p style="margin-top: 10px;">{" ".join(source_badges)}</p>')
    
    # Link to CbDD
    if painting_uri:
        html_parts.append(f'<p><a href="{painting_uri}" target="_blank" style="color: #0066cc;">üîó View in NFDI4Culture</a></p>')
    
    # Image
    if image_url:
        html_parts.append(f'''
            <img src="{image_url}" style="max-width: {max_width}px; max-height: 500px; border-radius: 4px;" 
                 onerror="this.onerror=null; this.src=''; this.alt='Image could not be loaded';">
        ''')
    
    # Combine all parts
    html = f"""
    <div style="border: 1px solid #ddd; padding: 15px; margin: 10px 0; border-radius: 8px; background: #fafafa;">
        {''.join(html_parts)}
    </div>
    """
    display(HTML(html))


def display_painter_profile(painter_name: str, show_paintings: int = 5):
    """
    Display a profile card for a painter showing their works and collaborators.
    
    Args:
        painter_name: Name of the painter
        show_paintings: Number of paintings to list
    """
    try:
        network = get_painter_network(painter_name)
    except NameError:
        print(f"‚ö† Painter network function not available. Run the CbDD Graph Loader cell first.")
        return
    
    if network['painting_count'] == 0:
        print(f"‚ö† No paintings found for '{painter_name}'")
        return
    
    html_parts = []
    
    # Header
    html_parts.append(f'<h3 style="margin-top: 0; color: #333;">üé® {painter_name}</h3>')
    html_parts.append(f'<p style="color: #000;"><strong>Total paintings:</strong> {network["painting_count"]}</p>')
    
    # Buildings worked in
    if network['buildings_worked_in']:
        bldgs = network['buildings_worked_in'][:5]
        html_parts.append(f'<p style="color: #000;"><strong>üèõÔ∏è Buildings worked in:</strong> {", ".join(bldgs)}</p>')
    
    # Co-painters
    if network['co_painters']:
        co_list = [f"{name} ({count})" for name, count in list(network['co_painters'].items())[:5]]
        html_parts.append(f'<p style="color: #000;"><strong>üë• Co-painters:</strong> {", ".join(co_list)}</p>')
    
    # Commissioners worked for
    if network['commissioners_worked_for']:
        comms = list(network['commissioners_worked_for'])[:5]
        html_parts.append(f'<p style="color: #000;"><strong>üë§ Commissioners:</strong> {", ".join(comms)}</p>')
    
    # Sample paintings
    if network['paintings']:
        paintings_list = '<ul style="margin: 5px 0; padding-left: 20px;">'
        for p in network['paintings'][:show_paintings]:
            paintings_list += f'<li style="color: #333;">{p["name"][:60]}</li>'
        if len(network['paintings']) > show_paintings:
            paintings_list += f'<li style="color: #666;"><em>...and {len(network["paintings"]) - show_paintings} more</em></li>'
        paintings_list += '</ul>'
        html_parts.append(f'<p style="color: #000;"><strong>üñºÔ∏è Sample works:</strong></p>{paintings_list}')
    
    html = f"""
    <div style="border: 2px solid #1976D2; padding: 15px; margin: 10px 0; border-radius: 8px; background: #E3F2FD;">
        {''.join(html_parts)}
    </div>
    """
    display(HTML(html))


# Backward-compatible alias
display_painting_full = display_painting_card

print("‚úÖ Display functions defined:")
print("   - display_painting_card(row) -> show painting with all metadata")
print("   - display_painter_profile(name) -> show painter's works & collaborators")
print("\nData sources integrated:")
print("   üìä NFDI4Culture KG: title, year, coordinates, subjects, image")
print("   üìä CbDD Graph: painters, commissioners, architects, plasterers,")
print("                  template_providers, room, building, building_function,")
print("                  location_state, building_architects, technique")

‚úÖ Display functions defined:
   - display_painting_card(row) -> show painting with all metadata
   - display_painter_profile(name) -> show painter's works & collaborators

Data sources integrated:
   üìä NFDI4Culture KG: title, year, coordinates, subjects, image
   üìä CbDD Graph: painters, commissioners, architects, plasterers,
                  template_providers, room, building, building_function,
                  location_state, building_architects, technique


In [18]:
# =============================================================================
# Coordinate Enrichment via Building Lookup
# =============================================================================
# For paintings without direct coordinates, try to get them from the parent building

print("="*70)
print("üìç COORDINATE ENRICHMENT VIA BUILDING LOOKUP")
print("="*70)

# Count paintings without coordinates
without_coords = df_enriched[(df_enriched['lat'].isna()) | (df_enriched['lat'] == '')]
with_building = without_coords[without_coords['building'].notna()]

print(f"\nüìä Before enrichment:")
print(f"   Paintings without coordinates: {len(without_coords)}/{len(df_enriched)}")
print(f"   Of those, with building info: {len(with_building)}")

# Try to enrich coordinates from buildings
if len(with_building) > 0:
    print(f"\nüîç Attempting to get coordinates for {len(with_building)} buildings...")
    
    coords_found = 0
    buildings_checked = set()
    
    for idx, row in with_building.iterrows():
        building = row['building']
        if building and building not in buildings_checked:
            buildings_checked.add(building)
            
            # Try to get coordinates from KG
            coords = get_building_coordinates_from_kg(building)
            if coords and coords.get('lat'):
                coords_found += 1
                # Update all paintings in this building
                mask = df_enriched['building'] == building
                df_enriched.loc[mask, 'lat'] = coords['lat']
                df_enriched.loc[mask, 'lon'] = coords['lon']
                df_enriched.loc[mask, 'coord_source'] = 'building'
                print(f"   ‚úì {building[:50]}: {coords['lat']:.4f}, {coords['lon']:.4f}")
    
    print(f"\nüìä After enrichment:")
    new_without_coords = df_enriched[(df_enriched['lat'].isna()) | (df_enriched['lat'] == '')]
    print(f"   Buildings with coordinates found: {coords_found}/{len(buildings_checked)}")
    print(f"   Paintings still without coordinates: {len(new_without_coords)}/{len(df_enriched)}")
else:
    print("   All paintings already have coordinates or building info!")

# Show summary by coordinate source
print("\nüìä Coordinate source summary:")
if 'coord_source' in df_enriched.columns:
    coord_sources = df_enriched['coord_source'].value_counts()
    for src, count in coord_sources.items():
        print(f"   From {src}: {count}")
    # Count paintings with original coords (no source set)
    original_coords = len(df_enriched[(df_enriched['lat'].notna()) & (df_enriched['lat'] != '') & (df_enriched['coord_source'].isna())])
    print(f"   From painting (direct): {original_coords}")

üìç COORDINATE ENRICHMENT VIA BUILDING LOOKUP

üìä Before enrichment:
   Paintings without coordinates: 0/50
   Of those, with building info: 0
   All paintings already have coordinates or building info!

üìä Coordinate source summary:
   From building: 38
   From painting: 12
   From painting (direct): 0


In [19]:
# =============================================================================
# Painter Network Analysis
# =============================================================================
# Explore relationships between painters and their works using the CbDD graph

print("="*70)
print("üé® PAINTER NETWORK ANALYSIS")
print("="*70)

# Show top painters
print("\nüìä Most prolific Baroque ceiling painters in Germany:")
print("-"*70)
top = get_top_painters(15)
for i, p in enumerate(top, 1):
    print(f"  {i:2d}. {p['name']}: {p['count']} paintings")

# Profile a famous painter
print("\n" + "="*70)
print("üé® PAINTER PROFILE: Cosmas Damian Asam")
print("="*70)
display_painter_profile("Asam, Cosmas Damian")

# Show another painter for comparison
print("\n" + "="*70)
print("üé® PAINTER PROFILE: Johann Oswald Harms")
print("="*70)
display_painter_profile("Harms, Johann Oswald")

# Find painters from our sample who worked together
print("\n" + "="*70)
print("ü§ù CO-PAINTER RELATIONSHIPS IN OUR SAMPLE")
print("="*70)
painters_in_sample = df_enriched[df_enriched['painters'].notna()]['painters'].unique()
co_painter_found = []
for p in painters_in_sample:
    # Handle comma-separated painters
    for painter in str(p).split(','):
        painter = painter.strip()
        network = get_painter_network(painter)
        if network['co_painters']:
            for co_painter, count in list(network['co_painters'].items())[:3]:
                co_painter_found.append((painter, co_painter, count))

if co_painter_found:
    print("\nPainters from our sample who worked with others:")
    seen = set()
    for p1, p2, count in sorted(co_painter_found, key=lambda x: -x[2])[:10]:
        pair = tuple(sorted([p1, p2]))
        if pair not in seen:
            seen.add(pair)
            print(f"  ‚Ä¢ {p1} worked with {p2} ({count} times)")
else:
    print("  No co-painter relationships found in sample.")

üé® PAINTER NETWORK ANALYSIS

üìä Most prolific Baroque ceiling painters in Germany:
----------------------------------------------------------------------
   1. Harms, Johann Oswald: 146 paintings
   2. Castelli, Carlo Ludovico: 127 paintings
   3. Asam, Cosmas Damian: 123 paintings
   4. Lammers, Seivert: 107 paintings
   5. Kager, Johann Matthias: 100 paintings
   6. Giusti, Tommaso: 86 paintings
   7. Asam, Hans Georg: 80 paintings
   8. Hermann, Franz Georg: 78 paintings
   9. Asam, Maria Theresia: 77 paintings
  10. Aloisi, Andrea: 68 paintings
  11. Colomba, Luca Antonio: 64 paintings
  12. Marchini, Giovanni Francesco: 58 paintings
  13. Peiker, Hermenegild: 56 paintings
  14. Rode, Bernhard: 55 paintings
  15. Gumpp, Johann Anton: 51 paintings

üé® PAINTER PROFILE: Cosmas Damian Asam



üé® PAINTER PROFILE: Johann Oswald Harms



ü§ù CO-PAINTER RELATIONSHIPS IN OUR SAMPLE
  No co-painter relationships found in sample.


In [20]:
# Display paintings with full metadata from both sources
import time  # For rate-limiting API calls

print("Displaying paintings with combined NFDI4Culture + CbDD Graph data:")
print("="*70)
print("  üìä NFDI4Culture: title, year, coordinates, subjects, image URL")
print("  üìä CbDD Graph: painters, commissioners, room, building, technique")
print("  üü¢ ICONCLASS | üîµ Getty AAT subjects")
print("="*70 + "\n")

# Display paintings that have painter info from CbDD
paintings_with_painters = df_enriched[df_enriched['painters'].notna()]
print(f"Found {len(paintings_with_painters)} paintings with painter information.\n")

for idx, row in paintings_with_painters.head(5).iterrows():
    display_painting_card(row)
    time.sleep(0.2)  # Small delay for subject resolution API calls

Displaying paintings with combined NFDI4Culture + CbDD Graph data:
  üìä NFDI4Culture: title, year, coordinates, subjects, image URL
  üìä CbDD Graph: painters, commissioners, room, building, technique
  üü¢ ICONCLASS | üîµ Getty AAT subjects

Found 21 paintings with painter information.



In [21]:
# =============================================================================
# Final Summary: Complete Data Extraction Results
# =============================================================================
# Show the comprehensive enrichment from graphData.json and NFDI4Culture KG

print("="*70)
print("üìä FINAL DATA EXTRACTION SUMMARY")
print("="*70)

print("\nüóÇÔ∏è DATA SOURCES:")
print("-"*70)
print("  NFDI4Culture KG:")
print("    - Title (rdfs:label)")
print("    - Year/Date")  
print("    - Coordinates (lat/lon)")
print("    - Subjects (ICONCLASS/Getty AAT)")
print("    - Image URL")
print("    - Parent structure")
print()
print("  CbDD graphData.json:")
print("    - Painters (PAINTERS links)")
print("    - Commissioners (COMMISSIONERS links)")
print("    - Architects (ARCHITECTS links)")
print("    - Template Providers (TEMPLATE_PROVIDERS links)")
print("    - Plasterers (PLASTERERS links)")
print("    - Room (via PART hierarchy)")
print("    - Building (via PART hierarchy)")
print("    - Building Function (FUNCTION links)")
print("    - Location State (LOCATION links)")
print("    - Building Architects")
print("    - Technique/Method (METHOD links)")

print("\nüìà ENRICHMENT RESULTS:")
print("-"*70)
total = len(df_enriched)
stats = {
    'Painters': len(df_enriched[df_enriched['painters'].notna()]),
    'Commissioners': len(df_enriched[df_enriched['commissioners'].notna()]),
    'Room': len(df_enriched[df_enriched['room'].notna()]),
    'Building': len(df_enriched[df_enriched['building'].notna()]),
    'Building Function': len(df_enriched[df_enriched['building_function'].notna()]),
    'Location State': len(df_enriched[df_enriched['location_state'].notna()]),
    'Building Architects': len(df_enriched[df_enriched['building_architects'].notna()]),
    'Template Providers': len(df_enriched[df_enriched['template_providers'].notna()]),
    'Technique': len(df_enriched[df_enriched['method'].notna()]),
    'Coordinates (total)': len(df_enriched[(df_enriched['lat'].notna()) & (df_enriched['lat'] != '')]),
    'Coords from Building': len(df_enriched[df_enriched['coord_source'] == 'building']),
}

for field, count in stats.items():
    pct = 100 * count / total
    bar = '‚ñà' * int(pct / 5) + '‚ñë' * (20 - int(pct / 5))
    print(f"  {field:20s}: {count:3d}/{total} ({pct:5.1f}%) {bar}")

print("\nüåç GEOGRAPHIC DISTRIBUTION:")
print("-"*70)
state_counts = df_enriched['location_state'].value_counts()
for state, count in state_counts.head(10).items():
    print(f"  {state:30s}: {count} paintings")

print("\nüèõÔ∏è BUILDING FUNCTION DISTRIBUTION:")
print("-"*70)
func_counts = df_enriched['building_function'].value_counts()
for func, count in func_counts.head(8).items():
    print(f"  {func:50s}: {count}")

print("\n‚úÖ Data extraction complete!")
print(f"   Total paintings analyzed: {total}")
print(f"   Matched with CbDD graph: {len(df_enriched[df_enriched['cbdd_id'].notna()])}")
print(f"   With full location data: {len(df_enriched[(df_enriched['lat'].notna()) & (df_enriched['lat'] != '')])}")

üìä FINAL DATA EXTRACTION SUMMARY

üóÇÔ∏è DATA SOURCES:
----------------------------------------------------------------------
  NFDI4Culture KG:
    - Title (rdfs:label)
    - Year/Date
    - Coordinates (lat/lon)
    - Subjects (ICONCLASS/Getty AAT)
    - Image URL
    - Parent structure

  CbDD graphData.json:
    - Painters (PAINTERS links)
    - Commissioners (COMMISSIONERS links)
    - Architects (ARCHITECTS links)
    - Template Providers (TEMPLATE_PROVIDERS links)
    - Plasterers (PLASTERERS links)
    - Room (via PART hierarchy)
    - Building (via PART hierarchy)
    - Building Function (FUNCTION links)
    - Location State (LOCATION links)
    - Building Architects
    - Technique/Method (METHOD links)

üìà ENRICHMENT RESULTS:
----------------------------------------------------------------------
  Painters            :  21/50 ( 42.0%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë
  Commissioners       :  25/50 ( 50.0%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë

### Data Pipeline Summary

The notebook implements a **dual-source data pipeline** combining:
1. **NFDI4Culture Knowledge Graph** (SPARQL) - structured linked data
2. **CbDD Graph Export** (graphData.json) - rich relational data from the source database

| Step | Source | Data Retrieved |
|------|--------|----------------|
| 0. Ontology Resolution | GitHub (cto.ttl, nfdicore.ttl) | Human-readable labels for 267 CTO/NFDI codes |
| 1. Core Data | NFDI4Culture SPARQL | Title, year, image, coordinates, ICONCLASS/AAT subjects |
| 2. Graph Enrichment | CbDD graphData.json | **Painters**, **commissioners**, room, building, technique |
| 3. Subject Resolution | ICONCLASS/Getty SPARQL | Human-readable subject labels |
| 4. Geo Enrichment | Wikidata SPARQL | Missing coordinates from place names |

**üîÑ Why Two Data Sources?**

| Aspect | NFDI4Culture KG | CbDD Graph |
|--------|-----------------|------------|
| Access | SPARQL endpoint | Local JSON file |
| Persons | GND URIs (need resolution) | **Direct names with roles** |
| Locations | GND URIs | Room ‚Üí Building hierarchy |
| Subjects | ICONCLASS/AAT URIs | N/A |
| Coordinates | Yes (some) | N/A |
| Images | Yes (URLs) | N/A |

The CbDD graph provides **explicit role information** (painter vs commissioner) directly, avoiding the need to:
- Fetch GND URIs and call lobid.org API
- Parse profession keywords to classify persons
- Handle API failures and timeouts

**üìã Schema Reference:**

| Source | Property/Link | Description |
|--------|--------------|-------------|
| NFDI4Culture | `CTO_0001073` | Creation date/year |
| NFDI4Culture | `CTO_0001026` | ICONCLASS/AAT subjects |
| NFDI4Culture | `CTO_0001021` | Image URL |
| CbDD Graph | `PAINTERS` link | Painter names (direct) |
| CbDD Graph | `COMMISSIONERS` link | Commissioner names (direct) |
| CbDD Graph | `PART` link | Room/Building hierarchy |
| CbDD Graph | `METHOD` link | Painting technique |

**üîß Key Functions:**
- `load_cbdd_graph()` ‚Üí load graphData.json with indices
- `enrich_painting_from_graph(name)` ‚Üí get all CbDD data for a painting
- `enrich_dataframe_from_graph(df)` ‚Üí batch enrich a DataFrame
- `display_painting_card(row)` ‚Üí rich HTML display with all data

## 4. Compare CbDD and Color Slide Archive of Wall and Ceiling Painting

Portal IDs from the registry:
- CbDD: `n4c:E4264`
- Color Slide Archive: `n4c:E4267`

Goal: Count how many records in the KG come from each of these portals.

We assume a pattern similar to:
- `?item schema:isPartOf ?feed`
- `?feed schema:isPartOf ?portal` or `?feed dcterms:isPartOf ?portal`

You may have to adjust the property in the middle depending on what you see in the inspection of the feed nodes.

In [119]:
query_ceiling_portal_counts = """\
SELECT ?portal ?portalLabel (COUNT(DISTINCT ?item) AS ?records)
WHERE {
  VALUES ?portal { n4c:E4264  n4c:E4267 }

  # feed belongs to one of the two portals
  ?feed ?isPartOfPortal ?portal .
  FILTER(?isPartOfPortal IN (schema:isPartOf, dcterms:isPartOf))

  # items belong to that feed
  ?item schema:isPartOf ?feed .

  ?portal schema:name ?portalLabel .
}
GROUP BY ?portal ?portalLabel
ORDER BY DESC(?records)
"""

df_ceiling_portal_counts = run_sparql(query_ceiling_portal_counts)
df_ceiling_portal_counts

In [120]:
# Simple bar chart of records per portal (CbDD vs Color Slide Archive)
if not df_ceiling_portal_counts.empty:
    plt.figure(figsize=(6, 4))
    plt.bar(df_ceiling_portal_counts["portalLabel"], df_ceiling_portal_counts["records"].astype(int))
    plt.xticks(rotation=20, ha="right")
    plt.ylabel("Number of records in KG")
    plt.title("Records from baroque wall & ceiling painting portals")
    plt.tight_layout()
    plt.show()
else:
    print("No results yet. Check if the intermediate predicate (?isPartOfPortal) is correct.")

No results yet. Check if the intermediate predicate (?isPartOfPortal) is correct.
