# GeoCroissant to GeoDCAT Conversion

<img src="../asset/GeoCroissant.jpg" alt="GeoCroissant" width="150" style="float: right; margin-left: 50px;">

This notebook demonstrates how to convert metadata from **GeoCroissant**, a geospatial extension of MLCommons Croissant, into **GeoDCAT** (DCAT-AP for geospatial datasets).

GeoDCAT is a standardized RDF-based metadata model for publishing geospatial datasets, enabling:
-  Metadata interoperability (with CKAN, INSPIRE, EU portals)
-  Semantic web support via RDF/JSON-LD and Turtle
-  Cataloging of spatial, temporal, and distribution metadata

## Field Mapping: GeoCroissant → GeoDCAT

| **GeoCroissant Field**              | **GeoDCAT Field**              |
|-------------------------------------|--------------------------------|
| `name`                              | `dct:title`                    |
| `description`                       | `dct:description`              |
| `license`                           | `dct:license`                  |
| `version`                           | `adms:version`                 |
| `datePublished`                     | `dct:issued`                   |
| `conformsTo`                        | `dct:conformsTo`               |
| `keywords`                          | `dcat:keyword`                 |
| `spatialCoverage.geo.box`           | `dct:spatial` + `geo:asWKT`    |
| `temporalCoverage`                  | `dct:temporal` + DCAT dates    |
| `geocr:coordinateReferenceSystem`   | `geocr:coordinateReferenceSystem` |
| `geocr:spatialResolution`           | `geocr:spatialResolution`      |
| `geocr:temporalResolution`          | `geocr:temporalResolution`     |
| `distribution` (cr:FileObject)      | `dcat:Distribution`            |
| `distribution.contentUrl`           | `dcat:accessURL`               |
| `distribution.encodingFormat`       | `dcat:mediaType`               |
| `recordSet` (cr:RecordSet)          | `geocr:RecordSet`              |
| `recordSet.field` (cr:Field)        | `geocr:Field`                  |

## Install Required Libraries

We use:
- `rdflib` for manipulating RDF graphs
- `pyshacl` for validating metadata using SHACL constraints

In [8]:
!pip install -q rdflib pyshacl


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m26.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Define GeoCroissant to GeoDCAT Conversion Function

This function converts proper GeoCroissant metadata with full compliance to GeoDCAT-AP format.

In [2]:
import json
from rdflib import Graph, Namespace, URIRef, Literal, BNode
from rdflib.namespace import DCTERMS, DCAT, FOAF, XSD, RDF
from urllib.parse import quote


def geocroissant_to_geodcat_jsonld(geocroissant_json, output_file="geodcat.jsonld"):
    """Convert GeoCroissant JSON-LD to GeoDCAT-AP compliant format"""
    g = Graph()

    # Namespaces
    GEO = Namespace("http://www.opengis.net/ont/geosparql#")
    SCHEMA = Namespace("https://schema.org/")
    SPDX = Namespace("http://spdx.org/rdf/terms#")
    ADMS = Namespace("http://www.w3.org/ns/adms#")
    PROV = Namespace("http://www.w3.org/ns/prov#")
    GEOCR = Namespace("http://mlcommons.org/croissant/geocr/")

    g.bind("dct", DCTERMS)
    g.bind("dcat", DCAT)
    g.bind("foaf", FOAF)
    g.bind("geo", GEO)
    g.bind("schema", SCHEMA)
    g.bind("spdx", SPDX)
    g.bind("adms", ADMS)
    g.bind("prov", PROV)
    g.bind("geocr", GEOCR)

    # Create dataset URI
    dataset_name = geocroissant_json.get("name", "dataset")
    # URL-encode the dataset name to handle spaces and special characters
    safe_name = quote(dataset_name, safe='')
    dataset_uri = URIRef(f"https://example.org/{safe_name}")
    
    # Basic dataset properties
    g.add((dataset_uri, RDF.type, DCAT.Dataset))
    g.add((dataset_uri, RDF.type, SCHEMA.Dataset))
    g.add((dataset_uri, DCTERMS.title, Literal(geocroissant_json["name"])))
    g.add((dataset_uri, DCTERMS.description, Literal(geocroissant_json["description"])))
    
    # License
    if "license" in geocroissant_json:
        g.add((dataset_uri, DCTERMS.license, URIRef(geocroissant_json["license"])))
    
    # Version
    if "version" in geocroissant_json:
        g.add((dataset_uri, ADMS.version, Literal(geocroissant_json["version"])))
    
    # Date published
    if "datePublished" in geocroissant_json:
        g.add((dataset_uri, DCTERMS.issued, Literal(geocroissant_json["datePublished"], datatype=XSD.date)))
    
    # ConformsTo
    for conformance in geocroissant_json.get("conformsTo", []):
        g.add((dataset_uri, DCTERMS.conformsTo, URIRef(conformance)))
    
    # Keywords
    for keyword in geocroissant_json.get("keywords", []):
        g.add((dataset_uri, DCAT.keyword, Literal(keyword)))
    
    # Spatial coverage
    spatial_coverage = geocroissant_json.get("spatialCoverage", {})
    if spatial_coverage and "geo" in spatial_coverage:
        geo_shape = spatial_coverage["geo"]
        if "box" in geo_shape:
            # Parse the bounding box (south west north east format)
            bbox = geo_shape["box"].split()
            if len(bbox) == 4:
                spatial_uri = URIRef(f"{dataset_uri}/spatial")
                g.add((dataset_uri, DCTERMS.spatial, spatial_uri))
                g.add((spatial_uri, RDF.type, DCTERMS.Location))
                
                # Create WKT polygon from bounding box
                south, west, north, east = bbox
                wkt_bbox = f"POLYGON(({west} {south}, {east} {south}, {east} {north}, {west} {north}, {west} {south}))"
                g.add((spatial_uri, GEO.asWKT, Literal(wkt_bbox, datatype=GEO.wktLiteral)))
    
    # Temporal coverage
    if "temporalCoverage" in geocroissant_json:
        temporal_coverage = geocroissant_json["temporalCoverage"]
        if "/" in temporal_coverage:
            start_date, end_date = temporal_coverage.split("/")
            temporal_uri = URIRef(f"{dataset_uri}/temporal")
            g.add((dataset_uri, DCTERMS.temporal, temporal_uri))
            g.add((temporal_uri, RDF.type, DCTERMS.PeriodOfTime))
            g.add((temporal_uri, DCAT.startDate, Literal(start_date, datatype=XSD.date)))
            g.add((temporal_uri, DCAT.endDate, Literal(end_date, datatype=XSD.date)))
    
    # GeoCroissant specific properties
    if "geocr:coordinateReferenceSystem" in geocroissant_json:
        crs_uri = URIRef(f"http://www.opengis.net/def/crs/{geocroissant_json['geocr:coordinateReferenceSystem']}")
        g.add((dataset_uri, GEOCR.coordinateReferenceSystem, crs_uri))
    
    # Spatial resolution
    if "geocr:spatialResolution" in geocroissant_json:
        spatial_res = geocroissant_json["geocr:spatialResolution"]
        if isinstance(spatial_res, dict) and "@type" in spatial_res:
            res_node = BNode()
            g.add((dataset_uri, GEOCR.spatialResolution, res_node))
            g.add((res_node, RDF.type, SCHEMA.QuantitativeValue))
            if "value" in spatial_res:
                g.add((res_node, SCHEMA.value, Literal(spatial_res["value"])))
            if "unitText" in spatial_res:
                g.add((res_node, SCHEMA.unitText, Literal(spatial_res["unitText"])))
    
    # Temporal resolution
    if "geocr:temporalResolution" in geocroissant_json:
        temporal_res = geocroissant_json["geocr:temporalResolution"]
        if isinstance(temporal_res, dict) and "@type" in temporal_res:
            res_node = BNode()
            g.add((dataset_uri, GEOCR.temporalResolution, res_node))
            g.add((res_node, RDF.type, SCHEMA.QuantitativeValue))
            if "value" in temporal_res:
                g.add((res_node, SCHEMA.value, Literal(temporal_res["value"])))
            if "unitText" in temporal_res:
                g.add((res_node, SCHEMA.unitText, Literal(temporal_res["unitText"])))
    
    # Distributions
    for dist in geocroissant_json.get("distribution", []):
        if dist.get("@type") == "cr:FileObject":
            dist_id = dist.get("@id", "distribution")
            dist_uri = URIRef(f"{dataset_uri}/distribution/{dist_id}")
            g.add((dataset_uri, DCAT.distribution, dist_uri))
            g.add((dist_uri, RDF.type, DCAT.Distribution))
            
            if "name" in dist:
                g.add((dist_uri, DCTERMS.title, Literal(dist["name"])))
            if "description" in dist:
                g.add((dist_uri, DCTERMS.description, Literal(dist["description"])))
            if "contentUrl" in dist:
                g.add((dist_uri, DCAT.accessURL, URIRef(dist["contentUrl"])))
            if "encodingFormat" in dist:
                g.add((dist_uri, DCAT.mediaType, Literal(dist["encodingFormat"])))
            if "md5" in dist:
                checksum_node = BNode()
                g.add((dist_uri, SPDX.checksum, checksum_node))
                g.add((checksum_node, RDF.type, SPDX.Checksum))
                g.add((checksum_node, SPDX.algorithm, SPDX.checksumAlgorithm_md5))
                g.add((checksum_node, SPDX.checksumValue, Literal(dist["md5"])))
        
        elif dist.get("@type") == "cr:FileSet":
            # Handle FileSet as a special type of distribution
            dist_id = dist.get("@id", "fileset")
            dist_uri = URIRef(f"{dataset_uri}/distribution/{dist_id}")
            g.add((dataset_uri, DCAT.distribution, dist_uri))
            g.add((dist_uri, RDF.type, DCAT.Distribution))
            g.add((dist_uri, RDF.type, GEOCR.FileSet))
            
            if "name" in dist:
                g.add((dist_uri, DCTERMS.title, Literal(dist["name"])))
            if "description" in dist:
                g.add((dist_uri, DCTERMS.description, Literal(dist["description"])))
            if "encodingFormat" in dist:
                g.add((dist_uri, DCAT.mediaType, Literal(dist["encodingFormat"])))
            if "includes" in dist:
                g.add((dist_uri, GEOCR.includes, Literal(dist["includes"])))
    
    # Record sets and fields (as additional metadata)
    for record_set in geocroissant_json.get("recordSet", []):
        if record_set.get("@type") == "cr:RecordSet":
            rs_id = record_set.get("@id", record_set.get("name", "recordset"))
            rs_uri = URIRef(f"{dataset_uri}/recordset/{rs_id}")
            g.add((dataset_uri, GEOCR.recordSet, rs_uri))
            g.add((rs_uri, RDF.type, GEOCR.RecordSet))
            
            if "name" in record_set:
                g.add((rs_uri, DCTERMS.title, Literal(record_set["name"])))
            if "description" in record_set:
                g.add((rs_uri, DCTERMS.description, Literal(record_set["description"])))
            
            # Handle fields
            for field in record_set.get("field", []):
                if field.get("@type") == "cr:Field":
                    field_id = field.get("@id", field.get("name", "field"))
                    field_uri = URIRef(f"{rs_uri}/field/{field_id}")
                    g.add((rs_uri, GEOCR.field, field_uri))
                    g.add((field_uri, RDF.type, GEOCR.Field))
                    
                    if "name" in field:
                        g.add((field_uri, DCTERMS.title, Literal(field["name"])))
                    if "description" in field:
                        g.add((field_uri, DCTERMS.description, Literal(field["description"])))
                    if "dataType" in field:
                        g.add((field_uri, GEOCR.dataType, Literal(field["dataType"])))

    # Serialize outputs
    g.serialize(destination=output_file, format="json-ld", indent=2)
    print(f"GeoDCAT JSON-LD metadata written to {output_file}")

    ttl_file = output_file.replace(".jsonld", ".ttl")
    g.serialize(destination=ttl_file, format="turtle")
    print(f"✓ GeoDCAT Turtle metadata written to {ttl_file}")
    
    return g

## Load GeoCroissant Metadata and Generate GeoDCAT RDF

We load the **croissant.json** file and convert it using our function. 

This will produce:
- `geodcat.jsonld`: GeoDCAT in JSON-LD format
- `geodcat.ttl`: GeoDCAT in Turtle (RDF) format

In [3]:
# Load GeoCroissant metadata and convert to GeoDCAT
with open("croissant.json", "r") as f:
    geocroissant = json.load(f)

# Perform conversion
graph = geocroissant_to_geodcat_jsonld(geocroissant, output_file="geodcat.jsonld")

print("\ Conversion complete!")
print(f"  - Input: croissant.json")
print(f"  - Output JSON-LD: geodcat.jsonld")
print(f"  - Output Turtle: geodcat.ttl")

GeoDCAT JSON-LD metadata written to geodcat.jsonld
✓ GeoDCAT Turtle metadata written to geodcat.ttl
\ Conversion complete!
  - Input: croissant.json
  - Output JSON-LD: geodcat.jsonld
  - Output Turtle: geodcat.ttl


  print("\ Conversion complete!")


## Inspect GeoDCAT JSON-LD

We reload and pretty-print the generated RDF in JSON-LD format to verify key fields like:
- Dataset identifiers
- Distributions and access URLs
- Creator, license, and temporal coverage

In [4]:
# Load and display the GeoDCAT JSON-LD content
g = Graph()
g.parse("geodcat.jsonld", format="json-ld")

print("GeoDCAT JSON-LD Output:")
print("=" * 80)
print(g.serialize(format="json-ld", indent=2))

GeoDCAT JSON-LD Output:
[
  {
    "@id": "https://example.org/NASA%20POWER%20T2M%202020/spatial",
    "@type": [
      "http://purl.org/dc/terms/Location"
    ],
    "http://www.opengis.net/ont/geosparql#asWKT": [
      {
        "@type": "http://www.opengis.net/ont/geosparql#wktLiteral",
        "@value": "POLYGON((-180.0 -90.0, 179.375 -90.0, 179.375 90.0, -180.0 90.0, -180.0 -90.0))"
      }
    ]
  },
  {
    "@id": "https://example.org/NASA%20POWER%20T2M%202020/recordset/t2m_data/field/longitude",
    "@type": [
      "http://mlcommons.org/croissant/geocr/Field"
    ],
    "http://mlcommons.org/croissant/geocr/dataType": [
      {
        "@value": "sc:Float"
      }
    ],
    "http://purl.org/dc/terms/description": [
      {
        "@value": "Longitude coordinate"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@value": "longitude"
      }
    ]
  },
  {
    "@id": "https://example.org/NASA%20POWER%20T2M%202020/recordset/t2m_data/field/time",
    "@type"

## Validate TTL Output

We validate the Turtle (.ttl) file to ensure it's properly formatted RDF and check its statistics.

In [5]:
# Validate the Turtle output
print("Loading Turtle file for validation...")
ttl_graph = Graph()
ttl_graph.parse("geodcat.ttl", format="turtle")

# Print statistics
print(f"\n✓ TTL file successfully parsed!")
print(f"  - Total triples: {len(ttl_graph)}")
print(f"  - Unique subjects: {len(set(ttl_graph.subjects()))}")
print(f"  - Unique predicates: {len(set(ttl_graph.predicates()))}")
print(f"  - Unique objects: {len(set(ttl_graph.objects()))}")

# List namespaces
print("\nNamespaces:")
for prefix, namespace in sorted(ttl_graph.namespaces()):
    print(f"  {prefix}: {namespace}")

Loading Turtle file for validation...

✓ TTL file successfully parsed!
  - Total triples: 62
  - Unique subjects: 12
  - Unique predicates: 27
  - Unique objects: 55

Namespaces:
  adms: http://www.w3.org/ns/adms#
  brick: https://brickschema.org/schema/Brick#
  csvw: http://www.w3.org/ns/csvw#
  dc: http://purl.org/dc/elements/1.1/
  dcam: http://purl.org/dc/dcam/
  dcat: http://www.w3.org/ns/dcat#
  dcmitype: http://purl.org/dc/dcmitype/
  dct: http://purl.org/dc/terms/
  doap: http://usefulinc.com/ns/doap#
  foaf: http://xmlns.com/foaf/0.1/
  geo: http://www.opengis.net/ont/geosparql#
  geocr: http://mlcommons.org/croissant/geocr/
  odrl: http://www.w3.org/ns/odrl/2/
  org: http://www.w3.org/ns/org#
  owl: http://www.w3.org/2002/07/owl#
  prof: http://www.w3.org/ns/dx/prof/
  prov: http://www.w3.org/ns/prov#
  qb: http://purl.org/linked-data/cube#
  rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
  rdfs: http://www.w3.org/2000/01/rdf-schema#
  schema: https://schema.org/
  sh: http

## Display Complete Turtle Output

View the full RDF Turtle serialization of the GeoDCAT metadata.

In [6]:
# Display the complete Turtle output
print("Complete GeoDCAT Turtle (TTL) Output:")
print("=" * 80)
with open("geodcat.ttl", "r", encoding="utf-8") as f:
    print(f.read())

Complete GeoDCAT Turtle (TTL) Output:
@prefix adms: <http://www.w3.org/ns/adms#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix geocr: <http://mlcommons.org/croissant/geocr/> .
@prefix schema: <https://schema.org/> .
@prefix spdx: <http://spdx.org/rdf/terms#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://example.org/NASA%20POWER%20T2M%202020> a dcat:Dataset,
        schema:Dataset ;
    geocr:coordinateReferenceSystem <http://www.opengis.net/def/crs/EPSG:4326> ;
    geocr:recordSet <https://example.org/NASA%20POWER%20T2M%202020/recordset/t2m_data> ;
    geocr:spatialResolution [ a schema:QuantitativeValue ;
            schema:unitText "degrees" ;
            schema:value 5e-01 ] ;
    geocr:temporalResolution [ a schema:QuantitativeValue ;
            schema:unitText "month" ;
            schema:value 1 ] ;
    dct:conformsTo <http://mlcommons.org/croissant/1.1>,
 

## Query GeoDCAT Metadata with SPARQL

Use SPARQL queries to extract specific metadata from the GeoDCAT RDF graph.

In [7]:
# Query the GeoDCAT metadata using SPARQL
from rdflib import Graph, Namespace
from rdflib.namespace import DCAT, DCTERMS

# Load the TTL file
g = Graph()
g.parse("geodcat.ttl", format="turtle")

# Query 1: Get dataset basic info
query = """
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX schema: <https://schema.org/>

SELECT ?dataset ?title ?description ?license
WHERE {
    ?dataset a dcat:Dataset .
    ?dataset dct:title ?title .
    ?dataset dct:description ?description .
    OPTIONAL { ?dataset dct:license ?license }
}
"""
print("Dataset Information:")
print("=" * 80)
for row in g.query(query):
    print(f"Title: {row.title}")
    print(f"Description: {row.description}")
    if row.license:
        print(f"License: {row.license}")

# Query 2: Get all distributions
query = """
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>

SELECT ?dist ?title ?url ?format
WHERE {
    ?dataset a dcat:Dataset .
    ?dataset dcat:distribution ?dist .
    OPTIONAL { ?dist dct:title ?title }
    OPTIONAL { ?dist dcat:accessURL ?url }
    OPTIONAL { ?dist dcat:mediaType ?format }
}
"""
print("\n\nDistribution URLs:")
print("=" * 80)
for row in g.query(query):
    if row.title:
        print(f"Distribution: {row.title}")
    if row.url:
        print(f"  Access URL: {row.url}")
    if row.format:
        print(f"  Format: {row.format}")
    print()

Dataset Information:
Title: NASA POWER T2M 2020
Description: Temperature at 2 Meters monthly data for 2020
License: file:///teamspace/studios/this_studio/dcai/GeoCroissant%20to%20GeoDCAT/CC-BY-4.0


Distribution URLs:
Distribution: zarr-data
  Access URL: https://nasa-power.s3.us-west-2.amazonaws.com/merra2/temporal/power_merra2_monthly_temporal_utc.zarr/
  Format: application/zarr

