
# Jupyter Notebook for Gemini Metadata Validation using pyschematron

This notebook demonstrates how to use the `pyschematron` Python library to validate XML metadata documents against a Schematron schema, specifically for a simplified Gemini/ISO 19139 metadata example.

In [1]:
import sys

# Print the current Python version
print(f"Python Version: {sys.version}\n")

Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]



## Setup: Install `pyschematron`

In [2]:
!pip install pyschematron==1.1.8 lxml requests

print("pyschematron, lxml, and requests installed.")


Defaulting to user installation because normal site-packages is not writeable
pyschematron, lxml, and requests installed.





## Prepare Files: Fetch Gemini Metadata XML and Schematron

We will download the actual Gemini metadata XML and the GEMINI 2.3 Schematron schema from the provided URLs.

In [5]:
import os
import requests

# Define URLs for the metadata and schematron files
metadata_url = "https://metadata.bgs.ac.uk/geonetwork/srv/api/records/9df8df52-d788-37a8-e044-0003ba9b0d98/formatters/xml"
schematron_url = "https://raw.githubusercontent.com/agiorguk/gemini-schematron/main/GEMINI_2.3_Schematron_Schema-v1.0.sch"

# Define file paths for saving
xml_file_path = "bgs_gemini_metadata.xml"
schematron_file_path = "GEMINI_2.3_Schematron_Schema-v1.0.sch"

# Function to download content
def download_file(url, filename):
    print(f"Downloading {filename} from {url}...")
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
        with open(filename, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print(f"Successfully downloaded '{filename}'.")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading {filename}: {e}")
        raise

# Download the metadata XML
try:
    download_file(metadata_url, xml_file_path)
    # Read content to verify
    with open(xml_file_path, "r", encoding="utf-8") as f:
        xml_content = f.read()
    print(f"First 500 characters of '{xml_file_path}':\n{xml_content[:500]}...")

    # Download the Schematron
    download_file(schematron_url, schematron_file_path)
    # Read content to verify
    with open(schematron_file_path, "r", encoding="utf-8") as f:
        schematron_content = f.read()
    print(f"First 500 characters of '{schematron_file_path}':\n{schematron_content[:500]}...")

except Exception as e:
    print(f"Failed to prepare files. Please check URLs and network connection: {e}")


Downloading bgs_gemini_metadata.xml from https://metadata.bgs.ac.uk/geonetwork/srv/api/records/9df8df52-d788-37a8-e044-0003ba9b0d98/formatters/xml...
Successfully downloaded 'bgs_gemini_metadata.xml'.
First 500 characters of 'bgs_gemini_metadata.xml':
<?xml version="1.0" encoding="UTF-8"?>
<gmd:MD_Metadata xmlns:gmd="http://www.isotc211.org/2005/gmd" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:gmx="http://www.isotc211.org/2005/gmx" xmlns:gsr="http://www.isotc211.org/2005/gsr" xmlns:gss="http://www.isotc211.org/2005/gss" xmlns:gts="http://www.isotc211.org/2005/gts" xmlns:srv="http://www.isotc211.org/2005/srv" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema...
Downloading GEMINI_2.3_Schematron_Schema-v1.0.sch from https://raw.githubusercontent.com/agiorguk/gemini-schematron/main/GEMINI_2.3_Schematron_Schema-v1.0.sch...
Successfully downloaded 'GEMINI_2.3_Schematron_Schema-v1.0.sch'.
First 500 charact

## Perform Validation

Now, we'll use `pyschematron` to validate our downloaded XML file against the Schematron schema.


In [9]:
from pyschematron import validate_document
from lxml import etree # Used for pretty printing SVRL and navigating XML

# Perform validation
try:
    print(f"Validating '{xml_file_path}' against '{schematron_file_path}'...")

    # Parse both the XML metadata file and the Schematron schema file into lxml ElementTree objects
    xml_tree = etree.parse(xml_file_path)
    schematron_tree = etree.parse(schematron_file_path)
    
    # Pass the parsed lxml ElementTree objects to validate_document
    validation_result = validate_document(xml_tree, schematron_tree)

    # Check if the document is valid based on assertions
    is_valid = validation_result.is_valid()
    print(f"\nValidation Result: {'VALID' if is_valid else 'INVALID'}")

    # Get the SVRL report (Schematron Validation Report Language)
    svrl_report = validation_result.get_svrl()

    if svrl_report is not None:
        print("\n--- Full Schematron Validation Report (SVRL) ---")
        # Pretty print the SVRL XML for detailed inspection
        print(etree.tostring(svrl_report, pretty_print=True, encoding='unicode'))
        print("------------------------------------------")
    else:
        print("No SVRL report generated (this might happen if the Schematron has no rules or if an error occurred).")

    # Extract and display failed assertions
    failed_assertions = svrl_report.xpath("//svrl:failed-assert", namespaces={"svrl": "http://purl.oclc.org/dsdl/svrl"})

    if failed_assertions:
        print("\n--- Summary of Failed Assertions ---")
        for fa in failed_assertions:
            test_expression = fa.get("test")
            location = fa.get("location")
            message_element = fa.xpath("svrl:text", namespaces={"svrl": "http://purl.oclc.org/dsdl/svrl"})
            message_text = message_element[0].text.strip() if message_element else "No message provided in Schematron."
            print(f"  Location: {location}")
            print(f"  Test: {test_expression}")
            print(f"  Message: {message_text}\n")
    else:
        print("\nNo failed assertions found. The document passed all 'assert' rules.")

    # Extract and display successful reports (optional, as reports are informational)
    successful_reports = svrl_report.xpath("//svrl:successful-report", namespaces={"svrl": "http://purl.oclC.org/dsdl/svrl"})
    if successful_reports:
        print("\n--- Summary of Successful Reports ---")
        for sr in successful_reports:
            test_expression = sr.get("test")
            location = sr.get("location")
            message_element = sr.xpath("svrl:text", namespaces={"svrl": "http://purl.oclC.org/dsdl/svrl"})
            message_text = message_element[0].text.strip() if message_element else "No message provided in Schematron."
            print(f"  Location: {location}")
            print(f"  Test: {test_expression}")
            print(f"  Message: {message_text}\n")
    else:
        print("\nNo successful reports found.")

except Exception as e:
    print(f"An error occurred during validation: {e}")

Validating 'bgs_gemini_metadata.xml' against 'GEMINI_2.3_Schematron_Schema-v1.0.sch'...

Validation Result: VALID

--- Full Schematron Validation Report (SVRL) ---
<svrl:schematron-output xmlns:svrl="http://purl.oclc.org/dsdl/svrl" xmlns:sch="http://purl.oclc.org/dsdl/schematron" xmlns:xs="http://www.w3.org/2001/XMLSchema" schemaVersion="1.2">
  <svrl:metadata xmlns:dct="http://purl.org/dc/terms/" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:pysch="https://github.com/robbert-harms/pyschematron">
    <dct:creator>
      <dct:agent>
        <skos:prefLabel>PySchematron 1.1.8</skos:prefLabel>
      </dct:agent>
    </dct:creator>
    <dct:created>2025-07-02T11:30:34.627323+01:00</dct:created>
    <dct:source>
      <rdf:Description>
        <dct:creator>
          <dct:Agent>
            <skos:prefLabel>PySchematron 1.1.8</skos:prefLabel>
          </dct:Agent>
        </dct:creator>
        <dct:created>2025-07-02T11:30:3

## 4. Interpreting the Validation Output

The output provides several key pieces of information:

`Validation Result: VALID/INVALID`: This indicates whether the document passed all `sch:assert` rules in the Schematron. If any `assert` fails, the document is considered `INVALID`.
Full Schematron Validation Report (SVRL): This is an XML document that provides a detailed log of all validation events.
     
* `<svrl:failed-assert>`: These elements indicate a rule that the XML document *failed* to satisfy. This is what makes the document `INVALID`.
* `@test`: The XPath expression that was evaluated.
* `@location`: The XPath to the element in your XML document that caused the assertion to fail.
* `<svrl:text>`: The human-readable message defined in your Schematron for this assertion.
* `<svrl:successful-report>`: These elements indicate a rule where a `sch:report` condition was met. Reports are typically for warnings, suggestions, or informational messages, and do not cause the document to be `INVALID`.
* Similar attributes (`@test`, `@location`) and `<svrl:text>` as `failed-assert`.

 The output will show the validation results for the real BGS metadata record against the GEMINI 2.3 Schematron. You can examine the "Summary of Failed Assertions" to see any non-conformities or the "Summary of Successful Reports" for informational messages.

 If the validation result is `INVALID`, the failed assertions will guide you on what needs to be corrected in the metadata record to make it conform to the GEMINI 2.3 standard.