
# Jupyter Notebook for Gemini Metadata Validation using pyschematron

This notebook demonstrates how to use the `pyschematron` Python library to validate XML metadata documents against a Schematron schema, specifically for a simplified Gemini/ISO 19139 metadata example.

In [1]:
import sys

# Print the current Python version
print(f"Python Version: {sys.version}\n")

Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]



## Setup: Install `pyschematron`

In [2]:
!pip install pyschematron==1.1.8 lxml requests

print("pyschematron, lxml, and requests installed.")


Defaulting to user installation because normal site-packages is not writeable
pyschematron, lxml, and requests installed.




In [3]:
# import logging

# Configure logging to show DEBUG messages and above
# You can change logging.DEBUG to logging.INFO, logging.WARNING, etc.
# logging.basicConfig(level=logging.DEBUG,
#                     format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

# If you want to be more specific, you can target the pyschematron logger
# logger = logging.getLogger('pyschematron')
# logger.setLevel(logging.DEBUG)
# handler = logging.StreamHandler(sys.stdout)
# formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
# handler.setFormatter(formatter)
# logger.addHandler(handler)


## Prepare Files: Fetch Gemini Metadata UUID and Schematron

We will download the actual GEMINI 2.3 Schematron schema from the provided URL. For the metadata, we will now query the GeoNetwork CSW service to get a list of records.


In [4]:
import os
import requests
from lxml import etree # Used for parsing XML responses

# Define URLs for the Schematron file and the GeoNetwork CSW endpoint
schematron_url = "https://raw.githubusercontent.com/medin-marine/Discovery-Standard-public-content/refs/heads/main/medin_schematron/MedinMetadataProfile_v3_1_2_and_nonGeographicDataset_schematron.sch"
geonetwork_csw_url = "https://metadata.bgs.ac.uk/geonetwork/medindatacatalogue/eng/csw"
geonetwork_record_base_url = "https://metadata.bgs.ac.uk/geonetwork/srv/api/records/"

# Define file paths for saving the Schematron
schematron_file_path = "MedinMetadataProfile_v3_1_2_and_nonGeographicDataset_schematron.sch"

# Function to download content
def download_file(url, filename):
    print(f"Downloading {filename} from {url}...")
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
        with open(filename, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print(f"Successfully downloaded '{filename}'.")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading {filename}: {e}")
        raise

# Download the Schematron
try:
    download_file(schematron_url, schematron_file_path)
    with open(schematron_file_path, "r", encoding="utf-8") as f:
        schematron_content = f.read()
    print(f"First 500 characters of '{schematron_file_path}':\n{schematron_content[:500]}...")

except Exception as e:
    print(f"Failed to prepare Schematron file. Please check URL and network connection: {e}")

Downloading MedinMetadataProfile_v3_1_2_and_nonGeographicDataset_schematron.sch from https://raw.githubusercontent.com/medin-marine/Discovery-Standard-public-content/refs/heads/main/medin_schematron/MedinMetadataProfile_v3_1_2_and_nonGeographicDataset_schematron.sch...
Successfully downloaded 'MedinMetadataProfile_v3_1_2_and_nonGeographicDataset_schematron.sch'.
First 500 characters of 'MedinMetadataProfile_v3_1_2_and_nonGeographicDataset_schematron.sch':
﻿<?xml version="1.0" encoding="utf-8" ?>

<!-- Schematron Schema for the MEDIN Disovery Metadata Profile                                  -->

<!-- 
     James Rapaport                                
     SeaZone Solutions                                                   
     2009-10-20          ...


## Fetch All Record UUIDs from GeoNetwork CSW

We'll use a `csw:GetRecords` request to query the GeoNetwork CSW service and retrieve the UUIDs of all available metadata records. We'll implement pagination to fetch all records.


In [5]:
def get_all_record_uuids(csw_url, max_records_per_page=200):
    """
    Fetches all record UUIDs from a GeoNetwork CSW service using pagination.
    """
    uuids = []
    start_position = 1
    total_records = float('inf') # Initialize with infinity to enter the loop

    print(f"\nFetching record UUIDs from CSW endpoint: {csw_url}")

    while start_position <= total_records:
        get_records_xml = f"""<?xml version="1.0" encoding="UTF-8"?>
<csw:GetRecords xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
                service="CSW" version="2.0.2" resultType="results"
                maxRecords="{max_records_per_page}" startPosition="{start_position}"
                xmlns:gmd="http://www.isotc211.org/2005/gmd"
                xmlns:dc="http://purl.org/dc/elements/1.1/">
  <csw:Query typeNames="csw:Record">
    <csw:ElementSetName>full</csw:ElementSetName>
  </csw:Query>
</csw:GetRecords>"""

        headers = {'Content-Type': 'application/xml'}
        try:
            response = requests.post(csw_url, data=get_records_xml.encode('utf-8'), headers=headers)
            response.raise_for_status()
            
            csw_response_tree = etree.fromstring(response.content)

            # Extract total number of records
            search_results = csw_response_tree.xpath("//csw:SearchResults", namespaces={"csw": "http://www.opengis.net/cat/csw/2.0.2"})
            if search_results:
                total_records = int(search_results[0].get("numberOfRecordsMatched"))
            else:
                print("Warning: Could not find csw:SearchResults in CSW response. Assuming no more records.")
                break

            # Extract UUIDs from current page
            # Use dc:identifier for the UUID within csw:Record
            current_page_uuids = csw_response_tree.xpath("//csw:Record/dc:identifier/text()", namespaces={"csw": "http://www.opengis.net/cat/csw/2.0.2", "dc": "http://purl.org/dc/elements/1.1/"})
            uuids.extend(current_page_uuids)
            
            print(f"Fetched {len(current_page_uuids)} records from start position {start_position}. Total matched: {total_records}")

            start_position += max_records_per_page
            
            if not current_page_uuids and start_position > 1: # Break if no records returned on a subsequent page
                break

        except requests.exceptions.RequestException as e:
            print(f"Error fetching records from CSW: {e}")
            break
        except etree.XMLSyntaxError as e:
            print(f"Error parsing CSW response XML: {e}")
            print(f"Response content: {response.text[:500]}...")
            break
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            break
            
    print(f"Finished fetching UUIDs. Total UUIDs found: {len(uuids)}")
    return uuids

# Get all UUIDs
all_record_uuids = get_all_record_uuids(geonetwork_csw_url)


Fetching record UUIDs from CSW endpoint: https://metadata.bgs.ac.uk/geonetwork/medindatacatalogue/eng/csw
Fetched 200 records from start position 1. Total matched: 1669
Fetched 200 records from start position 201. Total matched: 1669
Fetched 200 records from start position 401. Total matched: 1669
Fetched 200 records from start position 601. Total matched: 1669
Fetched 200 records from start position 801. Total matched: 1669
Fetched 200 records from start position 1001. Total matched: 1669
Fetched 200 records from start position 1201. Total matched: 1669
Fetched 200 records from start position 1401. Total matched: 1669
Fetched 69 records from start position 1601. Total matched: 1669
Finished fetching UUIDs. Total UUIDs found: 1669


## Loop Through Records and Perform Validation

Now, we'll iterate through each fetched UUID, download its full XML metadata, and then validate it against the GEMINI 2.3 Schematron.


In [6]:
from pyschematron import validate_document
from lxml import etree # Used for pretty printing SVRL and navigating XML

# Load the Schematron schema once
try:
    schematron_tree = etree.parse(schematron_file_path)
    print(f"\nSchematron schema '{schematron_file_path}' loaded successfully.")
except Exception as e:
    print(f"Error loading Schematron schema: {e}")
    schematron_tree = None # Set to None to prevent further validation attempts

if not all_record_uuids:
    print("\nNo record UUIDs found to validate. Please check the CSW fetching step.")
elif schematron_tree is None:
    print("\nCannot proceed with validation as the Schematron schema could not be loaded.")
else:
    print(f"\nStarting validation for {len(all_record_uuids)} records...")
    validation_results_summary = {} # To store a summary of results

    for i, uuid in enumerate(all_record_uuids):
        record_xml_url = f"{geonetwork_record_base_url}{uuid}/formatters/xml"
        record_file_name = f"record_{uuid}.xml"

        print(f"\n--- Validating Record {i+1}/{len(all_record_uuids)}: {uuid} ---")
        try:
            # Download the individual record XML
            record_response = requests.get(record_xml_url, stream=True)
            record_response.raise_for_status()
            
            # Parse the XML content into an lxml ElementTree object
            # pyschematron 1.1.8 expects an lxml.etree._ElementTree object for the XML document
            xml_tree = etree.ElementTree(etree.fromstring(record_response.content))

            # Perform validation
            # Pass the parsed lxml ElementTree object for the XML document
            validation_result = validate_document(xml_tree, schematron_tree)

            is_valid = validation_result.is_valid()
            validation_results_summary[uuid] = "VALID" if is_valid else "INVALID"
            print(f"Validation Result for {uuid}: {'VALID' if is_valid else 'INVALID'}")

            svrl_report = validation_result.get_svrl()

            if svrl_report is not None:
                failed_assertions = svrl_report.xpath("//svrl:failed-assert", namespaces={"svrl": "http://purl.oclc.org/dsdl/svrl"})
                if failed_assertions:
                    print("  --- Failed Assertions ---")
                    for fa in failed_assertions:
                        test_expression = fa.get("test")
                        location = fa.get("location")
                        message_element = fa.xpath("svrl:text", namespaces={"svrl": "http://purl.oclc.org/dsdl/svrl"})
                        message_text = message_element[0].text.strip() if message_element else "No message provided."
                        print(f"    Location: {location}")
                        print(f"    Test: {test_expression}")
                        print(f"    Message: {message_text}\n")
                else:
                    print("  No failed assertions found for this record.")
            else:
                print("  No SVRL report generated for this record.")

        except requests.exceptions.RequestException as e:
            print(f"  Error downloading record {uuid}: {e}")
            validation_results_summary[uuid] = f"ERROR: Download failed ({e})"
        except etree.XMLSyntaxError as e:
            print(f"  Error parsing XML for record {uuid}: {e}")
            validation_results_summary[uuid] = f"ERROR: XML parsing failed ({e})"
        except Exception as e:
            print(f"  An unexpected error occurred during validation for record {uuid}: {e}")
            validation_results_summary[uuid] = f"ERROR: Validation failed ({e})"

    print("\n--- Overall Validation Summary ---")
    for uuid, status in validation_results_summary.items():
        print(f"Record {uuid}: {status}")


Schematron schema 'MedinMetadataProfile_v3_1_2_and_nonGeographicDataset_schematron.sch' loaded successfully.

Starting validation for 1669 records...

--- Validating Record 1/1669: abc9f747-5412-0f38-e044-0003ba9b0d98 ---
Validation Result for abc9f747-5412-0f38-e044-0003ba9b0d98: VALID
  No failed assertions found for this record.

--- Validating Record 2/1669: abc9f747-546e-0f38-e044-0003ba9b0d98 ---
Validation Result for abc9f747-546e-0f38-e044-0003ba9b0d98: VALID
  No failed assertions found for this record.

--- Validating Record 3/1669: a2b1143b-5c9d-23d6-e054-002128a47908 ---
Validation Result for a2b1143b-5c9d-23d6-e054-002128a47908: VALID
  No failed assertions found for this record.

--- Validating Record 4/1669: a2b1143b-5c9f-23d6-e054-002128a47908 ---
Validation Result for a2b1143b-5c9f-23d6-e054-002128a47908: VALID
  No failed assertions found for this record.

--- Validating Record 5/1669: a2b1143b-5c9a-23d6-e054-002128a47908 ---
Validation Result for a2b1143b-5c9a-23d6-

## Interpreting the Validation Output

The output provides several key pieces of information for each record:
 * **`Validation Result: VALID/INVALID`**: This indicates whether the document passed all `sch:assert` rules in the Schematron. If any `assert` fails, the document is considered `INVALID`.
* **Failed Assertions**: For `INVALID` records, a summary of failed assertions will be displayed, including:
    * `Location`: The XPath to the element in your XML document that caused the assertion to fail.
    * `Test`: The XPath expression that was evaluated.
    * `Message`: The human-readable message defined in your Schematron for this assertion.
* **Overall Validation Summary**: A final list showing the validation status (VALID/INVALID/ERROR) for each processed UUID.
This comprehensive loop allows you to systematically validate a large number of metadata records against your chosen Schematron standard.
