Skip to content

feat(generators): add --default-language flag for language-tagged literals#9

Open
jdsika wants to merge 3 commits intodevelopfrom
feat/generators-default-language
Open

feat(generators): add --default-language flag for language-tagged literals#9
jdsika wants to merge 3 commits intodevelopfrom
feat/generators-default-language

Conversation

@jdsika
Copy link
Copy Markdown

@jdsika jdsika commented Apr 25, 2026

Summary

Adds a --default-language CLI option to both gen-owl and gen-shacl that emits BCP 47 language-tagged string literals (e.g. "Person"@en) for human-readable annotations.

This enables ontology producers to comply with RDF 1.1 §3.3 (language-tagged strings as rdf:langString) and OWL 2 §6.3 (annotation property values) without manual post-processing.

Problem

LinkML generators currently emit all string literals as plain xsd:string values, even for human-readable annotations like rdfs:label, rdfs:comment, sh:name, and sh:description. This prevents downstream consumers from:

  • Filtering labels by language in SPARQL (FILTER(lang(?label) = "en"))
  • Supporting multilingual ontologies
  • Complying with W3C best practices for language-tagged metadata

The LinkML metamodel already has an in_language metaslot, but no generator uses it.

Changes

gen-owl (owlgen.py)

  • New default_language field on OwlSchemaGenerator
  • _LANGUAGE_TAGGABLE_RANGES frozenset (string, ncname) guards tagging — technical types (URI, integer, boolean, datetime) are never tagged
  • _resolve_language() resolves element-level in_language → generator-level default_languageNone
  • _literal() helper creates properly tagged Literal objects
  • add_metadata() tags string-range and fallback-range annotation literals
  • add_enum() PV labels respect language tags (constraint values in owl:oneOf are correctly NOT tagged)
  • New --default-language Click CLI option

gen-shacl (shaclgen.py)

  • New default_language field with __post_init__ whitespace normalisation
  • NodeShape rdfs:label / rdfs:comment get language tags
  • PropertyShape sh:name / sh:description get language tags via prop_pv_text()
  • _add_annotations() tags string annotation values
  • Numeric literals (sh:order, sh:minCount, etc.) are never tagged
  • New --default-language Click CLI option

Tests

  • 7 new OWL tests: tagged labels, backward-compat plain literals, URI ranges, in_language override, annotations, empty string, whitespace-only
  • 7 new SHACL tests: NodeShape, PropertyShape, plain literals, numeric guards, annotations, empty string, whitespace-only

Backward compatibility

  • Default is None (no language tags) — existing behaviour is completely unchanged
  • Empty strings and whitespace-only values are normalised to None

Standards compliance

Standard Requirement Status
RDF 1.1 §3.3 rdf:langString vs xsd:string distinction
OWL 2 §6.3 Annotation properties accept rdf:langString
SHACL §2.3.2.1 sh:name / sh:description range includes rdf:langString
BCP 47 (RFC 5646) Language tag format ✅ (no pre-validation, consistent with rdflib)

Testing

  • 113 tests pass (91 owlgen + 22 shaclgen), 4 skipped
  • 3 rounds of adversarial review completed with 0 open bugs

jdsika added 2 commits April 25, 2026 14:02
…lib serialization

Add a --deterministic / --no-deterministic CLI flag (default off) to OWL,
SHACL, JSON-LD Context, and JSON-LD generators that produces byte-identical
output across invocations.

Three-phase hybrid pipeline for Turtle generators:
1. RDFC-1.0 canonicalization (W3C Recommendation) via pyoxigraph
2. Weisfeiler-Lehman structural hashing for diff-stable blank node IDs
3. Hybrid rdflib re-serialization for idiomatic Turtle (inline blank
   nodes, collection syntax, prefix filtering)

JSON generators use deterministic_json() with recursive deep-sort and
JSON-LD-aware key ordering that preserves conventional @context structure.

Collection items (owl:oneOf, sh:in, sh:ignoredProperties) are sorted
when --deterministic is set to ensure reproducible RDF list order.

pyoxigraph >= 0.4.0 is imported lazily only when --deterministic is used.
Tests skip gracefully when pyoxigraph is unavailable.

Refs: linkml#1847
Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
Signed-off-by: jdsika <carlo.van-driesten@bmw.de>
… names

Add an opt-in --normalize-prefixes flag to OWL, SHACL, and JSON-LD
Context generators that normalises non-standard prefix aliases to
well-known names from a static prefix map (derived from rdflib 7.x
defaults, cross-checked against prefix.cc consensus).

Key design decisions:
- Static frozen map (MappingProxyType) instead of runtime
  Graph().namespaces() lookup eliminates rdflib version dependency
- Both http://schema.org/ and https://schema.org/ map to 'schema'
- Shared normalize_graph_prefixes() helper used by OWL and SHACL
- Two-phase graph normalisation: Phase 1 normalises schema-declared
  prefixes, Phase 2 cleans up runtime-injected bindings
- Collision detection: skip with warning when standard prefix name
  is already user-declared for a different namespace
- Phase 2 guard prevents overwriting HTTPS bindings with HTTP variants

The flag defaults to off, preserving existing behaviour.

Tests cover OWL, SHACL, and context generators with sdo->schema,
dce->dc, http/https edge case, custom prefix preservation, flag-off
backward compatibility, cross-generator consistency, prefix collision
detection, schema1 regression prevention, Phase 2 HTTPS guard, empty
schema edge case, and static map integrity.

Signed-off-by: jdsika <carlo.van-driesten@bmw.de>
…erals

Add a --default-language CLI option to gen-owl and gen-shacl that wraps
human-readable annotation literals (rdfs:label, rdfs:comment,
skos:definition, sh:name, sh:description, dcterms:title) with a BCP 47
language tag.

- Element-level in_language overrides the generator default
- Technical literals (URIs, numerics, XSD facets) are never tagged
- Non-string annotation values preserve their native RDF datatype
- Whitespace-only values are normalised to None

Signed-off-by: Carlo van Driesten <carlo.van-driesten@2last.eu>
@jdsika jdsika force-pushed the feat/generators-default-language branch from c3751dd to 06ac9fc Compare April 26, 2026 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant