This notebook explores the structure of [Schema.org](http://schema.org/) types.

In [1]:
import json

In [2]:
with open("../rocrate/data/schema.jsonld") as f:
    schema = json.load(f)
entities = schema["@graph"]
len(entities)

2588

In [3]:
entities[0]

{'@id': 'http://schema.org/SteeringPositionValue',
 '@type': 'rdfs:Class',
 'http://schema.org/source': {'@id': 'http://www.w3.org/wiki/WebSchemas/SchemaDotOrgSources#Automotive_Ontology_Working_Group'},
 'rdfs:comment': 'A value indicating a steering position.',
 'rdfs:label': 'SteeringPositionValue',
 'rdfs:subClassOf': {'@id': 'http://schema.org/QualitativeValue'}}

Normalize: ensure that the value of `"@type"` and `"rdfs:subClassOf"` is always a list.

In [4]:
for e in entities:
    types = e.get("@type", [])
    if isinstance(types, str):
        e["@type"] = [types]
    subclasses = e.get("rdfs:subClassOf", [])
    if isinstance(subclasses, dict):
        e["rdfs:subClassOf"] = [subclasses]

Flatten `"rdfs:subClassOf"`

In [5]:
for e in entities:
    try:
        subclasses = e["rdfs:subClassOf"]
    except KeyError:
        pass
    else:
        e["rdfs:subClassOf"] = [_["@id"] for _ in subclasses]

In [6]:
entities[0]

{'@id': 'http://schema.org/SteeringPositionValue',
 '@type': ['rdfs:Class'],
 'http://schema.org/source': {'@id': 'http://www.w3.org/wiki/WebSchemas/SchemaDotOrgSources#Automotive_Ontology_Working_Group'},
 'rdfs:comment': 'A value indicating a steering position.',
 'rdfs:label': 'SteeringPositionValue',
 'rdfs:subClassOf': ['http://schema.org/QualitativeValue']}

Schema.org entities have one or more types. Since (some) entities are the actual types we are looking for, we'll call their types "metatypes". Is `rdfs:Class` the only metatype?

In [7]:
metatypes = set.union(*(set(e["@type"]) for e in entities))
metatypes

{'http://schema.org/ActionStatusType',
 'http://schema.org/BoardingPolicyType',
 'http://schema.org/BookFormatType',
 'http://schema.org/Boolean',
 'http://schema.org/CarUsageType',
 'http://schema.org/ContactPointOption',
 'http://schema.org/DataType',
 'http://schema.org/DayOfWeek',
 'http://schema.org/DeliveryMethod',
 'http://schema.org/DigitalDocumentPermissionType',
 'http://schema.org/DriveWheelConfigurationValue',
 'http://schema.org/DrugCostCategory',
 'http://schema.org/DrugPregnancyCategory',
 'http://schema.org/DrugPrescriptionStatus',
 'http://schema.org/EUEnergyEfficiencyEnumeration',
 'http://schema.org/EnergyStarEnergyEfficiencyEnumeration',
 'http://schema.org/EventAttendanceModeEnumeration',
 'http://schema.org/EventStatusType',
 'http://schema.org/GamePlayMode',
 'http://schema.org/GameServerStatus',
 'http://schema.org/GenderType',
 'http://schema.org/GovernmentBenefitsType',
 'http://schema.org/HealthAspectEnumeration',
 'http://schema.org/InfectiousAgentClass',
 '

There's also `rdf:Property` and several other entities we'll explore later.

We are interested in the root(s) of the type hierarchy. Which types have no superclass? Probably properties: not because they're root types, but because they're not types at all. Let's check this:

In [8]:
for e in entities:
    if "rdf:Property" in e["@type"]:
        assert "rdfs:subClassOf" not in e

Let's filter out properties then:

In [9]:
types = [e for e in entities if "rdf:Property" not in e["@type"]]
len(types)

1214

Which types have no superclass?

In [10]:
no_superclass = [e for e in types if "rdfs:subClassOf" not in e]
len(no_superclass)

344

Are all these root types?

In [11]:
no_superclass[0]

{'@id': 'http://schema.org/Neck',
 '@type': ['http://schema.org/PhysicalExam'],
 'http://schema.org/isPartOf': {'@id': 'http://health-lifesci.schema.org'},
 'rdfs:comment': 'Neck assessment with clinical examination.',
 'rdfs:label': 'Neck'}

This is not an `rdfs:Class`, but a `PhysicalExam`. That's one of the "other" metatypes seen above.

In [12]:
type_map = {e["@id"]: e for e in types}
type_map['http://schema.org/PhysicalExam']

{'@id': 'http://schema.org/PhysicalExam',
 '@type': ['rdfs:Class'],
 'http://schema.org/isPartOf': {'@id': 'http://health-lifesci.schema.org'},
 'rdfs:comment': 'A type of physical examination of a patient performed by a physician.',
 'rdfs:label': 'PhysicalExam',
 'rdfs:subClassOf': ['http://schema.org/MedicalEnumeration',
  'http://schema.org/MedicalProcedure']}

`PhysicalExam` is a subclass of `MedicalEnumeration` (and of `MedicalProcedure`)

In [13]:
type_map['http://schema.org/MedicalEnumeration']

{'@id': 'http://schema.org/MedicalEnumeration',
 '@type': ['rdfs:Class'],
 'http://schema.org/isPartOf': {'@id': 'http://health-lifesci.schema.org'},
 'rdfs:comment': 'Enumerations related to health and the practice of medicine: A concept that is used to attribute a quality to another concept, as a qualifier, a collection of items or a listing of all of the elements of a set in medicine practice.',
 'rdfs:label': 'MedicalEnumeration',
 'rdfs:subClassOf': ['http://schema.org/Enumeration']}

`MedicalEnumeration` is, in turn, a subclass of `Enumeration`.

In [14]:
enumeration = type_map['http://schema.org/Enumeration']
enumeration

{'@id': 'http://schema.org/Enumeration',
 '@type': ['rdfs:Class'],
 'rdfs:comment': 'Lists or enumerations—for example, a list of cuisines or music genres, etc.',
 'rdfs:label': 'Enumeration',
 'rdfs:subClassOf': ['http://schema.org/Intangible']}

Thus, `Neck` is one of the possible values for the `PhysicalExam` enum. Let's get all enums (descendants of `Enumeration`) recursively:

In [15]:
def r_enumerations(pid='http://schema.org/Enumeration'):
    for e in types:
        if pid in e.get('rdfs:subClassOf', []):
            yield e
            for se in r_enumerations(e["@id"]):
                yield se
enums = set(_["@id"] for _ in r_enumerations())
len(enums)

70

In [16]:
next(iter(enums))

'http://schema.org/GenderType'

Let's see what's left if we filter out these from the metatypes

In [17]:
metatypes.difference(enums)

{'http://schema.org/Boolean',
 'http://schema.org/DataType',
 'rdf:Property',
 'rdfs:Class'}

In [18]:
type_map["http://schema.org/DataType"]

{'@id': 'http://schema.org/DataType',
 '@type': ['rdfs:Class'],
 'rdfs:comment': 'The basic data types such as Integers, Strings, etc.',
 'rdfs:label': 'DataType',
 'rdfs:subClassOf': ['rdfs:Class']}

`DataType` is both an instance and a subclass of `rdfs:Class` (both a type and a metatype).

In [19]:
[_ for _ in types if "rdfs:Class" in _.get("rdfs:subClassOf", [])]

[{'@id': 'http://schema.org/DataType',
  '@type': ['rdfs:Class'],
  'rdfs:comment': 'The basic data types such as Integers, Strings, etc.',
  'rdfs:label': 'DataType',
  'rdfs:subClassOf': ['rdfs:Class']}]

So `DataType` is the only type that's a subclass of `rdfs:Class`.

In [20]:
datatype_values = [_ for _ in types if "http://schema.org/DataType" in _["@type"]]
datatype_values

[{'@id': 'http://schema.org/Text',
  '@type': ['rdfs:Class', 'http://schema.org/DataType'],
  'rdfs:comment': 'Data type: Text.',
  'rdfs:label': 'Text'},
 {'@id': 'http://schema.org/Number',
  '@type': ['http://schema.org/DataType', 'rdfs:Class'],
  'rdfs:comment': "Data type: Number.<br/><br/>\n\nUsage guidelines:<br/><br/>\n\n<ul>\n<li>Use values from 0123456789 (Unicode 'DIGIT ZERO' (U+0030) to 'DIGIT NINE' (U+0039)) rather than superficially similiar Unicode symbols.</li>\n<li>Use '.' (Unicode 'FULL STOP' (U+002E)) rather than ',' to indicate a decimal point. Avoid using these symbols as a readability separator.</li>\n</ul>\n",
  'rdfs:label': 'Number'},
 {'@id': 'http://schema.org/Time',
  '@type': ['http://schema.org/DataType', 'rdfs:Class'],
  'rdfs:comment': 'A point in time recurring on multiple days in the form hh:mm:ss[Z|(+|-)hh:mm] (see <a href="http://www.w3.org/TR/xmlschema-2/#time">XML schema for details</a>).',
  'rdfs:label': 'Time'},
 {'@id': 'http://schema.org/Date'

So `DataType` is effectively also an enum, with six possible values. What about `Boolean`?

In [21]:
boolean_values = [_ for _ in types if "http://schema.org/Boolean" in _["@type"]]
boolean_values

[{'@id': 'http://schema.org/True',
  '@type': ['http://schema.org/Boolean'],
  'rdfs:comment': 'The boolean value true.',
  'rdfs:label': 'True'},
 {'@id': 'http://schema.org/False',
  '@type': ['http://schema.org/Boolean'],
  'rdfs:comment': 'The boolean value false.',
  'rdfs:label': 'False'}]

Also an enum, and a subclass of the `DataType` enum. To summarize, metatypes are either `rdf:Property`, `rdfs:Class` or enums.

In [22]:
for t in "DataType", "Boolean":
    enums.add(f"https://schema.org/{t}")

Some enum values like `Neck` or `False` have no superclass, so they are part of the candidate root types set we've built above. Let's build a set of all enum values and subtract it from the candidate root types.

In [23]:
enum_values = set()
for entity in types:
    if set(entity["@type"]) <= enums:
        enum_values.add(entity["@id"])
len(enum_values)

355

In [24]:
next(iter(enum_values))

'http://schema.org/Osteopathic'

Add values from the "special" `DataType` and `Boolean` enums.

In [25]:
enum_values |= set(_["@id"] for _ in datatype_values)
enum_values |= {"http://schema.org/False", "http://schema.org/True"}

Back to the types with no superclass:

In [26]:
no_superclass_ids = set(_["@id"] for _ in no_superclass)
no_superclass_ids - enum_values

{'http://schema.org/Thing'}

So the only "real" root type is `Thing`. All other types that don't have a superclass are enum values. However, some enum values have a superclass:

In [27]:
enum_values_with_superclass = enum_values - no_superclass_ids
enum_values_with_superclass

{'http://schema.org/CommunityHealth',
 'http://schema.org/Dermatology',
 'http://schema.org/DietNutrition',
 'http://schema.org/Emergency',
 'http://schema.org/Geriatric',
 'http://schema.org/Gynecologic',
 'http://schema.org/Midwifery',
 'http://schema.org/Nursing',
 'http://schema.org/Obstetric',
 'http://schema.org/Oncologic',
 'http://schema.org/Optometric',
 'http://schema.org/Otolaryngologic',
 'http://schema.org/Pediatric',
 'http://schema.org/Physiotherapy',
 'http://schema.org/PlasticSurgery',
 'http://schema.org/Podiatric',
 'http://schema.org/PrimaryCare',
 'http://schema.org/Psychiatric',
 'http://schema.org/PublicHealth',
 'http://schema.org/RespiratoryTherapy'}

These are the values of which enum?

In [28]:
set.union(*(set(type_map[_]["@type"]) for _ in enum_values_with_superclass))

{'http://schema.org/MedicalSpecialty'}

And what are their superclasses?

In [29]:
set.union(*(set(type_map[_]["rdfs:subClassOf"]) for _ in enum_values_with_superclass))

{'http://schema.org/MedicalBusiness', 'http://schema.org/MedicalTherapy'}

## Mapping the schema.org type hierarchy to Python classes

We can map enums to Python [enums](https://docs.python.org/3.6/library/enum.html).

In [30]:
from enum import Enum

`Enum` can be considered as the mapping for `Enumeration`. All other enumerations can be mapped to Enum subclasses. Note that only leaf enumerations have values, while superclasses must have no members. We already know the `Enumeration -> MedicalEnumeration -> PhysicalExam` hierarchy: let's try to map it.

In [31]:
physical_exam_values = sorted(
    _.rsplit("/", 1)[-1] for _ in enum_values
    if "http://schema.org/PhysicalExam" in type_map[_]["@type"]
)
physical_exam_values

['Abdomen',
 'Appearance',
 'CardiovascularExam',
 'Ear',
 'Eye',
 'Genitourinary',
 'Head',
 'Lung',
 'MusculoskeletalExam',
 'Neck',
 'Neuro',
 'Nose',
 'Skin',
 'Throat']

In [32]:
class MedicalEnumeration(Enum):
    pass

PhysicalExam = MedicalEnumeration("PhysicalExam", physical_exam_values)
e = PhysicalExam.Neck
e == PhysicalExam.Ear

False

In [33]:
isinstance(e, MedicalEnumeration)

True

With respect to the current ro-crate-py model, in principle, these types could avoid being derived from `Entity`, but then the logic for adding and manipulating entities must take into account this different kind of entity.

What about "regular" (non-enum) types? We can have `Thing` map to our `Entity`, then it's only a matter of deriving other classes as required. Or is it?

In [34]:
regular_types = set(_["@id"] for _ in types) - enums
superclasses = set.union(*[set(type_map[_].get("rdfs:subClassOf", [])) for _ in regular_types])
leaves = regular_types - superclasses
sorted(leaves)[0]

'http://schema.org/3DModel'

Here's our first problem: `3DModel` is not a valid Python identifier since it starts with a digit. Let's ignore this for now and pick another one.

In [35]:
sorted(leaves)[-1]

'http://schema.org/Zoo'

In [36]:
def parents(t):
    return [type_map[_] for _ in type_map[t]["rdfs:subClassOf"]]
parents('http://schema.org/Zoo')

[{'@id': 'http://schema.org/CivicStructure',
  '@type': ['rdfs:Class'],
  'rdfs:comment': 'A public structure, such as a town hall or concert hall.',
  'rdfs:label': 'CivicStructure',
  'rdfs:subClassOf': ['http://schema.org/Place']}]

In [37]:
parents('http://schema.org/CivicStructure')

[{'@id': 'http://schema.org/Place',
  '@type': ['rdfs:Class'],
  'rdfs:comment': 'Entities that have a somewhat fixed, physical extension.',
  'rdfs:label': 'Place',
  'rdfs:subClassOf': ['http://schema.org/Thing']}]

In [38]:
parents('http://schema.org/Place')

[{'@id': 'http://schema.org/Thing',
  '@type': ['rdfs:Class'],
  'rdfs:comment': 'The most generic type of item.',
  'rdfs:label': 'Thing'}]

In the general case there might be multiple inheritance. Also, of course the above naive code is not the most efficient way of generating the whole class hierarchy. In this case, though, the structure could be:

In [39]:
class Entity:  # in the real case this would be the ro-crate-py Entity class
    pass

class Place(Entity):
    pass

class CivicStructure(Place):
    pass

class Zoo(CivicStructure):
    pass

However, we currently have a `ContextEntity, DataEntity` level below `Entity`. Most entities in schema.org could probably be derived from `ContextEntity`, but some are data entities. Also, custom RO-Crate aliases must be taken into account: for instance, `File` is an RO-Crate alias for http://schema.org/MediaObject, a subclass of http://schema.org/CreativeWork. Incidentally, `CreativeWork` is already present in our current model, but it's a direct subclass of `Entity`.