## 03. Schemas: Types! Types everywhere!

In this tutorial you will learn everything you need for defining your own schemas and schema plugins.

**Prerequisites:**

* basic understanding of Metador schemas and plugins (covered in previous tutorials)
* Optional, but helpful: some experience with [dataclasses](https://docs.python.org/3/library/dataclasses.html) and/or [pydantic](https://pydantic-docs.helpmanual.io/)

**Learning Goals:**

* Learn how to create a new schema based on an existing schema
* Learn how to model your metadata using powerful type hints
* Learn how to make a schema semantic using JSON-LD annotations
* Understand best practices for schema design and some common pitfalls to avoid

### Defining a schema

<div class="alert alert-block alert-info">
    Every Metador schema is either a direct subclass of <tt>MetadataSchema</tt>, or subclass of an existing schema.
</div>

So this is a perfectly valid, very simple schema:

In [1]:
from metador_core.schema import MetadataSchema

class SimpleSchema(MetadataSchema):
    """My new schema, totally unrelated to any other schemas."""
        
    fun: bool
    """Flag whether the person reading this tutorial is having fun."""
    
SimpleSchema(fun=True)

SimpleSchema(fun=True)

As you have seen in the last tutorial, a schema can be exposed as a plugin (just as any other kind of Metador plugin) by:

* adding a `Plugin` inner class that (at least) defines a `name` and a `version`
* declaring an entrypoint with the same `name` in the correct plugin group (for schemas: `"metador_schema"`)

But there are still (at least) two questions that must be answered before you can become productive with schema development:

* how can you extend an existing schema correctly to get all the advantages provided by Metador?
* how can you express the requirements for the values that can go into all the different fields?

By the end of this tutorial, you will have an answer to both of these and many other questions.

### Extending an existing schema: The absolute minimum

In the first tutorial, we mentioned that one core feature of schemas in Metador is that they can be easily **extended** and encouraged you to do so. We discussed metadata for a custom image format as an example. In the following, you will learn everything you need to do this in practice and we will arrive at a reasonable schema by the end of this tutorial.

We want to extend the `core.imagefile` schema to have some format-specific extra fields, so we take that schema as the base class for our schema. Furthermore, at the time we start writing our schema the `core.imagefile` schema is available in a certain version. To make sure that our schema works as expected in the future, we must also state which version of `core.imagefile` we intend our schema to be based on. Therefore, our "schema skeleton" looks like this:

In [2]:
from metador_core.plugins import schemas

ImageFile = schemas["core.imagefile"]

class NicheImage(ImageFile):
    """Schema for the .niche image format."""
    
    class Plugin:
        name = "dummy.imagefile.niche"
        version = (0, 1, 0)
        parent_schema = ImageFile.Plugin.ref(version=(0, 1, 0))

<div class="alert alert-block alert-warning">
In practice, you register your schema plugin as an entrypoint in your own Python package, as described in the previous tutorial. For practical purposes, we will bypass this step in this and following tutorials, and instead will manually "load" the schema into the plugin system using the <tt>register_in_group</tt> function or decorator. That way you can conveniently stay in this notebook and follow along, without needing to copy-paste everything into a project.
</div>

In [3]:
from metador_core.plugins.util import register_in_group

register_in_group(schemas, NicheImage)  # <- only used in tutorials, must not and cannot be used in practice!

Notebook: Plugin 'dummy.imagefile.niche' registered in 'schema' group!


On the surface nothing exciting happened, internally though, your schema was processed the same way it would be when loaded from an entrypoint - many checks were performed, and now it is accessible through the plugin interface. 

Now the schema is registered, so let's try to access it through the plugin system:

In [4]:
MySchema = schemas["dummy.imagefile.niche"]
print(MySchema)

<class '__main__.NicheImage'>


So we have successfully registered a schema, but it is the same as the parent, except in name - we did not change or define any field! Before we actually add some fields for our new image type, we first need to clarify the rules of schema inheritance and understand better how useful fields can be defined. We will revisit our niche image format again a bit later.

### Digression: One thing can have many names

<div class="alert alert-block alert-info">
  If you did not feel confused by the last cell and its output, feel free to go ahead to the next section.
</div>

If you *are* confused, it could be by the fact that the schema now is both `NicheImage` and `MySchema`. The source of your confusion is that you are used to `import` classes and to rely on their class name to be stable and unique.

**In the Metador plugin system, the *actually* important name is the one we declared as the entrypoint, `dummy.imagefile.niche`** - this is the name you should choose carefully, keep stable and avoid changing at all costs once your schema is publicly available. The Python variable referring to that class is just a variable, only by convention we give classes names that start with a capital letter.

Someone using a Metador plugin they access by entrypoint name can assign it any name they want - getting a class and calling it `MySchema` is the same as assigning a number to a variable you name `x`. Only that you probably have not thought about classes also being something that you can assign to a variable. If you are still confused, meditate over the following code snippet and compare it to the `NicheImage`/`MySchema` situation:

In [5]:
myDict = dict(foo="hello", bar="world")    # we create a dict and give it the name myDict
myDict2 = myDict                           # we assign it to a variable myDict2, this is NOT a copy!
myDict2["bar"] = "Metador"
print(myDict, myDict2, myDict is myDict2)  # myDict and myDict2 refer to the same underlying thing!

{'foo': 'hello', 'bar': 'Metador'} {'foo': 'hello', 'bar': 'Metador'} True


### The family life of schemas

**Schemas in Metador only support single inheritance**, which means that you can state only one parent schema plugin it claims to extend. This means that each Metador schema has a neat linear inheritance chain. We can inspect the inheritance chain of registered parent schemas like this:

In [6]:
schemas.parent_path("dummy.imagefile.niche")

['core.file', 'core.imagefile', 'dummy.imagefile.niche']

This tells us that `core.file` is the parent schema of `core.imagefile`, which in turn is the parent schema of our new schema.

We can also check the "descendant" schemas of a schema, like this:

In [7]:
schemas.children("core.file")

{'core.imagefile', 'dummy.imagefile.niche'}

We get back a set of all schemas that, directly or indirectly, are based on `core.file` and can be used in every 
place where `core.file` is expected - so one can say that this set is the set of installed schemas which are "compatible" with `core.file` as-is, without needing to do anything to the metadata.


<div class="alert alert-block alert-info"> <b>Parent Compatibility:</b><br />
    Every metadata object that is valid according to a schema must also be valid according to its declared parent schema (if it has one).
</div>
  
This is the most important requirements for writing schemas, as without this, building a hierarchy of schemas would have no value. A schema is like a "micro-standard", it can be designed in better or worse ways, but as long as people agree to use it, it has value. Extending a schema in a compatible way is like building on an existing standard.

In our example, your responsibility is to ensure this for the declared parent plugin, `core.imagefile`. The authors of `core.imagefile` are responsible for making sure that each valid `core.imagefile` is a valid `core.file`, so you do not have to worry or think about that. If everyone is doing their part, the chain will work and your new schema will also be a valid `core.file` "for free", by transitivity.

### All schemas are equal, but some schemas are more equal: To Plugin or not

So you might already have noticed that not every schema you define must be a plugin - you can define, use and even inherit from schemas that are not plugins just fine, being registered as a plugin which a nice `name` and well-maintained `version` is just an extra property, one which important schemas usually have. So what kind of schemas should be a plugin?

* Every schema that you want others to put into containers **must** be plugin
* Every schema that you want other people to actually use and possibly extend **must** be a plugin
* Every schema that does not make sense on its own **must not** be a plugin
* Apart from these guidelines, it is up to you

Remember that nested schemas can be accessed through the `Fields` interface, so there is no need to declare all nested schemas your schema is built from. At the same time, this communicates to the users of your schema: "the nested schemas are only parts of the larger schema you use, they are not supposed to be used for anything else".

**Every schema you register as a plugin can be attached as-is to a node in a Metador container**, that is why something like a general schema for a 3D position should not be a registered plugin.

<div class="alert alert-block alert-info"> <b>TODO</b> Maybe there should better be a flag whether a schema is usable in a container instead of coupling this ability with "being a schema plugin". To be revised</div>

### Digression: A Tale Of Two Lineages

Now let us look at a toy example:

In [8]:
from metador_core.schema import MetadataSchema

class NotAPlugin(MetadataSchema):
    a: int
    b: str
        
class AlsoNotAPlugin(ImageFile):
    c: NotAPlugin

@register_in_group(schemas)
class FinallyAPlugin(AlsoNotAPlugin):
    
    class Plugin:
        name = "dummy.imagefile.abc"
        version = (0, 1, 0)

print(schemas.parent_path("dummy.imagefile.abc"))
print(schemas.children("core.file"))

Notebook: Plugin 'dummy.imagefile.abc' registered in 'schema' group!
['dummy.imagefile.abc']
{'dummy.imagefile.niche', 'core.imagefile'}


`NotAPlugin` is a schema, but not a plugin. The top level class all schemas are derived from, `MetadataSchema`, is also not a plugin. Now the class `AlsoNotAPlugin` inherits from the `ImageFile` schema, which is a plugin, but it can use `NotAPlugin` we defined before.

Now for a more interesting example: `FinallyAPlugin` is, in fact, a plugin, **but for Metador, it is not considered a child of an existing schema**!

Metador only cares about inheritance between plugin schemas that have a name and version, and what determines the parent schema from the point of view of Metador plugins is what is stated as `parent_schema`. As we omitted it, for Metador this is a totally indepentent schema.

<div class="alert alert-block alert-info">
While the Python class inheritance chain and the plugin inheritance chain are deeply related and will agree most of the time, <b>they are distinct</b>.
</div>

* The Python inheritance chain is technical, it is about reuse of implementation
* The Metador schema inheritance is semantic, it is about representing layers of meaning

<!--

In some situations it might be useful to:

* extend a schema, but put it on the same level as one of the parents, instead of "below"
* create an schema that should not be used on its own

To satisfy such needs:

* not every schema has to be a plugin - this is totally optional
* `parent_schema`, if declared, can be **any** ancestor listed by `parent_path`

#### Q: So what should I list as `parent_schema`?

* Answer 1: Just use the same class you inherited from as `parent_schema` to provide a version
* (**Advanced**) Answer 2: In some very rare cases you might want to deviate, we will discuss this later

-->

#### Q: Why do I even have to declare the `parent_schema` even if it is the same as the class, isn't this redundant?

**A:** As mentioned above, to provide Metador with the assumed version of the schema (the version you used when developing your child schema). If we would not do this, then there would be no way to detect that the parent schema which is installed is the right one - after all, you do not state the version of the schema when asking for it, you always access the currently available version. If there is a mismatch, this can be detected.

### On the shoulders of giants: Extending a schema correctly

You have already seen in the last tutorial that schemas are defined using Python **type hints**, the same way [dataclasses](https://docs.python.org/3/library/dataclasses.html) are defined. We will now look at how you can extend your schema, and how you can make sure that the parent compatibility is not violated. There are only two guidelines that you need to keep in mind:

<div class="alert alert-block alert-info">
    <b>Parent Compatibility (1)</b>: Adding a new field that is not defined in the parent schema is safe.
</div>

The parent schema will simply ignore your additional field, so adding a new field can cause no problems up the "ancestry chain". 

The case actually needing some care and consideration is the following:

<div class="alert alert-block alert-info">
    <b>Parent Compatibility (2)</b>: You may only change a field which exists in the parent schema to a more restricted type.
</div>

This means that **you must not replace an existing field in a way that your schema would accept a value that the parent schema would not accept**. Notice that we say "exists" and not "is defined", because the parent itself can have inherited fields on its own which you also have to keep in mind.

<div class="alert alert-block alert-info">
    <b>Implication:</b> Removing a field existing in the parent is impossible.
</div>

If you could remove a field, you could also re-define it to something else entirely, so this cannot be possible.

If a parent field is mandatory, you are out of luck and will have to provide a value. If it is optional, you are lucky and can simply ignore it.

If you cannot live with these constraints, then maybe the chosen parent class is not really suitable for you.

<!--
<div class="alert alert-block alert-warning">
    <b>Advanced Users:</b> Here is the promised example use-case for having a different <tt>parent_schema</tt>.
    These rules are required and enforced on the level of the <tt>plugin schema</tt> parent lineage. This means that you can bend these rules a bit by setting a deviating <tt>parent_schema</tt> (or none at all), so that your schema can "branch off" at a higher point in the chain, but still inherit. Note that this means from the perspective of Metador, that your schema and the class you inherited from have <b>no relation to each other whatsoever</b>. This way you could make use of technical inheritance (saving some time and duplication) without implying semantic compatibility and everything it entails.
    
    <b>TODO:</b> Is this actually that useful?
</div>
-->

Now let us look at a toy example. Consider following parent class:

In [9]:
from metador_core.schema import MetadataSchema
from typing import Union, Optional

@register_in_group(schemas)
class Parent(MetadataSchema):
    
    class Plugin:
        name = "dummy.parent"
        version = (0, 1, 0)
        
    foo: Union[int, str]
    bar: Optional[float]
    qux: str

Notebook: Plugin 'dummy.parent' registered in 'schema' group!


Now we will register a child class that satisfies the rules:

In [10]:
from metador_core.schema.decorators import make_mandatory

@register_in_group(schemas)
@make_mandatory("bar")
class Child1(Parent):
    
    class Plugin:
        name = "dummy.child1"
        version = (0, 1, 0)
        
    foo: int
    new_field: bool

Notebook: Plugin 'dummy.child1' registered in 'schema' group!


You can see that we used the `make_mandatory` decorator - what it does is taking an inherited field from the parent, and make sure that is not `Optional`. The advantage is that you do not have to "duplicate" the inherited type just to get rid of the `Optional`, and it clearly states in what way this field is changed for a very common use-case.

<div class="alert alert-block alert-info">
    You can <b>use the <tt>make_mandatory</tt> decorator</b> to ensure that an inherited field is mandatory.<br />
    It will ensure that the field actually exists in the parent and will define the correct non-optional type for you automatically.
</div>

Let us see what happens if we violate the parent consistency rules:

In [11]:
try:  # we know it will go wrong

    @register_in_group(schemas)
    class Child2(Parent):

        class Plugin:
            name = "dummy.child2"
            version = (0, 1, 0)

        foo: float
            
except TypeError as e:
    print(e)  # show the message

<class '__main__.Child2'>: The type assigned to field 'foo'
    <class 'float'>
does not look like a valid subtype of the inherited type
    typing.Union[int, str]
from
    __main__.Parent (plugin: dummy.parent 0.1.0)

If you are ABSOLUTELY sure that this is a false alarm,
use the @overrides decorator to silence this error
and live forever with the burden of responsibility.



In `Child1`, we re-declared `foo` to accept `int` values, which is fine - the parent is declared as `Union[int, str]`, which means that it can accept either an `int` or a `str`. Now in `Child2` we tried to declare a field `foo` in our schema as a float, but the parent schema does not allow floats. The system can infer that something is wrong and will refuse to register this faulty schema.

Always remember the fields the parent schema inherited itself. The following attempt will also fail:

In [12]:
try:  # we know it will go wrong

    @register_in_group(schemas)
    class Child3(Child1):

        class Plugin:
            name = "dummy.child3"
            version = (0, 1, 0)

        qux: float
            
except TypeError as e:
    print(e)  # show the message
    

<class '__main__.Child3'>: The type assigned to field 'qux'
    <class 'float'>
does not look like a valid subtype of the inherited type
    <class 'str'>
from
    __main__.Parent (plugin: dummy.parent 0.1.0)

If you are ABSOLUTELY sure that this is a false alarm,
use the @overrides decorator to silence this error
and live forever with the burden of responsibility.



`Child3` inherits from `Child1`, which does not define `qux`, but it does inherit `qux` from `Parent` unchanged. Remember that you can use `Fields` to inspect **all** the fields declared in a schema. This will also tell you where a field is actually "coming from":

In [13]:
print(Child1.Fields)

foo
	type: <class 'int'>
	origin: __main__.Child1 (plugin: dummy.child1 0.1.0)
	
new_field
	type: <class 'bool'>
	origin: __main__.Child1 (plugin: dummy.child1 0.1.0)
	
bar
	type: <class 'float'>
	origin: __main__.Child1 (plugin: dummy.child1 0.1.0)
	
qux
	type: <class 'str'>
	origin: __main__.Parent (plugin: dummy.parent 0.1.0)
	


### Python Typology

After seeing how schema and field inheritance work and learning how to declare a child schema correctly without violating parent compatibility, now you might wonder what kind of types you can use in your schemas in order to express as precisely as possible what values are supposed to go into which fields. The possibilities, in fact, are limitless and there are many ways to model the same requirement. To get you started and give you an idea, we will give a quick overview of the most common and most useful types, give some guidance on how and when to use them, and equally important - what to avoid.

#### Primitives, times, dates and pydantic built-in types

You can use the corresponding built-in Python types - you may simply use `bool`, `int`, `float` and `str` as annotations (however, you should rarely need them and will probably use *constrained types* which are discussed further below).

You can also use `date`, `datetime` and `time` from the `datetime` package.

The `pydantic` library, the beating heart of Metador schemas, also provides [many useful type hints](https://pydantic-docs.helpmanual.io/usage/types/#pydantic-types) you can use for various kinds of entities such as IP addresses, URLs, colors and more (for some things we advise against using the types pydantic provides, which again is discussed further below).

Furthermore, in `metador_core.schema.types` we provide a number of generally useful types for different purposes which are meant to be imported directly. Take a look at them and prefer reusing or extenting them before you start defining your own types. 

#### Metador schemas

You can always use any other existing Metador schemas when defining a field. In fact, **this should be your preferred and default go-to choice for modelling nested, complex metadata**. For example, we can use one of the schemas we defined above inside another schema and it will work as expected:

In [14]:
class SomeSchema(MetadataSchema):
    some_field: str
    nested_obj: Child1
        
print(SomeSchema(some_field="hello", nested_obj=Child1(foo=1, bar=3.14, qux="hi", new_field=True)).yaml())

nested_obj:
  bar: 3.14
  foo: 1
  new_field: true
  qux: hi
some_field: hello



#### Optional values

If you want to declare a field that is relevant, but might not always be available, then you can make it `Optional` (from the `typing` module), so omitting the value will not be an error (in this case, the object will then automatically be `None` for this field).

When you define a new field in a schema that you develop on your own, you should **try to stick with mandatory fields as much as possible** in your initial versions, and relax a field to `Optional` only when there is really no feasible way to provide the desired information consistently. This way you will break less potential child schemas that are based on your schema. If a child schema will rely on a field being optional, but you suddenly make it mandatory, the child schema will also have to make it mandatory (otherwise it violates the parent rule discussed above by allowing a "missing value", which is `None`, that is not allowed in the parent, that is your schema). Also, obtaining missing information for a schema after-the-fact is harder or even impossible (e.g. you cannot re-do the scientific experiment), so it is better to err on the side of "strictness" up-front.

**Examples:** `Optional[int]`, `Optional[SIValue]`

#### Collections

To describe a collection of values or objects of the same kind, you can `List` or `Set` from `typing`.

**Use `List` if you want need to keep duplicate values or if order of the elements matters, otherwise use a `Set`**. Many things where you might first instinctively use a `List` are *actually*, semantically `Set`s, so make sure that you choose one or the other consciously. This choice has actual practical consequences for the behavior of harvesters (that we will talk about in a different tutorial).

**Examples:** `List[float]`, `Set[AnyHttpUrl]`


#### Literals and Enums

Sometimes you want a field to have a fixed value, or a value from a controlled list. The tools that you can use are `Literal`s (from `typing`) and `Enum`s (from `enum`). The rule of thumb is that you probably should use a simple `Literal` when there is just one or a handful of values that are permitted, but for a longer list you should define an [`Enum` class](https://docs.python.org/3/library/enum.html).

**Examples:** `Literal["a", "b", "c"]`, `Literal[0, 42]`, `Literal["always_this_value"]`, `SimAlgorithm`

with `SimAlgorithm` e.g. defined as:

```python
class SimAlgorithm(str, Enum):
    simple = "simple-simulation"
    fancy = "fancy-simulation"
    # ... other supported algorithms
```

#### Constrained types

Constrained types are variants of the primitive and collection types that we discussed above, where you want or need to restrict the range of permitted values. Examples are requirements such as:

* the number must be positive
* the value must be in $[-\pi, \pi)$
* the string must have a length between 100-1000 characters
* the list can have at most 7 elements

For constrained types, in Metador schemas we prefer using types based on the [phantom](https://phantom-types.readthedocs.io/en/0.7.1/pages/builtin-types.html) library. The advantage of using `phantom` types is that these work well with schema inheritance and the automatic checks that make sure that your types are compatible with the parent schema, whereas constrained types provided by `pydantic` do not have these nice properties.

#### Alternative choices

If you must accept values of multiple different types, you can use `Union` (from `typing`).

**We strongly advise against modelling your schema with `Union`s** if this can be avoided, there exist certain unintuitive subtleties that can lead to unexpected/unintended behavior. For example, due to how parsing of the metadata by `pydantic` works, when loading a metadata object, a value will always be of the first type (in the order as listed in the `Union`) for which the conversion succeeds. This means that **the order of listed types matters in `Union`s, so always list more specific types before more general types**.

Completely avoiding `Union`s could be difficult if your schema is implementing an existing standard where such alternatives (e.g. giving either a number or a string) are supported. So please just be especially careful with `Union`s and when testing your schema, make sure to pay special attention to how fields with `Union` types behave.


### What to avoid

**unconstrained numbers and strings:**

Avoid "plain" `int`, `float` or `str` in your schemas. We used them in the examples to keep it simple, but in actual use cases, there usually are restrictions that should apply - values you actually should exclude because they do not make sense in your context. Always try to formulate a constrained type and only fall back to these primitive types when there is no other solution.

One type you might want to use often is the `NonEmptyStr` type. You should always use `Optional[NonEmptyStr]` instead of `str` even if the string is missing. The reason is that an empty string is still a string, it is not considered a missing value, so if an empty string is not something you see as a relevant value, it must be excluded. 

This kind of discipline might feel unfamiliar to you or look very "un-Pythonic", but you will see that it can prevent many avoidable mistakes and removes ambiguity. **Making sure that missing values are represented in an unambiguous way is crucial to make harvesters work correctly**. There already exists a unique, unambiguous value representing missing information in Python, which is `None`, and the way to state that information is missing is wrapping the type of value in `Optional` to allow `None`.

<div class="alert alert-block alert-warning">
    <b>Always make sure to sufficiently constrain your values and allow only actually meaningful ones, make optionality explicit using <tt>Optional</tt>!</b>
</div>

**Tuple:**

There is a type hint `Tuple` in `typing` that can be used for tuples, but there are not many cases where a tuple would be the best solution. Instead of a tuple where each component is semantically different or has a different type, such as `Tuple[int, str, bool]`, you should write a schema instead and give those components a name.

In cases where you expect a sequence of elements of the same type, you usually want a `List`, possibly with a constraint specifying the minimal/maximal number of items.

One defendable case for a `Tuple` would be a vector in a space with fixed dimensions, such as a 2D or 3D vector. For such use cases, assuming that you provide documentation about the meaning (e.g. using `Tuple[float, float]` and documenting that it is supposed to be a `(x,y)` position), this would be fine. Even then, defining a schema with `x: float` and `y: float` could be the better choice, because the meaning is made explicit in the field name and not relying on external documentation.

**Dict**:

There is a type hint `Dict` in `typing`, but you should basically never use it. **Do not use dicts** for modelling, unless you have no information about the possible keys and the types of permitted values for those keys, which is rarely the case. If you have something which looks like a `dict`, you should define another schema and use that instead. Remember that you can freely define schemas which are not registered as plugins and use them as "bricks" to build up your larger plugin schema.

**dataclasses and other dataclass-like things that are not Metador schemas**:

There are many libraries that superficially look like they work the same way or very similarly to Metador schemas, however in general they will not or not fully compatible. So do not use plain `dataclass`es, do not use arbitrary pydantic `BaseModel`-based classes. When you consult pydantic documentation, keep in mind that instead of `BaseModel`, in Metador the top level class that is used is `MetadataSchema`.


**Pydantic constrained types:**

As mentioned above, avoid pydantic constrained types, i.e. types such as `PositiveInt`, `NegativeFloat`, and their general form based on the `conint`, `constr`, etc. helper functions in situations where you expect that some schemas extending your schema will want to override the fields.

<div class="alert alert-block alert-warning">
    When you do use pydantic type constraints, <b>you must use <tt>Annotated</tt></b>,
    as described <a href="https://pydantic-docs.helpmanual.io/usage/schema/#typingannotated-fields">here</a>.
</div>


### It's just a matter of semantics: Adding JSON-LD annotations to schemas

Each Metador schema represents a structural encoding for objects that belong to some abstract or real-world category we have in our mind - we know what it means, because we can read the documentation and look at the code if needed. To make this information visible to a machine, service, tool that uses your metadata, but does not know anything about Metador, this "human level metadata" must be provided to give the metadata meaning - **semantics**. This does not only help machines, but also other people interested in your data and metadata.

In the context of Metador, we mean that a schema is **semantic** if it is aligned and compatible with existing linked-data standards centered around [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework), [OWL](https://en.wikipedia.org/wiki/Web_Ontology_Language) and [SPARQL](https://en.wikipedia.org/wiki/SPARQL).
We assume that you have at least a rough idea about the vision, purpose and existing tooling for the [semantic web](https://en.wikipedia.org/wiki/Semantic_Web), and you probably want to know how to connect your schemas to existing vocabularies and ontologies, so that your metadata is interoperable with existing tools, can be unambiguously interpreted, added into a knowledge graph, queried using SPARQL and profit from all the other nice features that the semantic web ecosystem provides.

Until now, all the schemas we defined were structured, provided validation of the metadata, but were lacking a formal semantic interpretation, unless they inherited from a schema that already was providing some semantics (at least covering the inherited fields). If your schema represents a type for which a semantic standard already exists, you can easily make your schema fully semantic.

This is done by attaching [JSON-LD](https://json-ld.org/) annotations to a schema. The consequence of doing this is that **every serialized object** (as JSON or YAML, etc) **will contain a `@context` and a `@type` field**. These additional fields are "tacked on" automatically to each metadata object, and consequently, if you feed *non-semantic* metadata (by hand, or using harvesters) into metador schemas, you will obtain *semantic* metadata on the "output", meaning that tools and humans working with Metador containers will be able to make sense of your metadata with much less effort, given that the schemas you use are semantically "enriched" with the JSON-LD fields. For this to work, you have to do some preparations first.

#### Step 1: Understand the relevant object types in the standard(s) you use

A semantic standard such as an OWL ontology will define multiple kinds of entities, properties they can have and interrelationships. Each schema you define should ideally be a representation of one such entity, which is exactly what will be declared as the `@type` for your schema.

#### Step 2: Find or define your semantic context

**Semantics Beginner:** You will usually find a default context you can use in the documentation of your semantic metadata standard. It will typically be a URL (which points to the context object), such as `https://w3id.org/ro/crate/1.1/context` or can be even as simple as just `https://schema.org`.

**Semantics Expert:**
If you are combining multiple standards or have another use-case for a custom context, you can use an arbitrary JSON-like object as a context. This means, your context can be a simple Python dict that defines the interpretations for your fields, e.g.:

In [15]:
# this is just a Python dict encoding the JSON-LD context:
my_context = {
    "name": "http://schema.org/name",
    "image": {
      "@id": "http://schema.org/image",
      "@type": "@id"
    },
    "homepage": {
      "@id": "http://schema.org/url",
      "@type": "@id"
    }
}

<div class="alert alert-block alert-warning">
    <b>Metador cannot perform semantic validation of JSON-LD annotations!</b><br /> You have to make sure that yout context makes sense using other external tools, if necessary.
</div>

#### Step 3: Give schema fields names which are interpretable in your context

**Semantics Beginner:** The context is like a dictionary for looking up semantic interpretations, so your schema only makes semantic sense if the names you use have a definition. If you are using an existing standard, consult its documentation for the correct property names for various object types. 

**Semantics Expert:**
If you are using a custom context, you probably have a good understanding of this. One limitation you have to keep in mind is that you cannot use "qualified" names using a prefix as is often done, because you cannot easily have a colon in a field name in Python. Your context therefore must fully define all the concrete field names you use. This means that you cannot call a schema field `foaf:name`, but you are free to use any valid JSON-LD, including these kind of abbreviations, within your context definition - as long as in the end all the actual field names used in the schema are declared without any namespace prefix.

<div class="alert alert-block alert-info">
    When designing a JSON-LD <tt>@context</tt>, remember that you must explicitly define the field names you use in your schema, Metador does not understand or process JSON-LD prefixes.
</div>

### Adding JSON-LD annotations to schemas

Now assuming that you understand your semantic context and types, let us see how this can actually be implemented.

Everything you need for development of semantic schemas lives in the `metador_core.schema.ld` module.
Assuming that you have one context that you want to use for a whole collection of schemas, you should define a decorator to quickly attach both a `@context` and a `@type` to your schemas:

In [16]:
from metador_core.schema.ld import ld_type_decorator

my_context = "https://www.example.com/my/context"
my_semantic_type = ld_type_decorator(my_context)

If you intend to define a semantic schema and do not extend another already semantic schema, use `LDSchema` as the base class (instead of `MetadataSchema`) to distinguish it from non-semantic schemas. So your first semantic schema could look like this:

In [17]:
from metador_core.schema.ld import LDSchema
from pydantic.color import Color

@my_semantic_type("Animal")
class MySemanticSchema(LDSchema):
    furColor: Color

myAnimal = MySemanticSchema(id_="https://www.animalid.org/01234", furColor="#ff8000")

animalJson = myAnimal.json(indent=2)  # serialize it
print(animalJson)
sameAnimal = MySemanticSchema.parse_raw(animalJson)  # load it back
print("Loaded back same animal?", myAnimal == sameAnimal)

{
  "@id": "https://www.animalid.org/01234",
  "furColor": "#ff8000",
  "@context": "https://www.example.com/my/context",
  "@type": "Animal"
}
Loaded back same animal? True


Notice that we did not specify `@context` and `@type` for `myAnimal` - the schema knows them already and just "tacks them on" to each animal metadata object, and when it is stored (in a Metador container, JSON file or any other way), it will have the correct annotation. When loading a serialized animal with these annotations, the schema will also not complain as long as these are **exactly** the ones we attached to the schema (remember, Metador does not actually understand semantics!).

You see that we used `id_` even though we did not declare it. The `id_` field is automatically available to all semantic schemas derived from `LDSchema`, in order to set the JSON-LD `@id` of a semantic object. This is a property specific to each *instance* of a schema, a concrete metadata object, whereas the `@type` and `@context` are identical for all the instances. In Python/OOP jargon - `@context` and `@type` are class variables (with the special property of being constant values) and are attached using schema decorators, whereas `@id` is an actual instance variable specific to individual objects - just as all the fields you usually define in your schema. Naturally, in Metador schemas we call fields that behave like `@context` and `@type` simply **constant fields**.

**(Semantics Beginner) Q: I still don't get it, how exactly is the schema "better" now by adding these fields?**

**A:** Never forget that machines are really, really stupid. Using structured ways to organize the metadata (using Metador schemas, JSON, etc.) instead of using free-form natural language helps a technical system to understand structure, the "shape" of your metadata - which is an important step forward. Technical systems can to a lot with data and metadata without any understanding, because the required understanding is provided by humans - the software developers who understand the domain and metadata and write software which uses it. The advantage of adding such semantic "hints" might not be obvious, if you are able to understand the field names and read the corresponding documentation. But imagine a schema designed in a language you don't know - could you make sense of a schema like this?

In [18]:
class RuffleMeta(MetadataSchema):
    """A Ruffle is just a simple combination of the Quirzl phase and the Shpongle factor
    measured during the Xylic-Yzgel process at a fixed time step."""
    
    quirzl: str
    """Quirzl phase of the Ruffle."""
    
    shpongle: int
    """Shpongle factor of the Ruffle."""

Maybe this is something you are familiar with, but with weird names, or maybe this is a field of science you have never seen before - you have no chance to know either way. Semantic methods such as JSON-LD annotations solve this problem by connecting your schemas and their fields to a formalized system for knowledge representation - objects that refer to the same entity in an ontology are supposed to **mean** the same kind of thing, regardless of how the field, schema or object is named. This helps a human who does not understand your language or domain, and also helps a machine which is trying to process your data without knowing all of its context as well as you do.

### Advanced: Constant fields in general

Understanding the mechanism Metador uses to attach JSON-LD annotations, you might wonder if this can be used to add other information beyond `@context` and `@type`. The answer is yes. You can use the same machinery to attach either information for different JSON-LD keywords that are schema-specific (i.e. equal for all objects of that schema), and in fact, you can use it to attach arbitrary fields with the same properties, i.e.:

* constant fields are **not required** when creating or loading a metadata object
* if those fields already exist in the object that is loaded into a schema, they are **ignored**
* when serializing an object, the schema will attach the constant fields **as defined for the schema**

Constant fields are useful for enriching objects with additional information that is fully determined by their schema and which they always should "carry along with them" for settings where the metadata is used outside of Metador ecosystem.

### Documenting Schemas

Use python docstrings, for the schema class as well as for the fields, and document the meaning and purpose. This is the information others will consult when trying to use your schema and helping them to decide whether your schema is useful for their purpose. You should include information that you technically encode into constraint types, if it is not absolutely obvious - you don't have to spell out that a `NonEmptyStr` is a nonempty string, but if `MySpecialParameter` is actually a constrained type representing number range, it should also be explained in the documentation of the field that uses your custom constrained type. Furthermore, documentation should include all non-technical, human-level information about the intended context for the schema and helps to to use and interpret it correctly.

So a well-documented schema might actually contain more documentation text than actual "code", e.g. like this:

In [19]:
from typing import Literal
from phantom.interval import Closed

class FluffinessScore(float, Closed, low=0, high=10):
    ...


animalo_type = ld_type_decorator("https://www.example.com/animal-ontology")

@animalo_type("Animal")
class AnimalMeta(LDSchema):
    """Metadata for representing animals in the jungle pet discovery project."""
    
    voice: Literal["dog-like", "cat-like", "bird-like", "other"]
    """Voice category of the animal.
    
    We classify animals into:
    * dog-like (if they bark) 
    * cat-like (if they meow)
    * bird-like (if they chirp)
    * other
    """
    
    fluffiness: FluffinessScore
    """Fluffiness score for the animal, carefully estimated by petting it.
    
    The value is in the closed interval [0, 10], where the score is
    estimated on a linear scale assuming that
    * 0 means "it has no hair"
    * 10 means "absolute fluffball"
    """

# let's describe an animal!
print(AnimalMeta(id_="https://petid.org/874", voice="dog-like", fluffiness=5).yaml())

'@context': https://www.example.com/animal-ontology
'@id': https://petid.org/874
'@type': Animal
fluffiness: 5
voice: dog-like



### At last: A wonderful niche schema

New let's use what we learned to properly define our custom image file format, as promised in the beginning:

In [20]:
from metador_core.schema.types import MimeTypeStr

class NicheMimetype(MimeTypeStr, pattern=r"image/niche"):
    """The MIME type of .niche image files."""

# we extend the semantic context of ROCrate (ImageFile is based on ROCrate), as described here: 
# https://www.researchobject.org/ro-crate/1.1/appendix/jsonld.html#extending-ro-crate
my_context=["https://w3id.org/ro/crate/1.1/context", {
  "animalMeta": "https://www.example.com/animal-ontology/Animal"
}]

ext_rocrate_type = ld_type_decorator(my_context)

@register_in_group(schemas)
@ext_rocrate_type("File")
class NicheImage(ImageFile):
    """Schema for the .niche image format.
    
    The format achieves improved compression for images of animals,
    given some information about the depicted animal.
    """
    
    class Plugin:
        name = "dummy.imagefile.niche"
        version = (0, 1, 0)
        parent_schema = ImageFile.Plugin.ref(version=(0, 1, 0))
    
    # we constrain the allowed MIME type field from core.imagefile
    encodingFormat: NicheMimetype
    """The MIME type of .niche image files, must be 'image/niche'."""
    
    # we add a new field with the new relevant information for our format,
    # which we conveniently already have defined earlier
    animalMeta: AnimalMeta

Notebook: Plugin 'dummy.imagefile.niche' registered in 'schema' group!


Now let's take our new schema for a ride and create a metadata object with information about a image file encoded in our `.niche` format:

In [21]:
img_meta = NicheImage(
    # some dummy values (that a harvester would usually get for you):
    filename="someimage.niche",
    sha256="abc",
    contentSize=123,
    height=100, width=200,
    # now our custom added fields:
    encodingFormat="image/niche",
    animalMeta=AnimalMeta(voice="cat-like", fluffiness=3)
)
print(img_meta.yaml())

'@context':
- https://w3id.org/ro/crate/1.1/context
- animalMeta: https://www.example.com/animal-ontology/Animal
'@type': File
animalMeta:
  '@context': https://www.example.com/animal-ontology
  '@type': Animal
  fluffiness: 3
  voice: cat-like
contentSize: 123
encodingFormat: image/niche
filename: someimage.niche
height:
  '@context': https://schema.org
  '@type': QuantitativeValue
  unitText: px
  value: 100
sha256: abc
width:
  '@context': https://schema.org
  '@type': QuantitativeValue
  unitText: px
  value: 200



You can see how metadata objects created with our new schema are reusable and interoperable, both within the Metador tool ecosystem and beyond, due to our use of schema inheritance and addition of JSON-LD annotations.

### Versioning Schemas

When you write the first version of a schema, you have a lot of freedom in how you want to design it. But once others start using it, you have the responsibility to be careful with the changes you do, avoid changes that will break child schemas based on yours, and in any case, make sure that the severity of changes is reflected in the version of your schema plugin - which has to follow strict [semantic versioning](https://semver.org).

If your schema plugin $X$ had version `(MAJOR, MINOR, PATCH)` and you did changes to it directly or indirectly resulting in an updated schema $X'$, you have to update the version of your schema as well.

A non-exhaustive list of relevant changes includes:

* Adding, changing or removing fields of $X$
* Adding, changing or removing schema decorators that affect fields of $X$ (such as the LD annotations)
* Doing changes that affect the inheritance chain of $X$
* Doing any of the above to a schema that $X$ depends on (e.g. nested schemas)
* Updating the required version of any plugin that $X$ depends on (e.g. ones you reuse from others)

**It is important to understand to effect your changes have on the ability to process metadata that was already created with the previous version**. Some changes do not require any action, but others do. Changes that "break" things should be as rare as possible, but of course are sometimes unavoidable. Breaking changes are always annoying, but not necessarily a horrible experience - if managed well, they just require some extra work which usually is straight-forward. Good management of breaking changes in the reason why versioning discipline is important. This includes updating the semantic versioning triple to reflect the change severity (so that machines can see whether a schema and some metadata are compatible), and communication of the changes by other means (to inform users and provide ways to upgrade their existing metadata to the new version of the schema).

Some changes are **backward-compatible**, meaning that your schema can be put in place of the older version and nothing will break - every metadata object a previous version of your schema created still must be valid for the new version.

Some changes are even **forward-compatible**, meaning that older versions of your schema will work with metadata objects created by your newer version.

Let $v(S)$ denote the set of metadata objects that are valid according to schema $S$. Then the rules for **version bumping** are as follows (*bumping* the version here means incrementing the corresponding component and resetting the less important components back do $0$):

* If $v(X') = v(X)$, you have to bump `PATCH`,

* If $v(X') \supset v(X)$, you have to bump `MINOR`.

* If $v(X') \subset v(X)$, you have to bump `MAJOR`.


<div class="alert alert-block alert-warning">
    <b>Design your schemas to be initially as "strict" as possible!</b><br />
    You can always loosen up your requirements in future versions later without breaking existing metadata, but not the other way round.
</div>


This is quite abstract, so here are a few concrete examples:

* Any new field added in $X'$ or constrained more than before makes $v(X') \subset v(X)$

This is also true if the new field is optional - because if the field is provided, it is validated and **must have the correct type**, the schema cannot just "ignore" values if they are wrong. Think of the case that someone extended your schema and added a field with the same name to it which has an incompatible type. Your new field then breaks "parent compatibility" for that schema - a breaking change.

* If your changes strictly increase parsable instances, that is,
    your new version can parse older metadata of the same MAJOR,
    you may increment only MINOR (resetting PATCH to 0).

    If your changes could make some older metadata invalid,
    you must increment MAJOR (resetting MINOR and PATCH to 0).

    If you add, remove or change the name of a parent schema,
    you must increment MAJOR.

    If you change the version in the `parent_schema` to a version
    that with higher X (MAJOR, MINOR or PATCH), the version
    of your schema must be incremented in X as well.


### Testing Schemas

Writing a schema is only one half of the job, though. In order to make sure that everything works correctly, a schema must be properly tested. Does the schema accept only the values in fields that are supposed to be accepted, and reject values that do not make sense? Does the parent compatibility actually work, if we try it for concrete instances? All these questions and our expectations about how the schema behaves must be codified into a proper set of tests. Especially as a schema can be developed further over time, a test suite will help to detect actual mistakes as well as accidental "breaking changes" you did not consider.

<div class="alert alert-block alert-info"> <b>TODO</b>: Write when schema testing framework is in place</div>

### Advanced: Custom Parsers

<div class="alert alert-block alert-info"> <b>TODO</b>: Write here or in another, separate tutorial</div>


This was a long-winded tutorial, if you are here - congratulations! The good news is that now you have a deep understanding of schemas - the most important entities in Metador. You will see that most other plugin types are actually simpler! You really earned a break, see you next time!

### Summary

#### Schema inheritance

* A schema can have one a parent schema that it extends or specializes
* Python class inheritance and Metador schema inheritance are deeply connected, but conceptually different
* Not every defined schema must be a registered plugin

#### Type hints

* Use default Python types when you have no special requirements
* Use `Literal` types and `Enum` classes for discrete, fixed controlled lists of allowed values
* Use classes from the `datetime` package for times and dates
* Use classes from the `phantom` package for constrained types such as number ranges
* Use default `pydantic` types for things like URLs
* Use `Optional` when a field is not mandatory, but prefer mandatory fields
* Avoid `Union` unless you really need and understand it
* Avoid `Tuple` unless the meaning of the components is rather trivial
* Avoid `Dict` unless you have no idea what it can contain

#### Semantics

* Use the `ld_bla`

#### Testing

* Write tests for your schemas
* Test both for valid instances as well as invalid instances
* Take special care to cover boundary cases when using constrained types
* Take special care to cover fields that have advanced normalization
* Take special care to cover `Union` fields