This is a local version of Schema Overseer, intended to use in a single repository.
For the multi-repository service see schema-overseer-service.
Schema Overseer ensures strict adherence to defined data formats and raises an exception in case of attempting to process unsupported input schema.
In more technical terms, it is an adapter1 between inputs with different schemas and other application components.
- Data formats evolve over time
- Developers need to simultaneously support both legacy and new data formats
- Mismatches between input data format and the corresponding code can lead to unexpected and hard-to-debug runtime errors
- As the number of supported data formats increases, application code often becomes less maintainable
- Straightforward extensibility
- Static analysis checks via type checking
- Detailed runtime checks
- Incoming data validation with pydantic
pip install schema-overseer-local
-
Create a file
adapter.py
to define the adapter logic.
For quick start we will use single file, but in a real application it's better to use multiple files. -
Output. Define the output schema you plan to work with.
The output schema could be any object. For the tutorial purpose we will usedataclass
. The output schema attributes could be any python objects, including non-serializables. Output could have the same behavior as the original input object, or a completely different one. Here is example of different behavior.@dataclass class Output: value: int function: Callable
-
Registry. Create the
SchemaRegistry
instance forOutput
.schema_registry = SchemaRegistry(Output)
-
Input schemas. Define the input schemas using pydantic and register them in
schema_registry
.@schema_registry.add_schema class OldInputFormat(BaseModel): value: str @schema_registry.add_schema class NewInputFormat(BaseModel): renamed_value: int
-
Builders. Implement functions to convert each registered input to
Output
.
Builders require type hinting to link input formats andOutput
.@schema_registry.add_builder def old_builder(data: OldInputFormat) -> Output: return Output( value=data.value, function=my_function, ) @schema_registry.add_builder def new_builder(data: NewInputFormat) -> Output: return Output( value=data.renamed_value, function=my_other_function, )
-
Finally, use
schema_registry
inside the application to get validated output or handle the exception.schema_registry.setup() # see "Discovery" chapter in documentation def my_service(raw_data: dict[str, Any]): try: output = schema_registry.build(source_dict=raw_data) # build output object except BuildError as error: raise MyApplicationError() from error # handle the exception # use output object output.function() return output.value
Full quickstart example is here
Run it:
git clone git@github.com:Schema-Overseer/schema-overseer-local.git
cd schema-overseer-local
poetry install
poetry run python -m tutorial.quickstart.app
While you can define registry, models and builders in one or two files, it is usually a better idea to split them into different files, i.e., Python modules.
There are different ways to do the file structure, we recommend one of the following:
- Minimal — start with this one, when you are still figuring out the best way to work
- Expanded builders — useful for the case with lots of code for each builder
- Detached output — useful for the case, when the output is a big or complex entity
Minimal
|
Expanded builders
|
Detached output
|
Models (i.e., input data formats) are decoupled first for two reasons:
- If models contain inner models inside, it would be harder to distinguish between inner models for different root models. (see Q)
- If you transition to schema-overseer-service, the models are sourced from the outside of your code, so this split will come naturally.
Note
Python will not load modules automatically, unless they are explicitly imported.
SchemaRegistry
has a discovery_paths: Sequence[str]
argument to load all required models.
Specified modules and packages will be loaded at SchemaRegistry.setup()
.
Definition (SchemaRegistry(...)
) is decoupled with loading (SchemaRegistry.setup()
) to prevent cycle imports, that's why calling setup()
is required.
Argument discovery_paths
takes a sequence of strings in the absolute import format. Entries could be either python modules (single files) or python packages (folder with __init__.py
and other *.py
files inside)
For example, this will work for minimal option, mentioned above:
schema_registry = SchemaRegistry(
Output,
discovery_paths=[
'example_project.payload.models', # loaded as package
'example_project.payload.builders', # loaded as module
],
)
In addition to static type hint checks, schema-overseer-local
performs runtime checks to ensure:
- Each registered model has only one corresponding builder.
- All builders have a proper call signature, which includes:
- One argument for the input data
- No additional non-default arguments
- All builders have proper type hints
Additional runtime checks:
- If set to
validate_output=True
(the default isFalse
), it verifies whether the builder returns an object of the annotated type using pydantic. - By default,
schema-overseer-local
selects the builder from the first valid schema. However, ifcheck_for_single_valid_schema=True
is enabled, it ensures only one schema is valid for the input data.
If multiple schemas are found to be valid, aMultipleValidSchemasError
will be raised.
SchemaRegistry.build()
method operates in two modes:
- Using dict-like objects as inputs and extracting fields with
__getitem__
Usebuild(source_dict=...)
for this option - Using objects with data as attributes and extracting fields with
getattr
Usebuild(source_object=...)
for this option
source_dict
and source_object
are mutually exclusive.
TODO
A: It depends on the scale of the different formats you need to support. In case of a few formats to support, schema-overseer-local
would be an overhead indeed. But in the projects with lots of different formats, such extensive adapter layer could be helpful. Another goal of schema-overseer-local
is to serve as a fast and simple introduction to schema-overseer-service
for sophisticated use cases with multiple teams and repositories to work with.
A: schema-overseer-local
has three important benefits:
- it provides type checking;
- it has very detailed runtime checks;
- and it is easily extensible.
A: schema-overseer-local
uses the same pattern as pydantic
and FastAPI
for input and output validation in both runtime and static analysis. It provides an extra layer of defense against code errors. Even if your code is not entirely correctly typed or not checked with static analysis tools like mypy, the data is still validated.
Code example
class InnerModel(BaseModel):
value: int
class InputFormatV1(BaseModel):
inner: InnerModel
class InnerModelV2(BaseModel):
value: int
class InputFormatV2(BaseModel):
inner: InnerModelV2 # or re-use InnerModel?
A: Not really. While it might be tempting to adhere to the DRY2 principle in this context, it's generally a better approach to fully separate nested pydantic models into distinct modules, avoiding their reuse even if they are identical.
The primary rationale is future code maintainability: tracking modifications in reused models can be challenging, and the introduction of a new format version could require changes to the inner model, which would then demand separation regardless.