Create JSON schema for MIT TIMDEX records #111

jonavellecuerdo · 2024-01-05T16:26:47Z

Purpose and background context

Create a JSON schema that can be used to validate generated MIT TIMDEX records.

Here are a few things I learned about structuring the JSON schemas:

How to represent custom classes: Initially, I struggled to understand how to represent custom classes. For instance, how do we represent the following in a JSON schema:

alternate_titles: Optional[list[AlternateTitle]] = field(
     default=None, validator=optional(list_of(AlternateTitle))
 )

Eventually, I realized that we just have to represent these custom classes--like AlternateTitle--as instances of the JSON object type, which translates to a Python dictionary. Therefore, the example above can be represented in a JSON schema as:

"alternate_titles": {
         "type": "array",
         "items": {
             "type": "object",
             "properties": {
                 "value": {
                     "type": "string"
                 },
                 "kind": {
                     "type": "string"
                 }
             },
             "required": [
                 "value"
             ]
         }

Referencing $defs to simplify JSON schemas: While creating the JSON schema, I noticed I was repeating several lines of code just to define a "list of strings". This is when JSON schema's $defs functionality can come in handy! There may be additional repeated steps we can capture into a $def 🤔

How can a reviewer manually see the effects of these changes?

To perform some quick, high-level tests on the JSON schema, I created a script in tests/test_jsonschema.py. If you look at this test, I essentially copied the schema from the JSON document (transmogrifier/schemas/mit-schema-timdex.json) into this module. To see the JSON schema validation in action, you can run:

pipenv run coverage run --source=transmogrifier -m pytest -vv tests/test_jsonschema.py

Note: I'm sure this is a temporary module and will be improved when figure out how to implement JSON schema validation effectively into transmogrifier!

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Include links to Jira Software and/or Jira Service Management tickets here.

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: * The JSON schema can be used to validate generated TIMDEX records, while also serving as a detailed description of the TIMDEX data model. How this addresses that need: * Side effects of this change: * None Relevant ticket(s): * TBD

jonavellecuerdo · 2024-01-11T17:47:43Z

This comment documents some the feedback from @ghukill and @ehanson8 on this spike work:

[GH]: First and foremost, I think this is really neat, and can appreciate it's an excellent way to familiarize with the Transmog project and the data model in general. But before a proper PR, had a couple thoughts.
While I think ultimately we may find that a JSONSchema validation approach -- like GeoHarvester -- is more expressive and portable than the attrs classes, that feels like a distinct conversation if we would want to update Transmog in that context.
What I think might have more immediate value would be a JSONSchema that validates JSON documents that we might attempt to index into OpenSearch. And, with that, comes complexity. Thinking of the field types geo_point and geo_shape specifically. How can we write JSONSchema for those field types? what kind of effort would that be?
If we had a JSONSchema of an OpenSearch document -- keeping in mind that it's extremely similar to the TIMDEX data model, but not identical -- than we could validate documents on the way out of Transmogrifier, after they are transformed, to ensure they would successfully index into OpenSearch. As mentioned, that would have caught a few bugs that are only known once you actually attempt the indexing via TIM.
So in summary, I might propose waiting to implement a "TIMDEX data model" JSONSchema, but keep this pattern in mind in the event we want to implement an "OpenSearch data model" JSONSchema. And if we go that route, we'd probably want some plans on how to keep it up-to-date. For example, if the mapping lives in TIM, maybe the JSONSchema should also live there? However, I think TIM bulk indexes, so it doesn't lend itself to per-record validation, therefore making Transmog the place to implement that check. This all boils down to: where does the JSONSchema live, how is created and maintained, and where is it applied?
I feel confident it has a place, but I'd be wary of jumping in too quickly until we've had some discussions on bang-for-the-buck kind of thing.

[EH]: I agree that OpenSearch is the more useful target for JSONSchema. I lean towards doing this in transmogrifier since it’s purpose is to get records ready for OpenSearch. Also, we keep hitting this tension between TIMDEX’s data model with the OpenSearch data model and the minor differences which need to be fixed to avoid OpenSearch ingest errors. Not sure if there’s anything to be done but it point to OpenSearch as the ultimate validation target.

[GH] Agreed about the tension. If and when we scope out this work, I'd propose that we classify the kinds of bugs it would catch, or improvements it would provide. For example:
contract errors: the JSONSchema would have caught that Transmog's data model had locations.geodata while OpenSearch mapping had locations.geopoint
values errors: this summer I passed along a date string in a format OpenSearch didn't like (timezone related), but this could be readily caught

Given the information above, I will close this PR and refocus this spike to creating a JSON schema for the OpenSearch data model for TIMDEX records. 🤔

jonavellecuerdo requested review from ghukill and ehanson8 January 5, 2024 16:26

jonavellecuerdo force-pushed the mit-timdex-json-schema branch 2 times, most recently from 4b1d561 to 6aae5dd Compare January 5, 2024 16:36

jonavellecuerdo force-pushed the mit-timdex-json-schema branch from 6aae5dd to 6d150be Compare January 5, 2024 16:36

jonavellecuerdo closed this Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create JSON schema for MIT TIMDEX records #111

Create JSON schema for MIT TIMDEX records #111

jonavellecuerdo commented Jan 5, 2024

jonavellecuerdo commented Jan 11, 2024 •

edited

Loading

Create JSON schema for MIT TIMDEX records #111

Create JSON schema for MIT TIMDEX records #111

Conversation

jonavellecuerdo commented Jan 5, 2024

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

jonavellecuerdo commented Jan 11, 2024 • edited Loading

jonavellecuerdo commented Jan 11, 2024 •

edited

Loading