Internalization of external sources that don't define `Class` or `MimeType` equivalents #51

cutz · 2017-12-12T14:50:07Z

What's the proper way of going about using the internalization mechanisms, in particular object factory resolution, for data that comes from third party apis that don't provide any explicit declaration of type like Class or MimeType on nested objects? This seems like a common case when working with data from non NTI sources. For example an API that returns a list of orders structured something like this:

{
    "orders": [
           {"foo": "bar"},
           {"foo": "baz"}
    ]
}

There may not be properties that explicitly identify the nested objects in the orders field as Order objects, however I as the programmer know that the contract of the API defines the contents of orders are Order objects. Obviously this is a recursive problem. Each Order may have a caption field that we want to resolve to a Caption object.

Is the solution to recursively preprocess the raw json dict to add MimeType to any nested objects? That feels nasty. It seems like there could be a solution where the nested factory lookups could take in to account the containing object being parsed as well as the field being parsed. i.e. Use this factory if parsing this field on this type of object.

I think I could almost make that work using the deprecated functionality in Issue 29 but seeing as how that is going away that obviously wouldn't be a good approach. This does seem like a general problem that I've encountered working with a number of external sources. What are the thoughts on how to handle something like this.

The text was updated successfully, but these errors were encountered:

papachoco · 2017-12-12T15:02:40Z

FWIW.. I was thinking about using JQ to add the mimeTypes to the higher elements and then register a mimeType factory to handle the json inside e.g.

<adapter factory=".datastructures.CourseOverViewGroupFactory"
		 for="nti.base.interfaces.IDict"
		 provides="nti.externalization.interfaces.IMimeObjectFactory"
		 name="application/vnd.nextthought.nticourseoverviewgroup" />

papachoco · 2017-12-12T15:05:33Z

nti.contenttypes.presentation/src/nti/contenttypes/presentation/datastructures.py

jamadden · 2017-12-12T15:08:00Z

What are the thoughts on how to handle something like this.

My first thought is that this is exactly the sort of problem this library was explicitly not intended to solve. This library deals with defining a serialization format, meaning that it can write objects out and read them back in. It was not intended to deal with arbitrary incoming data that it had no control over.

There are libraries that are intended to do that, in one way or another. They all make various tradeoffs among security, robustness, and degree of developer input required. And they're all different and use their own conventions and schema languages, so they're not necessarily a clean fit in a zope.schema/interface/component world.

Still, there are ways to handle that sort of thing in the current structure. The easiest and safest way is to apply the appropriate metadata ahead of time. Something like PyQuery could make that relatively simple. Depending on the exact object structure, custom internalizers can be written to handle that without too much effort (I think we have code for Note objects or Canvas objects or something along those lines that does this; I think that's what _ext_excluded_in is for).

Ultimately I can imagine making it easier to do, potentially based on context, as you suggested, especially if we continue to embrace zope.schema (I can see field objects being consulted for more control over their contents).

But generalizing this project beyond dealing with the formats it creates and produces would be a direction change we would explicitly need to decide to take.

papachoco · 2017-12-12T15:40:03Z

There is also ..

simplejson.load(fp, encoding='utf-8', cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, use_decimal=None, **kw)

object_hook is an optional function that will be called with the result of any object literal decode (a dict). The return value of object_hook will be used instead of the dict. This feature can be used to implement custom decoders (e.g. JSON-RPC class hinting).

object_pairs_hook is an optional function that will be called with the result of any object literal decode with an ordered list of pairs. The return value of object_pairs_hook will be used instead of the dict. This feature can be used to implement custom decoders that rely on the order that the key and value pairs are decoded (for example, collections.OrderedDict will remember the order of insertion). If object_hook is also defined, the object_pairs_hook takes priority.

cutz · 2017-12-12T15:58:57Z

Fair enough. A preprocessing approach is workable unless and until we want to take this in another direction. What is proper protocol for these question related issues now. Do I close this?

jamadden · 2018-07-03T14:14:48Z

We have several more examples internally where this is useful now, so let's take another look. I understand there are a variety of approaches in use; for something to land here, I would really like for it to be declarative, and based on the schema and its fields. This should not only be simpler to use and reuse, but we have an architecture for caching based on schemas and fields that could be helpful for speed. @cutz, can you give a summary of various current options?

cutz · 2018-07-03T14:33:56Z

This has come up in a couple of new places internally. The use of zope.interface and zope.schema is prolific and this library has the flexibility and power to make externalization to a variety of formats (the serialization format this library produces by default but also more custom formats using named externalizers if appropriate) convenient. What is still lacking is the ability to go the other direction. Consuming json from an external source, or multiple sources. Unfortunately it lacks the necessary flexibility when finding factories for nested external data.

A couple of approaches have been used to get around this. From the internal nti.orgsync project a technique similar to what @papachoco mentioned above is used. The initial raw external data is annotated with a mime type. This can happen in a couple of ways. Once such example is using an object_hook at parse time:

def account_object_hook(data):
   if isinstance(data, Mapping):
       if 'id' in data and 'username' in data:
           data[StandardExternalFields.MIMETYPE] = Account.mimeType
   return data

def read_accounts_source(source, encoding="utf-8"):
   ext_obj = simplejson.load(source, encoding=encoding,
                             object_hook=account_object_hook)
   assert isinstance(ext_obj, Iterable)
   result = []
   for data in ext_obj:
       group = parse_account_from_external(data)
       if group is not None:
           result.append(group)
   return result

Then a custom mime factory is used (registered to the mime type injected above) to mutate the json structure so that nested objects parse:

def AccountFactory(ext_obj):
   ext_obj[ID] = ext_obj.get('id')
   last_login = ext_obj.get('last_login')
   if last_login:
       ext_obj['last_login'] = IDateTime(last_login)
   responses = ext_obj.get('profile_responses')
   for response in responses or ():
       response[MIMETYPE] = ProfileResponse.mimeType
       element = response.get('element')
       if element:
           element[ID] = element.get('id')
           element[MIMETYPE] = Element.mimeType
   classifications = ext_obj.get('classifications')
   for classification in classifications or ():
       classification[MIMETYPE] = Classification.mimeType
   return Account

That's doable, although it certainly seems odd that the IMimeObjectFactory has side effects that mutate the provided external object.

Another solution to the nested objects would be to adapt to the appropriate type when setting nested external values (likely a dict) to an interface we are expecting. That could easily be done with an AdaptingFieldProperty[1]. In that case the setter ends up knowing what it interface it expect and calls an adapter to get the actual value.

@account.setter
def account(self, v):
   if not IAccount.providedBy(v):
       v = IAccount(v)
   self.__dict__['_account'] = v # or whatever mechanism makes sense here

Then the adapter uses the existing internalization framework to populate the object.

@component.adapts(dict)
@interface.provides(IAccount)
def _external_account_factory(ext_value):
   account = Account()
   update_from_external_object(account, dict) #psuedo code forgetting the syntax...
   return account

Again, that seems workable, although it feels less than ideal to wrap that internalization logic up in the objects implementation. Ideally this remains in internalisation based hooks.

A middle ground could be to use custom implementations of IInternalObjectUpdater that subclass updateFromExternalObject to create the proper types and use the internalization framework to populate them prior to calling super. That seems like a bit of the best of both worlds but the lack of named adapters for IInternalObjectUpdater means you can't support one input format from one system, and another input format for another system. Depending on application that could make things difficult.

def updateFromExternalObject(self, ext_obj):
   if 'account' in ext_obj and not IAccount.providedBy(ext_obj['account']):
       account = IAccount(ext_obj['account’])
	ext_obj['account'] = account
   #call super

It actually turns out that the existing implementation of update_from_external_object actually allows for essentially the same solution without a custom internaliser. validate_field_value will adapt field values to the interface required by the fields schema (if possible and necessary). That can be leveraged by adding an adapter from dict to the desired interface. This test case demonstrates that in use.

That is my favourite approach thus far, although it is a bit more implicit than I would like. Turning that adaptation into a named adapter (provided form the parent update_from_external_object) call could may help with that.

All that said, this seems like a common problem that has come up in a couple of different areas. Even if we don't opt to make any changes to make this easier to deal with, it does seem like we should at least produce some recommendations for how to deal with this sort of thing.

Thoughts?

Move the factory finding responsibility to a new object that is given the key to assign to. The normal implementation will check the matching schema field's tagged value if one can't be found from the data. Currently this must be a callable object but there are some options for using strings. This could also allow us to get the actual object that's currently on the parent object, so that we really are updating in place intsead of constructing all new objects. This could have some nice database properties.

jamadden · 2018-07-13T16:18:36Z

It seems to me that the biggest problem here is that we get factories in a vacuum, based only on the external data. If we know the target where the output of the factory will eventually go, we can do better. Specifically, when we have a schema, knowing the target attribute lets us use the schema field to get a factory. In pseudo code:

from nti.externalization.internalization import find_factory_for
def schema_find_factory(parent_object, external_key, external_data):
    factory = find_factory_for(external_data) # this is all we currently do
    if factory is None:
        schema = interface.providedBy(parent_object)
        try:
            field = schema[external_key]
         except KeyError:
             pass
        else:
            factory = field.getTaggedValue('__external_factory__')
    return factory

The update loop changes slightly to call this function instead of just find_factory_for:

for k, v in iteritems(externalObject):
    factory = schema_find_factory(containedObject, k, v)
    if factory is not None:
         externalObject[k] = update_from_external_object(factory(), v)

This works well if every level of the object tree is modelled by a schema, but it breaks if for some reason some level of the tree is not modelled. I think we'd have to use a custom IInternalObjectUpdater there now anyway, so that's probably not a problem.

For fields that are sequences, this assumes that the sequence is homogenous. Do we ever deal with sequences of heterogeneous types? (This could be worked around but if we can avoid the complexity in general it would be good.)

One subtlety is that the key in the dictionary needs to match the schema field. I think we generally assume this anyway, though. (Again, this could be worked around but better to avoid the complexity if we don't need it.)

Another subtlety: since the place where we're getting the factory has the parent object, we could actually return the object that's already there instead of a new one. Thus we would actually be updating the object tree instead of throwing everything except the root away. That might be a win for persistent objects. (But it's a big behaviour change so I haven't done that yet.)

def schema_find_factory(parent_object, external_key, external_data):
    factory = find_factory_for(external_data) # this is all we currently do
    if factory is None:
        existing_obj = getattr(parent_object, external_key, None)
        if existing_obj is not None:
             factory = lambda: existing_obj
        else:
             schema = interface.providedBy(parent_object)
             ...

I have an implementation of this I'm working on in the issue51 branch. Using it currently looks like this:

        class INestedThing(interface.Interface):
            value = Int(title=u"An integer")

        class IRoot(interface.Interface):
            field = Object(INestedThing)

        @interface.implementer(IRoot)
        class Root(object):

            def __init__(self):
                self.field = None

        @interface.implementer(INestedThing)
        class NestedThing(object):

            def __init__(self):
                self.value = -1

        IRoot['field'].setTaggedValue('__external_factory__', NestedThing)

        external = {'field': {'value': 42}}

        root = Root()

        update_from_external_object(root, external, require_updater=True)

I don't particularly like that the schema field winds up directly referring to a class as the factory. Better would be a component lookup for a factory based on a name; this could answer the "different types of input in different sites." I would imagine a ZCML directive that sets all this up:

<ext:anonymousObjectFactory
    for='IRoot'
    field='field'
    factory='NestedThing' />

So far I haven't done anything with primitives as in the example:

   if last_login:
       ext_obj['last_login'] = IDateTime(last_login)

but I'm not sure that we would need to. I would expect the field adapting process mentioned above would handle that case.

jamadden · 2018-07-13T18:40:30Z

Worked example using ZCML.

Given this python:

class IGlobalNestedThing(interface.Interface):
    value = Int(title=u"An integer")

class IGlobalMiddleThing(interface.Interface):
    nested = Object(IGlobalNestedThing)

class IGlobalRoot(interface.Interface):
    field = Object(IGlobalMiddleThing)

@interface.implementer(IGlobalRoot)
class GlobalRoot(object):
    def __init__(self):
        self.field = None

@interface.implementer(IGlobalMiddleThing)
class GlobalMiddleThing(object):
    def __init__(self):
        self.nested = None

@interface.implementer(IGlobalNestedThing)
class GlobalNestedThing(object):
    def __init__(self):
        self.value = -1

And this ZCML:

<configure xmlns:ext="http://nextthought.com/ntp/ext">
   <include package="nti.externalization" file="meta.zcml" />
   <ext:registerAutoPackageIO
      root_interfaces=".IGlobalNestedThing .IGlobalMiddleThing .IGlobalRoot"
      iobase=".IOBase"
      modules="{0}"
      />
    <ext:anonymousObjectFactory
       for=".IGlobalRoot"
       field="field"
       factory=".GlobalMiddleThing"
       />
    <ext:anonymousObjectFactory
       for=".IGlobalMiddleThing"
       field="nested"
       factory=".GlobalNestedThing"
       />
</configure>

This works:

external = {'field': {'nested': {'value': 42}}}

root = GlobalRoot()

update_from_external_object(root, external, require_updater=True)

assert_that(root, has_attr('field', is_(GlobalMiddleThing)))
assert_that(root.field, has_attr('nested', is_(GlobalNestedThing)))
assert_that(root.field, has_attr('nested', has_attr('value', 42)))

With some effort, I could imagine trimming the amount of ZCML needed in the common case, but I'm not sure the convenience is worth the magic.

cutz · 2018-07-13T19:42:21Z

I like this approach. I think it solves the problems I am aware of. @jzuech3 has gone through this process again with the webinar integration. So perhaps having just gone through that he can comment as well.

This works well if every level of the object tree is modelled by a schema, but it breaks if for some reason some level of the tree is not modelled. I think we'd have to use a custom IInternalObjectUpdater there now anyway, so that's probably not a problem.

Yes, that seems like a non-issue to me.

For fields that are sequences, this assumes that the sequence is homogenous. Do we ever deal with sequences of heterogeneous types? (This could be worked around but if we can avoid the complexity in general it would be good.)

I can't think of a heterogeneous example off the top of my head. If that was the case it seems like you could have a factory registered that knows how to inspect the incoming object and determine the proper type. That would have to be deterministic for it to work at all. Probably a non-issue given the component lookup approach.

One subtlety is that the key in the dictionary needs to match the schema field. I think we generally assume this anyway, though. (Again, this could be worked around but better to avoid the complexity if we don't need it.)

That sounds like something we might run into, but probably a non-issue given the component lookup approach?

I don't particularly like that the schema field winds up directly referring to a class as the factory.

Agreed. I like the component lookup approach and anonymousObjectFactory directive.

With some effort, I could imagine trimming the amount of ZCML needed in the common case, but I'm not sure the convenience is worth the magic.

Agreed.

Move the factory finding responsibility to a new object that is given the key to assign to. The normal implementation will check the matching schema field's tagged value if one can't be found from the data. Currently this must be a callable object but there are some options for using strings. This could also allow us to get the actual object that's currently on the parent object, so that we really are updating in place intsead of constructing all new objects. This could have some nice database properties.

cutz added the question label Dec 12, 2017

cutz closed this as completed Mar 20, 2018

cutz added the wontfix label Mar 20, 2018

cutz removed the wontfix label Jul 3, 2018

cutz reopened this Jul 3, 2018

jamadden self-assigned this Jul 11, 2018

jamadden mentioned this issue Jul 13, 2018

Internalization of anonymous external data #68

Merged

jamadden closed this as completed in #68 Jul 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internalization of external sources that don't define `Class` or `MimeType` equivalents #51

Internalization of external sources that don't define `Class` or `MimeType` equivalents #51

cutz commented Dec 12, 2017 •

edited

Loading

papachoco commented Dec 12, 2017

papachoco commented Dec 12, 2017

jamadden commented Dec 12, 2017

papachoco commented Dec 12, 2017

cutz commented Dec 12, 2017

jamadden commented Jul 3, 2018 •

edited

Loading

cutz commented Jul 3, 2018

jamadden commented Jul 13, 2018 •

edited

Loading

jamadden commented Jul 13, 2018 •

edited

Loading

cutz commented Jul 13, 2018

Internalization of external sources that don't define Class or MimeType equivalents #51

Internalization of external sources that don't define Class or MimeType equivalents #51

Comments

cutz commented Dec 12, 2017 • edited Loading

papachoco commented Dec 12, 2017

papachoco commented Dec 12, 2017

jamadden commented Dec 12, 2017

papachoco commented Dec 12, 2017

cutz commented Dec 12, 2017

jamadden commented Jul 3, 2018 • edited Loading

cutz commented Jul 3, 2018

jamadden commented Jul 13, 2018 • edited Loading

jamadden commented Jul 13, 2018 • edited Loading

cutz commented Jul 13, 2018

Internalization of external sources that don't define `Class` or `MimeType` equivalents #51

Internalization of external sources that don't define `Class` or `MimeType` equivalents #51

cutz commented Dec 12, 2017 •

edited

Loading

jamadden commented Jul 3, 2018 •

edited

Loading

jamadden commented Jul 13, 2018 •

edited

Loading

jamadden commented Jul 13, 2018 •

edited

Loading