Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support deserialization of partial trees #132

Merged
merged 3 commits into from
Feb 20, 2024

Conversation

ftomassetti
Copy link
Contributor

Until now we were expecting to receive entire trees. In other words to receive some node with a null parent and other nodes directly or indirectly connected to such roots. In this way we were able to unserialize the tree properly, setting the parents for all nodes.

However, with support for proper partitions being introduced, we may want to ask for specific root nodes (or smaller trees) and not always to get the whole partition. So we may need to be able to unserialize trees that refer to parents we did not receive. We therefore ask support for this. In that case the parents not transmitted in the chunk will be set to null.

In the future we may want to specify some form of NodeResolver.

@enikao
Copy link
Contributor

enikao commented Feb 20, 2024

I think we can make a clear distinction: The lowlevel stuff (SerializedClassifierInstance etc.) should work just fine with unknown node ids. As soon as we cross over to the higher-level API, they must be resolved -- otherwise we cannot hold any of the semantic guarantees.

I only had a glimpse on this PR, so I don't know how it compares to this idea.

@ftomassetti
Copy link
Contributor Author

I think we can make a clear distinction: The lowlevel stuff (SerializedClassifierInstance etc.) should work just fine with unknown node ids.

Yes, I agree and this is how the code works

As soon as we cross over to the higher-level API, they must be resolved -- otherwise we cannot hold any of the semantic guarantees.

Well, yes, but that means that if I am interested in retrieve any node, I must get all the nodes connected and referred on the Client side, otherwise I cannot unserialize the tree.

So if I have a large partition, containing many trees directly under it, when I get any node within that partition I also need to get (or already have somehow on the client) all the parents, up to the partition, and then all the children and descendant of the partition. So I need to get a lot of stuff.

I suggest having the possibility to ask for subtrees and get the subtree that is of interest "cutting away" the parent and their ancestors, but doing that only explicitly (i.e., passing a flag indicating this is the behavior expected). When not setting the flag asking for anything but an entire partition just fails, as we do not have the nodes that we need to refer from one subtree

@ftomassetti
Copy link
Contributor Author

Perhaps we can use some form of proxy or placeholder, but I think it would be important to have a way to retrieve and unserialize something smaller than an entire partition. And if we decide to forbid that, then we should redefine the bulk APIs to specify that only partitions can be retrieve, and not anything smaller.

@enikao
Copy link
Contributor

enikao commented Feb 20, 2024

Bulk API explicitly supports partial trees, so your approach makes sense.

Regarding the implementation, it looks good.
The only think I'm not so sure about is the API with the boolean flag. We already see that you had to change lots of places. Any future change (e.g. providing a NodeResolver) would involve at least as many changes. Maybe we should introduce a builder or config to encapsulate that?

@ftomassetti
Copy link
Contributor Author

The only think I'm not so sure about is the API with the boolean flag. We already see that you had to change lots of places. Any future change (e.g. providing a NodeResolver) would involve at least as many changes. Maybe we should introduce a builder or config to encapsulate that?

We could make this a setting of JsonSerialization. It would probably make sense. I will refactor the code accordingly.

* partitions) or we will get references to nodes (for example, parents or ancestors) outside of the
* scope of the tree extracted. This policy specifies what we do with such references.
*/
enum UnknownNodePolicy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This policy is only about unknown parents, correct? Unknown annotations / children / reference targets are not affected. We should reflect that in the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I will do that

@enikao
Copy link
Contributor

enikao commented Feb 20, 2024

By the way, we can have results from bulk API that don't contain children -- if we set the maxDepth parameter. We might want to deal with that too (maybe in a different PR).

@ftomassetti
Copy link
Contributor Author

By the way, we can have results from bulk API that don't contain children -- if we set the maxDepth parameter. We might want to deal with that too (maybe in a different PR).

Good point. I will create an issue for that

@ftomassetti ftomassetti force-pushed the feature/allowDeserializationOfPartialTrees branch from 5fdc628 to b20d060 Compare February 20, 2024 14:41
@ftomassetti
Copy link
Contributor Author

I have created a follow-up issue for dealing with missing children, have renamed the enum to UnknownParentPolicy, and rebased on master

@ftomassetti
Copy link
Contributor Author

Thank you @enikao !

@ftomassetti ftomassetti merged commit e49c924 into master Feb 20, 2024
7 checks passed
@ftomassetti ftomassetti deleted the feature/allowDeserializationOfPartialTrees branch April 16, 2024 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants