Skip to content

Fix deserialization bug when JSON is not as nested as expected#325

Merged
bfolie merged 1 commit intomasterfrom
PLA-3998/predictor-report-bugfix
May 5, 2020
Merged

Fix deserialization bug when JSON is not as nested as expected#325
bfolie merged 1 commit intomasterfrom
PLA-3998/predictor-report-bugfix

Conversation

@bfolie
Copy link
Copy Markdown

@bfolie bfolie commented May 5, 2020

Citrine Python PR

Description

Predictor report deserialization was failing because pending reports return an empty reports field, and our deserialization code assumes that the JSON includes sufficiently-nested entries.

PR Type:

  • Breaking change (fix or feature that would cause existing functionality to change)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Adherence to team decisions

  • I have added tests for 100% coverage
  • I have written Numpy-style docstrings for every method and class.
  • I have communicated the downstream consequences of the PR to others.
  • I have bumped the version in setup.py

@bfolie bfolie requested review from andyczerwonka and maxhutch May 5, 2020 00:08
Comment on lines +82 to +85
if isinstance(value, dict):
value = value.get(field, self.default)
else:
value = self.default
Copy link
Copy Markdown
Author

@bfolie bfolie May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this instance we're deserializing the field 'report.descriptors'. If report key does not exist in the JSON then it takes the default value (edit: for the descriptors and models fields), which is an empty list in both cases. But it then tries to extract the descriptors key from that object, and an empty list does not have a get method.

The assumption that was being made here is that while the JSON might have missing terminal fields its basic structure is intact. That's not true in all cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the default value of "report" an empty list? I agree with @asantas93 that this fix is smelly.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not. The default value of report.descriptors is an empty list. report isn't a field we deserialize.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If report key does not exist in the JSON then it takes the default value, which is an empty list.

That's what I was commenting on.

Which of the three added tests was the one that was failing?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment with better prepositions.

They were all failing. report=None corresponds to what the backend returns for still-validating predictors.

@bfolie bfolie merged commit f764e96 into master May 5, 2020
fields = self.serialization_path.split('.')
for field in fields:
value = value.get(field, self.default)
if isinstance(value, dict):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check seems a little vague to me. Do we really want to ignore all non-dict values here? What if this is a string or a list? I would expect us to raise an exception in that case since the field was specified but (presumably) incorrect. Would value is None be a sufficient check for this fix?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is only used for serialization paths that are more than one level deep, in which case we do want to ignore all non-dict values. Any time we get to a non-dict value at this point, it means that the JSON did not have the expected structure and we should return the default value.

I actually think I should have made it more strict: any time value isn't a dict or it is a dict but doesn't have key field, we should set value = self.default and break out of the loop.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time we get to a non-dict value at this point, it means that the JSON did not have the expected structure and we should return the default value.

It's not obvious to me that it's appropriate to use a default value when the structure isn't correct. It makes sense to use a default value when an optional field is missing, but when the value is present and wrong, it probably indicates a bug that should be fixed rather than suppressed.

In the event that graceful recovery is appropriate (I would probably argue this is only when an optional field is missing), it does seem correct to break after falling back on the default value.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, in what case is it a string or a list? Aren’t we always expecting a dict? Agree on narrowing the scope of the check, in that throwing an exception is better so we don’t silently fail. Saying that, it is a programming error if it’s anything other than a dict or nothing, or am I missing something?

Copy link
Copy Markdown
Author

@bfolie bfolie May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the value is present and wrong, it probably indicates a bug that should be fixed rather than suppressed.

I agree, and that would throw an exception. But in this case the value isn't present and wrong, it's just not present.

Let's say the serialization path 'foo.bar' holds an integer which we want to deserialize as the field 'baz' and the default value is 0. This would be encoded as baz = properties.Integer(serialization_path='foo.bar', default=0).

A typical JSON looks like:

{
    'foo': {
        'bar': 17
    }
}

What if we get a non-integer as the value?

{
    'foo': {
        'bar': 'a string'
    }
}

That's an error and this PR doesn't change that.

What if the field bar is missing?

{
    'foo': {
    }
}

We would use the default value. That was the behavior before and is still the behavior.

What if the field foo is missing?

{
}

Previously this would result in an error, but now we'll use the default value.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying that you should use is None to check for absence. Otherwise you are allowing incorrect data to be treated the same as absent data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Andy be right that correct usage of this function will never lead to strings or lists here. I wouldn't necessarily try to write code that handles contrived scenarios -- I am just advocating for doing precisely the check that you want to do, which seems to be key absence based on what I've seen in this conversation so far.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to be clear, had I submitted this review before merge (which I thought I was doing at the time), I would have just left this as a comment. This wasn't intended as blocking feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example @bfolie I am specifically worried about a case where we try to extract foo.bar but we have

{"foo":"whyisthisnotadict"}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should always be a dict, and I agree that {"foo":"whyisthisnotadict"} should lead to an error. I've opened another PR and I'll include a test to check that that is the case.

@bfolie bfolie mentioned this pull request May 5, 2020
7 tasks
@bfolie bfolie deleted the PLA-3998/predictor-report-bugfix branch October 2, 2020 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants