Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support LazyNumber #893

Closed
wants to merge 30 commits into from
Closed

support LazyNumber #893

wants to merge 30 commits into from

Conversation

pjfanning
Copy link
Member

@pjfanning pjfanning commented Jan 14, 2023

@pjfanning pjfanning marked this pull request as draft January 17, 2023 14:04
@pjfanning pjfanning changed the title WIP: support LazyNumber support LazyNumber Jan 20, 2023
@pjfanning pjfanning marked this pull request as ready for review January 20, 2023 17:22
@pjfanning
Copy link
Member Author

@cowtowncoder could you re-review?

LazyNumber lazyNumber = null;
if ((_numTypesValid & NR_BIGDECIMAL) == 0) {
if (_numTypesValid == NR_UNKNOWN) {
_parseNumericValue(NR_BIGDECIMAL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong: if the underlying number is integer, it should not be coerced into BigDecimal?
Or do I misremember logic in _parseNumericValue
I guess I should have a look at unit test here first. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the if stmt is if ((_numTypesValid & NR_BIGDECIMAL) == 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmh ok. What is needed here are unit tests of seeing which LazyNumber is actually returned in various cases. They should match getNumber(), 1-to-1 I think (contents that is, Float/LazyFloat, etc etc).

While there are various possible edge cases (if number value is first accessed with non-lazy, then lazy), but the most important one to me is the one where nextToken() is immediately followed by getLazyNumber call -- that should work in a way that decoding is deferred if not yet done, but eagerly decoded value (for int and long) is used otherwise.
And I think we can actually count on "short" (int, long) integers to be eagerly decoded in that way I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, from later comments it does look like getLazyNumber() would always give LazyBigInteger for VALUE_NUMBER_INT token. If so, test should verify that (assuming we don't want to add LazyInteger / LazyLong).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjfanning Right, but I think that just means that no BigDecimal value has been decoded for current token -- but we do not automatically always want BigDecimal.

I think this should become clear with unit tests checking expected behavior: what I am worried about is that what is a short integral number (int) would become LazyBigDecimal here.

@cowtowncoder
Copy link
Member

@pjfanning Looks good wrt lazy value types. I think the important part to add would be unit tests to check behavior for basic json backend (can just test against String parser as logic is the same across backends), so that, at things point:

  • If getNumber() would return Integer, Long or BigInteger, getLazyNumber() will return LazyBigInteger
  • For getNumber() return of Float / Double / BigDecimal, getLazyNumber() will return matching LazyXxx

For second part, setting that determines if BigDecimal or Double is may need to be tested.
Actually I don't think JSON parser would ever return Float at all, only either Double or BigDecimal...

But now that I write this I realize that maybe we do not even have JsonReadFeature/StreamReadFeature for deciding "natural" type for floating-point values: there is only DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS.
Perhaps we need StreamReadFeature.USE_BIG_DECIMAL_FOR_FLOATS to actually force this behavior from low level; it would affect getNumberType() and so on.

(there is probably also need to then copy and modify this test for binary backends, later on -- but that can wait for now)

Hmmh. Ok, Double vs BigDecimal is tricky. I wonder if we really only should specifically have one LazyDecimal value which would store one of:

  1. BigDecimal (if already decoded, perhaps for binary format)
  2. Double (ditto)
  3. String -- for decoding into one (or even both) of above

This would be more complex, but would actually (I think!) defer decision on "is it Double or BigDecimal" until caller indicates their wish by calling specific accessor.
And so FINALLY @JsonCreator annotated BigDecimal (and Double) would be decoded from String only when actually needed.

WDYT? I know it sounds complicated but... I think that'd work.

btw also HUGE thank you for tackling this: I have thought about this for long time, but without getting started.

@pjfanning pjfanning marked this pull request as draft January 23, 2023 01:31
@pjfanning
Copy link
Member Author

@cowtowncoder I've made some of the changes you suggested. It does break a jackson-databind test case.

getNumberValue (and now getLazyNumber) start with:

        if (_numTypesValid == NR_UNKNOWN) {
            _parseNumericValue(NR_UNKNOWN); // will also check event type
        }

_parseNumericValue(NR_UNKNOWN) calls _parseSlowFloat and if you pass in NR_UNKNOWN, that code defaults to the lossy Double parsing approach.

Can I suggest that _parseSlowFloat needs some logic for the NR_UNKNOWN where the length of the _textBuffer.contentsAsString() matters - so we would use Doubles for small strings and lazily parsed BigDecimals for longer strings?

@cowtowncoder
Copy link
Member

@cowtowncoder I've made some of the changes you suggested. It does break a jackson-databind test case.

getNumberValue (and now getLazyNumber) start with:

        if (_numTypesValid == NR_UNKNOWN) {
            _parseNumericValue(NR_UNKNOWN); // will also check event type
        }

_parseNumericValue(NR_UNKNOWN) calls _parseSlowFloat and if you pass in NR_UNKNOWN, that code defaults to the lossy Double parsing approach.

Can I suggest that _parseSlowFloat needs some logic for the NR_UNKNOWN where the length of the _textBuffer.contentsAsString() matters - so we would use Doubles for small strings and lazily parsed BigDecimals for longer strings?

While I can see why such handling would have benefits, I think it would be confusing for users, when actual type we get is unpredictable. So I would be hesitant to add more specialized logic here with default settings.
However, as usual if there was a StreamReadFeature for opt-in, that would be fine.

I guess there are different use cases here, and I have been more focused in making sure that explicit target type will work with buffering as well as directly: that is, if target is indicated as BigDecimal, then that must be the handling; if Double, then that.

But handling of JsonNode, Object and Number is different now that I think about it.
Whether lazy/deferred or eager, there's the problem to consider -- Double or BigDecimal? It should be somewhat dynamic -- to use exact type from Binary formats -- but then again configurable for Textual case where there is no exact type.

This is why I was thinking a StreamReadFeature would make sense, to sort of fix low-level number type.

But I would be open to alternate Enum, say, that could have 3 states:

  1. Always expose Double if no native type known (textual formats)
  2. Always expose BigDecimal (same as ^^^)
  3. Use heuristics to decide ("small" FP as Double, "big" as BigDecimal)

and would only apply to Textual formats (or if binary formats do not have type, store FPs as text? If such format exists).

WDYT?

@cowtowncoder
Copy link
Member

Looking a bit at existing implementation, I noticed that handling in most cases is divided first for integral numbers (JsonToken.VALUE_NUMBER_INT) vs floating point (VALUE_NUMBER_FLOAT); and:

  • For integrals, getNumberType() used to distinguish and then use specific accessor
  • For floats, see if access is forced to coerce into BigDecimal -- and if so, use getDecimalValue() -- otherwise use getNumberValue(). In some cases tho alternatively use JsonParser.getNumberValueExact() (which is either native type from binary or BigDecimal for textual)

This for "untyped" (Object), JsonNode, and to some degree TokenBuffer.

Not sure what design would keep it all working and improve... just collecting observations so far.

@pjfanning
Copy link
Member Author

@cowtowncoder I was able to work around the test failure in jackson-databind by making a change in TokenBuffer.

I had started another jackson-core change that still might be worth merging into this PR. BigDecimal/BigInteger parsing is lazy in ParserBase. For instance, _parseSlowFloat just updates a _numberString variable instead of actually parsing the number and the actual parse is done later when needed. Locally, I've made Float/Double work the same way. Is that something that would be useful?

I still need to add more tests to this PR anyway.

@cowtowncoder
Copy link
Member

@cowtowncoder I was able to work around the test failure in jackson-databind by making a change in TokenBuffer.

I had started another jackson-core change that still might be worth merging into this PR. BigDecimal/BigInteger parsing is lazy in ParserBase. For instance, _parseSlowFloat just updates a _numberString variable instead of actually parsing the number and the actual parse is done later when needed. Locally, I've made Float/Double work the same way. Is that something that would be useful?

Yes, I think that lazy parsing of float and double would make sense too, regardless of "lazy"-work here.
So I could merge separate PR and we could verify it works fine with textual formats.

@cowtowncoder
Copy link
Member

Ok, so going with postpone decoding of double/Double as per above is a good addition (and maybe ditto for float?).

But beyond that I think that fundamentally handling of lazy wrappers is to be driven from TokenBuffer perspective, when it is buffering content without knowing eventual target types.
Method is TokenBuffer._copyBufferValue().

Process for individual number is divided between integers (JsonToken.VALUE_NUMBER_INT) and floating-point numbers (JsonToken.VALUE_NUMBER_FLOAT). Former is much simpler so I'll start with that.
(note: for some "untyped" formats like XML, CSV, no number tokens are found; they will buffer Strings etc; it's not problematic... or at least solution needs to be different for any problems).

Checking JsonParser.getNumberType() is called to figure out, reliably, whether we get Integer, Long or BigInteger. In first 2 cases there's no point in building wrapper and we should just leave code as-is.
But BigInteger is different: here we do not want to force decoding. But we also should not call JsonParser.getText() since CBOR/Smile/Protobuf/Avro will have binary representation to give,
So this is where we would want to call getLazyNumber() -- there's a question if we want common base type, or separate method for getLazyIntegralNumber() -- but for this discussion I'll assume shared base type.
Value we get, LazyBigInteger etc, only ever contains that type, containing either BigInteger or String that contains one, to be decoded on-demand. No need to have accessors for anything else but BigInteger value (and I guess getText() because why not).

The other part is more interesting: I think we must have "polymorphic" value type LazyFPNumber, from which float/double/BigDecimal can be decoded on-demand: not separate implementation types, but one that has String, Float (or float), Double (or double). This is trickier to implement and has state (wrt what has been decoded). It might need to contain NumberType if strict type (from binary formats) is to be detected?
TokenBuffer should always just call getLazyNumber() for all JsonToken.VALUE_NUMBER_FLOATs: JsonParser will create and return LazyFPNumber.

Accessors from TokenBuffer will now need check for Number (for eager numbers from binary codecs, and for int/long from JSON, YAML).

I don't know if this is helpful, but basically I think that LazyNumber subtypes from this PR should be combined into just 2 implementations, and doing that will allow solving the problem efficiently and reliably.
I'll try to think it through some more and might add some more notes. :)

@cowtowncoder
Copy link
Member

Ok, here's another thought that occurred to me: instead of adding wrapper type(s) in here, we could alternatively add method like:

     public Object getNumberDeferred() { 
          // return either eagerly decoded `Number` OR `String` for lazy/deferred decoding
          // default impl in base class simply does:
          return getNumber(); // that is, if not overridden calls eager access method
     }

and let caller (TokenBuffer) handle it the best it seems, using optimal wrapper.

This may seem crude but with logic outlined above (with TokenBuffer separating calls for integral vs floating-point types) would work and could be slightly simpler.

@pjfanning pjfanning marked this pull request as ready for review January 26, 2023 18:43
@cowtowncoder
Copy link
Member

@pjfanning sorry I ask all these things but... would it be possible to have the "lazier double/float parsing" (with _numberString) as a separate PR? I could review and merge that sooner -- I am not sure LazyNumber functionality here is yet ready. But would be good to get the other part in.

@pjfanning
Copy link
Member Author

pjfanning commented Jan 27, 2023

@cowtowncoder I created #899

@pjfanning
Copy link
Member Author

@cowtowncoder would it be possible to go back to a TokenBuffer internal solution and abandon putting any of these changes in jackson-core (other than #899)?

I can build on FasterXML/jackson-databind#3751 (which is useful in its own right) and do a follow up to 3751 which makes the TokenBuffer code that writes numbers to the buffer to store them as strings (ints and longs would be left to work as is) - TokenBuffer has the code to convert strings to numbers when reading them back off the buffer.

@pjfanning pjfanning marked this pull request as draft January 27, 2023 13:23
@cowtowncoder
Copy link
Member

@pjfanning One thing I could do is show what I mean by getNumberDeferred() by... well, implementing it.

The basic idea is just to return:

  1. Number just like getNumber() if (and only if!) number has already been decoded (or is in format from which one might as well do that, like byte array or buffer for binary formats)
  2. String otherwise: caller will know whether this integral or floating-point from JsonToken type

and then on TokenBuffer side there is need for handling deferred operation, using new wrapper type that can distinguish between deferred BigInteger (int and long are never deferred I think; I think this can be considered a hard constraint at this point in time) or one of floating-point types.
I think it is reasonable to fix NumberType on typed access (if specific number type is requested), but for getNumberType() to use heuristics otherwise, using DeserializationFeature to determine between DOUBLE and BIG_DECIMAL (FLOAT would only be indicated if pre-decode value gotten, or getFloatValue() called).

@pjfanning
Copy link
Member Author

closing due to #903

@pjfanning pjfanning closed this Jan 29, 2023
@pjfanning pjfanning deleted the lazy-num branch February 3, 2023 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants