Skip to content

Conversation

sugmanue
Copy link
Contributor

@sugmanue sugmanue commented Sep 29, 2025

There are two bugs in the _finishLongTextAscii method introduced in #519 (via #568) that produces the text to be truncated.

  1. The outPtr is always set to zero (see here) before the read loop. If the out buffer still has room its contents will be overwritten instead of keep adding to it (overwrite or missing chunks.)

  2. If the method exits by fully reading the expected text length, outside the outer loop the length of the last segment is not set, which makes the calling code to drop it when the string is finished (text truncated.)

Notes

  1. This code path is only triggered by long texts, where long means that we cannot fully read its length in the input buffer.
  2. This change includes two tests, one for non-chunked text (the case being fixed here), and, another for chunked text to validate that this issue is not in that path as well.

Fixes #616.

There are two bugs in the `_finishLongTextAscii` method introduced in
FasterXML#519 that produces the text to be truncated.

1. The `outPtr` is always set to zero (see
[here](https://github.com/FasterXML/jackson-dataformats-binary/blob/b20075ff0c029d659cb24adc6c65d2be748a8753/cbor/src/main/java/com/fasterxml/jackson/dataformat/cbor/CBORParser.java#L2636))
before the read loop. If the out buffer still has room its contents
will be overwritten instead of keep adding to it (overwrite or missing
chunks.)

2. If the method exits by fully reading the expected text length,
outside the [outer
loop]:(https://github.com/FasterXML/jackson-dataformats-binary/blob/b20075ff0c029d659cb24adc6c65d2be748a8753/cbor/src/main/java/com/fasterxml/jackson/dataformat/cbor/CBORParser.java#L2657)
the length of the last segment is not set, which makes the calling
code to drop it when the string is finished (text truncated.)

Fixes FasterXML#616.
@cowtowncoder
Copy link
Member

Whoa! Thank you very much for reporting #616 and providing this fix. I'll need to read it with thought.
I think we have CLA for you (as per #568) so that's good.

But due to nature of the bug, I think we'd want fix all the way to 2.18 branch (that's the intended LTS release. I could try cherry-picking, or, if it's easy enough for you, re-creating PR with target as 2.18 would be great.

One possibly gnarly change there is JUnit 4 -> 5 conversion (see #550), done for 2.19.

So alternatively could consider merging full PR in 2.19, and only backporting fix, not tests (not ideal but... would do).

@sugmanue
Copy link
Contributor Author

sugmanue commented Sep 29, 2025

Whoa! Thank you very much for reporting #616 and providing this fix. I'll need to read it with thought. I think we have CLA for you (as per #568) so that's good.

I introduced it in the first place, somehow I didn't fully test it. I added Jacoco locally and verified that all the code introduced in the previous PR is covered. Apologies for my sloppiness.

Unrelated to this change, with Jacoco, I see some code paths related to reading numbers that are not covered. I wonder if there's any reason not add Jacoco? If there's none, I can send a PR for that.

@sugmanue
Copy link
Contributor Author

But due to nature of the bug, I think we'd want fix all the way to 2.18 branch (that's the intended LTS release. I could try cherry-picking, or, if it's easy enough for you, re-creating PR with target as 2.18 would be great.

As far as I understand this change is only present on 2.19 onwards. I double checked the code present on 2.18 and this code path is not there. So, it's not affected by this particular issue.

@cowtowncoder
Copy link
Member

@sugmanue I should have checked before I wrote above: yes, this was changed in 2.19.0 so fix need not (and cannot) go in 2.18 anyway. But I think it'd be good to merge it in 2.19 just in case -- in case we'll release 2.19.3.

@cowtowncoder
Copy link
Member

@sugmanue np, these things happen. I did not review code well enough either. Glad it got caught now at least.

int inPtr = _inputPtr;
int i = 0;
// Tight loop to copy into the output buffer, bail if a non-ascii char is found
while (outPtr < outEnd && i >= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just noting for posterity, not suggesting change within this fix)

Check for i >= 0 seems sub-optimally placed, before actually access and output of byte itself, leading to need to "undo" copy -- instead of changing control flow where problem encountered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to remove as many branches from the loop as possible. The cost is the need to undo, but at least that branch is outside the hot code-path. I didn't do performance testing to validate the idea, but I will do some and post back the results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah measuring is good -- actual performance is not always obvious. So in this case there's just one extra comparison (i checked before first copy) but that's only once per segment/run, probably insignificant over non-trivial data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks with the code as is (using the this benchmarks).

Benchmark                (flavor)    (size)  Mode  Cnt      Score     Error  Units
MyBenchmark.cbor  ASCII_PRINTABLE  XX_LARGE  avgt    5  24318.121 ± 329.952  ns/op

And with this patch applied.

Benchmark                (flavor)    (size)  Mode  Cnt      Score      Error  Units
MyBenchmark.cbor  ASCII_PRINTABLE  XX_LARGE  avgt    5  26899.778 ± 1239.516  ns/op

Looks like the current version is slightly faster, but not by much. This input has 4 fields of about 16Kb.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll take that. :)

Thank you for humoring me.

Copy link
Member

@cowtowncoder cowtowncoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will merge, backport

@cowtowncoder cowtowncoder merged commit d57fa46 into FasterXML:2.x Sep 30, 2025
4 checks passed
cowtowncoder pushed a commit that referenced this pull request Sep 30, 2025
There are two bugs in the `_finishLongTextAscii` method introduced in
#519 that produces the text to be truncated.

1. The `outPtr` is always set to zero (see
[here](https://github.com/FasterXML/jackson-dataformats-binary/blob/b20075ff0c029d659cb24adc6c65d2be748a8753/cbor/src/main/java/com/fasterxml/jackson/dataformat/cbor/CBORParser.java#L2636))
before the read loop. If the out buffer still has room its contents
will be overwritten instead of keep adding to it (overwrite or missing
chunks.)

2. If the method exits by fully reading the expected text length,
outside the [outer
loop]:(https://github.com/FasterXML/jackson-dataformats-binary/blob/b20075ff0c029d659cb24adc6c65d2be748a8753/cbor/src/main/java/com/fasterxml/jackson/dataformat/cbor/CBORParser.java#L2657)
the length of the last segment is not set, which makes the calling
code to drop it when the string is finished (text truncated.)

Fixes #616.
@cowtowncoder
Copy link
Member

Merged, backported in 2.19(.3), 2.20(.1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CBOR text gets truncated on decoding

2 participants