Skip to content

KAFKA-19367: Fix InitProducerId with TV2 double-increments epoch if ongoing transaction is aborted #19910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

rreddy-22
Copy link
Contributor

@rreddy-22 rreddy-22 commented Jun 6, 2025

When InitProducerId is handled on the transaction coordinator, the
producer epoch is incremented (so that we fence stale requests), then if
a transaction was ongoing during this time, it's aborted. With
transaction version 2 (a.k.a. KIP-890 part 2), abort increments the
producer epoch again (it's the part of the new abort / commit protocol),
so the epoch ends up incremented twice.

In most cases, this is benign, but in the case where the epoch of the
ongoing transaction is 32766, it's incremented to 32767, which is the
maximum value for short. Then, when it's incremented for the second
time, it goes negative, causing an illegal argument exception.

To fix this we just avoid bumping the epoch a second time.

Reviewers: Justine Olshan jolshan@confluent.io, Artem Livshits
alivshits@confluent.io

@github-actions github-actions bot added triage PRs from the community core Kafka Broker small Small PRs labels Jun 6, 2025
@github-actions github-actions bot removed the triage PRs from the community label Jun 6, 2025
@github-actions github-actions bot removed the small Small PRs label Jun 11, 2025
Copy link
Contributor

@artemlivshits artemlivshits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jolshan jolshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@jolshan jolshan merged commit 0b2e410 into apache:trunk Jun 12, 2025
23 checks passed
@jolshan
Copy link
Member

jolshan commented Jun 12, 2025

Will pick to 4.0 and 4.1

jolshan pushed a commit that referenced this pull request Jun 12, 2025
…ngoing transaction is aborted (#19910)

When InitProducerId is handled on the transaction coordinator, the
producer epoch is incremented (so that we fence stale requests), then if
a transaction was ongoing during this time, it's aborted.  With
transaction version 2 (a.k.a. KIP-890 part 2), abort increments the
producer epoch again (it's the part of the new abort / commit protocol),
so the epoch ends up incremented twice.

In most cases, this is benign, but in the case where the epoch of the
ongoing transaction is 32766, it's incremented to 32767, which is the
maximum value for short. Then, when it's incremented for the second
time, it goes negative, causing an illegal argument exception.

To fix this we just avoid bumping the epoch a second time.

Reviewers: Justine Olshan <jolshan@confluent.io>, Artem Livshits
 <alivshits@confluent.io>
rreddy-22 added a commit to rreddy-22/kafka-rreddy that referenced this pull request Jun 12, 2025
…ngoing transaction is aborted (apache#19910)

When InitProducerId is handled on the transaction coordinator, the
producer epoch is incremented (so that we fence stale requests), then if
a transaction was ongoing during this time, it's aborted.  With
transaction version 2 (a.k.a. KIP-890 part 2), abort increments the
producer epoch again (it's the part of the new abort / commit protocol),
so the epoch ends up incremented twice.

In most cases, this is benign, but in the case where the epoch of the
ongoing transaction is 32766, it's incremented to 32767, which is the
maximum value for short. Then, when it's incremented for the second
time, it goes negative, causing an illegal argument exception.

To fix this we just avoid bumping the epoch a second time.

Reviewers: Justine Olshan <jolshan@confluent.io>, Artem Livshits
 <alivshits@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants