-
Notifications
You must be signed in to change notification settings - Fork 224
public interface for offset resetting #194
Conversation
Cool cool, this is a feature I've good use for.
Ah, good point - I hadn't realised that when I posted on #187. So iiuc, that means if you start the consumer, then change the offsets, you might get a message from
That wouldn't be completely terrible either. Presumably, someone using this feature has some other place where they store offsets (I've an app where the offsets sit in postgres), and so they would usually know which partitions they want to begin with. Although... it might get messy when the number of partitions is changed (by a cluster operator). If you then have to provide a list of partitions+offsets on init, you then can't make it grab the new partitions. So I guess for that scenario your current solution is better: you can grab whatever partitions there are (ie don't specify a list on init), and then for the ones you already knew about you can reset the offsets. Note to self: I think I have to handle |
I'm not totally convinced that there's a good reason to allow setting offsets on |
Yes, no indeed - after re-reading my earlier ramble, I agree :) |
I've added the option to lock and flush the partition queues during an offset reset. This means that users no longer need to manually pause consumption before calling |
This is ready for review @yungchin @kbourgoin. I'll be testing it against real streams as soon as possible, too. |
pykafka/simpleconsumer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this become "Iterable of (int, int)" - so that is, (partition_id, offset) - instead? Otherwise a user would have to go through self._partitions_by_id
to obtain instances of OwnedPartition
first before they'd call this, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can actually use the partitions
property that's currently in master. I think it's cleaner and more consistent with the rest of the API to supply Partition
instances over their ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean that that feels inconsistent, but on the other hand as a lazy user of the interface I'd say that the nice thing about ids is that you can save them to disk, and so when you get them from disk, you could pass them straight in here.
Maybe I'm missing a trick though, and there's a shortcut somehow. Am I right in thinking you'd have to get the ids, then map them to partition.Partition
s (which is what you get from the BalancedConsumer.partitions
property) and then again map to simpleconsumer.OwnedPartition
s to call reset_offsets
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sure - if you're saving partition IDs to disk then you would have to do a bit of work to construct the input to reset_offsets
. I think a cleaner solution is to stop this function from accepting OwnedPartition
and have it take Partition
instead, since OwnedPartition
is supposed to be private. I'll hold off on merging until I've figured that out, then.
2876a58
to
a3d7d24
Compare
Ok, so basically I've only lazy complaints about the function signature :) |
Awesome, thanks for the feedback. Your comments did make me realize it's no good to expose |
b7881b1
to
3388a6b
Compare
…m consume() during an offset reset
… remove the kwarg
…ore their existence
It turns out that kafka-python allows resetting to custom offsets by promising to fetch the given offset next, not by actually committing the given offset to kafka. This is a problem because it creates a mismatch between the offset stored in kafka and the offset stored in the consumer; this is contrary to the design we've tried to maintain in pykafka so far. |
This is now ready for review from @yungchin and @kbourgoin. @yungchin, this hasn't changed a ton since you looked at it, but I fixed some deadlocks and improved the interface for user-supplied offsets. |
pykafka/simpleconsumer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that partitions
has been dropped from the code above this, I think L523-525 may need rewriting to write into owned_partition_offsets
instead?
Cool, very happy with the addition of The new way of handling both timestamps and actual offsets seems a neat solution to me. I should add though that I don't feel particularly qualified to comment on that bit, given that I've only a poor understanding of Kafka's OffsetRequest semantics so far. Probably best if @kbourgoin can give that a read too. I've added one new note (see above), which seems important for repeating failed requests. Other than that, all happy! |
public interface for offset resetting
As of 2dc86ce (#194) reset_offsets() is a public method that should be working correctly on a running consumer. This commit deals with that by completely nuking the internal rdkafka consumer, which should be ok unless someone wanted to call reset_offsets() a lot more than expected. Tests for this are currently on a separate branch, see #213. As a bonus, this also made fetch_offsets work for a running consumer. Signed-off-by: Yung-Chin Oei <yungchin@yungchin.nl>
This pull request adds a public interface to the consumers that allows client code to set the current partition offsets to whatever they choose. Previously, this had to be done via a semi-public undocumented interface as detailed in this issue.
Given these changes, the new workflow for specifying offsets from which to consume looks like this:
This pull request also ties
reset_offsets
to thefetch
locking mechanism so that a partition can never be fetching and resetting its offsets at the same time. This makes the unlocking logic at the end offetch()
notably more complicated, since one of its error conditions involves callingreset_offsets
.The following has been resolved in this pull request
The only issue I see with this method is that the consumer has a chance to start fetching messages betweenget_simple_consumer
andreset_offsets
. To get around this, we could accept thepartition_offset_pairs
inget_simple_consumer
, but that would require some way to obtain a list of partitions before actually instantiating a consumer. This sounds like a road I'd rather not go down, so the question is: how bad is it that it's impossible to start the consumer at an arbitrary offset that's not the head or tail of the partition? My gut tells me it doesn't matter and this solution is ok.Let me know what you think, @kbourgoin and @yungchin.