Add two new optional channels to the ProducerConfig for handling failures and acks by schleppy · Pull Request #128 · IBM/sarama

schleppy · 2014-07-03T15:45:49Z

...
FailedChannel, when provided will have all messages that are not acked
sent as type *ConsumerEvent. The application using this package is
responsible for reading from the failed channel or it will eventually
block.

AckingChannel, when provided will provide counts per partition, per
topic of acked messages after each flush. The application using this
package is responsible for reading from the acking channel or is will
eventually block.

This commit also includes a change for the error sent in the
DroppedMessagesError struct to the errorCb to be the block error instead
of the error returned from Produce.

…ures and acks FailedChannel, when provided will have all messages that are not acked sent as type *ConsumerEvent. The application using this package is responsible for reading from the failed channel or it will eventually block. AckingChannel, when provided will provide counts per partition, per topic of acked messages after each flush. The application using this package is responsible for reading from the acking channel or is will eventually block. This commit also includes a change for the error sent in the DroppedMessagesError struct to the errorCb to be the block error instead of the error returned from Produce.

eapache · 2014-07-03T15:56:16Z

LGTM. @burke @fw42 @sirupsen ?

eapache · 2014-07-03T15:56:24Z

sorry, wrong button

sirupsen · 2014-07-03T15:59:10Z

Makes sense to me.

Are you guys seeing this happening often?

schleppy · 2014-07-03T16:01:34Z

@sirupsen I assume the question was for me so I will answer as if it were.

Failures, no - not often, but they do happen and the system kafka is supporting needs 100% message guarantees. As far as the acking, we see it all the time :)

The acking channel is simply to give us the ability to generate metrics about how many messages / time we are acking, how many messages are "outstanding", etc.

sirupsen · 2014-07-03T16:03:18Z

👍 We should use this too, I'm surprised we had nothing like this already.

otoolep · 2014-07-08T15:12:10Z

This design makes sense to me, however I am very new to this code.

graemej · 2014-07-08T15:14:52Z

producer.go

nitpick: think you meant partition (typo in the AckingChannel comment)

fw42 · 2014-07-08T15:26:28Z

Looks good to me, besides the name of that variable.

graemej · 2014-07-08T15:27:23Z

Couple nitpicks, but looks sane to me.

burke · 2014-07-09T15:47:11Z

producer.go

I don't like the idea of naming a channel .*Channel. How about something like FailedEvents and BrokerAcks, TopicPartitionAcks, Acknowledgements, or even just Acks.

burke · 2014-07-09T15:49:48Z

Channel naming aside, I like the implementation. 👍

schleppy · 2014-07-09T21:38:38Z

Not sure why the name change caused the 1.1 tests to fail. I don't have gvm or 1.1 on this machine. Would someone else mind taking a look?

…gger a Travis build and see if test failure was a fluke.

schleppy · 2014-07-09T23:14:47Z

I changed a comment and the go1.1 tests passed. Not sure what the problem is with the test, or if this was some sort of TravisCI fluke, but with the comment changed, tests pass.

eapache · 2014-07-10T03:19:25Z

That test failure seems to happen randomly ~1/5 times; I have been trying to track it down for over a week now without success. Needless to say, it isn't your fault but if you have any ideas please share :)

eapache · 2014-07-10T03:21:38Z

producer.go

how does the receiver know which topic this is for?

It doesn't, and it was never my intention to include that information as it is not needed in my current application. What I do care about is purely the values to know I sent N messages, and N messages have been acked. I could certainly add it by wrapping the map in a struct that includes the topic if that is desired.

type AcksPerPartition struct { topic string partitionAcks map[int32]int64 }

On an slightly unrelated note, I just had a thought. Is the "acksPerPartition" map actually reflecting "acksPerReplica"? I have inspected the data before, so I feel pretty confident that it is per partition, but I am not 100% sure now and don't have access to verify at the moment.

Oh, OK, I just thought topic would be something you'd need.

len(prb) is, I believe, the total number of messages sent in this request - so you're not overcounting with replication, but you are overcounting if the request includes multiple topics/partitions. I'm not actually sure that there's a nice way to get the right count at the moment; the grouping is done in prb.toRequest and isn't returned (I guess you could dig it out of the ProduceRequest object itself, but that gets ugly when you try to deal with compressed messages).

wvanbergen · 2014-07-23T18:24:28Z

Anything else that needs to happen to this, except a rebase?

eapache · 2014-07-23T18:27:30Z

We were discussing whether this over or under-counted the acks - I suggested it was probably overcounting, and haven't heard back... (see discussion on line 419)

schleppy · 2014-07-24T06:26:40Z

tl;dr: For my specific use case, this code is in production and works exactly the way I want it to. I am okay with maintaining our fork as we vendor the code anyway. If there is a desire for me to dig deeper and possibly augment the code base a bit, I think I can get this working in a general way. (That was a long tldr)

Sorry for not responding for so long, but I have had a really strong work commitment over the past couple weeks and haven't really had time to respond.

So, I am not entirely sure where to go with this to be honest. I have mostly worked on this feature around my requirements and not around the requirements of general users. For my use case, a single producer maps directly to a single topic so the errors / acks received are specific to that topic and I don't have "cross pollination" in the prb like others may have. I have several producers but they work off of different sources and the source defines the topic so while I see the issue with the acking code, it isn't something I have to / have had to deal with.

I don't over-count based on partitions either because I iterate over the map of partition_index:count pairs and first ensure they are all equal, then ack that number if they all match. This is what the pertinent code looks like:

    for {
        select {
        case ackCounts := <-kafkaAcks:
            if valid, num := ensureConsistentAcks(ackCounts); valid {
                ackNotifier <- num
                ...
            } else {
                ...
            }
        }
    }

I realize my use case might not be generic enough, but I am not sure how much you would like me to go into trying to reconstruct the counts from requests with multiple topics by trying to pull apart ProduceRequest object. I am certainly willing to give it more effort to make it general enough for all use cases but I have a feeling that will be adding to the code base quite a bit. I am also willing to maintain our fork if that is necessary.

I appreciate the work that has gone into this project, and really appreciate it being open-sourced.

eapache · 2014-07-24T13:24:24Z

So I've looked at this again; it definitely doesn't return the right values when multiple topics are produced to at once; we shouldn't merge it until that is fixed. Unfortunately, fixing that correctly looks like a relatively hard problem given the current architecture (especially regarding compressed messages). If it works for your current case and you're OK maintaining the separate patch on top then I think that is the best solution for now (though I would add a big XXX caveat to your patch so that if somebody tries to use it for multiple topics in the future they'll understand what's going on).

I'm kind of hoping that eventually #132 will turn into something useful; that architecture makes this whole problem trivial from the get-go.

eapache closed this Jul 3, 2014

eapache reopened this Jul 3, 2014

eapache mentioned this pull request Jul 8, 2014

Why not SendMessageBatch? #131

Closed

graemej reviewed Jul 8, 2014
View reviewed changes

producer.go Outdated

Copy link

graemej Jul 8, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: think you meant partition (typo in the AckingChannel comment)

Couple minor changes based on code review.

7a8cae4

burke reviewed Jul 9, 2014
View reviewed changes

Change name of failure and acking channels to remove the Channel suffix.

067d0c4

Reword comment a bit to make more sense for Acks channel. Also to tri…

5f02dbf

…gger a Travis build and see if test failure was a fluke.

eapache reviewed Jul 10, 2014
View reviewed changes

eapache closed this Jul 24, 2014

Conversation

schleppy commented Jul 3, 2014

Uh oh!

eapache commented Jul 3, 2014

Uh oh!

eapache commented Jul 3, 2014

Uh oh!

sirupsen commented Jul 3, 2014

Uh oh!

schleppy commented Jul 3, 2014

Uh oh!

sirupsen commented Jul 3, 2014

Uh oh!

otoolep commented Jul 8, 2014

Uh oh!

graemej Jul 8, 2014

Choose a reason for hiding this comment

Uh oh!

fw42 commented Jul 8, 2014

Uh oh!

graemej commented Jul 8, 2014

Uh oh!

burke Jul 9, 2014

Choose a reason for hiding this comment

Uh oh!

burke commented Jul 9, 2014

Uh oh!

schleppy commented Jul 9, 2014

Uh oh!

schleppy commented Jul 9, 2014

Uh oh!

eapache commented Jul 10, 2014

Uh oh!

eapache Jul 10, 2014

Choose a reason for hiding this comment

Uh oh!

schleppy Jul 10, 2014

Choose a reason for hiding this comment

Uh oh!

eapache Jul 10, 2014

Choose a reason for hiding this comment

Uh oh!

wvanbergen commented Jul 23, 2014

Uh oh!

eapache commented Jul 23, 2014

Uh oh!

schleppy commented Jul 24, 2014

Uh oh!

eapache commented Jul 24, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants