Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Producer deadlock: overlimit=0 #65

Closed
guyanb opened this issue Jan 21, 2014 · 8 comments · Fixed by #124
Closed

Producer deadlock: overlimit=0 #65

guyanb opened this issue Jan 21, 2014 · 8 comments · Fixed by #124
Labels

Comments

@guyanb
Copy link

guyanb commented Jan 21, 2014

file:produce_message.go

func (msg *produceMessage) enqueue(p *Producer) error {
    ......
    errs := make(chan error, 1)
    bp.flushRequest(p, prb, func(err error) {
        errs <- err
    })
    //:3 block
    return <-errs
}

file:producer.go

func (bp *brokerProducer) flushRequest(p *Producer, prb produceRequestBuilder, errorCb func(error)) (shutdownRequired bool) {
    ......
    switch err {
    case nil:
        break
    case EncodingError:
        ......
    default:
        overlimit := 0
        prb.reverseEach(func(msg *produceMessage) {
            //:1 if msg.reenqueue(p) return a nil, then overlimit=0
            if err := msg.reenqueue(p); err != nil {
                overlimit++
            }
        })
        //:2 if overlimit=0, not call errorCb
        if overlimit > 0 {
            errorCb(DroppedMessagesError{overlimit, nil})
        }
        return true
    }
    ......
}

func (msg *produceMessage) reenqueue(p *Producer) error {
    if !msg.retried {
        msg.retried = true
        return msg.enqueue(p)
    }
    return nil
}
@eapache
Copy link
Contributor

eapache commented Jun 24, 2014

I believe the correct fix for this is:

  • have reenqueue return non-nil when msg.retried so we know it hasn't been delivered
  • have flushRequest call errorCb(nil) when overlimit == 0

On inspection there are two other issues:

  • duplicate of this code around line 378 of producer.go, as above
  • if the response is well formed but empty (no blocks, or blocks but no partitions) then we return without ever calling errorCb - to fix this we should probably set a flag when we see a block-partition, and call errorCb if the flag is unset after the main loop

@eapache
Copy link
Contributor

eapache commented Jun 24, 2014

@burke @fw42 @wvanbergen does this make sense to you?

@wvanbergen
Copy link
Contributor

I am not quite sure what the initial problem was here? :/

@fw42
Copy link
Contributor

fw42 commented Jun 25, 2014

Yeah ditto, no clue what this is about :-) Can you elaborate?

@eapache
Copy link
Contributor

eapache commented Jun 25, 2014

It's a deadlock, read the inserted comments in the code block in order (1, 2, 3). #90 also includes a stack trace which I believe is the same issue.

@eapache
Copy link
Contributor

eapache commented Jun 25, 2014

Basically: enqueue expects that flushRequest calls errorCb exactly once, so that there is something on the errs channel for it to return. There are some cases where we never call errorCb (or call errorCb multiple times) which both cause a deadlock reading or writing to the errs channel.

@wvanbergen
Copy link
Contributor

OK, now it makes sense to me. Your solutions makes sense to me. I wonder whether we can build tests around this.

@wvanbergen wvanbergen changed the title Block: overlimit=0 Producer deadlock: overlimit=0 Jul 1, 2014
@eapache
Copy link
Contributor

eapache commented Jul 2, 2014

https://groups.google.com/forum/#!topic/sarama-users/8smSO57Ubzg found a way to consistently reproduce the hang, but it requires some dns hackery - I will use that to at least test a local copy of the fix

eapache added a commit that referenced this issue Jul 2, 2014
Also return an error when we fail to re-enqueue, and make sure to close the
channel in the errorCb used for synchronous sends.

In combination, fixes #65
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants