Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRITICAL] Consumer encountered an unexpected error, AttributeError: 'NoneType' object has no attribute 'decode' #266

Closed
vkatochka opened this issue Jan 15, 2020 · 21 comments

Comments

@vkatochka
Copy link

dramatiq 1.7.0

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.3 LTS
Release: 18.04
Codename: bionic

dramatiq bwfut -Q home -p1 -t1 --pid-file /var/www/.../tmp/dsrv-home-bwfut.pid --log-file /var/www/.../logs/dsrv-home-bwfut.log --path /var/www/.../ & disown

[2020-01-15 22:39:17,008] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Sending signal 1 to worker processes...
[2020-01-15 22:39:17,011] [PID 17867] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Stopping worker process...
[2020-01-15 22:39:17,012] [PID 17867] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
[2020-01-15 22:39:18,127] [PID 17867] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
[2020-01-15 22:39:18,260] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Dramatiq '1.7.0' is booting up.
[2020-01-15 22:39:18,280] [PID 19075] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Worker process is ready for action.
[2020-01-15 22:43:48,642] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Sending signal 1 to worker processes...
[2020-01-15 22:43:48,643] [PID 19075] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Stopping worker process...
[2020-01-15 22:43:49,357] [PID 19075] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
[2020-01-15 22:43:50,450] [PID 19075] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
[2020-01-15 22:43:50,613] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Dramatiq '1.7.0' is booting up.
[2020-01-15 22:43:50,634] [PID 20225] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Worker process is ready for action.
[2020-01-15 22:45:18,960] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Sending signal 1 to worker processes...
[2020-01-15 22:45:18,962] [PID 20225] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Stopping worker process...
[2020-01-15 22:45:18,987] [PID 20225] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
[2020-01-15 22:45:20,018] [PID 20225] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
[2020-01-15 22:45:20,155] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Dramatiq '1.7.0' is booting up.
[2020-01-15 22:45:20,176] [PID 21376] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Worker process is ready for action.
[2020-01-15 22:46:48,734] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Sending signal 1 to worker processes...
[2020-01-15 22:46:48,735] [PID 21376] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Stopping worker process...
[2020-01-15 22:46:49,521] [PID 21376] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
[2020-01-15 22:46:51,074] [PID 21376] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
[2020-01-15 22:46:51,207] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Dramatiq '1.7.0' is booting up.
[2020-01-15 22:46:51,230] [PID 22532] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Worker process is ready for action.
[2020-01-15 22:50:26,492] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Sending signal 1 to worker processes...
[2020-01-15 22:50:26,494] [PID 22532] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Stopping worker process...
[2020-01-15 22:50:27,059] [PID 22532] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
[2020-01-15 22:50:28,141] [PID 22532] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
[2020-01-15 22:50:28,276] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Dramatiq '1.7.0' is booting up.
[2020-01-15 22:50:28,297] [PID 23687] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Worker process is ready for action.
[2020-01-15 22:57:00,399] [PID 23687] [Thread-3] [dramatiq.worker.ConsumerThread(home)] [CRITICAL] Consumer encountered an unexpected error.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/dramatiq/worker.py", line 253, in run
    for message in self.consumer:
  File "/usr/local/lib/python3.6/dist-packages/dramatiq/brokers/redis.py", line 318, in __next__
    message = Message.decode(data)
  File "/usr/local/lib/python3.6/dist-packages/dramatiq/message.py", line 96, in decode
    return cls(**global_encoder.decode(data))
  File "/usr/local/lib/python3.6/dist-packages/dramatiq/encoder.py", line 52, in decode
    return json.loads(data.decode("utf-8"))
AttributeError: 'NoneType' object has no attribute 'decode'
[2020-01-15 22:57:00,402] [PID 23687] [Thread-3] [dramatiq.worker.ConsumerThread(home)] [INFO] Restarting consumer in 3.00 seconds.

I copied worker, reload with command "kill -HUP 17861", and send one task in queue (repeated several times, this can be seen from the log). At some point, the error appeared in the log. After this error, tasks runned and finished normally.

@AgeofCreations
Copy link

Are you sure you have an access to your message broker?
I've got the same error when my broker was down.

@vkatochka
Copy link
Author

Yes, Redis worked. Before and after that, I set tasks in Redis and everything worked. I cannot reproduce the error, but I will observe.

@synweap15
Copy link
Contributor

I've seen this pop up after worker reloads with HUP since 1.3.0 using Redis broker, but it seems no messages are being dropped.

@AgeofCreations
Copy link

Well, this error is also occured to me then Redis worked.
So i subscribe this issue

@Bogdanp
Copy link
Owner

Bogdanp commented Feb 2, 2020

I think this is a duplicate of #224. Please give 1.8.0 a try (it will be released later today) and let me know in that issue if you still encounter this problem.

@Bogdanp Bogdanp closed this as completed Feb 2, 2020
@tonycosentini
Copy link

I was seeing this on 1.7.0 and am still seeing it on 1.8.0.

We're running Dramatiq on Heroku, whenever a new release is deployed we get one instance of this crash per Dramatiq worker.

Let me know if there's anything else that would be helpful to debug this.

@Bogdanp Bogdanp reopened this Feb 6, 2020
@Bogdanp Bogdanp added this to the v1.9.0 milestone Feb 6, 2020
@tonycosentini
Copy link

Some follow up on this - I think this bug is actually fixed in 1.8.0.

Sorry for the hazy details, but we had some events that were in a weird stuck state from this bug. A broker would hold on to them forever. I ended up re-queuing them and everything is good now - not seeing this exception anymore during deploys.

@Bogdanp
Copy link
Owner

Bogdanp commented Feb 25, 2020

Thank you for following up! Closing this again. If anyone else runs into this problem on 1.8, please let me know.

@Bogdanp Bogdanp closed this as completed Feb 25, 2020
@mparent61
Copy link

mparent61 commented Jul 9, 2020

We started seeing this exact issue recently with both 1.8 and 1.9, after initially running without any problems for several months.

I suspect it's related to worker reloads re-enqueing messages.

A possible clue, our Sentry Redis integration indicates that the last Redis command prior to the exception sets the "do maintenance" flag to 1, not sure if it's related (see 2nd-to-last column below):

redis | EVALSHA 'e9668bc413bd4a2d63c8108b124f5b7df0d01263' 1 'api-broker' 'fetch' 1594306262596 'default.DQ' '43422391-f71b-4493-82b5-25ca6f01534a' 60000 604800000 1 7992

Is there any command we could run to purge/clear everything in Redis, including any messages held on to be workers? Or will these stale messages eventually timeout in a week (if I read expiration code correctly)?

@Bogdanp
Copy link
Owner

Bogdanp commented Jul 10, 2020

Thanks for the report, @mparent61 . Was there any other relevant information in your Redis log around that time? How often does this occur?

As far as clearing out Redis, the safest approach would be to stop all the workers then delete all the keys in Redis prefixed by dramatiq. If you only use Redis w/ Dramatiq, then it would also be safe to just run flushall.

@Bogdanp Bogdanp reopened this Jul 10, 2020
@Bogdanp Bogdanp modified the milestones: v1.9.0, 1.10.0 Jul 10, 2020
@mparent61
Copy link

It occurs on every deployment (stop workers, start new workers), usually 1-10 minutes after the new workers start up, and only 1 new worker has the error each time, then restarts and seems to be fine.

This only occurs in 1 of our 2 separate Redis/Dramatiq clusters, so seems like our cluster is somehow stuck in a bad state.

I used a flushall via Redis CLI and that seems to have fixed the problem - thanks for the suggestion.

Unfortunately I don't have any Redis logs available (we're using AWS ElastiCache and they're not on by default) - but will check logs if this starts happening again.

@Bogdanp Bogdanp removed this from the 1.10.0 milestone Oct 18, 2020
@denizdogan
Copy link

For what it's worth, I'm seeing this in 1.10.0 as well. I don't recall it ever happening before, but now it's happening. Curious to know if it has anything to do with the redis dependency version or something.

@Incognito
Copy link

Incognito commented Mar 2, 2021

I have lots of these errors per week in one project running 1.9.0, so it is at-least that old. Not sure what version it was introduced in however.

We run using the Redis broker, maybe that's the common factor here (as flushall seemed to resovle it in a previous comment). Is bad data possibly being requeued? We also use sentry (a previous post that discusses the requeue theory has this setup too)

@denizdogan
Copy link

We run using the Redis broker, maybe that's the common factor here (as flushall seemed to resovle it in a previous comment). Is bad data possibly being requeued? We also use sentry (a previous post that discusses the requeue theory has this setup too)

We use Sentry as well, so +1 on that possibility I guess?

@synweap15
Copy link
Contributor

I'm using Bugsnag, not Sentry, and I'm seeing those as well

@mwek
Copy link

mwek commented Apr 7, 2021

The exact same exception is raised whenever one of the Redis message IDs are not present in the .msgs HSET. This is because of the HMGET call that will return empty results (later translated to None) if the message ID is not found. IIUC, this will cause all Redis messages in the cache (RedisConsumer.message_cache) to be lost and not requeued. One can reproduce this by pushing a bogus Redis ID on the queue (RPUSH dramatiq:default this_id_does_not_exist).

IIUC, duplicates message IDs can be pushed to the queue when two workers independently perform the maintenance of the same dead worker.

@Bogdanp the simplest fix for this will be to skip over the None messages returned from Redis in the __next__ call in the RedisBroker. However, I'm not sure if this will just make the error even harder to debug in case of a race condition (Redis ID first put in the queue, then to the .msgs HSET). I'm happy to prepare a pull request with such an implementation. Any thoughts here?

@Bogdanp
Copy link
Owner

Bogdanp commented Apr 8, 2021

IIUC, duplicates message IDs can be pushed to the queue when two workers independently perform the maintenance of the same dead worker.

That's not possible. When that Lua script runs, it's the only thing that can run on the Redis DB. Think of it kind of like a serializable transaction; other scripts have to wait for it to finish before they can run.

For that same reason, it seems unlikely that you'd have a id in the queue without a corresponding entry in the .msgs hash, unless you have key expiration turned on. Then again, there could definitely be some kind of weird interaction where the state of things ends up that way.

Maybe the right fix here might be to make the lua script a little more defensive:

  • change ack and nack to check that srem succeeds before doing anything else
  • change the rpush in the do_maintenance block to only push if the message hexists in queue_messages

Bogdanp added a commit that referenced this issue Apr 18, 2021
Previously, there could be a race condition between a worker hitting
its heartbeat timeout and then shutting down, which could lead to the
same message id being put back into the queue twice, which might be
what caused the issues reported in #266.
Bogdanp added a commit that referenced this issue Apr 18, 2021
Bogdanp added a commit that referenced this issue Apr 18, 2021
Previously, there could be a race condition between a worker hitting
its heartbeat timeout and then shutting down, which could lead to the
same message id being put back into the queue twice, which might be
what caused the issues reported in #266.
Bogdanp added a commit that referenced this issue Apr 18, 2021
@william-scalexp
Copy link

william-scalexp commented May 9, 2021

Would love to see this patch go out in a release.
Using dramatiq redis broker and ran into this error. It started happening when I was testing the workers behavior post unexpected shutdown whilst running a long task.

@azhard4int
Copy link

We are noticing this error frequently in our production environment too. A patch/fix would be really helpful as it is flooding our sentry with errors.

@Bogdanp
Copy link
Owner

Bogdanp commented Aug 10, 2021

@azhard4int are you running v1.11?

@Bogdanp
Copy link
Owner

Bogdanp commented Jul 14, 2023

Closing for now since no new issues have been reported in a while. Feel free to reopen if you're still impacted by this.

@Bogdanp Bogdanp closed this as completed Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests