Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: race condition on join() in the redis broker. #481

Merged
merged 8 commits into from
Mar 17, 2022

Conversation

CaselIT
Copy link
Contributor

@CaselIT CaselIT commented Mar 14, 2022

This fixes #480

I'm not sure how to test this, since the race condition is not easy to trigger. @Bogdanp do you have any suggestions?

Also looking at the stub broker it seems that it may have a similar race condition, but I've never triggered a similar issue with it, so I'm not sure if the change (swapping the order or the delay queue and normal one) is needed

queues = [
self.queues[queue_name],
self.queues[dq_name(queue_name)],
]
except KeyError:
raise QueueNotFound(queue_name) from None
deadline = timeout and time.monotonic() + timeout / 1000
while True:
for queue in queues:
timeout = deadline and deadline - time.monotonic()
join_queue(queue, timeout=timeout)

@Bogdanp
Copy link
Owner

Bogdanp commented Mar 15, 2022

I think this likely doesn't get rid of the race. A better fix would be to change qsize so that it returns the size of the queue and its associated delay queue. Then we can replace the for loop with size = self.do_qsize(queue_name). Let me know if you'd like to do that. Otherwise, I'll do it when I have some free time.

Re. testing, something like the following pseudocode might work:

qsize_ev = threading.Event()
qsize_ev.set()
actor_ev = threading.Event()
old_qsize = broker.do_qsize
def qsize(*args, **kwargs):
  qsize_ev.wait()
  qsize_ev.clear()
  res = old_qsize(*args, *kwargs)
  actor_ev.set()
  return res
broker.do_qsize = qsize

@actor
def a():
  actor_ev.wait()
  b.send_with_options(delay=1000)
  qsize_ev.set()  # this might need to be delayed somehow

@actor
def b():
  pass

broker.join()

The idea being to force an ordering of events that we know is wrong, specifially:

* qsize checks delay, it's empty
* a() launches delayed b()
* qsize checks non-delay, a is finished so it's empty
* b() is in delay queue, but the broker isn't going to check it again

@CaselIT
Copy link
Contributor Author

CaselIT commented Mar 15, 2022

I think this likely doesn't get rid of the race

yes, the race is still there, but instead of not counting the message, it would be double counted (the delayed message in the delay queue and the normal message in the normal queue) fixing the issue from the join() prospective.

That said I agree that's probably better to update qsize, since that would most likely get rid of the rage condition altogether.

Thanks also for the test suggestion, I'll try working it and update the PR

@CaselIT
Copy link
Contributor Author

CaselIT commented Mar 15, 2022

I've added a test for that specific regression, but I think we may as well improve qsize

@CaselIT
Copy link
Contributor Author

CaselIT commented Mar 15, 2022

updated. the test now does not make much sense, but it continues to pass

@Bogdanp
Copy link
Owner

Bogdanp commented Mar 16, 2022

Thanks! I think we might as well drop the test at this point since that kind of ordering is no longer possible (and we're unlikely to ever regress).

@CaselIT
Copy link
Contributor Author

CaselIT commented Mar 16, 2022

Maybe I can change it to check that qsize returns all the messages in the queue & delay queue

@CaselIT
Copy link
Contributor Author

CaselIT commented Mar 16, 2022

btw @Bogdanp it seems that the ci on windows should be updated, since windows 2016 is being retired

@CaselIT
Copy link
Contributor Author

CaselIT commented Mar 16, 2022

Thanks! I think we might as well drop the test at this point since that kind of ordering is no longer possible (and we're unlikely to ever regress).

Updated the test. I think it's ready for a review

@CaselIT
Copy link
Contributor Author

CaselIT commented Mar 16, 2022

btw @Bogdanp it seems that the ci on windows should be updated, since windows 2016 is being retired

doing a simple update breaks the libmemcached compile: see the action run on this commit CaselIT@39285f3

@Bogdanp
Copy link
Owner

Bogdanp commented Mar 17, 2022

doing a simple update breaks the libmemcached compile: see the action run on this commit CaselIT@39285f3

Yes, I'm afraid that's going to be stuck until someone can update https://github.com/ryansm1/libmemcached-win/tree/yshurik-win to work w/ 2019.

@Bogdanp Bogdanp merged commit 0b72f13 into Bogdanp:master Mar 17, 2022
@CaselIT CaselIT deleted the fix_race_condition branch March 17, 2022 06:37
@CaselIT
Copy link
Contributor Author

CaselIT commented Mar 17, 2022

doing a simple update breaks the libmemcached compile: see the action run on this commit CaselIT@39285f3

Yes, I'm afraid that's going to be stuck until someone can update https://github.com/ryansm1/libmemcached-win/tree/yshurik-win to work w/ 2019.

I guess an option is to disable mamcached test on windows. In it doesn't compile chances are that people are not going to use it. I can open a PR of that's an acceptable solution

@brianseodd
Copy link

brianseodd commented Apr 19, 2022

@CaselIT would it be safe to install build the module from 1.13.0 to absorb this change? I'm getting a lot of issues with dramatiq that I suspect are resolved by this PR

edit: using debian

@CaselIT
Copy link
Contributor Author

CaselIT commented Apr 19, 2022

I don't think this patch is released on pypi yet

@brianseodd
Copy link

@CaselIT Yeah I was planning on just building it from master offline if you don't think there is anything stopping this from being viable. I'm getting something related to this and getting errors the below and it seems to be related to when the queue is completely exhausted or something.

ResponseError: Error running script (call to f_1b0bca8bf4560c1f4a84ff992da836e49515f830): @user_script:128: @user_script: 128: Unknown Redis command called from Lua script ```


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in redis broker join
3 participants