New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forking doesn't work sometimes #297
Comments
Tell me more about what you're doing at the worker level that is affecting your forks. I think it'd be safe to move the forks to after worker init, but I want to understand your use case better. Also, are you able to work around this problem by making the forks wait/retry? |
I'll try... |
I can try to work with this problem by contribute to dramatiq, if you prefer |
I think it shouldn't be a hard change to make and I think I'd prefer to make it myself since there are a few rough edges in the cli module. I just want to make sure I have a good understanding of the problem and implications of making the fork functions wait for initialization. Can you post that middleware, maybe with irrelevant stuff stripped? |
I was wrong, we are not redefining init , Our middleware: from dramatiq.middleware.prometheus import Prometheus as OriginalPrometheus, DB_PATH
class PrometheusMiddleware(OriginalPrometheus):
default_labels = ("queue_name", "actor_name")
def _init_meta(self):
self._meta = requests.get("...").json()
def get_labels(self):
return list(self.default_labels) + list(self._meta.keys())
def get_message_labels(self, message):
return tuple([message.queue_name, message.actor_name, *list(self._meta.values())])
def after_process_boot(self, broker):
self._init_meta()
os.environ["prometheus_multiproc_dir"] = DB_PATH
# This import MUST happen at runtime, after process boot and
# after the env variable has been set up.
import prometheus_client as prom
labels = self.get_labels()
self.logger.debug("Setting up metrics...")
registry = prom.CollectorRegistry()
self.total_messages = prom.Counter(
"dramatiq_messages_total", "The total number of messages processed.", labels, registry=registry
)
....
self.add_custom_metrics(prom, labels, registry)
def after_nack(self, broker, message):
labels = self.get_message_labels(message)
self.total_rejected_messages.labels(*labels).inc()
def after_enqueue(self, broker, message, delay):
if "retries" in message.options:
labels = self.get_message_labels(message)
self.total_retried_messages.labels(*labels).inc()
def before_delay_message(self, broker, message):
...
def before_process_message(self, broker, message):
...
def after_process_message(self, broker, message, *, result=None, exception=None):
del result
labels = self.get_message_labels(message)
message_start_time = self.message_start_times.pop(message.message_id, current_millis())
message_duration = current_millis() - message_start_time
self.message_durations.labels(*labels).observe(message_duration)
self.inprogress_messages.labels(*labels).dec()
self.total_messages.labels(*labels).inc()
if exception is not None:
self.total_errored_messages.labels(*labels).inc()
self.set_custom_metrics(message)
after_skip_message = after_process_message
def set_custom_metrics(self, message):
... ... - copy-paste from OriginalPrometheus middleware |
Thanks! Just to make sure changing this would fix things for you, can you try making your own version of the fork function that sleeps for a few seconds before calling Something along these lines: def run_prometheus_fork():
sleep(10)
from dramatiq.middleware.prometheus import _run_exposition_server
_run_exposition_server() |
We can confirm this issue also for the default Prometheus middleware. We have been able to successfully work around the issue by using a from time import sleep
from dramatiq.middleware.prometheus import _run_exposition_server
def run_prometheus_fork():
sleep(10)
logger.debug("Starting Prometheus server after 10s sleep.")
try:
_run_exposition_server()
except OSError:
logger.debug("Prometheus server already started.") We noticed that we could reproduce the issue by increasing the amounts of processes from 2 (matching CPU count) to a higher amount like 4-20. |
Thanks @Ecno92 . I think there is definitely a race here that I'll have to fix. I'll take a look sometime this week. |
This worked for me: In def _run_exposition_server():
logger = get_logger(__name__, "_run_exposition_server")
logger.debug("Starting exposition server...")
try:
import socketserver
with socketserver.TCPServer(("localhost", 0), None) as s:
free_port = s.server_address[1]
address = (HTTP_HOST, free_port)
# address = (HTTP_HOST, HTTP_PORT)
httpd = HTTPServer(address, _metrics_handler)
httpd.serve_forever()
except KeyboardInterrupt:
logger.debug("Stopping exposition server...")
httpd.shutdown()
return 0
|
Issues
Thank you for Dramatiq.
We use dramatiq and Prometheus widely on our production and after update version of dramatiq we faced with a problem that sometimes (50/50) dramatiq didn't start listening port (forks from Prometheus middleware).
My suggestion that we should wait till worker init (we have additional code at one of middleware and execution of this code may take 0-1 second) and then run forks. What do you think?(1ff83f5#diff-597db348179f5c72c5ad215480791da6R443)
There are logs from process that didn't start process
python:3.7-slim
The text was updated successfully, but these errors were encountered: