-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu time used in sr3 is higher than v2 for equivalent configuration. #1035
Comments
I notice that in v2 in the amqp consumer, when no message arrives, it does exponential backoff.... self.get_message()
self.raw_msg = self.consumer.consume(self.queue_name)
# if no message from queue, perhaps we have message to retry
if self.raw_msg is None:
self.raw_msg = self.retry.get()
# when no message sleep for 1 sec. (value taken from old metpx)
# *** value 0.01 was tested and would simply raise cpu usage of broker
# to unacceptable level with very fews processes (~20) trying to consume messages
# remember that instances and broker sharing messages add up to a lot of consumers
should_sleep = False
if self.raw_msg is None:
should_sleep = True
elif self.raw_msg.isRetry and self.last_msg_failed:
should_sleep = True
if should_sleep:
try:
time.sleep(self.sleep_now)
except:
self.logger.info("woke from sleep by alarm.. %s " % self.msg.notice)
self.sleep_now = self.sleep_now * 2
if self.sleep_now > self.sleep_max:
self.sleep_now = self.sleep_max in sr3, it instantaneously checks if a message is there, and returns immediately relying on the global loop. |
... but it looks like sr3 has the same logic... in the main loop... if not stopping:
self.gather()
last_gather_len = len(self.worklist.incoming)
if (last_gather_len == 0):
spamming = True
else:
current_sleep = self.o.sleep
spamming = False
.
.
.
if spamming and (current_sleep < 5):
current_sleep *= 2
|
Over the period of a week, we see that, on the servers in question:
my guess: memory fragmentation. the GC is spending more and more time to find bytes in ever more fragmented chains. |
reading about GC:
note: no memory leak observed... is that true? looking at housekeeping outputs, the memory usage is stable over several days... starts at 148 MiB... grows to 179 MB after running for a week. based on the above... a bunch of experiments to try:
|
https://docs.python.org/3/library/gc.html
|
ugh.... freeze() is tagged ... added in 3.7 ... |
created a plugin mem_gc_monitor ... given multiple instances. divide them by 3:
Look for a difference in cpu usage between the instances.
|
sample run... after running 7 minutes:
instance 1 has the most cpu time (un-tuned gc... ) and the tuned gc has the least. Manually invoking the gc is between the two. Need to watch for longer. |
after running over-night:
no obvious pattern. had debug on the gc runs also, and they seem fast also. |
notes:
so a work-around for this issue is periodic restarts. |
after left it for a five day run:
so instances 2 and 3 consumed about 11% less cpu... instance 1 might be special... but if it isn't then the other two got about 11% cpu benefit from gc tuning, and slightly better from just invoking gc manually when convenient. on the last day, grepping the logs...
so... way fewer gc calls, but they are more expensive... but the total user time is less ... mystère et boule de gomme. |
Looking at another configuration, where top shows cpu consumption at between 98% and 100%, I look at the reportted cpu times between housekeeping calls...
so in five minutes ... it's reporting about 8 seconds of cpu usage... which is kind of strange... for 100% cpu. |
I'm worried that instance 1 does extra work, and so the comparison might not be good. adding a fourth instance.... the forth should be comparable to 2 and 3 directly. |
looking through with ptrace I saw excessive fileops in retry logic... found one issue... but I don't think it will make a big difference. |
I wrote this, but I think it's wrong...
It does short circuit because after the first failure to retrieve the file it tries
these two file ops are completely useless most (>99%) of the time. |
branch origin/issue1035_retry_too_much eliminates all i/o from the main run loop.
so now there is no file i/o most times the loop runs. |
another weird behaviour:
at the end of each loop... so when there is nothing going on, it does that... which is relatively pointless relatively often. Moved this logic to be only executed at housekeeping intervals. |
another weird behaviour:
|
I'm seeing this error on dev, running d06ac13 EDIT: I did a git pull to use the latest development branch commit and I think the bug is gone. |
The performance is still better (reduced i/o overhead.) and the logging is lighter, but not a huge difference. |
but after two weeks, the load still climbs into the 20's. and flows that should be fast... are slow. so keeping this open for now... work-around is to restart everything once a week... which is very safe to do (unlike v2) because state recovers well from restarts. |
I profiled a couple sarras and a sender over the weekend to see if that would give any more insight into the increasing CPU usage. I'll also document the process I've been using: # start the profiling - I use an instance number 1 higher than the configured maximum number of
# instances, so it doesn't conflict with other running processes, and so sanity doesn't kill it.
python3 -m cProfile -o /local/home/sarra/cprofile_output_file.dat /local/home/sarra/sr3/sarracenia/instance.py --no 7 start component/config &
profilepid=$!
disown $profilepid
# when you want to stop, send a signal to the pid
kill $profilepid
# Generate the graph:
# https://stackoverflow.com/questions/843671/profiling-in-python-who-called-the-function
gprof2dot -n 0.05 -e 0.01 -f pstats $file | dot -Tpng -o ./"${file/dat/png}"
gprof2dot -n 0.05 -e 0.01 -f pstats $file | dot -Tsvg -o ./"${file/dat/svg}" Or if you want to get the results in text format: import pstats
import cProfile
ps = pstats.Stats('cprofile_output_file.dat')
ps.sort_stats('cumulative').print_stats() The only thing that jumps out is SFTP accounts for a lot of time in the sarra. I think we expect SFTP to be a bit CPU intensive, since the encryption takes time. But the SFTP code in the sarra took a huge percentage of the CPU usage/time, but I don't see the same in the sender. The feed I was testing with downloaded 197899 files, 156.649 GB total over ~3 days. I'm going to try to confirm there isn't a config issue on the server we're pulling from for the SFTP sarra/ sarra (sftp)sarra (http)sender (sftp) |
|
It would be cool to have some v2's profiled the same way... as a baseline. |
Comparing two pumps, it looks like the sr3 one consumes more cpu time than the v2.
This issue will be used to investigate a bit.
The text was updated successfully, but these errors were encountered: