cpu time used in sr3 is higher than v2 for equivalent configuration. #1035

petersilva · 2024-05-01T21:17:20Z

Comparing two pumps, it looks like the sr3 one consumes more cpu time than the v2.
This issue will be used to investigate a bit.

petersilva · 2024-05-01T21:20:23Z

I notice that in v2 in the amqp consumer, when no message arrives, it does exponential backoff....

        self.get_message()
        self.raw_msg = self.consumer.consume(self.queue_name)

        # if no message from queue, perhaps we have message to retry

        if self.raw_msg is None:
            self.raw_msg = self.retry.get()

        # when no message sleep for 1 sec. (value taken from old metpx)
        # *** value 0.01 was tested and would simply raise cpu usage of broker
        # to unacceptable level with very fews processes (~20) trying to consume messages
        # remember that instances and broker sharing messages add up to a lot of consumers

        should_sleep = False

        if self.raw_msg is None:
            should_sleep = True
        elif self.raw_msg.isRetry and self.last_msg_failed:
            should_sleep = True

        if should_sleep:
            try:
               time.sleep(self.sleep_now)
            except:
               self.logger.info("woke from sleep by alarm.. %s " % self.msg.notice)

            self.sleep_now = self.sleep_now * 2
            if self.sleep_now > self.sleep_max:
                self.sleep_now = self.sleep_max

in sr3, it instantaneously checks if a message is there, and returns immediately relying on the global loop.

petersilva · 2024-05-01T21:32:06Z

... but it looks like sr3 has the same logic... in the main loop...

                if not stopping:
                    self.gather()

                last_gather_len = len(self.worklist.incoming)
                if (last_gather_len == 0):
                    spamming = True
                else:
                    current_sleep = self.o.sleep
                    spamming = False
 .
 .
 . 
               if spamming and (current_sleep < 5):
                    current_sleep *= 2

petersilva · 2024-05-13T15:31:17Z

Over the period of a week, we see that, on the servers in question:

when things initially start, there is high cpu usage for a few polls and some sarra's, but polls calm down.
over a number of days, the cpu usage of busy sarras, which is initially quite low (2%) gradually creeps up. after a week or so, it's hanging at 99%.
restarting resets the clock.

my guess: memory fragmentation. the GC is spending more and more time to find bytes in ever more fragmented chains.

petersilva · 2024-05-13T15:53:00Z

reading about GC:

note: no memory leak observed... is that true? looking at housekeeping outputs, the memory usage is stable over several days... starts at 148 MiB... grows to 179 MB after running for a week.

based on the above... a bunch of experiments to try:

import gc, and gc.freeze() after all the configs are read, since they are stable and unlikely to ever be recovered.
disable gc
manually trigger GC at intervals? so that it runs less often but more thorougly, and a times that make sense.

petersilva · 2024-05-13T16:15:39Z

https://docs.python.org/3/library/gc.html

... you can disable the collector if you are sure your program does not create reference cycles. Automatic collection can be disabled by calling gc.disable().

petersilva · 2024-05-13T16:17:32Z

ugh.... freeze() is tagged ... added in 3.7 ...

petersilva · 2024-05-18T00:59:44Z

created a plugin mem_gc_monitor ... given multiple instances. divide them by 3:

1st (of 3) instances will use default gc settings.
2nd (of 3) instance will have gc turned off, but run manually, explicitly every housekeeping interval
3rd (of 3) will have the threshold for automatic gc raised from 700-> 70000 (a hundred-fold.)

Look for a difference in cpu usage between the instances.
Running on a cluster in dev on some configurations.

edcm-dirt22-1% cat mem_gc_monitor.py

import gc
import logging
import time

from sarracenia.flowcb import FlowCB

logger = logging.getLogger(__name__)

class Mem_gc_monitor(FlowCB):

    def __init__(self,options):
       super().__init__(options, logger)

    def on_start(self):

        # every first instance has a stock experience.

        # every 2nd the instances, disable gc, and do it only at housekeeping time.
        if self.o.no % 3 == 2:
            gc.disable()
        # every 3rd instance, raise the gc threshold.
        if self.o.no % 3 == 0:
            gc.set_threshold(70000, 5, 5)

        # enormous output (for every struct it looks at...)
        gc.set_debug(gc.DEBUG_STATS)
        pass

    def on_housekeeping(self):

        logger.info( f"before gc.isenabled={gc.isenabled()} ")
        logger.info( f"before gc.get_stats={gc.get_stats()} ")
        logger.info( f"before gc.get_count={gc.get_count()} ")
        logger.info( f"before gc.get_threshold={gc.get_threshold()} ")

        
        if not gc.isenabled():
             gc_start=time.perf_counter()
            gc.collect()
            logger.info(f"gc took {time.perf_counter()-gc_start}")

        logger.info( f"after gc.isenabled={gc.isenabled()} ")
        logger.info( f"after gc.get_stats={gc.get_stats()} ")
        logger.info( f"after gc.get_count={gc.get_count()} ")
        logger.info( f"after gc.get_threshold={gc.get_threshold()} ")

petersilva · 2024-05-18T01:21:06Z

sample run... after running 7 minutes:

subscribe_dd_swob_03.log:.. 00:59:04,189 ... on_housekeeping Current cpu_times: user=3.33 system=0.52
subscribe_dd_swob_02.log:.. 00:59:10,708 ... on_housekeeping Current cpu_times: user=4.31 system=0.64
subscribe_dd_swob_01.log:... 01:00:51,980 ... on_housekeeping Current cpu_times: user=5.22 system=0.63

instance 1 has the most cpu time (un-tuned gc... ) and the tuned gc has the least. Manually invoking the gc is between the two. Need to watch for longer.

petersilva · 2024-05-18T14:25:55Z

after running over-night:

bunny@host:~/.cache/sr3/log$ grep cpu_times sarra_get_dms-products_0?.log | sort -k2 | tail -8 | sort
sarra_get_dms-products_01.log:2024-05-18 14:18:16,576 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1081.32 system=39.33
sarra_get_dms-products_02.log:2024-05-18 14:18:46,546 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1095.48 system=54.05
sarra_get_dms-products_03.log:2024-05-18 14:18:28,276 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1076.5 system=41.04
sarra_get_dms-products_04.log:2024-05-18 14:18:07,449 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1054.98 system=46.55
sarra_get_dms-products_05.log:2024-05-18 14:18:31,412 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1074.32 system=40.74
sarra_get_dms-products_06.log:2024-05-18 14:18:42,301 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1044.78 system=39.54
sarra_get_dms-products_07.log:2024-05-18 14:18:29,128 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1035.61 system=46.3
sarra_get_dms-products_08.log:2024-05-18 14:17:55,235 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1061.29 system=42.2
bunny@host:~/.cache/sr3/log$

no obvious pattern. had debug on the gc runs also, and they seem fast also.

petersilva · 2024-05-18T14:52:20Z

notes:

a client noticed this behaviour a few years ago... It wasn't clear if it was a one off.
now that we have long running systems (ddsr-dev) we can see the cpu time building over days or a week.
when we restart, the cpu time (aka %cpu as reported by top) drops from 99% to something healthy ... typically below 1%.

so a work-around for this issue is periodic restarts.

petersilva · 2024-05-22T14:04:16Z

after left it for a five day run:

instance 01 (normal gc, no tuning) user=45597.73 system=2072.02
instance 02 (disabled gc, explicit invocations) user=40387.77 system=2067.65
instance 03 (tuned gc, much higher thresholds) user=40580.63 system=2078.31

so instances 2 and 3 consumed about 11% less cpu... instance 1 might be special... but if it isn't then the other two got about 11% cpu benefit from gc tuning, and slightly better from just invoking gc manually when convenient.

on the last day, grepping the logs...

instance 01 (normal gc, no tuning) 52188 calls to the garbage collector. 10.2 sec. gc: run time.
instance 02 (disabled gc, explicit calls) 395 ... 82.86 sec. gc run time. (99 sec reported externally.)
instance 03 (tuned gc) 659 ... 20.48 sec. of gc run time (reported by gc: diags)

so... way fewer gc calls, but they are more expensive... but the total user time is less ... mystère et boule de gomme.

petersilva · 2024-05-22T14:22:43Z

Looking at another configuration, where top shows cpu consumption at between 98% and 100%, I look at the reportted cpu times between housekeeping calls...

2024-05-22 13:57:06,135 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=10322.8 system=118.07
2024-05-22 14:02:09,291 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=10330.7 system=118.15
2024-05-22 14:07:13,542 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=10339.71 system=118.19
2024-05-22 14:12:16,314 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=10351.26 system=118.22
2024-05-22 14:17:16,982 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=10363.41 system=118.32

so in five minutes ... it's reporting about 8 seconds of cpu usage... which is kind of strange... for 100% cpu.
and the system overtime, isn't budging...

petersilva · 2024-05-22T16:35:56Z

I'm worried that instance 1 does extra work, and so the comparison might not be good. adding a fourth instance.... the forth should be comparable to 2 and 3 directly.

petersilva · 2024-05-23T18:36:11Z

looking through with ptrace I saw excessive fileops in retry logic... found one issue... but I don't think it will make a big difference.

petersilva · 2024-06-01T14:33:01Z

I wrote this, but I think it's wrong...

k.. yes it will make a big difference... the logic as far as I can tell is that ... especially when idle,

as part of gather() ... diskqueue.get() will be called with self.o.batch could be 50 or 100 or something like that...

diskqueue.get() is a loop over the batch size, regardless of whether the queue has any messages in it.

every pass it finds no message it also tries to delete the queue file (because it is empty.) This should cause a short circuit to avoid checking things again, but it doesn't... the logic is slightly off.

so it loops batch times trying the unlink the file every time.

gather() is called A LOT. basically many flows have very short sleeps between gathers.

I could see that getting expensive... just not sure how expensive. Fixing it.

It does short circuit because after the first failure to retrieve the file it tries
to delete it and returns... so each gather() phase with disk queue involves:

check if the file exists, return None
delete the file.
return None

these two file ops are completely useless most (>99%) of the time.
slight logic change will mean no i/o at all, when the retry_queue is known to be empty.

petersilva · 2024-06-03T19:30:13Z

branch origin/issue1035_retry_too_much eliminates all i/o from the main run loop.

check for retry queue existence: gone.
delete retry queue: gone.
checking for existence of metrics directory: gone.
write of metrics file (added metrics_writeInterval set to 5 seconds, vs. every loop)
noticed very expensive write of diff output in a plugin used by the flow_test. now if diff is >10 lines, just say "big"

so now there is no file i/o most times the loop runs.

petersilva · 2024-06-04T20:18:57Z

another weird behaviour:

2024-06-04 20:06:40,633 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json
2024-06-04 20:06:40,639 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json
2024-06-04 20:06:51,279 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json
2024-06-04 20:06:57,694 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json
2024-06-04 20:07:04,111 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json
2024-06-04 20:07:10,521 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json

at the end of each loop... so when there is nothing going on, it does that... which is relatively pointless relatively often. Moved this logic to be only executed at housekeeping intervals.

9c8ce5e

petersilva · 2024-06-04T21:12:07Z

another weird behaviour:

 sr3 status | head
status:
Component/Config                         Processes   Connection        Lag                              Rates
                                         State   Run Retry  msg data   Queued  LagMax LagAvg  Last  %rej     pubsub messages   RxData     TxData
                                         -----   --- -----  --- ----   ------  ------ ------  ----  ----     ------ --------   ------     ------
cpost/bulletins_from_sundew              idle    1/1     0 100%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
cpost/msc-radar                          idle    1/1     0 100%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
flow/opg_wiski                           idle    1/1     0 100%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
flow/scheduled_aviation_wind_fax_charts  idle    1/1     0 100%   0%      0    0.00s    0.00s 948.4s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
flow/scheduled_dfo_spine                 idle    1/1     0 100%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
flow/sealevelmonitoring                  idle    1/1     0 100%   0%      0    0.01s    0.01s 129.9s  0.0%  0 Bytes/s   0 msgs/s 685 Bytes/s  0 Bytes/s
Traceback (most recent call last):
  File "/local/home/sarra/.local/bin/sr3", line 11, in <module>
    load_entry_point('metpx-sr3', 'console_scripts', 'sr3')()
  File "/local/home/sarra/sr3/sarracenia/sr.py", line 3096, in main
    sys.exit(gs.status())
  File "/local/home/sarra/sr3/sarracenia/sr.py", line 2623, in status
    print(line)
BrokenPipeError: [Errno 32] Broken pipe

reidsunderland · 2024-06-06T18:03:04Z

2024-06-06 17:56:04,904 [INFO] sarracenia.flowcb.log after_post posted to a file with baseUrl: http://example.com/ relPath: data/wxofeed/dd/public_data/citypage_weather/xml/MB/s0000566_f.xml
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/sarracenia/instance.py", line 248, in <module>
    i.start()
  File "/usr/lib/python3/dist-packages/sarracenia/instance.py", line 239, in start
    self.running_instance.run()
  File "/local/home/sarra/sr3/sarracenia/flow/__init__.py", line 586, in run
    self.post(now)
  File "/local/home/sarra/sr3/sarracenia/flow/__init__.py", line 1223, in post
    and now > self.metrics_lastWrite+self.o.metrics_writeInterval:
AttributeError: 'Config' object has no attribute 'metrics_writeInterval'

I'm seeing this error on dev, running d06ac13

EDIT: I did a git pull to use the latest development branch commit and I think the bug is gone.

petersilva · 2024-06-06T21:30:48Z

improvements were published to 1/2 of a cluster to compare with the un-patched other half.
no discernible difference between the two, but...
the load is reasonable now (single digits) where it would climb to mid 20's after a few days before.
there were a number of errors in the configuration causing large retry queues. Those configuration errors were corrected, and it looks like that might have fixed it... rather than the performance tweaks done.

The performance is still better (reduced i/o overhead.) and the logging is lighter, but not a huge difference.
Will watch for another week or two.

petersilva · 2024-06-19T16:35:36Z

but after two weeks, the load still climbs into the 20's. and flows that should be fast... are slow.
If I restart, everything perks up.

so keeping this open for now... work-around is to restart everything once a week... which is very safe to do (unlike v2) because state recovers well from restarts.

reidsunderland · 2024-10-21T13:43:29Z

I profiled a couple sarras and a sender over the weekend to see if that would give any more insight into the increasing CPU usage.

I'll also document the process I've been using:

# start the profiling - I use an instance number 1 higher than the configured maximum number of 
# instances, so it doesn't conflict with other running processes, and so sanity doesn't kill it.
python3 -m cProfile -o /local/home/sarra/cprofile_output_file.dat /local/home/sarra/sr3/sarracenia/instance.py --no 7 start component/config &
profilepid=$!
disown $profilepid

# when you want to stop, send a signal to the pid
kill $profilepid

# Generate the graph:
# https://stackoverflow.com/questions/843671/profiling-in-python-who-called-the-function
gprof2dot -n 0.05 -e 0.01 -f pstats $file | dot -Tpng -o ./"${file/dat/png}"
gprof2dot -n 0.05 -e 0.01 -f pstats $file | dot -Tsvg -o ./"${file/dat/svg}"

Or if you want to get the results in text format:

import pstats
import cProfile
ps = pstats.Stats('cprofile_output_file.dat')
ps.sort_stats('cumulative').print_stats()

The only thing that jumps out is _runCallbackMetrics.

SFTP accounts for a lot of time in the sarra. I think we expect SFTP to be a bit CPU intensive, since the encryption takes time. But the SFTP code in the sarra took a huge percentage of the CPU usage/time, but I don't see the same in the sender. The feed I was testing with downloaded 197899 files, 156.649 GB total over ~3 days. I'm going to try to confirm there isn't a config issue on the server we're pulling from for the SFTP sarra/

sarra (sftp)

sarra (http)

sender (sftp)

petersilva · 2024-10-21T13:49:45Z

This profie data is eye opening... It's looks suspicious that sftp acquiring a lock takes so much time. It could just be how it works, but ... fishy.

petersilva · 2024-10-21T15:04:12Z

It would be cool to have some v2's profiled the same way... as a baseline.

petersilva added bug Something isn't working regression Broke something that was working before. v3only Only affects v3 branches. labels May 1, 2024

petersilva self-assigned this May 1, 2024

petersilva added the Priority 2 - Critical High Criticality or High Visibility Problem label May 31, 2024

petersilva mentioned this issue Jun 3, 2024

Issue1035 retry too much, reduce cpu to some extent... #1086

Merged

petersilva added a commit that referenced this issue Jun 4, 2024

sr3 status | head results in stack trace... weird. #1035

ff67465

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu time used in sr3 is higher than v2 for equivalent configuration. #1035

cpu time used in sr3 is higher than v2 for equivalent configuration. #1035

petersilva commented May 1, 2024

petersilva commented May 1, 2024

petersilva commented May 1, 2024

petersilva commented May 13, 2024 •

edited

Loading

petersilva commented May 13, 2024 •

edited

Loading

petersilva commented May 13, 2024

petersilva commented May 13, 2024 •

edited

Loading

petersilva commented May 18, 2024 •

edited

Loading

petersilva commented May 18, 2024

petersilva commented May 18, 2024 •

edited

Loading

petersilva commented May 18, 2024

petersilva commented May 22, 2024 •

edited

Loading

petersilva commented May 22, 2024

petersilva commented May 22, 2024

petersilva commented May 23, 2024

petersilva commented Jun 1, 2024 •

edited

Loading

petersilva commented Jun 3, 2024 •

edited

Loading

petersilva commented Jun 4, 2024 •

edited

Loading

petersilva commented Jun 4, 2024

reidsunderland commented Jun 6, 2024 •

edited

Loading

petersilva commented Jun 6, 2024

petersilva commented Jun 19, 2024 •

edited

Loading

reidsunderland commented Oct 21, 2024

petersilva commented Oct 21, 2024

petersilva commented Oct 21, 2024

cpu time used in sr3 is higher than v2 for equivalent configuration. #1035

cpu time used in sr3 is higher than v2 for equivalent configuration. #1035

Comments

petersilva commented May 1, 2024

petersilva commented May 1, 2024

petersilva commented May 1, 2024

petersilva commented May 13, 2024 • edited Loading

petersilva commented May 13, 2024 • edited Loading

petersilva commented May 13, 2024

petersilva commented May 13, 2024 • edited Loading

petersilva commented May 18, 2024 • edited Loading

petersilva commented May 18, 2024

petersilva commented May 18, 2024 • edited Loading

petersilva commented May 18, 2024

petersilva commented May 22, 2024 • edited Loading

petersilva commented May 22, 2024

petersilva commented May 22, 2024

petersilva commented May 23, 2024

petersilva commented Jun 1, 2024 • edited Loading

petersilva commented Jun 3, 2024 • edited Loading

petersilva commented Jun 4, 2024 • edited Loading

petersilva commented Jun 4, 2024

reidsunderland commented Jun 6, 2024 • edited Loading

petersilva commented Jun 6, 2024

petersilva commented Jun 19, 2024 • edited Loading

reidsunderland commented Oct 21, 2024

sarra (sftp)

sarra (http)

sender (sftp)

petersilva commented Oct 21, 2024

petersilva commented Oct 21, 2024

petersilva commented May 13, 2024 •

edited

Loading

petersilva commented May 13, 2024 •

edited

Loading

petersilva commented May 13, 2024 •

edited

Loading

petersilva commented May 18, 2024 •

edited

Loading

petersilva commented May 18, 2024 •

edited

Loading

petersilva commented May 22, 2024 •

edited

Loading

petersilva commented Jun 1, 2024 •

edited

Loading

petersilva commented Jun 3, 2024 •

edited

Loading

petersilva commented Jun 4, 2024 •

edited

Loading

reidsunderland commented Jun 6, 2024 •

edited

Loading

petersilva commented Jun 19, 2024 •

edited

Loading