Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu time used in sr3 is higher than v2 for equivalent configuration. #1035

Open
petersilva opened this issue May 1, 2024 · 24 comments
Open
Assignees
Labels
bug Something isn't working Priority 2 - Critical High Criticality or High Visibility Problem regression Broke something that was working before. v3only Only affects v3 branches.

Comments

@petersilva
Copy link
Contributor

Comparing two pumps, it looks like the sr3 one consumes more cpu time than the v2.
This issue will be used to investigate a bit.

@petersilva petersilva added bug Something isn't working regression Broke something that was working before. v3only Only affects v3 branches. labels May 1, 2024
@petersilva petersilva self-assigned this May 1, 2024
@petersilva
Copy link
Contributor Author

I notice that in v2 in the amqp consumer, when no message arrives, it does exponential backoff....

        self.get_message()
        self.raw_msg = self.consumer.consume(self.queue_name)

        # if no message from queue, perhaps we have message to retry

        if self.raw_msg is None:
            self.raw_msg = self.retry.get()

        # when no message sleep for 1 sec. (value taken from old metpx)
        # *** value 0.01 was tested and would simply raise cpu usage of broker
        # to unacceptable level with very fews processes (~20) trying to consume messages
        # remember that instances and broker sharing messages add up to a lot of consumers

        should_sleep = False

        if self.raw_msg is None:
            should_sleep = True
        elif self.raw_msg.isRetry and self.last_msg_failed:
            should_sleep = True

        if should_sleep:
            try:
               time.sleep(self.sleep_now)
            except:
               self.logger.info("woke from sleep by alarm.. %s " % self.msg.notice)

            self.sleep_now = self.sleep_now * 2
            if self.sleep_now > self.sleep_max:
                self.sleep_now = self.sleep_max

in sr3, it instantaneously checks if a message is there, and returns immediately relying on the global loop.

@petersilva
Copy link
Contributor Author

... but it looks like sr3 has the same logic... in the main loop...

                if not stopping:
                    self.gather()

                last_gather_len = len(self.worklist.incoming)
                if (last_gather_len == 0):
                    spamming = True
                else:
                    current_sleep = self.o.sleep
                    spamming = False
 .
 .
 . 
               if spamming and (current_sleep < 5):
                    current_sleep *= 2
          

@petersilva
Copy link
Contributor Author

petersilva commented May 13, 2024

Over the period of a week, we see that, on the servers in question:

  • when things initially start, there is high cpu usage for a few polls and some sarra's, but polls calm down.
  • over a number of days, the cpu usage of busy sarras, which is initially quite low (2%) gradually creeps up. after a week or so, it's hanging at 99%.
  • restarting resets the clock.

my guess: memory fragmentation. the GC is spending more and more time to find bytes in ever more fragmented chains.

@petersilva
Copy link
Contributor Author

petersilva commented May 13, 2024

reading about GC:

note: no memory leak observed... is that true? looking at housekeeping outputs, the memory usage is stable over several days... starts at 148 MiB... grows to 179 MB after running for a week.

based on the above... a bunch of experiments to try:

  • import gc, and gc.freeze() after all the configs are read, since they are stable and unlikely to ever be recovered.
  • disable gc
  • manually trigger GC at intervals? so that it runs less often but more thorougly, and a times that make sense.

@petersilva
Copy link
Contributor Author

https://docs.python.org/3/library/gc.html

... you can disable the collector if you are sure your program does not create reference cycles. Automatic collection can be disabled by calling gc.disable().

@petersilva
Copy link
Contributor Author

petersilva commented May 13, 2024

ugh.... freeze() is tagged ... added in 3.7 ...

@petersilva
Copy link
Contributor Author

petersilva commented May 18, 2024

created a plugin mem_gc_monitor ... given multiple instances. divide them by 3:

  • 1st (of 3) instances will use default gc settings.
  • 2nd (of 3) instance will have gc turned off, but run manually, explicitly every housekeeping interval
  • 3rd (of 3) will have the threshold for automatic gc raised from 700-> 70000 (a hundred-fold.)

Look for a difference in cpu usage between the instances.
Running on a cluster in dev on some configurations.

edcm-dirt22-1% cat mem_gc_monitor.py

import gc
import logging
import time

from sarracenia.flowcb import FlowCB

logger = logging.getLogger(__name__)

class Mem_gc_monitor(FlowCB):

    def __init__(self,options):
       super().__init__(options, logger)

    def on_start(self):

        # every first instance has a stock experience.

        # every 2nd the instances, disable gc, and do it only at housekeeping time.
        if self.o.no % 3 == 2:
            gc.disable()
        # every 3rd instance, raise the gc threshold.
        if self.o.no % 3 == 0:
            gc.set_threshold(70000, 5, 5)

        # enormous output (for every struct it looks at...)
        gc.set_debug(gc.DEBUG_STATS)
        pass

    def on_housekeeping(self):

        logger.info( f"before gc.isenabled={gc.isenabled()} ")
        logger.info( f"before gc.get_stats={gc.get_stats()} ")
        logger.info( f"before gc.get_count={gc.get_count()} ")
        logger.info( f"before gc.get_threshold={gc.get_threshold()} ")

        
        if not gc.isenabled():
             gc_start=time.perf_counter()
            gc.collect()
            logger.info(f"gc took {time.perf_counter()-gc_start}")

        logger.info( f"after gc.isenabled={gc.isenabled()} ")
        logger.info( f"after gc.get_stats={gc.get_stats()} ")
        logger.info( f"after gc.get_count={gc.get_count()} ")
        logger.info( f"after gc.get_threshold={gc.get_threshold()} ")

@petersilva
Copy link
Contributor Author

sample run... after running 7 minutes:

subscribe_dd_swob_03.log:.. 00:59:04,189 ... on_housekeeping Current cpu_times: user=3.33 system=0.52
subscribe_dd_swob_02.log:.. 00:59:10,708 ... on_housekeeping Current cpu_times: user=4.31 system=0.64
subscribe_dd_swob_01.log:... 01:00:51,980 ... on_housekeeping Current cpu_times: user=5.22 system=0.63

instance 1 has the most cpu time (un-tuned gc... ) and the tuned gc has the least. Manually invoking the gc is between the two. Need to watch for longer.

@petersilva
Copy link
Contributor Author

petersilva commented May 18, 2024

after running over-night:

bunny@host:~/.cache/sr3/log$ grep cpu_times sarra_get_dms-products_0?.log | sort -k2 | tail -8 | sort
sarra_get_dms-products_01.log:2024-05-18 14:18:16,576 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1081.32 system=39.33
sarra_get_dms-products_02.log:2024-05-18 14:18:46,546 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1095.48 system=54.05
sarra_get_dms-products_03.log:2024-05-18 14:18:28,276 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1076.5 system=41.04
sarra_get_dms-products_04.log:2024-05-18 14:18:07,449 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1054.98 system=46.55
sarra_get_dms-products_05.log:2024-05-18 14:18:31,412 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1074.32 system=40.74
sarra_get_dms-products_06.log:2024-05-18 14:18:42,301 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1044.78 system=39.54
sarra_get_dms-products_07.log:2024-05-18 14:18:29,128 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1035.61 system=46.3
sarra_get_dms-products_08.log:2024-05-18 14:17:55,235 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1061.29 system=42.2
bunny@host:~/.cache/sr3/log$

no obvious pattern. had debug on the gc runs also, and they seem fast also.

@petersilva
Copy link
Contributor Author

notes:

  • a client noticed this behaviour a few years ago... It wasn't clear if it was a one off.
  • now that we have long running systems (ddsr-dev) we can see the cpu time building over days or a week.
  • when we restart, the cpu time (aka %cpu as reported by top) drops from 99% to something healthy ... typically below 1%.

so a work-around for this issue is periodic restarts.

@petersilva
Copy link
Contributor Author

petersilva commented May 22, 2024

after left it for a five day run:

  • instance 01 (normal gc, no tuning) user=45597.73 system=2072.02
  • instance 02 (disabled gc, explicit invocations) user=40387.77 system=2067.65
  • instance 03 (tuned gc, much higher thresholds) user=40580.63 system=2078.31

so instances 2 and 3 consumed about 11% less cpu... instance 1 might be special... but if it isn't then the other two got about 11% cpu benefit from gc tuning, and slightly better from just invoking gc manually when convenient.

on the last day, grepping the logs...

  • instance 01 (normal gc, no tuning) 52188 calls to the garbage collector. 10.2 sec. gc: run time.
  • instance 02 (disabled gc, explicit calls) 395 ... 82.86 sec. gc run time. (99 sec reported externally.)
  • instance 03 (tuned gc) 659 ... 20.48 sec. of gc run time (reported by gc: diags)

so... way fewer gc calls, but they are more expensive... but the total user time is less ... mystère et boule de gomme.

@petersilva
Copy link
Contributor Author

Looking at another configuration, where top shows cpu consumption at between 98% and 100%, I look at the reportted cpu times between housekeeping calls...

2024-05-22 13:57:06,135 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=10322.8 system=118.07
2024-05-22 14:02:09,291 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=10330.7 system=118.15
2024-05-22 14:07:13,542 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=10339.71 system=118.19
2024-05-22 14:12:16,314 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=10351.26 system=118.22
2024-05-22 14:17:16,982 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=10363.41 system=118.32

so in five minutes ... it's reporting about 8 seconds of cpu usage... which is kind of strange... for 100% cpu.
and the system overtime, isn't budging...

@petersilva
Copy link
Contributor Author

I'm worried that instance 1 does extra work, and so the comparison might not be good. adding a fourth instance.... the forth should be comparable to 2 and 3 directly.

@petersilva
Copy link
Contributor Author

looking through with ptrace I saw excessive fileops in retry logic... found one issue... but I don't think it will make a big difference.

@petersilva petersilva added the Priority 2 - Critical High Criticality or High Visibility Problem label May 31, 2024
@petersilva
Copy link
Contributor Author

petersilva commented Jun 1, 2024

I wrote this, but I think it's wrong...

k.. yes it will make a big difference... the logic as far as I can tell is that ... especially when idle,

  • as part of gather() ... diskqueue.get() will be called with self.o.batch could be 50 or 100 or something like that...
  • diskqueue.get() is a loop over the batch size, regardless of whether the queue has any messages in it.
  • every pass it finds no message it also tries to delete the queue file (because it is empty.) This should cause a short circuit to avoid checking things again, but it doesn't... the logic is slightly off.
  • so it loops batch times trying the unlink the file every time.
  • gather() is called A LOT. basically many flows have very short sleeps between gathers.

I could see that getting expensive... just not sure how expensive. Fixing it.

It does short circuit because after the first failure to retrieve the file it tries
to delete it and returns... so each gather() phase with disk queue involves:

  • check if the file exists, return None
  • delete the file.
  • return None

these two file ops are completely useless most (>99%) of the time.
slight logic change will mean no i/o at all, when the retry_queue is known to be empty.

@petersilva
Copy link
Contributor Author

petersilva commented Jun 3, 2024

branch origin/issue1035_retry_too_much eliminates all i/o from the main run loop.

  • check for retry queue existence: gone.
  • delete retry queue: gone.
  • checking for existence of metrics directory: gone.
  • write of metrics file (added metrics_writeInterval set to 5 seconds, vs. every loop)
  • noticed very expensive write of diff output in a plugin used by the flow_test. now if diff is >10 lines, just say "big"

so now there is no file i/o most times the loop runs.

@petersilva
Copy link
Contributor Author

petersilva commented Jun 4, 2024

another weird behaviour:

2024-06-04 20:06:40,633 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json
2024-06-04 20:06:40,639 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json
2024-06-04 20:06:51,279 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json
2024-06-04 20:06:57,694 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json
2024-06-04 20:07:04,111 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json
2024-06-04 20:07:10,521 [INFO] sarracenia.flow post looking for old metrics for /local/home/sarra/.cache/sr3/metrics/sarra_get-urp-dev_01.json

at the end of each loop... so when there is nothing going on, it does that... which is relatively pointless relatively often. Moved this logic to be only executed at housekeeping intervals.

9c8ce5e

@petersilva
Copy link
Contributor Author

another weird behaviour:

 sr3 status | head
status:
Component/Config                         Processes   Connection        Lag                              Rates
                                         State   Run Retry  msg data   Queued  LagMax LagAvg  Last  %rej     pubsub messages   RxData     TxData
                                         -----   --- -----  --- ----   ------  ------ ------  ----  ----     ------ --------   ------     ------
cpost/bulletins_from_sundew              idle    1/1     0 100%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
cpost/msc-radar                          idle    1/1     0 100%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
flow/opg_wiski                           idle    1/1     0 100%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
flow/scheduled_aviation_wind_fax_charts  idle    1/1     0 100%   0%      0    0.00s    0.00s 948.4s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
flow/scheduled_dfo_spine                 idle    1/1     0 100%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
flow/sealevelmonitoring                  idle    1/1     0 100%   0%      0    0.01s    0.01s 129.9s  0.0%  0 Bytes/s   0 msgs/s 685 Bytes/s  0 Bytes/s
Traceback (most recent call last):
  File "/local/home/sarra/.local/bin/sr3", line 11, in <module>
    load_entry_point('metpx-sr3', 'console_scripts', 'sr3')()
  File "/local/home/sarra/sr3/sarracenia/sr.py", line 3096, in main
    sys.exit(gs.status())
  File "/local/home/sarra/sr3/sarracenia/sr.py", line 2623, in status
    print(line)
BrokenPipeError: [Errno 32] Broken pipe

@reidsunderland
Copy link
Member

reidsunderland commented Jun 6, 2024

2024-06-06 17:56:04,904 [INFO] sarracenia.flowcb.log after_post posted to a file with baseUrl: http://example.com/ relPath: data/wxofeed/dd/public_data/citypage_weather/xml/MB/s0000566_f.xml
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/sarracenia/instance.py", line 248, in <module>
    i.start()
  File "/usr/lib/python3/dist-packages/sarracenia/instance.py", line 239, in start
    self.running_instance.run()
  File "/local/home/sarra/sr3/sarracenia/flow/__init__.py", line 586, in run
    self.post(now)
  File "/local/home/sarra/sr3/sarracenia/flow/__init__.py", line 1223, in post
    and now > self.metrics_lastWrite+self.o.metrics_writeInterval:
AttributeError: 'Config' object has no attribute 'metrics_writeInterval'

I'm seeing this error on dev, running d06ac13

EDIT: I did a git pull to use the latest development branch commit and I think the bug is gone.

@petersilva
Copy link
Contributor Author

  • improvements were published to 1/2 of a cluster to compare with the un-patched other half.
  • no discernible difference between the two, but...
  • the load is reasonable now (single digits) where it would climb to mid 20's after a few days before.
  • there were a number of errors in the configuration causing large retry queues. Those configuration errors were corrected, and it looks like that might have fixed it... rather than the performance tweaks done.

The performance is still better (reduced i/o overhead.) and the logging is lighter, but not a huge difference.
Will watch for another week or two.

@petersilva
Copy link
Contributor Author

petersilva commented Jun 19, 2024

but after two weeks, the load still climbs into the 20's. and flows that should be fast... are slow.
If I restart, everything perks up.

so keeping this open for now... work-around is to restart everything once a week... which is very safe to do (unlike v2) because state recovers well from restarts.

@reidsunderland
Copy link
Member

I profiled a couple sarras and a sender over the weekend to see if that would give any more insight into the increasing CPU usage.

I'll also document the process I've been using:

# start the profiling - I use an instance number 1 higher than the configured maximum number of 
# instances, so it doesn't conflict with other running processes, and so sanity doesn't kill it.
python3 -m cProfile -o /local/home/sarra/cprofile_output_file.dat /local/home/sarra/sr3/sarracenia/instance.py --no 7 start component/config &
profilepid=$!
disown $profilepid

# when you want to stop, send a signal to the pid
kill $profilepid

# Generate the graph:
# https://stackoverflow.com/questions/843671/profiling-in-python-who-called-the-function
gprof2dot -n 0.05 -e 0.01 -f pstats $file | dot -Tpng -o ./"${file/dat/png}"
gprof2dot -n 0.05 -e 0.01 -f pstats $file | dot -Tsvg -o ./"${file/dat/svg}"

Or if you want to get the results in text format:

import pstats
import cProfile
ps = pstats.Stats('cprofile_output_file.dat')
ps.sort_stats('cumulative').print_stats()

The only thing that jumps out is _runCallbackMetrics.

SFTP accounts for a lot of time in the sarra. I think we expect SFTP to be a bit CPU intensive, since the encryption takes time. But the SFTP code in the sarra took a huge percentage of the CPU usage/time, but I don't see the same in the sender. The feed I was testing with downloaded 197899 files, 156.649 GB total over ~3 days. I'm going to try to confirm there isn't a config issue on the server we're pulling from for the SFTP sarra/

sarra (sftp)

cprofile_sarra_sftp

sarra (http)

cprofile_sarra_http

sender (sftp)

cprofile_sender_sftp

@petersilva
Copy link
Contributor Author

  • This profie data is eye opening... It's looks suspicious that sftp acquiring a lock takes so much time. It could just be how it works, but ... fishy.

@petersilva
Copy link
Contributor Author

It would be cool to have some v2's profiled the same way... as a baseline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Priority 2 - Critical High Criticality or High Visibility Problem regression Broke something that was working before. v3only Only affects v3 branches.
Projects
None yet
Development

No branches or pull requests

2 participants