sr status is too expensive/slow #174

petersilva · 2019-03-15T14:47:20Z

sr status was implemented, because it was easy, by forking processes that then read all the configurations and then check the status... it should, in principle be possible to do this in a single process.

benlapETS · 2019-03-15T17:19:27Z

Is there a metric for the time that we are targeting which would be acceptable, for checking the status and getting the results ?

petersilva · 2019-03-15T20:07:16Z

This is another one of those design things... likely need to move some code into the config module/class, and that is hard.

petersilva · 2019-03-15T21:46:22Z

@benlapETS, I would set a target: it should take less than 5 seconds on a setup with 300 components.
I think the problem is that some of the configuration work (where to find state files) is in instances, instead of in the configuration, so we need to invoke instances in order to find the state files in order to query the status. if the instance files were set as part of configuration, it might be easier to invoke them directly, rather than triggering a fork/reap cycle.

I think that is hard to do.

benlapETS · 2019-03-15T22:43:34Z

I may be off track with this issue but as I see that both concept are involved, I can tell I am not a big fan of sr_config being the parent of sr_instances as it doesnt make sense conceptually. I think the way to show that is to ask the questions : Does sr_instances is a configuration? Or does sr_instances use configuration? I believe that the latter is more representative of the relationship between instance and configurations. With that being said the uses is better represented by an association than by a generalization (that would be is a relationship). see Class diagram wiki...

petersilva · 2019-03-16T04:36:24Z

To me, the class hierarchy makes perfect sense: a configuration is an abstract definition of what a process should do. sr_instance adds a layer of process management around the configuration, and the components that inherit from sr_instances are specialized processes. Yes sr_subscribe is an instance, and yes an instance is a configuration. There's nothing conceptually wrong with it as-is. but this conversation sounds like debating angels on pinheads. We could debate that forever, but it isn't fruitful. It makes more sense to look at the change in terms of what it would take to implement.

while I understand the OO purity quest, this isn't a greenfield project. Rather than starting from a platonic ideal, we need to deal with the existing codebase. Making sr_instance not inherit from sr_config will mean re-writing the entire application: All settings are set in config, and inherited by instance, which is then inherited by all of the component main routines. As there are hundreds of settings and almost everything in the application involves applying or interpreting settings, it will change at least ... (guess) 70% of the lines of code. In addition, all plugins use the settings included so it will break every single plugin included in the code, and worse, those developed by users. There are thousands of them. It's a bad idea.

So while changing the hierarchy may appeal to your sense of propriety (and it doesn't even do that for me), the practical thing to do is to keep the same hierarchy and move some methods around to keep it compatible. Changing inheritance structure of sr_config->sr_instance->components will cause massive codebase disruption, and It is extremely doubtful that it is worthwhile.

petersilva · 2019-08-20T16:13:32Z

what probably should be done... implement a second configuration parser that scans the hierarchy differently. it reads:

all the configuration files into a single hierarchy of configs in memory, and each config only has a number of instances, and on/off/disabled.
one can also do a single query to the process table, and only keep the sr_ processes. so then sr operations go through the memory structure
one scan of all the state file directories.

Then use the foreground process to work on the result of this fixed data set. a foreground process should be able to execute this in under a second.

petersilva · 2019-08-23T13:07:42Z

I've started a branch (issue174) with a partial implementation. for a flow_test config, it's about 6x faster for status. I think the difference should increase with the size of the config... only status implemented so far, stop is modelled, but the kill's are in comments for now.

petersilva · 2019-08-23T13:08:04Z

It's called sr2.py on the branch and is a standalone script.

petersilva · 2019-08-24T18:08:57Z

renamed to srp.py (parallel) so it is installed as srp. start, stop, sanity done now...
output will comfort analysts (not ridiculous like current one.)l
it also properly re-directs stdout and stderr for launched processes.

petersilva · 2019-08-24T18:53:24Z

last case to deal with is when configured number of instances changes... need to understand what has happenned and do the right thing in sanity.

petersilva · 2019-08-24T18:57:42Z

Dunno if worth doing right now, but ... I think @benlapETS was asking about this... could add pattern matching to the command line here to operate on subsets of the configurations. (start up all the configs that start with:

srp start "*/f[0-9][0-9]*"
and now this is cheap because it is just a filter on the operation on all configs.

petersilva · 2019-08-24T18:58:05Z

stopping for now... I think the core is there as a proof of concept.

petersilva · 2019-08-25T14:53:09Z

renamed things to encourage use of new one... srp is now sr (so the default normal thing) but keeping the old one around as sr1.py (either version 1 or sequential (1 thread) version.) had to add declare and setup, but will never add cleanup or remove... both are too dangerous.

petersilva · 2019-08-25T15:24:19Z

Method to get a sample config to test with:


./flow_setup.sh
./flow_limit.sh 1
sr_cpump remove pelle_dd1_f04
sr_cpump remove pelle_dd2_f05
sr_shovel remove t_dd1_f00
sr_shovel remove t_dd2_f00

now have a bunch of configs setup, with no traffic going by...


blacklab% time sr1 status >/dev/null 2>&1

real	0m6.742s
user	0m6.076s
sys	0m0.679s
blacklab% time sr status >/dev/null 2>&1

real	0m2.807s
user	0m2.110s
sys	0m0.693s
blacklab%

so status is faster... declare?:

blacklab% time sr1 declare >/dev/null 2>&1

real	0m1.095s
user	0m0.924s
sys	0m0.118s
blacklab% time sr declare >/dev/null 2>&1

real	0m3.961s
user	0m9.832s
sys	0m1.425s
blacklab%

declare and setup are slower... in the new version... which is parallel.
my guess the overhead of reading the global state is about 2 seconds (based on status run.)
That overhead isn't present in sequential version, so for smaller configurations
it will be slower, but when you go to a much bigger configuration it will be faster.
On the other hand, it might be that launching all the tasks in parallel and then reaping them is so much more overhead that one at time is faster.... but I doubt it. Would have to see it run on a larger configuration.

blacklab% time sr1 stop >/dev/null 2>&1

real	0m19.842s
user	0m6.714s
sys	0m0.857s
blacklab% time sr1 start >/dev/null 2>&1

real	0m10.814s
user	0m9.135s
sys	0m1.137s

now with parallel version:

blacklab% time sr start
gathering global state: procs, configs, state files, logs, analysis - Done. 
starting............................................................................................Done

real	0m4.673s
user	0m1.713s
sys	0m0.482s
blacklab% time sr stop
gathering global state: procs, configs, state files, logs, analysis - Done. 
stopping.............................................................................................Done
Waiting 1 sec. to check if 93 processes stopped (try: 0)
All stopped after try 0

real	0m5.645s
user	0m3.520s
sys	0m1.052s
blacklab%

So it's maybe 3x or 4x faster for this small testing configuration. Advantage should increase with the size of the configuration.

petersilva · 2019-08-25T15:27:25Z

basically you pay a 2 second overhead cost on startup, and then the rest should be much faster as there is no more reading of configurations...

petersilva · 2019-08-25T15:30:57Z

so this implementation uses subprocess. It is now a standalone thing (not messed up with other classes, many levels deep.) so it could be re-formulated to use another API (multiprocess) if desired. Not convinced it is helpful, but at least now it is easy to experiment.

petersilva · 2019-08-26T21:26:07Z

handing to Benoit to look at adding filtering.

petersilva · 2019-08-29T11:08:57Z

the declare 1st implementation is probably dumb. We should probably do it over as a global op.
improve the parser to understand exchange and queue declarations and do it in a single process.
should get big improvement in performance... but declare isn´t a performance pain point, so probably just spawn a new issue, once this is merged, and deal with it in time.

petersilva · 2019-08-29T11:10:47Z

once declare is done, we can do a really safe cleanup, because it will understand all the declarations of all the configurations as a single unit.

petersilva · 2019-09-02T15:36:38Z

merged the work-in-progress for now... it passes the flow check and there is some positive feedback.
it is useful as-is, but would be even better to add the filtering, so this branch should continue at least until that is done.

petersilva · 2019-09-20T15:16:30Z

released in 2.19.09

petersilva · 2020-05-06T04:41:51Z

psutil on linux is still disastrously slow... going to make a shortcut on linux to call ps.
dunno what psutil is doing, but it is like... 240x slower than it should be.
opened a bug:

giampaolo/psutil#1751

… command. that python module is way too slow to use at scale buge opened with them. fix #174, help with #315 and #180, and #187

petersilva · 2020-05-06T13:57:25Z

well Giampaolo's fix was simple and effective... 2nd simpler patch in-bound...

petersilva added the Design impacts API, or code structure changes label Mar 15, 2019

petersilva self-assigned this Aug 23, 2019

petersilva assigned benlapETS Aug 26, 2019

petersilva mentioned this issue Aug 31, 2019

sr.py needs to be refactored #180

Closed

petersilva mentioned this issue Sep 4, 2019

race condition between sr_audit in background and stop, causes processes to be restarted while stopping #210

Closed

petersilva mentioned this issue Sep 16, 2019

sr cleanup is too dangerous #187

Closed

petersilva added the efficiency improve performance and/or reduce overhead label Sep 16, 2019

petersilva closed this as completed Sep 20, 2019

petersilva mentioned this issue Feb 26, 2020

tuning sr output for monitoring. #315

Open

petersilva added a commit that referenced this issue May 6, 2020

200x speedup to sr.py by replacing process_iter with a call to the ps…

be717f3

… command. that python module is way too slow to use at scale buge opened with them. fix #174, help with #315 and #180, and #187

petersilva mentioned this issue Jan 25, 2023

Add verbosity & metrics to sr3 status #614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sr status is too expensive/slow #174

sr status is too expensive/slow #174

petersilva commented Mar 15, 2019

benlapETS commented Mar 15, 2019

petersilva commented Mar 15, 2019

petersilva commented Mar 15, 2019

benlapETS commented Mar 15, 2019 •

edited

Loading

petersilva commented Mar 16, 2019

petersilva commented Aug 20, 2019

petersilva commented Aug 23, 2019

petersilva commented Aug 23, 2019

petersilva commented Aug 24, 2019

petersilva commented Aug 24, 2019

petersilva commented Aug 24, 2019

petersilva commented Aug 24, 2019

petersilva commented Aug 25, 2019

petersilva commented Aug 25, 2019 •

edited

Loading

petersilva commented Aug 25, 2019

petersilva commented Aug 25, 2019

petersilva commented Aug 26, 2019

petersilva commented Aug 29, 2019 •

edited

Loading

petersilva commented Aug 29, 2019

petersilva commented Sep 2, 2019

petersilva commented Sep 20, 2019

petersilva commented May 6, 2020

petersilva commented May 6, 2020

sr status is too expensive/slow #174

sr status is too expensive/slow #174

Comments

petersilva commented Mar 15, 2019

benlapETS commented Mar 15, 2019

petersilva commented Mar 15, 2019

petersilva commented Mar 15, 2019

benlapETS commented Mar 15, 2019 • edited Loading

petersilva commented Mar 16, 2019

petersilva commented Aug 20, 2019

petersilva commented Aug 23, 2019

petersilva commented Aug 23, 2019

petersilva commented Aug 24, 2019

petersilva commented Aug 24, 2019

petersilva commented Aug 24, 2019

petersilva commented Aug 24, 2019

petersilva commented Aug 25, 2019

petersilva commented Aug 25, 2019 • edited Loading

petersilva commented Aug 25, 2019

petersilva commented Aug 25, 2019

petersilva commented Aug 26, 2019

petersilva commented Aug 29, 2019 • edited Loading

petersilva commented Aug 29, 2019

petersilva commented Sep 2, 2019

petersilva commented Sep 20, 2019

petersilva commented May 6, 2020

petersilva commented May 6, 2020

benlapETS commented Mar 15, 2019 •

edited

Loading

petersilva commented Aug 25, 2019 •

edited

Loading

petersilva commented Aug 29, 2019 •

edited

Loading