Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sr status is too expensive/slow #174

Closed
petersilva opened this issue Mar 15, 2019 · 23 comments
Closed

sr status is too expensive/slow #174

petersilva opened this issue Mar 15, 2019 · 23 comments
Assignees
Labels
Design impacts API, or code structure changes efficiency improve performance and/or reduce overhead

Comments

@petersilva
Copy link
Contributor

sr status was implemented, because it was easy, by forking processes that then read all the configurations and then check the status... it should, in principle be possible to do this in a single process.

@benlapETS
Copy link
Contributor

Is there a metric for the time that we are targeting which would be acceptable, for checking the status and getting the results ?

@petersilva petersilva added the Design impacts API, or code structure changes label Mar 15, 2019
@petersilva
Copy link
Contributor Author

This is another one of those design things... likely need to move some code into the config module/class, and that is hard.

@petersilva
Copy link
Contributor Author

@benlapETS, I would set a target: it should take less than 5 seconds on a setup with 300 components.
I think the problem is that some of the configuration work (where to find state files) is in instances, instead of in the configuration, so we need to invoke instances in order to find the state files in order to query the status. if the instance files were set as part of configuration, it might be easier to invoke them directly, rather than triggering a fork/reap cycle.

I think that is hard to do.

@benlapETS
Copy link
Contributor

benlapETS commented Mar 15, 2019

I may be off track with this issue but as I see that both concept are involved, I can tell I am not a big fan of sr_config being the parent of sr_instances as it doesnt make sense conceptually. I think the way to show that is to ask the questions : Does sr_instances is a configuration? Or does sr_instances use configuration? I believe that the latter is more representative of the relationship between instance and configurations. With that being said the uses is better represented by an association than by a generalization (that would be is a relationship). see Class diagram wiki...

@petersilva
Copy link
Contributor Author

To me, the class hierarchy makes perfect sense: a configuration is an abstract definition of what a process should do. sr_instance adds a layer of process management around the configuration, and the components that inherit from sr_instances are specialized processes. Yes sr_subscribe is an instance, and yes an instance is a configuration. There's nothing conceptually wrong with it as-is. but this conversation sounds like debating angels on pinheads. We could debate that forever, but it isn't fruitful. It makes more sense to look at the change in terms of what it would take to implement.

while I understand the OO purity quest, this isn't a greenfield project. Rather than starting from a platonic ideal, we need to deal with the existing codebase. Making sr_instance not inherit from sr_config will mean re-writing the entire application: All settings are set in config, and inherited by instance, which is then inherited by all of the component main routines. As there are hundreds of settings and almost everything in the application involves applying or interpreting settings, it will change at least ... (guess) 70% of the lines of code. In addition, all plugins use the settings included so it will break every single plugin included in the code, and worse, those developed by users. There are thousands of them. It's a bad idea.

So while changing the hierarchy may appeal to your sense of propriety (and it doesn't even do that for me), the practical thing to do is to keep the same hierarchy and move some methods around to keep it compatible. Changing inheritance structure of sr_config->sr_instance->components will cause massive codebase disruption, and It is extremely doubtful that it is worthwhile.

@petersilva
Copy link
Contributor Author

what probably should be done... implement a second configuration parser that scans the hierarchy differently. it reads:

  • all the configuration files into a single hierarchy of configs in memory, and each config only has a number of instances, and on/off/disabled.
  • one can also do a single query to the process table, and only keep the sr_ processes. so then sr operations go through the memory structure
  • one scan of all the state file directories.

Then use the foreground process to work on the result of this fixed data set. a foreground process should be able to execute this in under a second.

@petersilva petersilva self-assigned this Aug 23, 2019
@petersilva
Copy link
Contributor Author

I've started a branch (issue174) with a partial implementation. for a flow_test config, it's about 6x faster for status. I think the difference should increase with the size of the config... only status implemented so far, stop is modelled, but the kill's are in comments for now.

@petersilva
Copy link
Contributor Author

It's called sr2.py on the branch and is a standalone script.

@petersilva
Copy link
Contributor Author

renamed to srp.py (parallel) so it is installed as srp. start, stop, sanity done now...
output will comfort analysts (not ridiculous like current one.)l
it also properly re-directs stdout and stderr for launched processes.

@petersilva
Copy link
Contributor Author

last case to deal with is when configured number of instances changes... need to understand what has happenned and do the right thing in sanity.

@petersilva
Copy link
Contributor Author

Dunno if worth doing right now, but ... I think @benlapETS was asking about this... could add pattern matching to the command line here to operate on subsets of the configurations. (start up all the configs that start with:

srp start "*/f[0-9][0-9]*"
and now this is cheap because it is just a filter on the operation on all configs.

@petersilva
Copy link
Contributor Author

stopping for now... I think the core is there as a proof of concept.

@petersilva
Copy link
Contributor Author

renamed things to encourage use of new one... srp is now sr (so the default normal thing) but keeping the old one around as sr1.py (either version 1 or sequential (1 thread) version.) had to add declare and setup, but will never add cleanup or remove... both are too dangerous.

@petersilva
Copy link
Contributor Author

petersilva commented Aug 25, 2019

Method to get a sample config to test with:


./flow_setup.sh
./flow_limit.sh 1
sr_cpump remove pelle_dd1_f04
sr_cpump remove pelle_dd2_f05
sr_shovel remove t_dd1_f00
sr_shovel remove t_dd2_f00

now have a bunch of configs setup, with no traffic going by...


blacklab% time sr1 status >/dev/null 2>&1

real	0m6.742s
user	0m6.076s
sys	0m0.679s
blacklab% time sr status >/dev/null 2>&1

real	0m2.807s
user	0m2.110s
sys	0m0.693s
blacklab% 

so status is faster... declare?:

blacklab% time sr1 declare >/dev/null 2>&1

real	0m1.095s
user	0m0.924s
sys	0m0.118s
blacklab% time sr declare >/dev/null 2>&1

real	0m3.961s
user	0m9.832s
sys	0m1.425s
blacklab% 

declare and setup are slower... in the new version... which is parallel.
my guess the overhead of reading the global state is about 2 seconds (based on status run.)
That overhead isn't present in sequential version, so for smaller configurations
it will be slower, but when you go to a much bigger configuration it will be faster.
On the other hand, it might be that launching all the tasks in parallel and then reaping them is so much more overhead that one at time is faster.... but I doubt it. Would have to see it run on a larger configuration.

blacklab% time sr1 stop >/dev/null 2>&1

real	0m19.842s
user	0m6.714s
sys	0m0.857s
blacklab% time sr1 start >/dev/null 2>&1

real	0m10.814s
user	0m9.135s
sys	0m1.137s

now with parallel version:

blacklab% time sr start
gathering global state: procs, configs, state files, logs, analysis - Done. 
starting............................................................................................Done

real	0m4.673s
user	0m1.713s
sys	0m0.482s
blacklab% time sr stop
gathering global state: procs, configs, state files, logs, analysis - Done. 
stopping.............................................................................................Done
Waiting 1 sec. to check if 93 processes stopped (try: 0)
All stopped after try 0

real	0m5.645s
user	0m3.520s
sys	0m1.052s
blacklab% 

So it's maybe 3x or 4x faster for this small testing configuration. Advantage should increase with the size of the configuration.

@petersilva
Copy link
Contributor Author

basically you pay a 2 second overhead cost on startup, and then the rest should be much faster as there is no more reading of configurations...

@petersilva
Copy link
Contributor Author

so this implementation uses subprocess. It is now a standalone thing (not messed up with other classes, many levels deep.) so it could be re-formulated to use another API (multiprocess) if desired. Not convinced it is helpful, but at least now it is easy to experiment.

@petersilva
Copy link
Contributor Author

handing to Benoit to look at adding filtering.

@petersilva
Copy link
Contributor Author

petersilva commented Aug 29, 2019

the declare 1st implementation is probably dumb. We should probably do it over as a global op.
improve the parser to understand exchange and queue declarations and do it in a single process.
should get big improvement in performance... but declare isn´t a performance pain point, so probably just spawn a new issue, once this is merged, and deal with it in time.

@petersilva
Copy link
Contributor Author

once declare is done, we can do a really safe cleanup, because it will understand all the declarations of all the configurations as a single unit.

@petersilva
Copy link
Contributor Author

merged the work-in-progress for now... it passes the flow check and there is some positive feedback.
it is useful as-is, but would be even better to add the filtering, so this branch should continue at least until that is done.

@petersilva
Copy link
Contributor Author

released in 2.19.09

@petersilva
Copy link
Contributor Author

psutil on linux is still disastrously slow... going to make a shortcut on linux to call ps.
dunno what psutil is doing, but it is like... 240x slower than it should be.
opened a bug:

giampaolo/psutil#1751

petersilva added a commit that referenced this issue May 6, 2020
… command.

that python module is way too slow to use at scale buge opened with them.

fix #174, help with #315 and #180, and #187
@petersilva
Copy link
Contributor Author

well Giampaolo's fix was simple and effective... 2nd simpler patch in-bound...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Design impacts API, or code structure changes efficiency improve performance and/or reduce overhead
Projects
None yet
Development

No branches or pull requests

2 participants