Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization only part-time #204

Open
Optiuse opened this issue Feb 19, 2020 · 16 comments
Open

Parallelization only part-time #204

Optiuse opened this issue Feb 19, 2020 · 16 comments

Comments

@Optiuse
Copy link

Optiuse commented Feb 19, 2020

Hello,

when I want multithreaded parallel function evaluations for 8 Threads i do this:

omp_set_dynamic(false); // omp setting
omp_set_num_threads(8); // omp setting
cmaparams.set_mt_feval(true); // libcmaes setting

This should aktivate in esostrategy.cc:
omp parallel for if (_parameters._mt_feval)

I have very expensive functions and some parameters take much longer than others. The problem is that often not all threads are used. Overall, this makes the optimization much longer because the CPU is often only used to 1/8 of its capacity. Therefore the effect of parallelization is only part-time.
It gives the impression that the parallelization is interrupted as soon as a "slow function execution" occurs in the overall process. As if the execution of new threads would have to wait.

What is the reason for this and is there a possibility that all 8 threads are always used and therefore the cpu is used at full capacity? This would significantly accelerate the total optimisation process.

Greetings

@beniz
Copy link
Collaborator

beniz commented Feb 19, 2020

Hi, my guess is that some of your external function calls take much longer than others, and thus the parallelized for loop waits until all calls have finished.

@Optiuse
Copy link
Author

Optiuse commented Feb 19, 2020

Ok there always 8 runs to be completed before the next 8 are started? It is possible that it does not block and the next already start.

I have found: "OpenMP will automatically wait for all threads to finish before execution continues."
http://ppc.cs.aalto.fi/ch3/nowait/
Is it possible to use something like "nowait" so that it no longer blocks?
#pragma omp for nowait

@beniz
Copy link
Collaborator

beniz commented Feb 19, 2020

If you'd like to reacquire the threads so that they can be used somehow by the still ongoing eval function calls, you may want to try using nowait, but my guess is that would only be of interest if your eval function is calling on new threads during its execution... Seems like a weird niche case I think. I guess the main question here is is your eval function multithreaded to begin with ?

@Optiuse
Copy link
Author

Optiuse commented Feb 19, 2020

Inside the FitFunc it is blocked until the return value is calculated. Inside the FitFunc a thread is called but FitFunc blocked inside until the thread is finished and the return value for FitFunc is calculated. So the FitFunc is not bypassed and gets its return.

But I think I found the reason. It's related to the value of lambda.
I have tracked the cpu load in the taskmanager with different lambda settings.
The effect that the cpu is not always working to capacity is probably related to the value of lambda.
Shortly before the number of runs that is equal to the value of lambda, is completed, the cpu load will decrease.
You can see in the picture that this effect of cpu-decrease always exists but the distances of cpu-decrease are different depending on the lambda.

cpu

So now i think the effect has to do with the algorithm of cmaes (value lambda) and not with openmp.
Do you think there is a way that all threads are always used? That would accelerate the total optimisation process.

@yasamoka
Copy link

yasamoka commented Mar 6, 2020

Well, show us some code.

@Optiuse
Copy link
Author

Optiuse commented Mar 26, 2020

The code doesn't matter because everything that takes some time will have this effect. You can put in the function a random sleep (e.g. one second to 2 minutes) and the same thing will happen.

I think the problem is that the algorithm of LIBCMAES wants to finish the full number of lambda completely before going any further. It seems to block until a population is completely done. In the last runs of a population only a few or one thread is busy until a new population can start. This effect can be seen in the graphic above and prevents the CPU from being fully used. This slows down the total optimization time very much.

LIBCMAES ist great and can effectively find the global optimum in a small number of steps, but this advantage is nullified because the parallelization is only part-time.
Is it that genetic optimizers always have to block before starting a new population?

As an example there is another algorithm Simulated Annealing for optimization that also uses OpenMP but that doesn't block anything:
https://github.com/epwalsh/CSA
This needs much more runs to get the best result (not as effective as CMAES) but is faster overall because nothing is blocked.

Would it be possible to design cmaes so that the multithreading is not limited? LIBCMAES is more effective algorithm and can effectively find the global optimum in a small number of steps, but this advantage is nullified because the parallelization is only part-time.
It would be great if LIBCMAES could nonstop fully use the cpu.

@nikohansen
Copy link
Collaborator

Is it that genetic optimizers always have to block before starting a new population?

No.

Would it be possible to design cmaes so that the multithreading is not limited?

Without the knowledge about implementation details (that is without knowing how simple, complicated or impossible it may be to implement), I see out of my head two "simple" ways to go about this:

  • if one solution is finished, send a new solution to the idle CPU. Complete the iteration when enough (i.e. lambda) solutions have finished and kill the still running solutions.

  • set lambda to, say, 90% of the #CPUs and occupy all CPUs when starting the iteration. Then the iteration is completed when lambda solutions are finished and the remainders can be killed. This can still lead to a long iteration, if more than 10% of the solutions are stuck. This may however not be likely, because long running solutions have a selection disadvantage.

In principle, CMA-ES is robust to small changes in lambda or, equivalently, to assigning 15%-or-so of the solutions to arbitrary bad fitness values.

A note on the above figure: to me it suggests (if my interpretation is correct) that the loss in time is less than a factor of two: the reddish area is more than half of the white area.

@Optiuse
Copy link
Author

Optiuse commented Mar 26, 2020

Thanks for the feedback!

A note on the above figure: to me it suggests (if my interpretation is correct) that the loss in time is less than a factor of two: the reddish area is more than half of the white area.

this was just harmless example. here is a current example of the cpu load. you can see that almost more than half of the cpu is not used. if the solutions only take a slightly different time then less than 50% of the cpu is used.
32cores, lambda=50:
lamda50 cores32

if one solution is finished, send a new solution to the idle CPU. Complete the iteration when enough (i.e. lambda) solutions have finished and kill the still running solutions.

currently it looks like there is a loop with the number of lambda. only when all in the loop are processed a new loop starts. this probably causes the always falling cpu curve.
if you kill calculations that take too long then they might be the successful ones. the function runs that take long might be the good ones.

set lambda to, say, 90% of the #CPUs and occupy all CPUs when starting the iteration. Then the iteration is completed when lambda solutions are finished and the remainders can be killed. This can still lead to a long iteration, if more than 10% of the solutions are stuck. This may however not be likely, because long running solutions have a selection disadvantage.

as said the long running solutions can be the good solutions. if these are killed, the optimum might not be found.
in order for the optimisation to work, lambda has to depend on the dimension of the problem. if lambda is too small, it will not find a solution. if it is too big, the optimisation takes far too long.
i think therefore lambda should be independent from the core number because it depends on the dimension.

if lambda is smaller than the cores then the cpu is never 100% utilized.
in the future there will be more and more cores in cpus. i am currently optimizing with 32 cores. if i set lambda to 16 it looks like this.

lambda16 cores32

at highest 50% of the cpu is used. then directly it drops immediately to near 0%.
In total, only 25% of the cpu is used in such an optimization.

in the near future, there will be cpus with 128, 256, 512 or more cores. If you optimize with e.g. Lambda=50 you will never be able to fully use the cpu.
It would be perfect if such great lib like LIBCMAES are really fully scalable (independent from values like lambda) otherwise the effect of multithreading cannot be fully used.

@beniz
Copy link
Collaborator

beniz commented Mar 26, 2020

I guess that theoretically (and @nikohansen can validate/invalidate), you should be able to hold on the values of points for which simulation takes longer, and reinject them back in when their f-value becomes available. If this is a valid option, then you may be able to hack this from libcmaes code without too much trouble, approximatively like this:

  • write a return signal for FitFunc to signify that the call hasn't finished in time
  • remove those points from the lambda set
  • keep track of points values under computation within a pool, reinject them whenever they are ready, and have FitFunc access the pool to get the value.

In all cases, we can only guide you here with direction on how to hack the libcmaes so that it best serves your purpose, you'll have to code it up.

@Optiuse
Copy link
Author

Optiuse commented Mar 26, 2020

I hope I got it. So if something takes too long, it'll continue independently and to continue the for-loop on libcmaes I have to return a pseudo-value, right?

when the run is finished in the background i will overwrite this pseudo value in the libcmaes pool right?

how can i access the pool? and is there an id to identify the correct entry?

@beniz
Copy link
Collaborator

beniz commented Mar 26, 2020

You'd need to build a pool outside of libcmaes: pool <-> fitfunc <-> libcmaes . You'd have to maintain whatever id you need. It's sketchy but it may work.

@nikohansen hello! what would it mean for CMA-ES to deal with asynchronous f-values, meaning having CMA-ES sampling a bunch of points, but getting the point values back asynchronously ? The distribution at time t would use whatever points are available, possibly from past sampling rounds, but every sampled point would eventually be considered. Would that hamper convergence ?

@Optiuse
Copy link
Author

Optiuse commented Mar 26, 2020

ok I can start a seperate independent thread in FitFunc that get the parameters and then in FitFunc return a pseudo value that the libcmaes-loop not blocking.
But when the thread is finished, how do i get the return value back to libcmaes? Into which data structure from libcmaes must the value be placed?

FitFunc fsphere = [](const double *x, const int N)
{
// start seperate std::thread...
return val; // return pseudo value
};

thread()
{
// calc...
// how do i get the return value in cmaes?
}

@nikohansen
Copy link
Collaborator

you should be able to hold on the values of points for which simulation takes longer, and reinject them back in when their f-value becomes available.

what would it mean for CMA-ES to deal with asynchronous f-values, meaning having CMA-ES sampling a bunch of points, but getting the point values back asynchronously ? The distribution at time t would use whatever points are available, possibly from past sampling rounds, but every sampled point would eventually be considered. Would that hamper convergence ?

@beniz Hi! Right, I didn't think about this. It is possible. It is hard to predict how reliable it is (it should usually work, but what do I know). What needs to be available is something akin to check_points in the Python module which calls a repair function. This function basically shortens the difference step between the incoming solution and the mean, if it is longer than one should expect according to the current sample distribution. Or in other words, the incoming solution is then interpreted as a direction relative to the current mean, rather than as a solution. It is a generic technique which allows to inject arbitrary proposal solutions (and it works more often than not, if we don't inject too many of them).

If I find a simple way to mimic the situation in a test function, I may check out how robust the Python module is when faced with such delayed evaluations.

in the near future, there will be cpus with 128, 256, 512 or more cores. If you optimize with e.g. Lambda=50 you will never be able to fully use the cpu.

I don't see how one can get around this easily. If 500 CPUs are populated from the current distribution, the update will reflect the current distribution for 500 evaluations. It seems impossible to mimic 10 iterations without updating the samples in between. I usually consider to run several independent optimizations in parallel in this scenario. It's hard to see, if setting Lambda=500 is not a viable option, how to fully exploit the available CPUs otherwise.

@nikohansen
Copy link
Collaborator

A small update: if older solutions are fed back, it seems advisable to turn off active CMA (which updates the covariance matrix from bad solutions with negative weights) for understandable reasons. Also the TPA step-size adaptation is not likely to work out-of-the-box anymore (it relies on the first two solutions fed back).

This is the code I used to make a quick experiment:

import numpy as np
import cma
class LaggingFitness:
    """delivers X, F like the input argument to tell 
    """
    def __init__(self, f, fraction, X):
        """`fraction` is the fraction of solutions with delayed evaluation"""
        self.f = f
        self.fraction = fraction
        self.X = [np.array(x) for x in X]
        
    def __call__(self, X):
        """return X_out, F, where X_out != X"""
        Xout = []
        lenX = len(X)
        while len(Xout) < self.fraction * lenX:
            # TODO: we may want to choose the delay by fitness value rather than uniform
            Xout += [self.X.pop(np.random.randint(len(self.X)))]
        while len(Xout) < lenX:
            Xout += [X.pop(np.random.randint(len(X)))]
        self.X += X  # keep the rest for later
        return Xout, [self.f(x) for x in Xout]

es = cma.CMAEvolutionStrategy(11 * [1], 1, {'ftarget': 1e-9, 'CMA_active':False})
fit = LaggingFitness(cma.ff.elli, 0.8, es.ask() + es.ask()[:1])
while not es.stop():
    X = es.ask()
    X, F = fit(X)
    es.tell(X, F)
    es.disp()
    es.logger.add()
cma.plot()

@yasamoka
Copy link

Is your fitness function itself parallelizable? I went that route with OpenMP and turned multithreading off for CMAES. That way the level of parallelism would not be dependent on the number of samples taken every iteration.

@Optiuse
Copy link
Author

Optiuse commented Mar 28, 2020

I don't see how one can get around this easily. If 500 CPUs are populated from the current distribution, the update will reflect the current distribution for 500 evaluations. It seems impossible to mimic 10 iterations without updating the samples in between. I usually consider to run several independent optimizations in parallel in this scenario. It's hard to see, if setting Lambda=500 is not a viable option, how to fully exploit the available CPUs otherwise.

yes, that's true. if the number of cpu cores becomes too high in relation to lambda then it will not work. maybe from about factor 3 on it will not make sense anymore. if you increase lambda the cpu usage is better but then the total time will be longer.
yes maybe it makes sense to do independent optimizations in this case. a possibility would be to split the parameter-bounds.

Is your fitness function itself parallelizable? I went that route with OpenMP and turned multithreading off for CMAES. That way the level of parallelism would not be dependent on the number of samples taken every iteration.

yes that would be a good idea but in my case it is not so easy to parallelize the function itself because the data is processed serially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants