Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompile nested functions #29

Closed
alecandido opened this issue Apr 3, 2020 · 17 comments
Closed

Precompile nested functions #29

alecandido opened this issue Apr 3, 2020 · 17 comments
Labels
enhancement New feature or request

Comments

@alecandido
Copy link
Member

alecandido commented Apr 3, 2020

Since we are using a bunch of lambdas and functions calls in other functions (cf. all the full DistributionVec API) maybe at some point this can slow down the execution.

Instead of find a way to avoid this nesting and so on, rewriting stuff in a proper fashion with even more tricks, it would be nice to find a way to precompile the functions before using them (e.g. for integration in scipy.integrate.quad where a lot of functions' calls are involved).
Of course, if needed, this can involve rewriting the functions definition in a proper fashion, adding suitable decorator or whatsoever, but the goal is to avoid to change critically the structure because of this.

My idea of precompiling: of course I'm not an expert, otherwise I would have already figured out the proper way, but in the worst scenario means define a function that takes another one as input and go through the calls collecting everything in a single expanded expression, to be evaluated again as python code (hopefully an external library can do this for us, I don't know if numba or what else)

@alecandido alecandido added the enhancement New feature or request label Apr 6, 2020
@alecandido
Copy link
Member Author

Probably we should take a look in numba, that means #27 is maybe enough to solve also this one.

@alecandido
Copy link
Member Author

Numba is making more damages than giving advantages...

@scarrazza @scarlehoff have you any advice?

Fri Apr 17 17:32:26 2020    apfel_w_numbify.prof

         169133690 function calls (160232623 primitive calls) in 183.925 seconds

   Ordered by: cumulative time
   List reduced from 8601 to 11 due to restriction <11>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1382/1    0.013    0.000  184.000  184.000 {built-in method builtins.exec}
        1    0.000    0.000  184.000  184.000 benchmark_against_apfel.py:4(<module>)
        7    0.001    0.000  182.688   26.098 benchmark_against_apfel.py:55(run_against_apfel)
        7    0.000    0.000  181.581   25.940 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:190(apply)
        7    0.018    0.003  181.581   25.940 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:140(__call__)
        7    0.000    0.000  181.116   25.874 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:130(get_output)
        7    0.000    0.000  181.116   25.874 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/__init__.py:49(get_output)
      126    0.000    0.000  181.116    1.437 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:65(get_output)
      126    0.002    0.000  181.115    1.437 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:35(_compute)
      252    0.065    0.000  181.114    0.719 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:48(_compute_component)
    14868    0.361    0.000  181.019    0.012 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/convolution.py:139(convolution)
Fri Apr 17 17:29:04 2020    apfel_wo_numbify.prof

         61928349 function calls (61900659 primitive calls) in 61.382 seconds

   Ordered by: cumulative time
   List reduced from 8070 to 11 due to restriction <11>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1382/1    0.015    0.000   61.399   61.399 {built-in method builtins.exec}
        1    0.000    0.000   61.399   61.399 benchmark_against_apfel.py:4(<module>)
        7    0.001    0.000   60.217    8.602 benchmark_against_apfel.py:55(run_against_apfel)
        7    0.000    0.000   59.400    8.486 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:190(apply)
        7    0.018    0.003   59.400    8.486 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:140(__call__)
        7    0.000    0.000   58.927    8.418 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:130(get_output)
        7    0.000    0.000   58.927    8.418 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/__init__.py:49(get_output)
      126    0.001    0.000   58.926    0.468 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:65(get_output)
      126    0.002    0.000   58.926    0.468 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:35(_compute)
      252    0.058    0.000   58.924    0.234 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:48(_compute_component)
    14868    0.292    0.000   58.839    0.004 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/convolution.py:139(convolution)

@scarlehoff
Copy link
Member

Without knowing what functions are you actually trying to compile:
numba doing worse than no-numba usually means you are compiling things that you shouldn't or that you are running through numba functions that do not need it.

As a general rule, try to have a function that take as argument as many things as you can and compile that one and pass everything as an argument.
i.e., ensure that you are not for instance compiling the call for each different value of the momentum.

Another possible pitfall is having intermediate objects that are not allowing numba to fully compile (basically, complicated python objects, where complicated I think starts at dictionary, can't remember right now though).

For more information I guess I'd have to read the code you are trying to numba.

@alecandido
Copy link
Member Author

Thanks, we thought that this could be the issue, and we will try to address in some way.

The main problem is that we are not really calling numba, inside yadism, but we are calling methods from eko that are numbified (BasisFunctions in particular), and as you said this will generate the issue.

@scarlehoff
Copy link
Member

Ah, right, there the structure/organization whatever of what is compiled and when is done thinking on the specific case for eko, which might be highly unoptimal here.

@felixhekhorn
Copy link
Contributor

Well, I think not really - we're dealing with BasisFunctions here and there - they should compile once and forever (for a given configuration) - I don't know if we (in eko) really do this? and how can we actually check this?

@scarlehoff
Copy link
Member

A quick way is checking where nb.njit is called and printing "compiling!" just before.

(when used as a decorator it wouldn't be much of a problem because I think we only use it like that in a few functions that are very basic, nothing fancy there)

@alecandido
Copy link
Member Author

@scarlehoff we made a try and we discover that we are compiling once per BasisFunction, but the problem is that for the limited size of the required output it is not worth the effort, at this stage we are wondering what to do:

  • implement some kind of threshold (purely on our side, given to the user, we together with numba) to decide to compile or not to do it
  • parallelize compilation, but in order to do this we need eager compilation (while numba is lazy, of course) and find a way to tell python to continue even if compilation thread are going on their own until he hit a line when he need to evaluate function numba is compiling on another thread
    • or we can just make a compilation block for basis functions' compilation, when instantiating the InterpolatorDispatcher, or right after, and compile them in parallel waiting for the last one to finish

@scarlehoff
Copy link
Member

scarlehoff commented Apr 20, 2020

I think I put a flag to avoid compilation? Or maybe I just thought about putting it... I did it for vegasflow for sure :_

In any case, if compilation is a problem parallelizing just creates a problem in every thread, the right thing is not to compile.

(maybe I'm failing to understand the issue, but looking at the benchmark you linked before it looks like what you want is to avoid the compile, lowering the time from 1 minute will create more problems that it will solve)

@felixhekhorn
Copy link
Contributor

relevant timings are:

  • /wo numba: 1.5s = 27s/17 = used time in log_evaluate_x / number of observables (=: nobs)
  • /w numba 2.5s = 145s/59 = used time in _compile_for_args / number of basis functions (=: n_bf)

so our current setup is

  • n_obs = 17 and n_bf = 59 and we are loosing 120s = 2.5*17 - 1.5*59
  • if we would have e.g. 180 observables, we would gain 120s

or more general, if and only if, we have n_obs/n_bf > 5/3 it is worth to run numba

The common number of data points in a DIS only PDF fit are ~ 3000

@scarlehoff
Copy link
Member

Ok, so you are "losing" time in the trial benchmark but "in real life" you would prefer to compile. Is that it?

If that's the case I would just avoid compilation during development (but keeping it in tests to ensure it works)

@alecandido
Copy link
Member Author

Ok, if all those points are going to be included in the relevant fits our threshold is for xgrids' size about 2000, so we can safely say that if you are going to use an xgrid with less than 1000 points it's useful to make numba compiling BasisFunctions.

@alecandido
Copy link
Member Author

relevant timings are:

* /wo numba: 1.5s =  27s/17 = used time in `log_evaluate_x` / number of observables (=: nobs)

* /w numba 2.5s = 145s/59 = used time in `_compile_for_args` / number of basis functions (=: n_bf)

so our current setup is

* n_obs = 17 and n_bf = 59 and we are loosing 120s = 2.5*17 - 1.5*59

* if we would have e.g. 180 observables, we would gain 120s

or more general, if and only if, we have n_obs/n_bf > 5/3 it is worth to run numba

The common number of data points in a DIS only PDF fit are ~ 3000

We updated the convolution function, using some clever caching in order to speed up. Because of this we also moved the trade off value, mainly because we optimized the number of calls to log_evaluate_x and the coefficient functions.

The trade off report updated:

  • /wo numba: 0.6s = 10s/17 = used time in log_evaluate_x / number of observables (=: nobs)
  • /w numba 2.6s = 157s/59 = used time in _compile_for_args / number of basis functions (=: n_bf)

Then the new trade off is: n_obs/n_bf > 5

Conclusion: still convenient, but exactly 3 times less...

@alecandido
Copy link
Member Author

Ok, the discussion on compiling eko with numba has been sufficiently investigated, and we also made some investigation about the chance of compiling coefficient functions, still with numba.

We concluded that with the current structure of the code is not possible to compile coefficient function inside python, because it's depending on external objects (quark_0 is a function, to be sum with quark_1 multiplied by eko's alpha, and they are three different functions to be joint), so we will keep python evaluation for the time being.

@scarlehoff if you have any idea about compiling sums and products of functions with numba it's perfect, but experimenting we found that it really does not like functional programming :)

@scarlehoff
Copy link
Member

scarlehoff commented Apr 21, 2020

it really does not like functional programming :)

Because it's python not js!

In any case, as I said, if timing is not a problem over-optimizing is not worth it.
But if you will need at some point better speed it is better to keep in mind the limitations of numba and slowly move away from passing lambdas around. In particular because eventually I want to work on making eko run on GPU* as the people from numba seem to be working heavily on GPU support now. This should be very good for eko.**

*personally I'm ignoring it for now but making eko run in GPU is in the "cool things to do" of my TODO list (cc @scarrazza, they claim to be compatible with AMD cards)

** @felixhekhorn (I'm using github as a slack chat at this point) about eko, maybe we need to have a chat at some point to decide priorities, whether we want to keep it on the side for now or whether we want to start pushing forward to have it for the summer meeting (I'll be busy for a few weeks on n3fit related stuff but will be much freer afterwards)

@alecandido
Copy link
Member Author

Probably if at the end speed will be an issue the solution of inlining lambdas is working out a proper computational graph with a suitable library.

Does this mean tensorflow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants