Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fair performance comparison with QuantLib #80

Open
DmitriGoloubentsev opened this issue Sep 12, 2022 · 9 comments
Open

Fair performance comparison with QuantLib #80

DmitriGoloubentsev opened this issue Sep 12, 2022 · 9 comments

Comments

@DmitriGoloubentsev
Copy link

Hi guys,

In the "Monte Carlo via Euler Scheme" example you compare TF with QuantLib pricing and conclude that TF finance is x100 times faster(or more).

I want to note that in QL you evolve 100 time steps of Log Normal process, but in TF you work in log space and only apply exp() at the end.
I agree QL may not be very fast, but in this example you compare 100 exponents per path in QL to just 1 exponent in TF...

Thank you!

@cyrilchim
Copy link
Contributor

Thanks for reaching out, Dmitri!

I think the point was to demonstrate GPU speed up rather than direct comparison to QL. We would very much welcome a contribution for a better CPU benchmark!

@DmitriGoloubentsev
Copy link
Author

DmitriGoloubentsev commented Sep 12, 2022

Sounds good! I'll come back to you later on CPU benchmark for this.

Also, you do not include graph optimization time into reporting.
// # Second run (excludes graph optimization time)

I know, it's not dependent on number of paths, but it's still part of total pricing time. And for QL CPU execution it's 0.

Shouldn't you report this separately?

@DmitriGoloubentsev
Copy link
Author

DmitriGoloubentsev commented Sep 12, 2022

On second thought, if you simulate 100 time steps and only apply 1 exp() at the end, you don't really do much of calculations per path.

So your problem is basically reduced to RNG algo competition.

You should somehow increase complexity in your SDE. Perhaps, use the Heston local vol model to make this benchmark more relevant to real world. With flat vols, flat rates and a simple normal process, I don't know how relevant this benchmark is for practitioners.

@DmitriGoloubentsev
Copy link
Author

What random generator is used if "PSEUDO_ANTITHETIC" is set?
For QL you don't use antithetic. I suspect antithetic reduces number of required random numbers by 2... Am I correct in this?

@SergK13GH
Copy link

SergK13GH commented Nov 17, 2022

There is an additional question about memory consumption, especially when run with XLA optimization.
I hit with error message when run the example just with num_timesteps= 5000 without XLA:
2022-11-17 14:50:46.371456: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 4000000000 exceeds 10% of free system memory.
Memory available 16G + 22G Swap.
And there is a kernel crash with XLA run with these parameters.
(ResourceExhaustedError: Out of memory while trying to allocate 72003200088 bytes. [Op:__inference_price_eu_options_1037])
Are there settings which controls limit for the memory allocation?

@cyrilchim
Copy link
Contributor

To answer @DmitriGoloubentsev question, yes, antithetic sampling uses fewer samples indeed. I think we could measure time it takes to simulate random numbers and then subtract that from runtime. I think at the time of writing that colab I was mainly motivated by GPU speed up and not comparing CPU performance. The colab can be extended to sample from Heston model as well. (just need to update GenericItoProcess drift and volatility definitions).

As for graph compilation time, normally you would deploy a TensorFlow graph to avoid any compilation time overhead.

@SergK13GH , The samples are precomputed for vectorization purpose. You could switch to tff.math.random.RandomType.PSEUDO and set precompute_normal_draws=False in the sampler. We try to vecotrize computations where possible to ensure good GPU performance. One could ,of course, rewrite the whole thing using while loops but then you'd lose benefits of vectorization. As for memory controlling measures, I think you'd need to control it on your side.

@DmitriGoloubentsev
Copy link
Author

As for graph compilation time, normally you would deploy a TensorFlow graph to avoid any compilation time overhead.

Sorry, can you please elaborate on what "deploy a TensorFlow graph" means?

Do you assume you can compile graph once and use it for all valuations in the future?

@cyrilchim
Copy link
Contributor

Yes, you could build a graph in Python and save its' proto definition that you then can deploy using TensorFlow serving.
See, e.g., here. You'd need to wrap your function in a tf.Module like here

@DmitriGoloubentsev
Copy link
Author

I can see how it may work for simple case (flat model parameters and the same number of time steps).

But am I right that in real problems you need to recompile graph everyday for all models and all trades?
I.e. as trades age, model parameter interpolations change, trades cash flows are paid, simulation time points move (they are usually defined w.r.t. current time), you need to redefine valuation graph and hence recompile it.

I think you can only reuse valuation graph on the same trading day and it's still a good idea to report how much time and memory needed for this step.

Simulating normal process using Euler scheme for 1000 time step is a very basic problem. What happens when you have 1000 IR swaps to price for xVA? Your graph is going to be huge and compilation time significant regardless if you use GPU or CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants