Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working example of fwd_gradient #27

Open
feribg opened this issue Mar 6, 2020 · 15 comments
Open

Working example of fwd_gradient #27

feribg opened this issue Mar 6, 2020 · 15 comments
Assignees

Comments

@feribg
Copy link

feribg commented Mar 6, 2020

Do you have a running example of tff.math.fwd_gradient either in the notebook or somewhere in the tests, only saw a very contrived one in https://github.com/google/tf-quant-finance/blob/39a1c03cc9938c6e08fceb63e405eff0fda835be/tf_quant_finance/math/integration/integration_test.py

Mainly for the purposes of calculating:
a) first order greeks
b) higher order greeks

@saxena-ashish-g
Copy link
Contributor

Yes, we have a pending PR for this.
#24

@feribg
Copy link
Author

feribg commented Mar 6, 2020

Thanks I'll take a look at it. Meanwhile do you have some high level docs or maybe can you explain in a few words which parts of the FD scheme are parallelized on the GPU, since it's mostly a sequential algorithm where you need to loop over each timestep and solve the tridiagonal system step by step as the solution of N+1 depends on the solution of N.Do you constraint the types of schemes to be used (maybe just implicit) or? I was reading through the code but it's not immediately apparent. Thanks!

@saxena-ashish-g
Copy link
Contributor

The primary benefit of the GPU probably comes from the tridiagonal solver. We wrote a GPU kernel for that which we upstreamed to TensorFlow core (https://www.tensorflow.org/api_docs/python/tf/linalg/tridiagonal_solve) The rest of the computation is pretty serial as you say.

@feribg
Copy link
Author

feribg commented Mar 11, 2020

@saxena-ashish-g do you have more details on the implementation or the cuda kernel? I couldn't see it in the TF repository

@saxena-ashish-g
Copy link
Contributor

It is located under tensorflow/core/kernels/tridiagonal_solve_op.cu.cc but you can't find it on github because of the size of the kernels directory which causes truncation in the directory listing. If you clone the repo, you should be able to find it there.

Where it ought to be:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/tridiagonal_solve_op.cu.cc

@cyrilchim
Copy link
Contributor

Hi Feras,

GPU kernel for tridiagonal solver is imported from cuSparse :
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/tridiagonal_solve_op_gpu.cu.cc

As I remember, the solver should be this one:
https://docs.nvidia.com/cuda/cusparse/index.html#gtsv

As for the forward gradient, we had to have a crack on differentiating the while loop to make it performant (as I assume you are interested in Greeks for Monte Carlo). I'll try to send a hack colab version tomorrow but it will take some time before we check it in properly.

@feribg
Copy link
Author

feribg commented Mar 12, 2020

Thanks Cyril, I took a look but the Nvidia docs don't suggest the algorithm used for the solver, so although the underlying library is clear im still not sure how they made an inherently iterative scheme parallelized I guess it's just parallel in the batch dimension, not the actual computation?
https://docs.nvidia.com/cuda/cusparse/index.html#gtsv2

My use case is for the FD scheme no monte carlo yet, I just want to compare the accuracy of derivatives computed off the grid directly vs ones computed numerically. I assume that when doing the fwd_gradient the gain is that we get delta, gamma,vega,theta,rho at once rather than having to do 5 numerical estimates, each requiring 2 evaluations (or this is not how autograd works).

I'm also working on comparing that to 2 other probabilistic solutions -> a neural network from inputs -> price (works reasonably well for the BS case, not yet validated for Heston or fBM models) and a more of neural PDE solver approach, so using tf-quant mostly as a data generator to test convergence on training data generated by it and by quantlib for cross reference.

@cyrilchim
Copy link
Contributor

cyrilchim commented Mar 12, 2020

Seems they use Cyclic reduction on a GPU since the documentation for cusparse<t>gtsv_nopivot() says:
"The routine does not perform any pivoting and uses a combination of the Cyclic Reduction (CR) and the Parallel Cyclic Reduction (PCR) algorithms to find the solution." I am not sure you can get the implementation details. The parallelism should be along the batch as well as within the algorithm as well and the decision on how to parallelize is decided by the algorithm itself.

As for the gradient, you can differentiate through the PDE solver but it is not cost free. Say,
you have a matrix function A = A(a), then (A^{-1})' = A^{-1} A' A^{-1}, that is tridiagonal solver needs to be invoked twice. Used within an algorithm, automatic differentiation of tridiagonal solver might still give you substantial performance gains.

Please pay attention that the gradients need to be aggregated (see the pull request Ashish mentioned above)

Here is an example of gradient computation:

import tensorflow as tf
import tf_quant_finance as tff

dtype = tf.float64
d = tf.constant([1, 1, 1], dtype=dtype)
super_d = tf.constant([0.1, 0.1, 0.1], dtype=dtype)
sub_d = tf.constant([0.2, 0.2, 0.2], dtype=dtype)
rhs = tf.constant([2, 2, 2], dtype=dtype)

# This is the function you are going to differentiate.
def f(super_d):
  return tf.linalg.tridiagonal_solve([super_d, d, sub_d], rhs, diagonals_format='sequence')

# Here is a gradient with aggregation df/d super_d[0] + df/d super_d[1] + df/d super_d[2]
tff.math.fwd_gradient(f, super_d)
# Outputs : <tf.Tensor: shape=(3,), dtype=float64, numpy=array([-1.31076389, -1.47569444,  0.29513889])>

#which equals to the limit of
e = 0.00001
(f(super_d + e) - f(super_d)) / e
# Outputs: <tf.Tensor: shape=(3,), dtype=float64, numpy=array([-1.31074852, -1.47570059,  0.29514012])>

# Here is a gradient without aggregation, simply df/d super_d[0]
tff.math.fwd_gradient(f, super_d, input_gradients=tf.constant([1, 0, 0], dtype=dtype))
# Outputs : <tf.Tensor: shape=(3,), dtype=float64, numpy=array([-1.48871528,  0.30381944, -0.06076389])>

#which equals to the limit of
e = 0.00001
e1 = tf.constant([e, 0, 0], dtype=dtype)
(f(super_d + e1) - f(super_d)) / e
# Outputs: <tf.Tensor: shape=(3,), dtype=float64, numpy=array([-1.48871838,  0.30382008, -0.06076402])>

Hope this helps. Do not hesitate to reach out.

@feribg
Copy link
Author

feribg commented Mar 17, 2020

Thanks @cyrilchim , your example is actually a lot more detailed than the PR, but since the solver returns some kind of tuple how do you differentiate without aggregation through it? I assume you return tf.Variable(estimate[:,-1]) or you have to differentiate over the entire grid regardless that we care only about the final time step?

    (estimate_values, estimate_grid, _, _) = \
        pde.fd_solvers.solve_backward(
        start_time=expiry,
        end_time=0,
        values_transform_fn=values_transform_fn,
        coord_grid=grid,
        values_grid=final_values_grid,
        time_step=time_delta,
        boundary_conditions=[(lower_boundary_fn, upper_boundary_fn)],
        second_order_coeff_fn=second_order_coeff_fn,
        first_order_coeff_fn=first_order_coeff_fn,
        zeroth_order_coeff_fn=zeroth_order_coeff_fn,
        dtype=dtype,
        )
    return (estimate_values, estimate_grid[0])

Another side question that I have is that it seems the grid is fixed per batch, regardless of what vol, timestep and spot are present in the batch. What scheme do you use internally to guarantee convergence and avoid ill defined grids, or that's up to the caller to ensure ? (which would depend on the scheme you use)

@saxena-ashish-g saxena-ashish-g self-assigned this Jun 7, 2020
@Shellcat-Zero
Copy link

Sorry if this is a bit off-topic, but is there any reason that JAX has not been leveraged for these functionalities here? I have found it very straightforward to implement the things discussed here with JAX, and all of the option greeks with batching and GPU functionality. My understanding was that there was already quite a bit of cross-collaboration between JAX and TensorFlow as well. There is a brief example shown on this Towards Data Science post.

@cyrilchim
Copy link
Contributor

Thank you, @Shellcat-Zero Indeed, JAX has native forward-mode differentiation. It is definitely possible to add this functionality to TensorFlow but it requires quite a bit of effort. For now you could use tff.math.fwd_gradient

It is indeed true that JAX and TensorFlow have a lot of overlap (there is even JAX to TF converter) as they both can use XLA compiler. We stick to TensorFlow as it has a lot of production level infrastructure (e.g., TF serving), community support, as well as more core ops (for example, I am not sure if JAX has tridiagonal solver that is used to for numerical PDE methods)

@luweizheng
Copy link

Hi @cyrilchim ! As the Monte Carlo algorithm mainly use tf.whil_loop to simulate different time steps. I am wondering if tf.while_loop can do forward gradient?

@cyrilchim
Copy link
Contributor

Hi @luweizheng,

Yes, you can do forward gradient through the while loop. You can do some experimenting in the colab example here.

It can be more efficient/beneficial to use custom for_loop for which we have a better gradient implementation. Example of using the custom loop is in the aforementioned colab.

@luweizheng
Copy link

luweizheng commented Jan 5, 2022

Hi @luweizheng,

Yes, you can do forward gradient through the while loop. You can do some experimenting in the colab example here.

It can be more efficient/beneficial to use custom for_loop for which we have a better gradient implementation. Example of using the custom loop is in the aforementioned colab.

Thank you so much for all these information. @cyrilchim
I'm not familiar with auto differentiation and computation graph. I'm not sure if the following is correct.
So in your explanation, the custom for_loop has better performance. The colab example calculates European options. I read some source code and find that if watch_params is not None, tff will use custom for loop. This custom for loop is mainly for monte-carlo sampling and only need forward gradient (forward gradient for greeks)?
And tf.while_loop may have some performance issues. I guess, mainly because the time of building graph needs lots of time?

@cyrilchim
Copy link
Contributor

That is about correct. Just a few notes:

  1. I don't think that while_loop has any performance issues and should not affect graph construction time (it is a just an op in a graph).
  2. When you use JAX and invoke compilation, it also creates a graph. There is also an op for the while loop.
  3. JAX has native forward differentiation mode (see this example for an explanation of the forward and backward differentiation modes).
  4. Custom for-loop was designed for the forward gradient only. This is because backward gradient thorough a while loop requires going through the loop twice (first computing values in a forward mode and then getting backward gradients).

Hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants