### Asian Barrier Options Pricing using GPU Acceleration
    1. CuPy, Numba and CPU Monte-Carlo Pricing
    2. Batched Monte-Carlo Pricing
    3. Approximation using Deep Learning derivatives 
    4. Mixed Precision and multiple GPUs 
    5. TensorRT inference

### Introduction

The European and American Options price can be estimated accurately by the efficient [Black–Scholes model](https://en.wikipedia.org/wiki/Black%E2%80%93Scholes_model). Options like [Barrier Option](https://en.wikipedia.org/wiki/Barrier_option) and [Basket Option](https://en.wikipedia.org/wiki/Basket_option) have a complicated structure with no simple analytical solution. The Monte Carlo simulation is an effective way to price them. To get an accurate price with a small variance, a large number of simulation paths are needed which is computationally intensive. Luckily, each of the simulation paths are independent and we can take advantage of the multiple core GPU to accelerate the computation. Using GPU can speedup the computation by orders of magnitude due to the parallelization of the independent paths. But even that is still not fast enough. Recently, [Deep learning derivatives method](https://arxiv.org/pdf/1809.02233.pdf) was introduced to value derivatives and achieves speedup even higher than the former.  

In this tutorial, we are going to use Monte Carlo methods to price the [Down-and-Out](https://www.investopedia.com/terms/d/daoo.asp) [Asian](https://www.investopedia.com/terms/a/asianoption.asp) [Barrier](https://www.investopedia.com/terms/b/barrieroption.asp) [Call Option](https://www.investopedia.com/terms/c/calloption.asp) :

Steps:
    1. Use Python GPU libraries to accelerate the Monte Carlo pricing on the GPU
    2. Use the Monte Carlo pricing dataset to train a simple Barrier Option Pricing Neural Network Model
    3. Accelerate the neural network inference by TensorRT
    
### Barrier Option pricing

Asian Barrier Option is a mixture of [Asian Option](https://en.wikipedia.org/wiki/Asian_option) and [Barrier Option](https://en.wikipedia.org/wiki/Barrier_option). The price depends on the average underlying Asset Price `S`, the Strick Price `K` and the Barrier Price `B`. There are 4 types of Barrier Options:-
   * [Up-and-out](https://www.investopedia.com/terms/u/up-and-outoption.asp): spot price starts below the barrier level and has to move up for the option to be knocked out.
   * [Down-and-out](https://www.investopedia.com/terms/d/daoo.asp): spot price starts above the barrier level and has to move down for the option to be knocked out.
   * [Up-and-in](https://www.investopedia.com/terms/u/up-and-inoption.asp): spot price starts below the barrier level and has to move up for the option to become activated.
   * [Down-and-in](https://www.investopedia.com/terms/d/daio.asp): spot price starts above the barrier level and has to move down for the option to become activated.

Without loss of generality, in this notebook we will use the [Down-and-Out Call Discretized Asian Barrier Option](https://ieeexplore.ieee.org/document/6327776/metrics#metrics) as an example. The option will be void if the average price of the underlying asset goes below the barrier. The asset Spot Price `S` is usually modeled as [Geometric Brownian motion](https://en.wikipedia.org/wiki/Geometric_Brownian_motion), which has 3 free parameters:- [Spot Price](https://www.investopedia.com/terms/s/spotprice.asp), [Percent Volatility](https://www.investopedia.com/terms/v/volatility.asp) and the [Percent Drift](https://en.wikipedia.org/wiki/Stochastic_drift). The price of the option will be the expected profit at the maturity discount to the current value.

Due to the complicated nature of the barrier and price algorithmic averaging, there is no analytical solution for this example of [exotic option](https://www.investopedia.com/terms/e/exoticoption.asp). We can use the Monte Carlo simulation method to estimate the expected value of profit on the maturity day. 

Following are the parameters we choose to price the example option:-

    Maturity (T): 1 year
    Spot (S) : 120
    Strike (K): 110
    Volatility (sigma): 35.0 %
    Risk Free Rate (r): 5.0 %
    Stock Drift Rate (mu): 10.0 %
    Barrier (B): 100

As we know the [Standard Error of the Mean](https://en.wikipedia.org/wiki/Standard_error) is proportional to the inversed square root of the number of samples. Hence the more simulation paths we have, the more accurate the pricing will be. We set the constants for the option and load the necessary libraries:-

In [1]:
import cupy
import numpy as np
import math
import time
import numba
from numba import cuda
from numba import njit
from numba import prange
import cudf

N_PATHS = 8192000
N_STEPS = 365
T = 1.0
K = 110.0
B = 100.0
S0 = 120.0
sigma = 0.35
mu = 0.1
r = 0.05

We will simulate 10 million paths with 365 steps where each step represents a day. 

#### Single Thread CPU
The single thread CPU code for the Monte Carlo simulation has two nested for-loops. The outer loop iterates each path while the inner loop iterates time and computes the underlying asset price for that day. Note that this code is accelerated via [Numba @jit](http://numba.pydata.org/) hence it compiles into machine code at runtime. 

In [2]:
@njit(fastmath=True)
def cpu_barrier_option(d_s, T, K, B, S0, sigma, mu, r, d_normals, N_STEPS, N_PATHS):
    tmp1 = mu*T/N_STEPS
    tmp2 = math.exp(-r*T)
    tmp3 = math.sqrt(T/N_STEPS)
    running_average = 0.0
    for i in range(N_PATHS):
        s_curr = S0
        for n in range(N_STEPS):
            s_curr += tmp1 * s_curr + sigma*s_curr*tmp3*d_normals[i + n * N_PATHS]
            running_average = running_average + 1.0/(n + 1.0) * (s_curr - running_average)
            if running_average <= B:
                break

        payoff = running_average - K if running_average>K else 0
        d_s[i] = tmp2 * payoff

  We use CuPy to generate Gaussian random numbers in the GPU and allocate an array to store the prices at maturity.


In [3]:
randoms_gpu = cupy.random.normal(0, 1, N_PATHS * N_STEPS, dtype=cupy.float32)
randoms_cpu = np_randoms = cupy.asnumpy(randoms_gpu)
output =  np.zeros(N_PATHS, dtype=np.float32)

Now we will run the Monte Carlo simulation and time it. When the Numba accelerated function is called for the first time, there is some overhead to compile it. So to time it accurately, we run this method twice and and consider the run time of the second attempt. 


In [4]:
cpu_barrier_option(output, np.float32(T), np.float32(K), 
                    np.float32(B), np.float32(S0), 
                    np.float32(sigma), np.float32(mu), 
                    np.float32(r), randoms_cpu, N_STEPS, N_PATHS)
s = time.time()
cpu_barrier_option(output, np.float32(T), np.float32(K), 
                    np.float32(B), np.float32(S0), 
                    np.float32(sigma), np.float32(mu), 
                    np.float32(r), randoms_cpu, N_STEPS, N_PATHS)
v = output.mean()
e = time.time()
print('time', e-s, 'v', v)

time 33.31781530380249 v 18.7093


#### Multiple Cores CPU
CPU has multiple cores and to make a fair comparison, the code can be modified a little to take advantage of all the CPU cores. Note how we parallelize the outer loop:-

In [5]:
@njit(fastmath=True, parallel=True)
def cpu_multiplecore_barrier_option(d_s, T, K, B, S0, sigma, mu, r, d_normals, N_STEPS, N_PATHS):
    tmp1 = mu*T/N_STEPS
    tmp2 = math.exp(-r*T)
    tmp3 = math.sqrt(T/N_STEPS)
    for i in prange(N_PATHS):
        s_curr = S0
        running_average = 0.0
        for n in range(N_STEPS):
            s_curr += tmp1 * s_curr + sigma*s_curr*tmp3*d_normals[i + n * N_PATHS]
            running_average = running_average + 1.0/(n + 1.0) * (s_curr - running_average)
            if running_average <= B:
                break
        payoff = running_average - K if running_average>K else 0
        d_s[i] = tmp2 * payoff

Running this parallel code and timing it:-

In [6]:
cpu_multiplecore_barrier_option(output, np.float32(T), np.float32(K), 
                    np.float32(B), np.float32(S0), 
                    np.float32(sigma), np.float32(mu), 
                    np.float32(r), randoms_cpu, N_STEPS, N_PATHS)
s = time.time()
cpu_multiplecore_barrier_option(output, np.float32(T), np.float32(K), 
                    np.float32(B), np.float32(S0), 
                    np.float32(sigma), np.float32(mu), 
                    np.float32(r), randoms_cpu, N_STEPS, N_PATHS)
v = output.mean()
e = time.time()
print('time', e-s, 'v', v)

time 4.055903911590576 v 18.7093


We see aproximately 32x speedup due to 32 cores of the CPU. 

#### NUMBA GPU
The multiple cores CPU code can be modified easily to run in the GPU via Numba.cuda.jit. The code below is very similar to the CPU multiple core code except that we parallize the outer loop on the GPU. Running this code and timing it:-

In [7]:
@cuda.jit
def numba_gpu_barrier_option(d_s, T, K, B, S0, sigma, mu, r, d_normals, N_STEPS, N_PATHS):
    # ii - overall thread index
    ii = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    stride = cuda.gridDim.x * cuda.blockDim.x
    tmp1 = mu*T/N_STEPS
    tmp2 = math.exp(-r*T)
    tmp3 = math.sqrt(T/N_STEPS)
    running_average = 0.0
    for i in range(ii, N_PATHS, stride):
        s_curr = S0
        for n in range(N_STEPS):
            s_curr += tmp1 * s_curr + sigma*s_curr*tmp3*d_normals[i + n * N_PATHS]
            running_average += (s_curr - running_average) / (n + 1.0)
            if running_average <= B:
                break
        payoff = running_average - K if running_average>K else 0
        d_s[i] = tmp2 * payoff

In [8]:

number_of_threads = 256
number_of_blocks = (N_PATHS-1) // number_of_threads + 1
output = cupy.zeros(N_PATHS, dtype=cupy.float32)
numba_gpu_barrier_option[(number_of_blocks,), (number_of_threads,)](output, np.float32(T), np.float32(K), 
                    np.float32(B), np.float32(S0), 
                    np.float32(sigma), np.float32(mu), 
                    np.float32(r), randoms_gpu, N_STEPS, N_PATHS)
s = time.time()
numba_gpu_barrier_option[(number_of_blocks,), (number_of_threads,)](output, np.float32(T), np.float32(K), 
                    np.float32(B), np.float32(S0), 
                    np.float32(sigma), np.float32(mu), 
                    np.float32(r), randoms_gpu, N_STEPS, N_PATHS)
v = output.mean()
cuda.synchronize()
e = time.time()
print('time', e-s, 'v', v)

time 0.07474923133850098 v 18.709291


We get 4x speedup compared to the mutliple core version and 128x speedup compared to the single core version. 

#### NUMBA Shared Memory 
While accessing the global memory for Gaussian random numbers, the memory access is already aligned and numbers are only read once. So using shared memory is not helping the performace as shown below:-

In [9]:
@cuda.jit
def numba_gpu_barrier_option_shared_mem(d_s, T, K, B, S0, sigma, mu, r, d_normals, N_STEPS, N_PATHS):
    shared = cuda.shared.array(shape=0, dtype=numba.float32)
    # load to shared memory
    path_offset = cuda.blockIdx.x * cuda.blockDim.x
    ii = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    stride = cuda.gridDim.x * cuda.blockDim.x
    tmp1 = mu*T/N_STEPS
    tmp2 = math.exp(-r*T)
    tmp3 = math.sqrt(T/N_STEPS)
    running_average = 0.0
    for i in range(ii, N_PATHS, stride):
        s_curr = S0
        for n in range(N_STEPS):
            shared[cuda.threadIdx.x] = d_normals[path_offset + cuda.threadIdx.x + n * N_PATHS]
            s_curr += tmp1 * s_curr + sigma*s_curr*tmp3*shared[cuda.threadIdx.x]
            running_average += (s_curr - running_average) / (n + 1.0)
            if running_average <= B:
                break
        payoff = running_average - K if running_average>K else 0
        d_s[i] = tmp2 * payoff

In [10]:
number_of_threads = 256
number_of_blocks = (N_PATHS-1) // number_of_threads + 1
output = cupy.zeros(N_PATHS, dtype=cupy.float32)
shared_buffer_size = number_of_threads * 4
numba_gpu_barrier_option_shared_mem[(number_of_blocks,), (number_of_threads,), 0, shared_buffer_size](output, np.float32(T), np.float32(K), 
                    np.float32(B), np.float32(S0), 
                    np.float32(sigma), np.float32(mu), 
                    np.float32(r), randoms_gpu, N_STEPS, N_PATHS)
s = time.time()
numba_gpu_barrier_option_shared_mem[(number_of_blocks,), (number_of_threads,), 0, shared_buffer_size](output, np.float32(T), np.float32(K), 
                    np.float32(B), np.float32(S0), 
                    np.float32(sigma), np.float32(mu), 
                    np.float32(r), randoms_gpu, N_STEPS, N_PATHS)
v = output.mean()
cuda.synchronize()
e = time.time()
print('time', e-s, 'v', v)

time 0.08823490142822266 v 18.709291


#### CUPY GPU
CuPy provides an easy way to define GPU kernels from raw CUDA source. `RawKernel` object allows you to call the kernel with CUDA’s `cuLaunchKernel` interface. Here is an example where we wrap the Barrier Option computation code inside the `RawKernel`:

In [11]:
cupy_barrier_option = cupy.RawKernel(r'''
extern "C" __global__ void barrier_option(
    float *d_s,
    const float T,
    const float K,
    const float B,
    const float S0,
    const float sigma,
    const float mu,
    const float r,
    const float * d_normals,
    const long N_STEPS,
    const long N_PATHS)
{
  unsigned idx =  threadIdx.x + blockIdx.x * blockDim.x;
  unsigned stride = blockDim.x * gridDim.x;
  unsigned tid = threadIdx.x;

  const float tmp1 = mu*T/N_STEPS;
  const float tmp2 = exp(-r*T);
  const float tmp3 = sqrt(T/N_STEPS);
  double running_average = 0.0;

  for (unsigned i = idx; i<N_PATHS; i+=stride)
  {
    float s_curr = S0;
    unsigned n=0;
    for(unsigned n = 0; n < N_STEPS; n++){
       s_curr += tmp1 * s_curr + sigma*s_curr*tmp3*d_normals[i + n * N_PATHS];
       running_average += (s_curr - running_average) / (n + 1.0) ;
       if (running_average <= B){
           break;
       }
    }

    float payoff = (running_average>K ? running_average-K : 0.f);
    d_s[i] = tmp2 * payoff;
  }
}

''', 'barrier_option')

We can launch it to compute the same Barrier Option price:-

In [12]:
number_of_threads = 256
number_of_blocks = (N_PATHS-1) // number_of_threads + 1
s = time.time()
cupy_barrier_option((number_of_blocks,), (number_of_threads,),
                   (output, np.float32(T), np.float32(K), 
                    np.float32(B), np.float32(S0), 
                    np.float32(sigma), np.float32(mu), 
                    np.float32(r),  randoms_gpu, N_STEPS, N_PATHS))
v = output.mean()
cupy.cuda.stream.get_current_stream().synchronize()
e = time.time()
print('time', e-s, 'v',v)

time 0.03548622131347656 v 18.70929


This approach is the most efficient way to use the GPU and it achieves 8x speedup compared to the 32 core CPU performance.

### Multiple GPUs Option Pricing

To get a more accurate estimation of the option price, more paths are needed for Monte Carlo simulation. The single V100 GPU we used in the above example only has 32GB memory and we are hitting the memory limits to run 8M simulations. [DASK](https://dask.org/) is an integrated component of RAPIDS for distributed computation on GPUs.  We can take advantage of it to distribute the Monte Carlo simulation computation to multiple nodes across multiple GPUs. First, we need to wrap all the computation inside a function to allow the allocated GPU memory to be released at the end of the function call. Note that the function takes an extra argument for the random number seed value so the individual function calls each have an independent sequence of random numbers. Loading the DASK library and setting up the local CUDA cluster :-

In [13]:
# clear the GPU memory
del randoms_gpu 
del randoms_cpu
del output

def get_option_price(T, K, B, S0, sigma, mu, r, N_PATHS = 8192000, N_STEPS = 365, seed=3):
    number_of_threads = 256
    number_of_blocks = (N_PATHS-1) // number_of_threads + 1
    cupy.random.seed(seed)
    randoms_gpu = cupy.random.normal(0, 1, N_PATHS * N_STEPS, dtype=cupy.float32)
    output =  cupy.zeros(N_PATHS, dtype=cupy.float32)
    cupy_barrier_option((number_of_blocks,), (number_of_threads,),
                   (output, np.float32(T), np.float32(K), 
                    np.float32(B), np.float32(S0), 
                    np.float32(sigma), np.float32(mu), 
                    np.float32(r),  randoms_gpu, N_STEPS, N_PATHS))
    v = output.mean()
    out_df = cudf.DataFrame()
    out_df['p'] = cudf.Series([v.item()])
    return out_df
o = get_option_price(T=1.0, K=120.0, B=90.0, S0=100.0, sigma=0.2, mu=0.1, r=0.05)

In [14]:
import dask
import dask_cudf
from dask.delayed import delayed
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()
from dask.distributed import Client
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://127.0.0.1:40199  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 270.39 GB


There are 4 GPUs inside the system. To distribute the above function, we wrap it into the `delayed` function to integrate it into the DASK computation graph. We use `from_delayed` to gather all the distributed dataframes into a holistic cudf_dask dataframe. We can call the cudf_dask dataframe `mean` and `std` to calculate the expected mean and standard deviation of the prices.

In [15]:
x = dask_cudf.from_delayed([delayed(get_option_price)(T=1.0, K=110.0, B=100.0, S0=120.0, sigma=0.35, mu=0.1, r=0.05, seed=3000+i) for i in range(1600)])

In [16]:
x.mean().compute()

p    18.711432
dtype: float64

In [None]:
x.std().compute()

The code computed 1600 Monte Carlo simulations of `8192000` paths. By averaging the price together to get a better estimation, the standard deviation is reduced by a factor of 1/sqrt(1600) = 1/40  