## Extra functions to be used in `nanpercentile` example

The two device functions contained in the cell below have been taken from [here](https://github.com/numba/numba/issues/4283#issuecomment-908511793).

The `gpu_quickSortIterative()` function will be used in the `nanpercentile` example in the `getting_started_with_numba_and_cupy.ipynb` Jupyter notebook to allow us to sort our input array on a GPU.

In [1]:
from numba import cuda

device = True
inline = True

@cuda.jit(device=device, inline=inline)
def gpu_partition(arr, ids, l, h):
    """
    Partition using pivot.

    Function takes last element as pivot, places the pivot element at its correct
    position in sorted array, and places all smaller (smaller than pivot) to left of
    pivot and all greater elements to right of pivot

    Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
    which was contributed by Mohit Kumra.

    Parameters
    ----------
    arr : vector of floats
        The array to be sorted.
    ids : vector of ints
        The unsorted IDs corresponding to arr, in other words range(len(arr)).
    l : int
        Starting index for sorting.
    h : int
        Ending index for sorting.

    Returns
    -------
    int
        the new pivot?

    """
    # index of smaller element
    i = l - 1

    pivot = arr[h]

    for j in range(l, h):

        # If current element is smaller than or equal to pivot
        if arr[j] <= pivot:

            # increment index of smaller element
            i += 1
            arr[i], arr[j] = arr[j], arr[i]
            ids[i], ids[j] = ids[j], ids[i]
            
        #added to test something
        #cuda.syncthreads()

    arr[i + 1], arr[h] = arr[h], arr[i + 1]
    ids[i + 1], ids[h] = ids[h], ids[i + 1]

    return i + 1


@cuda.jit(device=device, inline=inline)
def gpu_quickSortIterative(arr, stack, ids):
    """
    Perform iterative quicksort on array and an unsorted ID list of the array.

    Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
    which was contributed by Mohit Kumra.

    Parameters
    ----------
    arr : vector of floats
        The array to be sorted.
    stack : vector of ints
        Array initialized with 0's
    ids : vector of ints
        The unsorted IDs corresponding to arr, in other words range(len(arr)).

    Returns
    -------
    None.

    """
    # low and high indices.
    l, h = (0, len(arr) - 1)
    # stack = [0] * size
    # ids = list(range(len(arr)))

    # initialize top of stack
    top = -1

    # fill ids with range(len(arr))
    for i in range(len(arr)):
        ids[i] = i
        stack[i] = 0

    # push initial values of l and h to stack
    top = top + 1
    stack[top] = l
    top = top + 1
    stack[top] = h

    # Keep popping from stack while is not empty
    while top >= 0:

        # Pop h and l
        h = stack[top]
        top = top - 1
        l = stack[top]
        top = top - 1

        # Set pivot element at its correct position in
        # sorted array
        p = gpu_partition(arr, ids, l, h)

        # If there are elements on left side of pivot,
        # then push left side to stack
        if p - 1 > l:
            top = top + 1
            stack[top] = l
            top = top + 1
            stack[top] = p - 1

        # If there are elements on right side of pivot,
        # then push right side to stack
        if p + 1 < h:
            top = top + 1
            stack[top] = p + 1
            top = top + 1
            stack[top] = h

The functions in the following cell have been taken from [here](https://krstn.eu/np.nanpercentile()-there-has-to-be-a-faster-way/).

They provide a faster alternative to `np.nanpercentile` - a pure Python implementation of the quantile calculation. It calculates percentiles along the first axis.

The general idea is to:

* find the number of valid observations (non NaN)
* replace NaN with maximum value of array
* sort values along axis
* find position of quantile regarding number of valid observations
* linear interpolation if the desired quantile is inbetween two positions (like numpys linear interpolation)

In [2]:
def _zvalue_from_index(arr, ind):
    """private helper function to work around the limitation of np.choose() by employing np.take()
    arr has to be a 3D array
    ind has to be a 2D array containing values for z-indicies to take from arr
    See: http://stackoverflow.com/a/32091712/4169585
    This is faster and more memory efficient than using the ogrid based solution with fancy indexing.
    """
    # get number of columns and rows
    _,nC,nR = arr.shape

    # get linear indices and extract elements with np.take()
    idx = nC*nR*ind + np.arange(nC*nR).reshape((nC,nR))
    return np.take(arr, idx)

def nan_percentile(arr, q):
    # valid (non NaN) observations along the first axis
    valid_obs = np.sum(np.isfinite(arr), axis=0)
    # replace NaN with maximum
    max_val = np.nanmax(arr)
    arr[np.isnan(arr)] = max_val
    # sort - former NaNs will move to the end
    arr = np.sort(arr, axis=0)

    # loop over requested quantiles
    if type(q) is list:
        qs = []
        qs.extend(q)
    else:
        qs = [q]
    if len(qs) < 2:
        quant_arr = np.zeros(shape=(arr.shape[1], arr.shape[2]))
    else:
        quant_arr = np.zeros(shape=(len(qs), arr.shape[1], arr.shape[2]))

    result = []
    for i in range(len(qs)):
        quant = qs[i]
        # desired position as well as floor and ceiling of it
        k_arr = (valid_obs - 1) * (quant / 100.0)
        f_arr = np.floor(k_arr).astype(np.int32)
        c_arr = np.ceil(k_arr).astype(np.int32)
        fc_equal_k_mask = f_arr == c_arr

        # linear interpolation (like numpy percentile) takes the fractional part of desired position
        floor_val = _zvalue_from_index(arr=arr, ind=f_arr) * (c_arr - k_arr)
        ceil_val = _zvalue_from_index(arr=arr, ind=c_arr) * (k_arr - f_arr)

        quant_arr = floor_val + ceil_val
        quant_arr[fc_equal_k_mask] = _zvalue_from_index(arr=arr, ind=k_arr.astype(np.int32))[fc_equal_k_mask]  # if floor == ceiling take floor value

        result.append(quant_arr)

    # mask out areas with less than 1 value in the timeseries
    mask = np.asarray(valid_obs>=1)
    return np.ma.MaskedArray(result,mask=~mask)