From 8d6296ecbcd86606e86d7114acc8c0a4d6870303 Mon Sep 17 00:00:00 2001 From: samaid Date: Thu, 17 Nov 2022 20:12:00 -0600 Subject: [PATCH] Complete documentation draft Missing a few items documented as To-Do --- docs/sources/examples.rst | 30 +++++++ docs/sources/ext_links.txt | 2 + docs/sources/programming_dpep.rst | 125 ++++++++++++++++++++++++++++++ docs/sources/useful_links.rst | 40 ++++++++++ examples/03-dpnp2numba-dpex.py | 11 ++- 5 files changed, 204 insertions(+), 4 deletions(-) diff --git a/docs/sources/examples.rst b/docs/sources/examples.rst index 2c88626..7052dfd 100644 --- a/docs/sources/examples.rst +++ b/docs/sources/examples.rst @@ -3,3 +3,33 @@ List of examples ================ + +.. literalinclude:: ../../examples/01-hello_dpnp.py + :language: python + :lines: 27- + :caption: Your first NumPy code running on GPU + :name: examples_01_hello_dpnp + +.. literalinclude:: ../../examples/02-dpnp_device.py + :language: python + :lines: 27- + :caption: Select device type while creating array + :name: examples_02_dpnp_device + +.. literalinclude:: ../../examples/03-dpnp2numba-dpex.py + :language: python + :lines: 27- + :caption: Compile dpnp code with numba-dpex + :name: examples_03_dpnp2numba_dpex + +Benchmarks +********** + +.. todo:: + Provide instructions for dpbench + +Jupyter* Notebooks +****************** + +.. todo:: + Provide instructions for Jupyter Notebook samples illustrating Data Parallel Extensions for Python diff --git a/docs/sources/ext_links.txt b/docs/sources/ext_links.txt index 7019ec2..105a73a 100644 --- a/docs/sources/ext_links.txt +++ b/docs/sources/ext_links.txt @@ -12,3 +12,5 @@ .. _SYCL*: https://www.khronos.org/sycl/ .. _Data Parallel Control: https://intelpython.github.io/dpctl/latest/index.html .. _Data Parallel Extension for Numpy*: https://intelpython.github.io/dpnp/ +.. _IEEE 754-2019 Standard for Floating-Point Arithmetic: https://standards.ieee.org/ieee/754/6210/ +.. _David Goldberg, What every computer scientist should know about floating-point arithmetic: https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf> diff --git a/docs/sources/programming_dpep.rst b/docs/sources/programming_dpep.rst index 650d2c5..e8c55a5 100644 --- a/docs/sources/programming_dpep.rst +++ b/docs/sources/programming_dpep.rst @@ -71,3 +71,128 @@ It takes just a few lines to modify your CPU `Numba*`_ script to run on GPU. :caption: Compile dpnp code with numba-dpex :name: ex_03_dpnp2numba_dpex +In this example we implement a custom function ``sum_it()`` that takes an array input. We compile it with +`Data Parallel Extension for Numba*`_. Being just-in-time compiler, Numba derives the queue from input argument ``x``, +which is associated with the default device (``"gpu"`` on systems with integrated or discrete GPU) and +dynamically compiles the kernel submitted to that queue. The result will reside as a 0-dimensional array on the device +associated with the queue, and on exit from the offload kernel it will be assigned to the tensor y. + +The ``parallel=True`` setting in ``@njit` is essential to enable generation of data parallel kernels. +Please also note that we use ``fastmath=True`` in ``@njit`` decorator. This is an important setting +to instruct the compiler that you’re okay NOT preserving the order of floating-point operations. +This will enable generation of instructions (such as SIMD) for greater performance. + +Data Parallel Control - dpctl +***************************** + +Both ``dpnp`` and ``numba-dpex`` provide enough API versatility for programming data parallel devices but +there are some situations when you will need to use dpctl advanced capabilities: + +1. **Advanced device management.** Both ``dpnp`` and ``numba-dpex`` support Numpy array creation routines + with additional parameters that specify the device on which the data is allocated and the type of memory to be used + (``"device"``, ``"host"``, or ``"shared"``). However, if you need some more advanced device and data management + capabilities you will also need to import ``dpctl`` in addition to ``dpnp`` and/or ``numba-dpex``. + + One of frequent usages of ``dpctl`` is to query the list devices present on the system, available driver backend + (such as ``"opencl"``, ``"level_zero"``, ``"cuda"``, etc.) + + Another frequent usage is the creation additional queues for the purpose of profiling or choosing an out-of-order + execution of offload kernels. + +2. **Cross-platform development using Python Array API standard.** If you’re a Python developer + programming Numpy-like codes and targeting different hardware vendors and different tensor implementations, + then going with `Python* Array API Standard`_ is a good choice for writing a portable Numpy-like code. + The ``dpctl.tensor`` implements `Python* Array API Standard`_ for `SYCL*`_ devices. Accompanied with + respective SYCL device drivers from different vendors ``dpctl.tensor`` becomes a portable solution + for writing numerical codes for any SYCL device. + + For example, some Python communities, such as + `Scikit-Learn* community `_, are already establishing + a path for having algorithms (re-)implemented using `Python* Array API Standard`_ . + This is a reliable path for extending their capabilities beyond CPU only, or beyond certain GPU vendor only. + +3. **Zero-copy data exchange between tensor implementations.** Certain Python projects may have own tensor + implementations not relying on ``dpctl.tensor`` or ``dpnp.ndarray`` tensors. Can users still exchange data + between these tensors not copying it back and forth through the host? + `Python* Array API Standard`_ specifies the data exchange protocol for zero-copy exchange + between tensors through ``dlpack``. Being the `Python* Array API Standard`_ implementation + ``dpctl`` provides ``dpctl.tensor.from_dlpack()`` function used for zero-copy view of another tensor input. + + +Debugging and profiling Data Parallel Extensions for Python +*********************************************************** + +.. todo:: + Document debugging and profiling section + +Writing robust numerical codes for heterogeneous computing +********************************************************** + +Default primitive type (``dtype``) in `Numpy*`_ is double precision (``float64``), which is supported by +majority of modern CPUs. When it comes to program GPUs and especially specialized accelerators, +the set of supported primitive data types may be limited. For example, certain GPUs may not support +double precision or half-precision. **Data Parallel Extensions for Python** select default ``dtype`` depending on +device’s default type in accordance with Python Array API Standard. It can be either ``float64`` or ``float32``. +It means that unlike traditional `Numpy*`_ programming on a CPU, the heterogeneous computing requires +careful management of hardware peculiarities to keep the Python script portable and robust on any device. + +There are several hints how to make the numerical code portable and robust. + +Sensitivity to floating-point errors +------------------------------------ + +Floating-point arithmetic has a finite precision, which implies that only a tiny fraction of real numbers can be +represented in floating-point arithmetic. It is almost certain that every floating-point operation +will induce a rounding error because the result cannot be accurately represented as a floating-point number. +The `IEEE 754-2019 Standard for Floating-Point Arithmetic`_ sets the upper bound for rounding errors in each +arithmetic operation to 0.5 *ulp*, meaning that each arithmetic operation must be accurate to the last bit of +floating-point mantissa, which is an order of :math:`10^-16` in double precision and :math:`10^-7` +in single precision. + +In robust numerical codes these errors tend to accumulate slowly so that single precision is enough to +calculate the result accurate to 3-5 decimal digits. + +However, there is a situation known as a *catastrophic cancellation*, when small accumulated errors +result in a significant (or even a complete) loss of accuracy. The catastrophic cancellation happens +when two close floating-point numbers with small rounding errors are subtracted. As a result the original +rounding errors amplify by the number of identical leading digits: + +.. image:: ./_images/fp-cancellation.png + :scale: 50% + :align: center + :alt: Floating-Point Cancellation + +In the above example, green digits are accurate digits, a few trailing digits in red are inaccurate due to +induced errors. As a result of subtraction, only one accurate digit remains. + +Situations with catastrophic cancellations must be carefully handled. An example where catastrophic +cancellation happens naturally is the numeric differentiation, where two close numbers are subtracted +to approximate the derivative: + +.. math:: + + df/dx \approx \frac{f(x+\delta) - f(x-\delta)}{2\delta} + +Smaller you take :math:`\delta` is greater the catastrophic cancellation. At the same time bigger :math:`\delta` +results in bigger approximation error. Books on numerical computing and floating-point arithmetic discuss +variety of technics to make catastrophic cancellations controllable. For more details about floating-point +arithmetic please refer to `IEEE 754-2019 Standard for Floating-Point Arithmetic`_ and the article by +`David Goldberg, What every computer scientist should know about floating-point arithmetic`_. + + +Switch between single and double precision +****************************************** + +1. Implement your code to switch easily between single and double precision in a controlled fashion. + For example, implement a utility function or introduce a constant that selects ``dtype`` for + the rest of the `Numpy*`_ code. + +2. Run your code on a representative set of inputs in single and double precisions. + Observe sensitivity of computed results to the switching between single and double precisions. + If results remain identical to 3-5 digits for different inputs, it is a good sign that your code + is not sensitive to floating-point errors. + +3. Write your code with catastrophic cancellations in mind. These blocks of code will require special + care such as the use of extended precision or other techniques to control cancellations. + It is likely that this part of the code will require a hardware specific implementation. + diff --git a/docs/sources/useful_links.rst b/docs/sources/useful_links.rst index 3084b4a..2d08c8d 100644 --- a/docs/sources/useful_links.rst +++ b/docs/sources/useful_links.rst @@ -3,3 +3,43 @@ Useful links ============ + +.. list-table:: **Companion documentation** + :widths: 70 200 + :header-rows: 1 + + * - Document + - Description + * - `Data Parallel Extension for Numpy*`_ + - Documentation for programming NumPy-like codes on data parallel devices + * - `Data Parallel Extension for Numba*`_ + - Documentation for programming Numba codes on data parallel devices the same way as you program Numba on CPU + * - `Data Parallel Control`_ + - Documentation how to manage data and devices, how to interchange data between different tensor implementations, + and how to write data parallel extensions + * - `Python* Array API Standard`_ + - Standard for writing portable Numpy-like codes targeting different hardware vendors and frameworks + operating with tensor data + * - `SYCL*`_ + - Standard for writing C++-like codes for heterogeneous computing + * - `DPC++`_ + - Free e-book how to program data parallel devices using Data Parallel C++ + * - `OpenCl*`_ + - OpenCl* Standard for heterogeneous programming + * - `Data Parallel Extension for Numpy*`_ + - Documentation for programming NumPy-like codes on data parallel devices + * - `IEEE 754-2019 Standard for Floating-Point Arithmetic`_ + - Standard for floating-point arithmetic, essential for writing robust numerical codes + * - `David Goldberg, What every computer scientist should know about floating-point arithmetic`_ + - Scientific paper important for understanding how to write robust numerical code + * - `Numpy*`_ + - Documentation for Numpy - foundational CPU library for array programming. Used in conjunction with + `Data Parallel Extension for Numpy*`_. + * - `Numba*`_ + - Documentation for Numba - Just-In-Time compiler for Numpy-like codes. Used in conjunction with + `Data Parallel Extension for Numba*`_. + + +To-Do +===== +.. todolist:: diff --git a/examples/03-dpnp2numba-dpex.py b/examples/03-dpnp2numba-dpex.py index f788b53..c147f63 100644 --- a/examples/03-dpnp2numba-dpex.py +++ b/examples/03-dpnp2numba-dpex.py @@ -25,17 +25,20 @@ # ***************************************************************************** import dpnp as np -from numba_dpex import njit - +from numba_dpex import jit @njit(parallel=True, fastmath=True) -def sum(x): +def sum_it(x): return np.sum(x) -x = np.empty(3) +x = None try: x = np.asarray([1, 2, 3], device="gpu") except: print("GPU device is not available") +y = sum_it(x) + +print(y.shape) # Must be 0-dimensional array +print(y) # Expect 6