Draft of tutorial notebooks. Some exercises missing.

ContinuumIO · Apr 24, 2017 · 978bdd0 · 978bdd0
1 parent 942658d
commit 978bdd0
Show file tree

Hide file tree

Showing 12 changed files with 3,648 additions and 0 deletions.
diff --git a/1 - Numba Basics.ipynb b/1 - Numba Basics.ipynb
diff --git a/2 - CUDA Basics.ipynb b/2 - CUDA Basics.ipynb
diff --git a/3 - Memory Management.ipynb b/3 - Memory Management.ipynb
@@ -0,0 +1,229 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# GTC 2017 Numba Tutorial Notebook 3: Memory Management\n",
+    "\n",
+    "## Managing GPU Memory\n",
+    "\n",
+    "During the benchmarking in the previous notebook, we used NumPy arrays on the CPU as inputs and outputs.  If you want to reduce the impact of host-to-device/device-to-host bandwidth, it is best to copy data to the GPU explicitly and leave it there to amortize the cost over multiple function calls.  In addition, allocating device memory can be relatively slow, so allocating GPU arrays once and refilling them with data from the host can also be a performance improvement.\n",
+    "\n",
+    "Let's create our example addition ufunc again:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from numba import vectorize\n",
+    "import numpy as np\n",
+    "\n",
+    "@vectorize(['float32(float32, float32)'], target='cuda')\n",
+    "def add_ufunc(x, y):\n",
+    "    return x + y"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "n = 100000\n",
+    "x = np.arange(n).astype(np.float32)\n",
+    "y = 2 * x"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The slowest run took 137.56 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
+      "1000 loops, best of 3: 1.25 ms per loop\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit add_ufunc(x, y)  # Baseline performance with host arrays"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `numba.cuda` module includes a function that will copy host data to the GPU and return a CUDA device array:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<numba.cuda.cudadrv.devicearray.DeviceNDArray object at 0x112327f28>\n",
+      "(100000,)\n",
+      "float32\n"
+     ]
+    }
+   ],
+   "source": [
+    "from numba import cuda\n",
+    "\n",
+    "x_device = cuda.to_device(x)\n",
+    "y_device = cuda.to_device(y)\n",
+    "\n",
+    "print(x_device)\n",
+    "print(x_device.shape)\n",
+    "print(x_device.dtype)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Device arrays can be passed to CUDA functions just like NumPy arrays, but without the copy overhead:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1000 loops, best of 3: 429 µs per loop\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit add_ufunc(x_device, y_device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "That's a big performance improvement already, but we are still allocating a device array for the output of the ufunc and copying it back to the host.  We can create the output buffer with the `numba.cuda.device_array()` function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "out_device = cuda.device_array(shape=(n,), dtype=np.float32)  # does not initialize the contents, like np.empty()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "And then we can use a special `out` keyword argument to the ufunc to specify the output buffer:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1000 loops, best of 3: 235 µs per loop\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit add_ufunc(x_device, y_device, out=out_device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we have removed the device allocation and copy steps, the computation runs *much* faster than before.  When we want to bring the device array back to the host memory, we can use the `copy_to_host()` method:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[  0.   3.   6.   9.  12.  15.  18.  21.  24.  27.]\n"
+     ]
+    }
+   ],
+   "source": [
+    "out_host = out_device.copy_to_host()\n",
+    "print(out_host[:10])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exercise\n",
+    "\n",
+    "(Convert example functions to ufuncs and measure impact of computing with host arrays vs device arrays)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}