Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Draft of tutorial notebooks. Some exercises missing.
- Loading branch information
Showing
12 changed files
with
3,648 additions
and
0 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,229 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# GTC 2017 Numba Tutorial Notebook 3: Memory Management\n", | ||
"\n", | ||
"## Managing GPU Memory\n", | ||
"\n", | ||
"During the benchmarking in the previous notebook, we used NumPy arrays on the CPU as inputs and outputs. If you want to reduce the impact of host-to-device/device-to-host bandwidth, it is best to copy data to the GPU explicitly and leave it there to amortize the cost over multiple function calls. In addition, allocating device memory can be relatively slow, so allocating GPU arrays once and refilling them with data from the host can also be a performance improvement.\n", | ||
"\n", | ||
"Let's create our example addition ufunc again:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from numba import vectorize\n", | ||
"import numpy as np\n", | ||
"\n", | ||
"@vectorize(['float32(float32, float32)'], target='cuda')\n", | ||
"def add_ufunc(x, y):\n", | ||
" return x + y" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"n = 100000\n", | ||
"x = np.arange(n).astype(np.float32)\n", | ||
"y = 2 * x" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"The slowest run took 137.56 times longer than the fastest. This could mean that an intermediate result is being cached.\n", | ||
"1000 loops, best of 3: 1.25 ms per loop\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"%timeit add_ufunc(x, y) # Baseline performance with host arrays" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The `numba.cuda` module includes a function that will copy host data to the GPU and return a CUDA device array:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"<numba.cuda.cudadrv.devicearray.DeviceNDArray object at 0x112327f28>\n", | ||
"(100000,)\n", | ||
"float32\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from numba import cuda\n", | ||
"\n", | ||
"x_device = cuda.to_device(x)\n", | ||
"y_device = cuda.to_device(y)\n", | ||
"\n", | ||
"print(x_device)\n", | ||
"print(x_device.shape)\n", | ||
"print(x_device.dtype)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Device arrays can be passed to CUDA functions just like NumPy arrays, but without the copy overhead:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 13, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"1000 loops, best of 3: 429 µs per loop\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"%timeit add_ufunc(x_device, y_device)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"That's a big performance improvement already, but we are still allocating a device array for the output of the ufunc and copying it back to the host. We can create the output buffer with the `numba.cuda.device_array()` function:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 15, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"out_device = cuda.device_array(shape=(n,), dtype=np.float32) # does not initialize the contents, like np.empty()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"And then we can use a special `out` keyword argument to the ufunc to specify the output buffer:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 16, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"1000 loops, best of 3: 235 µs per loop\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"%timeit add_ufunc(x_device, y_device, out=out_device)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that we have removed the device allocation and copy steps, the computation runs *much* faster than before. When we want to bring the device array back to the host memory, we can use the `copy_to_host()` method:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 17, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"[ 0. 3. 6. 9. 12. 15. 18. 21. 24. 27.]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"out_host = out_device.copy_to_host()\n", | ||
"print(out_host[:10])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Exercise\n", | ||
"\n", | ||
"(Convert example functions to ufuncs and measure impact of computing with host arrays vs device arrays)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.