<img src='img/anaconda-logo.png' align='left' style="padding:10px">
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# Accelerate FFT Convolution with Dask

## Table of Contents
* [Distributed GPU FFT convolution (with Dask Distributed)](#Distributed-GPU-FFT-convolution-%28with-Dask-Distributed%29)
	* [GPU FFT Convolvution Code](#GPU-FFT-Convolvution-Code)
* [Using ``dask.distributed``](#Using-dask.distributed)
	* [Apply dask.distributed](#Apply-dask.distributed)
	* [Things to Try](#Things-to-Try)


# Distributed GPU FFT convolution (with Dask Distributed)

Setup:

launch dask distributed scheduler

```bash
$ dscheduler
```

launch dask workers

```bash
$ dworker <dscheduler_address>:8786
```

In [None]:
!dworker --help

## GPU FFT Convolvution Code

The following code are the same from earlier lesson on FFT convolution

In [None]:
from __future__ import division, print_function

import sys

import numpy as np
from scipy.signal import fftconvolve
from scipy.misc import imresize
from scipy.ndimage import imread
import skimage.data
from skimage.color import rgb2gray
from matplotlib import pyplot as plt

from numba import cuda, vectorize
from timeit import default_timer as timer

%matplotlib inline

In [None]:
# Build 5x5 laplacian filter
laplacian_pts = '''
-4 -1 0 -1 -4
-1  2 3  2 -1
 0  3 4  3  0
-1  2 3  2 -1
-4 -1 0 -1 -4
'''.split()

laplacian = np.array(laplacian_pts, dtype=np.float32).reshape(5, 5)

In [None]:
import accelerate.cuda.fft as cufft

@vectorize(['complex64(complex64, complex64)'], target='cuda')
def gpu_mult(a, b):
    # a GPU ufunc to compute the elementwise product 
    return a * b

def gpu_fftconvolve(image):
    image_complex = image.astype(np.complex64)
    response_complex = np.zeros_like(image_complex)
    response_complex[:5, :5] = laplacian.astype(np.complex64)
    
    # explicit CPU->GPU memory transfer
    d_image_complex = cuda.to_device(image_complex)
    d_response_complex = cuda.to_device(response_complex)

    # GPU forward FFT
    cufft.fft_inplace(d_image_complex)
    cufft.fft_inplace(d_response_complex)

    # GPU ufunc
    gpu_mult(d_image_complex, d_response_complex, out=d_image_complex)

    # GPU inverse FFT
    cufft.ifft_inplace(d_image_complex)

    # explicit GPU->CPU memory transfer
    cvimage_gpu = d_image_complex.copy_to_host().real
    return cvimage_gpu

# Using ``dask.distributed``

Function to generate random images

In [None]:
def generate_image(size):
    return skimage.data.binary_blobs(length=size).astype(np.float32)

View the sample image

In [None]:
im = generate_image(512)
plt.figure(figsize=(8,8))
plt.imshow(im, cmap=plt.cm.gray)

Test our GPU FFT convolve function

In [None]:
out = gpu_fftconvolve(im)

In [None]:
plt.figure(figsize=(8,8))
plt.imshow(out, cmap=plt.cm.gray)

## Apply dask.distributed

Connect to the scheduler.
This follows the same pattern as [concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html).

In [None]:
from dask.distributed import Executor
e = Executor('127.0.0.1:8786')

Generate 10 random images

In [None]:
images = [generate_image(size=512) for _ in range(10)]

Scatter our images to all workers

In [None]:
future_images = e.scatter(images)

Apply our GPU FFT convolution function on the loaded images.

The function references GPU ufuncs and cuFFT functions.  The jit-compiled GPU ufuncs can be seralized and transfer to the worker node, where it will be deserialized and finalized to machine code.

In [None]:
future_convolved = e.map(gpu_fftconvolve, future_images)

We use the `.gather` method to get the result of the convolution.  This will not return futures.  The result is a list of arrays of our images.

In [None]:
convolved = e.gather(future_convolved)

In [None]:
plt.figure(figsize=(8,8))
plt.imshow(convolved[0], cmap=plt.cm.gray)

In [None]:
plt.figure(figsize=(8,8))
plt.imshow(convolved[1], cmap=plt.cm.gray)

## Things to Try

By using `dask.distributed`, multi-node and multi-GPU usage becomes a separated problem of worker configuration.  User can launch many instances of `dworker` from different machines and connect to the `dscheduler` process.  For the above example, it is necessary to change how the image data is accessed.  Perhaps, using a distributed filesystem is the simplest way to ensure that all worker process can access the images. 

To use multiple GPUs, launch multiple `dworker` processes on each machine.  It is advised to reduce the number of threads per worker (the default is to launch as many threads as cpu-cores) to prevent oversubscription.  To designate the GPU for each worker, use environment variable `CUDA_VISIBLE_DEVICES`.  This environment variable is provided by Nvidia driver and it is visible by all CUDA process in the same environment.

For example:

```bash
$ CUDA_VISIBLE_DEVICES=0 dworker --nthreads=2
```

This will launch a `dworker`, with 2 threads, that will see GPU-0 only in the system.

To get a list of available devices, run the following:

In [None]:
from numba import cuda
cuda.detect()

---
*Copyright Continuum 2012-2016 All Rights Reserved.*