Further Optimization using NumbaPro
### <font color='blue'>NumbaPro 를 이용한 추가적인 최적화</font>
---

One of the most exciting new products from [Continuum Analytics](www.continuum.io) is called NumbaPro, which allows code written in Python to target CUDA-capable GPUs for parallelized computation. 
<font color='red'>[Continuum Analytics](www.continuum.io)의 가장 흥미로운 신제품 중 하나는 NumbaPro입니다. NumbaPro는 CUDA가 가능한 GPU를 목표로한 Python으로 코드작성으로 병렬 계산을 합니다.</font>

For a quick primer on how parallel computation on GPUs works, check out ???
<font color='red'>GPU의 병렬 계산 방법에 대한 간단한 입문을 위해 확인해보세요.</font>

Now, for a brief proof-of-concept look at the capabilities of NumbaPro, let's return to our old friend, 1D Nonlinear Convection.  
<font color='red'>이제 NumbaPro의 함수의 간단한 원리이해를 위해 이전에 다룬 1차원 비선류 컨벡션으로 돌아가 보겠습니다.</font>

Yes, this is a trivial problem, but it is a good demonstration of the potential for speed gains using NumbaPro and GPU computation. 
<font color='red'>예, 이것은 사소한 문제이지만 예시로 NumbaPro 및 GPU 계산을 사용하여 속도 향상하는 능력을 잘 보여줍니다.</font>

We'll start by importing the usual libraries, plus the `time` library, so we can measure our performance gains, and also the appropriate libraries from `numbapro`.
<font color='red'>성능 향상의 측정과 `numbapro`의 적절한 라이브러리를 알아보기 위해 늘 사용하던 일반적인 라이브러리와 `time` 라이브러리를 불러와서 시작합니다.</font>

`autojit` is the same library we used with regular `numba`, and in fact we'll be using it the same way, to provide a comparison between regular Numba and NumbaPro.  
<font color='red'>`autojit`은 일반 `numba`와 함께 사용한 것과 동일한 라이브러리이며, NumbaPro와 일반 Numba를 비교하기 위해 동일한 방식으로 사용할 것입니다.</font>

`cuda` is the NumbaPro library that provides the CUDA intrinsics which allow us to target the GPU for computation.  
<font color='red'>`cuda`는 계산을 위해 GPU를 대상으로하는 CUDA 내장 함수를 제공하는 NumbaPro 라이브러리입니다.</font>

`float32` is a data type.  Python generally takes care of whether we want an `int` or a `str` for us, but when we start delving into the dark depths of memory management, it can be helpful (and sometimes required) to be a bit more specific concerning our data formats.  
<font color='red'>`float32`는 데이터 유형입니다. Python은 일반적으로 우리에게 `int` 또는 `str`이 필요한지 여부를 결정하지만 메모리 관리의 더욱 깊은 내용을 다루기 시작할때 데이터 형식과 관련하여 좀 더 구체적으로 도움이 될 수 있습니다.</font>

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import time
from numbapro import autojit, cuda, jit, float

The first function we're trying out is a simple implementation using array operations in Numpy. 
<font color='red'>제일 먼저 시도하는 함수는 Numpy에서 배열작업을 사용하는 간단한 구현입니다.</font>

In [3]:
###1-D Nonlinear convection implemented using Numpy
def NonLinNumpy(u, un, nx, nt, dx, dt):

    ###Run through nt timesteps and plot/animate each step
    for n in range(nt): ##loop across number of time steps
        un = u.copy()
        u[1:] = -un[1:]*dt/dx*(un[1:]-un[:-1])+un[1:]
    
    return u

The 'vanilla' version is what we used for Step 2, two nested loops and not that efficient.  
<font color='red'>'vanilla' 버젼은 우리가 2단계에서 사용했었어요. 많이 효율적이지 못한 2개의 중첩된 루프입니다.</font>

In [4]:
###1-D Nonlinear convection implemented using 'vanilla' Python
def NonLinVanilla(u, nx, nt, dx, dt):

    for n in range(nt):
        for i in range(1,nx-1):
            u[i+1] = -u[i]*dt/dx*(u[i]-u[i-1])+u[i]

    return u

Here we've implemented the same 'vanilla' version, but we've added the `@autojit` decorator, which will tell Numba to JIT compile this function for a nice speed boost. 
<font color='red'>우리는 동일한 '바닐라'버전을 구현했지만, `@autojit` 데코레이터를 추가하고 이것을 Numba에 JIT가 컴파일하도록 명령해서 속도향상을 추진했습니다. 

In [5]:
###1-D Nonlinear convection implemented using Numba JIT compiler (similar to LLVM)
@autojit
def NonLinNumba(u,un, nx, nt, dx, dt):

    for n in range(nt):
        for i in range(1,nx):
            un[i] = -u[i]*dt/dx*(u[i]-u[i-1])+u[i]

    return un

CUDA JIT
---

There's a lot going on here that will be new to you, so we'll go through it piece by piece.
<font color='red'>다음으로 이어지는 내용은 다소 생소한 부분이 많습니다. 그렇기에 하나씩 함께 짚어가도록 하겠습니다.</font>

`@jit(argtypes=[float32[:], float32, float32, float32, float32[:]], target='gpu')`

Instead of `@autojit` which automatically figures out data-types for us, we have to specify what kind of variables will be sent to this function (which is actually a CUDA 'kernel').  The `argtypes` above tell the kernel that there will be five variables, three scalar floats and two float arrays.  
<font color='red'>데이터 타입을 자동으로 분별해주는 `@autojit` 대신, 우리가 함수에 어떠한 변수가 보내질것인지 지정해야합니다 (이것이 CUDA 'kernel'입니다). 위에서 `argtypes`은 kernel에게 5개의 변수, 3개의 스칼라 float 그리고 2개의 float 배열이 있다는것을 알려줍니다.

In [6]:
###1-D Nonlinear convection implemented using NumbaPro CUDA-JIT
d@jit(argtypes=[float32[:], float32, float32, float32, float32[:]], target='gpu')
def NonLinCudaJit(u, dx, dt, nt, un):
    tid = cuda.threadIdx.x
    blkid = cuda.blockIdx.x
    blkdim = cuda.blockDim.x
    i = tid + blkid * blkdim

    if i >= u.shape[0]:
        return

    for n in range(nt):
        un[i] = -u[i]*dt/dx*(u[i]-u[i-1])+u[i]
        
        cuda.syncthreads()

In [14]:
def main(nx):
    ##System Conditions    
    #nx = 500 
    nt = 500
    c = 1
    xmax = 15.0
    dx = xmax/(nx-1)
    sigma = 0.25
    dt = sigma*dx

    ##Initial Conditions for wave
    ui = np.ones(nx) ##create a 1xn vector of 1's
    ui[.5/dx:1/dx+1]=2 ##set hat function I.C. : .5<=x<=1 is 2
    un = np.ones(nx)    

    if nx < 20001:
        t1 = time.time()
        u = NonLinVanilla(ui, nx, nt, dx, dt)
        t2 = time.time()
        print "Vanilla version took: %.6f seconds" % (t2-t1)
    
    
    ui = np.ones(nx) ##create a 1xn vector of 1's
    ui[.5/dx:1/dx+1]=2 ##set hat function I.C. : .5<=x<=1 is 2
    
    t1 = time.time()
    u = NonLinNumpy(ui, un, nx, nt, dx, dt)
    t2 = time.time()
    print "Numpy version took: %.6f seconds" % (t2-t1)
    numpytime = t2-t1
    #plt.plot(numpy.linspace(0,xmax,nx),u[:],marker='o',lw=2)

    
    ui = np.ones(nx) ##create a 1xn vector of 1's
    ui[.5/dx:1/dx+1]=2 ##set hat function I.C. : .5<=x<=1 is 2
    
    t1 = time.time()
    u = NonLinNumba(ui, un, nx, nt, dx, dt)
    t2 = time.time()
    print "Numbapro Vectorize version took: %.6f seconds" % (t2-t1)
    vectime = t2-t1
    #plt.plot(numpy.linspace(0,xmax,nx),u[:],marker='o',lw=2)

    u = np.ones(nx)
    u = ui.copy()
    griddim = 320, 1
    blockdim = 768, 1, 1
    NonLinCudaJit_conf = NonLinCudaJit[griddim, blockdim]
    t1 = time.time()
    NonLinCudaJit(u, dx, dt, nt, un)
    t2 = time.time()

    print "Numbapro Cuda version took: %.6f seconds" % (t2-t1)
    cudatime = t2-t1

In [10]:
main(500)

Vanilla version took: 0.581475 seconds
Numpy version took: 0.007635 seconds
Numbapro Vectorize version took: 0.000966 seconds
Numbapro Cuda version took: 0.002658 seconds


In [11]:
main(1000)

Vanilla version took: 1.140803 seconds
Numpy version took: 0.008878 seconds
Numbapro Vectorize version took: 0.001837 seconds
Numbapro Cuda version took: 0.002678 seconds


In [12]:
main(5000)

Vanilla version took: 5.336566 seconds
Numpy version took: 0.023648 seconds
Numbapro Vectorize version took: 0.009166 seconds
Numbapro Cuda version took: 0.002717 seconds


In [13]:
main(10000)

Vanilla version took: 10.719647 seconds
Numpy version took: 0.043988 seconds
Numbapro Vectorize version took: 0.018464 seconds
Numbapro Cuda version took: 0.002899 seconds


In [15]:
main(20000)

Vanilla version took: 21.414605 seconds
Numpy version took: 0.083821 seconds
Numbapro Vectorize version took: 0.036616 seconds
Numbapro Cuda version took: 0.002943 seconds


In [16]:
main(50000)

Numpy version took: 0.207808 seconds
Numbapro Vectorize version took: 0.093922 seconds
Numbapro Cuda version took: 0.003228 seconds


In [17]:
main(100000)

Numpy version took: 0.456931 seconds
Numbapro Vectorize version took: 0.189677 seconds
Numbapro Cuda version took: 0.004876 seconds


In [18]:
main(200000)

Numpy version took: 1.255342 seconds
Numbapro Vectorize version took: 0.393786 seconds
Numbapro Cuda version took: 0.005403 seconds


In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()