# Benchmarking Julia on a PDE: the Kuramoto-Sivashinksy equation

### The benchmark algorithm: KS-CNAB2

The Kuramoto-Sivashinsky (KS) equation is a nonlinear time-evolving partial differential equation (PDE) on a 1d spatial domain.

\begin{equation*}
u_t = -u u_{x} - u_{xx} - u_{xxxx}
\end{equation*}

where $x$ is space, $t$ is time, and subscripts indicate differentiation. We assume a spatial domain $x \in [0, L_x]$ with periodic boundary conditions.

The KS-CNAB2 benchmark algorithm is a simple numerical integration scheme for the KS equation that uses Fourier expansion in space, collocation calculation of the nonlinear term $u u_x$, and finite-differencing in time, specifically 2nd-order Crank-Nicolson Adams-Bashforth (CNAB2) timestepping. Mathematical details of KS-CNAB2 algorithm are provided [below](## Mathematics of the CNAB2 algorithm).

As PDE solvers go, this one is super-simple, about twenty lines of code, and so comparable in scale to the Fibonnaci, pi-sum, etc. benchmarks at https://julialang.org/benchmarks/.
However this benchmark is a little different from those in that the dominant cost of the algorithm should be the fast Fourier transforms (FFTs), which in all languages are performed with calls to the same external C library, [FFTW](http://fftw.org/). So what I'm testing here is the overhead each language imposes over FFTW for things like function calls, index bounds checking, allocation of temporary arrays, and pointer dereferencing. 

This benchmark is meant as a preliminary investigation towards using Julia for classic high-performance computations (HPC) for PDEs from engineering and physics.

###  Languages and implementation differences

I implementated the KS-CNAB2 benchmark algorithm in Python, Matlab, C++, Fortran, and in three slightly different forms in Julia. 

 * The Python, Matlab, and Julia naive codes ([ksintegrate.py](codes/ksintegrate.py), [ksintegrate.m](codes/ksintegrate.m), [ksintegrateNaive.jl](codes/ksintegrateNaive.jl)) are line-by-line identical except for differences in language syntax. These codes are very Matlabby in style, in that they employ vectorized operations, do out-of-place FFTs, and don't take any particular care to avoid allocating temporary arrays. 
 
 
 * The Julia in-place code ([ksintegrateInplace.jl](codes/ksintegrateInplace.jl) uses a bit of FFTW expertise to do the FFTs in-place and julia-0.6's new loop-fusion capability to eliminate the generation of temporaries in the vector arithmetic expressions inside the timestepping loop. Both these changes are pretty simple, you just have to understand what the issues are and the right Julia syntax to fix them. 
 
 
 * The Julia unrolled code [ksintegrateUnrolled.jl](codes/ksintegrateUnrolled.jl) does the FFTs in-place and manually unrolls all the vector operations into `for` loops, instead of utilizing loop fusion. It's slightly lengthier but turns out to be a bit faster. 
 
 
 * The C++ code [ksintegrate.cpp](codes/ksintegrate.cpp) doesn't use any object orientation or heavy C++ libraries; it's essentially straight array-oriented C code, but using C++'s Complex type and I/O methods. 


 * The Fortran code [ksintegrate.f90](codes/ksintegrate.f90) is Fortran 90. Ashley Willis of U. Sheffield wrote it for me since I know little about Fortran. For reasons I don't understand, the Fortran code seg faults for $N_x \ge 8192$ in a memory out-of-bounds error during a call to FFTW. So I don't have complete benchmarks for Fortran. If anyone can help me figure this out, I'd appreciate it.
 
All the codes use complex-to-complex FFTs instead of real-to-complex. Real-to-complex would be appropriate and faster for this problem, but I don't yet understand how to call the FFTW r2c from Python or Matlab. I would very much like to include results for expertly-tuned algorithms in each language in addition to the straightforward and mildly tuned codes above, including real-to complex transforms, multicore & distributed-memory codes, Numba and Dedalus implementations for Python, and `ApproxFun.jl + DifferentialEqns.jl` for Julia. If you have expertise in those and can help, have at it and send a pull request.

The codes above show only the KS-CNAB2 integration algorithm encoded as a function in the given language. Self-contained codes including the benchmark algorithms and the driver programs are given [below](### Benchmark codes)

### Benchmark problem

The KS-CNAB2 benchmarks consist of running the above codes on $N_x$ uniform gridpoints or $N_x/2-1$ complex Fourier modes, a domain of length $L_x = 2\pi N_x/32$, giving constant gridspacing $\Delta x = \pi/16$, and the initial condition

\begin{equation*}
u(x,0) = \cos x + 0.1 \, \sin x/8 + 0.01 \cos 2\pi x/L_x.
\end{equation*}

The integrations run from $t=0$ to $t=200$ with step size $\Delta = 1/16$, totalling 3200 time steps. $N_x$ was set to powers of two running from $2^5 = 32$ to $2^{17} = 131072$. The double precision arrays for the largest simulations were then about 6 MB each. The benchmarks were run on a single core of a six-core Intel Core i7-3960X CPU at 3.30GHz running
openSUSE Leap 42.2, kernel 4.4.73-18.17-default [(details on compilers, language versions, etc.)](benchmark-data/cputime.asc).

 

## Results

### execution time versus simulation size

In [26]:
using Plots
gr()
d = readdlm("benchmark-data/cputime.asc")
Nx = d[:,1]
plot( Nx, d[:,5], label="Python", marker=:circ, color="magenta", 
      yscale=:log10, xscale=:log10,xlim=(10,1e07),ylim=(1e-03,1e02))
plot!(Nx, d[:,3], label="Matlab", marker=:circ, color="green")
plot!(Nx, d[:,2], label="C++", marker=:circ, color="blue")
plot!(Nx, d[:,7], label="Julia naive", marker=:circ, color="orange")
plot!(Nx, d[:,9], label="Julia in-place", marker=:circ, color="yellow")
plot!(Nx, d[:,8], label="Julia unrolled", marker=:circ, color="red")
plot!(Nx, d[:,10], label="Fortran", marker=:circ, color="black")
plot!(Nx, 1e-05*Nx .* log10.(Nx), label="Nx log Nx", xlabel="Nx", ylabel="cpu time", 
     linestyle=:dash, title="execution time, Kuramoto-Sivashinky simulation")



Note the termination of the Fortran data at $N_x = 4096$. 

In [3]:
plot!(xlim=(10^4,10^6), ylim=(1,1e2))

Timings for the last datapoint, $N_x = 2^{17} = 131072$, are


cputime (s) | language 
------------|-----------
37.1 | Python 
26.8 | Matlab 
22.5 | Julia naive 
15.8 | Julia in-place 
15.5 | C++ 
13.8 | Julia unrolled 

   

### execution time versus lines of code

In [10]:
d = readdlm("benchmark-data/linecount.asc")
Nx = d[:,1]
plot([d[1,2]], [d[1,1]],  label="Python", marker=:circ, color="magenta")
plot!([d[2,2]], [d[2,1]], label="Matlab", marker=:circ, color="green" )
plot!([d[3,2]], [d[3,1]], label="Julia naive", marker=:circ, color="orange")
plot!([d[4,2]], [d[4,1]], label="Julia in-place", marker=:circ, color="yellow")
plot!([d[5,2]], [d[5,1]], label="Julia unrolled", marker=:circ, color="red")
plot!([d[6,2]], [d[6,1]], label="Fortran (estimate)", marker=:circ, color="cyan")
plot!([d[7,2]], [d[7,1]], label="C++", marker=:circ, color="blue")
plot!(xlabel="lines of code", ylabel="cpu time, seconds", xlim=(0,80), ylim=(0,40))

The Fortran datapoint here is an estimate, since the Fortran code seg faults for $N_x \geq 8192$. Prior to this the Fortran execution time is about 9/10ths ofJulia unrolled.

In [28]:
; cat benchmark-data/linecount.asc


# cputime for Nx=131072, lines of code, bytes of code, language
35.7   19  532  # Python
26.8   21  451  # Matlab
21.9   21  495  # Julia naive
15.7   29  682  # Julia inplace
13.8   37 1076  # Julia unrolled
12.4   64 2109  # Fortran (estimated cputime)
15.4   77 2487  # C++





## Mathematics of the CNAB2 algorithm

Start from the Kuramoto-Sivashinsky equation $[0,L]$ with periodic boundary conditions

\begin{equation*}
u_t = - u_{xx} - u_{xxxx} - u u_{x}
\end{equation*}

on the domain $[0,L_x]$ with periodic boundary conditions and initial condition $u(x,0) = u_0(x)$. We will use a finite Fourier expansion to discretize space and finite-differencing to discretize time, specifically the 2nd-order rank-Nicolson/Adams-Bashforth (CNAB2) timestepping formula. CNAB2 is low-order but straightforward to describe and easy to implement for this simple benchmark.

Write the KS equation as 

\begin{equation*}
u_t = Lu + N(u)
\end{equation*}

where $Lu = - u_{xx} - u_{xxxx}$ is the linear terms and $N(u) = -u u_{x}$ is the nonlinear term. In practice we'll calculate the $N(u)$ in the equivalent form $N(u) = - 1/2 \, d/dx \, u^2$. 

Discretize time by letting $u^n(x) = u(x, n\Delta t)$ for some small $\Delta t$. The CNAB2 timestepping forumale approximates  $u_t = Lu + N(u)$ at time $t = (n+1/2) \, dt$ as 

\begin{equation*}
\frac{u^{n+1} - u^n}{\Delta t} = L\left(u^{n+1} + u^n\right) + \frac{3}{2} N(u^n) - \frac{1}{2} N(u^{n-1})
\end{equation*}


Put the unknown future $u^{n+1}$'s on the left-hand side of the equation and the present $u^{n}$ and past $u^{n+1}$ on the right.

\begin{equation*}
\left(I  - \frac{\Delta t}{2} L \right) u^{n+1} = \left(I  + \frac{\Delta t}{2}L \right) u^{n} + \frac{3 \Delta t}{2} N(u^n) - \frac{\Delta t}{2} N(u^{n-1})
\end{equation*}
Note that the linear operator $L$ applies to the unknown $u^{n+1}$ on the LHS, but that the nonlinear operator $N$ applies only to the knowns $u^n$ and $u^{n-1}$ on the RHS. This is an *implicit* treatment of the linear terms, which keeps the algorithm stable for large time steps, and an *explicit* treament of the nonlinear term, which makes the timestepping equation linear in the unknown $u^{n+1}$.

Now we discretize space with a finite Fourier expansion, so that $u$ now represents a vector of Fourier coefficients and $L$ turns into matrix (and a diagonal matrix, since Fourier modes are eigenfunctions of the linear operator). Let matrix $A = (I  - \Delta t/2 \; L)$, matrix $B =  (I  + \Delta t/2 \; L)$, and let vector $N^n$ be the Fourier transform of a collocation calculation of $N(u^n)$. That is, $N^n$ is the Fourier transform of $- u u_x = - 1/2 \, d/dx \, u^2$ calculated at $N_x$ uniformly spaced gridpoints on the domain $[0, L_x]$. 

With the spatial discretization, then the CNAB2 timestepping formula becomes 

\begin{equation*}
A \, u^{n+1} = B \, u^n + \frac{3 \Delta t}{2} N^n -  \frac{\Delta t}{2}N^{n-1}
\end{equation*}

This is a simple $Ax=b$ linear algebra problem whose iteration approximates the time-evolution of the Kuramoto-Sivashinksy PDE. 

### Naive Julia implementation of CNAB2 algorithm

The naive Julia code is pretty much a line-by-line translation of the same thing in Matlab, about 30 lines of code excluding comments and whitespace. Here's a slight modification of the benchmarked algorithm which saves and plots $u(x,t)$ data.

In [19]:
function ksintegrateNaive(u, Lx, dt, Nt, nsave);
    
    Nx = length(u)                  # number of gridpoints
    x = collect(0:(Nx-1)/Nx)*Lx
    kx = vcat(0:Nx/2-1, 0, -Nx/2+1:-1)  # integer wavenumbers: exp(2*pi*kx*x/L)
    alpha = 2*pi*kx/Lx              # real wavenumbers:    exp(alpha*x)
    D = 1im*alpha;                  # D = d/dx operator in Fourier space
    L = alpha.^2 - alpha.^4         # linear operator -D^2 - D^4 in Fourier space
    G = -0.5*D                      # -1/2 D operator in Fourier space
    
    Nsave = div(Nt, nsave)+1        # number of saved time steps, including t=0
    t = (0:Nsave)*(dt*nsave)        # t timesteps
    U = zeros(Nsave, Nx)            # matrix of u(xⱼ, tᵢ) values
    U[1,:] = u                      # assign initial condition to U
    s = 2                           # counter for saved data
    
    # Express PDE as u_t = Lu + N(u), L is linear part, N nonlinear part.
    # Then Crank-Nicolson Adams-Bashforth discretization is 
    # 
    # (I - dt/2 L) u^{n+1} = (I + dt/2 L) u^n + 3dt/2 N^n - dt/2 N^{n-1}
    #
    # let A = (I - dt/2 L) 
    #     B = (I + dt/2 L), then the CNAB timestep formula 
    # 
    # u^{n+1} = A^{-1} (B u^n + 3dt/2 N^n - dt/2 N^{n-1}) 

    # convenience variables
    dt2  = dt/2;
    dt32 = 3*dt/2;
    A =  ones(Nx) + dt2*L;
    B = (ones(Nx) - dt2*L).^(-1);

    Nn  = G.*fft(u.*u); # -u u_x (spectral), notation Nn = N^n     = N(u(n dt))
    Nn1 = Nn;           #                   notation Nn1 = N^{n-1} = N(u((n-1) dt))
    u  = fft(u);        # transform u to spectral

    # timestepping loop
    for n = 1:Nt
        Nn1 = Nn;                       # shift nonlinear term in time: N^{n-1} <- N^n
        Nn  = G.*fft(real(ifft(u)).^2); # compute Nn = -u u_x

        u = B .* (A .* u + dt32*Nn - dt2*Nn1);
        
        if mod(n, nsave) == 0
            U[s,:] = real(ifft(u))
            s += 1            
        end
    end

    t,U
end


ksintegrateNaive (generic function with 2 methods)

### Run the Julia code and plot results

In [22]:
Lx = 64*pi
Nx = 1024
dt = 1/16
nsave = 16
Nt = 3200

x = Lx*(0:Nx-1)/Nx
u = cos.(x) + 0.1*sin.(x/8) + 0.01*cos.((2*pi/Lx)*x);
t,U = ksintegrateNaive(u, Lx, dt, Nt, nsave)
;

In [21]:
heatmap(x,t,U, xlim=(x[1], x[end]), ylim=(t[1], t[end]), xlabel="x", ylabel="t", 
title="Kuramoto-Sivashinsky dynamics", fillcolor=:bluesreds)


In [25]:
norm(U[end,:])

40.97799051783059

### Tuned Julia code

The naive Julia code (straight Matlab translation) is slightly slower than Matlab. Can tune the Julia code by 
  * doing FFTs in-place 
  * removing temporary vectors in time-stepping loop
  * using @inbounds and @fastmath macros
  
The tuned Julia code is slightly faster than C++.


In [None]:
function ksintegrateTuned(u, Lx, dt, Nt);
    u = (1+0im)*u                       # force u to be complex
    Nx = length(u)                      # number of gridpoints
    kx = vcat(0:Nx/2-1, 0:0, -Nx/2+1:-1)# integer wavenumbers: exp(2*pi*kx*x/L)
    alpha = 2*pi*kx/Lx                  # real wavenumbers:    exp(alpha*x)
    D = 1im*alpha                       # spectral D = d/dx operator 
    L = alpha.^2 - alpha.^4             # spectral L = -D^2 - D^4 operator
    G = -0.5*D                          # spectral -1/2 D operator, to eval -u u_x = 1/2 d/dx u^2

    # convenience variables
    dt2  = dt/2
    dt32 = 3*dt/2
    A =  ones(Nx) + dt2*L
    B = (ones(Nx) - dt2*L).^(-1)

    # compute in-place FFTW plans
    FFT! = plan_fft!(u, flags=FFTW.ESTIMATE)
    IFFT! = plan_ifft!(u, flags=FFTW.ESTIMATE)

    # compute uf == Fourier coeffs of u and Nnf == Fourier coeffs of -u u_x
    # FFT!(u);
    Nn  = G.*fft(u.^2); # Nnf == -1/2 d/dx (u^2) = -u u_x, spectral
    Nn1 = copy(Nn);     # use Nnf1 = Nnf at first time step
    FFT!*u;

    # timestepping loop, many vector ops unrolled to eliminate temporary vectors
    for n = 0:Nt

        for i = 1:length(Nn)
            @inbounds Nn1[i] = Nn[i];
            @inbounds Nn[i] = u[i];            
        end

        IFFT!*Nn; # in-place FFT

        for i = 1:length(Nn)
            @fastmath @inbounds Nn[i] = Nn[i]*Nn[i];
        end

        FFT!*Nn;

        for i = 1:length(Nn)
            @fastmath @inbounds Nn[i] = G[i]*Nn[i];
        end

        for i = 1:length(u)
            @fastmath @inbounds u[i] = B[i]* (A[i] * u[i] + dt32*Nn[i] - dt2*Nn1[i]);
        end

    end
    u = real(ifft(u))
end


### Benchmark codes

Here are the benchmark codes, which include both the integration algorithm and a driver program to run and time the algorithm at a given $N_x$. I haven't bothered to automate the running of the benchmark codes or the production of the benchmark data files. I run them manually as follows, from within the ``codes`` directory. 

**Python:** [ksbenchmark.py](codes/ksbenchmark.py) From an interactive python shell run 

```
Python 2.7.13 (default, Mar 22 2017, 12:31:17) [GCC] 
IPython 3.2.2 -- An enhanced Interactive Python.
In [1]: execfile("ksbenchmark.py")

In [2]: ksbenchmark(512)
```

**Matlab:** [ksbenchmark.m](code/ksbenchmark.m) From a Matlab prompt 

```
>> ksbenchmark(512)
```


**Julia:** [ksbenchmark.jl](code/ksbenchmark.jl) At the Julia REPL run 

```
julia> include("ksbenchmark.jl")
julia> ksbenchmark(512, ksintegrateNaive)
```

etc. for ``ksintegrateInplace`` and ``ksintegrateUnrolled``.

**C++:** [ksbenchmark.cpp](code/ksbenchmark.cpp) At Unix prompt

```
bash$ gcc -O3 -o ksbenchmark-c++ -lfftw3 -lm
bash$ ksbenchmark-c++ 512
```

**Fortran:** [ksbenchmark.f90](code/ksbenchmark.f90) Edit the file to set $N_x$, then at Unix prompt

```
bash$ gfortran -O3 -o ksbenchmark-f90 -lfftw3
bash$ ksbenchmark-f90
```

 