# Backward transform

The real-to-complex transforms are not documented very well. The guess is that `r2cf` and `r2cb` are the forward and backward transforms. The both take **two** arguments for real input data. There are two reasonable possibilities: even/odd or lower half/upper half of the data is passed through different pointers. It turns out it is even/odd, which also makes the most sense. To see this, we first test the backward transform.

Clean the Noodles cache:

In [44]:
!rm -f lib/db

In [45]:
import os

In [54]:
os.environ["PYOPENCL_CTX"] = "Intel(R) OpenCL HD Graphics"

In [55]:
import pyopencl as cl
import numpy as np
from copy import copy
from genfft.opencl import run, single_stage_r2c, default_config

def max_err(a, b):
    return np.abs(a - b).max()

Generate the codelet and setup OpenCL

In [56]:
ctx = cl.create_some_context()
cfg = copy(default_config)
codelet = run(single_stage_r2c(cfg, 16, direction='b', sign=1))
prog = cl.Program(ctx, codelet).build()
queue = cl.CommandQueue(ctx)

Create an array of complex values for which we know the backward transform, and hope that FFTW uses the same data layout as the Numpy FFT does. Important: make sure that we convert to `float32`.

In [57]:
x = np.arange(16, dtype='float32')
y = np.fft.rfft(x).astype('complex64')

In [58]:
mf = cl.mem_flags
y_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=y)
x0_g = cl.Buffer(ctx, mf.WRITE_ONLY, x.nbytes//2)
x1_g = cl.Buffer(ctx, mf.WRITE_ONLY, x.nbytes//2)

We can slice the `cl.Buffer` object to get pointer offsets, but note that these slices are in **bytes**. The strides for the real data are 1, and for the complex data, since we interlaced real and imag part, 2. The last three arguments give the amount of times the transform should be repeated, and the respective strides for the outer loop.

In [59]:
prog.r2cb_16(queue, (1,), None, x0_g, x1_g, y_g, y_g[4:],
             np.int32(1), np.int32(2), np.int32(2),
             np.int32(1), np.int32(1), np.int32(1))

<pyopencl._cl.Event at 0x7fa598fde530>

Now we read out the results in both `x0` and `x1`.

In [60]:
x0 = np.zeros(8, dtype='float32')
x1 = np.zeros(8, dtype='float32')
cl.enqueue_copy(queue, x0, x0_g)
cl.enqueue_copy(queue, x1, x1_g)

<pyopencl._cl.NannyEvent at 0x7fa598eea650>

These are the even and odd parts of the real data, so we can join them using `np.c_` and flattening.

In [61]:
np.c_[x0, x1].flatten() / 16

array([ 0.       ,  1.0000002,  1.9999998,  3.0000005,  4.       ,
        5.       ,  6.       ,  7.       ,  8.       ,  9.       ,
       10.       , 11.       , 12.       , 13.       , 14.       ,
       15.       ], dtype=float32)

## Forward transform

The forward transform should now be easy.

In [62]:
codelet = run(single_stage_r2c(cfg, 16, direction='f'))
prog = cl.Program(ctx, codelet).build()
queue = cl.CommandQueue(ctx)

In [63]:
mf = cl.mem_flags
x = np.arange(16, dtype='float32')
x_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=x)
y_g = cl.Buffer(ctx, mf.WRITE_ONLY, x.nbytes + 8)

In [64]:
prog.r2cf_16(queue, (1,), None, x_g, x_g[4:], y_g, y_g[4:],
             np.int32(2), np.int32(2), np.int32(2),
             np.int32(1), np.int32(1), np.int32(1))

<pyopencl._cl.Event at 0x7fa598fe5410>

In [65]:
y = np.zeros(9, dtype='complex64')
cl.enqueue_copy(queue, y, y_g)
np.fft.irfft(y)

array([ 0.        ,  1.00000002,  1.99999978,  3.0000002 ,  3.99999976,
        5.0000003 ,  6.00000002,  7.00000012,  8.        ,  8.99999988,
        9.99999998, 10.9999997 , 12.00000024, 12.9999998 , 14.00000022,
       14.99999998])

In [66]:
np.abs(np.fft.rfft(x) - y).max()

1.7893790626999362e-06

Whoohoo!

## Staged half-complex transform

In [14]:
from genfft.opencl import multi_stage_hc2hc

In [15]:
codelet = run(multi_stage_hc2hc(cfg, 4, 4))
prog = cl.Program(ctx, codelet).build()
queue = cl.CommandQueue(ctx)

This is a bit more tricky. Say we want to do a 4x4 -> 16 transform. In full complex this would be

In [16]:
from genfft.fft import make_twiddle

tw = make_twiddle(4,4).conj()
x = np.arange(16)
a = np.fft.fft(x.reshape(4,4).T)
b = np.fft.fft((a * tw).T)
print(b.T.flatten())

[120. +0.j          -8.+40.21871519j  -8.+19.31370831j  -8.+11.97284555j
  -8. +8.j          -8. +5.34542894j  -8. +3.31370831j  -8. +1.59129858j
  -8. +0.j          -8. -1.59129858j  -8. -3.31370831j  -8. -5.34542894j
  -8. -8.j          -8.-11.97284555j  -8.-19.31370831j  -8.-40.21871519j]


We did twice the work needed, since the latter half of the output is the complex conjugate of the first half,

$$F(y)(k) = F(y)^*(-k).$$

In [28]:
x = np.arange(16).reshape(4,4)

In [29]:
a = np.fft.rfft(x.T)
a

array([[24.+0.j, -8.+8.j, -8.+0.j],
       [28.+0.j, -8.+8.j, -8.+0.j],
       [32.+0.j, -8.+8.j, -8.+0.j],
       [36.+0.j, -8.+8.j, -8.+0.j]])

In [30]:
b = np.fft.fft((a * tw[:,:3]).T)

Now, on the columns we need to do a **full** complex transform. Also, we need to figure out what to do with the `n+1` column. Compare this with the full-complex version:

In [31]:
b.T

array([[120. +0.j        ,  -8.+40.21871519j,  -8.+19.31370831j],
       [ -8. +8.j        ,  -8. +5.34542894j,  -8. +3.31370831j],
       [ -8. +0.j        ,  -8. -1.59129858j,  -8. -3.31370831j],
       [ -8. -8.j        ,  -8.-11.97284555j,  -8.-19.31370831j]])

The way around this, may be decimation-in-time.

In [None]:
a = np.fft.fft(x)

## Integer transform

In [1]:
from genfft.opencl import multi_stage_int, int_macros, run, default_config
from copy import copy
cfg = copy(default_config)
codelet = run(multi_stage_int(cfg, 4, 4))

In [2]:
int_macros

{'R': 'int16',
 'E': 'int32',
 'stride': 'int',
 'INT': 'int',
 'K(x)': '((E) x)',
 'DK(name,value)': 'const E name = K(value)',
 'WS(s,i)': 's*i',
 'MAKE_VOLATILE_STRIDE(x,y)': '0',
 'FMA(a,b,c)': 'a * b + c',
 'FMS(a,b,c)': 'a * b - c',
 'FNMA(a,b,c)': '-a * b - c',
 'FNMS(a,b,c)': '-a * b + c'}

In [3]:
print(codelet)

#define R int16
#define E int32
#define stride int
#define INT int
#define K(x) ((E) x)
#define DK(name,value) const E name = K(value)
#define WS(s,i) s*i
#define MAKE_VOLATILE_STRIDE(x,y) 0
#define FMA(a,b,c) a * b + c
#define FMS(a,b,c) a * b - c
#define FNMA(a,b,c) -a * b - c
#define FNMS(a,b,c) -a * b + c

__constant int16 twiddle_4_4[24] = {
  1023, 0, 1023, 0, 1023, 0, 945, 391, 723, 723, 391, 945, 723, 723, 0, 1023,
    -723, 723, 391, 945, -723, 723, -945, -391
};


/* Generated by: ../genfft/gen_notw.native -name notw4 -opencl -n 4 -compact -standalone */

/*
 * This function contains 16 FP additions, 0 FP multiplications,
 * (or, 16 additions, 0 multiplications, 0 fused multiply/add),
 * 13 stack variables, 0 constants, and 16 memory accesses
 */
__kernel void
notw4 (__global const R * ri, __global const R * ii, __global R * ro,
       __global R * io, stride is, stride os, INT v, INT ivs, INT ovs)
{
  {
    INT i;
    for (i = v; i > 0;
         i = i - 1, ri = ri + ivs, ii 