Skip to content

8x8 DCT

JulianKemmerer edited this page May 30, 2022 · 11 revisions

I wanted to do this amateur 'copy and paste some DCT algorithm' example to show what PipelineC does at it's core. When used at the most basic level.

You can often make the trade off of latency v.s. resource usage. As written, the below code is relatively well parallelized and uses quite a bit of resources.

// 8x8 DCT
// https://www.geeksforgeeks.org/discrete-cosine-transform-algorithm-program/
#define DCT_PI 3.14159265359
#define DCT_M 8
#define DCT_N 8
#define dct_iter_t uint3_t // 0-7
#define dct_pixel_t uint8_t // 0-255
typedef struct dct_t
{
	float matrix[DCT_M][DCT_N];
} dct_t;
  
// Function to find discrete cosine transform and return it 
dct_t dctTransform(dct_pixel_t matrix[DCT_M][DCT_N]) 
{ 
    dct_iter_t i;
    dct_iter_t j;
    dct_iter_t k;
    dct_iter_t l;
  
    // dct will store the discrete cosine transform 
    dct_t dct;
  
    float ci;
    float cj;
    float dct1;
    float sum; 
  
    for (i = 0; i < DCT_M; i=i+1) { 
        for (j = 0; j < DCT_N; j=j+1) { 
  
            // ci and cj depends on frequency as well as 
            // number of row and columns of specified matrix 
            if (i == 0) 
                ci = 1.0 / sqrt(DCT_M); 
            else
                ci = sqrt(2) / sqrt(DCT_M);
            if (j == 0) 
                cj = 1.0 / sqrt(DCT_N);
            else
                cj = sqrt(2) / sqrt(DCT_N); 
  
            // sum will temporarily store the sum of  
            // cosine signals 
            sum = 0.0; 
            for (k = 0; k < DCT_M; k=k+1) { 
                for (l = 0; l < DCT_N; l=l+1) { 
                    dct1 = (float)matrix[k][l] *  // Float * constant       
                             ( cos((float)((2 * k + 1) * i) * DCT_PI / (float)(2 * DCT_M)) *  
                               cos((float)((2 * l + 1) * j) * DCT_PI / (float)(2 * DCT_N))
                             );
                    sum = sum + dct1;
                }
            } 
            dct.matrix[i][j] = (ci * cj) * sum;
        } 
    }
    
    return dct; 
}

What does this synthesize to?

Alot of the above code becomes constant when the loops are unrolled. So all that really remains, in time order, is:

  • 4096 multipliers in parallel: Inner most 'l' loop O(n^4), 8^4=4096 multipliers. In each iteration 'dct1' does not depend on any previous iteration so they all occur in parallel

  • 64x64=4096 adder array: Accumulating into 'sum' inside the 'l' loop creates a sequential chain of 8 adders. The 'k' loop around that then duplicates the chain sequentially 8 more times for a total of 8x8=64 adders in series. The 'i' and 'j' loops duplicate that 64-chain in parallel 8x8=64 times. This is because 'sum' is reset to 0.0 in the 'j' loop, breaking the sequential accumulation into sum for 'i' and 'j' loops. This creates a 64x64 adder array.

  • 64 multipliers in parallel: The multiply at the end of the 'j' loop can occur in parallel in each iteration. 'dct.matrix[i][j]' does not depend on any previous iterations of the 'i' or 'j' loop.

All of that is fully pipelined, matrix in, matrix out, each clock cycle with some N>=0 clock latency.

I would love to give you some results (Mhz, latency in clocks, resource usage) but my desktop PC would not synthesize this behemoth. I gave up after ~16 hours.

Results will be more useful for what I am currently working on: the opposite end of the spectrum in terms of latency and resource usage. Use fewer resources by unrolling the execution of over time.


This opposite end of the spectrum example uses much fewer resources at the cost of a much higher latency.

The below code hides the logic for block RAMs in the function 'dct_lookup_increment'. The entire code is here. You may want to read about volatile state variables too.

// This is the unrolled version of the original dct copy-and-pasted algorithm
// https://www.geeksforgeeks.org/discrete-cosine-transform-algorithm-program/
// PipelineC iterations of dctTransformUnrolled are used
// to unroll the calculation serially in O(n^4) time

// Input 'matrix' and start=1 to begin calculation
// Input 'matrix' must stay constant until return .done

// 'sum' accumulates over iterations/clocks and should be pipelined
// So 'sum' must be a volatile stateful variable
// Keep track of when sum is valid and be read+written
volatile uint1_t dct_volatiles_valid;
// sum will temporarily store the sum of cosine signals
volatile float dct_sum;
// dct_result will store the discrete cosine transform
// Signal that this is the iteration containing the 'done' result 
typedef struct dct_done_t
{
	float matrix[DCT_M][DCT_N];
	uint1_t done;
} dct_done_t;
volatile dct_done_t dct_result;
dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t start)
{
	// Assume not done yet
	dct_result.done = 0;
	
	// Start validates volatiles
	if(start)
	{
		dct_volatiles_valid = 1;
	}
	
	// Stateful func to handle getting to BRAM
	//     1) Lookup constants from BRAM (using iterators)
	//     2) Increment iterators
	// Returns next iterators and constants and will increment when requested
	dct_lookup_increment_t lookup_increment;
	uint1_t do_increment;
	// Only increment when volatiles valid
	do_increment = dct_volatiles_valid;
	lookup_increment = dct_lookup_increment(do_increment);
	
	// Unpack struct for ease of reading calculation code below
	float const_val;
	const_val = lookup_increment.lookup.const_val;
	float cos_val;
	cos_val = lookup_increment.lookup.cos_val;
	dct_iter_t i;
	i = lookup_increment.incrementer.curr_iters.i;
	dct_iter_t j;
	j = lookup_increment.incrementer.curr_iters.j;
	dct_iter_t k;
	k = lookup_increment.incrementer.curr_iters.k;
	dct_iter_t l;
	l = lookup_increment.incrementer.curr_iters.l;
	uint1_t reset_k;
	reset_k = lookup_increment.incrementer.increment.reset_k;
	uint1_t reset_l;
	reset_l = lookup_increment.incrementer.increment.reset_l;
	uint1_t done;
	done = lookup_increment.incrementer.increment.done;
	
	
	// Do math for this volatile iteration only when
	// can safely read+write volatiles
	if(dct_volatiles_valid)
	{
		// ~~~ The primary calculation ~~~:
		// 1) Float * cosine constant from lookup table  
		float dct1;
		dct1 = (float)matrix[k][l] * cos_val;
		// 2) Increment sum
		dct_sum = dct_sum + dct1;
		// 3) constant * Float and assign into the output matrix
		dct_result.matrix[i][j] = const_val * dct_sum;
		
		// Sum accumulates during the k and l loops
		// So reset when they are rolling over
		if(reset_k & reset_l)
		{
			dct_sum = 0.0;
		}
		
		// Done yet?
		dct_result.done = done;
		
		// Reset volatiles once done
		if(done)
		{
			dct_volatiles_valid = 0;
		}
	}
	
	return dct_result;
}

What does this synthesize to?

Essentially a state machine where each state uses the same N clocks worth of logic to do work. (the body of dctTransformUnrolled).

Consider the 'execution' of the function in time order. The logic consists of:

  • ~17% of time for getting lookup constants & incrementing the iterators (dct_lookup_increment), reading the [k][l] value out of input 'matrix'
  • ~21% of time for the 1) Float * cosine constant from lookup table, a floating point multiplier
  • ~34% of time for the 2) Increment sum addition, a floating point adder
  • ~21% of time for the 3) constant * Float, a floating point multiplier
  • ~5% of time for the 3) assignment into the output matrix at [i][j]

That pipeline takes some fixed number of clock cycles N. That means every N clock cycles 'dct_volatiles_valid' will =1 (after being set at the start). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total latency in clock cycles is N * 4096.

(You'll notice that no DSPs are used. For now all floating point operations are implemented in fabric.)

Here are results for various operating frequencies:

Device:	xc7a35ticsg324-1L						
Pipeline Latency N (clocks)	Freq (Mhz)	Total  Latency (us)	CARRY4	Total LUTs	Registers	Total MUXs	SRL16E
1	24.54	166.91	250	6181	4863	96	0
2	44.97	182.16	253	4069	7076	96	0
3	67.95	180.84	253	4052	9060	96	65
4	81.19	201.80	255	4063	7118	96	2090
5	94.61	216.47	257	4051	7305	96	2122
6	112.98	217.52	257	4025	5671	96	2122
7	124.02	231.18	257	4103	5242	96	2154
8	138.08	237.31	257	4055	5714	64	2123
9	152.00	242.53	257	4045	6132	96	2171
10	144.89	282.71	285	4147	5751	24	2163