-
Notifications
You must be signed in to change notification settings - Fork 2
Description
A taxonomy of LUT features
It's been a while since I last wrote something for my blog!
That's kind of because most of my good ideas for blog posts became documentation.
This is something I'm writing for myself; I want to have some kind of document defining some of the terms I use informally to describe what a LUT can do.
Conventional LUTs
When you crack open a textbook, and it tells you that FPGAs use LUTs, it will usually have a form that looks like this:

Four inputs, one output, no frills.
Think Lattice iCE40, Altera Cyclone IV, or even the NanoXplore NG-Ultra.
Why does the "conventional" LUT have four inputs?
A LUT-K is internally a small memory of 2^K bits, where the K address lines are the address lines for the read port.
That means that increasing K by one roughly doubles the physical size of the LUT.
One can think of the maximum frequency of a design as the sum of:
- logic delay, determined by how fast a LUT is.
- routing delay, determined by how many LUTs are in the critical path.
Increasing the number of inputs to a LUT increases logic delay, because of the number of muxes the LUT inputs have to go through.
However, increasing the number of inputs to a LUT decreases routing delay, because fewer LUTs are required for the critical path.
It's also important to consider that one cannot always fill a LUT; a design might not have enough terms between flops to fill very large LUTs, or the LUT mapper maps the design inefficiently.
Some experiments
I'm going to use my chess move generator to provide numbers, as being a heavily-combinational design without RAMs, it's probably maximally illustrative of the differences between LUT sizes.
My Yosys command line here isyosys -p "synth -lut K -abc9 -flatten; stat -width; ltp -noff" *.sv
.
I modifiedsynth
to remove the-fast
flag from the abc call, because I think that provides more realistic numbers.
With this in mind, you can map a design to various LUT sizes, and compute the relative area (number of LUTs * 2^K) of an FPGA, as well as the number of logic levels of the critical path.
These can be plotted against K (the number of LUT inputs) to produce a graph:

One can also consider the "efficiency" of an FPGA by multiplying the relative area and logic levels together to form an approximation of the delay-area product.
This can also be plotted against K to produce a graph:

These graphs together show that while larger LUTs do reduce logic levels, they have diminishing returns, while the relative area increases exponentially.
As a result, a LUT4 architecture is the best compromise between delay and area.
So that's it? Why would anybody use something else?
Because a LUT4 is the best compromise under the assumption that we can't do any trickery.
Vendors have plenty of trickery available to them.
Fusible LUTs
From the area and logic levels graph, we can see that while the logic levels plateau above 6, the number of logic levels still halves between LUT4 and LUT6.
The issue with large LUT sizes is that they are expensive when only partially filled, so why not use a smaller, more efficient LUT size and then multiplex those together?
That way you're only paying for the large LUT size when you can fill it.

This is a signature of Lattice (other than the iCE40 that they bought SiliconBlue for) and Xilinx architectures.
We can model the effects of this by mapping a design for various K given a base LUT size (I'm using LUT4), and treating the area of a LUT-K where K > 4 as LUT4 * 2^(K - 4).
The abc
and abc9
commands in Yosys describe this as -lut 4:K
, so, fusing LUT4s to make a LUT6 would be 4:6
.
(using this syntax required more
synth
modification...)

Here we can see that fusing LUT4s up to LUT7s produces relatively low area overhead.
In exchange for this:

The delay-area product approximation decreases significantly past LUT4, because the decrease in logic levels pays for the extra LUTs.
Fracturable LUTs
Fusible LUTs have a major flaw, however.
While each LUT is individually smaller, each LUT input has its own multiplexer to source the signal.
By the time we reach LUT8, we are using 8*4 + 4 + 2 + 1 = 39
input multiplexers to source 8 unique signals.
This is, itself, a distinct area overhead.
So another approach is to start off with a (larger) LUT, and then, by having a second output, pack two smaller LUTs into the larger LUT.
This mitigates the area overhead of partially-filled large LUTs.
Modern Altera and Xilinx FPGAs use fracturable LUT6s, but Xilinx LUT6s look like this:

and Altera LUT6s look like this:

(Honestly, whenever you see a LUT6 architecture, you might as well assume it's fracturable.)
Unfortunately, to give illustrative numbers I have to do some hand-waving: it's not always possible to fit two smaller LUTs into a larger LUT.
However, I am going to assume the best-case that this is possible, because LUT packing is challenging.

The chart roughly suggests that having fracturable LUTs means we can increase K by one while keeping the same die area.

We can achieve a better delay-area product with fracturable LUT6s than conventional LUT4s, even.
You could also read this graph to suggest that fracturable LUT4s are even better than fracturable LUT6s, but I haven't seen any fracturable LUT4s in the wild.
LUT structures
Here's another idea: a LUT6 can represent all 2^64 possible functions, but a lot of these are permutations or negations of each other.
To give an example, A & B
, ~A & B
, A & ~B
, ~(A & B)
and so on are all very similar, and you could transform one of these into another.
So the actual number of 6-input functions is much lower when you consider this.
What if you tried to break down a 6-input function x(a, b, c, d, e, f)
into x(a, b, c, y(d, e, f, g))
or similar?
Then you would have a LUT6 for half the area cost.
This leads to the idea of "LUT structures", where a LUT has a direct feed from its output to the input of another LUT.
ABC calls this an "S44", so I'll use that term.
You can even have two LUTs, with their outputs feeds into the inputs of a third.
This is an "S444" in ABC terminology.

Here's an S44; the S444 is pretty similar.
The only FPGA I know of which uses this is the Hercules Micro HME-M7.
(It doesn't seem to be in production anymore.)
Again, I must handwave packing here; although LUT structures aren't too painful, it can be a bit rough to use a LUT4 as nothing but a passthrough.
Further, due to tooling limitations, I have polyfilled in the LUT4 and LUT5 numbers from the fusible-LUT section; ABC doesn't support mapping anything below LUT6 to an S44.

That graph is almost flat.
Even though the LUT structure can't express the complete set of functions above LUT4, it can express enough of them to pay for the extra LUTs used, right up to finding a useful set of LUT10s to map to an S444.

And so, the delay-area product is flat, because an S44 can achieve the logic levels of a LUT6, and an S444 can achieve the logic levels of a LUT8, while being significantly less expensive.
There is undoubtedly madness in the method, but the results speak for themselves.