07/10/2025: a taxonomy of LUT features

# A taxonomy of LUT features

> It's been a while since I last wrote something for my blog!
> That's kind of because most of my good ideas for blog posts [became](https://yosyshq.readthedocs.io/projects/yosys/en/latest/using_yosys/synthesis/abc.html) [documentation](https://yosyshq.readthedocs.io/projects/yosys/en/latest/using_yosys/synthesis/memory.html#supported-memory-patterns).

This is something I'm writing for myself; I want to have some kind of document defining some of the terms I use informally to describe what a LUT can do.

## Conventional LUTs

When you crack open a textbook, and it tells you that FPGAs use LUTs, it will usually have a form that looks like this:

<img width="285" height="125" alt="A rectangle labelled LUT4, with four inputs numbered I0-I3 and an output labelled O" src="https://github.com/user-attachments/assets/9603e8a7-9b22-4c0a-bd7a-05240f9667ac" />

Four inputs, one output, no frills.
Think Lattice iCE40, Altera Cyclone IV, or even the NanoXplore NG-Ultra.

### Why does the "conventional" LUT have four inputs?

A LUT-K is internally a small memory of 2^K bits, where the K address lines are the address lines for the read port.
That means that increasing K by one roughly doubles the physical size of the LUT.

One can think of the maximum frequency of a design as the sum of:
- logic delay, determined by how fast a LUT is.
- routing delay, determined by how many LUTs are in the critical path.

Increasing the number of inputs to a LUT increases logic delay, because of the number of muxes the LUT inputs have to go through.
However, increasing the number of inputs to a LUT decreases routing delay, because fewer LUTs are required for the critical path.

It's also important to consider that one cannot always fill a LUT; a design might not have enough terms between flops to fill very large LUTs, or the LUT mapper maps the design inefficiently.

### Some experiments

> I'm going to use my [chess move generator](https://github.com/Ravenslofty/tt07-chess) to provide numbers, as being a heavily-combinational design without RAMs, it's probably maximally illustrative of the differences between LUT sizes.
> My Yosys command line here is `yosys -p "synth -lut K -abc9 -flatten; stat -width; ltp -noff" *.sv`.
> I modified `synth` to remove the `-fast` flag from the abc call, because I think that provides more realistic numbers.

With this in mind, you can map a design to various LUT sizes, and compute the relative area (number of LUTs * 2^K) of an FPGA, as well as the number of logic levels of the critical path.
These can be plotted against K (the number of LUT inputs) to produce a graph:

<img width="927" height="577" alt="A graph with two curves; a blue line represents total area and is exponentially increasing with LUT input count, while a red line represents logic levels, and exponentially decreases with LUT input count until plateauing at about 20 levels" src="https://github.com/user-attachments/assets/f2ef4463-7ffe-445d-8ebe-175aa4440527" />

One can also consider the "efficiency" of an FPGA by multiplying the relative area and logic levels together to form an approximation of the delay-area product.
This can also be plotted against K to produce a graph:

<img width="927" height="574" alt="A bowl-shaped graph that decreases until its minimum at about 4 inputs per LUT, before exponentially rising again" src="https://github.com/user-attachments/assets/bfd513d0-66f6-4907-b32b-6b02bae88d04" />

These graphs together show that while larger LUTs do reduce logic levels, they have diminishing returns, while the relative area increases exponentially.
As a result, a LUT4 architecture is the best compromise between delay and area.

### So that's it? Why would anybody use something else?

Because a LUT4 is the best compromise under the assumption that we can't do any trickery.
Vendors have plenty of trickery available to them.

## Fusible LUTs

From the area and logic levels graph, we can see that while the logic levels plateau above 6, the number of logic levels still halves between LUT4 and LUT6.
The issue with large LUT sizes is that they are expensive when only partially filled, so why not use a smaller, more efficient LUT size and then multiplex those together?
That way you're only paying for the large LUT size when you can fill it.

<img width="524" height="303" alt="two LUT4s, each with fully independent inputs A0-A3/B0-B3 and outputs AOUT and BOUT, but the outputs are also connected to a multiplexer with its select input labelled I4 and its output labelled MUXOUT" src="https://github.com/user-attachments/assets/059460d6-85c1-4e71-9df7-b6f0714e617d" />

This is a signature of Lattice (other than the iCE40 that they bought SiliconBlue for) and Xilinx architectures.

We can model the effects of this by mapping a design for various K given a base LUT size (I'm using LUT4), and treating the area of a LUT-K where K > 4 as LUT4 * 2^(K - 4).
The `abc` and `abc9` commands in Yosys describe this as `-lut 4:K`, so, fusing LUT4s to make a LUT6 would be `4:6`.

> (using this syntax required more `synth` modification...)

<img width="850" height="526" alt="A bar chart showing relatively small area overhead of 5% for LUT5, 15% for LUT6, 30% for LUT7, before jumping up to 90% overhead for LUT8" src="https://github.com/user-attachments/assets/61fee766-d3ff-4931-b847-f678a4f583ae" />

Here we can see that fusing LUT4s up to LUT7s produces relatively low area overhead.
In exchange for this:

<img width="932" height="577" alt="The bowl-shaped delay-area product graph from above, which now has a red line that at LUT4 drops into a lower bowl with its minimum at LUT6" src="https://github.com/user-attachments/assets/c6e74f56-2433-4dad-a9b5-a26137462057" />

The delay-area product approximation decreases significantly past LUT4, because the decrease in logic levels pays for the extra LUTs.

## Fracturable LUTs

Fusible LUTs have a major flaw, however.
While each LUT is individually smaller, each LUT input has its own multiplexer to source the signal.
By the time we reach LUT8, we are using `8*4 + 4 + 2 + 1 = 39` input multiplexers to source 8 unique signals.
This is, itself, a distinct area overhead.

So another approach is to start off with a (larger) LUT, and then, by having a second output, pack two smaller LUTs into the larger LUT.
This mitigates the area overhead of partially-filled large LUTs.

Modern Altera and Xilinx FPGAs use fracturable LUT6s, but Xilinx LUT6s look like this:

<img width="475" height="303" alt="Two LUT5s that share all inputs. The lower LUT5 is connected to an output marked O5. Both LUT5s connect to a multiplexer, whose output is marked O6." src="https://github.com/user-attachments/assets/2649d35a-74fe-48cb-aa99-36cf53f98086" />

and Altera LUT6s look like this:

<img width="527" height="258" alt="Two LUT5s that share four inputs, but each has its own E input. There is a top and a bottom multiplexer, each of which has its data lines connected to the outputs of the LUTs, while they have independent select lines labelled F." src="https://github.com/user-attachments/assets/be278e17-f801-4e40-88df-cfa9a1228c96" />

(Honestly, whenever you see a LUT6 architecture, you might as well assume it's fracturable.)

> Unfortunately, to give illustrative numbers I have to do some hand-waving: it's not always possible to fit two smaller LUTs into a larger LUT.
> However, I am going to assume the best-case that this *is* possible, because LUT packing is challenging.

<img width="941" height="582" alt="Two exponential curves; one in blue represents the conventional LUT area, while a slower-rising curve (maybe two-thirds the height of the blue curve) represents the fracturable LUT area" src="https://github.com/user-attachments/assets/80651f0e-c835-4a28-8a75-bb255683bb72" />

The chart roughly suggests that having fracturable LUTs means we can increase K by one while keeping the same die area.

<img width="937" height="580" alt="Two bowl-shaped curves. One represents the conventional LUT delay-area product with its minimum at LUT4, while a lower, flatter, curve undercuts the conventional LUT." src="https://github.com/user-attachments/assets/0b207ef6-1a2b-4542-8326-a2adb481e465" />

We can achieve a better delay-area product with fracturable LUT6s than conventional LUT4s, even.
You could also read this graph to suggest that fracturable LUT4s are even better than fracturable LUT6s, but I haven't seen any fracturable LUT4s in the wild.

## LUT structures

Here's another idea: a LUT6 can represent all 2^64 possible functions, but a lot of these are permutations or negations of each other.
To give an example, `A & B`, `~A & B`, `A & ~B`, `~(A & B)` and so on are all very similar, and you could transform one of these into another.
So the actual number of 6-input functions is much lower when you consider this.

What if you tried to break down a 6-input function `x(a, b, c, d, e, f)` into `x(a, b, c, y(d, e, f, g))` or similar?
Then you would have a LUT6 for half the area cost.

This leads to the idea of "LUT structures", where a LUT has a direct feed from its output to the input of another LUT.
ABC calls this an "S44", so I'll use that term.

You can even have two LUTs, with their outputs feeds into the inputs of a third.
This is an "S444" in ABC terminology.

<img width="525" height="260" alt="One LUT4 with inputs marked A0-A3, feeds its output to another LUT4, with inputs marked B1-B3" src="https://github.com/user-attachments/assets/c5ae30f0-63e0-4d4f-8e21-293b1dd26a49" />

Here's an S44; the S444 is pretty similar.

The only FPGA I know of which uses this is the Hercules Micro HME-M7.
(It doesn't seem to be in production anymore.)

> Again, I must handwave packing here; although LUT structures aren't *too* painful, it can be a bit rough to use a LUT4 as nothing but a passthrough.
> Further, due to tooling limitations, I have polyfilled in the LUT4 and LUT5 numbers from the fusible-LUT section; ABC doesn't support mapping anything below LUT6 to an S44.

<img width="847" height="527" alt="A bar chart, showing relative area increasing very slowly with LUT size" src="https://github.com/user-attachments/assets/137afbd2-c2cf-41ac-b1c3-709cdd9dc76d" />

That graph is almost *flat*.
Even though the LUT structure can't express the complete set of functions above LUT4, it can express *enough* of them to pay for the extra LUTs used, right up to finding a useful set of LUT10s to map to an S444.

<img width="1117" height="691" alt="The bowl-shaped curve representing conventional LUT delay-area product, which at LUT4 drops into a line which is flat between LUT6 and LUT10" src="https://github.com/user-attachments/assets/6eda17dd-23ae-4262-9ddb-680811d77605" />

And so, the delay-area product *is* flat, because an S44 can achieve the logic levels of a LUT6, and an S444 can achieve the logic levels of a LUT8, while being significantly less expensive.

There is undoubtedly madness in the method, but the results speak for themselves.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

07/10/2025: a taxonomy of LUT features #16

A taxonomy of LUT features

Conventional LUTs

Why does the "conventional" LUT have four inputs?

Some experiments

So that's it? Why would anybody use something else?

Fusible LUTs

Fracturable LUTs

LUT structures

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

07/10/2025: a taxonomy of LUT features #16

Description

A taxonomy of LUT features

Conventional LUTs

Why does the "conventional" LUT have four inputs?

Some experiments

So that's it? Why would anybody use something else?

Fusible LUTs

Fracturable LUTs

LUT structures

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions