Skip to content

Bynaryman/ttsky26a_tensor_loops_flopoco2_mlir

Repository files navigation

Pattern-Guided Arithmetic Optimizations with MLIR & TinyTapeout (ttsky26a)

This repository is a TinyTapeout artifact for a simple but important compiler idea: keep arithmetic decisions visible until they become hardware, then validate that path on a small, real design.

The design is tt_um_lledoux_s3fdp_seqcomb, a streaming wrapper around a generated floating-point accumulation core. It targets open-source silicon through TinyTapeout on Sky130.

The repository is also a compact, reproducible view of a line of work spanning several levels of the hardware/software stack: arithmetic-oriented MLIR design (a multi-level compiler IR), lowering passes in emeraude-mlir (my arithmetic-oriented MLIR compiler repository, not yet open), supporting CIRCT contributions (hardware-oriented compiler infrastructure built on MLIR), and my FloPoCo2 refactoring work toward IR generation for arithmetic circuits, needed to keep floating-point datapaths compiler-visible instead of opaque.

Motivation

MLIR and CIRCT already provide a strong path from high-level programs to structural hardware, especially for control flow, memory, and integer-style datapaths. The weak point is floating-point realization near the circuit boundary. At exactly the stage where ASIC and FPGA flows want explicit wires, operators, and registers, floating-point regions often become opaque, remain trapped in HLS-style abstractions, or are handed off to external generators as black-box blocks.

That loss of visibility matters because this is where the interesting tradeoffs become concrete: area, latency, rounding points, exception handling, and opportunities for fusion or specialization. In other words, floating-point arithmetic is still "just hardware" at this level, but current compiler flows do not always expose it that way.

This repository focuses on that missing step. Concretely, it shows how a small floating-point kernel can be lowered into explicit compiler-visible comb/hw/seq structure before final RTL export, so arithmetic remains available to the compiler as something that can still be inspected, transformed, and specialized.

The motivation is therefore practical. The artifact is meant to show the continuity between several pieces of work that are often separated in practice: arithmetic-oriented IR choices, lowering passes, structural compiler cleanup near the CIRCT boundary, and arithmetic-generator integration. The value of the example is not only that it produces a small chip, but that the intermediate transformations remain visible and reproducible.

In this flow, CIRCT utilities such as map-arith-to-comb are useful, but they are not a complete floating-point structuralization path for the kernels targeted here. This artifact documents the additional lowering step used to close that gap on a compact example.

What Is Implemented Here

This repository brings together work I contributed at several layers:

  • arithmetic-aware lowering in emeraude-mlir, my current arithmetic-oriented MLIR compiler repository, including the path used here to move from arith/scf/memref programs to explicit hardware-oriented structure
  • arithmetic IR work around multi-level compilation, real-number expressions, fixed-point or specialized accumulation, and compiler-visible numeric choices
  • supporting CIRCT-side transformations that make the later structural lowering stages cleaner for this kind of flow
  • FloPoCo2 refactoring work for IR generation, exposed in this development branch, that exports floating-point hardware as an internal combinational graph rather than stopping at standalone HDL emission
  • TinyTapeout integration, wrapper design, and test scaffolding that tie those pieces into a reproducible open artifact

FloPoCo Integration In This Flow

In this repository, FloPoCo is not used only as an external generator. The relevant piece is my FloPoCo2 refactoring work toward IR generation, available in this development branch, which exposes arithmetic as an internal combinational graph (CombAST-style) that compiler passes can import and lower instead of only emitting standalone HDL.

This makes the generated datapath part of the compiler pipeline itself. In practice, it keeps floating-point structure visible across lowering stages and allows it to interact with the MLIR/CIRCT transformations used in this artifact.

Active branch used by this work:

The concrete flow in this repository combines:

From Llama context to this tiny chip

The kernel used here is intentionally small, but it comes from the same family of Python/LLM workloads discussed in the companion material: SwiGLU-style and matmul-dominated regions where multiply-add structure is abundant and arithmetic lowering choices matter.

One lowered MLIR stage from that broader LLaMA-derived path is shown below.

LLaMA-derived MLIR lowering context (from codez examples)

Companion context figure:

Llama-to-hardware context

In this artifact, we keep one compact loop-accum kernel so it fits TinyTapeout while still showing the full compiler path.

scf.for %k = %c0 to %c2 step %c1 {
  %x = memref.load %a[%k] : memref<2xf32>
  %y = memref.load %b[%k] : memref<2xf32>
  %acc = memref.load %c[%c0] : memref<2xf32>
  %m = arith.mulf %x, %y : f32
  %s = arith.addf %acc, %m : f32
  memref.store %s, %c[%c0] : memref<2xf32>
}

Source: flow/mlir/s3fdp_loop_accum.mlir

Arithmetic specialization and comb stage

Specialization is configured in scripts/generate_s3fdp_core.sh:

  • s3fdp.ovf=2
  • s3fdp.msb=4
  • s3fdp.lsb=-6
  • s3fdp.chunk_size=16

A representative comb-level view produced in the flow is shown below:

MLIR to SystemVerilog / comb-focused view (from codez examples)

You can inspect the exact IR stages in:

  • generated/ir-stages/20-flopoco-comb.mlir
  • generated/ir-stages/60-hw-aggregate.mlir
  • generated/ir-stages/90-hw-to-sv.mlir

Here CIRCT serves as the structural backend, while the floating-point path is made explicit through flopoco-arith-to-comb, the surrounding arithmetic-specialization work in emeraude-mlir, and the FloPoCo2 IR-generation refactoring above.

TinyTapeout wrapper behavior

Top module: tt_um_lledoux_s3fdp_seqcomb

Protocol:

  • input frame: 20 bytes (a[2], b[2], c0, little-endian)
  • run latency: 3 cycles
  • output: one 32-bit result over 4 bytes
  • slot cadence: 20 + 3 + 4 = 27 cycles

Waveform snapshot:

Streaming waveform

Reproduce

Generate core + IR artifacts:

./scripts/generate_s3fdp_core.sh

Run RTL test:

python3 -m venv .venv
. .venv/bin/activate
pip install -r test/requirements.txt
make -C test -B

Test vector in test/test.py:

  • a=[1.0, 0]
  • b=[1.0, 0]
  • c0=0.0
  • expected result 0x3F800000

Related material

Cite This Work

Selected references related to this repository:

@inproceedings{cochard2025multilevel,
  author = {Pierre Cochard and Luc Forget and Florent de Dinechin and Louis Ledoux},
  title = {Towards Multi-Level Arithmetic Optimizations},
  booktitle = {EuroLLVM 2025},
  address = {Berlin, Germany},
  year = {2025},
  url = {https://hal.science/hal-05063466v1},
  note = {Poster}
}

@article{ledoux2025optimizedmlir,
  author = {Louis Ledoux and Pierre Cochard and Florent de Dinechin},
  title = {Towards Optimized Arithmetic Circuits with MLIR},
  journal = {WiPiEC Journal: Works in Progress in Embedded Computing},
  volume = {11},
  number = {1},
  year = {2025},
  doi = {10.64552/wipiec.v11i1.90},
  url = {https://hal.science/hal-05385229v1},
  note = {Associated conference presentation at DSD 2025}
}

@misc{ledoux2026circtflopoco,
  author = {Louis Ledoux and Pierre Cochard and Florent de Dinechin},
  title = {Floating-Point Datapaths in CIRCT via FloPoCo AST Export and flopoco-arith-to-comb Lowering},
  howpublished = {Poster presentation at the EuroLLVM Developers' Meeting 2026},
  address = {Dublin, Ireland},
  year = {2026}
}

@misc{ledoux2026emeraude,
  author = {Louis Ledoux and Pierre Cochard and Florent de Dinechin},
  title = {Arithmetic Lowering with Emeraude-MLIR: Bridging Tensor and DSP Kernels to Silicon Datapaths},
  howpublished = {Presentation at the PEPR IA Embarqu{\'e}e Workshop},
  address = {Aussois, France},
  year = {2026},
  url = {https://hal.science/hal-05489427v1}
}

@misc{ledoux2025holigrail,
  author = {Louis Ledoux and Pierre Cochard},
  title = {FloPoCo and MLIR: a Multi-Level Compilation Framework for Many Intents},
  howpublished = {Holigrail Seminar, Sorbonne University},
  address = {Paris, France},
  year = {2025}
}

License

Apache-2.0 (see LICENSE).

About

mlir specialization pass

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors