Skip to content

Releases: Intuity/nexus

v0.3: Hardware capacity and capabilities

03 Jan 15:17
Compare
Choose a tag to compare

This release focusses on improving the capacity and capabilities of the Nexus hardware, in preparation for a complete rewrite of nxcompile. The changes in this release are all a result of attempting to scale up the v0.2 hardware above a 36 node mesh, with more than 8 inputs and outputs per node - when the design in v0.2 was scaled, complexity exploded and each node more than trebled in size leading to a complete exhaustion of resources on the FPGA.

Using a Xilinx XC7A200T FPGA, timing has been achieved at 200 MHz for the mesh using a 10x10 configuration with 32 inputs, 32 outputs, and 16 working registers per node. Over a PCIe link to a host system, Nexus has been observed to run at 10 million cycles per second with output messages disabled and around 2.4 million cycles per second with output messages enabled (a data rate of approximately 1 Gb/s) - both these figures are with the minimum amount of simulation work happening per cycle (so 'ideal' figures). This is far above the performance achieved in release v0.2.

The headline changes in v0.3 are:

  • nxmodel previously used SimPy, but is now written in C++. This change yielded a huge speed increase (simulations now run at tens of kilohertz), and it still integrates well with the cocotb verification environment using the awesome pybind11 framework.
  • Constants, enumerations, structs, and unions are now defined using Packtype which allows the definition to be mastered in Python and then used from generated code in the RTL, Python based verification environment, and the C++ model and driver.
  • The implementation of the node RTL has changed dramatically:
    • Output message start and end positions are now held in a lookup table in the node's RAM, this resulted in a substantial reduction in both flop and LUT usage.
    • Loopback of output messages has been replaced by a mask, which directly drives inputs from output values of the same node in matching positions.
    • The decoder now directly loads the node's RAM for all entries, rather than output mappings being fed through the controller.
    • The controller now supports output value trace generation, which is off by default and can be selectively enabled.
    • If configured with support, node inputs can be directly fed from external sources - this is part of the support for simulated memory access.
    • The logical core now evaluates three input truth table operations rather than fixed functions such as AND, OR, etc.
    • Message routing now prioritises horizontal dispatch (across columns) over vertical dispatch (across rows), this results in better balance of traffic across the mesh and helps to support the column aggregators.
  • New aggregator components now sit at the bottom of each column of nodes, collecting all signal messages and exposing a wide output bus. All other message types are passed through the aggregators and forwarded to the host.
  • The top-level controller has substantially changed:
    • Host-facing interfaces to the controller have been widened to 128-bit, and encoding has been changed to achieve better transfer efficiency.
    • The host-facing interfaces to the mesh have been removed, and instead specific controller request and response types have been introduced which allow messages to be forwarded into and out of the mesh.
    • The controller is now responsible for generating a summary of the wide output bus from the mesh and sending it to the host, sections of the summary can be suppressed if not required to reduce the traffic.
    • Multiple on-device memories exist in the controller which can be accessed by the mesh once per simulated cycle, the host can also read and write the contents of these memories at any point during the simulation.
  • nxlink has been rewritten to support the new controller interface, and the protobuf based gRPC framework has been dropped in favour of integrating with tools as a library rather than via socket connections.

The next focus will be on improving the compiler to take full advantage of the new capabilities and capacity of the Nexus hardware platform.

v0.2: Functional on FPGA

22 Aug 13:26
Compare
Choose a tag to compare

This release marks the design reaching sufficient maturity to run Nexus on an FPGA, meeting timing at 200 MHz on a Xilinx Artix-7 200T. The basic multilayer_8bit design was seen to operate correctly over many thousands of clock cycles, running at ~45 kHz. This simulated clock speed is far below what is expected (~1 MHz), and quite likely limited by inefficient use of the PCIe link to the host.

  • RTL updates:
    • New node I/O handling scheme;
    • Mesh traffic regulated through rotating tokens;
    • Refactoring and re-pipelining of many blocks to ease timing;
    • Adding a wrapper layer of hierarchy to convert to/from AXI4-Stream to interface with Xilinx PCIe DMA;
    • Lint issues have been resolved.
  • Tool updates:
    • nxcompile updated to work with new node I/O handling scheme;
    • nxmodel updated to work with new compiler output.
  • Flow updates:
    • Verilator lint has been integrated as a single make lint within the hardware folder;
    • Out-of-context synthesis of Nexus with both Yosys and Vivado targeting 7-series logic supported using make syn_yosys or make syn_vivado within the hardware folder;
    • Support for running regressions of all testbenches using make regress within the hardware folder.

v0.1: Initial release

22 Aug 13:27
Compare
Choose a tag to compare

Initial release of Nexus, with RTL at a minimal functioning level under simulation alone.