Support for memories with read latency higher than 1 #1392

fayalalebrun · 2024-04-22T11:52:29Z

At the moment, the readSync method of Mem produces a result in the following cycle (A read latency of 1). However, some memory IPs (see here) support read latencies higher than 2. Especially on FPGA's, this can be important to achieve high frequency.

Ideally, Mem would allow read latency to be configured, and convey this information to the blackboxing step.

The text was updated successfully, but these errors were encountered:

Dolu1990 · 2024-04-22T11:56:00Z

Agree.

One question i have, how those 2 cycle latency memory IP support a clock enable for the second stage ? (if at all)

KireinaHoro · 2024-04-22T12:06:23Z

XPM does not support a clock enable for the memory itself, only for the last register stage for output:

regcea	Input	1	clka	LEVEL_HIGH	1	Clock Enable for the last register stage on the output data path.
regceb	Input	1	clkb	LEVEL_HIGH	1	Clock Enable for the last register stage on the output data path.

Definition for memory latency:

READ_LATENCY_A: Specify the number of register stages in the port A read data pipeline. Read data output to port douta takes this number of clka cycles. To target block memory, a value of 1 or larger is required- 1 causes use of memory latch only; 2 causes use of output register. To target distributed memory, a value of 0 or larger is required- 0 indicates combinatorial output. Values larger than 2 synthesize additional flip-flops that are not retimed into memory primitives.

Dolu1990 · 2024-04-22T12:44:49Z

does not support a clock enable for the memory itself

Isn't it what ena / enb are for ? first stage "clock enable"

So ena / enb => first stage
regcea / regceb => second stage

Dolu1990 · 2024-04-22T13:14:29Z

The way i see it to more foward would change the current :

def readSyncImpl(address: UInt, data : Data, enable: Bool = null, readUnderWrite: ReadUnderWritePolicy = dontCare, clockCrossing: Boolean = false, allowMixedWidth: Boolean = false): Unit = {

into

def readSyncImpl(address: UInt, data : Data, enable: Bool = null, readUnderWrite: ReadUnderWritePolicy = dontCare, clockCrossing: Boolean = false, allowMixedWidth: Boolean = false, latency : Int = 1, stageEnable : Seq[Bool] = Nil): Unit = {

And the adjuste things around it.

Same for the readWrite ports

That would fit ?

KireinaHoro · 2024-04-22T13:27:30Z

Hmm I'm a bit confused about the stage enable signals being separate; what's the point of having them all separate? For instance I'm not sure how will this map to the en/regce scheme of the XPM.

In a broader sense I also don't understand, why would you want to read the value from RAM without enabling the output registers.

Dolu1990 · 2024-04-22T13:43:02Z

The point is to be able to implement streamed memory read of multiple cycle latency, with backpresure and the ability to collapse bubbles.

KireinaHoro · 2024-04-22T13:49:44Z

That's indeed interesting. I don't have a detailed idea how that would work, but from the viewpoint of the XPM, stageEnable.head should map to ena/enb, and stageEnable.tail.reduce(_ && _) should map to regcea/regceb?

Dolu1990 · 2024-04-22T13:54:13Z

regcea => enable
ena => stageEnable(0)

If that was a 3 cycle latency =>
ena_stage3 => stageEnable(1)

So in other words :

Stage 1 => enable
Stage 2 => stageEnable(0)
Stage 3 => stageEnable(1)
and so on.

KireinaHoro · 2024-04-22T14:15:37Z

No but this is not possible, since both en and regce are both single bit signals in the XPM, regardless of how many reg stages there are. Or did I understand it wrong?

Dolu1990 · 2024-04-22T14:30:14Z

i would say it is possible up to 2 cycle latency for XMP.
Not sure there is any interrest using XMP with latency larger than 2 anyway "Values larger than 2 synthesize additional flip-flops that are not retimed into memory primitives."

KireinaHoro · 2024-04-23T07:12:53Z

Sounds great. I'm not too familiar with the internals of the Mem API, so it would be great if @Dolu1990 you can implement the fix; I can then start upstreaming the XPM blackboxing pass I have as seen on Gitter :)

Dolu1990 · 2024-04-25T14:34:45Z

Will do.
Keep in mind, often in FPGA, unless you are very very high fmax, you don't want to use the output flipflop of the ramblock, instead you want those flops in the FPGA fabric which allows them to travel closer to where their output is used.

KireinaHoro · 2024-04-25T16:29:25Z

Yeah that makes sense. I think this should be left to the black boxing pass to handle, to use the output flop on the RAM primitive or to use separate stages?

Dolu1990 · 2024-04-25T16:39:09Z

I don't know.
Another idea would be that the blackboxing phase would be able to reconize Reg after the ram to fuse them into the blackbox XD

KireinaHoro · 2024-04-29T15:34:35Z

Ah regarding the extra pipeline stages after reading this thread on the Xilinx forum: they're useful to reduce routing congestion on the FPGA. Pipelining with extra flops allow the placer to move logic away from the congestion site, which in turn improves QoR. Not completely useless :)

Dolu1990 · 2024-04-30T08:39:06Z

Hmm i'm think now that maybe a way to implement it would be to keep MemReadSync as it is, and add regular Reg on its read data, then we can add special SpinalTag to help inferation / blackboxing figureing out that those register can be merged.

That would have the advantage to keep things "simple" and not requiring any update in the verilog/vhdl backends.
This will onyl require the blackboxer to know about those special tag.

KireinaHoro · 2024-04-30T08:40:53Z

That sounds exciting, but how would you guarantee that the unregistered RAM output is not consumed by anyone other than the reg stage?

Dolu1990 · 2024-04-30T10:41:32Z

What do you mean by consumed ?

developfpga · 2024-05-24T07:41:41Z

Hi @KireinaHoro @Dolu1990

I am interesting in this issue too.

Let's check xilinx ug974 as example

def readSyncImpl(address: UInt, data : Data, enable: Bool = null, readUnderWrite: ReadUnderWritePolicy = dontCare, clockCrossing: Boolean = false, allowMixedWidth: Boolean = false, latency : Int = 1, stageEnable : Seq[Bool] = Nil): Unit = {

enb is mapped to enable like before. And I think how about map stageEnable(0) to regceb?

Or we can use setBlackBoxName to generate ram blackbox, just add one more port 'stageEnable(0) '.

Dolu1990 · 2024-05-24T09:51:29Z

Hi @developfpga

I did some experiments localy, by extending the readSyncImpl API, but now i think the best would be instead that the blackboxing phase be capable to reconize Reg after a readSync and fusion them into the blackbox.

That way, we could define pipeline in a regular way, and as long as there is a simple register after the read sync, all can be "magicaly" merged.

So what would be needed, is to implement a phase which iterate over the AST and add tags to Mem to document "fusion" possibilities, which can then be used by the blackboxer.

developfpga · 2024-05-24T12:14:26Z

Hi @Dolu1990

From my perspective and xilinx user guide, here is my advice.

readData := ram.readSync(addr, rdEn, latency, stageEnable)

default latency = 1, and no need stageEnable. and stageEnable.getBitsWidth == latency - 1
For example, if i write spinal like

readData := ram.readSync(addr, rdEn, 2, stageEnable)

The verilog code generated I expected is like

always @(posedge rd_clk) begin
    if (rdEn) begin
      read_data <= ram[addr];
    end
    if (stageEnable) begin
        read_data_stage_1 <= read_data ;
    end 
end 
assign readData = read_data_stage_1 ;

The verilog blackbox is like, parameter rdLatency and io rd_stage_enable are added.
rdLatency and rd_stage_enable is handled by blackbox, no more to do by spinal.

  blackbox #(
    .wordCount(1024),
    .wordWidth(36),
    .clockCrossing(1'b0),
    .technology("auto"),
    .readUnderWrite("dontCare"),
    .wrAddressWidth(10),
    .wrDataWidth(36),
    .wrMaskWidth(1),
    .wrMaskEnable(1'b0),
    .rdAddressWidth(10),
    .rdDataWidth(36),
    .rdLatency(2) // added
  ) u_ram(
    .wr_clk  (), //i
    .wr_en   (), //i
    .wr_mask (), //i
    .wr_addr (), //i
    .wr_data (), //i
    .rd_clk  (), //i
    .rd_en   (), //i
    .rd_addr (), //i
    .rd_data (),  //o
    .rd_stage_enable() // added
  );

So I don't understand about 'add tags', I don't think any tags need to be add here.

developfpga · 2024-05-30T09:39:24Z

Hi @Dolu1990

Any updates for this issue?

Dolu1990 · 2024-05-30T15:47:56Z

No progress so far, i'm quite under water :/

Dolu1990 · 2024-05-31T12:29:03Z

Got some progress :

The following code will tag every ReadSync port with a special tag when the output is only used by registers.
From that, it only remain to implement the blackboxification itself.

object PlayMemSync3 extends App{
  class MemDataRegTag(reg : BaseType, rs : MemPortStatement, through : List[BaseNode]) extends SpinalTag
  class InferTagPhase extends PhaseNetlist{
    override def impl(pc: PhaseContext): Unit = {
      val algo = pc.globalData.allocateAlgoIncrementale()
      val portsHits = ArrayBuffer[MemPortStatement]()
      pc.walkDeclarations{
        case reg : BaseType if reg.isReg && !reg.hasInit && reg.hasOnlyOneStatement && reg.isDirectionLess =>{
          def walkStm(s : LeafStatement, through : List[BaseNode]): Unit = s match {
            case s : DataAssignmentStatement if s.target.isInstanceOf[BaseType] =>
              val target = s.target.asInstanceOf[BaseType]
              if(through == Nil || s.parentScope == target.parentScope) {
                walkExp(s.source, s :: through)
              }
            case _ =>
          }

          def walkExp(e: Expression, through : List[BaseNode]): Unit = e match {
            case e: CastEnumToBits =>
            case e: CastEnumToEnum =>
            case e: CastBitsToEnum =>
            case e: Cast => walkExp(e.input, e :: through)
            case e: BitsRangedAccessFixed => walkExp(e.source, e :: through)
            case bt: BaseType if bt.isComb && bt.hasOnlyOneStatement => walkStm(bt.head, bt :: through)
            case rs: MemReadSync if reg.component == rs.component => {
              println(s"hit $rs through ${through.map(e => "- " + e).mkString("\n")}")
              val hitId = portsHits.size
              portsHits += rs
              rs.addTag(new MemDataRegTag(reg, rs, through))
              through.foreach{ bn =>
                bn.algoIncrementale = algo
                bn.algoInt = hitId
              }
            }
            case _ =>
          }
          walkStm(reg.head.asInstanceOf[DataAssignmentStatement], Nil)
        }
        case _ =>
      }
      pc.walkStatements{s =>
        if(s.algoIncrementale != algo) {
          s.walkDrivingExpressions { e =>
            if (e.algoIncrementale == algo) {
              println("Remove it")
              val port = portsHits(e.algoInt)
              port.removeTags(port.getTags().filter(_.isInstanceOf[MemDataRegTag]))
            }
          }
        }
      }
    }
  }
  val config = SpinalConfig()
  config.addTransformationPhase(new InferTagPhase)
  config.addStandardMemBlackboxing(blackboxAllWhatsYouCan)
  config.generateVerilog(new Component{
    val dataType = HardType(Rgb(5,6,5))
    val rom = Mem.fill(256)(dataType)
    val addr = in(rom.addressType)
    val readed = rom.readSync(addr)
    val miaou = dataType()
    miaou := readed
    val buffer = RegNext(miaou)
    val result = out(CombInit(buffer))
  })
}

Dolu1990 · 2024-05-31T13:23:55Z

With the commit above, the following code seems fine :

object PlayMemSync3 extends App{
  val config = SpinalConfig()
  config.addTransformationPhase(new MemReadBufferPhase)
  config.addStandardMemBlackboxing(blackboxAllWhatsYouCan)
  config.generateVerilog(new Component{
    val dataType = HardType(Rgb(5,6,5))
    val ram = Mem.fill(256)(dataType)
    val write = ram.writePort().asSlave
    val addr = in(ram.addressType)
    val readed = ram.readSync(addr)
    val miaou = dataType()
    miaou := readed
    val condA = in Bool()
    val buffer = RegNextWhen(miaou, condA)
    val result = out(CombInit(buffer))
  })
}

Dolu1990 added a commit that referenced this issue May 31, 2024

Implement two cycle simple dual port ram blackboxification #1392

6fa3bc8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for memories with read latency higher than 1 #1392

Support for memories with read latency higher than 1 #1392

fayalalebrun commented Apr 22, 2024

Dolu1990 commented Apr 22, 2024

KireinaHoro commented Apr 22, 2024 •

edited

Dolu1990 commented Apr 22, 2024

Dolu1990 commented Apr 22, 2024

KireinaHoro commented Apr 22, 2024 •

edited

Dolu1990 commented Apr 22, 2024

KireinaHoro commented Apr 22, 2024

Dolu1990 commented Apr 22, 2024

KireinaHoro commented Apr 22, 2024

Dolu1990 commented Apr 22, 2024

KireinaHoro commented Apr 23, 2024 •

edited

Dolu1990 commented Apr 25, 2024

KireinaHoro commented Apr 25, 2024

Dolu1990 commented Apr 25, 2024

KireinaHoro commented Apr 29, 2024 •

edited

Dolu1990 commented Apr 30, 2024

KireinaHoro commented Apr 30, 2024

Dolu1990 commented Apr 30, 2024

developfpga commented May 24, 2024

Dolu1990 commented May 24, 2024

developfpga commented May 24, 2024

developfpga commented May 30, 2024

Dolu1990 commented May 30, 2024

Dolu1990 commented May 31, 2024

Dolu1990 commented May 31, 2024

Support for memories with read latency higher than 1 #1392

Support for memories with read latency higher than 1 #1392

Comments

fayalalebrun commented Apr 22, 2024

Dolu1990 commented Apr 22, 2024

KireinaHoro commented Apr 22, 2024 • edited

Dolu1990 commented Apr 22, 2024

Dolu1990 commented Apr 22, 2024

KireinaHoro commented Apr 22, 2024 • edited

Dolu1990 commented Apr 22, 2024

KireinaHoro commented Apr 22, 2024

Dolu1990 commented Apr 22, 2024

KireinaHoro commented Apr 22, 2024

Dolu1990 commented Apr 22, 2024

KireinaHoro commented Apr 23, 2024 • edited

Dolu1990 commented Apr 25, 2024

KireinaHoro commented Apr 25, 2024

Dolu1990 commented Apr 25, 2024

KireinaHoro commented Apr 29, 2024 • edited

Dolu1990 commented Apr 30, 2024

KireinaHoro commented Apr 30, 2024

Dolu1990 commented Apr 30, 2024

developfpga commented May 24, 2024

Dolu1990 commented May 24, 2024

developfpga commented May 24, 2024

developfpga commented May 30, 2024

Dolu1990 commented May 30, 2024

Dolu1990 commented May 31, 2024

Dolu1990 commented May 31, 2024

KireinaHoro commented Apr 22, 2024 •

edited

KireinaHoro commented Apr 22, 2024 •

edited

KireinaHoro commented Apr 23, 2024 •

edited

KireinaHoro commented Apr 29, 2024 •

edited