Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for memories with read latency higher than 1 #1392

Open
fayalalebrun opened this issue Apr 22, 2024 · 25 comments
Open

Support for memories with read latency higher than 1 #1392

fayalalebrun opened this issue Apr 22, 2024 · 25 comments

Comments

@fayalalebrun
Copy link
Contributor

At the moment, the readSync method of Mem produces a result in the following cycle (A read latency of 1). However, some memory IPs (see here) support read latencies higher than 2. Especially on FPGA's, this can be important to achieve high frequency.

Ideally, Mem would allow read latency to be configured, and convey this information to the blackboxing step.

@Dolu1990
Copy link
Member

Agree.

One question i have, how those 2 cycle latency memory IP support a clock enable for the second stage ? (if at all)

@KireinaHoro
Copy link
Contributor

KireinaHoro commented Apr 22, 2024

XPM does not support a clock enable for the memory itself, only for the last register stage for output:

regcea	Input	1	clka	LEVEL_HIGH	1	Clock Enable for the last register stage on the output data path.
regceb	Input	1	clkb	LEVEL_HIGH	1	Clock Enable for the last register stage on the output data path.

Definition for memory latency:

READ_LATENCY_A: Specify the number of register stages in the port A read data pipeline. Read data output to port douta takes this number of clka cycles. To target block memory, a value of 1 or larger is required- 1 causes use of memory latch only; 2 causes use of output register. To target distributed memory, a value of 0 or larger is required- 0 indicates combinatorial output. Values larger than 2 synthesize additional flip-flops that are not retimed into memory primitives.

@Dolu1990
Copy link
Member

does not support a clock enable for the memory itself

Isn't it what ena / enb are for ? first stage "clock enable"

So ena / enb => first stage
regcea / regceb => second stage

@Dolu1990
Copy link
Member

The way i see it to more foward would change the current :

def readSyncImpl(address: UInt, data : Data, enable: Bool = null, readUnderWrite: ReadUnderWritePolicy = dontCare, clockCrossing: Boolean = false, allowMixedWidth: Boolean = false): Unit = {

into

def readSyncImpl(address: UInt, data : Data, enable: Bool = null, readUnderWrite: ReadUnderWritePolicy = dontCare, clockCrossing: Boolean = false, allowMixedWidth: Boolean = false, latency : Int = 1, stageEnable : Seq[Bool] = Nil): Unit = {

And the adjuste things around it.

Same for the readWrite ports

That would fit ?

@KireinaHoro
Copy link
Contributor

KireinaHoro commented Apr 22, 2024

Hmm I'm a bit confused about the stage enable signals being separate; what's the point of having them all separate? For instance I'm not sure how will this map to the en/regce scheme of the XPM.

In a broader sense I also don't understand, why would you want to read the value from RAM without enabling the output registers.

@Dolu1990
Copy link
Member

The point is to be able to implement streamed memory read of multiple cycle latency, with backpresure and the ability to collapse bubbles.

@KireinaHoro
Copy link
Contributor

That's indeed interesting. I don't have a detailed idea how that would work, but from the viewpoint of the XPM, stageEnable.head should map to ena/enb, and stageEnable.tail.reduce(_ && _) should map to regcea/regceb?

@Dolu1990
Copy link
Member

regcea => enable
ena => stageEnable(0)

If that was a 3 cycle latency =>
ena_stage3 => stageEnable(1)

So in other words :

Stage 1 => enable
Stage 2 => stageEnable(0)
Stage 3 => stageEnable(1)
and so on.

@KireinaHoro
Copy link
Contributor

No but this is not possible, since both en and regce are both single bit signals in the XPM, regardless of how many reg stages there are. Or did I understand it wrong?

@Dolu1990
Copy link
Member

i would say it is possible up to 2 cycle latency for XMP.
Not sure there is any interrest using XMP with latency larger than 2 anyway "Values larger than 2 synthesize additional flip-flops that are not retimed into memory primitives."

@KireinaHoro
Copy link
Contributor

KireinaHoro commented Apr 23, 2024

Sounds great. I'm not too familiar with the internals of the Mem API, so it would be great if @Dolu1990 you can implement the fix; I can then start upstreaming the XPM blackboxing pass I have as seen on Gitter :)

@Dolu1990
Copy link
Member

Will do.
Keep in mind, often in FPGA, unless you are very very high fmax, you don't want to use the output flipflop of the ramblock, instead you want those flops in the FPGA fabric which allows them to travel closer to where their output is used.

@KireinaHoro
Copy link
Contributor

Yeah that makes sense. I think this should be left to the black boxing pass to handle, to use the output flop on the RAM primitive or to use separate stages?

@Dolu1990
Copy link
Member

I don't know.
Another idea would be that the blackboxing phase would be able to reconize Reg after the ram to fuse them into the blackbox XD

@KireinaHoro
Copy link
Contributor

KireinaHoro commented Apr 29, 2024

Ah regarding the extra pipeline stages after reading this thread on the Xilinx forum: they're useful to reduce routing congestion on the FPGA. Pipelining with extra flops allow the placer to move logic away from the congestion site, which in turn improves QoR. Not completely useless :)

@Dolu1990
Copy link
Member

Hmm i'm think now that maybe a way to implement it would be to keep MemReadSync as it is, and add regular Reg on its read data, then we can add special SpinalTag to help inferation / blackboxing figureing out that those register can be merged.

That would have the advantage to keep things "simple" and not requiring any update in the verilog/vhdl backends.
This will onyl require the blackboxer to know about those special tag.

@KireinaHoro
Copy link
Contributor

That sounds exciting, but how would you guarantee that the unregistered RAM output is not consumed by anyone other than the reg stage?

@Dolu1990
Copy link
Member

What do you mean by consumed ?

@developfpga
Copy link

Hi @KireinaHoro @Dolu1990

I am interesting in this issue too.

Let's check xilinx ug974 as example
image

def readSyncImpl(address: UInt, data : Data, enable: Bool = null, readUnderWrite: ReadUnderWritePolicy = dontCare, clockCrossing: Boolean = false, allowMixedWidth: Boolean = false, latency : Int = 1, stageEnable : Seq[Bool] = Nil): Unit = {

enb is mapped to enable like before. And I think how about map stageEnable(0) to regceb?

Or we can use setBlackBoxName to generate ram blackbox, just add one more port 'stageEnable(0) '.

@Dolu1990
Copy link
Member

Hi @developfpga

I did some experiments localy, by extending the readSyncImpl API, but now i think the best would be instead that the blackboxing phase be capable to reconize Reg after a readSync and fusion them into the blackbox.

That way, we could define pipeline in a regular way, and as long as there is a simple register after the read sync, all can be "magicaly" merged.

So what would be needed, is to implement a phase which iterate over the AST and add tags to Mem to document "fusion" possibilities, which can then be used by the blackboxer.

@developfpga
Copy link

Hi @Dolu1990

From my perspective and xilinx user guide, here is my advice.

readData := ram.readSync(addr, rdEn, latency, stageEnable)

default latency = 1, and no need stageEnable. and stageEnable.getBitsWidth == latency - 1
For example, if i write spinal like

readData := ram.readSync(addr, rdEn, 2, stageEnable)

The verilog code generated I expected is like

always @(posedge rd_clk) begin
    if (rdEn) begin
      read_data <= ram[addr];
    end
    if (stageEnable) begin
        read_data_stage_1 <= read_data ;
    end 
end 
assign readData = read_data_stage_1 ;

The verilog blackbox is like, parameter rdLatency and io rd_stage_enable are added.
rdLatency and rd_stage_enable is handled by blackbox, no more to do by spinal.

  blackbox #(
    .wordCount(1024),
    .wordWidth(36),
    .clockCrossing(1'b0),
    .technology("auto"),
    .readUnderWrite("dontCare"),
    .wrAddressWidth(10),
    .wrDataWidth(36),
    .wrMaskWidth(1),
    .wrMaskEnable(1'b0),
    .rdAddressWidth(10),
    .rdDataWidth(36),
    .rdLatency(2) // added
  ) u_ram(
    .wr_clk  (), //i
    .wr_en   (), //i
    .wr_mask (), //i
    .wr_addr (), //i
    .wr_data (), //i
    .rd_clk  (), //i
    .rd_en   (), //i
    .rd_addr (), //i
    .rd_data (),  //o
    .rd_stage_enable() // added
  );

So I don't understand about 'add tags', I don't think any tags need to be add here.

@developfpga
Copy link

Hi @Dolu1990

Any updates for this issue?

@Dolu1990
Copy link
Member

No progress so far, i'm quite under water :/

@Dolu1990
Copy link
Member

Got some progress :

The following code will tag every ReadSync port with a special tag when the output is only used by registers.
From that, it only remain to implement the blackboxification itself.

object PlayMemSync3 extends App{
  class MemDataRegTag(reg : BaseType, rs : MemPortStatement, through : List[BaseNode]) extends SpinalTag
  class InferTagPhase extends PhaseNetlist{
    override def impl(pc: PhaseContext): Unit = {
      val algo = pc.globalData.allocateAlgoIncrementale()
      val portsHits = ArrayBuffer[MemPortStatement]()
      pc.walkDeclarations{
        case reg : BaseType if reg.isReg && !reg.hasInit && reg.hasOnlyOneStatement && reg.isDirectionLess =>{
          def walkStm(s : LeafStatement, through : List[BaseNode]): Unit = s match {
            case s : DataAssignmentStatement if s.target.isInstanceOf[BaseType] =>
              val target = s.target.asInstanceOf[BaseType]
              if(through == Nil || s.parentScope == target.parentScope) {
                walkExp(s.source, s :: through)
              }
            case _ =>
          }

          def walkExp(e: Expression, through : List[BaseNode]): Unit = e match {
            case e: CastEnumToBits =>
            case e: CastEnumToEnum =>
            case e: CastBitsToEnum =>
            case e: Cast => walkExp(e.input, e :: through)
            case e: BitsRangedAccessFixed => walkExp(e.source, e :: through)
            case bt: BaseType if bt.isComb && bt.hasOnlyOneStatement => walkStm(bt.head, bt :: through)
            case rs: MemReadSync if reg.component == rs.component => {
              println(s"hit $rs through ${through.map(e => "- " + e).mkString("\n")}")
              val hitId = portsHits.size
              portsHits += rs
              rs.addTag(new MemDataRegTag(reg, rs, through))
              through.foreach{ bn =>
                bn.algoIncrementale = algo
                bn.algoInt = hitId
              }
            }
            case _ =>
          }
          walkStm(reg.head.asInstanceOf[DataAssignmentStatement], Nil)
        }
        case _ =>
      }
      pc.walkStatements{s =>
        if(s.algoIncrementale != algo) {
          s.walkDrivingExpressions { e =>
            if (e.algoIncrementale == algo) {
              println("Remove it")
              val port = portsHits(e.algoInt)
              port.removeTags(port.getTags().filter(_.isInstanceOf[MemDataRegTag]))
            }
          }
        }
      }
    }
  }
  val config = SpinalConfig()
  config.addTransformationPhase(new InferTagPhase)
  config.addStandardMemBlackboxing(blackboxAllWhatsYouCan)
  config.generateVerilog(new Component{
    val dataType = HardType(Rgb(5,6,5))
    val rom = Mem.fill(256)(dataType)
    val addr = in(rom.addressType)
    val readed = rom.readSync(addr)
    val miaou = dataType()
    miaou := readed
    val buffer = RegNext(miaou)
    val result = out(CombInit(buffer))
  })
}

@Dolu1990
Copy link
Member

With the commit above, the following code seems fine :

object PlayMemSync3 extends App{
  val config = SpinalConfig()
  config.addTransformationPhase(new MemReadBufferPhase)
  config.addStandardMemBlackboxing(blackboxAllWhatsYouCan)
  config.generateVerilog(new Component{
    val dataType = HardType(Rgb(5,6,5))
    val ram = Mem.fill(256)(dataType)
    val write = ram.writePort().asSlave
    val addr = in(ram.addressType)
    val readed = ram.readSync(addr)
    val miaou = dataType()
    miaou := readed
    val condA = in Bool()
    val buffer = RegNextWhen(miaou, condA)
    val result = out(CombInit(buffer))
  })
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants