Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shrinking Microwatt #294

Open
cr1901 opened this issue Jun 6, 2021 · 21 comments
Open

Shrinking Microwatt #294

cr1901 opened this issue Jun 6, 2021 · 21 comments

Comments

@cr1901
Copy link

cr1901 commented Jun 6, 2021

I'd like to experiment using Microwatt as a 64-bit microcontroller on an ECP5 LFE5U-25F. Unfortunately, even with the following patch, I can't seem to get Microwatt and SoC peripherals to fit:

diff --git a/pythondata_cpu_microwatt/vhdl/core.vhdl b/pythondata_cpu_microwatt/vhdl/core.vhdl
index 4a83d69..e67ef4f 100644
--- a/pythondata_cpu_microwatt/vhdl/core.vhdl
+++ b/pythondata_cpu_microwatt/vhdl/core.vhdl
@@ -200,9 +200,9 @@ begin
     icache_0: entity work.icache
         generic map(
             SIM => SIM,
-            LINE_SIZE => 64,
-            NUM_LINES => 64,
-	    NUM_WAYS => 2
+            LINE_SIZE => 16,
+            NUM_LINES => 1,
+	    NUM_WAYS => 1
             )
         port map(
             clk => clk,
@@ -342,9 +342,9 @@ begin
 
     dcache_0: entity work.dcache
         generic map(
-            LINE_SIZE => 64,
-            NUM_LINES => 64,
-	    NUM_WAYS => 2
+            LINE_SIZE => 16,
+            NUM_LINES => 1,
+	    NUM_WAYS => 1
             )
         port map (
             clk => clk,

Command Line:

python3 litex_boards/targets/gsd_orangecrab.py --l2-size=0 --cpu-type=microwatt --cpu-variant=standard+ghdl+irq --build

Resource Usage (nextpnr):

Info: Device utilisation:
Info:          TRELLIS_SLICE: 15856/12144   130%
Info:             TRELLIS_IO:    64/  197    32%
Info:                   DCCA:     5/   56     8%
Info:                 DP16KD:    54/   56    96%
Info:             MULT18X18D:    16/   28    57%
Info:                 ALU54B:     0/   14     0%
Info:                EHXPLLL:     2/    2   100%
Info:                EXTREFB:     0/    1     0%
Info:                   DCUA:     0/    1     0%
Info:              PCSCLKDIV:     0/    2     0%
Info:                IOLOGIC:    49/  128    38%
Info:               SIOLOGIC:     0/   69     0%
Info:                    GSR:     0/    1     0%
Info:                  JTAGG:     0/    1     0%
Info:                   OSCG:     0/    1     0%
Info:                  SEDGA:     0/    1     0%
Info:                    DTR:     0/    1     0%
Info:                USRMCLK:     0/    1     0%
Info:                CLKDIVF:     1/    4    25%
Info:              ECLKSYNCB:     1/   10    10%
Info:                DLLDELD:     0/    8     0%
Info:                 DDRDLL:     1/    4    25%
Info:                DQSBUFM:     2/    8    25%
Info:        TRELLIS_ECLKBUF:     3/    8    37%
Info:           ECLKBRIDGECS:     1/    2    50%

I'd like to discuss whether core variants would be possible (standard, lite, minimal) based on the FPGA size someone has on hand. However, I first would like to actually get Microwatt to fit at all :). Do I have any recourse, or is Microwatt (plus peripherals) considered "inherently too big" to be practical for a 25k LUT FPGA?

@paulusmack
Copy link
Collaborator

I assume you have set HAS_FPU to false and LOG_LENGTH to 0?

There probably are quite a few things that could be done to reduce the resource usage, though getting it down to 25k LUTs would be quite a challenge. Have you looked at Bill Flynn's A2P cpu?

@cr1901
Copy link
Author

cr1901 commented Jun 18, 2021

I assume you have set HAS_FPU to false and LOG_LENGTH to 0?

LOG_LENGTH was unmodified. Setting it helps a bit, but...

Info: Device utilisation:
Info:          TRELLIS_SLICE: 15430/12144   127%
Info:             TRELLIS_IO:    64/  197    32%
Info:                   DCCA:     5/   56     8%
Info:                 DP16KD:    46/   56    82%
Info:             MULT18X18D:    16/   28    57%
Info:                 ALU54B:     0/   14     0%
Info:                EHXPLLL:     2/    2   100%
Info:                EXTREFB:     0/    1     0%
Info:                   DCUA:     0/    1     0%
Info:              PCSCLKDIV:     0/    2     0%
Info:                IOLOGIC:    49/  128    38%
Info:               SIOLOGIC:     0/   69     0%
Info:                    GSR:     0/    1     0%
Info:                  JTAGG:     0/    1     0%
Info:                   OSCG:     0/    1     0%
Info:                  SEDGA:     0/    1     0%
Info:                    DTR:     0/    1     0%
Info:                USRMCLK:     0/    1     0%
Info:                CLKDIVF:     1/    4    25%
Info:              ECLKSYNCB:     1/   10    10%
Info:                DLLDELD:     0/    8     0%
Info:                 DDRDLL:     1/    4    25%
Info:                DQSBUFM:     2/    8    25%
Info:        TRELLIS_ECLKBUF:     3/    8    37%
Info:           ECLKBRIDGECS:     1/    2    50%

Aside: a number of config options at the tip of HEAD don't exist in the version of the core I'm using ce0205b. Are the TLB and FPU enabled by default in that version of the core (and not configurable). Going to assume "no" on both cases.

There probably are quite a few things that could be done to reduce the resource usage, though getting it down to 25k LUTs would be quite a challenge.

It's even worse in this case, since the core would need to be under 25k LUTs to support peripherals :D! I have a few FPGAs on hand that could fit MicroWatt, but I still wanted to try this specific board b/c its form factor makes it convenient to keep around on the workbench.

Have you looked at Bill Flynn's A2P cpu?

I've never heard of it, so I can't say that I have :). Can you give me more information on it?

@mikey
Copy link
Collaborator

mikey commented Jun 23, 2021

Info: Device utilisation:
Info:          TRELLIS_SLICE: 15430/12144   127%

Just looking at LUT4 usage, the biggest items are:

3893		rotator
2333		loadstore1_0_5ba93c9db0cff93f52b521d7420e43f6eda2784f
2058		logical
1977		decode1_0_5ba93c9db0cff93f52b521d7420e43f6eda2784f
1429		dcache_16_1_1_1_1_12_0
1405		mmu
940		writeback
918		multiply_4
735		divider
611		decode2_0_0e356ba505631fbf715758bed27d503f8b260e3a
611		core_debug_0
537		soc_8192_40000000_0_0_1_0_2_0_1_1_1_1_1_1_1_0_7d82c4e805dcf6b85d0748a287442d8cfbc69098
488		wishbone_arbiter_4
412		icache_16_8_1_1_1_12_56_0_5ba93c9db0cff93f52b521d7420e43f6eda2784f
301		xics_ics_16_3

@cr1901
Copy link
Author

cr1901 commented Jun 23, 2021

Hrm, I'm assuming rotator is a 64-bit barrel shifter? That would explain a lot :P.

@mikey
Copy link
Collaborator

mikey commented Jun 23, 2021

Hrm, I'm assuming rotator is a 64-bit barrel shifter? That would explain a lot :P.

yeah, and then some. powerpc has lots of rotate and mask/clear instructions

We could do a multi-cycle version that would take less LUT4s but be slower.

@cr1901
Copy link
Author

cr1901 commented Jun 23, 2021

@mikey It's worth a shot to me, tbh. I don't need the MMU or dcache either. This particular board only has enough LUTs to run Microwatt as a 64-bit microcontroller. Linux on LiteX for Microwatt would be nice, but I can accept that my board is too small for that :P.

I don't know what "A2P" is that @paulusmack is referencing... is it another, smaller OpenPOWER core?

@mikey
Copy link
Collaborator

mikey commented Jun 23, 2021

A2P is probably not really ready for the real world yet. I'd just ignore that comment for now. Sorry.

Removing the MMU and dcache is certainly an option. It might be something we could add as an generic option for microcontroller only configurations.

Do you need external DRAM?

@cr1901
Copy link
Author

cr1901 commented Jun 23, 2021

Do you need external DRAM?

Would be "nice to have" but certainly not a requirement for a microcontroller-only config. E.g. Micropython will work just fine using BRAM resources. Idk how practical it is to have Microwatt, a DRAM controller, and peripherals on OrangeCrab :D!

@paulusmack
Copy link
Collaborator

Just looking at LUT4 usage, the biggest items are:

3893		rotator
2333		loadstore1_0_5ba93c9db0cff93f52b521d7420e43f6eda2784f
2058		logical
1977		decode1_0_5ba93c9db0cff93f52b521d7420e43f6eda2784f
1429		dcache_16_1_1_1_1_12_0
1405		mmu
940		writeback
918		multiply_4
735		divider
611		decode2_0_0e356ba505631fbf715758bed27d503f8b260e3a
611		core_debug_0
537		soc_8192_40000000_0_0_1_0_2_0_1_1_1_1_1_1_1_0_7d82c4e805dcf6b85d0748a287442d8cfbc69098
488		wishbone_arbiter_4
412		icache_16_8_1_1_1_12_56_0_5ba93c9db0cff93f52b521d7420e43f6eda2784f
301		xics_ics_16_3

The rotator shouldn't be anything like that large. Maybe yosys is synthesizing it badly. The main 64-bit barrel shifter is done as 3 layers of 4-input 64-bit wide multiplexers. It should be possible to do a 4-input multiplexer using 3 LUT4s, meaning a total of 576 LUT4s for the rotator. There are also the mask generator, an input multiplexer (2 wide) and an output multiplexer (4 wide), but they shouldn't amount to anything like 3000 LUT4s. The mask generators should be about 72 LUT4s each, the input mux (2 inputs x 32 bits) should be 32 LUT4s, and the output mux should be 192 LUT4s. That should all come to less than 1000 LUT4s.

@mikey
Copy link
Collaborator

mikey commented Jul 16, 2021

@paulusmack a quick check of just the rotator confirms it's this big. I run this:

podman run --rm -v /home/mikey/src/microwatt:/src:z -w /src hdlc/ghdl:yosys yosys -m ghdl -p "ghdl --std=08 --no-formal decode_types.vhdl utils.vhdl common.vhdl rotator.vhdl  -e rotator; synth_ecp5 -json rotator.json  -noflatten"

which runs in less than a min and gives:

     CCU2C                         326
     L6MUX21                       758
     LUT4                         3349
     PFUMX                        1447

@umarcor
Copy link
Contributor

umarcor commented Jul 21, 2021

I have a few FPGAs on hand that could fit MicroWatt, but I still wanted to try this specific board b/c its form factor makes it convenient to keep around on the workbench.

@cr1901 excuse me if this is obvious, but I interpreted that you are not aware of it. Did you see https://github.com/antonblanchard/microwatt/blob/master/.github/workflows/test.yml#L67 ? So, there is a make target for generating a bitstream for the OrangeCrab, which is executed in CI. Since a month ago, it seems to be reaching the limit, so implementation fails depending on the seed (https://github.com/antonblanchard/microwatt/actions). Sometimes it fits: https://github.com/antonblanchard/microwatt/runs/3082247675?check_suite_focus=true.

 Info: Device utilisation:
Info: 	       TRELLIS_SLICE: 39752/41820    95%
Info: 	          TRELLIS_IO:     4/  365     1%
Info: 	                DCCA:     2/   56     3%
Info: 	              DP16KD:    56/  208    26%
Info: 	          MULT18X18D:    32/  156    20%
Info: 	             EHXPLLL:     1/    4    25%

However, it seems that the Makefiles in this repo are using device --um5g-85k since the beginning (https://github.com/antonblanchard/microwatt/blame/6326efaca421a0eb1e91cb70f9c7b324812c6dd0/Makefile.synth#L33).

@antonblanchard is the OrangeCrab you got from @gregdavill different from the "regular" one that people can buy? The cable in https://twitter.com/antonblanchard/status/1219448773333487616 hides that FPGA 😢

I think it would be desirable to have nextpnr arguments and https://github.com/antonblanchard/microwatt/blob/master/constraints/orange-crab.lpf match what "typical" OrangeCrab users will get (https://github.com/gregdavill/OrangeCrab-examples/tree/main/verilog). If not possible, might be good to use OrangeCrab-r0.0 or some other name here, which makes it explicit.

A2P is probably not really ready for the real world yet. I'd just ignore that comment for now. Sorry.

@mikey, I'm aware of a2i but not A2P. Do you mind providing a reference, even if it's not ready for the real world yet?

a quick check of just the rotator confirms it's this big. I run this:

@mikey, you might try the following for getting a diagram representation:

# yosys -m ghdl -p 'ghdl --std=08 --no-formal decode_types.vhdl utils.vhdl common.vhdl rotator.vhdl  -e rotator; prep; write_json rotator.json'
# netlistsvg rotator.json -o rotator.svg
# convert rotator.svg rotator.png

rotator.png

@gregdavill
Copy link

@antonblanchard is the OrangeCrab you got from @gregdavill different from the "regular" one that people can buy? The cable in https://twitter.com/antonblanchard/status/1219448773333487616 hides that FPGA cry

Yes, that OrangeCrab specifically has a 5G-85F installed. I built that one for him before our initial group-buy ran.
You can now buy OrangeCrabs with 85F's installed through farnell the part number is: orangecrab-r0d2-85

@mikey
Copy link
Collaborator

mikey commented Jul 22, 2021

@gregdavill Am I reading the data sheet right on the 85K OrangeCrab as having 512MB of RAM? 25K version only has 128MB of RAM?

@gregdavill
Copy link

@gregdavill Am I reading the data sheet right on the 85K OrangeCrab as having 512MB of RAM? 25K version only has 128MB of RAM?

Yes, on the released orangecrab-85F we also increase the DDR3 capacity, over the 25F version.
FYI, technically you can install a 1GB DDR3L, but they're ~10x the cost of the 512MB parts.

@cr1901
Copy link
Author

cr1901 commented Jul 22, 2021

To be 100% clear, I wanted Microwatt to fit into the 12/25F part :D!

@gregdavill
Copy link

To be 100% clear, I wanted Microwatt to fit into the 12/25F part :D!

Understood. I know your board is a 25F. I was just confirming that the existing/proven examples of this running on the ECP5, did use larger devices.

@mikey
Copy link
Collaborator

mikey commented Jul 22, 2021

To be 100% clear, I wanted Microwatt to fit into the 12/25F part :D!

Yep I understand and getting down to that size is a goal.

That being said, I've been looking for a while for a bigger ECP5 + RAM + SPI + SD card board so we can run microwatt + Linux on it. I didn't realise you could get the orange crabb still.

@mikey
Copy link
Collaborator

mikey commented Jul 29, 2021

I have a few FPGAs on hand that could fit MicroWatt, but I still wanted to try this specific board b/c its form factor makes it convenient to keep around on the workbench.

@cr1901 excuse me if this is obvious, but I interpreted that you are not aware of it. Did you see https://github.com/antonblanchard/microwatt/blob/master/.github/workflows/test.yml#L67 ? So, there is a make target for generating a bitstream for the OrangeCrab, which is executed in CI. Since a month ago, it seems to be reaching the limit, so implementation fails depending on the seed (https://github.com/antonblanchard/microwatt/actions). Sometimes it fits: https://github.com/antonblanchard/microwatt/runs/3082247675?check_suite_focus=true.

 Info: Device utilisation:
Info: 	       TRELLIS_SLICE: 39752/41820    95%
Info: 	          TRELLIS_IO:     4/  365     1%
Info: 	                DCCA:     2/   56     3%
Info: 	              DP16KD:    56/  208    26%
Info: 	          MULT18X18D:    32/  156    20%
Info: 	             EHXPLLL:     1/    4    25%

It looks like the icache RAM not being inferred as block RAMs by yosys, and this is creatng a huge bloat.

I've reduced the size of the icache with a workaround #303, which gets us from 95% usage down to 76%. This should keep us going until we can fix the RAM inferencing issue.

@umarcor
Copy link
Contributor

umarcor commented Jul 29, 2021

@mikey does the icache use byte enables or global enables? We found issues in NEORV32 too when trying to infer BRAMs: ghdl/ghdl#1781, ghdl/ghdl#1782, ghdl/ghdl#1780.

@mikey
Copy link
Collaborator

mikey commented Jul 30, 2021

@umarcor For the icache, we only write a cacheline at a time which is 64 bytes wide.

@cr1901
Copy link
Author

cr1901 commented Jul 30, 2021

The icache inference may have been a problem, but the core still didn't fit into OrangeCrab even when I essentially disabled the caches :P.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants