Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware math division: 5211/193 > 27 #786

Open
dansanderson opened this issue Feb 10, 2024 · 10 comments
Open

Hardware math division: 5211/193 > 27 #786

dansanderson opened this issue Feb 10, 2024 · 10 comments
Assignees
Labels
bug Confirmed bug. staged Ready to be merged with stable master branch.

Comments

@dansanderson
Copy link
Contributor

Test Environment (required)
You can use MEGA65INFO to retrieve this.

  • Platform: MEGA65
  • Core Commit: 683-cartflash,20240204.22,7e58caf (though unlikely to be a v0.96 regression)
  • ROM Release: 920395

Describe the bug
The hardware math registers are returning an inexact result for 5211/193 (= 27). This Issue is to track down whether this is expected, or whether there is a flaw in the hardware division.

This was noticed in the BASIC, which might be exacerbating the imprecision with its conversion from fixed point to floating point, or might be using the hardware math registers incorrectly in some other way. MEGA65/mega65-rom-public#101

To Reproduce

!cpu m65
!to "tdiv.prg", cbm

* = $2001

!8 $12,$20,$0a,$00,$fe,$02,$20,$30,$3a,$9e,$20
!pet "$2014"
!8 $00,$00,$00

multina = $d770
multinb = $d774
mathbusy = $d70f
divquotient = $d76c
divremain = $d768

  ldq dividend
  stq multina
  ldq divisor
  stq multinb
loop:
  bit mathbusy
  bmi loop
  ldq divquotient
  stq $1804
  ldq divremain
  stq $1800
  brk           ; 

dividend:
!16 5211,0
divisor:
!16 193,0

This breaks to the monitor. Type M1800 to see the result. The result is DIVOUT whole (D76C-F) of 1B 00 00 00 = +27 and DIVOUT frac (D768-B) of 03 00 00 00 = +3 != 0.

Expected behavior
DIVOUT frac of 0 is desired.

@dansanderson dansanderson added the new New report, not classified yet label Feb 10, 2024
@lydon42
Copy link
Member

lydon42 commented Feb 12, 2024

There is a python vunit testbench in tests/hardware_divider.vhdl which can be used to debug the math unit.

@lydon42 lydon42 added bug Confirmed bug. and removed new New report, not classified yet labels Feb 12, 2024
@Rhialto
Copy link

Rhialto commented Feb 14, 2024

Ok, so I'm absolutely a beginner in reading VHDL. But it seems the divider in question is this one: https://github.com/MEGA65/mega65-core/blob/development/src/vhdl/fast_divide.vhdl Is there any reference where I can find the algorithm and compare it to my understanding of it, after staring at it for a bit?

But I think I get (mostly) how it is supposed to work. It is an iterative approximation of the desired result, with as invariant that the value nn / dd remains the same throughout (at least in the mathematical sense, if no bits are lost or overflow happens or that sort of thing).

This is clearly true in the "normalise" stage: both values get shifted left the same amount. Nonzero bits from nn will not fall out on the left because nn has more bits than dd.

In the "step" phase, they are both multiplied with the same value every time. (I didn't verify how this particular factor 2 - dd is supposed to bring us closer to the desired end state, but I'll assume for now that that is true). At the end, nn contains the quotient and the remainder.

The stopping condition of the step phase surprises me a bit. In particular, it seems that the desired end state here is that dd == x"FFFFFFFFF". However, I would have expected a value 1 more, i.e. x"1000000000". With this value, the division nn / dd is exactly equivalent to a shift-right of a number of bits, and the quotient is the result. The fraction would be the part that's shifted out.

Now the difference between dividing by either of those two numbers is likely to be fairly small, especially for larger values in nn. So probably in many cases the extra 4 less significant bits that are used in nn and dd plus a little bit of rounding could compensate for this. But probably not in all cases. It may explain the comment about adding one (although dividing by a number that is slightly too small would give a result that is already slightly too large).

The number of step iterations may be too low, too. I did some imperfect calculation for our dd=normalized(193) and I came to about 8 iterations until some kind of end condition arose (and it wasn't all FFs) (but I don't trust completely that python did what I wanted).

I searched around, and It seems this division is a form of Goldschmidt division ( https://en.wikipedia.org/wiki/Division_algorithm#Goldschmidt_division ) with fixed-point numbers (instead of floating point?). The text under the next header suggests that 5 iterations will give you 2^5 = 32 bits of precision (under the right circumstances). That sounds like enough, but I think we want at least 64 bits here (for 32 bits of quotient plus 32 bits of fraction), which is at iteration extra. 68 bits would require another iteration.

@MJoergen
Copy link
Contributor

Here are some more results (reported by the user "raxrip" on Discord):

image

MJoergen added a commit to MJoergen/mega65-core that referenced this issue Jul 25, 2024
This fixes issue MEGA65#786

The problem was lack of precision in the intermediate calculations. This
fix increases the used number of DSPs from 12 to 32.
@Rhialto
Copy link

Rhialto commented Jul 26, 2024

(let me ramble a bit, I'll probably get around to try out the latest ideas in vhdl)

I've been playing with simulation of my ideas a bit. What was the main idea? That the end condition dd /= x"FFFFFFFFF" is a bit unfortunate, since it's not-quite 1,0. It is just 0,FFFFFFF which is slightly off. In theory, if you add enough FFs at the end, 0,FFFF... is the same as 1.0 (just like in decimal 0,99999... == 1). But we don't have an infinite number of bits, so it's not exact. So in fact, dd can never reach the value 1,0 since it's missing a bit on the left to get to 1.

So what did I do? I added an extra msb (on the left) of dd, so that it can represent 1,0 exactly. Later on I realised I could have done that by just interpreting the binary comma in a different place - maybe I'll make a code variant of that if it's useful.
Making dd larger means some other things need to grow a bit too, and some positioning of the original values into dd and nn has to move as well.

The very nice result of this is that if you now divide by 1, or any other power of 2, the first step to "normalize" the values already does the actual division! No more steps are needed. And dd is nicely 10000...00 so the end condition triggers right away.
There is just one wasteful step in the "output" phase in this case: it takes it result values from the multiplication, but this isn't needed for this case; dd already is the correct value. I suppose this just wastes a clock cycle to wait for the multiplier, since the hardware multiplier is there and doing its work anyway.

Still with this change, the original case of 5211/193 is just as in-precise, so @MJoergen 's fix to add precision bits is the right thing to do.
But probably my change can be combined with his for even better results. Or at least faster (for easy divisions).

I'm not sure what the easiest way is to show my changes; creating a fork and a branch doesn't let you see a diff easily I think. So I'll just paste a diff below and a bit of the debugging output to show a nice case like I just mentioned.

I also increased the number of iterations to see if that helps for some cases. That's more for diagnostic purposes than anything else (although I suspect that one extra iteration might be needed for some unlucky cases)

101:9:@2250ns:(report note): Calculating $0000145B / $00000010
108:9:@2250ns:(report note): Normalised to $00000145.B0000000.0 / $100000000.0
52:7:@2270ns:(report note): state is step
63:11:@2270ns:(report note): nn=$00000145.B0000000.0 / $100000000.0
70:11:@2270ns:(report note): f = $1000000000
75:11:@2270ns:(report note): temp96=$000000145B00000000000000000
79:11:@2270ns:(report note): temp64=$1000000000000000000
52:7:@2290ns:(report note): state is output
94:11:@2290ns:(report note): temp64=$0000000145B00000000

By contrast, the original code looks like this:

100:9:@2250ns:(report note): Calculating $0000145B / $00000010
107:9:@2250ns:(report note): Normalised to $000000A2.D8000000.0 / $80000000.0

and it needs all the steps with multiplications and stuff to reach the approximation of the answer:

75:11:@2370ns:(report note): temp96=$000000145AFFFFFFFFFFFFEBA50

which needs to rounded up to be correct. All that work and it doesn't even get a simple bit-shift exact.

--- fast_divide.vhdl    2024-07-26 20:32:08.821552509 +0200
+++ fast_divide_new.vhdl        2024-07-26 20:57:06.425252705 +0200
@@ -22,9 +22,9 @@
 architecture wattle_and_daub of fast_divide is
   type state_t is (idle, step, output);
   signal state : state_t := idle;
-  signal steps_remaining : integer range 0 to 5 := 0;
+  signal steps_remaining : integer range 0 to 7 := 0;
 
-  signal dd : unsigned(35 downto 0) := to_unsigned(0,36);
+  signal dd : unsigned(36 downto 0) := to_unsigned(0,37);
   signal nn : unsigned(67 downto 0) := to_unsigned(0,68);
 
   pure function count_leading_zeros(arg : unsigned(31 downto 0)) return natural is
@@ -40,11 +40,11 @@
 begin
 
   process (clock) is
-    variable temp64 : unsigned(73 downto 0) := to_unsigned(0,74);
+    variable temp64 : unsigned(74 downto 0) := to_unsigned(0,75);
     variable temp96 : unsigned(105 downto 0) := to_unsigned(0,106);
     variable f : unsigned(37 downto 0) := to_unsigned(0,38);
     variable leading_zeros : natural range 0 to 31;
-    variable new_dd : unsigned( 35 downto 0);
+    variable new_dd : unsigned( 36 downto 0);
     variable new_nn : unsigned( 67 downto 0);
     variable padded_d : unsigned(63 downto 0);
   begin
@@ -55,13 +55,13 @@
       case state is
         when idle =>
           -- Deal with divide by zero
-          if dd = to_unsigned(0,36) then
+          if dd = to_unsigned(0,37) then
             q <= (others => '1');
             busy <= '0';
           end if;
         when step =>
           report "nn=$" & to_hstring(nn(67 downto 36)) & "." & to_hstring(nn(35 downto 4)) & "." & to_hstring(nn(3 downto 0))
-            & " / $" & to_hstring(dd(35 downto 4)) & "." & to_hstring(dd(3 downto 0));
+            & " / $" & to_hstring(dd(36 downto 4)) & "." & to_hstring(dd(3 downto 0));
 
         -- f = 2 - dd
           f := to_unsigned(0,38);
@@ -75,12 +75,13 @@
           report "temp96=$" & to_hstring(temp96);
 
           temp64 := dd * f;
-          dd <= temp64(71 downto 36);
+          dd <= temp64(72 downto 36);
           report "temp64=$" & to_hstring(temp64);
 
           -- Perform number of required steps, or abort early if we can
-          if steps_remaining /= 0 and dd /= x"FFFFFFFFF" then
+          if steps_remaining /= 0 and dd /= x"1000000000" then
             steps_remaining <= steps_remaining - 1;
+            report "steps_remaining=" & to_string(steps_remaining);
           else
             state <= output;
           end if;
@@ -89,7 +90,7 @@
           -- giving a result of 1.999999999
           temp64(67 downto 0) := nn;
           temp64(73 downto 68) := (others => '0');
-          temp64 := temp64 + 1;
+          --temp64 := temp64 + 1;
           report "temp64=$" & to_hstring(temp64);
           busy <= '0';
           q <= temp64(67 downto 4);
@@ -101,16 +102,16 @@
         leading_zeros := count_leading_zeros(d);
         padded_d := d & X"00000000";
         new_dd := (others => '0');
-        new_dd(35 downto 4) := padded_d(63-leading_zeros downto 32-leading_zeros);
+        new_dd(36 downto 5) := padded_d(63-leading_zeros downto 32-leading_zeros);
         new_nn := (others => '0');
-        new_nn(35+leading_zeros downto 4+leading_zeros) := n;
+        new_nn(36+leading_zeros downto 5+leading_zeros) := n;
         report "Normalised to $" & to_hstring(new_nn(67 downto 36)) & "." &
           to_hstring(new_nn(35 downto 4)) & "." & to_hstring(new_nn(3 downto 0))
-          & " / $" & to_hstring(new_dd(35 downto 4)) & "." & to_hstring(new_dd(3 downto 0));
+          & " / $" & to_hstring(new_dd(36 downto 4)) & "." & to_hstring(new_dd(3 downto 0));
         dd <= new_dd;
         nn <= new_nn;
         state <= step;
-        steps_remaining <= 5;
+        steps_remaining <= 7;
         busy <= '1';
       elsif start_over='1' then
         report "Ignoring divide by zero";  

@Rhialto
Copy link

Rhialto commented Jul 26, 2024

Here a diff for the case where I do not change the size of dd, so this should be an easier guide how to apply the same idea to @MJoergen 's version with increased lsbs.

f can probably made 1 bit smaller but I didn't get that quite right yet.

--- fast_divide.vhdl    2024-07-26 21:50:38.474444754 +0200
+++ fast_divide_new2.vhdl       2024-07-26 22:46:37.292384747 +0200
@@ -19,10 +19,10 @@
     );
 end entity;
 
-architecture wattle_and_daub of fast_divide is
+architecture wattle_and_daub_new2 of fast_divide is
   type state_t is (idle, step, output);
   signal state : state_t := idle;
-  signal steps_remaining : integer range 0 to 5 := 0;
+  signal steps_remaining : integer range 0 to 9 := 0;
 
   signal dd : unsigned(35 downto 0) := to_unsigned(0,36);
   signal nn : unsigned(67 downto 0) := to_unsigned(0,68);
@@ -65,21 +65,21 @@
 
         -- f = 2 - dd
           f := to_unsigned(0,38);
-          f(37) := '1';
+          f(36) := '1';
           f := f - dd;
           report "f = $" & to_hstring(f);
 
           -- Now multiply both nn and dd by f
           temp96 := nn * f;
-          nn <= temp96(103 downto 36);
+          nn <= temp96(102 downto 35);
           report "temp96=$" & to_hstring(temp96);
 
           temp64 := dd * f;
-          dd <= temp64(71 downto 36);
-          report "temp64=$" & to_hstring(temp64);
+          dd <= temp64(70 downto 35);
+          report "temp64=        $" & to_hstring(temp64);
 
           -- Perform number of required steps, or abort early if we can
-          if steps_remaining /= 0 and dd /= x"FFFFFFFFF" then
+          if steps_remaining /= 0 and dd /= x"7FFFFFFFF" and dd /= x"800000000" then
             steps_remaining <= steps_remaining - 1;
           else
             state <= output;
@@ -87,12 +87,10 @@
         when output =>
           -- No idea why we need to add one, but we do to stop things like 4/2
           -- giving a result of 1.999999999
-          temp64(67 downto 0) := nn;
-          temp64(73 downto 68) := (others => '0');
-          temp64 := temp64 + 1;
-          report "temp64=$" & to_hstring(temp64);
+          report "nn=$" & to_hstring(nn);
+          report "dd=$" & to_hstring(dd);
           busy <= '0';
-          q <= temp64(67 downto 4);
+          q <= nn(67 downto 4);
           state <= idle;
       end case;
 
@@ -103,14 +101,14 @@
         new_dd := (others => '0');
         new_dd(35 downto 4) := padded_d(63-leading_zeros downto 32-leading_zeros);
         new_nn := (others => '0');
-        new_nn(35+leading_zeros downto 4+leading_zeros) := n;
+        new_nn(36+leading_zeros downto 5+leading_zeros) := n;
         report "Normalised to $" & to_hstring(new_nn(67 downto 36)) & "." &
           to_hstring(new_nn(35 downto 4)) & "." & to_hstring(new_nn(3 downto 0))
           & " / $" & to_hstring(new_dd(35 downto 4)) & "." & to_hstring(new_dd(3 downto 0));
         dd <= new_dd;
         nn <= new_nn;
         state <= step;
-        steps_remaining <= 5;
+        steps_remaining <= 9;
         busy <= '1';
       elsif start_over='1' then
         report "Ignoring divide by zero";
@@ -118,4 +116,4 @@
 
     end if;
   end process;
-end wattle_and_daub;
+end wattle_and_daub_new2;

@MJoergen
Copy link
Contributor

MJoergen commented Jul 27, 2024

@Rhialto :

I've modified the unit test, to make it more extensive: https://github.com/MJoergen/mega65-core/blob/development/tests/hardware_divider.vhdl

It now automatically calculates the expected result (including rounding).

It will count the number of cases where the result differs by only 1 in the LSB. Any greater difference is an immediate failure.

I suggest you try this, and please do extend the test with additional values too.

@Rhialto
Copy link

Rhialto commented Jul 27, 2024

Looks nice.

I added test cases

    (25705, 1),
    (25705, 2),
    (25705, 4),
    (25705, 8),
    (25705, 16),
    (25705, 32),
    (25705, 64),
    (25705, 128),
    (25705, 25705),
    (1, 1),
    (128, 128),
    --(16#FFFFFFFF#, 16#FFFFFFFF#),
    --(-1, -1),
    --(16#80000000#, 16#80000000#),
    (16#7FFFFFFF#, 16#7FFFFFFF#)

The compiler didn't accept the commented-out values:

/home/rhialto/mega65/mega65-core/tests/hardware_divider.vhdl:55:6:warning: static expression violates bounds [-Wruntime-error]
    (16#FFFFFFFF#, 16#FFFFFFFF#),
     ^
/home/rhialto/mega65/mega65-core/tests/hardware_divider.vhdl:55:20:warning: static expression violates bounds [-Wruntime-error]
    (16#FFFFFFFF#, 16#FFFFFFFF#),
                   ^

and

/home/rhialto/mega65/mega65-core/tests/vunit_out/test_output/lib.tb_example.Divider_passes_collected_test_values_905730febb0bc6f5f7031248c6a9062eacbaf98c/ghdl/tb_example-tb:error: error during elaboration

The divider should work for these, though.

I applied my second version to yours, and I found that while divisions by powers of 2 are faster, the precision is reduced. It's not so strange, since I essentially trade a lsb for an msb. My version increases the reported round-off errors from 8 to 13 or so.

But while my version somewhat improves the divisions by powers of 2, it leaves another weak spot of the algorithm untouched: numbers divided by themselves, where the result should be 1.0 exactly. I don't know if there is a generic way to improve this case or if it is worth to special-case it.

@lydon42
Copy link
Member

lydon42 commented Jul 27, 2024

From my tests I would now merge @MJoergen 's PR. And as far as I read your comments, @Rhialto , you are fine with the solution, too. Correct?

@Rhialto
Copy link

Rhialto commented Jul 27, 2024

@lydon42 Correct.

lydon42 added a commit that referenced this issue Jul 28, 2024
@lydon42 lydon42 added the staged Ready to be merged with stable master branch. label Jul 28, 2024
@lydon42
Copy link
Member

lydon42 commented Jul 28, 2024

PR#812 merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug. staged Ready to be merged with stable master branch.
Projects
None yet
Development

No branches or pull requests

4 participants