Added 48 bit Routines #453

fundudeone · 2023-11-07T02:57:14Z

Added the following 48 bit crt routines:

slland: 348 cycles
sllor: 348 cycles
sllxor: 381 cycles
sllnot: 158 cycles
sllcmpu: 210 cycles
sllcmpzero: 45 cycles
sllshl: varies. It's quite long so it could be rewritten with a loop like shr routines if need be.
sllshru: varies
sllshrs: varies
sllmulu: 1592 cycles

Added: slland: 348 cycles sllor: 348 cycles sllxor: 381 cycles sllnot: 158 cycles sllcmpu: 210 cycles sllcmpzero: 45 cycles sllshl: varies sllshru: varies sllshrs: varies sllmulu: 1592 cycles

src/crt/sllnot.src

Fixed interrupt issue and optimized to 108 cycles

src/crt/sllmulu.src

src/crt/sllshl.src

src/crt/sllshrs.src

src/crt/sllshru.src

src/crt/sllcmpzero.src

src/crt/sllcmpu.src

src/crt/slland.src

No longer backs up alt registers. Disables interrupts during the routine.

Updated documentation and made a very minor optimization

Comparison of de to 0 simplified to comparison of de to hl

runer112 · 2023-11-08T02:07:35Z

Before I really get into review, just checking, does the calling convention for these routines match what ez80-clang emits?

fundudeone · 2023-11-08T02:15:41Z

Before I really get into review, just checking, does the calling convention for these routines match what ez80-clang emits?

Yes, they should work with the Z80_LibCall calling convention (once everything's fixed at least)

Edit: sllcmpu and sllcmpzero both use Z80_LibCall_F actually

Interrupt enabling and disabling now handled conditionally

Now uses ude:uhl for minor shifts instead of uiy:uhl. Now uses stack for large shifts

Condensed routines. Eliminated most of sllshru. Fixed a bug where shifting 0 would cause a shift of 256.

Added slldvrmu for 48 bit unsigned division and modulo (~6000 cycles). This is called from the slldivu and sllremu which are in one file as they share cleanup behavior.

mateoconlechuga · 2023-11-25T04:59:00Z

I don't like the fact that some CRT routines are disabling interrupts... but we can't really use them anyway so maybe I'm less conflcted 🙃

runer112

I only reviewed a small portion of the changes so far. Before I continue, we really ought to get @jacobly0's support to (a) confirm the signature expected for each of these functions and (b) produce a version of ez80-clang that actually emits calls to dedicated 48-bit math functions so they can be tested.

runer112 · 2023-12-06T18:17:17Z

src/crt/slland.src

+	; Push 2/3 done values
+	push iy
+	ld iy, 0
+	add iy, sp
+	push de
+	push hl
+	push bc
+
+	; Deal with Upper Byte
+
+	ld a, (iy-1) ; a = u[de] (modified)
+	and a, (iy+2) ; a &= u[iy]
+	ld (iy-1), a ; u[de] (modified) = a
+
+	ld a, (iy-4) ; a = u[hl] (modified)
+	and a, (iy-7) ; a &= u[bc]
+	ld (iy-4), a ; u[hl] (modified) = a
+
+	; Load values back into registers and clean up stack
+	pop bc
+	pop hl
+	pop de
+	pop iy


This optimization seems valid? -10 bytes, -10F cycles.

Suggested change

; Push 2/3 done values

push iy

ld iy, 0

add iy, sp

push de

push hl

push bc

; Deal with Upper Byte

ld a, (iy-1) ; a = u[de] (modified)

and a, (iy+2) ; a &= u[iy]

ld (iy-1), a ; u[de] (modified) = a

ld a, (iy-4) ; a = u[hl] (modified)

and a, (iy-7) ; a &= u[bc]

ld (iy-4), a ; u[hl] (modified) = a

; Load values back into registers and clean up stack

pop bc

pop hl

pop de

pop iy

; Push 2/3 done values

push de

push iy

push hl

sbc hl, hl

add hl, sp ; uhl = &l

push bc

; Deal with Upper Byte

ld de, 3

dec hl

ld a, (hl) ; a = bcu

add hl, de

and a, (hl) ; a &= hlu

ld (hl), a ; hlu = a

add hl, de

ld a, (hl) ; a = iyu

add hl, de

and a, (hl) ; a &= deu

ld (hl), a ; deu = a

; Load values back into registers and clean up stack

pop bc

pop hl

pop iy

pop de

If valid, the same pattern can be applied to sllor and sllxor, of course.

Sorry this took me so long to get back to. This makes sense and works with everything I've thrown at it, so go ahead with or and xor.

src/crt/sllcmpzero.src

calc84maniac · 2023-12-14T20:28:46Z

Just FYI, support has now been added to the compiler for these routines, and they use an i48 naming scheme rather than sll. I've added stub implementations for these in this file which ideally should make the routines "work" (suboptimally) by wrapping the 64-bit routines, and those can be replaced with full implementations via this PR.

My observations about the libcall calling convention (hopefully accurate, haven't confirmed but should be testable now) are:
Routines that take one 48-bit parameter: input in UDEUHL
Routines that take two 48-bit parameters: inputs in UDEUHL, UIYUBC
Routines that take one 48-bit parameter and one 8-bit parameter: inputs in UDEUHL, C
Routines that return one 48-bit result: output in UDEUHL
Routines that return two 48-bit results (just __i48dvrmu): output in UDEUHL, UIYUBC
Routines that return one 8-bit result: output in A
__i48cmpu outputs C and Z flags
__i48cmpzero outputs S and Z flags (just fixed the 32-bit implementation based on this information)
__i48cmps outputs S and V flags (does it also need the Z flag? probably not, the 32-bit and 64-bit implementations seem to indicate no, and the LLVM implementation seems to reorder comparisons to avoid explicit equality checks)

Updated file names and documentation to use i48 instead of sll

i48cmpu has been changed to the old commented out version which doesn't use arguments off stack. Also minor optimization.

Removed multiple stack uses, now about 100 cycles

Changed i48dvrmu to put remainder in uiy:ubc rather than on the stack. i48dvrmu now restores interrupts and fully cleans up rather than partially relying on i48divu and i48remu.

Added i48cmps for signed int48_t comparison. 105 cycles.

Added i48bitrev, i48bswap, i48ctlz, i48popcnt. Significant optimization on i48bitrev and i48bswap may be possible. i48bitrev just does ibitrev twice without restoring af until the end, but an entirely new approach may perform better. i48bswap could be improved by using sp directly to manipulate stack values rather than hl (as lbswap does), but I was warned against doing this on prior routines.

Switched the order of register comparison so that routine outputs a sign flag matching the sign of the number.

i48neg contains routines for getting the negative of both uhl:ude and uiy:ubc. Signed division and remainder are about 10-500 more cycles than unsigned.

Cycles per loop is now 52 instead of 143. Base cost is slightly higher however (80 --> 279 cycles).

Added 48 bit Routines

23079e3

Added: slland: 348 cycles sllor: 348 cycles sllxor: 381 cycles sllnot: 158 cycles sllcmpu: 210 cycles sllcmpzero: 45 cycles sllshl: varies sllshru: varies sllshrs: varies sllmulu: 1592 cycles

mateoconlechuga requested review from calc84maniac, runer112, mateoconlechuga and jacobly0 November 7, 2023 03:21