Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added 48 bit Routines #453

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Conversation

fundudeone
Copy link

Added the following 48 bit crt routines:

slland: 348 cycles
sllor: 348 cycles
sllxor: 381 cycles
sllnot: 158 cycles
sllcmpu: 210 cycles
sllcmpzero: 45 cycles
sllshl: varies. It's quite long so it could be rewritten with a loop like shr routines if need be.
sllshru: varies
sllshrs: varies
sllmulu: 1592 cycles

Added:

slland: 348 cycles
sllor: 348 cycles
sllxor: 381 cycles
sllnot: 158 cycles
sllcmpu: 210 cycles
sllcmpzero: 45 cycles
sllshl: varies
sllshru: varies
sllshrs: varies
sllmulu: 1592 cycles
src/crt/sllnot.src Outdated Show resolved Hide resolved
Fixed interrupt issue and optimized to 108 cycles
src/crt/sllmulu.src Outdated Show resolved Hide resolved
src/crt/sllshl.src Outdated Show resolved Hide resolved
src/crt/sllshl.src Outdated Show resolved Hide resolved
src/crt/sllshrs.src Outdated Show resolved Hide resolved
src/crt/sllshrs.src Outdated Show resolved Hide resolved
src/crt/sllshrs.src Outdated Show resolved Hide resolved
src/crt/sllshru.src Outdated Show resolved Hide resolved
src/crt/sllshru.src Outdated Show resolved Hide resolved
src/crt/sllcmpzero.src Outdated Show resolved Hide resolved
src/crt/sllcmpu.src Outdated Show resolved Hide resolved
src/crt/slland.src Outdated Show resolved Hide resolved
No longer backs up alt registers. Disables interrupts during the routine.
Updated documentation and made a very minor optimization
Comparison of de to 0 simplified to comparison of de to hl
@runer112
Copy link
Member

runer112 commented Nov 8, 2023

Before I really get into review, just checking, does the calling convention for these routines match what ez80-clang emits?

@fundudeone
Copy link
Author

fundudeone commented Nov 8, 2023

Before I really get into review, just checking, does the calling convention for these routines match what ez80-clang emits?

Yes, they should work with the Z80_LibCall calling convention (once everything's fixed at least)

Edit: sllcmpu and sllcmpzero both use Z80_LibCall_F actually

Interrupt enabling and disabling now handled conditionally
Now uses ude:uhl for minor shifts instead of uiy:uhl. Now uses stack for large shifts
Condensed routines. Eliminated most of sllshru. Fixed a bug where shifting 0 would cause a shift of 256.
Added slldvrmu for 48 bit unsigned division and modulo (~6000 cycles). This is called from the slldivu and sllremu which are in one file as they share cleanup behavior.
@mateoconlechuga
Copy link
Member

I don't like the fact that some CRT routines are disabling interrupts... but we can't really use them anyway so maybe I'm less conflcted 🙃

@runer112 runer112 self-assigned this Dec 6, 2023
Copy link
Member

@runer112 runer112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed a small portion of the changes so far. Before I continue, we really ought to get @jacobly0's support to (a) confirm the signature expected for each of these functions and (b) produce a version of ez80-clang that actually emits calls to dedicated 48-bit math functions so they can be tested.

Comment on lines 37 to 59
; Push 2/3 done values
push iy
ld iy, 0
add iy, sp
push de
push hl
push bc

; Deal with Upper Byte

ld a, (iy-1) ; a = u[de] (modified)
and a, (iy+2) ; a &= u[iy]
ld (iy-1), a ; u[de] (modified) = a

ld a, (iy-4) ; a = u[hl] (modified)
and a, (iy-7) ; a &= u[bc]
ld (iy-4), a ; u[hl] (modified) = a

; Load values back into registers and clean up stack
pop bc
pop hl
pop de
pop iy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization seems valid? -10 bytes, -10F cycles.

Suggested change
; Push 2/3 done values
push iy
ld iy, 0
add iy, sp
push de
push hl
push bc
; Deal with Upper Byte
ld a, (iy-1) ; a = u[de] (modified)
and a, (iy+2) ; a &= u[iy]
ld (iy-1), a ; u[de] (modified) = a
ld a, (iy-4) ; a = u[hl] (modified)
and a, (iy-7) ; a &= u[bc]
ld (iy-4), a ; u[hl] (modified) = a
; Load values back into registers and clean up stack
pop bc
pop hl
pop de
pop iy
; Push 2/3 done values
push de
push iy
push hl
sbc hl, hl
add hl, sp ; uhl = &l
push bc
; Deal with Upper Byte
ld de, 3
dec hl
ld a, (hl) ; a = bcu
add hl, de
and a, (hl) ; a &= hlu
ld (hl), a ; hlu = a
add hl, de
ld a, (hl) ; a = iyu
add hl, de
and a, (hl) ; a &= deu
ld (hl), a ; deu = a
; Load values back into registers and clean up stack
pop bc
pop hl
pop iy
pop de

If valid, the same pattern can be applied to sllor and sllxor, of course.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this took me so long to get back to. This makes sense and works with everything I've thrown at it, so go ahead with or and xor.

src/crt/sllcmpzero.src Outdated Show resolved Hide resolved
@runer112 runer112 removed their assignment Dec 6, 2023
@calc84maniac
Copy link
Contributor

calc84maniac commented Dec 14, 2023

Just FYI, support has now been added to the compiler for these routines, and they use an i48 naming scheme rather than sll. I've added stub implementations for these in this file which ideally should make the routines "work" (suboptimally) by wrapping the 64-bit routines, and those can be replaced with full implementations via this PR.

My observations about the libcall calling convention (hopefully accurate, haven't confirmed but should be testable now) are:
Routines that take one 48-bit parameter: input in UDEUHL
Routines that take two 48-bit parameters: inputs in UDEUHL, UIYUBC
Routines that take one 48-bit parameter and one 8-bit parameter: inputs in UDEUHL, C
Routines that return one 48-bit result: output in UDEUHL
Routines that return two 48-bit results (just __i48dvrmu): output in UDEUHL, UIYUBC
Routines that return one 8-bit result: output in A
__i48cmpu outputs C and Z flags
__i48cmpzero outputs S and Z flags (just fixed the 32-bit implementation based on this information)
__i48cmps outputs S and V flags (does it also need the Z flag? probably not, the 32-bit and 64-bit implementations seem to indicate no, and the LLVM implementation seems to reorder comparisons to avoid explicit equality checks)

Updated file names and documentation to use i48 instead of sll
i48cmpu has been changed to the old commented out version which doesn't use arguments off stack. Also minor optimization.
Removed multiple stack uses, now about 100 cycles
Changed i48dvrmu to put remainder in uiy:ubc rather than on the stack. i48dvrmu now restores interrupts and fully cleans up rather than partially relying on i48divu and i48remu.
Added i48cmps for signed int48_t comparison. 105 cycles.
Added i48bitrev, i48bswap, i48ctlz, i48popcnt. Significant optimization on i48bitrev and i48bswap may be possible. i48bitrev just does ibitrev twice without restoring af until the end, but an entirely new approach may perform better.  i48bswap could be improved by using sp directly to manipulate stack values rather than hl (as lbswap does), but I was warned against doing this on prior routines.
Switched the order of register comparison so that routine outputs a sign flag matching the sign of the number.
i48neg contains routines for getting the negative of both uhl:ude and uiy:ubc. Signed division and remainder are about 10-500 more cycles than unsigned.
Cycles per loop is now 52 instead of 143. Base cost is slightly higher however (80 --> 279 cycles).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

5 participants