-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added 48 bit Routines #453
base: master
Are you sure you want to change the base?
Conversation
Added: slland: 348 cycles sllor: 348 cycles sllxor: 381 cycles sllnot: 158 cycles sllcmpu: 210 cycles sllcmpzero: 45 cycles sllshl: varies sllshru: varies sllshrs: varies sllmulu: 1592 cycles
Fixed interrupt issue and optimized to 108 cycles
No longer backs up alt registers. Disables interrupts during the routine.
Updated documentation and made a very minor optimization
Comparison of de to 0 simplified to comparison of de to hl
Before I really get into review, just checking, does the calling convention for these routines match what ez80-clang emits? |
Yes, they should work with the Z80_LibCall calling convention (once everything's fixed at least) Edit: sllcmpu and sllcmpzero both use Z80_LibCall_F actually |
Interrupt enabling and disabling now handled conditionally
Now uses ude:uhl for minor shifts instead of uiy:uhl. Now uses stack for large shifts
Condensed routines. Eliminated most of sllshru. Fixed a bug where shifting 0 would cause a shift of 256.
Added slldvrmu for 48 bit unsigned division and modulo (~6000 cycles). This is called from the slldivu and sllremu which are in one file as they share cleanup behavior.
I don't like the fact that some CRT routines are disabling interrupts... but we can't really use them anyway so maybe I'm less conflcted 🙃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only reviewed a small portion of the changes so far. Before I continue, we really ought to get @jacobly0's support to (a) confirm the signature expected for each of these functions and (b) produce a version of ez80-clang that actually emits calls to dedicated 48-bit math functions so they can be tested.
src/crt/slland.src
Outdated
; Push 2/3 done values | ||
push iy | ||
ld iy, 0 | ||
add iy, sp | ||
push de | ||
push hl | ||
push bc | ||
|
||
; Deal with Upper Byte | ||
|
||
ld a, (iy-1) ; a = u[de] (modified) | ||
and a, (iy+2) ; a &= u[iy] | ||
ld (iy-1), a ; u[de] (modified) = a | ||
|
||
ld a, (iy-4) ; a = u[hl] (modified) | ||
and a, (iy-7) ; a &= u[bc] | ||
ld (iy-4), a ; u[hl] (modified) = a | ||
|
||
; Load values back into registers and clean up stack | ||
pop bc | ||
pop hl | ||
pop de | ||
pop iy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization seems valid? -10 bytes, -10F cycles.
; Push 2/3 done values | |
push iy | |
ld iy, 0 | |
add iy, sp | |
push de | |
push hl | |
push bc | |
; Deal with Upper Byte | |
ld a, (iy-1) ; a = u[de] (modified) | |
and a, (iy+2) ; a &= u[iy] | |
ld (iy-1), a ; u[de] (modified) = a | |
ld a, (iy-4) ; a = u[hl] (modified) | |
and a, (iy-7) ; a &= u[bc] | |
ld (iy-4), a ; u[hl] (modified) = a | |
; Load values back into registers and clean up stack | |
pop bc | |
pop hl | |
pop de | |
pop iy | |
; Push 2/3 done values | |
push de | |
push iy | |
push hl | |
sbc hl, hl | |
add hl, sp ; uhl = &l | |
push bc | |
; Deal with Upper Byte | |
ld de, 3 | |
dec hl | |
ld a, (hl) ; a = bcu | |
add hl, de | |
and a, (hl) ; a &= hlu | |
ld (hl), a ; hlu = a | |
add hl, de | |
ld a, (hl) ; a = iyu | |
add hl, de | |
and a, (hl) ; a &= deu | |
ld (hl), a ; deu = a | |
; Load values back into registers and clean up stack | |
pop bc | |
pop hl | |
pop iy | |
pop de |
If valid, the same pattern can be applied to sllor and sllxor, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this took me so long to get back to. This makes sense and works with everything I've thrown at it, so go ahead with or and xor.
Just FYI, support has now been added to the compiler for these routines, and they use an My observations about the libcall calling convention (hopefully accurate, haven't confirmed but should be testable now) are: |
Updated file names and documentation to use i48 instead of sll
i48cmpu has been changed to the old commented out version which doesn't use arguments off stack. Also minor optimization.
Removed multiple stack uses, now about 100 cycles
Changed i48dvrmu to put remainder in uiy:ubc rather than on the stack. i48dvrmu now restores interrupts and fully cleans up rather than partially relying on i48divu and i48remu.
Added i48cmps for signed int48_t comparison. 105 cycles.
Added i48bitrev, i48bswap, i48ctlz, i48popcnt. Significant optimization on i48bitrev and i48bswap may be possible. i48bitrev just does ibitrev twice without restoring af until the end, but an entirely new approach may perform better. i48bswap could be improved by using sp directly to manipulate stack values rather than hl (as lbswap does), but I was warned against doing this on prior routines.
Switched the order of register comparison so that routine outputs a sign flag matching the sign of the number.
88cf7d0
to
c6152ff
Compare
i48neg contains routines for getting the negative of both uhl:ude and uiy:ubc. Signed division and remainder are about 10-500 more cycles than unsigned.
Cycles per loop is now 52 instead of 143. Base cost is slightly higher however (80 --> 279 cycles).
Added the following 48 bit crt routines:
slland: 348 cycles
sllor: 348 cycles
sllxor: 381 cycles
sllnot: 158 cycles
sllcmpu: 210 cycles
sllcmpzero: 45 cycles
sllshl: varies. It's quite long so it could be rewritten with a loop like shr routines if need be.
sllshru: varies
sllshrs: varies
sllmulu: 1592 cycles