-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Results tracking - Paste your results here! #1
Comments
Results as of 2024/01/05M3 familyM3
M3 Pro
M3 Max (3 channel)
M3 Max (4 channel)
M3 Ultra (at time of writing, no such product exists )
M2 FamilyM2
M2 Pro
M2 Max
M2 Ultra
M1 FamilyM1
M1 Pro
M1 Max
M1 Ultra
|
The last two are questionable and total power consumption was constantly around 18-20W vs up to 40W when doing the |
Ran out of memory? M1 Ultra 128GB
|
That's strange. Max value tested is edge length = 59049 (59049^2)83 would be ~80 GB. Even allowing for a massive transpose matrix, that should add no more than +1/3{memory footprint}, still well below the 128 GBs available. All of the memory allocations (outside of any supplementary matrix transposes done by accelerate) are done on application start. You'd only "run out" if accelerate ballooned? I pushed a version to git a few hours ago ( fd7382e ) that takes this into account and divides by 4 instead of 3 when figuring out max edge length. Even then, you'd still end up with max integer of 3^N being 10 and testing for 59049 🤔 If you do a run with "only" 64, what do those results look like? |
Beyond the memory bug, thanks for running it! Wowed to see ~750 GFLOPS on that machine |
M1 Ultra using 64 out of 128 GB. More GFLOPS for ya.
|
1.1 TFLOPS lets go! That's (just!) enough to surpass Sandia labs project ASCI RED, the first ever TFLOP class super computer! To reach that performance mark, ASCI RED needed 850KW of power for the systems alone, excluding the needs of the rest of the infrastructure. It also needed over 7k sockets! Thank you so much for providing the data! |
M2 Max (Macbook Pro 14")
Peak of 651 GFLOPS EDIT: I built with the recommended flags of
Unsure why its 20GFLOPS faster but i'll take it :) |
The new interfaces seem to provide a much bigger bump for very small N. Without checking the instruction stream, I'd assume part of the difference is not firing up the AMX tile for small problem sets where you're better off having lower throughput, but because your latency is so much lower, it comes out in the wash. I'd assume it's also being smarter about when it starts using a single AMX tile, when it switches from the ecore AMX tile to the pcore attached AMX tile and when it starts issuing calls to both |
I got these results with M2 Max and 32GB RAM config:
Pretty consistent with @willkill07, but a smidge faster. Hope it helps :) |
m3 max, 128GB, got 784 max but maybe because I was doing other stuff? I can try again later [2, 3, 4, 8, 9, 10, 16, 27, 32, 64, 81, 100, 128, 243, 256, 512, 729, 1000, 1024, 2048, 2187, 4096, 6561, 8192, 10000, 16384, 19683, 32768, 59049] |
I entered "72" into the prompt and got: [2, 3, 4, 8, 9, 10, 16, 27, 32, 64, 81, 100, 128, 243, 256, 512, 729, 1000, 1024, 2048, 2187, 4096, 6561, 8192, 10000, 16384, 19683, 32768] 796 max this time |
I don't have a great place to store different peoples results, lets add them here!
You can place them directly below, ideally within a code tag to make for easier parsing!
Otherwise, feel free to send a Pastebin, git gist etc.
Thanks,
-FCLC
The text was updated successfully, but these errors were encountered: