Skip to content

Conversation

chriselrod
Copy link
Collaborator

No description provided.

@chriselrod chriselrod requested a review from DilumAluthge May 6, 2021 02:02
@chriselrod
Copy link
Collaborator Author

chriselrod commented May 6, 2021

Two items:

  1. M1 seems to have higher overhead on threading than x86 chips, so I increased the threading threshold for non x86. Until I've tried other ARM CPUs or Power, I figured it's safer to go with the higher threshold. I've already done the same thing in LoopVectorization.
  2. The M1 has 2 cache levels: 64 KiB L1D, and 4 MiB L2, both core-local. These are massive! More to the point, the x86 chips I tested on have three cache levels: the first two local (like the M1), and then a third that is shared. For comparison, Haswell, the first AVX2 chip, has 32 KiB L1D and 0.25 MiB L2, while Tiger Lake, Intel's latest, has 48 KiB L1D and 1.25 MiB L2. On the x86 chips, A (in C = A*B) would be blocked to fit in the L2 cache, while B would be split in the L3 cache and shared among threads (as the L3 cache is shared). On Octavian master with the M1, it would block A and B to fit in the L1 and L2 caches instead. This was bad, because while the M1's L1D cache is very large at 64 KiB, it is still much smaller than an L2 cache, and blocking shared B in the L2 caches doesn't make sense either because they're not shared between cores anyway. Therefore, the updated behavior is to match 3-cache level x86: block A in the L2 (allowing for tremendous reuse, due to the L2's tremendous size), and then leave B in a higher level. Unfortunately, that higher level on the M1 isn't a cache but system RAM, so more testing/optimization will be needed evetually to figure out the best thing to do there. My guess: probably best not to block B at all.

Unfortunately, Octavian isn't going to contend with Accelerate anytime soon, because Octavian is using Neon, while Accelerate is using Apple's secret AMX/matrix instructions.

@codecov
Copy link

codecov bot commented May 6, 2021

Codecov Report

Merging #82 (02149f7) into master (9457cdb) will decrease coverage by 1.85%.
The diff coverage is 42.10%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #82      +/-   ##
==========================================
- Coverage   86.54%   84.69%   -1.86%     
==========================================
  Files          10       10              
  Lines         565      575      +10     
==========================================
- Hits          489      487       -2     
- Misses         76       88      +12     
Impacted Files Coverage Δ
src/global_constants.jl 50.00% <0.00%> (-16.67%) ⬇️
src/matmul.jl 89.25% <80.00%> (-0.18%) ⬇️
src/block_sizes.jl 94.91% <0.00%> (-1.70%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9457cdb...02149f7. Read the comment docs.

@chriselrod
Copy link
Collaborator Author

The B-blocking optimizations can come in a separate PR. I'll merge this for now after tests pass.

@chriselrod chriselrod enabled auto-merge (squash) May 6, 2021 02:27
@chriselrod chriselrod merged commit 2dd77ea into master May 6, 2021
@chriselrod chriselrod deleted the m1threadthresh branch May 6, 2021 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant