Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore AVX2 via 2xNEON support for Apple silicon #330

Conversation

Developer-Ecosystem-Engineering
Copy link
Contributor

- Restore changes from Lighttransport/embree-aarch64 related to AVX2
- Add bootstrap scripts for linux/arm64 and mac/arm64
@cbenthin
Copy link
Contributor

cbenthin commented Aug 5, 2021

We actually experimented with 8-wide SIMD code paths on Apple silicon a couple of months back but the 4-wide path provided better performance. Do you have additional data that shows that the 8-wide path is better than the 4-wide?

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

We actually experimented with 8-wide SIMD code paths on Apple silicon a couple of months back but the 4-wide path provided better performance. Do you have additional data that shows that the 8-wide path is better than the 4-wide?

Will run some benchmarks and provide results, standby.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

We actually experimented with 8-wide SIMD code paths on Apple silicon a couple of months back but the 4-wide path provided better performance. Do you have additional data that shows that the 8-wide path is better than the 4-wide?

We see about 8% improvement in BMW scene when rendered in Blender using these changes. Note Cycles is currently setting RTC_SCENE_FLAG_COMPACT which means it would ignore bvh8, if thats patched these improvements can be verified.

269s with changes vs 292s without changes

SIMD256 was being disabled which was prohibiting projects from benefiting from these changes.

Also minor update to CMAKE so by default it will support AVX2 without additional flags on AS_MAC.
@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Minor change to enable SIMD256 on CPU::ARM. With this + RTC_SCENE_FLAG_COMPACT disabled in Cycles, you should be able to see the performance improvements directly.

@cbenthin
Copy link
Contributor

cbenthin commented Sep 8, 2021

Thanks, we'll have a look at it. However, there needs to be a consistent speedup across many apps and scenes to warrant a change.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Thanks, we'll have a look at it. However, there needs to be a consistent speedup across many apps and scenes to warrant a change.

How many apps/scenes would be sufficient?
Thanks!

@svenwoop
Copy link
Contributor

I did just try to integrate your pull request into Embree and run all of our CI testing. I created a branch neon_avx_emulation on github which also includes my changes. To test this just enable SSE2, SSE4.2, AVX and AVX2 ISA in cmake.
I see a speedup of about 8% which is great, however, a lot of codepaths are broken. Ray packets seem broken entirely and even if I disable them I get these failures:

viewer_curves_round_line_segments_0 (Failed)
35 - viewer_curves_round_line_segments_1 (Failed)
36 - viewer_curves_round_line_segments_2 (Failed)
37 - viewer_curves_round_line_segments_3 (Failed)
38 - viewer_curves_round_line_segments_4 (Failed)
39 - viewer_curves_round_line_segments_5 (Failed)
40 - viewer_curves_round_line_segments_6 (Failed)
41 - viewer_curves_round_line_segments_7 (Failed)
42 - viewer_curves_round_line_segments_8 (Failed)
43 - viewer_curves_round_line_segments_9 (Failed)
44 - viewer_curves_curve1 (Failed)
45 - viewer_curves_oriented_curve1 (Failed)
46 - viewer_points_points (Failed)
51 - viewer_msmblur_curves_msmblur (Failed)
52 - viewer_msmblur_lines_msmblur (Failed)
60 - pathtracer_curves_curve1 (Failed)
61 - pathtracer_curves_oriented_curve1 (Failed)
62 - pathtracer_points_points (Failed)
67 - pathtracer_msmblur_curves_msmblur (Failed)
68 - pathtracer_msmblur_lines_msmblur (Failed)
73 - hair_geometry (Failed)
76 - grid_geometry (Failed)
80 - motion_blur_geometry (Failed)
81 - interpolation (Failed)
82 - curve_geometry (Subprocess aborted)
83 - point_geometry (Failed)
85 - closest_point (Failed)

We cannot accept the pull request with so many tests broken. Would you be willing looking into this? At appears curve support is the cause for most of these failures.

As a first step one could look into this failing rendering where the curve is missing entirely:

./viewer -c ../tutorials/models/curve_round.ecs

There are also other models in the tutorials/models folder to test.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Ah! Apologies, we will investigate, absolutely!

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

We are still working on this, found some other issues that needed investigating

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Have some changes ready, working on reviews and rebasing, will update the PR when completed.

@wjakob
Copy link

wjakob commented Jan 31, 2022

Dear all,

I also gave this a test with Mitsuba and saw a 7% speedup in a benchmark. However, tracing 4 or 8-wide packets crashes the renderer when rebased onto this branch. Some parts of the build system changes didn't make sense to me -- where are the CMake variables DIST_ARCH and AS_MAC supposed to come from? For example, I don't think CMake sets DIST_ARCH by default.

Thanks,
Wenzel

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

We are working on approval for the latest change set to resolve the outstanding feedback.

- Restore changes from Lighttransport/embree-aarch64 related to AVX2 (Neon2x for Apple silicon lighttransport/embree-aarch64#34)
- Add bootstrap scripts for linux/arm64 and mac/arm64
- Restore a couple missed PRs from the lighttransport merge around threading
- Fix broken tests highlighted by @svenwoop
@Developer-Ecosystem-Engineering
Copy link
Contributor Author

@svenwoop Investigating the reported failures revealed several prior fixes were missing, which we added back. Believe we've resolved the outstanding issues at this time, apologies for the delay

@svenwoop
Copy link
Contributor

I pushed a branch 330-restore-AVX2-via-2xNEON which fixes some issues with that pull request. In particular the 8-wide BVH was not used, and disabling the EMBREE_NEON_AVX2_EMULATION caused compile issues on Apple platform. You can now also disable EMBREE_NEON_AVX2_EMULATION in cmake.

The code now passes all of our CI tests, thanks for fixing the issues.

The performance difference I experience is just 3-4%, could you please verify this. With crown model from here https://github.com/embree/models/releases/download/release/crown.zip I get these numbers:

./viewer -c crown/crown.ecs --benchmark 4 100 --verbose 2
AVX2_EMU_ON: 38.14 fps avg
AVX2_EMU_OFF: 37.00fps avg

Is that the expected performance different, or should the speedup be higher? I remember once having measured 8%, but I cannot reproduce this with current code.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Hi @svenwoop, here are our results using ./viewer -c crown/crown.ecs --benchmark 4 100 --verbose 2 on various configs across the three variations

Master 330-PR-NEON-OFF 330-PR-NEON-ON
M1 8 Core
Average Render FPS 35.14 36.79 38.11
Improvement vs master 4.68% 8.44%
Improvement vs NEON Off 3.59%
M1 Pro 10 Core
Average Render FPS 51.14 55.58 57.48
Improvement vs master 8.69% 12.41%
Improvement vs NEON Off 3.42%
M1 Ultra 20 Core
Average Render FPS 101.41 110.64 115.06
Improvement vs master 9.10% 13.46%
Improvement vs NEON Off 4.00%

@wjakob
Copy link

wjakob commented Mar 18, 2022

Really curious about one thing: it makes sense that FPS values are higher on the larger M1 CPUs, but why are the speedups bigger? Is vectorization somehow becoming more effective there?

Another question I would have is whether these improvements mainly apply to scalar ray tracing, or also packet ray tracing?

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

why are the speedups bigger?

Performance vs Efficiency core ratios

Another question I would have is whether these improvements mainly apply to scalar ray tracing, or also packet ray tracing?

Ray packets performance is highly dependent on how the "packet" behaves to the various conditions along the compute path. In general the same speed up is expected, with higher variance.

@DaveHarwoodNZ
Copy link

Would utilising AMX2 matrix multiplication improve ray tracing performance? If so, would this be considered legitimate for a CPU-bound renderer?

@wjakob
Copy link

wjakob commented Jun 2, 2022

Are there outstanding issues to be resolved with this PR, or can it be merged?

@svenwoop
Copy link
Contributor

svenwoop commented Jun 2, 2022

I want to spend a day to clean some things up, but generally the merge request is fine. I will try to get this done next week.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Thanks @svenwoop and everyone else for your patience, let us know if you need anything else to integrate these changes.

@svenwoop
Copy link
Contributor

I did do some cleanups over the merge request and we will release this improvement soon. I updated the 330-restore-AVX2-via-2xNEON branch.
I found one issue with the merge request that did cause performance issues, the non fused mad instruction was used due to some define in avx2neon.h. With that issue fixed I see now much better performance:

viewer, crown model, 1024x1024:
NEON baseline 36.0fps
NEON improved 37.7fps
NEON2X 40.8fps

You see that the NEON2X gives now an 8% improvement over using just NEON, and a 13% improvement over the NEON version that is in current Embree.

@svenwoop svenwoop closed this Jun 14, 2022
@wjakob
Copy link

wjakob commented Jun 29, 2022

Just in case it is of interests, also some data points from me. This is the speedup on the 'staircase' scene from Benedikt Bitterli's scene repository, rendered with the upcoming Mitsuba 3.

out

NEON is always active, and the _old vs _new bars correspond to embree versions before/after this set of changes (including @svenwoop's improvements). The percentage values quantify the relative difference between the _old and _new bars.

This plot benchmarks two different variants of the system. scalar refers to classic precompiled single-ray tracing, which gives a speedup of ~8%. That's a little smaller than the 13% which @svenwoop reported above, not sure why.

The llvm_* bars refers to a 4-wide vectorized/packet ray tracer that is just-in-time compiled for this scene. Curiously, the impact of the patch is more significant for this configuration reaching >17%. I would not have expected this, since those ray packets are expected to be quite divergent after the first bounce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants