Skip to content

Conversation

@avik-pal
Copy link
Collaborator

@avik-pal avik-pal commented Mar 1, 2025

  • add sharding
    • sharding tests
  • get non-sharding tests to pass
  • add ifrt to buildkite runners
  • distributed workflow
  • update precompile workflow
  • reenable precompile workflow

@avik-pal avik-pal force-pushed the ap/ifrt_high_level branch 2 times, most recently from d2fc751 to 97582df Compare March 2, 2025 17:34
Base automatically changed from ap/implicit_padding to main March 4, 2025 02:38
@avik-pal avik-pal force-pushed the ap/ifrt_high_level branch 2 times, most recently from 31f5c5e to 0ecac94 Compare March 4, 2025 02:39
@avik-pal avik-pal changed the base branch from main to ap/ifrt_fix_cpp March 4, 2025 02:40
Base automatically changed from ap/ifrt_fix_cpp to main March 4, 2025 16:49
@avik-pal avik-pal force-pushed the ap/ifrt_high_level branch 11 times, most recently from e84d20e to 073ff00 Compare March 5, 2025 23:25
@avik-pal avik-pal force-pushed the ap/ifrt_high_level branch from c6fa5a6 to 9197461 Compare March 6, 2025 07:36
@avik-pal
Copy link
Collaborator Author

avik-pal commented Mar 6, 2025

Locally even distributed works 🎉

@avik-pal
Copy link
Collaborator Author

avik-pal commented Mar 6, 2025

One thing that doesn't work is IFRT-PJRT doesn't have a proper interface to broadcast data to all processes. Currently we just fill in the ones that is addressable. With the Proxy backend we should be able to fix this

@avik-pal avik-pal requested a review from wsmoses March 6, 2025 07:45
@avik-pal
Copy link
Collaborator Author

avik-pal commented Mar 6, 2025

not sure why the arm ones are failing more consistently than before

@avik-pal avik-pal force-pushed the ap/ifrt_high_level branch from 9197461 to 07985bb Compare March 6, 2025 07:51
@mofeing
Copy link
Collaborator

mofeing commented Mar 6, 2025

Thanks for pushing this @avik-pal

I should have done more for IFRT earlier but I'm incredibly busy

@avik-pal avik-pal force-pushed the ap/ifrt_high_level branch from 07985bb to fd40a0f Compare March 6, 2025 16:15
@avik-pal
Copy link
Collaborator Author

avik-pal commented Mar 6, 2025

I am disabling aarch64 precompilation for now

@avik-pal avik-pal marked this pull request as ready for review March 6, 2025 16:57
@avik-pal avik-pal force-pushed the ap/ifrt_high_level branch 11 times, most recently from aa0cbf1 to c414082 Compare March 8, 2025 04:30
@avik-pal avik-pal force-pushed the ap/ifrt_high_level branch from c414082 to 7cc62f8 Compare March 8, 2025 04:33
@avik-pal
Copy link
Collaborator Author

avik-pal commented Mar 8, 2025

can we test that locally somewhere with a debug jll to see what's happening

I have re-enabled the pre-compilation. But it seems like the failure is independent of the IFRT work (https://github.com/EnzymeAD/Reactant.jl/actions/runs/13731962902/job/38410549236#step:8:330). It was just hidden since the tests passed even when precompilation failed. I have updated how the preferences are set to match the previous behavior

@wsmoses
Copy link
Member

wsmoses commented Mar 9, 2025

we should clearly still fix that, but can move past that for now

Copy link
Member

@wsmoses wsmoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable to me, but i presume you need to separate jll change

@avik-pal
Copy link
Collaborator Author

avik-pal commented Mar 9, 2025

I ended up not needing the jll change since some of the ifrt sharding types don't implement the whole api.

But I kept it since it still wraps useful functionality

@avik-pal avik-pal merged commit b36f525 into main Mar 9, 2025
67 of 72 checks passed
@avik-pal avik-pal deleted the ap/ifrt_high_level branch March 9, 2025 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants