Fix ray conflict changes #100

FanhaiLu1 · 2024-05-24T00:53:58Z

This PR add below changes:

1: move torch_xla2.default_env() to function. jax_mode = torch_xla2.default_env() block jax multiple controller in init state
2: ray engine create is different than default run server one, it will have prefill and decode engines later
3: removed duplciated JetEngineEnvironment
4: Not support shard_on_batch and ragged attention in ray multiple for now

wang2yn84

Can you help me understand how jax_mode = torch_xla2.default_env() block jax multiple controller in init state?

jetstream_pt/ray_worker.py

run_interactive_multiple_host.py

wang2yn84 · 2024-05-24T22:36:42Z

Can you help me understand how jax_mode = torch_xla2.default_env() block jax multiple controller in init state?

Can you help me understand why is it?

FanhaiLu1 · 2024-05-24T22:57:35Z

Can you help me understand how jax_mode = torch_xla2.default_env() block jax multiple controller in init state?

Can you help me understand why is it?

The is jax call under this function ( or deeper). For any jax function call, it will try to init the multiple controller env (though MPI barrier), which mean need to wait all the chips finished. So in ray multiple host, if there is a jax function call in head node, it will wait all the chips be ready, but only the head node chip is ready at this time, the the whole application will stuck there.

For current use case, it happens when Ray head load the class even, it call the jax and stuck there even before start execute main function.

FanhaiLu1 added 3 commits May 24, 2024 00:25

Fixed multiple host bugs

b36bae9

remove shard_on_batch and ragged_mha

f22c07b

lint fix

0236508

FanhaiLu1 requested review from bhavya01, lsy323, qihqi and wang2yn84 May 24, 2024 15:33

wang2yn84 reviewed May 24, 2024

View reviewed changes

jetstream_pt/ray_worker.py Show resolved Hide resolved

run_interactive_multiple_host.py Show resolved Hide resolved

run_interactive_multiple_host.py Show resolved Hide resolved

lsy323 approved these changes May 24, 2024

View reviewed changes

FanhaiLu1 merged commit 2880904 into AI-Hypercomputer:main May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ray conflict changes #100

Fix ray conflict changes #100

Uh oh!

FanhaiLu1 commented May 24, 2024 •

edited

Loading

Uh oh!

wang2yn84 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wang2yn84 commented May 24, 2024

Uh oh!

FanhaiLu1 commented May 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix ray conflict changes #100

Fix ray conflict changes #100

Uh oh!

Conversation

FanhaiLu1 commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wang2yn84 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wang2yn84 commented May 24, 2024

Uh oh!

FanhaiLu1 commented May 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FanhaiLu1 commented May 24, 2024 •

edited

Loading