Integrates ragged attention to JetStream Pytorch #93

wang2yn84 · 2024-05-20T21:43:15Z

Currently the performance is on par with dense attention. We can keep improving the performance in the following PRs.

qihqi · 2024-05-21T00:27:31Z

jetstream_pt/third_party/llama/model_exportable.py

      mask,
+      start,
+      input_pos,
+      pre_batch,


Few comments on what is pre_batch and pre_block?

Also should gemma/model_exportable.py also be modified?

Thanks for pointing out this! Pushed the commits to update the Gemma model.

FanhaiLu1 · 2024-05-21T19:24:23Z

jetstream_pt/engine.py

        slot,
    )

+  def precompute_ragged_block_indices(self, decode_state: DecodeState):


Nit: do you need other attribute from decode_state besides start and input_pos, pass start and input_pos instead of heavy object decode_state.

Only the start and input_pos is used, but considering we just passing the reference to the decode_state, so it should not affect the performance at all.

sounds good

jetstream_pt/engine.py

jetstream_pt/layers.py

FanhaiLu1 · 2024-05-22T02:48:04Z

jetstream_pt/attention_kernel.py

+    end_ref,
+    line_end_ref,
+    pre_b_ref,
+    pre_i_ref,


q, k, v and related are easy to read. What are the b, i, o, m, l, bk and pre? Can you add brief description to describe them?

Added. Done.

FanhaiLu1 · 2024-05-22T02:50:00Z

jetstream_pt/engine.py

        slot,
    )

+  def precompute_ragged_block_indices(self, decode_state: DecodeState):


sounds good

…al performance. Fix the typo that should use jax.lax.div instead of jnp.div

…ocessing API from JetStream.

dense_attention_quantized and use option to control if it's quantization or not. Use the new torch_xla2 API.

…a ring buffer. Will cause error.

…enchmarking.

* refactor flags * clean up: * fix run_server * move common flags to global * format * update * udpate readme * update run_interactive

…flags for debugging and performance tuning.

… align with main.

…nction.

…ersion.

…ture.

…nts. The error message is missing positional arguments.

…nput_pos) back to original to avoid unnecessary issues.

… class.

…he pylink rules.

…nel. Fix other lint errors.

wang2yn84 requested review from FanhaiLu1 and qihqi May 20, 2024 21:43

qihqi reviewed May 21, 2024

View reviewed changes

FanhaiLu1 reviewed May 21, 2024

View reviewed changes

FanhaiLu1 approved these changes May 22, 2024

View reviewed changes

wang2yn84 and others added 25 commits May 23, 2024 05:46

Stable version of ragged attention.

cf45d7f

Converts the attention output types the same as q.

d2bb514

Fixes the typo for the ragged attention.

8482117

Provides the default value for partition_by_axis.

4585ab4

Provides mesh to the shard_map.

1498ba9

Fixes typo.

81bfaa6

Fixes typo, should be start instead of start_pos.

01d2eef

Should use "//" instead of "/" to get int results.

5603879

Use block size // 2 as the starting current position for better initi…

2488297

…al performance. Fix the typo that should use jax.lax.div instead of jnp.div

Updates the run_interactive script to use the correct result token pr…

f04b20a

…ocessing API from JetStream.

Fix typo, should use token_utils.process_result_token.

6aaf6d9

Fix typo.

cd84291

Fixes the sampled tokens list.

53240bc

Use text_tokens_to_str to convert the output tokens.

ed368b5

Reshape the precomputed grid indices to 1D. Removes the

5264f11

dense_attention_quantized and use option to control if it's quantization or not. Use the new torch_xla2 API.

Should check if X is None instead of if X

a4241d9

Fix the dense_attention not returning data.

00a8fa0

Reshape the kv scaler to 3 dim for ragged attention.

4a26aed

Cannot stop the input_pos counter from increasing since we are using …

7fdf340

…a ring buffer. Will cause error.

Adds starting_position and profiling_prefill for better testing and b…

0721646

…enchmarking.

Move flags in scripts to a common function (#92)

930eaa0

* refactor flags * clean up: * fix run_server * move common flags to global * format * update * udpate readme * update run_interactive

Stable version of ragged attention.

97c6435

Fix the merge conflicts

6be5ec3

Fixes the missing pieces after merging conflicts. Adds couple of new …

6ae0f9d

…flags for debugging and performance tuning.

Integrates ragged attention to Gemma too.

212aa8e

wang2yn84 added 7 commits May 23, 2024 18:25

Somehow have some local changes to run_interactive, reverting them to…

ddb32e0

… align with main.

Set the default value for the newly added parameters.

fb68025

Adds more descriptions to the ragged attention index precompuation fu…

2def37c

…nction.

Merges the quantized ragged attention kernel with the non quantized v…

268c407

…ersion.

Moves the attention calculation to attention.py for better code struc…

8fa8fcb

…ture.

Fix run issues refactoring.

98856f6

Fix the quantized version for ragged attention.

ab38726

wang2yn84 force-pushed the ragged-attention-final2 branch from cbb2fe9 to ab38726 Compare May 23, 2024 18:33

wang2yn84 added 9 commits May 23, 2024 19:11

Fix test_attention by adding default value for the newly added argume…

a712862

…nts. The error message is missing positional arguments.

Fixes unit tests, changes the Transformer model call argument order(i…

c50aba6

…nput_pos) back to original to avoid unnecessary issues.

Format attention_kernel.py

d431803

Add descrpitions to ragged attention outputs.

9061fac

Fix quantization tests by adding default value to quantization kernel…

a6059e9

… class.

Reformat attention_kernel.py. Format with black doesn't comply with t…

a73658d

…he pylink rules.

Ignores R0913: Too many arguments link error for ragged attention ker…

9aaa7a6

…nel. Fix other lint errors.

Ignore R0903: Too few public methods. Fix lint errors.

1286bb5

Fix the rest of the lint errors.

f64fe6f

wang2yn84 merged commit 517d847 into main May 23, 2024

wang2yn84 deleted the ragged-attention-final2 branch May 23, 2024 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrates ragged attention to JetStream Pytorch #93

Integrates ragged attention to JetStream Pytorch #93

Uh oh!

wang2yn84 commented May 20, 2024

Uh oh!

qihqi May 21, 2024

Uh oh!

wang2yn84 May 21, 2024

Uh oh!

FanhaiLu1 May 21, 2024

Uh oh!

wang2yn84 May 22, 2024

Uh oh!

FanhaiLu1 May 22, 2024

Uh oh!

Uh oh!

Uh oh!

FanhaiLu1 May 22, 2024

Uh oh!

wang2yn84 May 23, 2024

Uh oh!

FanhaiLu1 May 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Integrates ragged attention to JetStream Pytorch #93

Integrates ragged attention to JetStream Pytorch #93

Uh oh!

Conversation

wang2yn84 commented May 20, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants