TE Gemma tutorial attempt#2 #1839

sudhakarsingh27 · 2025-06-02T20:52:04Z

Description

Adds a tutorial to showcase how to:

use TE TransformerEngine layer in place of HuggingFace's GemmaDecoderLayer in Gemma models.
use non-paged and paged KV cache from TE
use CUDA Graphs and fp8_model_init to optimize generation times

Attempt#1 @ #829

Type of change

Documentation change (change only to the documentation, either a fix or a new content)

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…mma_tutorial_base

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…ransformerEngine into te_gemma_tutorial_base_test

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

for more information, see https://pre-commit.ci

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…mma_tutorial_base Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

…gine into te_gemma_tutorial_base

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…mma_tutorial_base

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

pggPL · 2025-08-08T08:29:47Z

This tutorial is not visible in the docs.

pggPL

I was focusing more on documentation. The first part with HF finetuning is ok (I have some small suggestions, but if we merge it without any change I think it would be ok).

For the second part - I generally agree with high-level concepts, but it needs to be polished. So I think it will be good if you read it once again and look for minor issues and then I will have a look once again. I left some comments with areas in which something is missing or they left after some removed parts.

I think also that if we present something as an example we should be sure that it works correctly - so maybe consider copying the code from all the cells into some test file and add it to the CI pipeline. But this may be part of other PR, this is big enough.

I was not looking deeply into the code - it seems that some code is left commented and I am not sure if this is final form. If you want me to look into that, let me know.

I think this is a good moment to think more about the future of TE docs. We were adding many unrelated tutorials/examples lately (attention, export to onnx, offloading in near future) and our docs seem quite unstructured at this point. We have section "tutorials and examples" when there is code explaining some fp8 recipes and these tutorials. Maybe it is worth to split it into tutorial part and example part. In tutorial path we can have only descriptions of the most important features and in examples composition of these things like gemma/llama finetuning/generation.

We lack good descriptions of many good features we have inside - blockwise scaling is not in the docs, moe support, context parallelism. Maybe some refactoring and adding more structure and adding descriptions of these things is good idea for the future. @ptrendx

docs/examples/te_gemma/tutorial_accelerate_hf_gemma_finetuning_with_te.ipynb

pggPL · 2025-08-08T08:45:39Z

docs/examples/te_gemma/tutorial_accelerate_hf_gemma_finetuning_with_te.ipynb

This file is nice in current form - simple extension of llama tutorial.
Having llama and gemma tutorial next to each other seems to be quite weird, but I do not see better solution and we should leave it as it is imo.

docs/examples/te_gemma/tutorial_generation_gemma_with_te.ipynb

pggPL · 2025-08-08T09:09:10Z

docs/examples/te_gemma/tutorial_generation_gemma_with_te.ipynb

+   "source": [
+    "\n",
+    "<figure align=\"center\">\n",
+    "<img src=\"./media/plot.svg\">\n",


this picture looks wierd in html

I now think this picture is an overkill. I'll plan to remove it

docs/examples/te_gemma/tutorial_generation_gemma_with_te.ipynb

transformer_engine/pytorch/attention/multi_head_attention.py

pggPL · 2025-08-08T09:21:31Z

transformer_engine/pytorch/attention/inference.py

@@ -266,6 +271,11 @@ def pre_step(
        for k, v in self.sequences.items():
            self.sequences_pre_step[k] = v - step_dict[k]

+        pre_step_seqlens = torch.Tensor(list(self.sequences_pre_step.values())).to(
+            dtype=torch.int32, device="cpu"


Why it is on CPU? I haven't tried to deeply understand what's going on here yet.

There's a self.pre_step_seqlens variable in InferenceParams which is on GPU and this temp variable is used to populate it. But I see the confusion - this call is redundant as torch.Tensor is created on CPU by default. I'll remove it

docs/examples/te_gemma/tutorial_generation_gemma_with_te.ipynb

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from 03729bc to 2a514cf Compare June 2, 2025 21:10

sudhakarsingh27 added 2 commits June 2, 2025 14:16

add tutorial files and other local changes

2430700

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into te_ge…

4757bfa

…mma_tutorial_base

sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from 2a514cf to 4757bfa Compare June 2, 2025 21:19

remove extraneous code for easy debu

d56f439

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch 3 times, most recently from 5d7538e to 93960fd Compare June 16, 2025 22:09

make cuda graphs work with non-paged and paged attention

6cd3c1a

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from 588fcd6 to 6cd3c1a Compare June 17, 2025 22:27

pre-commit-ci bot and others added 20 commits June 17, 2025 22:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

2d12b72

for more information, see https://pre-commit.ci

perf imp for kv cache ops

97b756c

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

add code for calibration

5011eb3

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'te_gemma_tutorial_base' of github.com:sudhakarsingh27/T…

dea99f6

…ransformerEngine into te_gemma_tutorial_base_test

Merge branch 'te_gemma_tutorial_base_test' into te_gemma_tutorial_base

714ff34

[pre-commit.ci] auto fixes from pre-commit.com hooks

0f7ea22

for more information, see https://pre-commit.ci

optimize kv_cache reindex and copy kernels

adc1d7c

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9989ec5

for more information, see https://pre-commit.ci

Merge branch 'main' into optimize_kv_cache

115d40a

changes to make quantizers work with fp8_calibration

2a235ab

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into te_ge…

c6a6f28

…mma_tutorial_base Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

merge optimize_kv_cache

dc5d12c

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

avoid reindexing from python side

535d3ef

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

rename variable from previous commit

7796bc2

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Merge branch 'main' into optimize_kv_cache

bd44894

minor fix

17fd1ea

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

minor fix

3ffe5c2

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Merge branch 'optimize_kv_cache' of github.com:cyanguwa/TransformerEn…

e284e84

…gine into te_gemma_tutorial_base

use quantizer only if needed

89a089e

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

functionality of the tutorial tested and perf checked

b5f3ea2

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into te_ge…

502fd1e

…mma_tutorial_base

sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from 016ff52 to 502fd1e Compare July 23, 2025 21:22

remove files and update headers/licenses

c15e2cc

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from 40b596a to c15e2cc Compare July 23, 2025 21:26

update header/license

65778f1

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from c263819 to 65778f1 Compare July 23, 2025 21:27

pre-commit-ci bot and others added 3 commits July 23, 2025 21:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

75623c4

for more information, see https://pre-commit.ci

update tutorial for review

bdad0bd

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

resolve conflicts

dcd9c1b

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 marked this pull request as ready for review August 7, 2025 08:43

sudhakarsingh27 requested review from ptrendx, cyanguwa, pggPL and timmoon10 August 7, 2025 08:44

[pre-commit.ci] auto fixes from pre-commit.com hooks

525ddf8

for more information, see https://pre-commit.ci

pggPL reviewed Aug 8, 2025

View reviewed changes

sudhakarsingh27 and others added 11 commits August 11, 2025 16:47

make weights downloadable on the fly; remove extra print statements

537ae3c

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

merge conflict

4fca3d2

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

resolve MCs with main

ab7c9ad

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f68dc61

for more information, see https://pre-commit.ci

fix lint and update comments

d8c967a

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1bc7ca5

for more information, see https://pre-commit.ci

add comma back, typo

eb3bb8c

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sequence_start_positions should be None for training

ee71e23

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0f8cd94

for more information, see https://pre-commit.ci

add paged attention numberes and update requirements.txt file

7cc56f5

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' into te_gemma_tutorial_base

eac19c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TE Gemma tutorial attempt#2 #1839

TE Gemma tutorial attempt#2 #1839

Uh oh!

sudhakarsingh27 commented Jun 2, 2025

Uh oh!

pggPL commented Aug 8, 2025

Uh oh!

pggPL left a comment

Uh oh!

Uh oh!

Uh oh!

pggPL Aug 8, 2025

Uh oh!

Uh oh!

Uh oh!

pggPL Aug 8, 2025

Uh oh!

sudhakarsingh27 Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

pggPL Aug 8, 2025

Uh oh!

sudhakarsingh27 Aug 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

TE Gemma tutorial attempt#2 #1839

Are you sure you want to change the base?

TE Gemma tutorial attempt#2 #1839

Uh oh!

Conversation

sudhakarsingh27 commented Jun 2, 2025

Description

Type of change

Uh oh!

pggPL commented Aug 8, 2025

Uh oh!

pggPL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pggPL Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pggPL Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pggPL Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 Aug 13, 2025 •

edited

Loading