Skip to content

TE Gemma tutorial attempt#2 #1839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

sudhakarsingh27
Copy link
Collaborator

Description

Adds a tutorial to showcase how to:

  1. use TE TransformerEngine layer in place of HuggingFace's GemmaDecoderLayer in Gemma models.
  2. use non-paged and paged KV cache from TE
  3. use CUDA Graphs and fp8_model_init to optimize generation times

Attempt#1 @ #829

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)

@sudhakarsingh27 sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from 03729bc to 2a514cf Compare June 2, 2025 21:10
@sudhakarsingh27 sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from 2a514cf to 4757bfa Compare June 2, 2025 21:19
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch 3 times, most recently from 5d7538e to 93960fd Compare June 16, 2025 22:09
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from 588fcd6 to 6cd3c1a Compare June 17, 2025 22:27
pre-commit-ci bot and others added 20 commits June 17, 2025 22:27
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
…ransformerEngine into te_gemma_tutorial_base_test
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
…mma_tutorial_base

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from 016ff52 to 502fd1e Compare July 23, 2025 21:22
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from 40b596a to c15e2cc Compare July 23, 2025 21:26
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 force-pushed the te_gemma_tutorial_base branch from c263819 to 65778f1 Compare July 23, 2025 21:27
pre-commit-ci bot and others added 3 commits July 23, 2025 21:28
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 marked this pull request as ready for review August 7, 2025 08:43
@pggPL
Copy link
Collaborator

pggPL commented Aug 8, 2025

This tutorial is not visible in the docs.

Copy link
Collaborator

@pggPL pggPL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was focusing more on documentation. The first part with HF finetuning is ok (I have some small suggestions, but if we merge it without any change I think it would be ok).

For the second part - I generally agree with high-level concepts, but it needs to be polished. So I think it will be good if you read it once again and look for minor issues and then I will have a look once again. I left some comments with areas in which something is missing or they left after some removed parts.

I think also that if we present something as an example we should be sure that it works correctly - so maybe consider copying the code from all the cells into some test file and add it to the CI pipeline. But this may be part of other PR, this is big enough.

I was not looking deeply into the code - it seems that some code is left commented and I am not sure if this is final form. If you want me to look into that, let me know.

I think this is a good moment to think more about the future of TE docs. We were adding many unrelated tutorials/examples lately (attention, export to onnx, offloading in near future) and our docs seem quite unstructured at this point. We have section "tutorials and examples" when there is code explaining some fp8 recipes and these tutorials. Maybe it is worth to split it into tutorial part and example part. In tutorial path we can have only descriptions of the most important features and in examples composition of these things like gemma/llama finetuning/generation.

We lack good descriptions of many good features we have inside - blockwise scaling is not in the docs, moe support, context parallelism. Maybe some refactoring and adding more structure and adding descriptions of these things is good idea for the future. @ptrendx

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is nice in current form - simple extension of llama tutorial.
Having llama and gemma tutorial next to each other seems to be quite weird, but I do not see better solution and we should leave it as it is imo.

"source": [
"\n",
"<figure align=\"center\">\n",
"<img src=\"./media/plot.svg\">\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this picture looks wierd in html

Image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now think this picture is an overkill. I'll plan to remove it

@@ -266,6 +271,11 @@ def pre_step(
for k, v in self.sequences.items():
self.sequences_pre_step[k] = v - step_dict[k]

pre_step_seqlens = torch.Tensor(list(self.sequences_pre_step.values())).to(
dtype=torch.int32, device="cpu"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it is on CPU? I haven't tried to deeply understand what's going on here yet.

Copy link
Collaborator Author

@sudhakarsingh27 sudhakarsingh27 Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a self.pre_step_seqlens variable in InferenceParams which is on GPU and this temp variable is used to populate it. But I see the confusion - this call is redundant as torch.Tensor is created on CPU by default. I'll remove it

sudhakarsingh27 and others added 11 commits August 11, 2025 16:47
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants