Add SFT LLaMA3 GPU demo notebook by katjasrz · Pull Request #3285 · AI-Hypercomputer/maxtext

katjasrz · 2026-03-02T13:57:18Z

Description

Supersedes #3146 (closed due to branch rename/history rewrite).

Adds src/MaxText/examples/sft_llama3_gpu.ipynb: a GPU-focused, end-to-end notebook for SFT of Llama 3.1-8B on NVIDIA GPUs (HF auth → gated access note → HF → MaxText checkpoint conversion (CPU) → SFT run → TensorBoard → inference sanity check).
Motivation: complements existing notebook/docs that emphasize TPU SFT flows (e.g. sft_llama3_demo.ipynb) with a clear NVIDIA GPU path.

Tests

Executed the notebook in the NGC container nvcr.io/nvidia/jax:26.01-maxtext-py3 on a cluster node with 8 H100 NVIDIA GPUs, CUDA 13.1, driver 580.105.08, JAX 0.8.1.dev20260217.

Verified:

Checkpoint conversion completes.
SFT runs for 100 steps and writes checkpoints/logs.
TensorBoard logs are created.
Inference sanity check runs and produces output.

To reproduce, execute as follows:

cd <YOUR_ROOT_DIRECTORY>

# clone this repo
git clone https://github.com/katjasrz/maxtext.git

export ROOT_DIR=$(pwd)
export PROJECT_DIR=$ROOT_DIR/maxtext/src/maxtext/examples
export HF_CACHE_DIR=$ROOT_DIR/huggingface

docker run -it --rm --ipc=host \
  --gpus=all \
  -p 8889:8889 \
  -p 6006:6006 \
  --shm-size=16g \
  --ulimit memlock=-1 \
  -v "$PROJECT_DIR":/workspace \
  -v "$HF_CACHE_DIR":/hf_cache \
  -e HF_HOME=/hf_cache \
  -e LOCAL_UID=$(id -u) \
  -e LOCAL_GID=$(id -g) \
  nvcr.io/nvidia/jax:26.01-maxtext-py3 \
  bash -lc 'set -e
    groupadd -g $LOCAL_GID hostgrp 2>/dev/null || true
    useradd -u $LOCAL_UID -g $LOCAL_GID -M -d /workspace hostusr 2>/dev/null || true
    
    python3 -m pip install --upgrade pip
    pip install jupyterlab ipywidgets
    pip install -U git+https://github.com/google/tunix
		pip install torch --index-url https://download.pytorch.org/whl/cpu
    
    su hostusr -c "cd /workspace && HOME=/workspace HF_HOME=/hf_cache \
      jupyter lab --ip=0.0.0.0 --port=8889 --no-browser"'

Then follow the instructions in the jupyter notebook src/maxtext/examples/sft_llama3_gpu.ipynb

Two container-specific fixes corresponding to an older maxtext version excluded from the notebook:

Fix 1. The code below includes a workaround for a known container issue where create_nnx_model defaults model_mode to None instead of "train". This is patched at runtime.

# Fix for container bug: model_creation_utils.create_nnx_model defaults model_mode=None
# but it should default to "train". Set the correct default.
from MaxText import model_creation_utils
model_creation_utils.create_nnx_model.__defaults__ = (None, None, "train", None)

Fix 2. The code below is a workaround for a known container issue where empty-string defaults for hf_train_files/hf_eval_files/hf_data_dir cause datasets.load_dataset to fail. These are patched to None at runtime.

# Fix for container bug: empty string defaults for hf_train_files/hf_eval_files/hf_data_dir
# cause datasets.load_dataset to fail. Monkey-patch to convert empty strings to None.
# Guard against multiple applications to avoid recursion.
import datasets
if not hasattr(datasets, '_original_load_dataset'):
    datasets._original_load_dataset = datasets.load_dataset

    def _patched_load_dataset(*args, **kwargs):
        for key in ['data_files', 'data_dir']:
            if key in kwargs and kwargs[key] == '':
                kwargs[key] = None
        return datasets._original_load_dataset(*args, **kwargs)

    datasets.load_dataset = _patched_load_dataset

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

A9isha

Thank you so much!

There are a few failing tests which would need to be fixed of course

codecov · 2026-03-05T12:39:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

A9isha · 2026-03-05T22:57:40Z

Ignoring the jupyter notebook test failure w.r.t getpass

Gemini's explanation: b
Because the PR is likely submitted from a fork, GitHub Actions automatically restricts access to repository secrets to prevent malicious code from exfiltrating them. As a result, the HF_TOKEN environment variable is empty. The code then falls back to getpass() to ask for interactive input. However, the CI job runs the notebook headlessly using papermill ( source), which has no interactive input stream, so it crashes.Because the PR is likely submitted from a fork, GitHub Actions automatically restricts access to repository secrets to prevent malicious code from exfiltrating them. As a result, the HF_TOKEN environment variable is empty. The code then falls back to getpass() to ask for interactive input. However, the CI job runs the notebook headlessly using papermill ( source), which has no interactive input stream, so it crashes.

katjasrz requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jesselu-google, jiangjy1982, khatwanimohit, parambole, richjames0, shralex, suexu1025 and vipannalla as code owners March 2, 2026 13:57

A9isha approved these changes Mar 4, 2026

View reviewed changes

Add SFT LLaMA3 GPU demo notebook and skip TPU CI run

4bee5b0

katjasrz force-pushed the sft_llama3_gpu branch from d35a606 to 4bee5b0 Compare March 5, 2026 12:33

katjasrz requested review from dipannita08 and igorts-git as code owners March 5, 2026 12:33

A9isha added the pull ready label Mar 5, 2026

igorts-git approved these changes Mar 5, 2026

View reviewed changes

igorts-git mentioned this pull request Mar 5, 2026

Update the name of a renamed file #3326

Merged

4 tasks

copybara-service Bot merged commit 7656eb8 into AI-Hypercomputer:main Mar 6, 2026
51 of 61 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SFT LLaMA3 GPU demo notebook#3285

Add SFT LLaMA3 GPU demo notebook#3285
copybara-service[bot] merged 1 commit intoAI-Hypercomputer:mainfrom
katjasrz:sft_llama3_gpu

katjasrz commented Mar 2, 2026 •

edited

Loading

Uh oh!

A9isha left a comment •

edited

Loading

Uh oh!

codecov Bot commented Mar 5, 2026

Uh oh!

A9isha commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

katjasrz commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

A9isha left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 5, 2026

Codecov Report

Uh oh!

A9isha commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

katjasrz commented Mar 2, 2026 •

edited

Loading

A9isha left a comment •

edited

Loading