Add MXFP8 and NVFP4 quantization support to LLaMA3#1500
Add MXFP8 and NVFP4 quantization support to LLaMA3#1500jomitchellnv wants to merge 6 commits intomainfrom
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds layer-wise quantization support (FP8/FP4/BF16): new config field for per-layer precision, recipe attachment APIs on TE models, per-layer autocast orchestration in forward, quantization utilities and logging init, training-script wiring, tests, docs/config updates, and removal of legacy FP8 debugging initialization. Changes
Sequence Diagram(s)sequenceDiagram
participant Script as Training Script
participant Config as Resolver
participant Recipes as Recipe Factory
participant Model as NVLlama Model
participant Decoder as Decoder Layers
participant TE as TransformerEngine Autocast
Script->>Config: resolve_layer_precision(fp8_enabled, fp4_enabled, fp8_layers, fp4_layers)
Config-->>Script: layer_precision list
Script->>Recipes: create fp8_recipe / fp4_recipe (conditional)
Recipes-->>Script: recipe objects or None
Script->>Model: set_recipes(fp8_recipe, fp4_recipe)
Model->>Model: store _fp8_recipe, _fp4_recipe and config.layer_precision
Script->>Model: forward(input_ids,...)
Model->>TE: enter outer FP8 autocast if _fp8_recipe
TE->>Decoder: iterate layers
Decoder->>Model: get_layer_autocast(layer_number)
Model-->>Decoder: return per-layer context (nullcontext / FP4 autocast / BF16-disabled)
Decoder->>TE: execute layer under returned autocast
Decoder-->>Model: layer outputs
Model-->>Script: logits / outputs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
cb05a45 to
4067915
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
bionemo-recipes/recipes/llama3_native_te/tests/conftest.py (1)
54-64: Useitem.originalnamefor more robust test name matching.The stats tests are not currently parametrized, but the reordering logic would silently break if they become parametrized in the future (pytest appends
[param]suffixes to parametrized test names). Usingitem.originalnameor stripping parameter suffixes makes this logic resilient to future changes.Suggested change
def pytest_collection_modifyitems(items): """Run FP8 stats logging tests first to avoid late debug initialization.""" stats_test_names = { "test_sanity_ddp_fp8_stats_logging", "test_sanity_fsdp2_fp8_stats_logging", "test_sanity_ddp_fp8_partial_layers_stats_logging", "test_sanity_fsdp2_fp8_partial_layers_stats_logging", } - stats_tests = [item for item in items if item.name in stats_test_names] - other_tests = [item for item in items if item.name not in stats_test_names] + def _base_name(item): + return getattr(item, "originalname", item.name.split("[", 1)[0]) + + stats_tests = [item for item in items if _base_name(item) in stats_test_names] + other_tests = [item for item in items if _base_name(item) not in stats_test_names] items[:] = stats_tests + other_tests🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/tests/conftest.py` around lines 54 - 64, The reordering currently matches on item.name which breaks for parametrized tests; change the comparisons in pytest_collection_modifyitems to use the test's original name (e.g., use item.originalname or getattr(item, "originalname", item.name)) when building stats_tests and other_tests so parameter suffixes like "[param]" are ignored; update the two list comprehensions that reference item.name and keep the rest of the function unchanged (symbols: pytest_collection_modifyitems, stats_test_names, stats_tests, other_tests, item.originalname).bionemo-recipes/recipes/llama3_native_te/quantization.py (1)
86-94: Redundant YAML serialization and temp file cleanup consideration.The config is serialized twice: once to the temp file (line 87) and again for logging (line 90). Consider reusing the serialized string. Also, the temp file with
delete=Falseis never cleaned up - this is acceptable since it's a short-lived training run, but consider documenting this behavior or adding cleanup in the caller.♻️ Optional: reuse serialized config
- temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) - yaml.dump(config, temp_file, default_flow_style=False) - temp_file.close() - - config_str = yaml.dump(config, default_flow_style=False) + config_str = yaml.dump(config, default_flow_style=False) + temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) + temp_file.write(config_str) + temp_file.close() + logger.info(f"Created updated quant stats config at: {temp_file.name}") logger.info(f"Updated quant stats config contents:\n{config_str}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/quantization.py` around lines 86 - 94, The code serializes `config` twice (once writing to `temp_file` via `yaml.dump(config, temp_file, ...)` and again into `config_str`) and the temporary file created with `tempfile.NamedTemporaryFile(..., delete=False)` is not documented or cleaned up; update the function that creates `temp_file` so you serialize `config` once into a string (e.g., call `yaml.dump(config, default_flow_style=False)` once), write that string to `temp_file` (instead of calling `yaml.dump` twice) and reuse that string for the log message, and either document that the returned filename must be removed by the caller or add optional cleanup logic in the caller that removes the temp file when no longer needed (refer to symbols `temp_file`, `config_str`, `yaml.dump`, and the function that returns `temp_file.name`).bionemo-recipes/recipes/llama3_native_te/perf_logger.py (1)
94-94: Consider renaming to clarify this is a boolean flag.The attribute
quant_stats_configstores a boolean (args.quant_stats_config.enabled), but the name suggests it holds a config object. Considerquant_stats_enabledfor clarity and consistency with the previousfp8_stats_enablednaming convention.Suggested rename for clarity
- self.quant_stats_config = args.quant_stats_config.enabled + self.quant_stats_enabled = args.quant_stats_config.enabledAnd update usages at lines 153 and 204 accordingly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/perf_logger.py` at line 94, Rename the boolean attribute self.quant_stats_config to self.quant_stats_enabled to match the fp8_stats_enabled naming and avoid implying a config object; update the constructor assignment (currently assigning args.quant_stats_config.enabled) and replace every usage of self.quant_stats_config in the class (and any methods that read it, e.g., where flags are checked or conditional logic runs) to self.quant_stats_enabled, and adjust any related variable names, docs, and tests that reference the old name so references remain consistent.bionemo-recipes/models/llama3/modeling_llama_te.py (1)
191-211: Consider adding bounds check inget_layer_autocast.If
layer_numberexceedslen(self.config.layer_precision), line 205 will raise anIndexError. Consider adding a bounds check or documenting this precondition.Optional bounds check
def get_layer_autocast(self, layer_number: int): """Return the appropriate TE autocast context manager for a given layer. ... """ - precision = self.config.layer_precision[layer_number] if self.config.layer_precision is not None else None + if self.config.layer_precision is None or layer_number >= len(self.config.layer_precision): + precision = None + else: + precision = self.config.layer_precision[layer_number] if precision == "fp8":🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/llama3/modeling_llama_te.py` around lines 191 - 211, get_layer_autocast currently indexes self.config.layer_precision without validating layer_number; add a bounds check at the start of get_layer_autocast to verify self.config.layer_precision is not None and 0 <= layer_number < len(self.config.layer_precision), and if the check fails raise a clear exception (e.g., IndexError or ValueError) with an explanatory message; keep the existing behavior for valid indices (use precision = self.config.layer_precision[layer_number] and return nullcontext() / transformer_engine.pytorch.autocast(...) as before).bionemo-recipes/recipes/llama3_native_te/tests/test_quantization.py (1)
24-26:sys.pathmanipulation for imports.The
sys.path.appendpattern works but is fragile. Consider if there's a package structure that could avoid this, or add a comment explaining why this approach is necessary for the recipe test layout.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/tests/test_quantization.py` around lines 24 - 26, The test currently mutates sys.path using sys.path.append(Path(__file__).parent.parent.as_posix()) to import quantization, which is fragile; either make the test an actual package (add an __init__.py in the recipes/llama3_native_te/tests or recipes/llama3_native_te parent and import via package path) or load the module via a test-time PYTHONPATH/pytest configuration (e.g., set pythonpath in pytest.ini) so you don't need sys.path manipulation, and if keeping the append must remain, replace it with a brief comment above the line explaining why direct import is necessary for the recipe layout and noting the intended workaround (package or pytest config) for future maintainers; locate the sys.path.append line and the import of generate_layer_regex/resolve_layer_precision/update_quant_stats_config in the test file to apply the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/recipes/llama3_native_te/quantization.py`:
- Around line 140-175: The function resolve_layer_precision lacks validation
that entries in fp8_layers and fp4_layers fall within 1..num_layers; add checks
after all_layers is defined to validate any provided fp8_layers and fp4_layers
(and similarly handle duplicates/overlap) by iterating the lists and raising
ValueError listing invalid layer numbers (or a clear message) if any number <1
or >num_layers or not an int; reference fp8_layers, fp4_layers, num_layers,
all_layers and raise the error before proceeding with the existing
overlapping/none checks so invalid indices are caught early.
In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py`:
- Line 53: The file currently imports only resolve_layer_precision from
quantization but misses initialize_quant_stats_logging and does not initialize
quant stats; add initialize_quant_stats_logging to the imports and, where
argument parsing/setup occurs (same area as resolve_layer_precision usage), call
initialize_quant_stats_logging(process_logger, args.quant_stats_config) when
args.quant_stats_config.enabled is true so quantization stats are initialized
the same way as in train_fsdp2.py/train_ddp.py.
---
Nitpick comments:
In `@bionemo-recipes/models/llama3/modeling_llama_te.py`:
- Around line 191-211: get_layer_autocast currently indexes
self.config.layer_precision without validating layer_number; add a bounds check
at the start of get_layer_autocast to verify self.config.layer_precision is not
None and 0 <= layer_number < len(self.config.layer_precision), and if the check
fails raise a clear exception (e.g., IndexError or ValueError) with an
explanatory message; keep the existing behavior for valid indices (use precision
= self.config.layer_precision[layer_number] and return nullcontext() /
transformer_engine.pytorch.autocast(...) as before).
In `@bionemo-recipes/recipes/llama3_native_te/perf_logger.py`:
- Line 94: Rename the boolean attribute self.quant_stats_config to
self.quant_stats_enabled to match the fp8_stats_enabled naming and avoid
implying a config object; update the constructor assignment (currently assigning
args.quant_stats_config.enabled) and replace every usage of
self.quant_stats_config in the class (and any methods that read it, e.g., where
flags are checked or conditional logic runs) to self.quant_stats_enabled, and
adjust any related variable names, docs, and tests that reference the old name
so references remain consistent.
In `@bionemo-recipes/recipes/llama3_native_te/quantization.py`:
- Around line 86-94: The code serializes `config` twice (once writing to
`temp_file` via `yaml.dump(config, temp_file, ...)` and again into `config_str`)
and the temporary file created with `tempfile.NamedTemporaryFile(...,
delete=False)` is not documented or cleaned up; update the function that creates
`temp_file` so you serialize `config` once into a string (e.g., call
`yaml.dump(config, default_flow_style=False)` once), write that string to
`temp_file` (instead of calling `yaml.dump` twice) and reuse that string for the
log message, and either document that the returned filename must be removed by
the caller or add optional cleanup logic in the caller that removes the temp
file when no longer needed (refer to symbols `temp_file`, `config_str`,
`yaml.dump`, and the function that returns `temp_file.name`).
In `@bionemo-recipes/recipes/llama3_native_te/tests/conftest.py`:
- Around line 54-64: The reordering currently matches on item.name which breaks
for parametrized tests; change the comparisons in pytest_collection_modifyitems
to use the test's original name (e.g., use item.originalname or getattr(item,
"originalname", item.name)) when building stats_tests and other_tests so
parameter suffixes like "[param]" are ignored; update the two list
comprehensions that reference item.name and keep the rest of the function
unchanged (symbols: pytest_collection_modifyitems, stats_test_names,
stats_tests, other_tests, item.originalname).
In `@bionemo-recipes/recipes/llama3_native_te/tests/test_quantization.py`:
- Around line 24-26: The test currently mutates sys.path using
sys.path.append(Path(__file__).parent.parent.as_posix()) to import quantization,
which is fragile; either make the test an actual package (add an __init__.py in
the recipes/llama3_native_te/tests or recipes/llama3_native_te parent and import
via package path) or load the module via a test-time PYTHONPATH/pytest
configuration (e.g., set pythonpath in pytest.ini) so you don't need sys.path
manipulation, and if keeping the append must remain, replace it with a brief
comment above the line explaining why direct import is necessary for the recipe
layout and noting the intended workaround (package or pytest config) for future
maintainers; locate the sys.path.append line and the import of
generate_layer_regex/resolve_layer_precision/update_quant_stats_config in the
test file to apply the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 39eb17c1-c1ff-4756-90d6-a1eae8aa0d07
📒 Files selected for processing (18)
bionemo-recipes/models/llama3/modeling_llama_te.pybionemo-recipes/models/llama3/tests/test_distributed_fp8.pybionemo-recipes/models/llama3/tests/test_layer_quantization.pybionemo-recipes/recipes/llama3_native_te/README.mdbionemo-recipes/recipes/llama3_native_te/fp4_debugging_stats.yamlbionemo-recipes/recipes/llama3_native_te/fp8_debugging.pybionemo-recipes/recipes/llama3_native_te/fp8_debugging_stats.yamlbionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/llama3_native_te/modeling_llama_te.pybionemo-recipes/recipes/llama3_native_te/perf_logger.pybionemo-recipes/recipes/llama3_native_te/quantization.pybionemo-recipes/recipes/llama3_native_te/tests/conftest.pybionemo-recipes/recipes/llama3_native_te/tests/test_perf_logger.pybionemo-recipes/recipes/llama3_native_te/tests/test_quantization.pybionemo-recipes/recipes/llama3_native_te/tests/test_train.pybionemo-recipes/recipes/llama3_native_te/train_ddp.pybionemo-recipes/recipes/llama3_native_te/train_fsdp2.pybionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py
💤 Files with no reviewable changes (1)
- bionemo-recipes/recipes/llama3_native_te/fp8_debugging.py
|
@coderabbitai resolve |
|
@coderabbitai review |
✅ Actions performedComments resolved. Auto-approval is disabled; enable |
✅ Actions performedReview triggered.
|
|
Remaining
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py (1)
92-100: Consider guardinglayer_precisionassignment for TE models only.The
layer_precisionattribute is assigned to the config regardless of whether it'sNVLlamaConfigorLlamaConfig. While this won't cause errors (Python allows dynamic attribute assignment), the attribute is only meaningful for TE models that useget_layer_autocast(). For clarity and consistency, consider wrapping this in a check.♻️ Optional: Guard layer_precision assignment
layer_precision = resolve_layer_precision( num_layers=config.num_hidden_layers, fp8_enabled=args.fp8_config.enabled, fp4_enabled=args.fp4_config.enabled, fp8_layers=OmegaConf.to_container(args.fp8_layers, resolve=True) if args.fp8_layers is not None else None, fp4_layers=OmegaConf.to_container(args.fp4_layers, resolve=True) if args.fp4_layers is not None else None, ) -config.layer_precision = layer_precision +if args.use_te: + config.layer_precision = layer_precision🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py` around lines 92 - 100, The code always assigns config.layer_precision after calling resolve_layer_precision, but layer_precision is only meaningful for TE models (e.g., NVLlamaConfig) that use get_layer_autocast; update the block that calls resolve_layer_precision and sets config.layer_precision to only set this attribute when config is an instance of the TE-specific config class (e.g., NVLlamaConfig) — keep calling resolve_layer_precision as needed but guard the assignment with an isinstance(config, NVLlamaConfig) (or equivalent TE config check) so LlamaConfig instances are not mutated with a TE-only attribute.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/recipes/llama3_native_te/README.md`:
- Around line 68-77: The convergence benchmarks image paths in the README.md
reference a non-existent recipes/llama3 subdirectory; update the src attributes
in the <img> tags for the two convergence images (lingua-1b-loss-curve.png and
lingua-1b-step-time.png) to point to ../../../docs/docs/assets/images/llama3/…
(remove the recipes/ component) so they match the actual location; ensure you
modify the two src strings currently containing
.../assets/images/recipes/llama3/... to .../assets/images/llama3/... in the
README.md convergence section.
---
Nitpick comments:
In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py`:
- Around line 92-100: The code always assigns config.layer_precision after
calling resolve_layer_precision, but layer_precision is only meaningful for TE
models (e.g., NVLlamaConfig) that use get_layer_autocast; update the block that
calls resolve_layer_precision and sets config.layer_precision to only set this
attribute when config is an instance of the TE-specific config class (e.g.,
NVLlamaConfig) — keep calling resolve_layer_precision as needed but guard the
assignment with an isinstance(config, NVLlamaConfig) (or equivalent TE config
check) so LlamaConfig instances are not mutated with a TE-only attribute.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6703b6b6-e632-489e-a1b5-96f595dfafc4
⛔ Files ignored due to path filters (3)
docs/docs/assets/images/llama3/lingua-1b-loss-curve.pngis excluded by!**/*.pngdocs/docs/assets/images/llama3/lingua-1b-step-time.pngis excluded by!**/*.pngdocs/docs/assets/images/llama3/llama3_1b_fsdp2_tflops.pngis excluded by!**/*.png
📒 Files selected for processing (3)
bionemo-recipes/recipes/llama3_native_te/README.mdbionemo-recipes/recipes/llama3_native_te/train_fsdp2.pybionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1500 +/- ##
=======================================
Coverage 76.45% 76.45%
=======================================
Files 102 102
Lines 7952 7952
=======================================
Hits 6080 6080
Misses 1872 1872 |
f82b896 to
8dda89b
Compare
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
8dda89b to
6079013
Compare
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
|
@coderabbitai re-review |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
bionemo-recipes/recipes/llama3_native_te/quantization.py (1)
86-94: Consider simplifying the redundant YAML dump.The config is serialized twice: once to write to the temp file (line 87) and again for logging (line 90). This is minor but could be optimized.
♻️ Suggested optimization
- temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) - yaml.dump(config, temp_file, default_flow_style=False) - temp_file.close() - - config_str = yaml.dump(config, default_flow_style=False) + config_str = yaml.dump(config, default_flow_style=False) + temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) + temp_file.write(config_str) + temp_file.close() + logger.info(f"Created updated quant stats config at: {temp_file.name}") logger.info(f"Updated quant stats config contents:\n{config_str}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/quantization.py` around lines 86 - 94, The code currently calls yaml.dump twice; instead, call yaml.dump(config, default_flow_style=False) once into a string (e.g., config_str), write that string to the NamedTemporaryFile (temp_file.write(config_str) or temp_file.write(config_str.encode() if opened in binary), log config_str with logger.info, close the temp_file and return temp_file.name; update the logic around temp_file, config_str, yaml.dump, logger.info, and return to use the single serialized string.bionemo-recipes/recipes/llama3_native_te/tests/test_train.py (1)
510-538: Consider adding log file assertions for consistency.The DDP partial layers test only asserts directory existence, while the FSDP2 test also asserts that specific log files exist. For consistency and more thorough validation, consider adding the file existence checks here as well.
♻️ Suggested enhancement
assert (quant_log_dir / "rank_0" / "nvdlfw_inspect_statistics_logs").exists(), ( "nvdlfw_inspect_statistics_logs directory was not created" ) + + # Verify log files exist (consistent with FSDP2 test) + assert (quant_log_dir / "rank_0" / "nvdlfw_inspect_logs" / "nvdlfw_inspect_globalrank-0.log").exists() + assert (quant_log_dir / "rank_0" / "nvdlfw_inspect_statistics_logs" / "nvdlfw_inspect_globalrank-0.log").exists()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/tests/test_train.py` around lines 510 - 538, Add assertions in test_sanity_ddp_fp8_partial_layers_stats_logging to verify that the expected log files were created inside the directories (not just the directories themselves): after the existing directory asserts for quant_log_dir / "rank_0" / "nvdlfw_inspect_logs" and quant_log_dir / "rank_0" / "nvdlfw_inspect_statistics_logs", assert that each directory contains at least one file (e.g., using any(path.iterdir()) or list(path.glob("*.log"))/len(list(path.iterdir())) > 0). Reference the test function name test_sanity_ddp_fp8_partial_layers_stats_logging and the Path variables quant_log_dir, "rank_0", nvdlfw_inspect_logs, and nvdlfw_inspect_statistics_logs to locate where to insert these checks.bionemo-recipes/recipes/llama3_native_te/README.md (1)
97-99: Minor grammar nit: Use hyphen in compound adjective.The phrase "Low Precision convergence benchmarks" should be "Low-Precision convergence benchmarks" for grammatical consistency with line 68.
📝 Suggested fix
-### Low Precision convergence benchmarks +### Low-Precision convergence benchmarks🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/README.md` around lines 97 - 99, Update the header text "Low Precision convergence benchmarks" to use a hyphenated compound adjective: change the string "Low Precision convergence benchmarks" to "Low-Precision convergence benchmarks" (look for that exact header text in README.md, around the section titled "Low Precision convergence benchmarks").bionemo-recipes/models/llama3/modeling_llama_te.py (1)
158-159: Clarify the intentional recipe reset pattern.The
_fp8_recipeand_fp4_recipeattributes are set at lines 158-159 for constructor validation, then explicitly reset toNoneat lines 221-222 beforepost_init(). This appears intentional to ensure recipes (which are not serializable) are attached viaset_recipes()after model sharding.Consider adding a brief comment at lines 221-222 explaining this design decision:
📝 Suggested clarification
+ # Reset recipes to None - they must be re-attached via set_recipes() after model + # creation/sharding since Recipe objects are not serializable. self._fp8_recipe: transformer_engine.common.recipe.Recipe | None = None self._fp4_recipe: transformer_engine.common.recipe.Recipe | None = NoneAlso applies to: 221-222
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/llama3/modeling_llama_te.py` around lines 158 - 159, Add a short inline comment where the code resets self._fp8_recipe and self._fp4_recipe to None (before calling post_init()) explaining the intentional pattern: these Recipe objects are set in the constructor only for validation but are not serializable and must be attached later via set_recipes() after model sharding; this clarifies why we clear them here and expect set_recipes() to provide the runtime recipes. Ensure the comment references the attributes _fp8_recipe and _fp4_recipe and the set_recipes() and post_init() flow so future readers understand the design.bionemo-recipes/recipes/llama3_native_te/tests/test_quantization.py (1)
248-314: Consider cleaning up temp files created byupdate_quant_stats_config.The tests properly use
tmp_pathfor input fixtures, butupdate_quant_stats_configcreates output files in the system temp directory withdelete=False. These files accumulate across test runs. Consider adding cleanup or usingtmp_pathfor output verification.♻️ Optional: Add temp file cleanup in tests
def test_fp8_layers_updates_regex(fp8_only_config): """FP8 layer list should update the regex in the output config.""" output_path = update_quant_stats_config(config_file=fp8_only_config, fp4_layers=None, fp8_layers=[1, 2, 3]) - with open(output_path) as f: - result = yaml.safe_load(f) - regex = result["example_fp8_tensor_stat_collection"]["layers"]["layer_name_regex_pattern"] - assert re.search(regex, "model.model.layers.1.self_attention.layernorm_qkv") - assert re.search(regex, "model.model.layers.3.layernorm_mlp.fc2") - assert not re.search(regex, "model.model.layers.4.self_attention.proj") + try: + with open(output_path) as f: + result = yaml.safe_load(f) + regex = result["example_fp8_tensor_stat_collection"]["layers"]["layer_name_regex_pattern"] + assert re.search(regex, "model.model.layers.1.self_attention.layernorm_qkv") + assert re.search(regex, "model.model.layers.3.layernorm_mlp.fc2") + assert not re.search(regex, "model.model.layers.4.self_attention.proj") + finally: + Path(output_path).unlink(missing_ok=True)Alternatively, consider creating a pytest fixture that wraps
update_quant_stats_configwith automatic cleanup.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/tests/test_quantization.py` around lines 248 - 314, Tests call update_quant_stats_config which creates temp files with delete=False and never cleans them up; update each test that calls update_quant_stats_config (e.g., test_fp8_layers_updates_regex, test_none_layers_disables_matching, test_fp4_section_disabled_fp8_still_updated, test_original_file_not_modified, test_preserves_other_config_fields, test_missing_section_is_skipped) to remove the produced file after assertions (os.remove(output_path)) or, better, pass a tmp_path-based output location if update_quant_stats_config supports an output path parameter; alternatively add a small pytest fixture that wraps update_quant_stats_config and ensures the returned temp file is unlinked in teardown to avoid accumulating temp files.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yaml`:
- Around line 44-45: Add a runtime validation that enforces the constraint
between quantized_model_init_kwargs.enabled and fp8_config.enabled: when
quantized_model_init_kwargs.enabled is true but fp8_config.enabled is false,
raise a clear error (ValueError/RuntimeError) with a message instructing the
user to enable fp8_config. Implement this check the same way as the existing
pattern in fp8_debugging.py (lines 49-53) — add it to the configuration/recipe
initialization path so it runs at startup and prevents passing recipe=None with
enabled=true to the quantized_model_init context manager.
---
Nitpick comments:
In `@bionemo-recipes/models/llama3/modeling_llama_te.py`:
- Around line 158-159: Add a short inline comment where the code resets
self._fp8_recipe and self._fp4_recipe to None (before calling post_init())
explaining the intentional pattern: these Recipe objects are set in the
constructor only for validation but are not serializable and must be attached
later via set_recipes() after model sharding; this clarifies why we clear them
here and expect set_recipes() to provide the runtime recipes. Ensure the comment
references the attributes _fp8_recipe and _fp4_recipe and the set_recipes() and
post_init() flow so future readers understand the design.
In `@bionemo-recipes/recipes/llama3_native_te/quantization.py`:
- Around line 86-94: The code currently calls yaml.dump twice; instead, call
yaml.dump(config, default_flow_style=False) once into a string (e.g.,
config_str), write that string to the NamedTemporaryFile
(temp_file.write(config_str) or temp_file.write(config_str.encode() if opened in
binary), log config_str with logger.info, close the temp_file and return
temp_file.name; update the logic around temp_file, config_str, yaml.dump,
logger.info, and return to use the single serialized string.
In `@bionemo-recipes/recipes/llama3_native_te/README.md`:
- Around line 97-99: Update the header text "Low Precision convergence
benchmarks" to use a hyphenated compound adjective: change the string "Low
Precision convergence benchmarks" to "Low-Precision convergence benchmarks"
(look for that exact header text in README.md, around the section titled "Low
Precision convergence benchmarks").
In `@bionemo-recipes/recipes/llama3_native_te/tests/test_quantization.py`:
- Around line 248-314: Tests call update_quant_stats_config which creates temp
files with delete=False and never cleans them up; update each test that calls
update_quant_stats_config (e.g., test_fp8_layers_updates_regex,
test_none_layers_disables_matching, test_fp4_section_disabled_fp8_still_updated,
test_original_file_not_modified, test_preserves_other_config_fields,
test_missing_section_is_skipped) to remove the produced file after assertions
(os.remove(output_path)) or, better, pass a tmp_path-based output location if
update_quant_stats_config supports an output path parameter; alternatively add a
small pytest fixture that wraps update_quant_stats_config and ensures the
returned temp file is unlinked in teardown to avoid accumulating temp files.
In `@bionemo-recipes/recipes/llama3_native_te/tests/test_train.py`:
- Around line 510-538: Add assertions in
test_sanity_ddp_fp8_partial_layers_stats_logging to verify that the expected log
files were created inside the directories (not just the directories themselves):
after the existing directory asserts for quant_log_dir / "rank_0" /
"nvdlfw_inspect_logs" and quant_log_dir / "rank_0" /
"nvdlfw_inspect_statistics_logs", assert that each directory contains at least
one file (e.g., using any(path.iterdir()) or
list(path.glob("*.log"))/len(list(path.iterdir())) > 0). Reference the test
function name test_sanity_ddp_fp8_partial_layers_stats_logging and the Path
variables quant_log_dir, "rank_0", nvdlfw_inspect_logs, and
nvdlfw_inspect_statistics_logs to locate where to insert these checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7e63c8a0-7ffd-4ff6-a4e8-711e7a796b54
⛔ Files ignored due to path filters (4)
docs/docs/assets/images/llama3/lingua-1b-loss-curve.pngis excluded by!**/*.pngdocs/docs/assets/images/llama3/lingua-1b-step-time.pngis excluded by!**/*.pngdocs/docs/assets/images/llama3/llama3_1b_fsdp2_tflops.pngis excluded by!**/*.pngdocs/docs/assets/images/llama3/llama3_8gpu_tflops.pngis excluded by!**/*.png
📒 Files selected for processing (18)
bionemo-recipes/models/llama3/modeling_llama_te.pybionemo-recipes/models/llama3/tests/test_distributed_fp8.pybionemo-recipes/models/llama3/tests/test_layer_quantization.pybionemo-recipes/recipes/llama3_native_te/README.mdbionemo-recipes/recipes/llama3_native_te/fp4_debugging_stats.yamlbionemo-recipes/recipes/llama3_native_te/fp8_debugging.pybionemo-recipes/recipes/llama3_native_te/fp8_debugging_stats.yamlbionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/llama3_native_te/modeling_llama_te.pybionemo-recipes/recipes/llama3_native_te/perf_logger.pybionemo-recipes/recipes/llama3_native_te/quantization.pybionemo-recipes/recipes/llama3_native_te/tests/conftest.pybionemo-recipes/recipes/llama3_native_te/tests/test_perf_logger.pybionemo-recipes/recipes/llama3_native_te/tests/test_quantization.pybionemo-recipes/recipes/llama3_native_te/tests/test_train.pybionemo-recipes/recipes/llama3_native_te/train_ddp.pybionemo-recipes/recipes/llama3_native_te/train_fsdp2.pybionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py
💤 Files with no reviewable changes (1)
- bionemo-recipes/recipes/llama3_native_te/fp8_debugging.py
Replace broken fp8_first_last_bf16 mechanism with resolve_layer_precision() from PR #1500. The old approach set config attributes that were never read by the forward pass, causing all layers to default to FP8 regardless of setting. Key changes: - Delete fp8_debugging.py, add quantization.py with resolve_layer_precision() and initialize_quant_stats_logging() - Add set_recipes()/get_layer_autocast() to OG2 model (from lepton branch), model now handles per-layer autocast internally - Model constructor accepts fp8_recipe/fp4_recipe, set_recipes() called after FSDP wrapping since recipes aren't serializable - Remove outer te.autocast() from training loop (model handles it) - Rename fp8_stats_config -> quant_stats_config throughout - Add _parse_layers_cfg() for CLI string support - Add og2_7b_fp8_fl1_pq2.yaml with explicit fp8_layers=[2..31] - Expand fp8_debugging_stats.yaml with all layer types + LogTensorStats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Summary
Test plan
Usage
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist
Summary by CodeRabbit
New Features
Documentation
Tests
Chores