[tx] Add LLama3 support by atemaguer · Pull Request #657 · NovaSky-AI/SkyRL

atemaguer · 2025-11-11T19:48:01Z

This PR adds LLama3.2 models support to Tx. LLlam3 and Qwen3 mostly share the same architecture except for the QK-Norm layers present in Qwen3 but absent in LLama3. So both models share certain layers.

gemini-code-assist

Code Review

This pull request introduces support for Llama3 models, including a new implementation in tx.models.llama3, along with comprehensive unit and integration tests. A significant improvement is the refactoring of common layers like RMSNorm and SwiGLUMLP into a shared tx.layers.common module, which cleans up the existing Qwen3 model implementation and promotes code reuse.

My review has identified two main issues. First, a critical bug is introduced by the change to allow string-based stop sequences in the sampling parameters. The API is updated, but the backend generation logic is not, which will cause runtime failures. Second, there is a high-severity bug in the is_lora_param method within the new Llama3ForCausalLM model that prevents LoRA parameters from being identified for training, rendering LoRA fine-tuning ineffective.

Apart from these issues, the overall implementation and testing strategy are excellent.

gemini-code-assist · 2025-11-11T19:50:05Z

skyrl-tx/tx/tinker/api.py

    max_tokens: int | None = None
    seed: int | None = None
-    stop: Sequence[int] | None = None
+    stop: Sequence[int] | Sequence[str] | None = None


The stop parameter in SamplingParams now accepts Sequence[str], but the generation logic is not updated to handle string-based stop sequences. The current implementation in tx.utils.generator.GeneratorMixin.generate expects a list of integer token IDs and will raise an error when it tries to convert a list containing strings into a jnp.array of integers. This will lead to a runtime crash for any request that uses string stop sequences.

To fix this, the string stop sequences need to be tokenized before being used in the generation loop. This likely requires passing the tokenizer to the generation function or handling tokenization within the API layer before creating the generation request.

skyrl-tx/tx/models/llama3.py

atemaguer · 2025-11-17T12:40:17Z

@pcmoritz , any thoughts about this PR?

atemaguer · 2025-11-22T01:18:06Z

@tyler-griggs, got thoughts about this PR?

tyler-griggs

Thanks for writing this up @atemaguer!

tyler-griggs · 2025-11-30T02:01:53Z

skyrl-tx/tx/tinker/api.py

    max_tokens: int | None = None
    seed: int | None = None
-    stop: Sequence[int] | None = None
+    stop: Sequence[int] | Sequence[str] | None = None


I'm curious why these updates to stop were needed -- could you please explain?

I think it's because the Tinker client API update supports providing stop sequences as well. Also, some of the examples were failing, especially when using Llama 3 as the base model.

It makes sense that it should be updated if the Tinker client API should support stop sequences, but I don't think our generation logic actually supports stop sequences right now. It seems like we should push this set of changes into a later PR when we actually add support for stop strings sequences rather than just stop tokens.

tyler-griggs · 2025-11-30T02:05:02Z

skyrl-tx/tx/models/qwen3.py

Based on the observation you made that Qwen3 and Llama3 are very similar, could we simplify this even further such that Qwen3 inherits from Llama3 and adds fairly light-weight additional logic? E.g., the transformers library takes this pattern: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3/modular_qwen3.py

We could likely further reduce the duplicate code between the two models, which will make it easier to keep them consistent going forward. But let me know if this actually doesn't make sense.

atemaguer · 2025-12-01T00:18:29Z

@tyler-griggs , let me know what you think of these new updates

tyler-griggs · 2025-12-08T18:31:15Z

skyrl-tx/tx/models/llama3.py

+        self.config = config
+
+        # Token embeddings
+        self.embed_tokens = nnx.Embed(


I assume we'd like to have LoRA in the embedding layer for Llama3 (as we do in Qwen3). Is the plan to do this in a follow up PR?

tyler-griggs · 2025-12-08T18:43:43Z

skyrl-tx/tx/models/qwen3.py

+        # Get rope_theta from config
+        from tx.layers.common import apply_rope
+
+        rope_theta = getattr(self.config, "rope_theta", 500000.0)


Ideally we would not set default parameters in the code like this, but rather inherit the defaults from the base config itself. Is there a reason to use getattr here or can we avoid it?

tyler-griggs · 2025-12-08T18:47:37Z

skyrl-tx/tx/tinker/api.py

    max_tokens: int | None = None
    seed: int | None = None
-    stop: Sequence[int] | None = None
+    stop: Sequence[int] | Sequence[str] | None = None


It makes sense that it should be updated if the Tinker client API should support stop sequences, but I don't think our generation logic actually supports stop sequences right now. It seems like we should push this set of changes into a later PR when we actually add support for stop strings sequences rather than just stop tokens.

tyler-griggs · 2025-12-08T19:15:59Z

/gemini review

gemini-code-assist

Code Review

This pull request successfully adds support for Llama3 models and refactors the Qwen3 implementation to share common layers, which is a great step towards code reuse. The new Llama3 model and its associated tests are well-implemented. My review primarily focuses on improving the inheritance structure between the Llama3 and Qwen3 models. Currently, several Qwen3 classes inherit from their Llama3 counterparts but completely override methods with nearly identical code, leading to significant code duplication. I've provided suggestions to refactor the Llama3 base classes to be more extensible, which will reduce duplication and improve long-term maintainability. I also noted a minor issue with a local import.

gemini-code-assist · 2025-12-08T19:18:10Z

skyrl-tx/tx/models/qwen3.py

-
-
-class Qwen3Attention(nnx.Module):
+class Qwen3Attention(Llama3Attention):


While it's great that Qwen3Attention inherits from Llama3Attention, the __call__ method is almost a complete copy of the parent's implementation, with the only addition being the QK-Norm step. This code duplication makes maintenance harder.

Consider refactoring Llama3Attention.__call__ to include a hook for processing Q and K tensors, which can be overridden by subclasses. This would allow Qwen3Attention to simply implement the QK-Norm logic without duplicating the entire attention mechanism.

For example, in Llama3Attention:

def _process_q_k(self, q: jax.Array, k: jax.Array) -> tuple[jax.Array, jax.Array]: # No-op for Llama3 return q, k def __call__(self, ...): # ... project q, k, v ... q, k = self._process_q_k(q, k) # ... apply RoPE, attention, etc. ...

Then, Qwen3Attention can just override _process_q_k and inherit __call__:

class Qwen3Attention(Llama3Attention): # ... __init__ ... def _process_q_k(self, q: jax.Array, k: jax.Array) -> tuple[jax.Array, jax.Array]: q = self.q_norm(q) k = self.k_norm(k) return q, k

Since llama3.py is part of this PR, this refactoring should be straightforward to apply.

gemini-code-assist · 2025-12-08T19:18:10Z

skyrl-tx/tx/models/qwen3.py



-class Qwen3Model(nnx.Module):
+class Qwen3Model(Llama3Model):


Qwen3Model inherits from Llama3Model but completely overrides both __init__ and __call__ without calling super(). This makes the inheritance misleading and introduces significant code duplication, which will be a maintenance burden. The only differences are the types of the embedding and decoder layers, and the arguments passed to embed_tokens.

A better approach would be to make Llama3Model more generic so that Qwen3Model can reuse its structure. For example, you could parameterize the layer classes in Llama3Model.__init__:

# In Llama3Model def __init__(self, config, *, dtype, rngs, embedding_cls=nnx.Embed, decoder_layer_cls=Llama3DecoderLayer, **embedding_kwargs): ... self.embed_tokens = embedding_cls(..., **embedding_kwargs) self.layers = nnx.List([decoder_layer_cls(...)]) ...

This would simplify Qwen3Model to:

class Qwen3Model(Llama3Model): def __init__(self, config: Qwen3Config, *, dtype: jnp.dtype, rngs: nnx.Rngs) -> None: # Pass Qwen3-specific classes and LoRA-related kwargs for LoRAEmbed super().__init__(config, dtype=dtype, rngs=rngs, embedding_cls=LoRAEmbed, decoder_layer_cls=Qwen3DecoderLayer, max_lora_adapters=config.max_lora_adapters, max_lora_rank=config.max_lora_rank)

A similar refactoring could be applied to __call__ to handle the different arguments for embed_tokens.

gemini-code-assist · 2025-12-08T19:18:10Z

skyrl-tx/tx/models/qwen3.py



-class Qwen3ForCausalLM(nnx.Module, GeneratorMixin):
+class Qwen3ForCausalLM(Llama3ForCausalLM):


Similar to Qwen3Model, Qwen3ForCausalLM inherits from Llama3ForCausalLM but overrides __init__ and __call__ with almost identical code, which is not ideal for maintainability. The key differences are the underlying model class (Qwen3Model vs Llama3Model) and the lm_head implementation (LoRALinear vs nnx.Linear).

To improve this, you could refactor Llama3ForCausalLM to be more configurable, allowing subclasses to specify the model class and lm_head implementation. This would allow Qwen3ForCausalLM to inherit the core logic and only define what's different, greatly reducing code duplication.

gemini-code-assist · 2025-12-08T19:18:10Z

skyrl-tx/tx/models/qwen3.py

+        k = self.k_norm(k)
+
+        # Get rope_theta from config
+        from tx.layers.common import apply_rope


This local import should be moved to the top of the file to follow standard Python conventions and improve readability.

atemaguer · 2025-12-11T01:40:26Z

@tyler-griggs , let me know what you think about these new updates

atemaguer · 2025-12-12T01:29:56Z

Btw @pcmoritz, it appears like this PR will cause conflicting changes with the FSDP support changes. Any chance you could review and merge this PR and then add the FSDP changes after?

pcmoritz · 2025-12-12T02:50:03Z

skyrl-tx/tx/tinker/engine.py

@@ -1,8 +1,7 @@
 """Background engine for processing training requests."""

 import argparse


I'm a little suprised about the diff in this file, maybe you need to merge main into the PR? These changes should already be on the current master :)

atemaguer · 2025-12-13T19:05:30Z

@pcmoritz, I merged the changes from main. let me know if there's anything else that needs changing or improving

pcmoritz · 2025-12-13T23:59:14Z

Thanks a lot for all the work, I'll get this PR merged next :)

I'm planning do a little bit of restructuring, it is probably easiest to just edit it directly :)

pcmoritz · 2025-12-14T22:52:15Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for Llama3 models, including LoRA training and generation. The changes are well-structured, introducing a generic ModelConfig and moving common layers like RMSNorm and apply_rope into separate files for reuse. New tests for the Llama3 model and its LoRA training capabilities are also included, which is great.

I've found one critical issue in the new Llama3Attention implementation related to batched decoding that needs to be addressed. This same issue appears to exist in the Qwen3Attention model as well and should be fixed there too. I also have a minor suggestion for code cleanup in one of the new test files. Overall, this is a solid contribution.

gemini-code-assist · 2025-12-14T22:53:28Z

skyrl-tx/tx/models/llama3.py

+            k = jax.lax.dynamic_update_slice(k_cache, k, (0, cache_position, 0, 0))
+            v = jax.lax.dynamic_update_slice(v_cache, v, (0, cache_position, 0, 0))


The use of jax.lax.dynamic_update_slice with hardcoded start indices (0, cache_position, 0, 0) will not work correctly for batched decoding (i.e., when batch size > 1). It will only update the KV cache for the first sequence in the batch.

To support batching correctly, you should use the .at[...].set(...) syntax, which is aware of batch dimensions.

Suggested change

k = jax.lax.dynamic_update_slice(k_cache, k, (0, cache_position, 0, 0))

v = jax.lax.dynamic_update_slice(v_cache, v, (0, cache_position, 0, 0))

k = k_cache.at[:, cache_position:cache_position+1].set(k)

v = v_cache.at[:, cache_position:cache_position+1].set(v)

gemini-code-assist · 2025-12-14T22:53:28Z

skyrl-tx/tests/models/test_llama3_lora_training.py

+
+            optimizer.update(lora_params, lora_grads)
+
+            print(f"Step {step}: loss = {float(loss):.4f}")


This print statement appears to be for debugging purposes. It's best to remove it from the final test code to keep the test output clean. If you need to output information during tests, consider using Python's logging module.

This reverts commit 0e0ab53.

pcmoritz

I updated the code now, it turns out trying to share too much between llama3 and qwen3 actually just makes the code harder to understand and doesn't have too many benefits (e.g. vllm, sglang, torch titan also don't share), so I refactored the code to only share truly common layers.

In order to be able to run the rl_loop.py with #657, we need to implement string stop sequences.

This PR adds LLama3.2 models support to Tx. LLlam3 and Qwen3 mostly share the same architecture except for the QK-Norm layers present in Qwen3 but absent in LLama3. So both models share certain layers. --------- Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>

In order to be able to run the rl_loop.py with NovaSky-AI#657, we need to implement string stop sequences.

atemaguer added 3 commits November 5, 2025 22:44

formatting issues fixed

2cf3bb2

updates

e1b8f3a

linting fixed

dbb03a2

gemini-code-assist bot reviewed Nov 11, 2025

View reviewed changes

pcmoritz added the tx label Nov 12, 2025

Merge branch 'main' into feat-llama3-support

3dd248d

fix issues

22a1702

tyler-griggs reviewed Nov 30, 2025

View reviewed changes

atemaguer added 2 commits November 30, 2025 15:38

Merge branch 'main' into feat-llama3-support

c0de5ce

linting issues resolved

22dbee5

tyler-griggs reviewed Dec 8, 2025

View reviewed changes

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

updates

d4a83ab

pcmoritz reviewed Dec 12, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into feat-llama3-support

e6f601d

pcmoritz added 7 commits December 13, 2025 16:20

Merge branch 'main' into feat-llama3-support

632c567

update

ac6ef0a

use ungated models

8039340

update

bea8108

update

34d0ad5

update

5458156

update

2e06899

pcmoritz added 12 commits December 14, 2025 12:46

update

73a14da

update

00fce7b

update

8632f80

update

d0108df

update

812b4df

cleanup

fc8261d

update

3444d6f

update

6bb742e

align model definitions

9f30eb5

align models

3b4b334

update

e441d84

update

0f78eb7

gemini-code-assist bot reviewed Dec 14, 2025

View reviewed changes

pcmoritz added 4 commits December 14, 2025 14:56

add back CI

0e0ab53

Revert "add back CI"

f5f78e3

This reverts commit 0e0ab53.

cleanup

454658d

align

56f33f7

pcmoritz approved these changes Dec 14, 2025

View reviewed changes

pcmoritz added 2 commits December 14, 2025 15:26

update

95b5c83

update

7fd6f58

pcmoritz merged commit 1fe9dd4 into NovaSky-AI:main Dec 15, 2025
4 of 5 checks passed

pcmoritz mentioned this pull request Dec 15, 2025

[tx] Implement string stop sequences #778

Merged

pcmoritz added a commit that referenced this pull request Dec 17, 2025

[tx] Implement string stop sequences (#778)

f95e0e1

In order to be able to run the rl_loop.py with #657, we need to implement string stop sequences.

dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026

[tx] Implement string stop sequences (NovaSky-AI#778)

bd3218c

In order to be able to run the rl_loop.py with NovaSky-AI#657, we need to implement string stop sequences.



		class Qwen3Attention(nnx.Module):
		class Qwen3Attention(Llama3Attention):



		class Qwen3ForCausalLM(nnx.Module, GeneratorMixin):
		class Qwen3ForCausalLM(Llama3ForCausalLM):

		@@ -1,8 +1,7 @@
		"""Background engine for processing training requests."""

		import argparse

		k = jax.lax.dynamic_update_slice(k_cache, k, (0, cache_position, 0, 0))
		v = jax.lax.dynamic_update_slice(v_cache, v, (0, cache_position, 0, 0))


		optimizer.update(lora_params, lora_grads)

		print(f"Step {step}: loss = {float(loss):.4f}")

Conversation

atemaguer commented Nov 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

atemaguer commented Nov 17, 2025

Uh oh!

atemaguer commented Nov 22, 2025

Uh oh!

tyler-griggs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atemaguer commented Dec 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyler-griggs commented Dec 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

atemaguer commented Dec 11, 2025

Uh oh!

atemaguer commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atemaguer commented Dec 13, 2025

Uh oh!

pcmoritz commented Dec 13, 2025

Uh oh!

pcmoritz commented Dec 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

pcmoritz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!