Add tensor parallelism support for HF wrapper forward and lm_eval integration #340

bigximik · 2025-07-30T11:50:10Z

✨ Description

Add tensor parallelism support for HF wrapper forward and lm_eval integration

Closes #334

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

Key updates introduced in this PR:

Fixed a bug where the batch config was being read from the wrong place.
Added additional broadcast primitives and optimized _object_to_tensor for faster performance (following PyTorch sources).
Added tensor-parallel logits collection in the model head.
Added tensor-parallel support to forward.
Added coordinator-forward mode, which allows generate to run only on data-parallel leader ranks while tensor parallel workers participate through worker_forward.
Added model and pipeline parallelism support to the lm_eval wrapper.
Added wait barriers in critical places, as the standard 60s timeout on distributed primitives was insufficient in cases such as slow post-processing for some lm_eval tasks or when batches are incomplete and some data-parallel ranks have no data.

🗒️ Notes and Known Issues

Manually tested with DP and TP on 2 GPUs, DP+TP on 4 GPUs, and also on a single GPU.
There is a problem with CUDA memory fragmentation, potentially caused by scattering and broadcasting tensors of different size.
For some tasks (e.g., Wikitext using sliding-window log-likelihood), processing is very slow with data and model parallel setup. This is likely due to logits being sent to rank 0 and offloaded to CPU before applying softmax. The problem is more severe with larger batch sizes.
High memory usage was observed in general. For example, with the Qwen 1.5B model and batch size 3 per GPU, memory spikes to nearly 100% during evaluation.

…rted in forward

…aster according to torch src

…e to right gpu after scatter

…uate_tp

jlamypoirier · 2025-09-18T22:43:07Z

fast_llm/engine/evaluation/config.py

        " If not set, it is inferred from the Fast-LLM model config or tokenizer.",
    )

+    communication_timeout_sec: float = Field(


timeout. Unnecessary long timeouts are often bad, so I recommend making it optional (default none) and enabling only as needed.

Context

Conceptually, places like worker_forward or data-parallel_worker wait primitives should only exit under three conditions:

They receive work

They receive a finish message

The connection with peers/coordinator is lost (after some timeout)

However, this is not how torch.distributed works. It is designed for more or less synchronous communication, while here we are trying to adapt it for asynchronous communication.

Problem

If we set the default timeout to None, users will end up seeing random timeouts in different places.

Discussion

A better long-term solution would be to use a distributed messaging framework that is more appropriate for sending work and finish messages. However, introducing another communication layer into fast_llm is likely outside the scope of this PR.

Proposal

Keep the default timeout as it is, applied only to these entry points. reset timeout after wait operation to default of 60 sec.

Clarify the naming/description to avoid confusion.

Add a TODO to revisit this later with a more suitable communication framework.

jlamypoirier · 2025-09-18T22:48:25Z

fast_llm/engine/inference/huggingface.py

        # Meant to be overridden in derived classes
        raise NotImplementedError()
+
+    def forward(


This doesn't seem relevant outside lm_eval. Any way to move it there?

I initially thought about handling this differently, but since each subclass of the model has its own class, the only practical way I found was to use a dynamic class that constructs itself on the fly with type.
This lets us encapsulate forward of the fast_llm Hugging Face class and then pass it to generate.

Something like:

def wrap_hf_model(model): inner_forward = get_bounded_method(model.forward, model) wrapper_class = get_new_type( model.__class__, { "inner_forward": inner_forward, "forward": cordinator_forward, "worker_forward": worker_forward, }, ) model.__class__ = wrapper_class return model

Another option would be to create a static wrapper class, but that would require exposing and forwarding a lot of functionality that generate expects.

So instead, I decided to implement this in our HF wrapper, since it is implemented before any class specialization.

I'm not sure I'm following here. From what I understand, these methods are called by FastLLMLmEvalWrapper above which we are free to adjust as we want, and there isn't any dependency on the HF model so moving should be easy. Or are there some call in the. LM eval base class that absolutely enforce this structure?

This isn't present in typical HF models, so I'd prefer to avoid it.

No, lm_eval calls FastLLMLmEvalWrapper, which then calls our model.generate which in turn makes multiple calls to forward, all runs entirely on TP rank 0. Our forward must be overridden to handle data distribution across all TP ranks.

---------------------------------------------------------------------------------------------------- | Tensor Parallel Setup | ---------------------------------------------------------------------------------------------------- [ lm_eval ] [ forward_worker ] [ forward_worker ] ... more | | | v v v +-------------------------+ | | | HF generate mixin | | | | model must be HF | | | | (model.generate) | | | | - runs only on TP rank0 | | | | - does multiple forward | | | | calls + sampling | | | | beam search, etc. | | | +-------------------------+ | | | | | v | | +-------------------------+ | | | model.forward() |---------------+----------------------------------+--> [wait for data] | must be overridden | | | [long timeout] | - orchestrates TP calls | | | +-------------------------+ | | | | | v v v +---------------------+ +---------------------+ +---------------------+ ... TP N-1 | TP rank 0 | | TP rank 1 | | TP rank 2 | | model.forward_inner | | model.forward_inner | | model.forward_inner | +---------------------+ +---------------------+ +---------------------+

Alternatives i have considered:

Composition (wrapper-model around HF specialization)

Define a class OrchestratorModel that looks like an HF model (has forward, generate, etc.).

It contains an inner HuggingfaceBaseModelForCausalLM (or subclass) that runs on workers.

Orchestrator forward does the TP dispatch/gathering, then calls into the inner worker models as needed.

generate (from HF) runs on this outer orchestrator class, which works because it just calls self.forward.

This is clean, explicit, stable — but involves boilerplate to replicate the HF interface.

Dynamic class injection (multiple inheritance / runtime patching)

Build a class at runtime that inherits both:

HF specialization (HuggingfaceGPTModelForCausalLM, etc.)

OrchestratorMixin (overrides forward).

Register that as the actual model class for the model object.

generate is inherited unmodified from HF mixin, but calls OrchestratorMixin.forward.

This avoids extra wrapper code, but is “hackier” and could break with HF updates.

That’s why I dismissed the other options, but if we really want to keep HuggingfaceBaseModelForCausalLM completely unmodified, option 1 (composition) is likely the safer and more maintainable approach for our use case.

fast_llm/layers/language_model/head.py

bigximik added 2 commits July 30, 2025 11:36

added model and sequence parallel to forward

1b60413

added asserts for pipeline and sequence parallel to be 1 as not suppo…

5eac621

…rted in forward

bigximik changed the title ~~[WIP] Add tensor parallelism (and general model/sequence parallelism) support for HF wrapper forward and lm_eval integration~~ [WIP] Add tensor parallelism support for HF wrapper forward and lm_eval integration Aug 6, 2025

bigximik added 7 commits August 7, 2025 12:19

changed logits gathering for only tp and stp dimensions

d882d7b

added more broadcast primitives and changed _object_to_tensor to be f…

b9851c2

…aster according to torch src

added support to TP in forward for generate

750ea1c

added suppport to other parallelism additionally to data parallelism

0f196da

removed out of date comment

82b901d

added extended wait in key places, fix to right batch config, fix mov…

543f3d6

…e to right gpu after scatter

added more docs

be8050c

bigximik changed the title ~~[WIP] Add tensor parallelism support for HF wrapper forward and lm_eval integration~~ Add tensor parallelism support for HF wrapper forward and lm_eval integration Aug 20, 2025

bigximik requested review from tscholak and jlamypoirier August 20, 2025 12:12

bigximik marked this pull request as ready for review August 20, 2025 12:18

bigximik and others added 3 commits August 20, 2025 12:19

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/eval…

d51f584

…uate_tp

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/eval…

f48574a

…uate_tp

Merge branch 'main' into denis/evaluate_tp

32ea639

jlamypoirier reviewed Sep 18, 2025

View reviewed changes

bigximik requested a review from jlamypoirier September 20, 2025 12:14

jlamypoirier approved these changes Sep 25, 2025

View reviewed changes

bigximik merged commit cc5ca89 into main Sep 29, 2025
4 checks passed

bigximik deleted the denis/evaluate_tp branch September 29, 2025 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tensor parallelism support for HF wrapper forward and lm_eval integration #340

Add tensor parallelism support for HF wrapper forward and lm_eval integration #340

Uh oh!

bigximik commented Jul 30, 2025 •

edited

Loading

Uh oh!

jlamypoirier Sep 18, 2025

Uh oh!

bigximik Sep 20, 2025

Uh oh!

jlamypoirier Sep 18, 2025

Uh oh!

bigximik Sep 20, 2025 •

edited

Loading

Uh oh!

jlamypoirier Sep 24, 2025

Uh oh!

bigximik Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add tensor parallelism support for HF wrapper forward and lm_eval integration #340

Add tensor parallelism support for HF wrapper forward and lm_eval integration #340

Uh oh!

Conversation

bigximik commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

🗒️ Notes and Known Issues

Uh oh!

jlamypoirier Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

bigximik Sep 20, 2025

Choose a reason for hiding this comment

Context

Problem

Discussion

Proposal

Uh oh!

jlamypoirier Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

bigximik Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

bigximik Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bigximik commented Jul 30, 2025 •

edited

Loading

bigximik Sep 20, 2025 •

edited

Loading