Skip to content

BioNeMo Conversion to Recipes#4001

Merged
holgerroth merged 2 commits intoNVIDIA:mainfrom
holgerroth:bionemo_recipe_update
Jan 21, 2026
Merged

BioNeMo Conversion to Recipes#4001
holgerroth merged 2 commits intoNVIDIA:mainfrom
holgerroth:bionemo_recipe_update

Conversation

@holgerroth
Copy link
Copy Markdown
Collaborator

@holgerroth holgerroth commented Jan 21, 2026

Fixes # .

Description

Cherry-pick #3943 and #3982

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@holgerroth holgerroth requested a review from ZiyueXu77 January 21, 2026 20:01
@holgerroth
Copy link
Copy Markdown
Collaborator Author

/build

ZiyueXu77
ZiyueXu77 previously approved these changes Jan 21, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 21, 2026

Greptile Summary

This PR cherry-picks #3982 to add BioNeMo Task Fitting with PyTorch. The changes replace the previous BioNeMo implementation with a cleaner PyTorch-based approach for federated protein embeddings and MLP training.

Major Changes:

  • Replaced custom BioNeMo launchers and learners with PyTorch-based implementations using NVFlare's FedAvgRecipe
  • Added new MLP training client (task_fitting/job_fedavg/client.py) with proper federated learning workflow
  • Added inference job (task_fitting/job_inference/) for ESM2 embedding extraction
  • Refactored downstream examples (SAbDab, TAP, SCL) to use new recipe-based approach
  • Removed legacy implementation files (bionemo_mlp_learner.py, bionemo_mlp_model_persistor.py, etc.)
  • Updated notebooks and documentation to reflect PyTorch workflow

Key Improvements:

  • Cleaner separation between inference (embedding extraction) and training (MLP classification)
  • Better integration with NVFlare's client API using flare.init(), flare.receive(), flare.send()
  • Support for both federated and local training modes via SIM_LOCAL environment variable
  • Added TensorBoard integration for experiment tracking

Issues Found:

  • Potential division by zero in evaluate_model function when dataloader is empty (critical fix needed)
  • Interactive input() in job configuration blocks automation (previously noted)

Confidence Score: 4/5

  • This PR is generally safe to merge with one critical division by zero bug that needs fixing
  • The PR successfully refactors BioNeMo examples to use PyTorch-based federated learning with cleaner architecture. Code quality is good with proper error handling, comprehensive documentation, and well-structured examples. However, there's a division by zero bug in the evaluation function that could cause runtime errors if an empty dataloader is passed. The interactive input() was already noted in previous reviews. The hardcoded /tmp paths are acceptable for examples. Overall the refactoring improves maintainability and follows NVFlare best practices.
  • Pay close attention to examples/advanced/bionemo/task_fitting/job_fedavg/client.py which has a division by zero bug that needs fixing before merge

Important Files Changed

Filename Overview
examples/advanced/bionemo/task_fitting/job_fedavg/client.py New PyTorch-based federated MLP training client with proper error handling and metrics tracking
examples/advanced/bionemo/task_fitting/job_fedavg/job.py FedAvg recipe configuration with interactive mode selection, uses hardcoded paths
examples/advanced/bionemo/task_fitting/job_inference/client.py ESM2 inference client using subprocess to run external BioNeMo command
examples/advanced/bionemo/downstream/client.py BioNeMo ESM2 fine-tuning client with complex training logic, uses os._exit(0) for cleanup
examples/advanced/bionemo/downstream/sabdab/job.py SAbDab dataset job configuration with custom filters for BioNeMo model parameters

Sequence Diagram

sequenceDiagram
    participant User
    participant InferenceJob as Inference Job (job.py)
    participant InferenceClient as Inference Client
    participant ESM2Model as ESM2 Model
    participant TrainingJob as Training Job (job.py)
    participant TrainingClient as MLP Training Client
    participant Server as FedAvg Server
    
    Note over User,Server: Phase 1: Embedding Extraction
    User->>InferenceJob: Run inference job
    InferenceJob->>Server: Initialize FedAvgRecipe (1 round)
    Server->>InferenceClient: Send inference task
    InferenceClient->>ESM2Model: Run infer_esm2 subprocess
    ESM2Model-->>InferenceClient: Return protein embeddings
    InferenceClient->>InferenceClient: Save embeddings to /tmp/data/.../results
    InferenceClient->>Server: Send metadata (num_sequences, shapes)
    
    Note over User,Server: Phase 2: MLP Training
    User->>TrainingJob: Run training job (select mode)
    TrainingJob->>Server: Initialize FedAvgRecipe (50 rounds)
    
    loop For each round (1-50)
        Server->>TrainingClient: Send global model weights
        TrainingClient->>TrainingClient: Load embeddings and labels
        TrainingClient->>TrainingClient: Evaluate global model
        TrainingClient->>TrainingClient: Train locally for N epochs
        TrainingClient->>TrainingClient: Evaluate trained model
        TrainingClient->>Server: Send updated weights + metrics
        Server->>Server: Aggregate weights from all clients
    end
    
    Server-->>User: Final global model
    User->>User: View results in TensorBoard
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread examples/advanced/bionemo/task_fitting/job_fedavg/job.py
holgerroth and others added 2 commits January 21, 2026 15:07
Fixes # .

### Description

Convert bionemo examples to use FedAvgRecipe

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [x] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Signed-off-by: Holger Roth <hroth@nvidia.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Fixes # .

### Description

Switch to PyTorch for task fitting experiments.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Signed-off-by: Holger Roth <hroth@nvidia.com>
Co-authored-by: root <root@r1u14.cm.cluster>
@holgerroth holgerroth force-pushed the bionemo_recipe_update branch from a1d008d to ebbbeb8 Compare January 21, 2026 20:08
@holgerroth holgerroth changed the title BioNeMo Task Fitting with PyTorch BioNeMo Conversion to Recipes Jan 21, 2026
@holgerroth holgerroth requested a review from ZiyueXu77 January 21, 2026 20:09
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

27 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread examples/advanced/bionemo/task_fitting/job_fedavg/client.py
@holgerroth
Copy link
Copy Markdown
Collaborator Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants