Skip to content

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable #41

@EuanPyle

Description

@EuanPyle

Hi!
I've nicely picked some particles using Surforama, and I'm just trying to train the NN to pick them in future.

I am running the following command:

membrain_pick train --data-dir . --training-dir training_output

Within the current dir (data-dir), I have directories: train and var. Within I have the .star and the .h5 files with matching names. When I run the command though, I get the following error (this is an abridged log, as the full message is extremely long, let me know I should post the entire thing):

Loading membranes into dataset.
Projecting points to nearest hyperplane.
Projecting points to nearest hyperplane.
Precomputing partitioning of the mesh.
Loaded partitioning for membrane 0.
Loaded partitioning for membrane 1.
Loading membranes into dataset.
Projecting points to nearest hyperplane.
Projecting points to nearest hyperplane.
Precomputing partitioning of the mesh.
Loaded partitioning for membrane 0.
Loaded partitioning for membrane 1.
/g/mattei/euan/miniforge/envs/membrainv2/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun like so: srun python3.10 /g/mattei/euan/miniforge/envs/membrainv2/ bin/mem ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high' ) which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32 _matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Loading membranes into dataset.
Projecting points to nearest hyperplane.
Projecting points to nearest hyperplane.
Precomputing partitioning of the mesh.
Loaded partitioning for membrane 0.
Loaded partitioning for membrane 1.
Loading membranes into dataset.
Projecting points to nearest hyperplane.
Projecting points to nearest hyperplane.
Precomputing partitioning of the mesh.
Loaded partitioning for membrane 0.
Loaded partitioning for membrane 1.
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]

| Name | Type | Params | Mode

0 | model | DiffusionNet | 9.7 K | train
1 | criterion | CombinedLoss | 0 | train

9.7 K Trainable params
0 Non-trainable params
9.7 K Total params
0.039 Total estimated model params size (MB)
76 Modules in train mode
0 Modules in eval mode
Sanity Checking DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.44it/s] Validation epoch loss: 54.443806
Validation epoch loss: 59.996666
/g/mattei/euan/miniforge/envs/membrainv2/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:434: It is recommended to use self.log ('val_loss', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
Epoch 0: 5%|█████▎ | 16/347 [03:16<1:07:35, 0.08it/s, v_num=3]

ERROR MESSAGE STARTS HERE:

[rank0]: │ /g/mattei/euan/miniforge/envs/membrainv2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py:1528 in _pre_forward │
[rank0]: │ │
[rank0]: │ 1525 │ │ # call _rebuild_buckets before the peak memory usage increases │
[rank0]: │ 1526 │ │ # during forward computation. │
[rank0]: │ 1527 │ │ # This should be called only once during whole training period. │
[rank0]: │ ❱ 1528 │ │ if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): │
[rank0]: │ 1529 │ │ │ logger.info("Reducer buckets have been rebuilt in this iteration.") │
[rank0]: │ 1530 │ │ │ self._has_rebuilt_buckets = True │
[rank0]: │ 1531 │
[rank0]: ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[rank0]: RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable
[rank0]: the detection of unused parameters in DDP, either by setting the string value strategy='ddp_find_unused_parameters_true' or by setting the flag in the strategy with
[rank0]: strategy=DDPStrategy(find_unused_parameters=True).
Epoch 0: 5%|▍ | 16/347 [06:59<2:24:45, 0.04it/s, v_num=3]

Any idea on how I can get around this?
Thanks
Euan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions