RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable

Hi! 
I've nicely picked some particles using Surforama, and I'm just trying to train the NN to pick them in future. 

I am running the following command:

membrain_pick train --data-dir . --training-dir training_output 

Within the current dir (data-dir), I have directories: train and var. Within I have the .star and the .h5 files with matching names. When I run the command though, I get the following error (this is an abridged log, as the full message is extremely long, let me know I should post the entire thing): 

Loading  membranes into dataset.                                                                                                                                                                                  
Projecting points to nearest hyperplane.                                                                                                                                                                          
Projecting points to nearest hyperplane.                                                                                                                                                                          
Precomputing partitioning of the mesh.                                                                                                                                                                            
Loaded partitioning for membrane 0.                                                                                                                                                                               
Loaded partitioning for membrane 1.                                                                                                                                                                               
Loading  membranes into dataset.                                                                                                                                                                                  
Projecting points to nearest hyperplane.                                                                                                                                                                          
Projecting points to nearest hyperplane.                                                                                                                                                                          
Precomputing partitioning of the mesh.                                                                                                                                                                            
Loaded partitioning for membrane 0.                                                                                                                                                                               
Loaded partitioning for membrane 1.                                                                                                                                                                               
/g/mattei/euan/miniforge/envs/membrainv2/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but                                      is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.10 /g/mattei/euan/miniforge/envs/membrainv2/                                     bin/mem ...                                                                                                                                                                                                       
GPU available: True (cuda), used: True                                                                                                                                                                            
TPU available: False, using: 0 TPU cores                                                                                                                                                                          
HPU available: False, using: 0 HPUs                                                                                                                                                                               
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high'                                     )` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32                                     _matmul_precision                                                                                                                                                                                                 
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2                                                                                                                                                             
Loading  membranes into dataset.                                                                                                                                                                                  
Projecting points to nearest hyperplane.                                                                                                                                                                          
Projecting points to nearest hyperplane.                                                                                                                                                                          
Precomputing partitioning of the mesh.                                                                                                                                                                            
Loaded partitioning for membrane 0.                                                                                                                                                                               
Loaded partitioning for membrane 1.                                                                                                                                                                               
Loading  membranes into dataset.                                                                                                                                                                                  
Projecting points to nearest hyperplane.                                                                                                                                                                          
Projecting points to nearest hyperplane.                                                                                                                                                                          
Precomputing partitioning of the mesh.                                                                                                                                                                            
Loaded partitioning for membrane 0.                                                                                                                                                                               
Loaded partitioning for membrane 1.                                                                                                                                                                               
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2                                                                                                                                                             
----------------------------------------------------------------------------------------------------                                                                                                              
distributed_backend=nccl                                                                                                                                                                                          
All distributed processes registered. Starting with 2 processes                                                                                                                                                   
----------------------------------------------------------------------------------------------------                                                                                                              

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]                                                                                                                                                                       
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]                                                                                                                                                                       

  | Name      | Type         | Params | Mode                                                                                                                                                                      
---------------------------------------------------                                                                                                                                                               
0 | model     | DiffusionNet | 9.7 K  | train                                                                                                                                                                     
1 | criterion | CombinedLoss | 0      | train                                                                                                                                                                     
---------------------------------------------------                                                                                                                                                               
9.7 K     Trainable params                                                                                                                                                                                        
0         Non-trainable params                                                                                                                                                                                    
9.7 K     Total params                                                                                                                                                                                            
0.039     Total estimated model params size (MB)                                                                                                                                                                  
76        Modules in train mode                                                                                                                                                                                   
0         Modules in eval mode                                                                                                                                                                                    
Sanity Checking DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.44it/s]                                     Validation epoch loss:  54.443806                                                                                                                                                                                 
Validation epoch loss:  59.996666                                                                                                                                                                                 
/g/mattei/euan/miniforge/envs/membrainv2/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:434: It is recommended to use `self.log                                     ('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.                                                                                    
Epoch 0:   5%|█████▎                                                                                                             | 16/347 [03:16<1:07:35,  0.08it/s, v_num=3]   



ERROR MESSAGE STARTS HERE:



[rank0]: │ /g/mattei/euan/miniforge/envs/membrainv2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py:1528 in _pre_forward                                               │
[rank0]: │                                                                                                                                                                           │
[rank0]: │   1525 │   │   # call _rebuild_buckets before the peak memory usage increases                                                                                             │
[rank0]: │   1526 │   │   # during forward computation.                                                                                                                              │
[rank0]: │   1527 │   │   # This should be called only once during whole training period.                                                                                            │
[rank0]: │ ❱ 1528 │   │   if torch.is_grad_enabled() and self.reducer._rebuild_buckets():                                                                                            │
[rank0]: │   1529 │   │   │   logger.info("Reducer buckets have been rebuilt in this iteration.")                                                                                    │
[rank0]: │   1530 │   │   │   self._has_rebuilt_buckets = True                                                                                                                       │
[rank0]: │   1531                                                                                                                                                                    │
[rank0]: ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[rank0]: RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable 
[rank0]: the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with 
[rank0]: `strategy=DDPStrategy(find_unused_parameters=True)`.
Epoch 0:   5%|▍         | 16/347 [06:59<2:24:45,  0.04it/s, v_num=3] 




Any idea on how I can get around this? 
Thanks
Euan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable #41

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

| Name | Type | Params | Mode

0 | model | DiffusionNet | 9.7 K | train
1 | criterion | CombinedLoss | 0 | train

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable #41

Description

distributed_backend=nccl All distributed processes registered. Starting with 2 processes

| Name | Type | Params | Mode

0 | model | DiffusionNet | 9.7 K | train 1 | criterion | CombinedLoss | 0 | train

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

0 | model | DiffusionNet | 9.7 K | train
1 | criterion | CombinedLoss | 0 | train