Skip to content

RuntimeError in DataLoader: Stack Expects Each Tensor to Be Equal Size During Training #2728

@Chap5732

Description

@Chap5732

Is there an existing issue for this?

  • I have searched the existing issues

Bug description

My training dataset consists of 338 images with 1-10 mice, 6 bodyparts were annotated.
(nose, left_ear, right_ear, mid_back, mouse_center and tail1, just the name I set, they are not truly the case as in superanimal model according to the result showed in confusion_matrix.png)

And I wanted to use a superanimal pretrained model (sa-tvm, 27 kpts) to fine-tune on my data(6 kpts) and I got error as follows:
Traceback (most recent call last): File "/ssd01/user_acc_data/oppa/deeplabcut/code/satvm_training.py", line 29, in <module> main() File "/ssd01/user_acc_data/oppa/deeplabcut/code/satvm_training.py", line 26, in main train_network(config_path, shuffle, device, net_type) File "/ssd01/user_acc_data/oppa/deeplabcut/code/satvm_training.py", line 6, in train_network deeplabcut.train_network( File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/deeplabcut/compat.py", line 245, in train_network return train_network( File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py", line 326, in train_network train( File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py", line 189, in train runner.fit( File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py", line 170, in fit train_loss = self._epoch( File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py", line 220, in _epoch for i, batch in enumerate(loader): File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__ data = self._next_data() File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 316, in default_collate return collate(batch, collate_fn_map=default_collate_fn_map) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 154, in collate clone.update({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem}) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 154, in <dictcomp> clone.update({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem}) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 141, in collate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 222, in collate_numpy_array_fn return collate([torch.as_tensor(b) for b in batch], collate_fn_map=collate_fn_map) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 141, in collate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 213, in collate_tensor_fn return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [3, 1688, 1695] at entry 0 and [3, 1688, 1726] at entry 11

Operating System

operating system

Linux

DeepLabCut version

dlc version

3.0.0.rc2

DeepLabCut mode

multi animal

Device type

gpu

Steps To Reproduce

create shuffle step:

from pathlib import Path
import deeplabcut
from deeplabcut.core.engine import Engine
from deeplabcut.core.weight_init import WeightInitialization
from deeplabcut.modelzoo.utils import (
    create_conversion_table,
    read_conversion_table_from_csv,
)
from deeplabcut.utils.pseudo_label import keypoint_matching

def create_training_dataset(config_path, shuffle, super_animal_name, model_name, conversion_table_path, net_type):
    # Step 1: Keypoint matching before creating the training dataset
    keypoint_matching(config_path, super_animal_name, model_name)

    # Step 2: Initialize weights for memory replay
    table = create_conversion_table(
        config=config_path,
        super_animal=super_animal_name,
        project_to_super_animal=read_conversion_table_from_csv(conversion_table_path),
    )
    
    weight_init = WeightInitialization(
        dataset=super_animal_name,
        conversion_array=table.to_array(),
        with_decoder=True,
        memory_replay=True,
    )

    # Step 3: Create training dataset
    deeplabcut.create_training_dataset(
        config_path,
        Shuffles=[shuffle],
        net_type=net_type,
        weight_init=weight_init,
        engine=Engine.PYTORCH,
        userfeedback=False
    )

def main():
    dlc_proj_root = Path("/ssd01/user_acc_data/oppa/deeplabcut/projects/oppamousetracker-Oppa-2024-08-23")
    super_animal_name = "superanimal_topviewmouse"
    net_type = 'top_down_hrnet_w32'
    model_name = 'hrnetw32'
    shuffle = 1

    config_path = str(dlc_proj_root / "config.yaml")
    conversion_table_path = dlc_proj_root / "memory_replay" / "conversion_table.csv"

    # Step 1: Create training dataset
    create_training_dataset(config_path, shuffle, super_animal_name, model_name, conversion_table_path, net_type)

if __name__ == "__main__":
    main()

training step:
`from pathlib import Path
import deeplabcut

def train_network(config_path, shuffle, device, net_type):
    # Train the network with memory replay
    deeplabcut.train_network(
        config_path, 
        shuffle=shuffle, 
        device=device, 
        pose_threshold=0.1, 
        net_type=net_type,
        detector_batch_size=32,
        batch_size=64,
        freeze_bn_stats=False
    )

def main():
    dlc_proj_root = Path("/ssd01/user_acc_data/oppa/deeplabcut/projects/oppamousetracker-Oppa-2024-08-23")
    net_type = 'top_down_hrnet_w32'
    device = "cuda"
    shuffle = 1

    config_path = str(dlc_proj_root / "config.yaml")

    # Step 2: Train the network
    train_network(config_path, shuffle, device, net_type)

if __name__ == "__main__":
    main()

error report:
-------------------------------------------------- Traceback (most recent call last): File "/ssd01/user_acc_data/oppa/deeplabcut/code/satvm_training.py", line 29, in <module> main() File "/ssd01/user_acc_data/oppa/deeplabcut/code/satvm_training.py", line 26, in main train_network(config_path, shuffle, device, net_type) File "/ssd01/user_acc_data/oppa/deeplabcut/code/satvm_training.py", line 6, in train_network deeplabcut.train_network( File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/deeplabcut/compat.py", line 245, in train_network return train_network( File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py", line 326, in train_network train( File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py", line 189, in train runner.fit( File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py", line 170, in fit train_loss = self._epoch( File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py", line 220, in _epoch for i, batch in enumerate(loader): File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__ data = self._next_data() File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 316, in default_collate return collate(batch, collate_fn_map=default_collate_fn_map) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 154, in collate clone.update({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem}) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 154, in <dictcomp> clone.update({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem}) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 141, in collate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 222, in collate_numpy_array_fn return collate([torch.as_tensor(b) for b in batch], collate_fn_map=collate_fn_map) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 141, in collate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map) File "/share/user_data/oppa/.conda/envs/dlc-oppa/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 213, in collate_tensor_fn return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [3, 1688, 1695] at entry 0 and [3, 1688, 1726] at entry 11

Relevant log output

pytorch_config:

Task:
scorer:
date:
multianimalproject:
identity:
project_path:
engine: tensorflow
video_sets:
bodyparts:
start:
stop:
numframes2pick:
skeleton: []
skeleton_color: black
pcutoff:
dotsize:
alphavalue:
colormap:
TrainingFraction:
iteration:
default_net_type:
default_augmenter:
snapshotindex:
detector_snapshotindex:
batch_size:
cropping:
x1:
x2:
y1:
y2:
corner2move2:
move2corner:
SuperAnimalConversionTables:
data:
  colormode: RGB
  inference:
    auto_padding:
      pad_width_divisor: 32
      pad_height_divisor: 32
    normalize_images: true
  train:
    affine:
      p: 0.5
      scaling:
      - 1.0
      - 1.0
      rotation: 30
      translation: 0
    gaussian_noise: 12.75
    normalize_images: true
    auto_padding:
      pad_width_divisor: 32
      pad_height_divisor: 32
detector:
  data:
    colormode: RGB
    inference:
      normalize_images: true
    train:
      hflip: true
      normalize_images: true
  device: auto
  model:
    type: FasterRCNN
    variant: fasterrcnn_resnet50_fpn_v2
    box_score_thresh: 0.6
    pretrained: false
  runner:
    type: DetectorTrainingRunner
    eval_interval: 50
    optimizer:
      type: AdamW
      params:
        lr: 1e-05
    scheduler:
      type: LRListScheduler
      params:
        milestones:
        - 90
        lr_list:
        - - 1e-06
    snapshots:
      max_snapshots: 5
      save_epochs: 50
      save_optimizer_state: false
  train_settings:
    batch_size: 32
    dataloader_workers: 32
    dataloader_pin_memory: true
    display_iters: 500
    epochs: 250
device: auto
metadata:
  project_path: /ssd01/user_acc_data/oppa/deeplabcut/projects/oppamousetracker-Oppa-2024-08-23
  pose_config_path: 
    /ssd01/user_acc_data/oppa/deeplabcut/projects/oppamousetracker-Oppa-2024-08-23/dlc-models-pytorch/iteration-0/oppamousetrackerAug23-trainset90shuffle1/train/pose_cfg.yaml
  bodyparts:
  - nose
  - left_ear
  - right_ear
  - left_ear_tip
  - right_ear_tip
  - left_eye
  - right_eye
  - neck
  - mid_back
  - mouse_center
  - mid_backend
  - mid_backend2
  - mid_backend3
  - tail_base
  - tail1
  - tail2
  - tail3
  - tail4
  - tail5
  - left_shoulder
  - left_midside
  - left_hip
  - right_shoulder
  - right_midside
  - right_hip
  - tail_end
  - head_midpoint
  unique_bodyparts: []
  individuals:
  - individual1
  - individual2
  - individual3
  - individual4
  - individual5
  - individual6
  - individual7
  - individual8
  - individual9
  - individual10
  with_identity: false
method: td
model:
  backbone:
    type: HRNet
    model_name: hrnet_w32
    pretrained: false
    freeze_bn_stats: false
    freeze_bn_weights: false
    interpolate_branches: false
    increased_channel_count: false
  backbone_output_channels: 32
  heads:
    bodypart:
      type: HeatmapHead
      weight_init: normal
      predictor:
        type: HeatmapPredictor
        apply_sigmoid: false
        clip_scores: true
        location_refinement: false
        locref_std: 7.2801
      target_generator:
        type: HeatmapGaussianGenerator
        num_heatmaps: 27
        pos_dist_thresh: 17
        heatmap_mode: KEYPOINT
        generate_locref: false
        locref_std: 7.2801
      criterion:
        heatmap:
          type: WeightedMSECriterion
          weight: 1.0
      heatmap_config:
        channels:
        - 32
        - 27
        kernel_size:
        - 1
        strides:
        - 1
net_type: top_down_hrnet_w32
runner:
  type: PoseTrainingRunner
  key_metric: test.mAP
  key_metric_asc: true
  eval_interval: 10
  optimizer:
    type: AdamW
    params:
      lr: 1e-05
  scheduler:
    type: LRListScheduler
    params:
      lr_list:
      - - 1e-06
      - - 1e-07
      milestones:
      - 160
      - 190
  snapshots:
    max_snapshots: 5
    save_epochs: 25
    save_optimizer_state: false
train_settings:
  batch_size: 64
  dataloader_workers: 64
  dataloader_pin_memory: true
  display_iters: 500
  epochs: 200
  pretrained_weights:
  seed: 42
  weight_init:
    dataset: superanimal_topviewmouse
    with_decoder: true
    memory_replay: true
    conversion_array:
    - 0
    - 1
    - 2
    - 7
    - 9
    - 13
freeze_bn_stats: false

Anything else?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions