Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch_norm error in fit() at end of training epoch #1723

Closed
lcadalzo opened this issue May 30, 2022 · 2 comments · Fixed by #1883
Closed

batch_norm error in fit() at end of training epoch #1723

lcadalzo opened this issue May 30, 2022 · 2 comments · Fixed by #1883
Assignees
Labels
bug Something isn't working
Projects
Milestone

Comments

@lcadalzo
Copy link
Collaborator

Describe the bug
At the end of this for loop, depending on batch_size and the first dim of x_preprocessed it's possible that i_batch will end up having a batch_size of 1. This occurs when x_preprocessed.shape[0] % batch_size == 1. In my case, x_preprocessed is of shape (39209, 48, 48, 3) and batch_size is 8, and 39209 % 8 is 1. When i_batch has batch_size 1, this then causes an error in torch's batch normalization. Specifically in this function which is called here. The end result is an error that looks like this:

File "/opt/conda/lib/python3.8/site-packages/art/estimators/classification/pytorch.py", line 1115, in forward
    x = module_(x)
        │       └ tensor([[  7.4983,  33.5128,   3.2305,   0.0000,  60.0542,   0.0000,   0.0000,
        │                    0.0000,   0.0000,  11.4205,   0.000...
        └ BatchNorm1d(300, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
           │             │        └ {}
           │             └ (tensor([[  7.4983,  33.5128,   3.2305,   0.0000,  60.0542,   0.0000,   0.0000,
           │                          0.0000,   0.0000,  11.4205,   0.00...
           └ <bound method _BatchNorm.forward of BatchNorm1d(300, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)>
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 168, in forward
    return F.batch_norm(
           │ └ <function batch_norm at 0x7f1e6b6bc550>
           └ <module 'torch.nn.functional' from '/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py'>
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2280, in batch_norm
    _verify_batch_size(input.size())
    │                  │     └ <method 'size' of 'torch._C._TensorBase' objects>
    │                  └ tensor([[  7.4983,  33.5128,   3.2305,   0.0000,  60.0542,   0.0000,   0.0000,
    │                               0.0000,   0.0000,  11.4205,   0.000...
    └ <function _verify_batch_size at 0x7f1e6b6bc4c0>
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2248, in _verify_batch_size
    raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))
                                                                                                      └ torch.Size([1, 300])

ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 300])

This doesn't appear to be strictly an ART bug but rather an error occurring between the interaction of ART and PyTorch

To Reproduce
Call a PyTorch estimator's fit() method using a dataset and batch_size such that # of elements in dataset % batch_size == 1. Below is a snippet of code that does so:

import numpy as np
from armory.baseline_models.pytorch import micronnet_gtsrb
from armory.scenarios.utils import to_categorical

model = micronnet_gtsrb.get_art_model({}, {})
num_batches = 5
x = np.random.randn(num_batches, 48, 48, 3).astype(np.float32)
y = to_categorical(np.random.randint(10, size=num_batches)).astype(np.float32)
model.fit(x, y, batch_size=4, nb_epochs=1

Notice that if you change num_batches to 6, e.g., the error goes away

Using Armory run: set dataset "batch_size" in this config to 4 or 8 and run armory run <config>.

System information (please complete the following information):

  • OS: ubuntu
  • Python version: 3.8.10
  • ART version: 1.10.1
  • PyTorch version: 1.10.2
@beat-buesser
Copy link
Collaborator

Hi @lcadalzo Thank you very much for reporting this issue!

@beat-buesser beat-buesser removed this from Issues open in ART 1.11.0 Jun 29, 2022
@beat-buesser beat-buesser removed this from the ART 1.11.0 milestone Jun 29, 2022
@davidslater
Copy link
Collaborator

One way to deal with this would be to add a drop_last kwarg to fit similar to what pytorch dataloaders do. Here's how it is defined in https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html:

        drop_last (bool, optional): set to ``True`` to drop the last incomplete batch,
            if the dataset size is not divisible by the batch size. If ``False`` and
            the size of dataset is not divisible by the batch size, then the last batch
            will be smaller. (default: ``False``)

@beat-buesser beat-buesser added this to the ART 1.12.2 milestone Oct 18, 2022
@beat-buesser beat-buesser self-assigned this Oct 19, 2022
@beat-buesser beat-buesser linked a pull request Oct 19, 2022 that will close this issue
5 tasks
@beat-buesser beat-buesser added this to Issues open in ART 1.12.2 Oct 19, 2022
@beat-buesser beat-buesser moved this from Issues open to Issues in progress in ART 1.12.2 Oct 19, 2022
@beat-buesser beat-buesser moved this from Issues in progress to Issues closed in ART 1.12.2 Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
ART 1.12.2
  
Issues closed
Development

Successfully merging a pull request may close this issue.

3 participants