Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs #580

Open
tastyminerals opened this issue Nov 1, 2019 · 14 comments

Comments

@tastyminerals
Copy link

@tastyminerals tastyminerals commented Nov 1, 2019

I am training a version of unet with joint classification and semantic segmentation using O1 level. The training crashes after I explicitly cast box_coord_tensor in roi_pool function.

rois = roi_pool(
        input=classification_feature_map_tensor, # FLOAT16 
        boxes=box_coord_tensor.half(), # FLOAT32 IF NOT CASTED EXPLICITLY
        output_size=roi_size,
        spatial_scale=1,
)

Thing is, classification_feature_map_tensor comes as float16 since it is handled by amp while box_coord_tensor comes from input batch which is float32. However, roi_pool requires tensors to have equal precision and throws

RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type Variable[CUDAHalfType] does not equal Variable[CUDAFloatType] (while checking arguments for ROIPool_forward_cuda) (checkSameType at /pytorch/aten/src/ATen/TensorUtils.cpp:140)

But if I cast box_coord_tensor to float16, CUDA throws memory access error below.

  File "/usr/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/usr/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
    post_backward_models_are_masters(scaler, params, stashed_grads)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
    scale_override=grads_have_scale/out_scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
    self.unscale_python(model_grads, master_grads, scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 89, in unscale_python
    self.dynamic)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python
    cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered

Is there anything I could try to do because so far any attempts result in the error above.

@mcarilli

This comment has been minimized.

Copy link
Contributor

@mcarilli mcarilli commented Nov 3, 2019

When in doubt, always prefer casting to FP32. In this case (I think) you're calling into a custom torchvision op that may not have an FP16 implementation. Cast both inputs to FP32 instead of FP16 and it should work.

@tastyminerals

This comment has been minimized.

Copy link
Author

@tastyminerals tastyminerals commented Nov 3, 2019

I casted everything to float32

rois = roi_pool(
    input=classification_feature_map_tensor.float(), 
    boxes=box_coord_tensor.float(),
    output_size=self.roi_size,
    spatial_scale=1,
)

The roi_pool works and passes but the exception is thrown in apex here

with amp.scale_loss(loss, self.optimizer) as scaled_loss:
    scaled_loss.backward() # exception is thrown

inside the training loop below

        for epoch in range(1, self.num_epochs + 1):
            logger.info(f"running epoch {epoch}")
            avg_train_loss = 0

            self.model.train()
            for step, sample_batch in enumerate(self.train_data, start=1):
                sample_batch = self._sample_to_device(sample_batch)
                self.optimizer.zero_grad()

                doc_id_batch = sample_batch[DOC_ID]

                logits_dict = self.model(sample_batch)
                loss = self.criterion(logits_dict, sample_batch)
                with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                    scaled_loss.backward()  # exception is thrown

                self.optimizer.step()

                avg_train_loss += loss.item()

            epoch_end_time = timeit.default_timer()
            epoch_time = epoch_end_time - epoch_start_time
@tastyminerals

This comment has been minimized.

Copy link
Author

@tastyminerals tastyminerals commented Nov 4, 2019

Below are some training logs with O2 just before the crash. You can even see that epoch 1 completed with nan loss though.

2019-11-04 10:35:43,186 - INFO - __main__ - starting training
2019-11-04 10:35:43,186 - INFO - net.train.trainer - starting training of model, going to train 100 epochs
2019-11-04 10:35:43,190 - INFO - net.train.trainer - running epoch 1
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 128.0
2019-11-04 10:35:53,378 - INFO - net.train.trainer - epoch 1; average train loss nan; processed 10 batches in 10.19 seconds, 1.02 sec per batch on average
2019-11-04 10:35:53,379 - INFO - net.train.trainer - epoch 1; starting validation
2019-11-04 10:35:56,085 - INFO - net.train.trainer - epoch 1: validation loss nan
2019-11-04 10:35:56,085 - INFO - net.train.trainer - epoch 1: validation loss did not decrease, patience left 9
2019-11-04 10:35:56,085 - INFO - net.train.trainer - running epoch 2
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16.0
(...)
  File "/usr/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 182, in post_backward_with_master_weights
    models_are_masters=False)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
    self.unscale_python(model_grads, master_grads, scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 89, in unscale_python
    self.dynamic)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python
    cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered

Now with O3 we get a little bit further and with a crash duing summing the validation loss.

Selected optimization level O3:  Pure FP16 training.
Defaults for this optimization level are:
enabled                : True
opt_level              : O3
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : False
master_weights         : False
loss_scale             : 1.0
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O3
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : False
master_weights         : False
loss_scale             : 1.0
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ImportError('/usr/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE')
2019-11-04 13:19:25,347 - INFO - __main__ - starting training
2019-11-04 13:19:25,347 - INFO - net.train.trainer - starting training of model, going to train 100 epochs
2019-11-04 13:19:25,351 - INFO - net.train.trainer - running epoch 1
2019-11-04 13:19:35,604 - INFO - net.train.trainer - epoch 1; average train loss 3.7108697175979612; processed 10 batches in 10.25 seconds, 1.03 sec per batch on average
2019-11-04 13:19:35,605 - INFO - net.train.trainer - epoch 1; starting validation
2019-11-04 13:19:38,362 - INFO - net.train.trainer - epoch 1: validation loss 3.0665794213612876
2019-11-04 13:19:38,362 - INFO - net.train.trainer - epoch 1: better model found, new best validation loss: 3.0665794213612876
2019-11-04 13:19:38,367 - INFO - net.train.trainer - running epoch 2
2019-11-04 13:19:48,451 - INFO - net.train.trainer - epoch 2; average train loss 2.4132291316986083; processed 10 batches in 10.08 seconds, 1.01 sec per batch on average
2019-11-04 13:19:48,451 - INFO - net.train.trainer - epoch 2; starting validation
2019-11-04 13:19:51,411 - INFO - net.train.trainer - epoch 2: validation loss 2.798730452855428
2019-11-04 13:19:51,411 - INFO - net.train.trainer - epoch 2: better model found, new best validation loss: 2.798730452855428
2019-11-04 13:19:51,416 - INFO - net.train.trainer - running epoch 3
...
  File "/home/user/net/train/trainer.py", line 138, in train
    avg_train_loss += loss.item()
RuntimeError: CUDA error: an illegal memory access was encountered

Running the training with CUDA_LAUNCH_BLOCKING=1 gives us:

   trained_model_state, optimizer_state, metrics = trainer.train()
  File "/home/user/net/train/trainer.py", line 131, in train
    scaled_loss.backward()
  File "/home/user/.local/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
@tastyminerals

This comment has been minimized.

Copy link
Author

@tastyminerals tastyminerals commented Nov 4, 2019

Could it be related to this? So does it mean that we are running out of memory? But nvidia-smi tells that we use only 50% GPU.

2.1.10. GEMM Algorithms Numerical Behavior
Some GEMM algorithms split the computation along the dimension K to increase the GPU occupancy, especially when the dimension K is large compared to dimensions M and N. When this type of algorithm is chosen by the cuBLAS heuristics or explicitly by the user, the results of each split is summed deterministically into the resulting matrix to get the final result.
For the routines cublasgemmEx and cublasGemmEx, when the compute type is greater than the output type, the sum of the split chunks can potentially lead to some intermediate overflows thus producing a final resulting matrix with some overflows. Those overflows might not have occured if all the dot products had been accumulated in the compute type before being converted at the end in the output type.
This computation side-effect can be easily exposed when the computeType is CUDA_R_32F and Atype, Btype and Ctype are in CUDA_R_16F.

@mcarilli

This comment has been minimized.

Copy link
Contributor

@mcarilli mcarilli commented Nov 4, 2019

I don't think it's running out of memory. With O1, for the backward pass (#580 (comment)) does it error on the very first backward pass? And what is the exception trace that is thrown?

@tastyminerals

This comment has been minimized.

Copy link
Author

@tastyminerals tastyminerals commented Nov 4, 2019

Correct, with O1 it fails on the first backward pass. With O2 it finishes two epochs and with O3 finishes three epochs. With O0 it does not crash.
Below is the run with O1 opt-level.

CUDA_LAUNCH_BLOCKING=1 python train.py --config-file config/config.gin --log-level INFO                                                                                                 
2019-11-04 19:29:08,258 - INFO - __main__ - setting random seed to 42
2019-11-04 19:29:08,258 - INFO - __main__ - setting up train data
2019-11-04 19:29:08,264 - INFO - __main__ - split data with valid fraction 0.2 --> # train data: 40, # valid data: 10
2019-11-04 19:29:08,268 - INFO - net.utils.class_weights - calculating class weights with c=1.04 for box weights and c=1.04 for segmentation weights
2019-11-04 19:29:16,816 - INFO - net.utils.class_weights - calculated box class weights: tensor([ 1.5608, 21.2831, 22.9914, 16.3494, 23.2191, 21.6754, 25.2760, 25.3858,
        23.1732, 25.0054, 19.9499, 10.7810, 19.6184, 20.9051])
2019-11-04 19:29:16,817 - INFO - net.utils.class_weights - calculated segmentation class weights: tensor([0.0821, 0.1714, 0.1662, 0.1396, 0.1677, 0.1864, 0.1912, 0.2489, 0.1080])
2019-11-04 19:29:16,832 - INFO - __main__ - setting up loss function
2019-11-04 19:29:16,832 - INFO - __main__ - combining loss by sum with box loss weight 1.0 and segmentation loss weight 1.0
2019-11-04 19:29:16,832 - INFO - __main__ - setting up model
2019-11-04 19:29:16,891 - INFO - __main__ - setting up trainer instance
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ImportError('/usr/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE')
2019-11-04 19:29:22,263 - INFO - __main__ - starting training
2019-11-04 19:29:22,263 - INFO - net.train.trainer - starting training of model, going to train 100 epochs
2019-11-04 19:29:22,263 - INFO - net.train.trainer - running epoch 1
...
  File "train.py", line 267, in train
    trained_model_state, optimizer_state, metrics = trainer.train()
  File "/home/user/net/train/trainer.py", line 132, in train
    scaled_loss.backward()
  File "/home/user/.local/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
@tastyminerals

This comment has been minimized.

Copy link
Author

@tastyminerals tastyminerals commented Nov 4, 2019

According to these docs CUBLAS_STATUS_EXECUTION_FAILED means "the function failed to launch on the GPU". I wonder what could be the possible reasons for that since it is launched on GPU several times before it crashes.

Batch size does not change the behavior. I also tried running with nightly pytorch builds, same results. Tried running on different machines GTX1070 and GTX1080Ti, same error. The apex example imagenet network runs without errors though so it is something with our model.

@tastyminerals tastyminerals changed the title "CUDA error: an illegal memory access" with explicit cast to float16 CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs Nov 4, 2019
@tastyminerals tastyminerals changed the title CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs Nov 4, 2019
@ptrblck

This comment has been minimized.

Copy link
Contributor

@ptrblck ptrblck commented Nov 7, 2019

@tastyminerals Are you using variable input sizes, i.e. are some inputs larger than others?
If so, could it be related to this issue?
If you are using CUDA10.0, could you update to 10.1, please, and check, if it's working?

@tastyminerals

This comment has been minimized.

Copy link
Author

@tastyminerals tastyminerals commented Nov 7, 2019

I read the issue, don't think it is related.
No, the input sizes do not change but we have multi-modal input, that is:

chargrid_tensor [2 x 65 x 512 x 368]  # batch_size x char_vocab x h x w
box_coordinates [595 x 4] # word_boxes x coords

Cuda and cudnn versions: cuda 10.1.243-1, cudnn 7.6.4.38-1

The model structure is the following:

BoxClassificationSemanticSegmentation(
  (backbone): UNetWideOneEncoderOneDecoder(
    (conv_encode_1): Sequential(
      (0): Conv2d(65, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU()
      (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (4): ReLU()
      (5): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (7): ReLU()
    )
    (conv_encode_2): Sequential(
      (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU()
      (2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (4): ReLU()
      (5): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (7): ReLU()
    )
    (conv_encode_3): Sequential(
      (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU()
      (2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (4): ReLU()
      (5): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (7): ReLU()
    )
    (bottleneck): Sequential(
      (0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU()
      (2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Conv2d(512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (4): ReLU()
      (5): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (conv_decode_3): Sequential(
      (0): Conv2d(512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU()
      (2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (4): ReLU()
      (5): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (6): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (7): ReLU()
    )
    (conv_decode_2): Sequential(
      (0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU()
      (2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (4): ReLU()
      (5): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (6): ConvTranspose2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (7): ReLU()
    )
    (conv_decode_1): Sequential(
      (0): Conv2d(192, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU()
      (2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (4): ReLU()
      (5): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (6): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (7): ReLU()
    )
  )
  (box_classification_head): BoxClassificationHeadHighway(
    (box_classifier): Sequential(
      (0): Highway(
        (proj): Linear(in_features=1024, out_features=1024, bias=True)
        (transform): Linear(in_features=1024, out_features=1024, bias=True)
      )
      (1): Tanh()
      (2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Linear(in_features=1024, out_features=14, bias=True)
      (4): BatchNorm1d(14, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (semantic_segmentation_head): SemanticSegmentationHead1Conv(
    (conv): Conv2d(64, 9, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
)

So, I did light model pruning like reducing it to box classification task only leaving single cross_entropy loss, that is u-net -> cross_entropy. No changes.

@someAdjectiveNoun

This comment has been minimized.

Copy link

@someAdjectiveNoun someAdjectiveNoun commented Nov 7, 2019

I get a similar error with the forward pass. After some batches, it gives the following error(s).

Sometimes it is error 1 and sometimes it is error 2 or error 3.
Sometimes the error is thrown after processing 1st batch and sometimes at 2nd,9th or 13th, 17th, 21st batch.

Error 1
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)``

Error 2
RuntimeError: CUDA error: device-side assert triggered

Error 3
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCBlas.cu:258

Maybe this issue discussion can bring more perspective to it.

@tastyminerals

This comment has been minimized.

Copy link
Author

@tastyminerals tastyminerals commented Nov 9, 2019

I managed to train the model without crashing (at least reach 10th epoch) with batch_size=1 and O2 opt-level. Anything else leads to an exception.

batch_size=1, opt-level=O1 --> crashes after couple of epochs
batch_size=1, opt-level=O2 --> works fine
batch_size=1, opt-level=O3 --> crashes after couple of epochs

batch_size=2, opt-level=O1 --> crashes after couple of epochs
batch_size=2, opt-level=O2 --> crashes after couple of epochs
batch_size=2, opt-level=O3 --> crashes after couple of epochs

Unfortunately, even though with O2 I am able to train the loss is still nan right after the first epoch :(

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ImportError('/usr/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE')
2019-11-09 19:31:13,427 - INFO - __main__ - starting training
2019-11-09 19:31:13,427 - INFO - unet.train.trainer - starting training of model, going to train 100 epochs
2019-11-09 19:31:13,429 - INFO - unet.train.trainer - running epoch 1
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
2019-11-09 19:31:23,699 - INFO - unet.train.trainer - epoch 1; average train loss nan; processed 40 batches in 10.27 seconds, 0.26 sec per batch on average
2019-11-09 19:31:23,699 - INFO - unet.train.trainer - epoch 1; starting validation
2019-11-09 19:31:26,067 - INFO - unet.train.trainer - epoch 1: validation loss nan
2019-11-09 19:31:26,068 - INFO - unet.train.trainer - epoch 1: validation loss did not decrease, patience left 9
2019-11-09 19:31:26,068 - INFO - unet.train.trainer - running epoch 2
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
2019-11-09 19:31:36,441 - INFO - unet.train.trainer - epoch 2; average train loss nan; processed 40 batches in 10.37 seconds, 0.26 sec per batch on average
2019-11-09 19:31:36,442 - INFO - unet.train.trainer - epoch 2; starting validation
2019-11-09 19:31:38,790 - INFO - unet.train.trainer - epoch 2: validation loss nan
2019-11-09 19:31:38,791 - INFO - unet.train.trainer - epoch 2: validation loss did not decrease, patience left 8
2019-11-09 19:31:38,791 - INFO - unet.train.trainer - running epoch 3
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.710505431213761e-20
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.470329472543003e-22
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0587911840678754e-22
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.6469779601696886e-23
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.617444900424222e-24
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.6543612251060553e-24
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.1359030627651384e-25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0339757656912846e-25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.2924697071141057e-26
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.2311742677852644e-27
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.077935669463161e-28
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0194839173657902e-28
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.524354896707238e-29
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.1554436208840472e-30
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.944304526105059e-31
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.860761315262648e-32
2019-11-09 19:31:49,216 - INFO - unet.train.trainer - epoch 3; average train loss nan; processed 40 batches in 10.43 seconds, 0.26 sec per batch on average
2019-11-09 19:31:49,217 - INFO - unet.train.trainer - epoch 3; starting validation
2019-11-09 19:31:51,595 - INFO - unet.train.trainer - epoch 3: validation loss nan
2019-11-09 19:31:51,595 - INFO - unet.train.trainer - epoch 3: validation loss did not decrease, patience left 7
2019-11-09 19:31:51,595 - INFO - unet.train.trainer - running epoch 4
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.465190328815662e-32
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.162975822039155e-33
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.62964972193618e-35
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.407412430484045e-35
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.018531076210112e-36
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.52316384526264e-37
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.4039548065783e-38
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.350988701644575e-38
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.877471754111438e-39
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.346839692639297e-40
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.8367099231598242e-40
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.591774807899561e-41
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.739718509874451e-42
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4349296274686127e-42
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.793662034335766e-43
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44
2019-11-09 19:32:02,018 - INFO - unet.train.trainer - epoch 4; average train loss nan; processed 40 batches in 10.42 seconds, 0.26 sec per batch on average
2019-11-09 19:32:02,018 - INFO - unet.train.trainer - epoch 4; starting validation
2019-11-09 19:32:04,435 - INFO - unet.train.trainer - epoch 4: validation loss nan
2019-11-09 19:32:04,436 - INFO - unet.train.trainer - epoch 4: validation loss did not decrease, patience left 6

I have cuda 10.1.243-2, torchvision 0.4.2-3 and pytorch 1.3.0 installed.

@tastyminerals

This comment has been minimized.

Copy link
Author

@tastyminerals tastyminerals commented Nov 9, 2019

@tastyminerals Are you using variable input sizes, i.e. are some inputs larger than others?
If so, could it be related to this issue?
If you are using CUDA10.0, could you update to 10.1, please, and check, if it's working?

I cannot reproduce the bug, the code below works fine on my machine.

torch.zeros((16*2**20 - 512)//2 + 1, 1, dtype=torch.float16, device='cuda:0') @ torch.zeros(1, 2, dtype=torch.float16, device='cuda:0')
@ptrblck

This comment has been minimized.

Copy link
Contributor

@ptrblck ptrblck commented Nov 10, 2019

@tastyminerals @someAdjectiveNoun
Could you try to post a (small) code snippet to reproduce this issue?

@someAdjectiveNoun

This comment has been minimized.

Copy link

@someAdjectiveNoun someAdjectiveNoun commented Nov 11, 2019

@tastyminerals @someAdjectiveNoun
Could you try to post a (small) code snippet to reproduce this issue?

The problem is solved now. How? The problem was actually caused by using BioBERT model that I used. Using the BERT in Pytorch works smoothly. The problem seems to be coming from BioBERT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.