Transfer learning fails and cannot be restarted #44

dwinkler1 · 2018-09-20T07:55:57Z

I have trained a model on my text corpus (full_model.pt) and want to see now how well it does with a labeled dataset. So I labeled the data and ran the following:

python transfer.py --load_model full_model.pt --data ./labeled.csv --neurons 30 --epochs 5 --split 10,1,1
configuring data
generating csv at ./labeled.sentence.label.csv
Creating mlstm
writing results to full_model_transfer/sentiment
transforming train
batch     1/  162 | ch/s 8.56E+03 | time 7.25E+02 | time left 1.17E+05
batch     2/  162 | ch/s 1.39E+04 | time 4.03E+02 | time left 9.02E+04
batch     3/  162 | ch/s 1.33E+04 | time 5.10E+02 | time left 8.68E+04
batch     4/  162 | ch/s 1.13E+04 | time 5.68E+02 | time left 8.71E+04
batch     5/  162 | ch/s 1.29E+04 | time 5.46E+02 | time left 8.64E+04
batch     6/  162 | ch/s 1.13E+04 | time 5.78E+02 | time left 8.66E+04
batch     7/  162 | ch/s 1.33E+04 | time 4.90E+02 | time left 8.46E+04
batch     8/  162 | ch/s 1.19E+04 | time 6.36E+02 | time left 8.58E+04
batch     9/  162 | ch/s 1.27E+04 | time 5.48E+02 | time left 8.51E+04
batch    10/  162 | ch/s 1.27E+04 | time 6.60E+02 | time left 8.61E+04
batch    11/  162 | ch/s 1.40E+04 | time 5.55E+02 | time left 8.54E+04
batch    12/  162 | ch/s 1.36E+04 | time 6.53E+02 | time left 8.59E+04
batch    13/  162 | ch/s 1.11E+04 | time 7.29E+02 | time left 8.71E+04
batch    14/  162 | ch/s 1.30E+04 | time 8.20E+02 | time left 8.90E+04
batch    15/  162 | ch/s 1.51E+04 | time 7.54E+02 | time left 8.99E+04
batch    16/  162 | ch/s 1.39E+04 | time 8.07E+02 | time left 9.11E+04
batch    17/  162 | ch/s 1.11E+04 | time 1.10E+03 | time left 9.45E+04
batch    18/  162 | ch/s 1.25E+04 | time 9.17E+02 | time left 9.60E+04
batch    19/  162 | ch/s 1.25E+04 | time 9.85E+02 | time left 9.77E+04
batch    20/  162 | ch/s 1.19E+04 | time 1.01E+03 | time left 9.94E+04
batch    21/  162 | ch/s 1.28E+04 | time 1.04E+03 | time left 1.01E+05
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=4 : unspecified launch failure
Traceback (most recent call last):
  File "transfer.py", line 328, in <module>
    trXt, trY = transform(model, train_data)
  File "transfer.py", line 138, in transform
    cell = model(text_batch, length_batch, args.get_hidden)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/imsm/Documents/daniel_tmp/sentimentNvidia/sentiment-discovery-master/model/model.py", line 93, in forward
    cell = get_valid_outs(i, seq_len, cell, last_cell)
  File "/home/imsm/Documents/daniel_tmp/sentimentNvidia/sentiment-discovery-master/model/model.py", line 130, in get_valid_outs
    if (invalid_steps.long().sum() == 0):
RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THC/generated/../THCReduceAll.cuh:317

When I try to restart the training it fails immediately with error:

python transfer.py --load_model full_model.pt --data ./labeled.csv --neurons 30 --epochs 5 --split 10,1,1
configuring data
Creating mlstm
Traceback (most recent call last):
  File "transfer.py", line 89, in <module>
    sd = x = torch.load(f)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 358, in load
    return _load(f, map_location, pickle_module)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 542, in _load
    result = unpickler.load()
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 508, in persistent_load
    data_type(size), location)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 104, in default_restore_location
    result = fn(storage, location)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 75, in _cuda_deserialize
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.

Some more details:

torch.version.cuda
9.2.148'

python --version
Python 3.6.6

lspci | grep VGA 
04:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
17:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
b3:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

nvidia-settings --version
nvidia-settings:  version 396.37  (buildmeister@swio-display-x86-rhel47-05)  Tue Jun 12 14:49:22 PDT 2018

uname -a
Linux imsm-gpu2 4.15.0-33-generic #36-Ubuntu SMP Wed Aug 15 16:00:05 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Any ideas?

The text was updated successfully, but these errors were encountered:

raulpuric · 2018-09-20T19:07:09Z

Let me check with our pytorch frameworks team. I've never seen this before. Any chance I can get you to run on a pytorch docker container with cuda 9.0?

raulpuric · 2018-09-20T19:10:29Z

what version of pytorch are you using as well?

dwinkler1 · 2018-09-21T11:22:24Z

I am using pytorch 0.4.1. It seems to be related to Automatic Suspend in Ubuntu. I've disabled it and it has been training without error since last night. I will try with cuda 9.0 as soon as it either fails or is done (likely tomorrow) but I don't want to mess with it right now.

raulpuric · 2018-10-08T15:34:27Z

ok thanks for letting us know. Gonna close this, hopefully not too many ppl have automatic suspend set.

dwinkler1 · 2018-10-08T18:00:36Z

Thanks. Sorry I didn't get to more testing. I'll try to find a solution to this in the future and create a pull request. Anyways the workaround seems solid.

raulpuric closed this as completed Oct 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transfer learning fails and cannot be restarted #44

Transfer learning fails and cannot be restarted #44

dwinkler1 commented Sep 20, 2018 •

edited

raulpuric commented Sep 20, 2018

raulpuric commented Sep 20, 2018

dwinkler1 commented Sep 21, 2018

raulpuric commented Oct 8, 2018

dwinkler1 commented Oct 8, 2018

Transfer learning fails and cannot be restarted #44

Transfer learning fails and cannot be restarted #44

Comments

dwinkler1 commented Sep 20, 2018 • edited

raulpuric commented Sep 20, 2018

raulpuric commented Sep 20, 2018

dwinkler1 commented Sep 21, 2018

raulpuric commented Oct 8, 2018

dwinkler1 commented Oct 8, 2018

dwinkler1 commented Sep 20, 2018 •

edited