Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer learning fails and cannot be restarted #44

Closed
dwinkler1 opened this issue Sep 20, 2018 · 5 comments
Closed

Transfer learning fails and cannot be restarted #44

dwinkler1 opened this issue Sep 20, 2018 · 5 comments

Comments

@dwinkler1
Copy link

dwinkler1 commented Sep 20, 2018

I have trained a model on my text corpus (full_model.pt) and want to see now how well it does with a labeled dataset. So I labeled the data and ran the following:

python transfer.py --load_model full_model.pt --data ./labeled.csv --neurons 30 --epochs 5 --split 10,1,1
configuring data
generating csv at ./labeled.sentence.label.csv
Creating mlstm
writing results to full_model_transfer/sentiment
transforming train
batch     1/  162 | ch/s 8.56E+03 | time 7.25E+02 | time left 1.17E+05
batch     2/  162 | ch/s 1.39E+04 | time 4.03E+02 | time left 9.02E+04
batch     3/  162 | ch/s 1.33E+04 | time 5.10E+02 | time left 8.68E+04
batch     4/  162 | ch/s 1.13E+04 | time 5.68E+02 | time left 8.71E+04
batch     5/  162 | ch/s 1.29E+04 | time 5.46E+02 | time left 8.64E+04
batch     6/  162 | ch/s 1.13E+04 | time 5.78E+02 | time left 8.66E+04
batch     7/  162 | ch/s 1.33E+04 | time 4.90E+02 | time left 8.46E+04
batch     8/  162 | ch/s 1.19E+04 | time 6.36E+02 | time left 8.58E+04
batch     9/  162 | ch/s 1.27E+04 | time 5.48E+02 | time left 8.51E+04
batch    10/  162 | ch/s 1.27E+04 | time 6.60E+02 | time left 8.61E+04
batch    11/  162 | ch/s 1.40E+04 | time 5.55E+02 | time left 8.54E+04
batch    12/  162 | ch/s 1.36E+04 | time 6.53E+02 | time left 8.59E+04
batch    13/  162 | ch/s 1.11E+04 | time 7.29E+02 | time left 8.71E+04
batch    14/  162 | ch/s 1.30E+04 | time 8.20E+02 | time left 8.90E+04
batch    15/  162 | ch/s 1.51E+04 | time 7.54E+02 | time left 8.99E+04
batch    16/  162 | ch/s 1.39E+04 | time 8.07E+02 | time left 9.11E+04
batch    17/  162 | ch/s 1.11E+04 | time 1.10E+03 | time left 9.45E+04
batch    18/  162 | ch/s 1.25E+04 | time 9.17E+02 | time left 9.60E+04
batch    19/  162 | ch/s 1.25E+04 | time 9.85E+02 | time left 9.77E+04
batch    20/  162 | ch/s 1.19E+04 | time 1.01E+03 | time left 9.94E+04
batch    21/  162 | ch/s 1.28E+04 | time 1.04E+03 | time left 1.01E+05
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=4 : unspecified launch failure
Traceback (most recent call last):
  File "transfer.py", line 328, in <module>
    trXt, trY = transform(model, train_data)
  File "transfer.py", line 138, in transform
    cell = model(text_batch, length_batch, args.get_hidden)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/imsm/Documents/daniel_tmp/sentimentNvidia/sentiment-discovery-master/model/model.py", line 93, in forward
    cell = get_valid_outs(i, seq_len, cell, last_cell)
  File "/home/imsm/Documents/daniel_tmp/sentimentNvidia/sentiment-discovery-master/model/model.py", line 130, in get_valid_outs
    if (invalid_steps.long().sum() == 0):
RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THC/generated/../THCReduceAll.cuh:317

When I try to restart the training it fails immediately with error:

python transfer.py --load_model full_model.pt --data ./labeled.csv --neurons 30 --epochs 5 --split 10,1,1
configuring data
Creating mlstm
Traceback (most recent call last):
  File "transfer.py", line 89, in <module>
    sd = x = torch.load(f)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 358, in load
    return _load(f, map_location, pickle_module)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 542, in _load
    result = unpickler.load()
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 508, in persistent_load
    data_type(size), location)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 104, in default_restore_location
    result = fn(storage, location)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 75, in _cuda_deserialize
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.

Some more details:

torch.version.cuda
9.2.148'

python --version
Python 3.6.6

lspci | grep VGA 
04:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
17:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
b3:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

nvidia-settings --version
nvidia-settings:  version 396.37  (buildmeister@swio-display-x86-rhel47-05)  Tue Jun 12 14:49:22 PDT 2018

uname -a
Linux imsm-gpu2 4.15.0-33-generic #36-Ubuntu SMP Wed Aug 15 16:00:05 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Any ideas?

@raulpuric
Copy link
Contributor

Let me check with our pytorch frameworks team. I've never seen this before. Any chance I can get you to run on a pytorch docker container with cuda 9.0?

@raulpuric
Copy link
Contributor

what version of pytorch are you using as well?

@dwinkler1
Copy link
Author

I am using pytorch 0.4.1. It seems to be related to Automatic Suspend in Ubuntu. I've disabled it and it has been training without error since last night. I will try with cuda 9.0 as soon as it either fails or is done (likely tomorrow) but I don't want to mess with it right now.

@raulpuric
Copy link
Contributor

ok thanks for letting us know. Gonna close this, hopefully not too many ppl have automatic suspend set.

@dwinkler1
Copy link
Author

Thanks. Sorry I didn't get to more testing. I'll try to find a solution to this in the future and create a pull request. Anyways the workaround seems solid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants