-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda support #103
Cuda support #103
Conversation
# Conflicts: # src/evaluation/evaluator.py
# Conflicts: # main.py # src/evaluation/evaluator.py
# Conflicts: # main.py
# Conflicts: # src/algorithms/lstm_enc_dec_axl.py
minor fix
# Conflicts: # main.py # src/algorithms/autoencoder.py # src/algorithms/dagmm.py # src/algorithms/lstm_enc_dec_axl.py
How about branching from master next time so there aren't 57 commits? :) |
|
||
@property | ||
def torch_device(self): | ||
return torch.device(f'cuda:{self.gpu}' if torch.cuda.is_available() else 'cpu') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there are multiple GPUs we can use?
2018-06-22 08:06:38 [INFO] src.evaluation.evaluator: Training DAGMM_LSTMAutoEncoder_withWindow on Synthetic Combined Outliers 4-dimensional
2018-06-22 08:06:38 [ERROR] src.evaluation.evaluator: An exception occurred while training DAGMM_LSTMAutoEncoder_withWindow on Synthetic Combined Outliers 4-dimensional: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorCopy.c:20
2018-06-22 08:06:38 [ERROR] src.evaluation.evaluator: Traceback (most recent call last):
File "/repo/src/evaluation/evaluator.py", line 71, in evaluate
det.fit(X_train, y_train)
File "/repo/src/algorithms/dagmm.py", line 188, in fit
self.to_device(self.dagmm)
File "/repo/src/algorithms/cuda_utils.py", line 28, in to_device
model.to(self.torch_device)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 393, in to
return self._apply(lambda t: t.to(device))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 176, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 176, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 111, in _apply
ret = super(RNNBase, self)._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 182, in _apply
param.data = fn(param.data)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 393, in <lambda>
return self._apply(lambda t: t.to(device))
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorCopy.c:20 |
2018-06-22 08:06:38 [ERROR] src.evaluation.evaluator: An exception occurred while training Donut on Synthetic Combined Outliers 4-dimensional: CUDA runtime implicit initialization on GPU:0 failed. Status: an ill
egal memory access was encountered
2018-06-22 08:06:38 [ERROR] src.evaluation.evaluator: Traceback (most recent call last):
File "/repo/src/evaluation/evaluator.py", line 71, in evaluate
det.fit(X_train, y_train)
File "/repo/src/algorithms/donut.py", line 154, in fit
with self.tf_device:
File "/repo/src/algorithms/cuda_utils.py", line 13, in tf_device
local_device_protos = device_lib.list_local_devices()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/device_lib.py", line 41, in list_local_devices
for s in pywrap_tensorflow.list_devices(session_config=session_config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 1675, in list_devices
return ListDevices(status)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: an illegal memory access was encountered |
2018-06-22 08:06:59.731177: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0x6339750: CUDA_ERROR_ILLEGAL_ADDRESS
Traceback (most recent call last):
File "main.py", line 126, in <module>
main()
File "main.py", line 17, in main
run_experiments()
File "main.py", line 94, in run_experiments
detectors = [RecurrentEBM(num_epochs=15), LSTMAD(), Donut(), LSTMED(num_epochs=40),
File "/repo/src/algorithms/lstm_ad.py", line 67, in __init__
torch.manual_seed(0)
File "/usr/local/lib/python3.6/dist-packages/torch/random.py", line 33, in manual_seed
torch.cuda.manual_seed_all(seed)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/random.py", line 86, in manual_seed_all
_lazy_call(lambda: _C._cuda_manualSeedAll(seed))
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 121, in _lazy_call
callable()
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/random.py", line 86, in <lambda>
_lazy_call(lambda: _C._cuda_manualSeedAll(seed))
RuntimeError: Creating MTGP constants failed. at /pytorch/aten/src/THC/THCTensorRandom.cu:34
|
|
nn.Dropout(p=0.5), | ||
nn.Linear(10, n_gmm), | ||
nn.Softmax(dim=1) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
* Adapt LSTMAD, Donut and DAGMM with NNAutoencoder * Ignore device placement on ReEBM as well * Adapt LSTMED * Separate running detectors * Use magic GPUWrapper * Sort by framework
should work the way we want it to.
has to be tested, tho.