CTM training fails. #21

cidrugHug8 · 2021-07-07T06:37:50Z

OCTIS version: 1.8.0
Python version: 3.8.10
Operating System: Ubuntu 20.04.02

Description

CTM training fails.

What I Did

dataset = Dataset()
dataset.load_custom_dataset_from_folder(DATASET_PATH)
model = CTM(num_topics=TOPIC_SIZE)
model_output = model.train_model(dataset)
save_model_output(model_output, MODEL_OUTPUT_PATH)
save_model_output(model, MODEL_PATH)

The following error message was displayed.

Batches:  84%|████████████████████████████████████████████████████████████████████████████████████████▌                 | 21790/26093 [59:43<11:47,  6.08it/s]
Traceback (most recent call last):
  File "train.py", line 62, in <module>
    model = ProdLDA(num_topics=TOPIC_SIZE)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 95, in train_model
    x_train, x_test, x_valid, input_size = self.preprocess(
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 175, in preprocess
    b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 208, in load_bert_data
    bert_ouput = bert_embeddings_from_list(texts, bert_model)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/contextualized_topic_models/utils/data_preparation.py", line 35, in bert_embeddings_from_list
    return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))
  File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/SentenceTransformer.py", line 160, in encode
    out_features = self.forward(features)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/models/Transformer.py", line 51, in forward
    output_states = self.auto_model(**trans_features, return_dict=False)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 991, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 582, in forward
    layer_outputs = layer_module(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 470, in forward
    self_attention_outputs = self.attention(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 401, in forward
    self_outputs = self.self(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 305, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

The text was updated successfully, but these errors were encountered:

cidrugHug8 · 2021-07-07T07:55:44Z

Thanks for the advice.
I'll check my data set.

cidrugHug8 · 2021-07-12T00:43:59Z

corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional).

Is it acceptable to leave the data in the third column empty?

cidrugHug8 · 2021-07-12T01:00:38Z

Python 3.8.10 (default, Jun  2 2021, 10:49:15) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from octis.dataset.dataset import Dataset
>>> dataset = Dataset()
>>> dataset.fetch_dataset("20NewsGroup")
>>> dataset.save('/home/root/20newsgroup')
>>> del dataset
>>> dataset = Dataset()
>>> dataset.load_custom_dataset_from_folder('/home/root/20newsgroup')
>>> from octis.dataset.dataset import Dataset
>>> from octis.models.CTM import CTM
>>> model = CTM(num_topics=25)
>>> model_output = model.train_model(dataset)
Batches:  77%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▎                              | 89/115 [00:14<00:04,  5.95it/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 95, in train_model
    x_train, x_test, x_valid, input_size = self.preprocess(
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 175, in preprocess
    b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 208, in load_bert_data
    bert_ouput = bert_embeddings_from_list(texts, bert_model)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/contextualized_topic_models/utils/data_preparation.py", line 35, in bert_embeddings_from_list
    return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))
  File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/SentenceTransformer.py", line 160, in encode
    out_features = self.forward(features)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/models/Transformer.py", line 51, in forward
    output_states = self.auto_model(**trans_features, return_dict=False)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 991, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 582, in forward
    layer_outputs = layer_module(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 470, in forward
    self_attention_outputs = self.attention(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 401, in forward
    self_outputs = self.self(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 305, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

Is the code wrong?

silviatti · 2021-07-12T07:35:33Z

Yes, the third column can be missing.

I don't understand why you fetch the dataset, then save it, then delete the variable, and reload the dataset. You could just do fetch_dataset and run the model. Do you do something in between? However, the code should work (I tried it on colab).

I wonder if this is related to your GPU and CUDA version. Can you try to run the code on the CPU and see if it works?

Thanks,

Silvia

cidrugHug8 · 2021-07-12T08:00:00Z

I wanted to confirm the correct data set, so I saved it once.

Can you try to run the code on the CPU and see if it works?

It worked well. Hmm.

I wonder if this is related to your GPU and CUDA version.

Your remark seems to be correct. My environment is as follows.

GPU: Pascal TITAN X
Driver Version: 470.42.01
CUDA Version: 11.4

My CUDA version may be too high. Anyway, I'll try it with the CPU. Thank you!

cidrugHug8 closed this as completed Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTM training fails. #21

CTM training fails. #21

cidrugHug8 commented Jul 7, 2021

cidrugHug8 commented Jul 7, 2021

cidrugHug8 commented Jul 12, 2021

cidrugHug8 commented Jul 12, 2021

silviatti commented Jul 12, 2021

cidrugHug8 commented Jul 12, 2021

CTM training fails. #21

CTM training fails. #21

Comments

cidrugHug8 commented Jul 7, 2021

Description

What I Did

cidrugHug8 commented Jul 7, 2021

cidrugHug8 commented Jul 12, 2021

cidrugHug8 commented Jul 12, 2021

silviatti commented Jul 12, 2021

cidrugHug8 commented Jul 12, 2021