Skip to content
This repository was archived by the owner on Nov 26, 2025. It is now read-only.

fix: stratified train/test split in notebook#60

Merged
micedre merged 1 commit intomainfrom
54-failure-in-trying-to-reproduce-the-example-notebook-provided-by-the-repository
Jun 25, 2025
Merged

fix: stratified train/test split in notebook#60
micedre merged 1 commit intomainfrom
54-failure-in-trying-to-reproduce-the-example-notebook-provided-by-the-repository

Conversation

@meilame-tayebjee
Copy link
Copy Markdown
Member

avoid errors due to smaller num_classes than the max label

avoid errors due to smaller num_classes than the max label
@meilame-tayebjee meilame-tayebjee requested a review from micedre May 19, 2025 11:33
@micedre
Copy link
Copy Markdown
Contributor

micedre commented May 19, 2025

Still have this error when training on GPU (Nvidia T4 15G) :

2025-05-19 12:26:12 - torchFastText.torchFastText - Checking inputs...
2025-05-19 12:26:12 - torchFastText.torchFastText - Inputs successfully checked. Starting the training process..
2025-05-19 12:26:12 - torchFastText.torchFastText - Running on: cuda
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[31], line 1
----> 1 model.train(
      2     X_train,
      3     y_train,
      4     X_test,
      5     y_test,
      6     num_epochs=parameters_train['num_epochs'],
      7     batch_size=parameters_train['batch_size'],
      8     patience_scheduler=parameters_train['patience'],
      9     patience_train=parameters_train['patience'],
     10     lr=parameters_train['lr'],
     11     verbose = True
     12 )

File /usr/local/lib/python3.12/site-packages/torchFastText/torchFastText.py:589, in torchFastText.train(self, X_train, y_train, X_val, y_val, num_epochs, batch_size, cpu_run, num_workers, optimizer, optimizer_params, lr, scheduler, patience_scheduler, loss, patience_train, verbose, trainer_params)
    586         end = time.time()
    587         logger.info("Model successfully built in {:.2f} seconds.".format(end - start))
--> 589 self.pytorch_model = self.pytorch_model.to(self.device)
    591 # Dataloaders
    592 train_dataloader, val_dataloader = self.__build_data_loaders(
    593     train_categorical_variables=train_categorical_variables,
    594     training_text=training_text,
   (...)    600     num_workers=num_workers,
    601 )

File /usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py:1355, in Module.to(self, *args, **kwargs)
   1352         else:
   1353             raise
-> 1355 return self._apply(convert)

File /usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py:915, in Module._apply(self, fn, recurse)
    913 if recurse:
    914     for module in self.children():
--> 915         module._apply(fn)
    917 def compute_should_use_set_data(tensor, tensor_applied):
    918     if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    919         # If the new tensor has compatible tensor type as the existing tensor,
    920         # the current behavior is to change the tensor in-place using `.data =`,
   (...)    925         # global flag to let the user control whether they want the future
    926         # behavior of overwriting the existing tensor or not.

File /usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py:942, in Module._apply(self, fn, recurse)
    938 # Tensors stored in modules are graph leaves, and we don't want to
    939 # track autograd history of `param_applied`, so we have to use
    940 # `with torch.no_grad():`
    941 with torch.no_grad():
--> 942     param_applied = fn(param)
    943 p_should_use_set_data = compute_should_use_set_data(param, param_applied)
    945 # subclasses may have multiple child tensors so we need to use swap_tensors

File /usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py:1341, in Module.to.<locals>.convert(t)
   1334     if convert_to_format is not None and t.dim() in (4, 5):
   1335         return t.to(
   1336             device,
   1337             dtype if t.is_floating_point() or t.is_complex() else None,
   1338             non_blocking,
   1339             memory_format=convert_to_format,
   1340         )
-> 1341     return t.to(
   1342         device,
   1343         dtype if t.is_floating_point() or t.is_complex() else None,
   1344         non_blocking,
   1345     )
   1346 except NotImplementedError as e:
   1347     if str(e) == "Cannot copy out of meta tensor; no data!":

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Comment thread notebooks/example.ipynb
],
"source": [
"# Stable version\n",
"pip install torchFastText \n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"pip install torchFastText \n",
"!pip install torchFastText \n",

Comment thread notebooks/example.ipynb
Comment thread notebooks/example.ipynb
"metadata": {},
"outputs": [],
"source": [
"model.load_from_checkpoint(model.best_model_path) # or any other checkpoint path (string)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line cause an error when predicting (when training was done on GPU)

@micedre
Copy link
Copy Markdown
Contributor

micedre commented May 20, 2025

Still have this error when training on GPU (Nvidia T4 15G) :

2025-05-19 12:26:12 - torchFastText.torchFastText - Checking inputs...
2025-05-19 12:26:12 - torchFastText.torchFastText - Inputs successfully checked. Starting the training process..
2025-05-19 12:26:12 - torchFastText.torchFastText - Running on: cuda
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[31], line 1
----> 1 model.train(
      2     X_train,
      3     y_train,
      4     X_test,
      5     y_test,
      6     num_epochs=parameters_train['num_epochs'],
      7     batch_size=parameters_train['batch_size'],
      8     patience_scheduler=parameters_train['patience'],
      9     patience_train=parameters_train['patience'],
     10     lr=parameters_train['lr'],
     11     verbose = True
     12 )

File /usr/local/lib/python3.12/site-packages/torchFastText/torchFastText.py:589, in torchFastText.train(self, X_train, y_train, X_val, y_val, num_epochs, batch_size, cpu_run, num_workers, optimizer, optimizer_params, lr, scheduler, patience_scheduler, loss, patience_train, verbose, trainer_params)
    586         end = time.time()
    587         logger.info("Model successfully built in {:.2f} seconds.".format(end - start))
--> 589 self.pytorch_model = self.pytorch_model.to(self.device)
    591 # Dataloaders
    592 train_dataloader, val_dataloader = self.__build_data_loaders(
    593     train_categorical_variables=train_categorical_variables,
    594     training_text=training_text,
   (...)    600     num_workers=num_workers,
    601 )

File /usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py:1355, in Module.to(self, *args, **kwargs)
   1352         else:
   1353             raise
-> 1355 return self._apply(convert)

File /usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py:915, in Module._apply(self, fn, recurse)
    913 if recurse:
    914     for module in self.children():
--> 915         module._apply(fn)
    917 def compute_should_use_set_data(tensor, tensor_applied):
    918     if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    919         # If the new tensor has compatible tensor type as the existing tensor,
    920         # the current behavior is to change the tensor in-place using `.data =`,
   (...)    925         # global flag to let the user control whether they want the future
    926         # behavior of overwriting the existing tensor or not.

File /usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py:942, in Module._apply(self, fn, recurse)
    938 # Tensors stored in modules are graph leaves, and we don't want to
    939 # track autograd history of `param_applied`, so we have to use
    940 # `with torch.no_grad():`
    941 with torch.no_grad():
--> 942     param_applied = fn(param)
    943 p_should_use_set_data = compute_should_use_set_data(param, param_applied)
    945 # subclasses may have multiple child tensors so we need to use swap_tensors

File /usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py:1341, in Module.to.<locals>.convert(t)
   1334     if convert_to_format is not None and t.dim() in (4, 5):
   1335         return t.to(
   1336             device,
   1337             dtype if t.is_floating_point() or t.is_complex() else None,
   1338             non_blocking,
   1339             memory_format=convert_to_format,
   1340         )
-> 1341     return t.to(
   1342         device,
   1343         dtype if t.is_floating_point() or t.is_complex() else None,
   1344         non_blocking,
   1345     )
   1346 except NotImplementedError as e:
   1347     if str(e) == "Cannot copy out of meta tensor; no data!":

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Could not reproduce it today... Either way, it works with cpu.

@micedre micedre merged commit 156d374 into main Jun 25, 2025
3 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants