[BUG] Bugs in examples/tutorial #589

lendle · 2022-12-28T18:00:54Z

Bug description

Bug 1
examples/tutorial/03-Session-based-recsys.ipynb, section "3.2.4 Train XLNET with Side Information for Next Item Prediction" , the cell that runs training fails.

Log with stack trace

***** Running training *****
  Num examples = 112128
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 1314
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File <timed exec>:15

File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:1316, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1314         tr_loss_step = self.training_step(model, inputs)
   1315 else:
-> 1316     tr_loss_step = self.training_step(model, inputs)
   1318 if (
   1319     args.logging_nan_inf_filter
   1320     and not is_torch_tpu_available()
   1321     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1322 ):
   1323     # if loss is nan or inf simply add the average of previous logged losses
   1324     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:1849, in Trainer.training_step(self, model, inputs)
   1847         loss = self.compute_loss(model, inputs)
   1848 else:
-> 1849     loss = self.compute_loss(model, inputs)
   1851 if self.args.n_gpu > 1:
   1852     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:1881, in Trainer.compute_loss(self, model, inputs, return_outputs)
   1879 else:
   1880     labels = None
-> 1881 outputs = model(**inputs)
   1882 # Save past state if it exists
   1883 # TODO: this needs to be fixed and made cleaner later.
   1884 if self.args.past_index >= 0:

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs)
   1182 # If we don't have any hooks, we want to skip the rest of the logic in
   1183 # this function, and just call forward.
   1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1185         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1186     return forward_call(*input, **kwargs)
   1187 # Do not call functions when jit is used
   1188 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/trainer.py:830, in HFWrapper.forward(self, *args, **kwargs)
    828 def forward(self, *args, **kwargs):
    829     inputs = kwargs
--> 830     return self.wrapper_module(inputs, *args)

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs)
   1182 # If we don't have any hooks, we want to skip the rest of the logic in
   1183 # this function, and just call forward.
   1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1185         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1186     return forward_call(*input, **kwargs)
   1187 # Do not call functions when jit is used
   1188 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/model/base.py:553, in Model.forward(self, inputs, training, **kwargs)
    550 outputs = {}
    551 for head in self.heads:
    552     outputs.update(
--> 553         head(inputs, call_body=True, training=training, always_output_dict=True, **kwargs)
    554     )
    556 if len(outputs) == 1:
    557     outputs = outputs[list(outputs.keys())[0]]

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs)
   1182 # If we don't have any hooks, we want to skip the rest of the logic in
   1183 # this function, and just call forward.
   1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1185         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1186     return forward_call(*input, **kwargs)
   1187 # Do not call functions when jit is used
   1188 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/model/base.py:398, in Head.forward(self, body_outputs, training, call_body, always_output_dict, ignore_masking, **kwargs)
    395 outputs = {}
    397 if call_body:
--> 398     body_outputs = self.body(body_outputs, training=training, ignore_masking=ignore_masking)
    400 for name, task in self.prediction_task_dict.items():
    401     outputs[name] = task(
    402         body_outputs, ignore_masking=ignore_masking, training=training, **kwargs
    403     )

File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs)
     47 def __call__(self, *args, **kwargs):
     48     self.check_schema()
---> 50     return super().__call__(*args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs)
   1182 # If we don't have any hooks, we want to skip the rest of the logic in
   1183 # this function, and just call forward.
   1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1185         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1186     return forward_call(*input, **kwargs)
   1187 # Do not call functions when jit is used
   1188 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/block/base.py:152, in SequentialBlock.forward(self, input, training, ignore_masking, **kwargs)
    150 elif "training" in inspect.signature(module.forward).parameters:
    151     if "ignore_masking" in inspect.signature(module.forward).parameters:
--> 152         input = module(input, training=training, ignore_masking=ignore_masking)
    153     else:
    154         input = module(input, training=training)

File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs)
     47 def __call__(self, *args, **kwargs):
     48     self.check_schema()
---> 50     return super().__call__(*args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/tabular/base.py:390, in TabularModule.__call__(self, inputs, pre, post, merge_with, aggregation, *args, **kwargs)
    387 inputs = self.pre_forward(inputs, transformations=pre)
    389 # This will call the `forward` method implemented by the super class.
--> 390 outputs = super().__call__(inputs, *args, **kwargs)  # noqa
    392 if isinstance(outputs, dict):
    393     outputs = self.post_forward(
    394         outputs, transformations=post, merge_with=merge_with, aggregation=aggregation
    395     )

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs)
   1182 # If we don't have any hooks, we want to skip the rest of the logic in
   1183 # this function, and just call forward.
   1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1185         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1186     return forward_call(*input, **kwargs)
   1187 # Do not call functions when jit is used
   1188 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/sequence.py:257, in TabularSequenceFeatures.forward(self, inputs, training, ignore_masking, **kwargs)
    254     outputs = self.aggregation(outputs)
    256 if self.projection_module:
--> 257     outputs = self.projection_module(outputs)
    259 if self.masking and (not ignore_masking or training):
    260     outputs = self.masking(
    261         outputs, item_ids=self.to_merge["categorical_module"].item_seq, training=training
    262     )

File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs)
     47 def __call__(self, *args, **kwargs):
     48     self.check_schema()
---> 50     return super().__call__(*args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs)
   1182 # If we don't have any hooks, we want to skip the rest of the logic in
   1183 # this function, and just call forward.
   1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1185         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1186     return forward_call(*input, **kwargs)
   1187 # Do not call functions when jit is used
   1188 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/block/base.py:148, in SequentialBlock.forward(self, input, training, ignore_masking, **kwargs)
    146 if i == len(self) - 1:
    147     filtered_kwargs = filter_kwargs(kwargs, module, filter_positional_or_keyword=False)
--> 148     input = module(input, **filtered_kwargs)
    150 elif "training" in inspect.signature(module.forward).parameters:
    151     if "ignore_masking" in inspect.signature(module.forward).parameters:

File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs)
     47 def __call__(self, *args, **kwargs):
     48     self.check_schema()
---> 50     return super().__call__(*args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs)
   1182 # If we don't have any hooks, we want to skip the rest of the logic in
   1183 # this function, and just call forward.
   1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1185         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1186     return forward_call(*input, **kwargs)
   1187 # Do not call functions when jit is used
   1188 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/block/base.py:156, in SequentialBlock.forward(self, input, training, ignore_masking, **kwargs)
    154             input = module(input, training=training)
    155     else:
--> 156         input = module(input)
    158 return input

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs)
   1182 # If we don't have any hooks, we want to skip the rest of the logic in
   1183 # this function, and just call forward.
   1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1185         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1186     return forward_call(*input, **kwargs)
   1187 # Do not call functions when jit is used
   1188 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: expected scalar type Float but found Double

I believe this is because the product_recency_days_log_norm-list_seq created in the prior notebook (02-ETL-with-NVTabular) is float64 rather than float32. I was able to get things to run by adding >> nvt.ops.ReduceDtypeSize() to the cell where that feature is defined in the prior notebook, section 5.3. I'm not sure if this is the correct fix though.

Bug 2

XLNet-MLM with side information accuracy results that get written to results.txt in 03-Session-based-recsys should have metric name and values separated by : rather than space. Metrics from the other two models trained in the notebook are written correctly. This causes the call to create_bar_chart('results.txt') to fail.

Easy fix,

with open("results.txt", 'a') as f:
    f.write('\n')
    f.write('XLNet-MLM with side information accuracy results:')
    f.write('\n')
    for key, value in  model.compute_metrics().items(): 
        f.write('%s %s\n' % (key, value.item()))

should have f.write('%s:%s\n' % (key, value.item())) in the last line.

Steps/Code to reproduce bug

Run the tutorial notebooks.

Expected behavior

Environment details

Google Cloud Workbench managed notebook with image version nvcr.io/nvidia/merlin/merlin-pytorch:22.11

Machine info: a2-highgpu-1g (Accelerator Optimized: 1 NVIDIA Tesla A100 GPU, 12 vCPUs, 85GB RAM)

I'm using version of the example notebooks that are available in the image.

Transformers4Rec version: 0.1.15
Platform: Google Cloud Workbench managed notebook, image nvcr.io/nvidia/merlin/merlin-pytorch:22.11, Machine type a2-highgpu-1g (Accelerator Optimized: 1 NVIDIA Tesla A100 GPU, 12 vCPUs, 85GB RAM),
Python version: 3.8.10
Huggingface Transformers version: 4.12.0
PyTorch version (GPU?): 1.13.0a0+d321be6
Tensorflow version (GPU?):

Additional context

The text was updated successfully, but these errors were encountered:

rnyak · 2023-01-03T21:34:28Z

@lendle thanks for reporting that. we'll take a look shortly.

rnyak · 2023-01-05T13:44:36Z

@lendle I cannot reproduce the first error msg you shared coming from examples/tutorial/03-Session-based-recsys.ipynb, section "3.2.4 Train XLNET with Side Information for Next Item Prediction" . Please note that we already fixed the dtype of the product_recency_days_log_norm-list_seq created in the prior 02-ETL notebook as float32. We do it that way in the notebook: price_log = ['price'] >> nvt.ops.LogOp() >> nvt.ops.Normalize(out_dtype=np.float32) >> nvt.ops.Rename(name='price_log_norm')

you might want to use merlin-pytorch:22.12 docker image for the recent changes, or just fix the line above in your 02-ETL nb.

for the second bug, we'll fix that. thanks.

rnyak · 2023-01-17T22:51:11Z

closing due to lack of activity.

lendle added bug Something isn't working status/needs-triage labels Dec 28, 2022

lendle mentioned this issue Dec 28, 2022

[BUG] Bugs in examples/tutorial NVIDIA-Merlin/NVTabular#1738

Closed

rnyak added this to the Merlin 23.02 milestone Jan 11, 2023

rnyak added the P2 label Jan 11, 2023

rnyak closed this as completed Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Bugs in examples/tutorial #589

[BUG] Bugs in examples/tutorial #589

lendle commented Dec 28, 2022

rnyak commented Jan 3, 2023

rnyak commented Jan 5, 2023

rnyak commented Jan 17, 2023

[BUG] Bugs in examples/tutorial #589

[BUG] Bugs in examples/tutorial #589

Comments

lendle commented Dec 28, 2022

Bug description

Steps/Code to reproduce bug

Expected behavior

Environment details

Additional context

rnyak commented Jan 3, 2023

rnyak commented Jan 5, 2023

rnyak commented Jan 17, 2023