[BUG] `to_parquet` throws an error when `ListSlice(..., pad=True)` and `ValueCount()` are combined. #1700

edknv · 2022-11-08T20:30:33Z

Describe the bug
to_parquet throws an error when ListSlice(..., pad=True) and ValueCount() are combined.

Steps/Code to reproduce bug
Container version: nvcr.io/nvidia/merlin/merlin-tensorflow:22.10

Notebook: https://github.com/NVIDIA-Merlin/models/blob/a507022f1350f6d496cfa1bb5ee070a49acc9aa4/examples/usecases/ecommerce-session-based-next-item-prediction-for-fashion.ipynb
Add ValueCount to Cell 12:

SESSIONS_MAX_LENGTH = 3
truncated_features = (
    groupby_features[list_features]
    >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH, pad=True)
    >> nvt.ops.Rename(postfix = '_seq')
    >> nvt.ops.ValueCount()  # Adding this produces an error
)
final_features = groupby_features[nonlist_features] + truncated_features

Or run the following script:

import os

import nvtabular as nvt
from nvtabular.ops import AddMetadata
from merlin.schema import Schema, Tags
from merlin.datasets.synthetic import generate_data


train, valid = generate_data("dressipi2022-preprocessed", num_rows=1_000, set_sizes=(0.8, 0.2))

item_features_names = ["f_" + str(col) for col in [47, 68]]
cat_features = [["item_id", "purchase_id"]] + item_features_names >> nvt.ops.Categorify()

features = ["session_id", "timestamp", "date"] + cat_features

to_aggregate = {}
to_aggregate["date"] = ["first"]
to_aggregate["item_id"] = ["last", "list"]
to_aggregate["purchase_id"] = ["first"]

for name in item_features_names:
    to_aggregate[name] = ["list"]

groupby_features = features >> nvt.ops.Groupby(
    groupby_cols=["session_id"],
    sort_cols=["date"],
    aggs=to_aggregate,
    name_sep="_",
)

item_last = groupby_features["item_id_last"] >> AddMetadata(tags=[Tags.ITEM, Tags.ITEM_ID])
item_list = groupby_features["item_id_list"] >> AddMetadata(
    tags=[Tags.ITEM, Tags.ITEM_ID, Tags.LIST, Tags.SEQUENCE]
)
feature_list = groupby_features[[name + "_list" for name in item_features_names]] >> AddMetadata(
    tags=[Tags.SEQUENCE, Tags.ITEM, Tags.LIST]
)
target_feature = groupby_features["purchase_id_first"] >> AddMetadata(tags=[Tags.TARGET])
other_features = groupby_features["session_id", "date_first"]

groupby_features = item_last + item_list + feature_list + other_features + target_feature

list_features = [name + "_list" for name in item_features_names] + ["item_id_list"]
nonlist_features = ["session_id", "date_first", "item_id_last", "purchase_id_first"]

SESSIONS_MAX_LENGTH = 3
truncated_features = (
    groupby_features[list_features]
    >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH, pad=True, pad_value=0)
    >> nvt.ops.Rename(postfix="_seq")
    >> nvt.ops.ValueCount()
)

final_features = groupby_features[nonlist_features] + truncated_features

workflow = nvt.Workflow(final_features)

workflow.fit(train)
transformed_workflow = workflow.transform(train)
transformed_workflow.to_parquet(os.path.join("/tmp", "train/"))

Error:

Traceback (most recent call last):
  File "nvt.py", line 60, in <module>
    transformed_workflow.to_parquet(os.path.join("/tmp", "train/"))
  File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 902, in to_parquet
    tf_metadata.to_proto_text_file(output_path)
  File "/usr/local/lib/python3.8/dist-packages/merlin/schema/io/tensorflow_metadata.py", line 153, in to_proto_text_file
    _write_file(self.to_proto_text(), path, file_name)
  File "/usr/local/lib/python3.8/dist-packages/merlin/schema/io/tensorflow_metadata.py", line 139, in to_proto_text
    return proto_utils.better_proto_to_proto_text(self.proto_schema, schema_pb2.Schema())
  File "/usr/local/lib/python3.8/dist-packages/merlin/schema/io/proto_utils.py", line 84, in better_proto_to_proto_text
    message.ParseFromString(bytes(better_proto_message))
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/message.py", line 199, in ParseFromString
    return self.MergeFromString(serialized)
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1128, in MergeFromString
    if self._InternalParse(serialized, 0, length) != length:
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1195, in InternalParse
    pos = field_decoder(buffer, new_pos, end, self, field_dict)
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/decoder.py", line 705, in DecodeRepeatedField
    if value.add()._InternalParse(buffer, pos, new_pos) != new_pos:
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1195, in InternalParse
    pos = field_decoder(buffer, new_pos, end, self, field_dict)
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/decoder.py", line 726, in DecodeField
    if value._InternalParse(buffer, pos, new_pos) != new_pos:
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1195, in InternalParse
    pos = field_decoder(buffer, new_pos, end, self, field_dict)
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/decoder.py", line 705, in DecodeRepeatedField
    if value.add()._InternalParse(buffer, pos, new_pos) != new_pos:
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1178, in InternalParse
    raise message_mod.DecodeError('Field number 0 is illegal.')
google.protobuf.message.DecodeError: Field number 0 is illegal.

Expected behavior
Dataset is written to parquet files with padded lists and value counts.

Environment details (please complete the following information):

Environment location:nvcr.io/nvidia/merlin/merlin-tensorflow:22.10
Method of NVTabular install: Docker
- docker run --rm -it --net host --gpus 0 -v /home/edwardk/data:/root/data nvcr.io/nvidia/merlin/merlin-tensorflow:22.10 bash

The text was updated successfully, but these errors were encountered:

edknv added the bug Something isn't working label Nov 8, 2022

rnyak added the P1 label Nov 9, 2022

edknv mentioned this issue Nov 9, 2022

Use merlin-dataloader package NVIDIA-Merlin/models#845

Merged

rjzamora mentioned this issue Nov 18, 2022

Fix feature.shape attribute in from_merlin_schema NVIDIA-Merlin/core#169

Merged

karlhigley closed this as completed in NVIDIA-Merlin/core#169 Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `to_parquet` throws an error when `ListSlice(..., pad=True)` and `ValueCount()` are combined. #1700

[BUG] `to_parquet` throws an error when `ListSlice(..., pad=True)` and `ValueCount()` are combined. #1700

edknv commented Nov 8, 2022 •

edited

[BUG] to_parquet throws an error when ListSlice(..., pad=True) and ValueCount() are combined. #1700

[BUG] to_parquet throws an error when ListSlice(..., pad=True) and ValueCount() are combined. #1700

Comments

edknv commented Nov 8, 2022 • edited

[BUG] `to_parquet` throws an error when `ListSlice(..., pad=True)` and `ValueCount()` are combined. #1700

[BUG] `to_parquet` throws an error when `ListSlice(..., pad=True)` and `ValueCount()` are combined. #1700

edknv commented Nov 8, 2022 •

edited