Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] to_parquet throws an error when ListSlice(..., pad=True) and ValueCount() are combined. #1700

Closed
edknv opened this issue Nov 8, 2022 · 0 comments · Fixed by NVIDIA-Merlin/core#169
Labels
bug Something isn't working P1

Comments

@edknv
Copy link
Contributor

edknv commented Nov 8, 2022

Describe the bug
to_parquet throws an error when ListSlice(..., pad=True) and ValueCount() are combined.

Steps/Code to reproduce bug
Container version: nvcr.io/nvidia/merlin/merlin-tensorflow:22.10

  1. Notebook: https://github.com/NVIDIA-Merlin/models/blob/a507022f1350f6d496cfa1bb5ee070a49acc9aa4/examples/usecases/ecommerce-session-based-next-item-prediction-for-fashion.ipynb
    Add ValueCount to Cell 12:
SESSIONS_MAX_LENGTH = 3
truncated_features = (
    groupby_features[list_features]
    >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH, pad=True)
    >> nvt.ops.Rename(postfix = '_seq')
    >> nvt.ops.ValueCount()  # Adding this produces an error
)
final_features = groupby_features[nonlist_features] + truncated_features
  1. Or run the following script:
import os

import nvtabular as nvt
from nvtabular.ops import AddMetadata
from merlin.schema import Schema, Tags
from merlin.datasets.synthetic import generate_data


train, valid = generate_data("dressipi2022-preprocessed", num_rows=1_000, set_sizes=(0.8, 0.2))

item_features_names = ["f_" + str(col) for col in [47, 68]]
cat_features = [["item_id", "purchase_id"]] + item_features_names >> nvt.ops.Categorify()

features = ["session_id", "timestamp", "date"] + cat_features

to_aggregate = {}
to_aggregate["date"] = ["first"]
to_aggregate["item_id"] = ["last", "list"]
to_aggregate["purchase_id"] = ["first"]

for name in item_features_names:
    to_aggregate[name] = ["list"]

groupby_features = features >> nvt.ops.Groupby(
    groupby_cols=["session_id"],
    sort_cols=["date"],
    aggs=to_aggregate,
    name_sep="_",
)

item_last = groupby_features["item_id_last"] >> AddMetadata(tags=[Tags.ITEM, Tags.ITEM_ID])
item_list = groupby_features["item_id_list"] >> AddMetadata(
    tags=[Tags.ITEM, Tags.ITEM_ID, Tags.LIST, Tags.SEQUENCE]
)
feature_list = groupby_features[[name + "_list" for name in item_features_names]] >> AddMetadata(
    tags=[Tags.SEQUENCE, Tags.ITEM, Tags.LIST]
)
target_feature = groupby_features["purchase_id_first"] >> AddMetadata(tags=[Tags.TARGET])
other_features = groupby_features["session_id", "date_first"]

groupby_features = item_last + item_list + feature_list + other_features + target_feature

list_features = [name + "_list" for name in item_features_names] + ["item_id_list"]
nonlist_features = ["session_id", "date_first", "item_id_last", "purchase_id_first"]

SESSIONS_MAX_LENGTH = 3
truncated_features = (
    groupby_features[list_features]
    >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH, pad=True, pad_value=0)
    >> nvt.ops.Rename(postfix="_seq")
    >> nvt.ops.ValueCount()
)

final_features = groupby_features[nonlist_features] + truncated_features

workflow = nvt.Workflow(final_features)

workflow.fit(train)
transformed_workflow = workflow.transform(train)
transformed_workflow.to_parquet(os.path.join("/tmp", "train/"))

Error:

Traceback (most recent call last):
  File "nvt.py", line 60, in <module>
    transformed_workflow.to_parquet(os.path.join("/tmp", "train/"))
  File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 902, in to_parquet
    tf_metadata.to_proto_text_file(output_path)
  File "/usr/local/lib/python3.8/dist-packages/merlin/schema/io/tensorflow_metadata.py", line 153, in to_proto_text_file
    _write_file(self.to_proto_text(), path, file_name)
  File "/usr/local/lib/python3.8/dist-packages/merlin/schema/io/tensorflow_metadata.py", line 139, in to_proto_text
    return proto_utils.better_proto_to_proto_text(self.proto_schema, schema_pb2.Schema())
  File "/usr/local/lib/python3.8/dist-packages/merlin/schema/io/proto_utils.py", line 84, in better_proto_to_proto_text
    message.ParseFromString(bytes(better_proto_message))
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/message.py", line 199, in ParseFromString
    return self.MergeFromString(serialized)
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1128, in MergeFromString
    if self._InternalParse(serialized, 0, length) != length:
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1195, in InternalParse
    pos = field_decoder(buffer, new_pos, end, self, field_dict)
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/decoder.py", line 705, in DecodeRepeatedField
    if value.add()._InternalParse(buffer, pos, new_pos) != new_pos:
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1195, in InternalParse
    pos = field_decoder(buffer, new_pos, end, self, field_dict)
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/decoder.py", line 726, in DecodeField
    if value._InternalParse(buffer, pos, new_pos) != new_pos:
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1195, in InternalParse
    pos = field_decoder(buffer, new_pos, end, self, field_dict)
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/decoder.py", line 705, in DecodeRepeatedField
    if value.add()._InternalParse(buffer, pos, new_pos) != new_pos:
  File "/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/python_message.py", line 1178, in InternalParse
    raise message_mod.DecodeError('Field number 0 is illegal.')
google.protobuf.message.DecodeError: Field number 0 is illegal.

Expected behavior
Dataset is written to parquet files with padded lists and value counts.

Environment details (please complete the following information):

  • Environment location:nvcr.io/nvidia/merlin/merlin-tensorflow:22.10
  • Method of NVTabular install: Docker
    • docker run --rm -it --net host --gpus 0 -v /home/edwardk/data:/root/data nvcr.io/nvidia/merlin/merlin-tensorflow:22.10 bash
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants