Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates Models to support new dataloader format for lists (__values and __offsets in dict) and scalar (1D) #999

Merged
merged 29 commits into from
Mar 15, 2023

Conversation

gabrielspmoreira
Copy link
Member

@gabrielspmoreira gabrielspmoreira commented Feb 24, 2023

Fixes #819
It also resolves the Capture shapes everywhere #823 for the Merlin Models library

Goals ⚽

This PR changes Merlin Models to support dataloader changes from PR #101:

  • List features are now represented by two dict keys with the following format: {feature_name}__values, {feature_name}__offsets, instead of a tuple
  • Scalar features are now 1D tensors, instead of 2D

It also updates code from using ColumnSchema "value_count" to use the new ColumnSchema shape API.

Implementation Details 🚧

Merlin Models had code in many places that converts the tuple representation into RaggedTensor, SparseTensor, or Tensor.
This PR centralizes the conversion from the dataloader representation of list features and eliminate duplicate code on lists conversion.

  • Created centralized blocks to prepare scalar and list features (PrepareFeatures, PrepareListFeatures)
  • Removed duplicated blocks to manage lists (ListToDense, ListToRagged, ListToSparse)
  • Replaced usage of "properties/value_count" to use the new shape

Other

  • Changed NVT preproc workflow of examples/usecases/transformers-next-item-prediction.ipynb to fix tagging and make it simpler

Testing Details 🔍

  • Fixed dozens of unit and integration tests to match the new data loader and shape API

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@gabrielspmoreira gabrielspmoreira marked this pull request as draft February 24, 2023 06:13
@gabrielspmoreira gabrielspmoreira self-assigned this Feb 24, 2023
@gabrielspmoreira gabrielspmoreira added enhancement New feature or request chore Maintenance for the repository labels Feb 24, 2023
@gabrielspmoreira gabrielspmoreira added this to the Merlin 23.03 milestone Feb 24, 2023
@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-999

@@ -208,7 +200,13 @@ def InputBlock(
name="continuous_projection",
)

return ParallelBlock(branches, aggregation=aggregation, post=post, is_input=True, **kwargs)
_pre = PrepareFeatures(schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does every model have PrepareFeatures as the first layer? if so, maybe we don't need this in the InputBlock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed that. As there are many blocks that don't use InputBlockV2 and InputBlock, I moved the calling of PrepareFeatures to Model, Encoder and to the blocks that can be used as pre for fit() / evaluate() and the ones that can be used as a transform for Loader.map()

@oliverholworthy
Copy link
Member

Re: issue with loader returning SparseTensor's. I think this is because in some cases we're now setting the value_count.max property (specifying the shape of ragged dimensions) which wasn't previously set. Until we remove the sparse output option in the dataloader NVIDIA-Merlin/dataloader#103 . One intermediate step could be to update _augment_schema helper function to always remove value_count from the properties if it's there so that ragged columns will always return the ragged representation.

@jperez999 jperez999 marked this pull request as ready for review March 8, 2023 19:09
@jperez999 jperez999 marked this pull request as draft March 8, 2023 20:08
@gabrielspmoreira gabrielspmoreira marked this pull request as ready for review March 8, 2023 20:38
tox.ini Outdated Show resolved Hide resolved
@@ -26,8 +26,7 @@
"name": "item_age_days_norm",
"type": "FLOAT",
"valueCount": {
"min": "1",
"max": "4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to remove the max value count from this schema and others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that when min and max are provided and are different, the dataloader returns a tf.SparseTensor, but now MM can only take either tf.Tensor or our convension for ragged tensors (__values and __offsets). So I remove the max to enforce it to return the ragged tensors.
I know you have already a draft PR that will make it return ragged tensors in such cases. As soon as it is merged we can add back the value_count.max from the test dataset schemas if needed.
Does that make sense?

layer_inputs = {
k: v
for k, v in inputs.items()
if k.replace("__values", "").replace("__offsets", "") in maybe_schema.column_names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what situation is this change required for. If we're handling the ragged inputs in the first layer how are the values/offsets making their way through to a ParallelBlock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. In general we use PrepareFeatures in the Model right at the beginning to ensure that lists are converted from our ragged tensor convension (__values and __offsets) to tf.RaggedTensor.
But for RetrievalModelV2 we cannot PrepareFeatures in Model, because in that case the Encoders (which are tf.keras.Model) would receive tf.RaggedTensor as inputs and that would cause problems when the query encoder is saved to be served on Triton.
That is why on RetrievalModelV2 constructor we set prep_features=False and let the Encoder class to prep_features=True. As the RetrievalModelV2 is a ParallelBlock of Encoder that is when it would receive the ragged convension (__values and __offsets) and this change is to ensure that the ragged representation is cascaded to the Encoder block

@@ -159,11 +166,14 @@ def batch_predict(

return merlin.io.Dataset(predictions)

def call(self, inputs, training=False, testing=False, targets=None, **kwargs):
def call(self, inputs, targets=None, training=False, testing=False, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the re-ordering of the params a functional change. e.g. are there any places where these are being passed in as positional params instead of keyword. One option could be to enforce this by adding a * after inputs. That way we can be sure that the order of the keyword arguments don't matter.

def call(self, inputs, *, targets=None, training=False, testing=False, **kwargs):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

is_list=True,
is_ragged=is_ragged,
properties=properties,
dims=((0, None), (shape[1], shape[1])),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to dims=(None, shape[1]). Equivalent to what is already here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@bschifferer
Copy link
Contributor

rerun tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change chore Maintenance for the repository enhancement New feature or request
Projects
None yet
6 participants