In [19]:
# %%bash
# cd /core
# git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && git fetch && git checkout main
# pip install . --no-deps

# cd /dataloader
# git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && git fetch && git checkout main
# pip install . --no-deps

# cd /nvtabular
# git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && git fetch && git checkout main
# pip install . --no-deps

# cd /systems
# git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && git fetch && git checkout main
# pip install . --no-deps

# cd /transformers4rec
# git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && git fetch && git checkout main
# pip install . --no-deps

In [2]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions anda
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_models-transformers-net-item-prediction/nvidia_logo.png" style="width: 90px; float: right;">

# Transformer-based architecture for next-item prediction task with pretrained embeddings

This notebook is created using the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) container.

## Overview

In this use case we will train a Transformer-based architecture for next-item prediction task with pretrained embeddings.

**You can chose to download the full dataset manually or use synthetic data.**

We will use the [SIGIR eCOM 2021 Data Challenge Dataset](https://github.com/coveooss/SIGIR-ecom-data-challenge) to train a session-based model. The dataset contains 36M events of users browsing an online store.

We will reshape the data to organize it into 'sessions'. Each session will be a full customer online journey in chronological order. The goal will be to predict the `url` of the next action taken.


### Learning objectives

- Training a Transformer-based architecture for next-item prediction task

## Downloading and preparing the dataset

In [16]:
import cudf
import numpy as np
import pandas as pd
import nvtabular as nvt
from merlin.schema import ColumnSchema, Schema, Tags

You can download the full dataset by registering [here](https://www.coveo.com/en/ailabs/sigir-ecom-data-challenge). If you chose to download the data, please place it alongside this notebook in the `sigir_dataset` directory and extract it.

To process the downloaded data uncomment the cell below.

By default, in this notebook, we will be using synthetically generated data based on the SIGIR dataset.

In [2]:
# # Unocomment this cell to use the original SIGIR dataset.

# train = nvt.Dataset('/workspace/sigir_dataset/train/browsing_train.csv', part_size='500MB')
# skus = nvt.Dataset('/workspace/sigir_dataset/train/sku_to_content.csv')

# skus = pd.read_csv('/workspace/sigir_dataset/train/sku_to_content.csv')

# skus['description_vector'] = skus['description_vector'].replace(np.nan, '')
# skus['image_vector'] = skus['image_vector'].replace(np.nan, '')

# skus['description_vector'] = skus['description_vector'].apply(lambda x: [] if len(x) == 0 else eval(x))
# skus['image_vector'] = skus['image_vector'].apply(lambda x: [] if len(x) == 0 else eval(x))
# skus = skus[skus.description_vector.apply(len) > 0]
# skus = nvt.Dataset(skus)

In [3]:
from merlin.datasets.synthetic import generate_data

train = generate_data('sigir-browsing', 1000)
skus = generate_data('sigir-sku', 1000)

The `skus` dataset contains the mapping between the `product_sku_hash` (essentially an item id) to the `description_vector` -- an embedding obtained from the description.

To use this information in our model, we need to map the `product_sku_hash` information to an id.

But we need to make sure that the way we process `skus` and the `train` dataset (event information) is consistent. That the same `product_sku_hash` is mapped to the same id both when processing `skus` and `train`.

We do so by defining and fitting a `Categorify` op and using it to process both datasets.

In [4]:
cat_op = nvt.ops.Categorify()
out = ['product_sku_hash'] >> cat_op >> nvt.ops.TagAsItemID()
out += ['event_type', 'product_action', 'session_id_hash', 'hashed_url'] >> nvt.ops.Categorify()
out += ['server_timestamp_epoch_ms'] >> nvt.ops.NormalizeMinMax()

wf = nvt.Workflow(out)

train = wf.fit_transform(train)

train.head()

Unnamed: 0,product_sku_hash,event_type,product_action,session_id_hash,hashed_url,server_timestamp_epoch_ms
0,375,3,5,15,570,0.136624
1,23,4,4,13,403,0.393882
2,549,4,3,12,3,0.557738
3,148,3,5,28,505,0.612437
4,328,3,3,31,253,0.575691


Now that we have processed the train set, we can use the mapping preserved in the `cat_op` to process the `skus` dataset containing the embeddings we are after.

Let's now `Categorify` the `product_sku_hash` in `skus` and grab just the description embedding information.

In [5]:
skus.head()

Unnamed: 0,product_sku_hash,description_vector,category_hash,price_bucket
0,49,"[0.19074470948308836, -0.03131100529154618, 0....",26,0.092083
1,32,"[-0.2518207082929709, -0.20303765932361634, 0....",54,0.04374
2,9,"[-0.3638074882104804, -0.10183149710489742, -0...",99,0.185579
3,97,"[-0.38035569346313397, 0.06832374946505887, -0...",33,0.497887
4,14,"[0.35193821537813347, 0.41277890253984556, 0.1...",82,0.544694


In [6]:
out = ['product_sku_hash'] >> cat_op
wf = nvt.Workflow(out + 'description_vector')
skus_ds = wf.transform(skus)

skus_ds.head()

Unnamed: 0,product_sku_hash,description_vector
0,2,"[0.19074470948308836, -0.03131100529154618, 0...."
1,106,"[-0.2518207082929709, -0.20303765932361634, 0...."
2,2,"[-0.3638074882104804, -0.10183149710489742, -0..."
3,298,"[-0.38035569346313397, 0.06832374946505887, -0..."
4,270,"[0.35193821537813347, 0.41277890253984556, 0.1..."


Let us now export the embedding information to a `numpy` array and write it to disk.

We will later pass this information so that the `Loader` will load the correct emebedding for the products corresponding to the given step of a customer journey.

The embeddings are linked to the train set using the `product_sku_hash` information.

In [7]:
skus_ds.to_npy('skus.npy')

How will the `Loader` know which embedding to associated with a given row of the train set?

The `product_sku_hash` ids have been exported along with the embeddings and are contained in the first column of the output `numpy` array.

Here is the id of the first embedding stored in `skus.npy`.

In [8]:
np.load('skus.npy')[0, 0]

2.0

and here is the embedding vector corresponding to `product_sku_hash` of id referenced above:

In [9]:
np.load('skus.npy')[0, 1:]

array([ 0.19074471, -0.03131101,  0.0201479 ,  0.53804358,  0.39359061,
       -0.32382233,  0.51149587,  0.49506202,  0.60158905,  0.3474365 ,
        0.47490941,  0.3104798 , -0.2218935 , -0.34076992, -0.24330408,
       -0.39893738,  0.26564596,  0.58610945,  0.40313457,  0.13064291,
        0.34864956,  0.1488015 , -0.38335331,  0.42396508, -0.0792708 ,
        0.56811159, -0.38731376, -0.43323464, -0.3575653 ,  0.02976547,
        0.3375143 , -0.39471757, -0.11737858,  0.57075452,  0.14806672,
       -0.01940817,  0.20723742,  0.07139346, -0.28549599,  0.44750621,
       -0.28758708, -0.25481674,  0.06444519,  0.43473896, -0.33112008,
        0.58701177,  0.47687082, -0.25761298,  0.37786294,  0.35886267])

Let us now construct the `Loader` that will provide the data to our model.

Let us first rearrange the `train` dataset to group the actions by `session_id_hash`. Actions within a session will be contained in a single row.

In [10]:
groupby_features = train.head().columns.tolist() >> nvt.ops.Groupby(
    groupby_cols=['session_id_hash'],
    aggs={
        'product_sku_hash': ['list'],
        'event_type': ['list'],
        'product_action': ['list'],
        'hashed_url': ['list', 'count'],
        'server_timestamp_epoch_ms': ['list']
    },
    sort_cols="server_timestamp_epoch_ms"
)

MINIMUM_SESSION_LENGTH = 5
filtered_sessions = groupby_features >> nvt.ops.Filter(f=lambda df: df["hashed_url_count"] >= MINIMUM_SESSION_LENGTH) 

wf = nvt.Workflow(filtered_sessions)
train_processed = wf.fit_transform(train)

train_processed.head()

Unnamed: 0,session_id_hash,product_sku_hash_list,event_type_list,product_action_list,hashed_url_list,server_timestamp_epoch_ms_list,hashed_url_count
0,3,"[237, 261, 88, 22, 610, 159, 275, 156, 611, 48...","[4, 4, 4, 3, 4, 3, 3, 4, 4, 3, 4, 4, 3, 4, 4, ...","[3, 4, 3, 3, 4, 6, 4, 4, 3, 5, 6, 5, 6, 5, 4, ...","[435, 3, 35, 272, 167, 42, 502, 512, 180, 146,...","[0.004735263430550937, 0.050407877989059435, 0...",36
1,4,"[154, 78, 417, 162, 71, 6, 3, 369, 91, 259, 25...","[3, 4, 3, 3, 4, 4, 4, 3, 3, 3, 4, 3, 4, 4, 4, ...","[4, 3, 6, 4, 5, 3, 5, 3, 3, 4, 5, 6, 5, 5, 3, ...","[75, 29, 175, 212, 423, 134, 8, 485, 61, 245, ...","[0.026263786969227345, 0.0973746012991237, 0.1...",31
2,5,"[508, 117, 85, 43, 40, 15, 128, 170, 578, 87, ...","[4, 4, 3, 4, 4, 4, 3, 3, 4, 3, 4, 3, 3, 3, 4, ...","[4, 5, 5, 4, 6, 4, 3, 4, 3, 6, 5, 4, 6, 4, 5, ...","[18, 611, 140, 17, 140, 90, 96, 162, 230, 68, ...","[0.004323552121669836, 0.030039676699837904, 0...",31
3,6,"[222, 566, 505, 210, 266, 138, 43, 142, 41, 27...","[3, 4, 3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 4, 3, 3, ...","[4, 5, 3, 6, 3, 4, 5, 3, 6, 6, 4, 5, 3, 6, 4, ...","[256, 368, 10, 251, 325, 72, 173, 181, 585, 19...","[0.03764613117847081, 0.1714338844084301, 0.17...",31
4,7,"[307, 367, 284, 305, 354, 9, 183, 205, 112, 27...","[3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 4, 4, 3, ...","[4, 3, 6, 4, 3, 4, 5, 5, 4, 4, 6, 5, 4, 5, 4, ...","[14, 321, 580, 54, 108, 574, 181, 427, 566, 58...","[0.024009896481330915, 0.054163690625947336, 0...",29


We are now ready to construct the `Loader` that will feed the data to our model.

We begin by reading in the embeddings information.

In [11]:
embeddings = np.load('skus.npy')

Let's disard from the schema the columns that we will not be using to train our model.

In [12]:
train_processed.schema = train_processed.schema.remove_col('session_id_hash').remove_col('hashed_url_count')

We are now ready to define the `Loader`.

In [13]:
from merlin.dataloader.tensorflow import Loader
from merlin.dataloader.ops.embeddings import EmbeddingOperator

loader = Loader(
    train_processed,
    batch_size=10,
    transforms=[
        EmbeddingOperator(
            embeddings[:, 1:],
            id_lookup_table=embeddings[:, 0].astype(int),
            lookup_key="product_sku_hash_list",
        )
    ],
    shuffle=True
)

2023-06-13 02:31:28.552435: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-13 02:31:28.553593: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-13 02:31:28.553786: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-13 02:31:28.553930: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must 

Using the `EmbeddingOperator` object we referenced our `embeddings` and advised the model what to use as a key to look up the information.

Below is an example batch of data that our model will consume.

In [14]:
batch = loader.peek()
batch

({'product_sku_hash_list__values': <tf.Tensor: shape=(161,), dtype=int64, numpy=
  array([356,  46, 592, 500, 501, 434, 509, 556, 320, 210, 604, 247, 140,
         539, 557, 257, 182, 526, 232, 165, 160, 222, 566, 505, 210, 266,
         138,  43, 142,  41, 279, 493,  97, 269, 163,   6, 131, 147, 110,
          89, 494,  40, 258, 231,   8, 461, 173,  33, 227,  59, 585, 391,
          42, 516,  72, 545,  48, 410, 426,  17,  69, 436, 148, 530,  78,
         189, 330,  48, 215, 250, 402, 271, 327, 571, 261, 342,  50, 560,
          89, 529, 339,  20, 113, 130, 144, 203, 125, 169, 129, 254, 104,
         364, 147, 300,  55, 119,  70,  11, 180, 491, 221,  65, 127, 111,
         109, 120, 393, 224, 547, 467, 554, 312, 158, 387,   9, 343, 407,
         468,  60, 552, 374, 238, 419, 109, 466, 255, 453,  56,  64,  23,
           6, 365,  96, 217, 181,  79,  90, 175, 422, 187,  68, 490, 247,
          16, 230,  18,   8, 583,  15,  28, 270,  55,   4, 215,  88, 581,
          26, 175,  45,  86,  5

We are now read to construct our model.

In [17]:
import merlin.models.tf as mm

input_block = mm.InputBlockV2(
    loader.output_schema,
    embeddings=mm.Embeddings(
        loader.output_schema.select_by_tag(Tags.CATEGORICAL),
        sequence_combiner=None,
    ),
    pretrained_embeddings=mm.PretrainedEmbeddings(
        loader.output_schema.select_by_tag(Tags.EMBEDDING),
        sequence_combiner=None,
        normalizer="l2-norm",
        output_dims={"embeddings": 128},
    )
)

We have now constructed an `input_block` that will take our batch and transform it in a fashion that will make it amenable for further processing by subsequent layers of our model.

To test that everything has worked okay, we can pass our exemplary `batch` through the `input_block`.

In [18]:
input_block(batch)

ValueError: as_list() is not defined on an unknown TensorShape.

In [20]:
%debug

> [0;32m/usr/local/lib/python3.8/dist-packages/merlin/models/tf/inputs/embedding.py[0m(398)[0;36mbuild[0;34m()[0m
[0;32m    396 [0;31m        """
[0m[0;32m    397 [0;31m        [0;32mif[0m [0;32mnot[0m [0mself[0m[0;34m.[0m[0mtable[0m[0;34m.[0m[0mbuilt[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 398 [0;31m            [0mself[0m[0;34m.[0m[0mtable[0m[0;34m.[0m[0mbuild[0m[0;34m([0m[0minput_shapes[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    399 [0;31m        [0;32mreturn[0m [0msuper[0m[0;34m([0m[0mEmbeddingTable[0m[0;34m,[0m [0mself[0m[0;34m)[0m[0;34m.[0m[0mbuild[0m[0;34m([0m[0minput_shapes[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    400 [0;31m[0;34m[0m[0m
[0m
ipdb> input_shapes
({'product_sku_hash_list__values': TensorShape([161]), 'product_sku_hash_list__offsets': TensorShape([11]), 'event_type_list__values': TensorShape([161]), 'event_type_list__offsets': TensorShape([11]), 'product_act

In [74]:
loader.input_schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.num_buckets,properties.freq_threshold,properties.max_size,properties.cat_path,properties.domain.min,properties.domain.max,properties.domain.name,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension,properties.value_count.min,properties.value_count.max
0,product_sku_hash_list,"(Tags.ITEM, Tags.CATEGORICAL, Tags.ID)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,.//categories/unique.product_sku_hash.parquet,0.0,57485.0,product_sku_hash,57486.0,512.0,0,
1,event_type_list,(Tags.CATEGORICAL),"DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,.//categories/unique.event_type.parquet,0.0,4.0,event_type,5.0,16.0,0,
2,product_action_list,(Tags.CATEGORICAL),"DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,.//categories/unique.product_action.parquet,0.0,6.0,product_action,7.0,16.0,0,
3,hashed_url_list,(Tags.CATEGORICAL),"DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,.//categories/unique.hashed_url.parquet,0.0,489302.0,hashed_url,489303.0,512.0,0,
4,server_timestamp_epoch_ms_list,(Tags.CONTINUOUS),"DType(name='float64', element_type=<ElementTyp...",True,True,,,,,,,,,,0,


In [75]:
loader.output_schema

Unnamed: 0,name,tags,dtype,is_list,is_ragged,properties.num_buckets,properties.freq_threshold,properties.max_size,properties.cat_path,properties.domain.min,properties.domain.max,properties.domain.name,properties.embedding_sizes.cardinality,properties.embedding_sizes.dimension,properties.value_count.min,properties.value_count.max
0,product_sku_hash_list,"(Tags.ITEM, Tags.CATEGORICAL, Tags.ID)","DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,.//categories/unique.product_sku_hash.parquet,0.0,57485.0,product_sku_hash,57486.0,512.0,0,
1,event_type_list,(Tags.CATEGORICAL),"DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,.//categories/unique.event_type.parquet,0.0,4.0,event_type,5.0,16.0,0,
2,product_action_list,(Tags.CATEGORICAL),"DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,.//categories/unique.product_action.parquet,0.0,6.0,product_action,7.0,16.0,0,
3,hashed_url_list,(Tags.CATEGORICAL),"DType(name='int64', element_type=<ElementType....",True,True,,0.0,0.0,.//categories/unique.hashed_url.parquet,0.0,489302.0,hashed_url,489303.0,512.0,0,
4,server_timestamp_epoch_ms_list,(Tags.CONTINUOUS),"DType(name='float64', element_type=<ElementTyp...",True,True,,,,,,,,,,0,
5,embeddings,"(Tags.ITEM, Tags.EMBEDDING)","DType(name='float64', element_type=<ElementTyp...",True,True,,,,,,,,,,0,


In [76]:
inputs = mm.sample_batch(loader, batch_size=10, include_targets=False, prepare_features=True)
input_batch = input_block(inputs)

In [77]:
input_batch.shape

TensorShape([10, None, 233])

In [78]:
target = 'hashed_url_list'

In [79]:
dmodel=128
mlp_block = mm.MLPBlock(
                [128,dmodel],
                activation='relu',
                no_activation_last_layer=True,
            )
transformer_block = mm.XLNetBlock(d_model=dmodel, n_head=4, n_layer=2)
model = mm.Model(
    input_block,
    mlp_block,
    transformer_block,
    mm.CategoricalOutput(
        train_processed.schema.select_by_name(target),
        default_loss="categorical_crossentropy",
    ),
)

In [80]:
model.compile(run_eagerly=False, optimizer='adam', loss="categorical_crossentropy")
model.fit(loader, batch_size=64, epochs=5, pre=mm.SequenceMaskRandom(schema=schema, target=target, masking_prob=0.3, transformer=transformer_block))

2023-06-12 01:00:02.308265: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8700


Epoch 1/5


2023-06-12 01:00:24.143373: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: model_2/xl_net_block_2/sequential_block_21/replace_masked_embeddings_4/RaggedWhere/Assert/AssertGuard/branch_executed/_111


   996/188411 [..............................] - ETA: 2:21:59 - loss: 10.2552 - recall_at_10: 0.1281 - mrr_at_10: 0.0678 - ndcg_at_10: 0.0822 - map_at_10: 0.0678 - precision_at_10: 0.0128 - regularization_loss: 0.0000e+00 - loss_batch: 10.2335

KeyboardInterrupt: 

In [None]:
    embeddings = np.load(npy_path)
    # second workflow that categorifies the embedding table data
    df = make_df({"string_id": np.random.choice(string_ids, 30)})
    graph2 = ["string_id"] >> cat_op
    train_res = Workflow(graph2).transform(Dataset(df, cpu=(cpu is not None)))

    data_loader = Loader(
        train_res,
        batch_size=1,
        transforms=[
            EmbeddingOperator(
                embeddings[:, 1:],
                id_lookup_table=embeddings[:, 0].astype(int),
                lookup_key="string_id",
            )
        ],
        shuffle=False,
        device=cpu,
    )
    origin_df = train_res.to_ddf().merge(emb_res.to_ddf(), on="string_id", how="left").compute()
    for idx, batch in enumerate(data_loader):
        batch
        b_df = batch[0].to_df()
        org_df = origin_df.iloc[idx]
        if not cpu:
            assert (b_df["string_id"].to_numpy() == org_df["string_id"].to_numpy()).all()
            assert (b_df["embeddings"].list.leaves == org_df["embeddings"].list.leaves).all()
        else:
            assert (b_df["string_id"].values == org_df["string_id"]).all()
            assert b_df["embeddings"].values[0] == org_df["embeddings"].tolist()

In [None]:
import cudf
import numpy as np

In [None]:
df = cudf.DataFrame(data={'id': [1,2,3], 'val': [0, np.nan, 10], 'another_col': ['a', 'b', 'c']})

In [None]:
df.val[df.val.isna()]

In [None]:
df.val[~df.val.isna()]

In [None]:
ds = nvt.Dataset(df)

out = ['val'] >> nvt.ops.Filter(f=lambda col: ~col.isna())

wf = nvt.Workflow(out)
wf.fit_transform(ds).compute()

In [None]:
out = ['id', 'val'] >> nvt.ops.Filter(f=lambda df: ~df['val'].isna())

wf = nvt.Workflow(out)
wf.fit_transform(ds).compute()

In [None]:
out = ['id', 'val'] >> nvt.ops.Filter(f=lambda df: ~df['val'].isna())

wf = nvt.Workflow(out + ['another_col'])
wf.fit_transform(ds).compute()

In [None]:
train.head()

In [None]:
skus.to_npy('embeddings.npy')

In [None]:
out = ['product_sku_hash', 'category_hash'] >> nvt.ops.Categorify() >> nvt.ops.TagAsItemID()
out += ['description_vector'] >> nvt.ops.TagAsItemFeatures()
out += ['price_bucket'] >> nvt.ops.NormalizeMinMax()

wf = nvt.Workflow(out)
skus = wf.fit_transform(skus)

In [None]:
train.head()

In [None]:
skus.head()

In [111]:
df = cudf.DataFrame(data={'id': [1,2,3], 'label': [1,2,1]})
ds = nvt.Dataset(df)

out = ['label'] >> nvt.ops.AddMetadata(Tags.TARGET)

wf = nvt.Workflow(out + ['id'])

ds_out = wf.fit_transform(ds)

loader = Loader(
    ds_out,
    batch_size=1,
)

loader.peek()

To use synthetically generated data, uncomment the following cell:

In [19]:
%%bash

cd /workspace && pip install . 

Processing /workspace
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'


Building wheels for collected packages: merlin-models
  Building wheel for merlin-models (PEP 517): started
  Building wheel for merlin-models (PEP 517): finished with status 'done'
  Created wheel for merlin-models: filename=merlin_models-23.5.dev0+63.g6d83a8b1.dirty-py3-none-any.whl size=384113 sha256=743286edc1e20e76705ea7980a6cbe6f45accd6b753f72a0815da260a3102641
  Stored in directory: /tmp/pip-ephem-wheel-cache-myac_4jv/wheels/59/14/70/d94958f41745fe226f3bc60bb3cabbbc8a98e4d6679e91038a
Successfully built merlin-models


ERROR: transformers4rec 23.5.0+9.gf4946bfa requires torchmetrics>=0.10.0, which is not installed.


Installing collected packages: merlin-models
  Attempting uninstall: merlin-models
    Found existing installation: merlin-models 23.5.dev0+57.gad9fa2bb.dirty
    Uninstalling merlin-models-23.5.dev0+57.gad9fa2bb.dirty:
      Successfully uninstalled merlin-models-23.5.dev0+57.gad9fa2bb.dirty
Successfully installed merlin-models-23.5.dev0+63.g6d83a8b1.dirty


In [3]:
from merlin.datasets.synthetic import KNOWN_DATASETS

In [4]:
KNOWN_DATASETS

{'e-commerce': PosixPath('/usr/local/lib/python3.8/dist-packages/merlin/datasets/ecommerce/small'),
 'e-commerce-large': PosixPath('/usr/local/lib/python3.8/dist-packages/merlin/datasets/ecommerce/large'),
 'music-streaming': PosixPath('/usr/local/lib/python3.8/dist-packages/merlin/datasets/entertainment/music_streaming'),
 'social': PosixPath('/usr/local/lib/python3.8/dist-packages/merlin/datasets/social'),
 'testing': PosixPath('/usr/local/lib/python3.8/dist-packages/merlin/datasets/testing'),
 'sequence-testing': PosixPath('/usr/local/lib/python3.8/dist-packages/merlin/datasets/testing/sequence_testing'),
 'movielens-25m': PosixPath('/usr/local/lib/python3.8/dist-packages/merlin/datasets/entertainment/movielens/25m'),
 'movielens-1m': PosixPath('/usr/local/lib/python3.8/dist-packages/merlin/datasets/entertainment/movielens/1m'),
 'movielens-1m-raw-ratings': PosixPath('/usr/local/lib/python3.8/dist-packages/merlin/datasets/entertainment/movielens/1m-raw/ratings'),
 'movielens-100k': 

Unnamed: 0,session_id_hash,event_type,product_action,product_sku_hash,hashed_url,server_timestamp_epoch_ms
0,50,0,1,399,949,0.578108
1,8,0,3,505,818,0.471705
2,16,1,0,423,771,0.613716
3,87,1,2,791,49,0.503518
4,378,1,3,424,166,0.864198


Unnamed: 0,product_sku_hash,description_vector,category_hash,price_bucket
0,35,"[0.28320169822835645, 0.28876255653408484, 0.3...",114,0.148844
1,11,"[0.0498882587601161, 0.4050611778162572, 0.489...",78,0.983131
2,12,"[-0.20383781352758598, 0.2821339201063496, -0....",78,0.268082
3,14,"[0.08096254937074815, 0.5582722991824396, 0.22...",156,0.310764
4,49,"[-0.29878517778423397, -0.3313343019075635, -0...",121,0.097739
