In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# TensorFlow: Using Sparse Features with NVTabular

## Sparse Features

As the name indicates, `Sparse Features` is a way to represent data, which is very sparse. It means, that many values are 0. TensorFlow provides the functionality to use `tf.sparse.SparseTenors` to provide an optimized representation for deep learning training. This notebok provides an example, how to use data containing `Sparse Features` with NVTabular and TensorFlow.

We are using a simple toy example to show the functionality of the workflow. This can scale to much larger examples.

### Example of Sparse Features

First, let's take a look on what are sparse features.<br><br>

For example, the dataset is based on an ecommerce shop and we want to predict, if a customer will purchase a product.

In [2]:
import pandas as pd

df = pd.DataFrame(
    {"customer": ["a", "a", "b", "b", "c"], "product": [0, 1, 0, 2, 4], "purchase": [0, 1, 0, 0, 1]}
)
df.head()

Unnamed: 0,customer,product,purchase
0,a,0,0
1,a,1,1
2,b,0,0
3,b,2,0
4,c,4,1


So far, we have no sparse features in the dataset. Let's say, that we want to add an input feature, describing the historical customer rating for every product in the catalog in the last 90 days.

Customer a rated product 0 with 2.5, product 4 with 4.0 and product 5 with 5.0<br>
Customer b rated product 1 with 1.5, product 7 with 2.0, product 8 with 3.5 and product 9 with 4.0<br>
Customer c rated product 3 with 1,0 and product 6 with 4.0

This is a sparse representation.

In [3]:
customer_ratings = [
    {0: 2.5, 4: 4.0, 5: 5.0},
    {0: 2.5, 4: 4.0, 5: 5.0},
    {1: 1.5, 7: 2.0, 8: 3.5, 9: 4.0},
    {1: 1.5, 7: 2.0, 8: 3.5, 9: 4.0},
    {3: 1.0, 6: 4.0},
]
df["customer_ratings"] = customer_ratings
df.head()

Unnamed: 0,customer,product,purchase,customer_ratings
0,a,0,0,"{0: 2.5, 4: 4.0, 5: 5.0}"
1,a,1,1,"{0: 2.5, 4: 4.0, 5: 5.0}"
2,b,0,0,"{1: 1.5, 7: 2.0, 8: 3.5, 9: 4.0}"
3,b,2,0,"{1: 1.5, 7: 2.0, 8: 3.5, 9: 4.0}"
4,c,4,1,"{3: 1.0, 6: 4.0}"


We can convert the sparse representation to a dense one. Thereby, we create for each example a dense vector with `(rating_1, ... rating_n)`. If a customer has no rated a product yet, we insert it as 0. 

In [4]:
import numpy as np

max_dense = 10


def sparse_to_dense(x):
    dense = np.zeros(max_dense)
    for ind, value in zip(list(x.keys()), list(x.values())):
        dense[ind] = value
    return dense


df["customer_ratings_dense"] = df["customer_ratings"].apply(lambda x: sparse_to_dense(x))
df.head()

Unnamed: 0,customer,product,purchase,customer_ratings,customer_ratings_dense
0,a,0,0,"{0: 2.5, 4: 4.0, 5: 5.0}","[2.5, 0.0, 0.0, 0.0, 4.0, 5.0, 0.0, 0.0, 0.0, ..."
1,a,1,1,"{0: 2.5, 4: 4.0, 5: 5.0}","[2.5, 0.0, 0.0, 0.0, 4.0, 5.0, 0.0, 0.0, 0.0, ..."
2,b,0,0,"{1: 1.5, 7: 2.0, 8: 3.5, 9: 4.0}","[0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 3.5, ..."
3,b,2,0,"{1: 1.5, 7: 2.0, 8: 3.5, 9: 4.0}","[0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 3.5, ..."
4,c,4,1,"{3: 1.0, 6: 4.0}","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 4.0, 0.0, 0.0, ..."


We can see, that the dense representation is not optimial. We require much more disk space or memory, when we represent the information in a dense structure. If our product catalog has 10,000 products, then each vector has 10,000 elements. That would be a big overhead.

In [5]:
df = df.drop("customer_ratings_dense", axis=1)

We provided an example of sparse features and the need to represent it in a sparse way. We continue to use Sparse Features with NVTabular and TensorFlow.

## Sparse Features with NVTabular and TensorFlow

NVTabular supports to use list columns in dataframes. We will transform the sparse feature into two columns:
1. column: a list of the product indices
2. column: a list of the product ratings

In [6]:
df["customer_ratings_index"] = df["customer_ratings"].apply(lambda x: list(x.keys()))
df["customer_ratings_values"] = df["customer_ratings"].apply(lambda x: list(x.values()))
df = df.drop("customer_ratings", axis=1)
df.head()

Unnamed: 0,customer,product,purchase,customer_ratings_index,customer_ratings_values
0,a,0,0,"[0, 4, 5]","[2.5, 4.0, 5.0]"
1,a,1,1,"[0, 4, 5]","[2.5, 4.0, 5.0]"
2,b,0,0,"[1, 7, 8, 9]","[1.5, 2.0, 3.5, 4.0]"
3,b,2,0,"[1, 7, 8, 9]","[1.5, 2.0, 3.5, 4.0]"
4,c,4,1,"[3, 6]","[1.0, 4.0]"


Our product IDs are already continuous integers in a incremental order (0, ... n). Otherwise, we could use nvt.ops.Categorify to convert the indices in the correct format.

In [7]:
import os
import cudf
import tensorflow as tf
import nvtabular as nvt

# we can control how much memory to give tensorflow with this environment variable
# IMPORTANT: make sure you do this before you initialize TF's runtime, otherwise
# TF will have claimed all free GPU memory
os.environ["TF_MEMORY_ALLOCATION"] = "0.7"  # fraction of free memory
from nvtabular.loader.tensorflow import KerasSequenceLoader
from nvtabular.framework_utils.tensorflow import layers

2021-10-26 15:41:30.599968: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-10-26 15:41:32.762881: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-10-26 15:41:32.764161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:0b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-10-26 15:41:32.764196: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-10-26 15:41:32.764250: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-10-26 15:41:32.764284: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas

We initialize the NVTabular dataloader `KerasSequenceLoader` from the dataframe. As we want to show the functionality of the sparse features, we will use only the columns `customer_ratings_index`, `customer_ratings_values` and `purchase`.

In [8]:
train_dataset_tf = KerasSequenceLoader(
    nvt.Dataset(cudf.from_pandas(df)),  # you could also use a glob pattern
    batch_size=5,
    label_names=["purchase"],
    cat_names=["customer_ratings_index"],
    cont_names=["customer_ratings_values"],
    shuffle=False,
    buffer_size=0.06,  # how many batches to load at once
    parts_per_chunk=1,
)

Let's take a look on the structure in NVTabular data loader.

In [9]:
batch = next(iter(train_dataset_tf))

2021-10-26 15:41:34.164278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:0b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-10-26 15:41:34.166528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-10-26 15:41:34.166580: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-10-26 15:41:34.166714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-10-26 15:41:34.166728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2021-10-26 15:41:34.166735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
2021-10-26 15:41:34.169926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:l

In [10]:
batch[0]["customer_ratings_index"]

(<tf.Tensor: shape=(16, 1), dtype=int64, numpy=
 array([[0],
        [4],
        [5],
        [0],
        [4],
        [5],
        [1],
        [7],
        [8],
        [9],
        [1],
        [7],
        [8],
        [9],
        [3],
        [6]])>,
 <tf.Tensor: shape=(5, 1), dtype=int32, numpy=
 array([[3],
        [3],
        [4],
        [4],
        [2]], dtype=int32)>)

The feature `customer_ratings_index` is represented as a tuple of two tensors. The first tensors contain the actual values and the second tensor are the row_lengths. As each row can contain a different number of elements in the list, we have a [RaggedTensors](https://www.tensorflow.org/api_docs/python/tf/RaggedTensor). It is an efficient representation to use two Tensors to constructed the RaggedTensor.<br><br>
In our example,
* the first and second row has each 3 elements
* the third and the forth row has each 4 elemtns
* the with row has 2 elements

This is captured in the 2nd Tenors. <br><br>
The feature `customer_ratings_values` looks equivalent.

In [11]:
batch[0]["customer_ratings_values"]

(<tf.Tensor: shape=(16, 1), dtype=float64, numpy=
 array([[2.5],
        [4. ],
        [5. ],
        [2.5],
        [4. ],
        [5. ],
        [1.5],
        [2. ],
        [3.5],
        [4. ],
        [1.5],
        [2. ],
        [3.5],
        [4. ],
        [1. ],
        [4. ]])>,
 <tf.Tensor: shape=(5, 1), dtype=int32, numpy=
 array([[3],
        [3],
        [4],
        [4],
        [2]], dtype=int32)>)

NVTabular provides custom TensorFlow layers, `nvtabular.framework_utils.tensorflow.SparseTensor`,which can handle the sparse features.<br><br>
The `dense_dim` defines the length of the vector in a dense representation. In our example, it is `10` as the largest index is `9`.

In [12]:
x = layers.SparseTensor(dense_dim=10)(
    batch[0]["customer_ratings_index"][0],
    batch[0]["customer_ratings_index"][1],
    batch[0]["customer_ratings_values"][0],
)
x

<tensorflow.python.framework.sparse_tensor.SparseTensor at 0x7f15106509a0>

We can convert the TensorFlow SparseTensor to a dense representation.

In [13]:
tf.sparse.to_dense(x)

<tf.Tensor: shape=(5, 10), dtype=float64, numpy=
array([[2.5, 0. , 0. , 0. , 4. , 5. , 0. , 0. , 0. , 0. ],
       [2.5, 0. , 0. , 0. , 4. , 5. , 0. , 0. , 0. , 0. ],
       [0. , 1.5, 0. , 0. , 0. , 0. , 0. , 2. , 3.5, 4. ],
       [0. , 1.5, 0. , 0. , 0. , 0. , 0. , 2. , 3.5, 4. ],
       [0. , 0. , 0. , 1. , 0. , 0. , 4. , 0. , 0. , 0. ]])>

### Building a Neural Network with tf.keras

Let's define our neural network architecture. We create `tf.keras.input` layers to define our input layers.

In [14]:
inputs = {}
col = "customer_ratings"

inputs[col + "_values"] = (
    tf.keras.Input(name=col + "_values_values", dtype=tf.float32, shape=(1,)),
    tf.keras.Input(name=col + "_values_index", dtype=tf.int64, shape=(1,)),
)
inputs[col + "_index"] = (
    tf.keras.Input(name=col + "_index_values", dtype=tf.int64, shape=(1,)),
    tf.keras.Input(name=col + "_index_index", dtype=tf.int64, shape=(1,)),
)

We define our layers.SparseTensor.

In [15]:
x_sparse = layers.SparseTensor(dense_dim=10)(
    inputs[col + "_index"][0], inputs[col + "_index"][1], inputs[col + "_values"][0]
)

We can add a FullyConnected Layer.

In [16]:
x = tf.keras.layers.Dense(1)(x_sparse)
x = tf.keras.activations.sigmoid(x)

We compile our model.

In [17]:
model = tf.keras.Model(inputs=inputs, outputs=x)
model.compile("sgd", "binary_crossentropy")

We can train the model with `.fit` and the NVTabular data loader.

In [18]:
model.fit(train_dataset_tf, epochs=100)

2021-10-26 15:41:34.810963: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-10-26 15:41:34.834295: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2195025000 Hz


Epoch 1/100




Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

<tensorflow.python.keras.callbacks.History at 0x7f15104c0fd0>

This is a really small toy example to show how to use Sparse Features with NVTabular and TensorFlow. It has only 5 examples and therefore, batches are pretty small. But this can scale to much larger dataset.