This example notebook shows how we can train a simple Regression classifier.
We employ TileDB as a storage engine for our training data and labels.
We will use the MovieLens 100K public data set, available [here](https://grouplens.org/datasets/movielens/100k/). We will first download the
MovieLens, which contains 100.000 ratings, by 943 users on 1682 items.
Continuing, we will use our TileDB support for PyTorch Sparse Dataloader API in order to train the classifier.
First, let's import what we need and download our data. We will transform our data to a sparse format
in order to show the support of TileDB in ingesting and providing to the Pytorch framework sparse datasets. Sparse
datasets are important and frequently found and used in applications like recommender systems et.al. For example by designing
a Factorisation Machine [FM model](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf) someone could take advantage of data sparsity and build a refined recommendation system.

In [1]:
import os
import urllib.request
import numpy as np
import tiledb
import torch
import pandas as pd

## Dataset
Download MovieLens dataset.

In [2]:
data_home = os.path.join(os.path.pardir, 'data')
data_dir = os.path.join(data_home, 'readers', 'pytorch', 'sparse')
os.makedirs(data_dir, exist_ok=True)

filename = os.path.join(data_home, "movielens-ml-100k-u.data")
if not os.path.exists(filename):
    url = "https://files.grouplens.org/datasets/movielens/ml-100k/u.data"
    urllib.request.urlretrieve(url, filename)

Use pandas to display dataset in readable form

In [3]:
data = pd.read_csv(filename, sep="\t", usecols=[0,1,2], names=["user_id", "item_id", "rating"])
display(data.head())
display(data.shape)

Unnamed: 0,user_id,item_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


(100000, 3)

## Data Analysis / Sparsity
Before we apply the one-hot transformation let’s check the memory usage of our original data frame.

In [4]:
BYTES_TO_MB_DIV = 0.000001
def print_memory_usage_of_data_frame(df):
    mem = round(df.memory_usage().sum() * BYTES_TO_MB_DIV, 3)
    print("Memory usage is " + str(mem) + " MB")

print_memory_usage_of_data_frame(data)

Memory usage is 2.4 MB


## Data transformation

Now, let’s apply the transformation and check the memory usage of the transformed data frame.

In [5]:
data_one_hot = pd.get_dummies(data, columns=['user_id', 'item_id'])
display(data_one_hot.head())
display(data_one_hot.shape)
print_memory_usage_of_data_frame(data_one_hot)

Unnamed: 0,rating,user_id_1,user_id_2,user_id_3,user_id_4,user_id_5,user_id_6,user_id_7,user_id_8,user_id_9,...,item_id_1673,item_id_1674,item_id_1675,item_id_1676,item_id_1677,item_id_1678,item_id_1679,item_id_1680,item_id_1681,item_id_1682
0,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


(100000, 2626)

Memory usage is 263.3 MB


We will slice the dataset `user_movie` will be our `x_train` data transformed with one-hot encoding.
So we expect its schema to be the number of ratings as rows and binary columns for users + binary columns
for each item (in our case movies). This will lead to (100000, 2625)

The target data will be the `ratings`, which will include the ratings and thus will have a shape of (100000,1)

In [6]:
user_movie = data_one_hot[data_one_hot.columns.difference(['rating'])]
ratings = data['rating']

## Data Ingestion
Then we proceed with ingesting `user_movies` into sparse TileDB arrays as our training data and `rating` into dense TileDB
array as our target data. Here, we should point out that besides the
flexibility of TileDB in defining a schema, i.e., multiple dimensions, multiple attributes, compression etc,
we choose to define a simple schema. So, for a numpy array of D number of dimensions we create a dense TileDB array,
with the same number of dimensions, and a single attribute of data type numpy float32.

In [7]:
def get_schema(data: np.array, batch_size: int, sparse: bool) -> tiledb.ArraySchema:
    dims = [
        tiledb.Dim(
            name="dim_" + str(dim),
            domain=(0, data.shape[dim] - 1),
            tile=data.shape[dim] if dim > 0 else batch_size,
            dtype=np.int32,
        )
        for dim in range(data.ndim)
    ]

    # TileDB schema
    schema = tiledb.ArraySchema(
        domain=tiledb.Domain(*dims),
        sparse=sparse,
        attrs=[tiledb.Attr(name="features", dtype=np.float32)],
    )

    return schema

# Let's define an ingestion function
def ingest_in_tiledb(data: np.array, batch_size: int, uri: str, sparse: bool):
    schema = get_schema(data, batch_size, sparse)

    # Create the (empty) array on disk.
    tiledb.Array.create(uri, schema)

    # Ingest
    with tiledb.open(uri, "w") as tiledb_array:
        idx = np.nonzero(data) if sparse else slice(None)
        tiledb_array[idx] = {"features": data[idx]}

We ingest `user_movie` as sparse TileDB array and `ratings` as dense TileDB array

In [8]:
# Ingest images
training_images = os.path.join(data_dir, 'training_images')
if not os.path.exists(training_images):
    ingest_in_tiledb(data=user_movie.to_numpy(), batch_size=64, uri=training_images, sparse=True)

# Ingest labels
training_labels = os.path.join(data_dir, 'training_labels')
if not os.path.exists(training_labels):
    ingest_in_tiledb(data=ratings.to_numpy(), batch_size=64, uri=training_labels, sparse=False)

We can now explore our TileDB arrays and check their structure.

In [9]:
user_movie_array = tiledb.open(training_images)
ratings_array = tiledb.open(training_labels)

print(user_movie_array.schema)
print(ratings_array.schema)

ArraySchema(
  domain=Domain(*[
    Dim(name='dim_0', domain=(0, 99999), tile=64, dtype='int32'),
    Dim(name='dim_1', domain=(0, 2624), tile=2625, dtype='int32'),
  ]),
  attrs=[
    Attr(name='features', dtype='float32', var=False, nullable=False),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=False,
)

ArraySchema(
  domain=Domain(*[
    Dim(name='dim_0', domain=(0, 99999), tile=64, dtype='int32'),
  ]),
  attrs=[
    Attr(name='features', dtype='float32', var=False, nullable=False),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=False,
)



## Model training
Although we used Factorization Machines as a reference model to create
our training set, here we will train a simple Logistic Regression model in Pytorch only
for demonstration purposes. Anyone can easily build any Model to train on the data.

### Declare Model Class

In [10]:
import torch.nn as nn
import torch.optim as optim

class LogisticRegression(nn.Module):
    def __init__(self, shape):
        super(LogisticRegression, self).__init__()
        self.linear = torch.nn.Linear(shape[0], shape[1])

    def forward(self, x):
        outputs = self.linear(x)
        return outputs

### Train the Model


In [11]:
from tiledb.ml.readers.pytorch import PyTorchTileDBDataLoader, ArrayParams

ctx = tiledb.Ctx({'py.init_buffer_bytes': 1024**2})
with tiledb.open(training_images, ctx=ctx) as x, tiledb.open(training_labels, ctx=ctx) as y:
    train_loader = PyTorchTileDBDataLoader(ArrayParams(x), ArrayParams(y), batch_size=32)
    #Number of ratings x (user + movies)
    datashape_x = (100000, 2625)

    logre = LogisticRegression(shape=(2625, 1))
    criterion = nn.MSELoss()
    optimizer = optim.SGD(logre.parameters(), lr=0.01, momentum=0.5)

    for epoch in range(1, 3):
        logre.train()
        for batch_idx, (inputs, labels) in enumerate(train_loader):
            # zero the parameter gradients
            optimizer.zero_grad()
            # forward + backward + optimize
            outputs = logre(inputs.to(torch.float))
            loss = criterion(outputs, labels.type(torch.FloatTensor).view(-1,1))
            loss.backward()
            optimizer.step()
            if batch_idx % 500 == 0:
                print('Train Epoch: {} Batch: {} Loss: {:.6f}'.format(epoch, batch_idx, loss.item()))

  stacked.shape,


Train Epoch: 1 Batch: 0 Loss: 11.926880
Train Epoch: 1 Batch: 500 Loss: 1.468971
Train Epoch: 1 Batch: 1000 Loss: 0.879726
Train Epoch: 1 Batch: 1500 Loss: 1.019065
Train Epoch: 1 Batch: 2000 Loss: 1.534842
Train Epoch: 1 Batch: 2500 Loss: 1.282928
Train Epoch: 1 Batch: 3000 Loss: 1.501987
Train Epoch: 2 Batch: 0 Loss: 1.401106
Train Epoch: 2 Batch: 500 Loss: 1.085804
Train Epoch: 2 Batch: 1000 Loss: 0.793254
Train Epoch: 2 Batch: 1500 Loss: 0.954215
Train Epoch: 2 Batch: 2000 Loss: 1.347507
Train Epoch: 2 Batch: 2500 Loss: 1.206381
Train Epoch: 2 Batch: 3000 Loss: 1.451027
