As we're using TensorFlow we can make use of the tf.data.Dataset object. First, we'll load in our Numpy binaries from file:

In [3]:

import numpy as np

with open('movie-xids.npy', 'rb') as f:
    Xids = np.load(f, allow_pickle=True)
with open('movie-xmask.npy', 'rb') as f:
    Xmask = np.load(f, allow_pickle=True)
with open('movie-labels.npy', 'rb') as f:
    labels = np.load(f, allow_pickle=True)

In [1]:
import tensorflow as tf

In [4]:
Xids.shape

(156060, 512)

In [5]:
labels

array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.]])

We can take these three arrays and create a TF dataset object with them using from_tensor_slices like so:

In [6]:
dataset = tf.data.Dataset.from_tensor_slices((Xids, Xmask, labels))

Metal device set to: Apple M1


2023-01-03 14:49:05.738088: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-01-03 14:49:05.738133: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [7]:
dataset.take(1)

<TakeDataset element_spec=(TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(5,), dtype=tf.float64, name=None))>

Each sample in our dataset is a tuple containing a single Xids, Xmask, and labels tensor. However, when feeding data into our model we need a two-item tuple in the format (, ). Now, we have two tensors for our inputs - so, what we do is enter our tensor as a dictionary:

    {
    'input_ids': ,
    
    'attention_mask': 
    }
To rearrange the dataset format we can map a function that modifies the format like so:

In [8]:
def map_func(input_ids, masks, labels):
    #convert three item tuple into two item tupke where the input item is a dictionary
    return {'input_ids':input_ids, 'attention_mask':masks}, labels

In [9]:
# map method to apply the transformation

dataset = dataset.map(map_func)

In [10]:
dataset

<MapDataset element_spec=({'input_ids': TensorSpec(shape=(512,), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(512,), dtype=tf.int64, name=None)}, TensorSpec(shape=(5,), dtype=tf.float64, name=None))>

In [11]:
# Take the batch sizes of 16 and drop any samples that don't fit into chunks of 16

In [12]:
batch_size = 16

dataset = dataset.shuffle(10000).batch(batch_size, drop_remainder=True)

dataset.take(1)

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))>

Now our dataset samples are organized into batches of 16. The final step is to split our data into training and validation sets. For this we use the take and skip methods, creating and 90-10 split.

In [13]:
split = 0.9

# we need to calculate how many batches must be taken to create 90% training set
size = int((Xids.shape[0] / batch_size) * split)

size

8778

In [14]:
train_ds = dataset.take(size)
val_ds = dataset.skip(size) # skip method will skip 8.7k

# free up memory
del dataset

Our two datasets are fully prepared for our model inputs. Now, we can save both to file using tf.data.experimental.save.

In [15]:
tf.data.experimental.save(train_ds, 'train')
tf.data.experimental.save(val_ds, 'val')

Instructions for updating:
Use `tf.data.Dataset.save(...)` instead.


2023-01-03 14:49:12.475130: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


In [16]:
train_ds.element_spec

({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None),
  'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)},
 TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))

In [17]:
val_ds.element_spec == train_ds.element_spec

True

In [18]:
ds = tf.data.Dataset.load('train', element_spec=train_ds.element_spec)

In [19]:
ds

<_LoadDataset element_spec=({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))>