# Datasets
Datasets in TensorFlow are objects that represent a collection of data. They are used to feed data into a model for training or inference. Datasets can be created from a variety of sources such as NumPy arrays, Python generators, CSV files, and TFRecord files. Datasets can be transformed and manipulated using a variety of methods to prepare the data for training. Datasets can be iterated over using a for loop or by using the `iter` method to create an iterator. Datasets can be batched, shuffled, and repeated to create a data pipeline that feeds data into a model.

## Creating Datasets
You can create a dataset by either choosing a source like a tensor, csv file, or others. Or you create a dataset from another dataset. Here are some examples:

In [90]:
import tensorflow as tf
from tensorflow import keras

# create a dataset from a tensor
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
list(dataset.as_numpy_iterator())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [91]:
# create a dataset from a numpy array
import numpy as np
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
dataset = tf.data.Dataset.from_tensor_slices(data)
list(dataset.as_numpy_iterator())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [92]:
dataset = tf.data.Dataset.range(12)
list(dataset.as_numpy_iterator())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [93]:
# Once you have obtained a dataset, you can iterate over it using a for loop
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
for item in dataset:
    print(item.numpy())

1
2
3
4
5
6
7
8
9
10


In [94]:
# A more complex example of features and labels
features = tf.constant([[1, 2], [3, 4], [5, 6]])
labels = tf.constant(['cat', 'dog', 'fox'])
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
list(dataset.as_numpy_iterator())

[(array([1, 2]), b'cat'), (array([3, 4]), b'dog'), (array([5, 6]), b'fox')]

In [95]:
import pandas as pd

# create a dataset from a CSV file
titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")

# load as dataframe
df = pd.read_csv(titanic_file)
df.head()

# create a dataset from a pandas dataframe
dataset = tf.data.Dataset.from_tensor_slices(dict(df))
for line in dataset.take(1):
    for key, value in line.items():
        print(f"{key:20s}: {value}")

survived            : 0
sex                 : b'male'
age                 : 22.0
n_siblings_spouses  : 1
parch               : 0
fare                : 7.25
class               : b'Third'
deck                : b'unknown'
embark_town         : b'Southampton'
alone               : b'n'


In [96]:
# bring in the mnist dataset. This is NOT a tf.data.Dataset object though
mnist = keras.datasets.mnist

# load the mnist dataset and extract them as numpy arrays
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# now if you wanted to, you could create a dataset from the numpy arrays
# for preprocessing
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))

## Transforming Datasets
Datasets can be transformed and manipulated in a variety of ways. Some common transformations include batching, shuffling, and repeating the data. Here are some examples:

In [97]:
# batch the dataset: instead of iterating over individual items, you can iterate over batches of items
# Reminder: batching is useful when you want to train on multiple examples at once
# which can be more efficient and can take advantage of parallelism
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
batched_dataset = dataset.batch(3)
for batch in batched_dataset:
    print(batch.numpy())

[1 2 3]
[4 5 6]
[7 8 9]
[10]


In [98]:
# batching when there is a tuple of features and labels
features = tf.constant([[1, 2], [3, 4], [5, 6]])
labels = tf.constant(['cat', 'dog', 'fox'])
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
batched_dataset = dataset.batch(2)
for features, labels in batched_dataset:
    print(f"features: {features.numpy()}, labels: {labels.numpy()}")

features: [[1 2]
 [3 4]], labels: [b'cat' b'dog']
features: [[5 6]], labels: [b'fox']


In [109]:
# combining transformations
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
dataset = dataset.shuffle(3).batch(3)
for item in dataset:
    print(item.numpy())

[1 3 4]
[5 6 2]
[ 9 10  7]
[8]


In [100]:
# mapping a function to the dataset. Useful if you need to transform the data to a different format too
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
dataset = dataset.map(lambda x: x * 2)
for item in dataset:
    print(item.numpy())

2
4
6
8
10
12
14
16
18
20


## Consuming files from a directory
You can create a dataset from files in a directory. This is useful when you have a large number of files that you want to feed into a model. Here is an example of how to create a dataset from image files:

In [101]:
import pathlib

# download the flower photos dataset
flowers_root = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)
# get the path to the directory
flowers_root = pathlib.Path(flowers_root)
flowers_root

WindowsPath('C:/Users/kingk/.keras/datasets/flower_photos')

In [102]:
import os

# Preprocessing for image files

# this will create a dataset of the file paths
# the stars being added will
list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

for f in list_ds.take(5):
    print(f.numpy())

# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def parse_image(filename):
  parts = tf.strings.split(filename, os.sep)
  label = parts[-2]

  image = tf.io.read_file(filename)
  image = tf.io.decode_jpeg(image)
  image = tf.image.convert_image_dtype(image, tf.float32)
  image = tf.image.resize(image, [128, 128])
  return image, label

# now, lets create a dataset of the images and labels
images_ds = list_ds.map(parse_image)
for image, label in images_ds.take(1):
    print(f"image shape: {image.shape}, label: {label}")

# now you can batch and shuffle the dataset
images_ds = images_ds.batch(32).shuffle(buffer_size=1000)

# split into training and test
train_size = int(0.7 * len(images_ds))
train_ds = images_ds.take(train_size)
test_ds = images_ds.skip(train_size)

b'C:\\Users\\kingk\\.keras\\datasets\\flower_photos\\tulips\\14957470_6a8c272a87_m.jpg'
b'C:\\Users\\kingk\\.keras\\datasets\\flower_photos\\tulips\\15052586652_56a82de133_m.jpg'
b'C:\\Users\\kingk\\.keras\\datasets\\flower_photos\\sunflowers\\145303599_2627e23815_n.jpg'
b'C:\\Users\\kingk\\.keras\\datasets\\flower_photos\\sunflowers\\18097401209_910a46fae1_n.jpg'
b'C:\\Users\\kingk\\.keras\\datasets\\flower_photos\\roses\\17051448596_69348f7fce_m.jpg'
image shape: (128, 128, 3), label: b'tulips'


## Using Datasets with Keras


In [88]:
import tensorflow as tf
from tensorflow import keras

# load mnist from keras
mnist = keras.datasets.mnist

# load the mnist dataset and extract them as numpy arrays
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# create a dataset from the numpy arrays
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))

# preprocess the data
def preprocess_image(image, label):
    image = tf.cast(image, tf.float32) / 255.0
    return image, label

train_dataset = train_dataset.map(preprocess_image)
test_dataset = test_dataset.map(preprocess_image)

# shuffle the data
train_dataset = train_dataset.shuffle(buffer_size=1024)

# split the training dataset into training and validation
validation_split = 0.2
num_train = len(x_train)
num_val = int(validation_split * num_train)

val_dataset = train_dataset.take(num_val)
train_dataset = train_dataset.skip(num_val)

# batch the data after splitting
BATCH_SIZE = 32

# prefetching will speed up the training process by prefetching the next batch while the current batch is being processed
# that way the pipeline is always full and the data is always ready to be used.
# AUTOTUNE will let tensorflow determine the optimal prefetch size.
train_dataset = train_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
val_dataset = val_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

# create an improved model
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(256, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Add early stopping to prevent overfitting
early_stopping = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=3,
    restore_best_weights=True
)

history = model.fit(
    train_dataset,
    epochs=25,
    validation_data=val_dataset,
    callbacks=[early_stopping]
)

# plot the training loss and validation loss
import matplotlib.pyplot as plt

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values) + 1)

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(epochs, loss_values, 'bo-', label='Training loss')
plt.plot(epochs, val_loss_values, 'r.-', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

# plot the training accuracy and validation accuracy
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.subplot(1, 2, 2)
plt.plot(epochs, acc_values, 'bo-', label='Training accuracy')
plt.plot(epochs, val_acc_values, 'r.-', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()

# evaluate the model
test_loss, test_acc = model.evaluate(test_dataset)
print(f'Test accuracy: {test_acc:.4f}')

TypeError: in user code:

    File "D:\Development\AI\deeplearning-tensorflow\.venv\Lib\site-packages\keras\src\engine\training.py", line 1972, in test_function  *
        return step_function(self, iterator)
    File "D:\Development\AI\deeplearning-tensorflow\.venv\Lib\site-packages\keras\src\engine\training.py", line 1956, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "D:\Development\AI\deeplearning-tensorflow\.venv\Lib\site-packages\keras\src\engine\training.py", line 1944, in run_step  **
        outputs = model.test_step(data)
    File "D:\Development\AI\deeplearning-tensorflow\.venv\Lib\site-packages\keras\src\engine\training.py", line 1853, in test_step
        return self.compute_metrics(x, y, y_pred, sample_weight)
    File "D:\Development\AI\deeplearning-tensorflow\.venv\Lib\site-packages\keras\src\engine\training.py", line 1179, in compute_metrics
        self.compiled_metrics.update_state(y, y_pred, sample_weight)
    File "D:\Development\AI\deeplearning-tensorflow\.venv\Lib\site-packages\keras\src\engine\compile_utils.py", line 605, in update_state
        metric_obj.update_state(y_t, y_p, sample_weight=mask)
    File "D:\Development\AI\deeplearning-tensorflow\.venv\Lib\site-packages\keras\src\utils\metrics_utils.py", line 77, in decorated
        update_op = update_state_fn(*args, **kwargs)
    File "D:\Development\AI\deeplearning-tensorflow\.venv\Lib\site-packages\keras\src\metrics\base_metric.py", line 140, in update_state_fn
        return ag_update_state(*args, **kwargs)
    File "D:\Development\AI\deeplearning-tensorflow\.venv\Lib\site-packages\keras\src\metrics\base_metric.py", line 723, in update_state  **
        matches = ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "D:\Development\AI\deeplearning-tensorflow\.venv\Lib\site-packages\keras\src\metrics\accuracy_metrics.py", line 462, in sparse_categorical_accuracy
        if matches.shape.ndims > 1 and matches.shape[-1] == 1:

    TypeError: '>' not supported between instances of 'NoneType' and 'int'
