**Welcome to Deep Learning with Keras and TensorFlow in Python**

**Presented by: Reza Saadatyar (2024-2025)**<br/>
**E-mail: Reza.Saadatyar@outlook.com**<br/>
**[GitHub](https://github.com/RezaSaadatyar/Deep-Learning-in-python)**

**Outline:**<br/>
▪ [Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)<br/>
▪ [Data Shuffling](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle)<br/>
▪ [Repeat Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#repeat)<br/>
▪ [Batching](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch)

**Extract, Transform, Load (ETL) pipeline:**<br/>
▪ `Extract:` Data is gathered from various sources Cloud (e.g., Google Cloud Storage, AWS S3, or Azure Blob Storage), Databases (e.g., MySQL, PostgreSQL), and Local File System (this might include CSV files, JSON files, or other raw data stored locally).<br/>
▪ `Transform:` Data is processed, cleaned, or reformatted to make it suitable for analysis or model training. Common transformations include: normalizing numerical data (e.g., scaling values between 0 and 1), encoding categorical data (e.g., one-hot encoding), handling missing values, and resizing images or tokenizing text (if working with image or NLP datasets).<br/>
▪ `Load:` The transformed data is loaded into a target system, such as a device or storage for further use.<br/>

`tf.data` a TensorFlow API, streamlines loading, preprocessing, and feeding data into models. It excels with large datasets, supporting streaming and parallel processing for efficiency. 

**Key tf.data methods for extraction:**<br/>
▪ `tf.data.Dataset.from_tensor_slices():` Create a dataset from in-memory tensors (e.g., NumPy arrays).<br/>
▪ `tf.data.TextLineDataset:` Load text files line by line (e.g., for CSVs or raw text).<br/>
▪ `tf.data.TFRecordDataset:` Load data stored in TFRecord format, which is optimized for TensorFlow.<br/>
▪ `tf.keras.utils.image_dataset_from_directory(): `Load image datasets directly from a directory structure (useful for image classification tasks).<br/>

**Key tf.data methods for transformation:**<br/>
▪ `dataset.map():` Apply a transformation function to each element.<br/>
▪ `dataset.filter():` Filter out elements based on a condition.<br/>
▪ `dataset.shuffle():` Randomize the dataset.<br/>
▪ `dataset.batch():` Group elements into batches.<br/>

▪ <br/>
▪ <br/>
▪ <br/>

<font color='#FF000e' size="4.5" face="Arial"><b>Import modules</b></font>

In [None]:
import pprint
import numpy as np
import tensorflow as tf

In [4]:
# Create a NumPy array with the given values
x = np.array([8, 3, 20, -1, 0, 1])

# Create a TensorFlow Dataset from the NumPy array using tf.data.Dataset.from_tensor_slices
# This creates a dataset where each element is a slice of the input array
dataset = tf.data.Dataset.from_tensor_slices(x)

# The dataset is now ready for iteration or further processing
dataset, x

(<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>,
 array([ 8,  3, 20, -1,  0,  1]))

In [13]:
# Iterate over the dataset and print each element along with its index
for ind, tensor in enumerate(dataset):
    print(f"{ind} → {tensor = }")

0 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=8>
1 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=3>
2 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=20>
3 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=-1>
4 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=0>
5 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=1>


In [None]:
# Inspect the element specification of the dataset
dataset.element_spec

TensorSpec(shape=(), dtype=tf.int64, name=None)

In [85]:
# Create a 2D tensor with random uniform values (shape [100, 5])
x = tf.random.uniform([100, 5])

# Create a 1D tensor with random uniform integer values (shape [100]) ranging from 0 to 1
y = tf.random.uniform([100], maxval=2, dtype=tf.int32)

# Create a TensorFlow Dataset from a tuple of tensors (x, y) using tf.data.Dataset.from_tensor_slices
dataset = tf.data.Dataset.from_tensor_slices((x, y))

# Inspect the element specification of the dataset
dataset.element_spec

(TensorSpec(shape=(5,), dtype=tf.float32, name=None),
 TensorSpec(shape=(), dtype=tf.int32, name=None))

In [93]:
# Create a TensorFlow Dataset from the 2D tensor `x` using tf.data.Dataset.from_tensor_slices
x_dataset = tf.data.Dataset.from_tensor_slices(x)

# Create a TensorFlow Dataset from the 1D tensor `y` using tf.data.Dataset.from_tensor_slices
y_dataset = tf.data.Dataset.from_tensor_slices(y)

# Combine the two datasets into a single dataset using tf.data.Dataset.zip
# This pairs each element of `x_dataset` with the corresponding element of `y_dataset`
dataset = tf.data.Dataset.zip((x_dataset, y_dataset))

# Inspect the element specification of the original dataset
dataset.element_spec

(TensorSpec(shape=(5,), dtype=tf.float32, name=None),
 TensorSpec(shape=(), dtype=tf.int32, name=None))

In [94]:
# Iterate over the first 5 elements of the dataset and print each pair of (x, y) values
for ind_x, ind_y in dataset.take(5):
    print(f"{ind_y} → {ind_x}")

0 → [0.69098985 0.80467045 0.7604947  0.0914799  0.6327827 ]
1 → [0.99304974 0.5970018  0.21458507 0.7159656  0.7758702 ]
1 → [0.31689167 0.5630431  0.2784543  0.00234151 0.65439403]
1 → [0.50773513 0.10693932 0.40303254 0.27550995 0.6557487 ]
0 → [0.4887359  0.44025254 0.05140471 0.75439227 0.35550952]


In [95]:
for ind_x, ind_y in dataset.take(5):
    print(f"{ind_y = } → {ind_x = }")

ind_y = <tf.Tensor: shape=(), dtype=int32, numpy=0> → ind_x = <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.69098985, 0.80467045, 0.7604947 , 0.0914799 , 0.6327827 ],
      dtype=float32)>
ind_y = <tf.Tensor: shape=(), dtype=int32, numpy=1> → ind_x = <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.99304974, 0.5970018 , 0.21458507, 0.7159656 , 0.7758702 ],
      dtype=float32)>
ind_y = <tf.Tensor: shape=(), dtype=int32, numpy=1> → ind_x = <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.31689167, 0.5630431 , 0.2784543 , 0.00234151, 0.65439403],
      dtype=float32)>
ind_y = <tf.Tensor: shape=(), dtype=int32, numpy=1> → ind_x = <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.50773513, 0.10693932, 0.40303254, 0.27550995, 0.6557487 ],
      dtype=float32)>
ind_y = <tf.Tensor: shape=(), dtype=int32, numpy=0> → ind_x = <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.4887359 , 0.44025254, 0.05140471, 0.75439227, 0.35550952],
      dtype=float32)>


Data Shuffling

In [98]:
# Shuffle the dataset with a buffer size of 5
# The `shuffle` method randomly shuffles the elements of the dataset using a buffer
dataset = dataset.shuffle(buffer_size=5)

# Iterate over the first 5 elements of the shuffled dataset and print each pair of (x, y) values
for ind_x, ind_y in dataset.take(5):
    print(f"{ind_y} → {ind_x}")

1 → [0.50773513 0.10693932 0.40303254 0.27550995 0.6557487 ]
1 → [0.39901233 0.60831    0.1106385  0.68864775 0.3791287 ]
1 → [0.31689167 0.5630431  0.2784543  0.00234151 0.65439403]
1 → [0.6434109  0.12706244 0.13220489 0.9911444  0.3176396 ]
1 → [0.99304974 0.5970018  0.21458507 0.7159656  0.7758702 ]


**Repeat Dataset**

In [109]:
# Create a 1D tensor with values [0, 1, 2] using tf.range
x = tf.range(3)

# Create a TensorFlow Dataset from the tensor using tf.data.Dataset.from_tensor_slices
x_dataset = tf.data.Dataset.from_tensor_slices(x)

# Repeat the dataset 2 times using the `repeat` method
# This creates a dataset that iterates through the original dataset twice
ds = x_dataset.repeat(2)

# Iterate over the first 10 elements of the repeated dataset and print each element
for ind_x in ds.take(10):
    print(ind_x)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)


**Batching**

In [114]:
# Create a TensorFlow Dataset with values from 0 to 99 using tf.data.Dataset.range
dataset = tf.data.Dataset.range(100)

# Iterate over the first 5 elements of the dataset and print each element
for ind in dataset.take(5):
    print(ind)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)


In [118]:
# Batch the dataset into groups of 4 elements using the `batch` method
ds = dataset.batch(4)

# Iterate over the first 5 batches of the dataset and print each batch
for ind in ds.take(5):
    print(ind)

tf.Tensor([0 1 2 3], shape=(4,), dtype=int64)
tf.Tensor([4 5 6 7], shape=(4,), dtype=int64)
tf.Tensor([ 8  9 10 11], shape=(4,), dtype=int64)
tf.Tensor([12 13 14 15], shape=(4,), dtype=int64)
tf.Tensor([16 17 18 19], shape=(4,), dtype=int64)
