**Welcome to Deep Learning with Keras and TensorFlow in Python**

**Presented by: Reza Saadatyar (2024-2025)**<br/>
**E-mail: Reza.Saadatyar@outlook.com**<br/>
**[GitHub](https://github.com/RezaSaadatyar/Deep-Learning-in-python)**

**Outline:**<br/>
▪ [Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)<br/>
▪ [Data Shuffling](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle)<br/>
▪ [Repeat Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#repeat)<br/>
▪ [Batching](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch)<br/>
▪ Prepare TensorFlow datasets for training, validation, and testing

**Extract, Transform, Load (ETL) pipeline:**<br/>
▪ `Extract:` Data is gathered from various sources Cloud (e.g., Google Cloud Storage, AWS S3, or Azure Blob Storage), Databases (e.g., MySQL, PostgreSQL), and Local File System (this might include CSV files, JSON files, or other raw data stored locally).<br/>
▪ `Transform:` Data is processed, cleaned, or reformatted to make it suitable for analysis or model training. Common transformations include: normalizing numerical data (e.g., scaling values between 0 and 1), encoding categorical data (e.g., one-hot encoding), handling missing values, and resizing images or tokenizing text (if working with image or NLP datasets).<br/>
▪ `Load:` The transformed data is loaded into a target system, such as a device or storage for further use.<br/>

`tf.data` a TensorFlow API, streamlines loading, preprocessing, and feeding data into models. It excels with large datasets, supporting streaming and parallel processing for efficiency. 

**Key tf.data methods for extraction:**<br/>
▪ `tf.data.Dataset.from_tensor_slices():` Create a dataset from in-memory tensors (e.g., NumPy arrays).<br/>
▪ `tf.data.TextLineDataset:` Load text files line by line (e.g., for CSVs or raw text).<br/>
▪ `tf.data.TFRecordDataset:` Load data stored in TFRecord format, which is optimized for TensorFlow.<br/>
▪ `tf.keras.utils.image_dataset_from_directory(): `Load image datasets directly from a directory structure (useful for image classification tasks).<br/>

**Key tf.data methods for transformation:**<br/>
▪ `dataset.map():` Apply a transformation function to each element.<br/>
▪ `dataset.filter():` Filter out elements based on a condition.<br/>
▪ `dataset.shuffle():` Randomize the dataset.<br/>
▪ `dataset.batch():` Group elements into batches.<br/>

<font color='#FF000e' size="4.5" face="Arial"><b>Import modules</b></font>

In [3]:
import pprint
import numpy as np
import tensorflow as tf
from typing import Tuple, Union
from sklearn.model_selection import train_test_split

<font color=#e4e706 size="4.8" face="Arial"><b>1️⃣ Dataset</b></font>

In [4]:
# Create a NumPy array with the given values
x = np.array([8, 3, 20, -1, 0, 1])

# Create a TensorFlow Dataset from the NumPy array using tf.data.Dataset.from_tensor_slices
# This creates a dataset where each element is a slice of the input array
dataset = tf.data.Dataset.from_tensor_slices(x)

# The dataset is now ready for iteration or further processing
dataset, x

(<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>,
 array([ 8,  3, 20, -1,  0,  1]))

In [13]:
# Iterate over the dataset and print each element along with its index
for ind, tensor in enumerate(dataset):
    print(f"{ind} → {tensor = }")

0 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=8>
1 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=3>
2 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=20>
3 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=-1>
4 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=0>
5 → tensor = <tf.Tensor: shape=(), dtype=int64, numpy=1>


In [None]:
# Inspect the element specification of the dataset
dataset.element_spec

TensorSpec(shape=(), dtype=tf.int64, name=None)

In [120]:
# Create a 2D tensor with random uniform values (shape [100, 5])
x = tf.random.uniform([100, 5])

# Create a 1D tensor with random uniform integer values (shape [100]) ranging from 0 to 1
y = tf.random.uniform([100], maxval=2, dtype=tf.int32)

# Create a TensorFlow Dataset from a tuple of tensors (x, y) using tf.data.Dataset.from_tensor_slices
dataset = tf.data.Dataset.from_tensor_slices((x, y))

# Inspect the element specification of the dataset
dataset.element_spec

(TensorSpec(shape=(5,), dtype=tf.float32, name=None),
 TensorSpec(shape=(), dtype=tf.int32, name=None))

In [None]:
# Create a TensorFlow Dataset from the 2D tensor `x` using tf.data.Dataset.from_tensor_slices
x_dataset = tf.data.Dataset.from_tensor_slices(x)

# Create a TensorFlow Dataset from the 1D tensor `y` using tf.data.Dataset.from_tensor_slices
y_dataset = tf.data.Dataset.from_tensor_slices(y)

# Combine the two datasets into a single dataset using tf.data.Dataset.zip
# This pairs each element of `x_dataset` with the corresponding element of `y_dataset`
dataset = tf.data.Dataset.zip((x_dataset, y_dataset))

# Inspect the element specification of the original dataset
dataset.element_spec

(TensorSpec(shape=(5,), dtype=tf.float32, name=None),
 TensorSpec(shape=(), dtype=tf.int32, name=None))

In [94]:
# Iterate over the first 5 elements of the dataset and print each pair of (x, y) values
for ind_x, ind_y in dataset.take(5):
    print(f"{ind_y} → {ind_x}")

0 → [0.69098985 0.80467045 0.7604947  0.0914799  0.6327827 ]
1 → [0.99304974 0.5970018  0.21458507 0.7159656  0.7758702 ]
1 → [0.31689167 0.5630431  0.2784543  0.00234151 0.65439403]
1 → [0.50773513 0.10693932 0.40303254 0.27550995 0.6557487 ]
0 → [0.4887359  0.44025254 0.05140471 0.75439227 0.35550952]


In [95]:
for ind_x, ind_y in dataset.take(5):
    print(f"{ind_y = } → {ind_x = }")

ind_y = <tf.Tensor: shape=(), dtype=int32, numpy=0> → ind_x = <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.69098985, 0.80467045, 0.7604947 , 0.0914799 , 0.6327827 ],
      dtype=float32)>
ind_y = <tf.Tensor: shape=(), dtype=int32, numpy=1> → ind_x = <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.99304974, 0.5970018 , 0.21458507, 0.7159656 , 0.7758702 ],
      dtype=float32)>
ind_y = <tf.Tensor: shape=(), dtype=int32, numpy=1> → ind_x = <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.31689167, 0.5630431 , 0.2784543 , 0.00234151, 0.65439403],
      dtype=float32)>
ind_y = <tf.Tensor: shape=(), dtype=int32, numpy=1> → ind_x = <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.50773513, 0.10693932, 0.40303254, 0.27550995, 0.6557487 ],
      dtype=float32)>
ind_y = <tf.Tensor: shape=(), dtype=int32, numpy=0> → ind_x = <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.4887359 , 0.44025254, 0.05140471, 0.75439227, 0.35550952],
      dtype=float32)>


<font color= #5ff309 size="4.8" face="Arial"><b>2️⃣ Data Shuffling</b></font>

In [98]:
# Shuffle the dataset with a buffer size of 5
# The `shuffle` method randomly shuffles the elements of the dataset using a buffer
dataset = dataset.shuffle(buffer_size=5)

# Iterate over the first 5 elements of the shuffled dataset and print each pair of (x, y) values
for ind_x, ind_y in dataset.take(5):
    print(f"{ind_y} → {ind_x}")

1 → [0.50773513 0.10693932 0.40303254 0.27550995 0.6557487 ]
1 → [0.39901233 0.60831    0.1106385  0.68864775 0.3791287 ]
1 → [0.31689167 0.5630431  0.2784543  0.00234151 0.65439403]
1 → [0.6434109  0.12706244 0.13220489 0.9911444  0.3176396 ]
1 → [0.99304974 0.5970018  0.21458507 0.7159656  0.7758702 ]


<font color=#0ec3f0 size="4.8" face="Arial"><b>3️⃣ Repeat Dataset</b></font>

In [109]:
# Create a 1D tensor with values [0, 1, 2] using tf.range
x = tf.range(3)

# Create a TensorFlow Dataset from the tensor using tf.data.Dataset.from_tensor_slices
x_dataset = tf.data.Dataset.from_tensor_slices(x)

# Repeat the dataset 2 times using the `repeat` method
# This creates a dataset that iterates through the original dataset twice
ds = x_dataset.repeat(2)

# Iterate over the first 10 elements of the repeated dataset and print each element
for ind_x in ds.take(10):
    print(ind_x)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)


<font color=#e706af size="4.8" face="Arial"><b>4️⃣ Batching</b></font>

In [114]:
# Create a TensorFlow Dataset with values from 0 to 99 using tf.data.Dataset.range
dataset = tf.data.Dataset.range(100)

# Iterate over the first 5 elements of the dataset and print each element
for ind in dataset.take(5):
    print(ind)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)


In [118]:
# Batch the dataset into groups of 4 elements using the `batch` method
ds = dataset.batch(4)

# Iterate over the first 5 batches of the dataset and print each batch
for ind in ds.take(5):
    print(ind)

tf.Tensor([0 1 2 3], shape=(4,), dtype=int64)
tf.Tensor([4 5 6 7], shape=(4,), dtype=int64)
tf.Tensor([ 8  9 10 11], shape=(4,), dtype=int64)
tf.Tensor([12 13 14 15], shape=(4,), dtype=int64)
tf.Tensor([16 17 18 19], shape=(4,), dtype=int64)


<font color=#5b0cee size="4.5" face="Arial"><b>5️⃣ Prepare TensorFlow datasets for training, validation, and testing</b></font>

In [4]:
def prepare_dataset(
    data: Union[np.ndarray, tf.Tensor], 
    labels: Union[np.ndarray, tf.Tensor], 
    train_size: float = 0.8, 
    valid_size: float = 0.16, 
    batch_size: int = 16, 
    shuffle_train: bool = True, 
    shuffle_buffer_size: int = 1000
) -> Tuple[tf.data.Dataset, tf.data.Dataset, tf.data.Dataset]:
    """
    This function handles the complete pipeline from raw data to optimized TensorFlow Dataset objects,
    including data splitting, shuffling, batching, and prefetching for optimal performance.
    
    Args:
        data: Input features as either numpy array or TensorFlow tensor
               Shape should be (num_samples, ...features_dims)
        labels: Corresponding labels as either numpy array or TensorFlow tensor
                Shape should be (num_samples, ...label_dims)
        train_size: Proportion of total data to use for training (0.0 to 1.0)
        valid_size: Proportion of total data to use for validation (0.0 to 1.0)
        batch_size: Number of samples per training batch (positive integer)
        shuffle_train: Whether to shuffle training data (recommended for training)
        shuffle_buffer_size: Size of buffer used for shuffling (larger = better shuffling but more memory)
    
    Returns:
        A tuple containing three tf.data.Dataset objects in order:
        - train_dataset: Dataset for model training
        - valid_dataset: Dataset for validation during training
        - test_dataset: Dataset for final evaluation
    
    Raises:
        ValueError: If input sizes are invalid or data/labels have mismatched lengths
    """
    
    # ============================================ INPUT VALIDATION ============================================
    # Validate the split proportions make sense
    if train_size + valid_size > 1.0:
        raise ValueError("train_size + valid_size must not exceed 1.0 (test_size would be negative)")
    if train_size < valid_size:
        raise ValueError("Training set should typically be larger than validation set")

    # Convert TensorFlow tensors to numpy arrays for sklearn splitting
    if tf.is_tensor(data):
        data = data.numpy()
    if tf.is_tensor(labels):
        labels = labels.numpy()

    # Verify data and labels have compatible shapes
    if len(data) != len(labels):
        raise ValueError(f"Mismatched lengths: data has {len(data)} samples but labels has {len(labels)}")

    # ======================================== DATA SPLITTING ==================================================
    # First split separates test set from training+validation
    x_train_val, x_test, y_train_val, y_test = train_test_split(
        data,
        labels,
        train_size=train_size + valid_size,  # Combined size for train+val
        test_size=1.0 - (train_size + valid_size),  # Remainder for test
        random_state=24,  # Fixed seed for reproducibility
        # stratify=labels if len(set(labels)) > 1 else None  # Optional stratification
    )

    # Second split divides train+val into separate sets
    # Calculate relative proportion of validation within train+val subset
    valid_proportion = valid_size / (train_size + valid_size)
    x_train, x_valid, y_train, y_valid = train_test_split(
        x_train_val,
        y_train_val,
        train_size=1.0 - valid_proportion,  # Relative train size
        test_size=valid_proportion,          # Relative validation size
        random_state=24,  # Same seed for consistency
        # stratify=y_train_val if len(set(y_train_val)) > 1 else None
    )

    # ======================================= DATASET CREATION =================================================
    # Create TensorFlow Dataset objects with proper type casting
    train_dataset = tf.data.Dataset.from_tensor_slices(
        (tf.cast(x_train, tf.float32),  # Convert features to float32
        tf.cast(y_train, tf.float32)    # Convert labels to float32
    ))
    valid_dataset = tf.data.Dataset.from_tensor_slices(
        (tf.cast(x_valid, tf.float32), 
        tf.cast(y_valid, tf.float32))
    )
    test_dataset = tf.data.Dataset.from_tensor_slices(
        (tf.cast(x_test, tf.float32), 
        tf.cast(y_test, tf.float32))
    )
    
    # ====================================== DATASET OPTIMIZATION ==============================================
    # Shuffle training data if enabled (recommended for better training)
    if shuffle_train:
        train_dataset = train_dataset.shuffle(
            buffer_size=min(shuffle_buffer_size, len(x_train)),  # Don't exceed dataset size
            reshuffle_each_iteration=True  # Important for proper epoch training
        )
    
    # Batch all datasets for efficient processing
    train_dataset = train_dataset.batch(batch_size)
    valid_dataset = valid_dataset.batch(batch_size)
    test_dataset = test_dataset.batch(batch_size)
    
    # Prefetch data to overlap preprocessing and model execution
    train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE)  # Let TensorFlow optimize buffer size
    valid_dataset = valid_dataset.prefetch(tf.data.AUTOTUNE)
    test_dataset = test_dataset.prefetch(tf.data.AUTOTUNE)

    # ============================================ VERIFICATION OUTPUT =========================================
    print(f"Training set:   {x_train.shape} features, {y_train.shape} labels")
    print(f"Validation set: {x_valid.shape} features, {y_valid.shape} labels")
    print(f"Test set:       {x_test.shape} features, {y_test.shape} labels")
    print(f"\nBatch size:     {batch_size}")
    print(f"Training shuffle: {'enabled' if shuffle_train else 'disabled'}")
    print("\t")
    pprint.pprint(train_dataset.element_spec, width=80)
    
    return train_dataset, valid_dataset, test_dataset

In [5]:
# Define dimensions for synthetic data
num_samples = 1000    # Number of samples
num_features = 10     # Number of features per sample (e.g., tabular data)
num_classes = 2       # Binary classification (adjustable)

# Generate synthetic feature data
# Random values between 0 and 1 (simulating normalized features)
data = np.random.random(size=(num_samples, num_features))

# Generate synthetic labels
# Random binary labels (0 or 1) for classification
labels = np.random.randint(0, num_classes, size=(num_samples, 1))


batch_size = 20

# Example usage with your U-Net
# Assuming data_resize and masks_resize are your input data and masks
train_dataset, valid_dataset, test_dataset = prepare_dataset(
    data,
    labels,
    train_size=0.64,  # 64% of total
    valid_size=0.16,  # 16% of total (20% of train+valid)
    batch_size=batch_size,
    shuffle_train=True,
    shuffle_buffer_size=1000,
)

Training set:   (640, 10) features, (640, 1) labels
Validation set: (160, 10) features, (160, 1) labels
Test set:       (200, 10) features, (200, 1) labels

Batch size:     20
Training shuffle: enabled
	
(TensorSpec(shape=(None, 10), dtype=tf.float32, name=None),
 TensorSpec(shape=(None, 1), dtype=tf.float32, name=None))
