# Chapter 3: Keras and Data Retrieval in TensorFlow 2

## 1️⃣ Chapter Overview

In the previous chapters, we dealt with the low-level building blocks of TensorFlow: Tensors, Variables, and Operations. While powerful, building complex deep learning models using only these primitives is time-consuming and error-prone. 

This chapter introduces **Keras**, the high-level API built into TensorFlow 2, which streamlines model development. We will explore the three specific ways to build models in Keras, ranging from simple to highly complex architectures. Furthermore, a model is useless without data. We will also master the art of building efficient, scalable data pipelines using `tf.data` and other utilities to feed our models.

**Key Machine Learning Concepts:**
* **Model Abstractions:** Sequential vs. Functional vs. Subclassing paradigms.
* **Data Pipelines:** The ETL (Extract, Transform, Load) process in Deep Learning.
* **Input Optimization:** Prefetching, caching, and batching.

**Practical Skills:**
* Building models using the **Sequential**, **Functional**, and **Subclassing** APIs.
* Implementing custom Keras layers.
* creating robust data pipelines using the **tf.data API**.
* Using **Keras DataGenerators** for legacy support and **TensorFlow Datasets (TFDS)** for benchmark datasets.

## 2️⃣ Theoretical Explanation

### 2.1 The Three Flavors of Keras Model Building

Keras offers three distinct ways to define a neural network, offering a trade-off between simplicity and flexibility.

#### 1. The Sequential API
* **Definition:** A linear stack of layers. Each layer has exactly one input tensor and one output tensor.
* **Intuition:** Think of it as a simple assembly line. Data goes in one end, passes through stations A, B, and C in order, and comes out the other end.
* **Use Case:** Perfect for 90% of standard deep learning models (e.g., simple CNNs, MLPs).
* **Limitation:** Cannot handle models with multiple inputs (e.g., image + metadata), multiple outputs, or non-linear topology (e.g., Residual connections).

#### 2. The Functional API
* **Definition:** A graph-based approach where you treat layers as functions that take tensors as inputs and return tensors as outputs.
* **Intuition:** Think of it as a complex plumbing system. You can split pipes, merge them, have multiple inlets, and multiple outlets.
* **Use Case:** Essential for complex architectures like ResNet (skip connections), InceptionNet (branching), or multi-modal models.

#### 3. The Subclassing API
* **Definition:** A fully object-oriented approach where you define a class inheriting from `tf.keras.Model` and define the forward pass logic in the `call()` method.
* **Intuition:** This gives you full control over the "forward pass." You can use Python control flow (`if`, `for`) inside the model execution.
* **Use Case:** Research on exotic architectures, dynamic networks, or when you need total control over the training loop.

### 2.2 Data Retrieval Strategies

Deep learning models are data-hungry. If your GPU has to wait for the CPU to load and process images, your training will be slow. TensorFlow provides tools to optimize this.

1.  **`tf.data` API:** The standard, most performant way to build pipelines. It treats data as a stream that can be mapped, filtered, batched, and prefetched asynchronously.
2.  **Keras DataGenerators:** Older utilities (like `ImageDataGenerator`) specifically designed for image augmentation. Easier to use for simple image tasks but less performant than `tf.data`.
3.  **TensorFlow Datasets (TFDS):** A library of ready-to-use datasets (like MNIST, CIFAR-10) managed by Google, handling download and caching automatically.

## 3️⃣ Part 1: Keras Model-Building APIs

We will implement three models to solve a classification problem on the **Iris Dataset**. 
1.  **Model A (Sequential):** A standard baseline.
2.  **Model B (Functional):** A model that takes two separate inputs (Raw features + PCA features).
3.  **Model C (Subclassing):** A model utilizing a custom layer with a multiplicative bias.

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import requests
import os
from sklearn.decomposition import PCA
from tensorflow.keras.layers import Dense, Input, Concatenate
from tensorflow.keras.models import Sequential, Model
import tensorflow.keras.backend as K

# Ensure reproducibility
def fix_random_seed(seed):
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy not imported")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow not imported")

fix_random_seed(42)

### 3.1 Data Preparation (Iris Dataset)
We will download the Iris dataset, clean it, and prepare it for training.

In [None]:
# 1. Download Data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
file_path = "iris.data"

if not os.path.exists(file_path):
    r = requests.get(url)
    with open(file_path, 'wb') as f:
        f.write(r.content)

# 2. Load Data with Pandas
iris_df = pd.read_csv(file_path, header=None)
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_width', 'petal_length', 'label']

# 3. Preprocessing
# Map string labels to integers
iris_df["label"] = iris_df["label"].map(
    {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
)

# Shuffle the data
iris_df = iris_df.sample(frac=1.0, random_state=42)

# Separate Features (x) and Labels (y)
x = iris_df[['sepal_length', 'sepal_width', 'petal_width', 'petal_length']]
y = iris_df['label']

# Normalize features (Center around 0)
x = (x - x.mean()) / x.std()

# Convert to One-Hot Encoding for the output
y = tf.one_hot(y, depth=3)

# Convert to numpy arrays for Keras
x = x.values
y = y.numpy()

print("Data Shape:", x.shape)
print("Labels Shape:", y.shape)

### 3.2 Model A: The Sequential API
This is the simplest approach. We stack layers linearly.

In [None]:
K.clear_session()

# Define the model using Sequential
model_seq = Sequential([
    # Input shape must be defined in the first layer
    Dense(32, activation='relu', input_shape=(4,)), 
    Dense(16, activation='relu'),
    Dense(3, activation='softmax') # Output layer: 3 classes
])

# Compile the model
model_seq.compile(
    loss='categorical_crossentropy', 
    optimizer='adam', 
    metrics=['acc']
)

model_seq.summary()

# Train
print("\n--- Training Sequential Model ---")
model_seq.fit(x, y, batch_size=64, epochs=10, verbose=0)
print("Training Complete.")

### 3.3 Model B: The Functional API
We will create a model that takes **two inputs**:
1. The original raw features (4 dimensions).
2. PCA-reduced features (2 dimensions).

The model will process them in parallel branches and merge them.

In [None]:
K.clear_session()

# Prepare PCA data (2nd Input source)
pca_model = PCA(n_components=2, random_state=42)
x_pca = pca_model.fit_transform(x)

# --- Defining the Functional Graph ---

# 1. Define Inputs explicitly
input_raw = Input(shape=(4,), name='input_raw')
input_pca = Input(shape=(2,), name='input_pca')

# 2. Branch 1: Process Raw Data
x1 = Dense(16, activation='relu')(input_raw)

# 3. Branch 2: Process PCA Data
x2 = Dense(16, activation='relu')(input_pca)

# 4. Merge Branches
concat = Concatenate(axis=1)([x1, x2])

# 5. Post-Merge Processing
h = Dense(16, activation='relu')(concat)
output = Dense(3, activation='softmax')(h)

# 6. Instantiate Model
model_func = Model(inputs=[input_raw, input_pca], outputs=output)

model_func.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
model_func.summary()

# Train (Pass inputs as a list)
print("\n--- Training Functional Model ---")
model_func.fit([x, x_pca], y, batch_size=64, epochs=10, verbose=0)
print("Training Complete.")

### 3.4 Model C: The Subclassing API (Custom Layers)
Here we will define a custom layer `MulBiasDense`. Unlike standard layers that calculate $y = \sigma(Wx + b)$, this layer will calculate:
$$ y = \sigma((Wx + b) \times b_{mul}) $$
where $b_{mul}$ is a learned multiplicative bias.

In [None]:
from tensorflow.keras.layers import Layer
from tensorflow.keras import activations

class MulBiasDense(Layer):
    def __init__(self, units=32, activation=None):
        super(MulBiasDense, self).__init__()
        self.units = units
        self.activation = activations.get(activation)

    def build(self, input_shape):
        # Create Weights (w)
        self.w = self.add_weight(
            shape=(input_shape[-1], self.units),
            initializer='glorot_uniform',
            trainable=True,
            name='kernel'
        )
        # Create Additive Bias (b)
        self.b = self.add_weight(
            shape=(self.units,),
            initializer='zeros',
            trainable=True,
            name='bias'
        )
        # Create Multiplicative Bias (b_mul)
        self.b_mul = self.add_weight(
            shape=(self.units,),
            initializer='ones',
            trainable=True,
            name='mul_bias'
        )

    def call(self, inputs):
        # The computation logic
        out = (tf.matmul(inputs, self.w) + self.b) * self.b_mul
        return self.activation(out)

K.clear_session()

# Using the Custom Layer in a Functional Model
inp = Input(shape=(4,))
out = MulBiasDense(units=32, activation='relu')(inp)
out = Dense(3, activation='softmax')(out)

model_sub = Model(inputs=inp, outputs=out)
model_sub.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

# Train
print("\n--- Training Custom Layer Model ---")
model_sub.fit(x, y, batch_size=64, epochs=10, verbose=0)
print("Training Complete.")

## 4️⃣ Part 2: Retrieving Data for TensorFlow

In this section, we will build a production-grade data pipeline for image data. We will simulate a scenario where we have images on disk and labels in a CSV file.

### 4.1 Setup: Downloading Dummy Image Data
Since we don't have the local files mentioned in the book, we will download the specific 'flower_photos' dataset often used in TF tutorials to simulate the environment.

In [None]:
# Download Flower Dataset
import pathlib
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)

# Create a CSV file simulating a real-world scenario (filename, label)
import csv
import glob

# We will only use 'roses' and 'daisy' for this small example to keep it fast
image_paths = list(data_dir.glob('roses/*')) + list(data_dir.glob('daisy/*'))
image_paths = [str(path) for path in image_paths]
labels = [0] * len(list(data_dir.glob('roses/*'))) + [1] * len(list(data_dir.glob('daisy/*')))

# Shuffle
rng = np.random.default_rng(42)
combined = list(zip(image_paths, labels))
rng.shuffle(combined)
image_paths, labels = zip(*combined)

# Save to CSV
csv_file = 'flower_labels.csv'
with open(csv_file, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['filename', 'label'])
    for img, lbl in zip(image_paths, labels):
        writer.writerow([img, lbl])

print(f"Created CSV with {len(image_paths)} records.")
print(f"Example path: {image_paths[0]}")

### 4.2 The `tf.data` API Pipeline

We will build a pipeline that:
1. Reads the CSV file.
2. Parses filenames and labels.
3. Loads the actual image from the disk.
4. Resizes and normalizes the image.
5. Batches and prefetches the data.

This is the most efficient way to feed data in TensorFlow.

In [None]:
# 1. Create a Dataset from the CSV file
# We skip the header line
csv_ds = tf.data.experimental.make_csv_dataset(
    csv_file,
    batch_size=1, # Read one by one initially
    header=True,
    shuffle=False
)

# 2. Transformation Functions
def process_csv_row(row):
    # Extract filename and label from the dictionary row returned by make_csv_dataset
    return row['filename'], row['label']

def load_and_preprocess_image(filename, label):
    # Read file from disk
    img_raw = tf.io.read_file(filename)
    # Decode image (detects format automatically)
    img = tf.image.decode_image(img_raw, channels=3)
    # Resize to fixed size (e.g., 64x64)
    img = tf.image.resize(img, [64, 64])
    # Normalize to [0, 1]
    img = img / 255.0
    return img, label

# 3. Construct the Pipeline
# Unbatch first because make_csv_dataset returns batches
train_ds = csv_ds.unbatch().map(process_csv_row)

# Map the image loading function
# num_parallel_calls=AUTOTUNE allows TF to load images in parallel using multiple CPU cores
train_ds = train_ds.map(load_and_preprocess_image, num_parallel_calls=tf.data.AUTOTUNE)

# 4. Optimization: Shuffle, Batch, and Prefetch
BATCH_SIZE = 32
train_ds = train_ds.shuffle(buffer_size=100)
train_ds = train_ds.batch(BATCH_SIZE)
train_ds = train_ds.prefetch(buffer_size=tf.data.AUTOTUNE)

# 5. Test the pipeline
print("\n--- Testing tf.data Pipeline ---")
for images, labels in train_ds.take(1):
    print("Batch of Images Shape:", images.shape)
    print("Batch of Labels Shape:", labels.shape)
    print("Sample Label:", labels[0].numpy())

### Step-by-Step Explanation of the Pipeline
1.  **`make_csv_dataset`**: Reads the text file efficiently.
2.  **`map`**: Applies transformations. `load_and_preprocess_image` contains the critical logic: `tf.io.read_file` brings bytes into memory, and `tf.image.decode_image` turns bytes into pixel tensors.
3.  **`num_parallel_calls=AUTOTUNE`**: This is crucial. It tells TensorFlow to use available CPU cores to load/process images *while* the GPU is busy training on the previous batch.
4.  **`prefetch`**: This ensures there is always a batch of data ready in memory when the GPU finishes the current step, eliminating I/O bottlenecks.

### 4.3 Keras DataGenerators
Before `tf.data`, `ImageDataGenerator` was the standard. It is still useful for quick prototypes involving image augmentation.

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# 1. Define Generator with Augmentation
datagen = ImageDataGenerator(
    rescale=1./255,         # Normalize
    rotation_range=20,      # Random rotation
    width_shift_range=0.2,  # Random shift
    horizontal_flip=True    # Random flip
)

# 2. Flow from DataFrame
# We reuse the DataFrame concept (although we wrote to CSV, we can load it back to DF)
df_flow = pd.read_csv(csv_file)
# Provide string labels for categorical mode
df_flow['label'] = df_flow['label'].astype(str) 

generator = datagen.flow_from_dataframe(
    dataframe=df_flow,
    x_col='filename',
    y_col='label',
    target_size=(64, 64),
    batch_size=32,
    class_mode='binary'
)

print("\n--- Testing Keras ImageDataGenerator ---")
batch_x, batch_y = next(generator)
print("Batch Shape:", batch_x.shape)

### 4.4 TensorFlow Datasets (TFDS)
Finally, the easiest way to access standard datasets. We will load **CIFAR-10** as an example.

In [None]:
import tensorflow_datasets as tfds

# Load CIFAR-10
# with_info=True returns metadata about the dataset
data, info = tfds.load("cifar10", with_info=True)

train_data = data['train']
test_data = data['test']

print("\n--- TFDS Info ---")
print("Dataset Size:", info.splits['train'].num_examples)
print("Features:", info.features)

# Pipeline for TFDS
def format_data(data):
    image = tf.cast(data['image'], tf.float32) / 255.0
    image = tf.image.resize(image, [64, 64])
    return image, data['label']

train_data = train_data.map(format_data).batch(32).prefetch(tf.data.AUTOTUNE)

print("\n--- Testing TFDS Pipeline ---")
for img, label in train_data.take(1):
    print("Image Batch Shape:", img.shape)

## 5️⃣ Chapter Summary

In this chapter, we moved from basic TensorFlow primitives to professional-grade model building and data handling.

* **Keras APIs:**
    * Use **Sequential** for simple stacks of layers.
    * Use **Functional** for complex topologies (multi-input/output, shared layers).
    * Use **Subclassing** for custom training loops and dynamic behaviors.
* **Data Pipelines:**
    * **`tf.data`** is the gold standard. It creates highly optimized, asynchronous pipelines that prevent GPU starvation.
    * **`ImageDataGenerator`** is convenient for quick augmentation but less scalable.
    * **`TFDS`** provides instant access to standard academic datasets.

In the next chapter, we will combine these skills to dip our toes into deep learning by building Fully Connected Networks, CNNs, and RNNs.