<a href="https://colab.research.google.com/github/Monaa48/TensorFlow-in-Action-starter/blob/main/notebooks/Ch03_Keras_and_data_retrieval_in_TF2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 03 — Keras and Data Retrieval in TensorFlow 2


## 1) Summary

This chapter is primarily about two things that I *need* before doing “real” deep learning:

- **How to define models in Keras** (Sequential, Functional, Subclassing)
- **How to feed data** in a clean way (tf.data, Keras generators, tfds)

I test each approach with small examples so I can see the differences without getting lost.


## 2) Setup


In [1]:
import random
import numpy as np
import tensorflow as tf

print("TensorFlow version:", tf.__version__)

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)


TensorFlow version: 2.19.0


## 3) Keras model-building APIs (with a small tabular dataset)

To keep it simple, I use the Iris dataset:
- 4 numeric features
- 3 classes

My goal here is not to beat SOTA accuracy, but to make sure I understand how each Keras API works.


### 3.1 Load and prepare the Iris dataset


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
cols = ["sepal_len","sepal_wid","petal_len","petal_wid","label"]

df = pd.read_csv(url, header=None, names=cols).dropna()

label_to_id = {name:i for i,name in enumerate(sorted(df["label"].unique()))}
df["label_id"] = df["label"].map(label_to_id)

X = df[["sepal_len","sepal_wid","petal_len","petal_wid"]].values.astype("float32")
y = df["label_id"].values.astype("int32")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=SEED, stratify=y
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train).astype("float32")
X_test  = scaler.transform(X_test).astype("float32")

num_classes = len(label_to_id)
print("Classes:", label_to_id)
print("Train:", X_train.shape, y_train.shape)
print("Test :", X_test.shape, y_test.shape)


Classes: {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
Train: (120, 4) (120,)
Test : (30, 4) (30,)


I wrap the arrays using `tf.data.Dataset` because it’s the most reusable way to build pipelines.

In [3]:
BATCH_SIZE = 16

train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_ds = train_ds.shuffle(len(X_train), seed=SEED).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
test_ds = test_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

next(iter(train_ds))


(<tf.Tensor: shape=(16, 4), dtype=float32, numpy=
 array([[-1.124492  ,  0.1258048 , -1.2902186 , -1.4516375 ],
        [ 1.8608855 , -0.5501623 ,  1.3233619 ,  0.91480553],
        [-1.0050771 ,  0.35112754, -1.4606694 , -1.3201684 ],
        [-1.4827378 ,  0.35112754, -1.3470356 , -1.3201684 ],
        [ 1.6220549 ,  0.35112754,  1.2665449 ,  0.78333646],
        [-1.124492  ,  0.1258048 , -1.2902186 , -1.4516375 ],
        [ 0.5473195 ,  0.8017725 ,  1.039277  ,  1.5721511 ],
        [-0.16917162,  3.0549965 , -1.2902186 , -1.0572304 ],
        [-0.40800163,  1.0270947 , -1.4038525 , -1.3201684 ],
        [-1.3633227 ,  0.35112754, -1.4038525 , -1.3201684 ],
        [-0.88566214,  1.0270947 , -1.3470356 , -1.1886994 ],
        [ 1.144395  ,  0.35112754,  1.209728  ,  1.4406818 ],
        [-0.28858662, -0.09951741,  0.18702246,  0.12599114],
        [ 2.2191305 , -0.09951741,  1.3233619 ,  1.4406818 ],
        [-0.40800163, -1.0008073 ,  0.35747346, -0.0054778 ],
        [ 0.3084889 

### 3.2 Sequential API

This is the “stack layers in order” approach. Works great when the model is just a straight pipeline.


In [4]:
seq_model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(4,)),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(num_classes)  # logits
])

seq_model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

seq_model.fit(train_ds, validation_data=test_ds, epochs=20, verbose=0)
loss, acc = seq_model.evaluate(test_ds, verbose=0)
print("Sequential — test loss:", loss, "| test acc:", acc)


Sequential — test loss: 0.41009896993637085 | test acc: 0.7666666507720947


### 3.3 Functional API (multi-input example)

Functional API feels like wiring a graph manually. To show why it matters, I split Iris features into 2 inputs:
- sepal (2 features)
- petal (2 features)

Then I process them separately and concatenate.


In [5]:
# Split features: first 2 = sepal, last 2 = petal
X_train_sepal, X_train_petal = X_train[:, :2], X_train[:, 2:]
X_test_sepal,  X_test_petal  = X_test[:, :2],  X_test[:, 2:]

train_ds2 = tf.data.Dataset.from_tensor_slices(((X_train_sepal, X_train_petal), y_train))
train_ds2 = train_ds2.shuffle(len(X_train), seed=SEED).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

test_ds2 = tf.data.Dataset.from_tensor_slices(((X_test_sepal, X_test_petal), y_test))
test_ds2 = test_ds2.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

in_sepal = tf.keras.Input(shape=(2,), name="sepal")
in_petal = tf.keras.Input(shape=(2,), name="petal")

x1 = tf.keras.layers.Dense(8, activation="relu")(in_sepal)
x2 = tf.keras.layers.Dense(8, activation="relu")(in_petal)

x = tf.keras.layers.Concatenate()([x1, x2])
x = tf.keras.layers.Dense(16, activation="relu")(x)
out = tf.keras.layers.Dense(num_classes)(x)

func_model = tf.keras.Model(inputs=[in_sepal, in_petal], outputs=out)

func_model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

func_model.fit(train_ds2, validation_data=test_ds2, epochs=20, verbose=0)
loss, acc = func_model.evaluate(test_ds2, verbose=0)
print("Functional — test loss:", loss, "| test acc:", acc)


Functional — test loss: 0.39869073033332825 | test acc: 0.8999999761581421


### 3.4 Subclassing API (custom behavior)

Subclassing is when I want full control. I define `__init__` and `call()` like writing normal Python.

I also try writing a tiny custom Dense-like layer just to see how `build()` and weights work.


In [6]:
class MyDense(tf.keras.layers.Layer):
    def __init__(self, units, activation=None):
        super().__init__()
        self.units = units
        self.activation = tf.keras.activations.get(activation)

    def build(self, input_shape):
        in_dim = int(input_shape[-1])
        self.W = self.add_weight(shape=(in_dim, self.units), initializer="glorot_uniform", trainable=True)
        self.b = self.add_weight(shape=(self.units,), initializer="zeros", trainable=True)

    def call(self, x):
        y = tf.matmul(x, self.W) + self.b
        return self.activation(y) if self.activation is not None else y

class IrisSubclassModel(tf.keras.Model):
    def __init__(self, num_classes):
        super().__init__()
        self.d1 = MyDense(16, activation="relu")
        self.d2 = MyDense(16, activation="relu")
        self.out = MyDense(num_classes)

    def call(self, x):
        x = self.d1(x)
        x = self.d2(x)
        return self.out(x)

sub_model = IrisSubclassModel(num_classes)

sub_model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

sub_model.fit(train_ds, validation_data=test_ds, epochs=20, verbose=0)
loss, acc = sub_model.evaluate(test_ds, verbose=0)
print("Subclassing — test loss:", loss, "| test acc:", acc)


Subclassing — test loss: 0.6938612461090088 | test acc: 0.800000011920929


## 4) Data retrieval options (how I can feed data)

In the chapter, data input shows up in multiple forms. The three common ones I see are:

1. **`tf.data` pipelines** (most flexible, good for custom logic)
2. **Keras generators** (very convenient for directory-based images)
3. **`tensorflow-datasets` (tfds)** (fast access to standardized datasets)


### 4.1 `tf.data` pipeline for images via a CSV (filename, label)

A common “real dataset” pattern is: images on disk, plus a CSV file that maps filename → label.

Steps I follow:
- read CSV lines
- parse filename + label
- load image bytes, decode, resize, normalize
- batch + prefetch

I use TensorFlow's flower photos dataset because it’s a simple folder dataset I can download quickly.


In [9]:
import csv, pathlib

flowers_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
tgz_path = tf.keras.utils.get_file("flower_photos.tgz", flowers_url, extract=True)
data_root = pathlib.Path(tgz_path).with_suffix("") / "flower_photos" # Corrected path
print("Data root:", data_root)

class_names = sorted([p.name for p in data_root.iterdir() if p.is_dir()])
name_to_id = {name:i for i,name in enumerate(class_names)}
print("Classes:", class_names)

rows = []
max_per_class = 50
for cname in class_names:
    files = list((data_root/cname).glob("*.jpg"))[:max_per_class]
    for f in files:
        rows.append((str(f), name_to_id[cname]))

csv_path = "/tmp/flowers_small.csv"
with open(csv_path, "w", newline="") as f:
    w = csv.writer(f)
    w.writerow(["filename","label"])
    w.writerows(rows)

print("CSV rows:", len(rows))
print("CSV path:", csv_path)

Data root: /root/.keras/datasets/flower_photos_extracted/flower_photos
Classes: ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']
CSV rows: 250
CSV path: /tmp/flowers_small.csv


In [11]:
IMG_SIZE = (64, 64)
NUM_CLASSES = len(class_names)

def parse_csv_line(line):
    parts = tf.strings.split(line, sep=",")
    fname = parts[0]
    label = tf.strings.to_number(parts[1], out_type=tf.int32)
    return fname, label

def load_and_preprocess(path, label):
    img_bytes = tf.io.read_file(path)
    img = tf.image.decode_jpeg(img_bytes, channels=3)
    img = tf.image.resize(img, IMG_SIZE)
    img = tf.cast(img, tf.float32) / 255.0
    label_oh = tf.one_hot(label, depth=NUM_CLASSES)
    return img, label_oh

ds = tf.data.TextLineDataset(csv_path).skip(1)
ds = ds.map(parse_csv_line, num_parallel_calls=tf.data.AUTOTUNE)
ds = ds.map(load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
ds = ds.shuffle(200, seed=SEED).batch(16).prefetch(tf.data.AUTOTUNE)

batch_imgs, batch_labels = next(iter(ds))
print("Image batch:", batch_imgs.shape, batch_imgs.dtype)
print("Label batch:", batch_labels.shape, batch_labels.dtype)


Image batch: (16, 64, 64, 3) <dtype: 'float32'>
Label batch: (16, 5) <dtype: 'float32'>


### 4.2 Keras generators (`ImageDataGenerator`)

If my image dataset already follows the folder layout:

```
root/
  classA/
  classB/
```

Then `flow_from_directory()` can produce batches with labels automatically.
This is usually my fast way to prototype.


In [12]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)

train_gen = datagen.flow_from_directory(
    data_root,
    target_size=IMG_SIZE,
    batch_size=16,
    subset="training",
    seed=SEED
)

val_gen = datagen.flow_from_directory(
    data_root,
    target_size=IMG_SIZE,
    batch_size=16,
    subset="validation",
    seed=SEED
)

x_batch, y_batch = next(train_gen)
print("Generator batch:", x_batch.shape, y_batch.shape)


Found 2939 images belonging to 5 classes.
Found 731 images belonging to 5 classes.
Generator batch: (16, 64, 64, 3) (16, 5)


### 4.3 `tensorflow-datasets` (tfds)

`tfds` is primarily “ready-to-use datasets”. I load MNIST to show the workflow.


In [13]:
import tensorflow_datasets as tfds

mnist = tfds.load("mnist", split="train", as_supervised=True)
mnist = mnist.map(lambda x,y: (tf.cast(x, tf.float32)/255.0, y)).batch(32).prefetch(tf.data.AUTOTUNE)

img, lbl = next(iter(mnist))
print("MNIST batch:", img.shape, lbl.shape)




Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/mnist/3.0.1...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/mnist/incomplete.8F1CBF_3.0.1/mnist-train.tfrecord*...:   0%|          | 0…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/mnist/incomplete.8F1CBF_3.0.1/mnist-test.tfrecord*...:   0%|          | 0/…

Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
MNIST batch: (32, 28, 28, 1) (32,)


## 5) Takeaways

- Sequential is easiest for straight stacks; Functional is for graphs (multi-input/multi-output); Subclassing is for full flexibility.
- For tabular data, `tf.data.Dataset.from_tensor_slices()` is simple and clean.
- For images: folder layout → generator is fastest; CSV/custom logic → `tf.data` is the most controllable.
- tfds is great when I just want a standard benchmark dataset quickly.
