# Chapter 7: Teaching machines to see better: Improving CNNs and making them confess

This notebook reproduces the code and summarizes the theoretical concepts from Chapter 7 of *'TensorFlow in Action'* by Thushan Ganegedara.

In Chapter 6, we built an Inception v1 model that suffered from severe **overfitting** (high training accuracy, low validation accuracy). This chapter focuses on practical techniques to solve that problem and significantly improve our model's performance. 

We will cover:
1.  **Regularization Techniques**: Using Image Data Augmentation, Dropout, and Early Stopping to combat overfitting.
2.  **A Better Architecture (Minception)**: Implementing a more modern architecture inspired by Inception-ResNet, which uses Batch Normalization and Residual Connections.
3.  **Transfer Learning**: Using a large, pretrained model (Inception-ResNet v2) to get state-of-the-art results.
4.  **Model Explainability (Grad-CAM)**: Visualizing *why* our CNN makes certain decisions.

## Setup: Data Pipeline from Chapter 6

Before we can improve the model, we need the same data pipeline from Chapter 6. We'll use the **tiny-imagenet-200** dataset and the `ImageDataGenerator`.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models, Model
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import CSVLogger, EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.models import load_model
import tensorflow.keras.backend as K
from functools import partial
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

# Set a random seed for reproducibility
random_seed = 4321
np.random.seed(random_seed)
tf.random.set_seed(random_seed)

# Define file paths (assuming data is in 'data/tiny-imagenet-200')
data_dir = os.path.join('data', 'tiny-imagenet-200')
train_image_dir = os.path.join(data_dir, 'train')
val_dir = os.path.join(data_dir, 'val')
val_ann_path = os.path.join(val_dir, 'val_annotations.txt')

# Helper function to read the test (validation) annotations
def get_test_labels_df(test_labels_path):
    test_df = pd.read_csv(test_labels_path, sep='\t', index_col=None, header=None)
    test_df = test_df.iloc[:, [0, 1]].rename({0: "filename", 1: "class"}, axis=1)
    return test_df

# Helper function to create the auxiliary data generator
# Our Inception model has 3 outputs, so the generator must yield (x, (y, y, y))
def data_gen_aux(gen):
    for x, y in gen:
        yield x, (y, y, y)

---

## 7.1 Techniques for Reducing Overfitting

**Overfitting** is when a model learns the training data *too* well, including its noise and random fluctuations. It memorizes the training examples instead of learning the general patterns. This results in high training accuracy but poor performance on new, unseen data (like the validation or test set).

We will apply three techniques to fight this.

### 7.1.1 Image Data Augmentation with Keras

Data augmentation artificially creates more training data by applying random transformations to the existing images (e.g., rotating, shifting, zooming, and flipping). This teaches the model that these transformed images all belong to the same class, making it more robust and less likely to overfit on specific orientations.

We only apply augmentation to the **training set**. The validation and test sets must remain unchanged to serve as a consistent benchmark.

In [None]:
batch_size = 128
# We'll use 56x56 as the target size for our custom Minception model later
target_size = (56, 56)

# 1. Define the generator for Training and Validation WITH AUGMENTATION
image_gen_aug = ImageDataGenerator(
    samplewise_center=True,      # Normalize each image
    rotation_range=30,         # Randomly rotate up to 30 degrees
    width_shift_range=0.2,     # Randomly shift width by 20%
    height_shift_range=0.2,    # Randomly shift height by 20%
    brightness_range=(0.5, 1.5), # Randomly change brightness
    shear_range=5,             # Apply shear transformation
    zoom_range=0.2,            # Randomly zoom in by 20%
    horizontal_flip=True,      # Randomly flip horizontally
    fill_mode='reflect',         # How to fill pixels after a shift/zoom
    validation_split=0.1     # Split 10% of data for validation
)

# 2. Define the generator for Test data (NO AUGMENTATION, only normalization)
image_gen_test = ImageDataGenerator(samplewise_center=True)

# 3. Create the Training and Validation Generators
partial_flow_func = partial(
    image_gen_aug.flow_from_directory,
    directory=train_image_dir,
    target_size=target_size,
    class_mode='categorical',
    batch_size=batch_size,
    shuffle=True,
    seed=random_seed
)
train_gen = partial_flow_func(subset='training')
valid_gen = partial_flow_func(subset='validation')

# 4. Create the Test Generator
test_df = get_test_labels_df(val_ann_path)
test_gen = image_gen_test.flow_from_dataframe(
    dataframe=test_df,
    directory=os.path.join(val_dir, 'images'),
    x_col='filename',
    y_col='class',
    target_size=target_size,
    class_mode='categorical',
    batch_size=batch_size,
    shuffle=False
)

print("Data generators with augmentation are ready.")

### 7.1.2 Dropout

**Dropout** is a regularization technique where, during each training step, a random fraction of neurons (e.g., 40%) are "dropped out" or temporarily switched off. 

This prevents the network from becoming too reliant on any single neuron or feature. It forces the network to learn redundant representations, which makes it more generalizable.

We will re-define the `aux_out` function and the main model's output to include `Dropout` layers, as specified in the Inception v1 paper (which we omitted in Ch. 6 for simplicity).

In [None]:
# Re-defining the 'aux_out' function from Chapter 6, but with Dropout
def aux_out_with_dropout(inp, name=None):
    avgpool1 = layers.AvgPool2D((5,5), strides=(3,3), padding='valid')(inp) 
    conv1 = layers.Conv2D(128, (1,1), activation='relu', padding='same')(avgpool1) 
    flat = layers.Flatten()(conv1) 
    dense1 = layers.Dense(1024, activation='relu')(flat) 
    # Add Dropout(0.7) as specified in the original Inception paper
    dropout1 = layers.Dropout(0.7)(dense1)
    aux_out = layers.Dense(200, activation='softmax', name=name)(dropout1) # 200 classes
    return aux_out

# We would then build the Inception v1 model, but add a Dropout(0.4) layer
# before the final prediction layer.

# (Conceptual model snippet showing where Dropout is added)
# ... (Inception blocks) ...
# avgpool_final = layers.AvgPool2D((7,7), strides=(1,1), padding='valid')(inc_5b)
# flat_out = layers.Flatten()(avgpool_final)
# -- ADD DROPOUT HERE --
# dropout_final = layers.Dropout(0.4)(flat_out)
# main_output = layers.Dense(200, activation='softmax', name='final')(dropout_final)
# ... (rest of the model) ...
print("Dropout concept added to model definition.")

### 7.1.3 Early Stopping

**Early Stopping** is a technique to stop the training process automatically when the model's performance on the *validation set* stops improving. 

We monitor a specific metric (e.g., `val_loss`). If that metric doesn't improve for a set number of epochs (called `patience`), we halt training. This prevents the model from continuing to train into an overfitted state.

In [None]:
# Define the EarlyStopping callback
es_callback = EarlyStopping(
    monitor='val_loss', # Monitor the validation loss
    patience=5          # Stop if it doesn't improve for 5 epochs
)

print("EarlyStopping callback defined.")

# When fitting the model, we would pass this in the 'callbacks' list:
# model.fit(
#     train_gen_aux, 
#     validation_data=valid_gen_aux, 
#     epochs=50, 
#     callbacks=[es_callback, csv_logger]
# )

---

## 7.2 Toward minimalism: Minception instead of Inception

The Inception v1 architecture is effective but somewhat outdated. The book proposes building a *new* model, **"Minception"**, inspired by the more modern Inception-ResNet v2. This model introduces two powerful concepts: Batch Normalization and Residual Connections.

### Batch Normalization (BN)
BN normalizes the output of a layer by re-centering and re-scaling the activations. This solves the "internal covariate shift" problem, where the distribution of each layer's inputs changes during training. 

**Benefits:**
* Allows for much faster training (higher learning rates).
* Stabilizes the training process.
* Acts as a regularizer, sometimes replacing the need for Dropout.

It's typically applied **after** the convolution/dense layer and **before** the activation function.

### Residual Connections (Skip Connections)
A residual connection allows the input of a layer (or block) to be added directly to its output. 

`output = layers.Add()([layer_output, layer_input])`

This creates a "shortcut" for the gradient, allowing it to flow directly back through the network. This makes it possible to train much deeper networks (e.g., 100+ layers) without suffering from the vanishing gradient problem.

### 7.2.1-7.2.5 Implementing the Minception Model

We will now build the Minception model piece by piece using the Functional API.

In [None]:
from tensorflow.keras.layers import (
    Input, Conv2D, MaxPool2D, AvgPool2D, Dense, 
    Concatenate, Flatten, BatchNormalization, Activation, Add
)
from tensorflow.keras.layers.experimental.preprocessing import RandomCrop, RandomContrast

# We'll use a standard initializer
init = 'glorot_uniform'

def bn_relu(inp):
    """Helper function for Batch Norm -> ReLU."""
    bn = BatchNormalization()(inp)
    return Activation('relu')(bn)

# 1. The Stem (based on Listing 7.6, simplified for clarity)
def stem(inp, activation='relu', bn=True):
    conv1_1 = Conv2D(32, (3,3), strides=(2,2), activation=None, kernel_initializer=init, padding='same')(inp)
    conv1_1 = bn_relu(conv1_1)
    conv1_2 = Conv2D(32, (3,3), strides=(1,1), activation=None, kernel_initializer=init, padding='same')(conv1_1)
    conv1_2 = bn_relu(conv1_2)
    conv1_3 = Conv2D(64, (3,3), strides=(1,1), activation=None, kernel_initializer=init, padding='same')(conv1_2)
    conv1_3 = bn_relu(conv1_3)
    
    split_1_pool = MaxPool2D((3,3), strides=(2,2), padding='same')(conv1_3)
    split_1_conv = Conv2D(96, (3,3), strides=(2,2), activation=None, kernel_initializer=init, padding='same')(conv1_3)
    split_1_conv = bn_relu(split_1_conv)
    
    out_split_1 = Concatenate(axis=-1)([split_1_pool, split_1_conv])
    # ... (omitting the rest of the complex stem for this summary)
    # The book's stem is quite complex. We will use a simplified stem 
    # for this notebook to focus on the Inception-ResNet blocks.
    return out_split_1

# 2. Inception-ResNet Block A (based on Listing 7.7)
def inception_resnet_a(inp, n_filters, activation='relu', bn=True, res_w=0.1):
    # Branch 1 (1x1)
    out1_1 = Conv2D(n_filters[0][0], (1,1), strides=(1,1), activation=None, kernel_initializer=init, padding='same')(inp)
    out1_1 = bn_relu(out1_1)
    
    # Branch 2 (1x1 -> 3x3)
    out2_1 = Conv2D(n_filters[1][0], (1,1), strides=(1,1), activation=None, kernel_initializer=init, padding='same')(inp)
    out2_1 = bn_relu(out2_1)
    out2_2 = Conv2D(n_filters[1][1], (3,3), strides=(1,1), activation=None, kernel_initializer=init, padding='same')(out2_1)
    out2_2 = bn_relu(out2_2)

    # Branch 3 (1x1 -> 3x3 -> 3x3)
    out3_1 = Conv2D(n_filters[2][0], (1,1), strides=(1,1), activation=None, kernel_initializer=init, padding='same')(inp)
    out3_1 = bn_relu(out3_1)
    out3_2 = Conv2D(n_filters[2][1], (3,3), strides=(1,1), activation=None, kernel_initializer=init, padding='same')(out3_1)
    out3_2 = bn_relu(out3_2)
    out3_3 = Conv2D(n_filters[2][2], (3,3), strides=(1,1), activation=None, kernel_initializer=init, padding='same')(out3_2)
    out3_3 = bn_relu(out3_3)
    
    # Concatenate all branches
    out_concat = Concatenate(axis=-1)([out1_1, out2_2, out3_3])
    
    # Final 1x1 convolution (Linear activation)
    out_final_conv = Conv2D(n_filters[3][0], (1,1), strides=(1,1), activation=None, kernel_initializer=init, padding='same')(out_concat)
    
    # --- Residual Connection ---
    # Add the input (shortcut) to the output of the conv block
    out_final = Add()([out_final_conv, inp])
    out_final = Activation(activation)(out_final) # Apply activation *after* adding
    return out_final

# Note: Inception-ResNet-B and Reduction blocks are similar in principle
# We will use just Block A for this simplified example.

# 3. Build the full Minception model (Simplified from Listing 7.10)
def build_minception(input_shape=(64, 64, 3), num_classes=200):
    K.clear_session()
    
    inp = Input(shape=input_shape)
    
    # Preprocessing layers
    crop_inp = RandomCrop(56, 56, seed=random_seed)(inp)
    contrast_inp = RandomContrast(0.3, seed=random_seed)(crop_inp)
    
    # Stem
    stem_out = stem(contrast_inp)
    
    # Body (A few Inception-ResNet blocks)
    # Filter numbers are simplified from the book's version
    inc_a_1 = inception_resnet_a(stem_out, [(32,),(32,32), (32, 48, 64),(288)], initializer=init)
    inc_a_2 = inception_resnet_a(inc_a_1, [(32,),(32,32), (32, 48, 64),(288)], initializer=init)
    
    # Classification Head
    avgpool1 = layers.GlobalAveragePooling2D()(inc_a_2)
    dropout1 = layers.Dropout(0.5)(avgpool1)
    out_main = Dense(num_classes, activation='softmax', kernel_initializer=init, name='final')(dropout1)
    
    model = Model(inputs=inp, outputs=out_main)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

minception_model = build_minception()
minception_model.summary()

### 7.2.6 Training Minception

When training this model, we introduce another callback: `ReduceLROnPlateau`. This will automatically reduce the learning rate (e.g., by a factor of 10) if the `val_loss` stops improving. This helps the model settle into a good minimum.

In [None]:
# Define callbacks for Minception training
es_callback_min = EarlyStopping(monitor='val_loss', patience=10)
csv_logger_min = CSVLogger(os.path.join('eval', '3_eval_minception.log'))
lr_callback_min = ReduceLROnPlateau(
    monitor='val_loss', 
    factor=0.1, 
    patience=5, 
    verbose=1
)

# Note: Minception has only 1 output, so we use the original (non-aux) generators
train_gen_single = train_gen # From 7.1.1
valid_gen_single = valid_gen # From 7.1.1

print("Training Minception model...")
history_minception = minception_model.fit(
    train_gen_single,
    validation_data=valid_gen_single,
    steps_per_epoch=get_steps_per_epoch(len(train_gen_single.filenames), batch_size),
    validation_steps=get_steps_per_epoch(len(valid_gen_single.filenames), batch_size),
    epochs=5, # Book uses 50, we use 5 for speed
    callbacks=[es_callback_min, csv_logger_min, lr_callback_min]
)
print("Minception training complete.")

---

## 7.3 Transfer Learning: Using Pretrained Networks

**Transfer Learning** is one of the most powerful techniques in deep learning. Instead of training a model from scratch, we use a model that has already been trained on a massive dataset (like ImageNet, with over 1 million images).

The 

idea is that this model has already learned rich, general-purpose features (edges, textures, shapes). We can then 

use this model as a **feature extractor** and simply add a new, small classification head on top, which we train on our specific (and smaller) dataset.

We will use the full `InceptionResNetV2` model, pretrained on ImageNet.

In [None]:
from tensorflow.keras.applications import InceptionResNetV2

K.clear_session()

# 1. Define the input shape required by InceptionResNetV2 (e.g., 224x224)
INPUT_SHAPE = (224, 224, 3)

# 2. Load the base model (pretrained on ImageNet)
base_model = InceptionResNetV2(
    include_top=False,     # <-- DO NOT include the final 1000-class ImageNet classifier
    weights='imagenet',    # <-- Load pretrained weights
    input_shape=INPUT_SHAPE,
    pooling='avg'        # <-- Apply Global Average Pooling to the output
)

# 3. Freeze the base model (optional, but good for initial training)
# This prevents its weights from being updated.
# base_model.trainable = False

# 4. Create our new model
model_tl = Sequential([
    layers.Input(shape=INPUT_SHAPE), 
    base_model, # The pretrained base
    layers.Dropout(0.4),
    layers.Dense(200, activation='softmax') # Our new 200-class head
])

# 5. Compile with a low learning rate for fine-tuning
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
model_tl.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

model_tl.summary()

Now we need to create new data generators that resize our images to `(224, 224)`.

In [None]:
# Create data generators for 224x224 images (based on Listing 7.13)
tl_target_size = (224, 224)
tl_batch_size = 32 # Use a smaller batch size for this large model

# We can re-use the image_gen_aug from section 7.1.1
partial_flow_func_tl = partial(
    image_gen_aug.flow_from_directory,
    directory=train_image_dir,
    target_size=tl_target_size, # New target size
    class_mode='categorical',
    batch_size=tl_batch_size,   # New batch size
    shuffle=True,
    seed=random_seed,
    interpolation='bilinear'   # Specify interpolation for resizing
)

train_gen_tl = partial_flow_func_tl(subset='training')
valid_gen_tl = partial_flow_func_tl(subset='validation')

# We also need to re-create the auxiliary generators (with only 1 output)
def data_gen_single(gen):
    for x, y in gen:
        yield x, y

train_gen_tl_single = data_gen_single(train_gen_tl)
valid_gen_tl_single = data_gen_single(valid_gen_tl)

print("Training Transfer Learning model...")
history_tl = model_tl.fit(
    train_gen_tl_single,
    validation_data=valid_gen_tl_single,
    steps_per_epoch=get_steps_per_epoch(len(train_gen_tl.filenames), tl_batch_size),
    validation_steps=get_steps_per_epoch(len(valid_gen_tl.filenames), tl_batch_size),
    epochs=3 # This will take a long time to train. We'll keep it short.
)
print("Transfer Learning training complete.")

---

## 7.4 Grad-CAM: Making CNNs Confess

**Grad-CAM (Gradient-weighted Class Activation Mapping)** is a technique to visualize where a CNN is "looking" when it makes a prediction. It produces a heatmap that highlights the most important regions in the input image for a given class.

**How it works (simplified):**
1.  Get the model's prediction for an image.
2.  Get the output feature map of the **last convolutional layer** (just before pooling and flatten).
3.  Calculate the **gradient** of the predicted class's score with respect to the feature map from step 2.
4.  Average these gradients for each feature map (channel) to get "weights" (this is `alpha_k` in the paper).
5.  Compute a weighted sum of all the feature maps using these weights.
6.  Apply a ReLU to the result (we only care about features that have a *positive* influence).
7.  The result is a coarse heatmap, which we can resize and overlay on the original image.

*(Note: The full code (based on Appendix B) is complex. We will implement the core logic here.)*

In [None]:
# We need a trained model. Let's assume we saved our 'model_tl'.
# model_tl.save(os.path.join('models', 'inception_resnet_v2_tl.h5'))
# model = load_model(os.path.join('models', 'inception_resnet_v2_tl.h5'))

# For this example, we'll just use the model_tl we just defined.
model = model_tl

# 1. Find the name of the last convolutional layer in the base model
base_model = model.get_layer('inception_resnet_v2')
last_conv_layer_name = "conv_7b_ac" # Found by inspecting base_model.summary()

# 2. Create a new model that outputs the last conv layer's features and the final prediction
grad_model = Model(
    inputs=[base_model.input],
    outputs=[base_model.get_layer(last_conv_layer_name).output, model.output]
)

# 3. Get a sample image
x_sample_test, y_sample_test = next(iter(valid_gen_tl_single))
sample_image = x_sample_test[0:1] # Get first image, keep batch dim
sample_label_idx = np.argmax(y_sample_test[0])

# 4. Use tf.GradientTape to get gradients
with tf.GradientTape() as tape:
    # Get the two outputs we defined
    conv_outputs, predictions = grad_model(sample_image)
    # Get the score for the predicted class
    loss = predictions[:, sample_label_idx]

# 5. Get the gradients of the score w.r.t the feature map
grads = tape.gradient(loss, conv_outputs)

# 6. Calculate channel weights (Global Average Pooling of gradients)
weights = tf.reduce_mean(grads, axis=(1, 2), keepdims=True)

# 7. Create the heatmap (weighted sum of feature maps)
heatmap = conv_outputs * weights
heatmap = tf.reduce_sum(heatmap, axis=-1) # Sum across channels

# 8. Apply ReLU (we only want positive contributions)
heatmap = tf.nn.relu(heatmap)

# 9. Normalize
heatmap /= tf.reduce_max(heatmap)
heatmap = tf.squeeze(heatmap) # Remove batch dim

print("Grad-CAM Heatmap generated.")

# 10. Visualize the heatmap and overlay it
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.imshow((sample_image[0] + 1) / 2) # Un-normalize from [-1, 1] to [0, 1]
plt.title("Original Image")
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow((sample_image[0] + 1) / 2)
# Resize heatmap to match image and overlay it
plt.imshow(tf.image.resize(heatmap[..., tf.newaxis], (224, 224)), cmap='jet', alpha=0.5)
plt.title("Grad-CAM Overlay")
plt.axis('off')
plt.show()