# Chapter 14: Deep Computer Vision Using CNNs

## 1. Chapter Overview
**Goal:** In this chapter, we will learn about Convolutional Neural Networks (CNNs). Unlike dense networks that treat every pixel as an independent feature, CNNs understand spatial relationships (e.g., "this pixel is part of a line, which is part of a square"). We will start with the building blocks (Convolution, Pooling) and then explore famous architectures like ResNet and Xception. Finally, we will use **Transfer Learning** to use state-of-the-art models for our own tasks.

**Key Concepts:**
* **The Visual Cortex:** How biological vision inspired CNNs.
* **Convolutional Layers:** Filters (Kernels), Feature Maps, Stride, and Padding.
* **Pooling Layers:** Max Pooling and Average Pooling for downsampling.
* **CNN Architectures:**
    * **LeNet-5 (1998):** The grandfather of CNNs.
    * **AlexNet (2012):** The deep learning revolution starter.
    * **VGGNet:** Simplicity with 3x3 filters.
    * **GoogLeNet (Inception):** Using Inception modules to capture features at different scales.
    * **ResNet (2015):** Using skip connections to train ultra-deep networks (100+ layers).
    * **Xception:** Extreme Inception with depthwise separable convolutions.
* **Transfer Learning:** Using a model pretrained on ImageNet (millions of images) to classify specific datasets with high accuracy.

**Practical Skills:**
* Building a CNN from scratch using `Conv2D` and `MaxPooling2D`.
* Implementing a simplified ResNet Unit.
* Using Pretrained Keras Models (`Xception`, `ResNet50`) for image classification.
* Preprocessing images for specific models (`preprocess_input`).

In [None]:
# Setup
import sys
assert sys.version_info >= (3, 5)

import sklearn
assert sklearn.__version__ >= "0.20"

import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
tf.random.set_seed(42)

print("TensorFlow version:", tf.__version__)

## 2. Theoretical Explanation (In-Depth)

### 1. Convolutional Layers
A standard Dense layer connects every input pixel to every neuron. This is wasteful and ignores geometry. A **Convolutional Layer** uses small "filters" (or kernels) that slide across the image.

**Key Parameters:**
* **Filters:** The number of feature maps to output (e.g., 64 filters might learn 64 different edge types).
* **Kernel Size:** The size of the sliding window (usually 3x3, 5x5, or 7x7).
* **Stride:** How many pixels the filter moves at each step. Stride=1 keeps spatial dimensions roughly the same. Stride=2 halves them.
* **Padding:** 
    * `'valid'`: No padding. The output is smaller than the input.
    * `'same'`: Zero padding is added so the output size equals the input size (if stride=1).

### 2. Pooling Layers
Pooling layers reduce the spatial dimensions (width, height) to reduce computational load and memory usage, and to make the network robust to small shifts in the image (invariance).
* **Max Pooling:** Takes the maximum value in the window. Extracts the most prominent feature.
* **Average Pooling:** Takes the average. Smoothes the features.
* **Global Average Pooling:** Computes the mean of the entire feature map. Often used just before the final classification layer.

### 3. ResNet (Residual Networks)
As networks got deeper (20+ layers), training became hard due to vanishing gradients. ResNet introduced the **Skip Connection** (or Residual Connection).
Instead of trying to learn $h(x)$, the layer tries to learn the residual $f(x) = h(x) - x$. The output becomes $f(x) + x$.
If the optimal function is close to the identity function (doing nothing), the weights can easily shrink to zero, letting the signal pass through the skip connection. This allows training networks with 100+ layers.

### 4. Transfer Learning
Training a CNN from scratch requires huge datasets and massive GPU power. Instead, we use models pre-trained on **ImageNet** (1.2 million images, 1000 classes). 
We can remove the top layer (classification head) and replace it with our own, then train only that new layer. The lower layers have already learned to detect edges, shapes, and textures.

## 3. Code Reproduction

### 3.1 Building a CNN from Scratch
We will build a CNN to classify Fashion MNIST images. We use `Conv2D` layers followed by `MaxPooling2D` layers.

In [None]:
# Load Data
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()

# Preprocessing
# CNNs expect 3D inputs (Height, Width, Channels). 
# Since Fashion MNIST is grayscale, we must add the channel dimension: [28, 28, 1]
X_train_full = X_train_full.reshape((60000, 28, 28, 1)) / 255.0
X_test = X_test.reshape((10000, 28, 28, 1)) / 255.0

X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

# Building the Model
model = keras.models.Sequential([
    # Layer 1: Conv2D with 64 filters, 7x7 kernel. Input shape is mandatory.
    keras.layers.Conv2D(64, 7, activation="relu", padding="same", input_shape=[28, 28, 1]),
    keras.layers.MaxPooling2D(2),
    
    # Layer 2 & 3: Stacked Conv2D with 128 filters, 3x3 kernel
    keras.layers.Conv2D(128, 3, activation="relu", padding="same"),
    keras.layers.Conv2D(128, 3, activation="relu", padding="same"),
    keras.layers.MaxPooling2D(2),
    
    # Layer 4 & 5: Going deeper with 256 filters
    keras.layers.Conv2D(256, 3, activation="relu", padding="same"),
    keras.layers.Conv2D(256, 3, activation="relu", padding="same"),
    keras.layers.MaxPooling2D(2),
    
    # Fully Connected Head
    keras.layers.Flatten(),
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(64, activation="relu"),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
model.summary()

In [None]:
# Train (Note: CNNs are slower to train than MLPs without a GPU)
# We limit epochs to 3 for demonstration purposes
history = model.fit(X_train, y_train, epochs=3, validation_data=(X_valid, y_valid))

### 3.2 Using Pretrained Models (ResNet-50)
We will load a ResNet-50 model trained on ImageNet and use it to classify real-world images. This requires resizing images to 224x224.

In [None]:
from sklearn.datasets import load_sample_images

# Load ResNet50 model with ImageNet weights
model = keras.applications.resnet50.ResNet50(weights="imagenet")

# Load sample images (a Flower and a Chinese Temple)
images = load_sample_images().images
X_raw = np.array(images)

# Preprocessing for ResNet50
# 1. Resize to 224x224 (ResNet standard)
X_resized = tf.image.resize(X_raw, [224, 224])

# 2. Use the specific preprocess_input function for ResNet
# This function handles scaling (e.g., -1 to 1 or 0 to 1) and channel ordering (RGB vs BGR)
inputs = keras.applications.resnet50.preprocess_input(X_resized)

# Prediction
Y_proba = model.predict(inputs)

# Decode predictions into human-readable class names
top_K = keras.applications.resnet50.decode_predictions(Y_proba, top=3)

for image_index in range(len(images)):
    print(f"Image #{image_index}")
    for class_id, name, y_proba in top_K[image_index]:
        print(f"  {name} - {class_id}: {y_proba*100:.2f}%")
    print()

### 3.3 Transfer Learning with Xception
How do we adapt a powerful model like Xception to our own dataset (e.g., flowers)?
1.  Load Xception **without the top layer** (`include_top=False`).
2.  Freeze the base layers (make them non-trainable).
3.  Add our own GlobalAveragePooling and Dense output layer.
4.  Train only the new layers.
5.  Unfreeze some base layers and fine-tune.

In [None]:
# Example setup (we won't run full training here as it requires a large dataset)
import tensorflow_datasets as tfds

# 1. Base Model
base_model = keras.applications.xception.Xception(weights="imagenet", include_top=False)

# 2. Add Custom Head
avg = keras.layers.GlobalAveragePooling2D()(base_model.output)
output = keras.layers.Dense(10, activation="softmax")(avg) # Assuming 10 classes
model = keras.models.Model(inputs=base_model.input, outputs=output)

# 3. Freeze Base Layers
for layer in base_model.layers:
    layer.trainable = False

# 4. Compile
model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

# 5. Train (Simulated code)
# history = model.fit(dataset, epochs=5)

model.summary()

## 4. Step-by-Step Explanation

### 1. The CNN Architecture
Our Fashion MNIST model follows a classic structure:
* **C-P-C-C-P:** Convolution -> Pooling -> Conv -> Conv -> Pooling.
* **Filter Pyramid:** We start with 64 filters, then go to 128, then 256. As we go deeper into the network, the spatial dimensions ($28\times28 \rightarrow 14\times14 \rightarrow 7\times7$) decrease, but the number of feature maps increases. This means the network learns fewer high-level features (like "sleeves" or "buttons") compared to many low-level features (like "lines" or "curves").
* **Padding='same':** We keep the size constant during convolution so we don't lose pixels at the borders.

### 2. Why ResNet works
In standard deep networks, gradients have to multiply through many layers during backpropagation. If layers are weights < 1, the gradient vanishes. In ResNet, the gradient can flow directly through the skip connection ($+ x$) without being attenuated. This acts like a "highway" for gradients to reach the early layers.

### 3. Preprocessing for Pretrained Models
Every model expects input in a specific format. ResNet expects inputs to be zero-centered (mean subtracted). Xception might expect inputs scaled to -1 to 1. 
Always use the helper function `keras.applications.model_name.preprocess_input(x)` instead of manually scaling.

## 5. Chapter Summary

* **CNNs** exploit the spatial structure of images using **Filters** (local patterns) and **Pooling** (subsampling).
* **Architectures:** Modern CNNs like **ResNet** and **Xception** are extremely deep but trainable thanks to Residual connections and efficient convolutions.
* **Transfer Learning:** The most practical way to use Deep Learning. Download a model trained on ImageNet, remove the top, and train a new classifier on top.
* **Data Augmentation:** (Not shown in code but crucial) Shifting, rotating, and flipping images during training to artificially increase dataset size and reduce overfitting.