# Chapter 12: Custom Models and Training with TensorFlow

## 1. Chapter Overview
**Goal:** Up until now, we have used the high-level Keras API (`Sequential`, `Functional`). But what if you need a loss function that doesn't exist in Keras? Or a layer that behaves differently? Or a training loop that does something exotic? This chapter teaches you how to use TensorFlow's lower-level API to customize every part of your Deep Learning pipeline.

**Key Concepts:**
* **TensorFlow Basics:** Tensors, Operations, and how they differ from NumPy.
* **Custom Loss Functions:** Defining your own mathematical criteria for errors.
* **Custom Layers:** Building layers with internal weights (kernels/biases) using the Subclassing API.
* **Custom Models:** Creating complex architectures that behave dynamically.
* **Automatic Differentiation:** Using `GradientTape` to manually compute gradients.
* **Custom Training Loops:** Writing the `for` loop for training from scratch, bypassing `model.fit()`.
* **TensorFlow Functions:** Using `@tf.function` to compile Python code into efficient TensorFlow Graphs (AutoGraph).

**Practical Skills:**
* Manipulating Tensors (slicing, reshaping, math operations).
* Implementing a **Huber Loss** function manually.
* Creating a custom **Dense Layer** from scratch.
* Writing a full training loop using `tape.gradient` and `optimizer.apply_gradients`.
* debugging TensorFlow code and optimizing it with Graph Mode.

In [None]:
# Setup
import sys
assert sys.version_info >= (3, 5)

import sklearn
assert sklearn.__version__ >= "0.20"

import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
tf.random.set_seed(42)

## 2. Theoretical Explanation (In-Depth)

### 1. Tensors and Operations
At the heart of TensorFlow is the **Tensor**. It is very similar to a NumPy `ndarray`, but with two critical differences:
1.  **Immutability:** You cannot modify a Tensor in place (unlike NumPy arrays). You must create a new one.
2.  **GPU Acceleration:** Tensors can immediately be processed by a GPU or TPU, whereas NumPy arrays live on the CPU.

**Variables:** Since Tensors are immutable, how do we store weights that need to change during training? We use `tf.Variable`. These are mutable containers for Tensors.

### 2. Custom Components (Losses, Layers, Models)
* **Loss Function:** A simple Python function that takes `y_true` and `y_pred` and returns a scalar tensor (the error). If you need internal parameters (hyperparameters) inside your loss, you subclass `keras.losses.Loss`.
* **Layers:** To build a custom layer with weights, you subclass `keras.layers.Layer` and implement three methods:
    * `__init__`: Save hyperparameters.
    * `build(input_shape)`: Create the weights (kernels, biases) knowing the input shape. This happens only once.
    * `call(inputs)`: The forward pass computation.
* **Models:** Subclass `keras.models.Model`. Similar to layers, but usually contains other layers.

### 3. Automatic Differentiation (Autodiff)
How does TensorFlow know how to compute gradients for *any* custom function you write? It uses **Reverse-Mode Autodiff**.
In TF 2.0, this is handled by `tf.GradientTape`. Any operation performed inside a `with tf.GradientTape() as tape:` block is recorded. Afterward, you call `tape.gradient(target, sources)` to get the derivatives.

### 4. TensorFlow Functions (Graph Mode)
Python is flexible but slow. TensorFlow Graphs are fast and portable. 
In TF 1.x, you had to manually build graphs. In TF 2.x, Eager Execution (running like standard Python) is default. However, for performance, we often want to convert our functions into Graphs. 
We use the decorator `@tf.function`. It analyzes your Python code (including `for` loops and `if` statements) and generates an optimized computation graph (**AutoGraph**).

## 3. Code Reproduction

### 3.1 Tensors and NumPy
Let's explore the basics of Tensors.

In [None]:
# Creating a tensor
t = tf.constant([[1., 2., 3.], [4., 5., 6.]])
print("Tensor:\n", t)
print("Shape:", t.shape)
print("Dtype:", t.dtype)

# Indexing (same as NumPy)
print("Slice:", t[:, 1:])

# Operations
print("Addition:", t + 10)
print("Square:", tf.square(t))
print("Matrix Multiplication:", t @ tf.transpose(t)) # @ operator is matmul

# TensorFlow and NumPy interoperability
a = t.numpy() # Convert to numpy
print("NumPy array:", a)
t_from_np = tf.constant(np.array([1, 2, 3])) # Convert from numpy
print("Tensor from NumPy:", t_from_np)

### 3.2 Custom Loss Function
We will implement the **Huber Loss**, which is less sensitive to outliers than MSE. 
$$ L_\delta(y, f(x)) = \begin{cases} \frac{1}{2}(y-f(x))^2 & \text{for } |y-f(x)| \le \delta, \\ \delta |y-f(x)| - \frac{1}{2}\delta^2 & \text{otherwise.} \end{cases} $$
We will train a simple model on the California Housing dataset using this custom loss.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Prepare Data
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

# Define Huber Loss Function
def my_huber_loss(y_true, y_pred):
    error = y_true - y_pred
    is_small_error = tf.abs(error) < 1
    squared_loss = tf.square(error) / 2
    linear_loss = tf.abs(error) - 0.5
    # tf.where selects elements from squared_loss or linear_loss based on condition
    return tf.where(is_small_error, squared_loss, linear_loss)

# Compile and Train with custom loss
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="selu", kernel_initializer="lecun_normal", input_shape=X_train.shape[1:]),
    keras.layers.Dense(1)
])

model.compile(loss=my_huber_loss, optimizer="nadam", metrics=["mae"])
model.fit(X_train_scaled, y_train, epochs=2, validation_data=(X_valid_scaled, y_valid))

### 3.3 Custom Layers
Let's build a simplified `Dense` layer from scratch to understand the mechanics.

In [None]:
class MyDense(keras.layers.Layer):
    def __init__(self, units, activation=None, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.activation = keras.activations.get(activation)

    def build(self, batch_input_shape):
        # Create weights (kernel) and bias
        self.kernel = self.add_weight(
            name="kernel", shape=[batch_input_shape[-1], self.units],
            initializer="glorot_normal")
        self.bias = self.add_weight(
            name="bias", shape=[self.units], initializer="zeros")
        super().build(batch_input_shape) # Must be at the end

    def call(self, X):
        # Forward pass: X * W + b
        return self.activation(X @ self.kernel + self.bias)

    def compute_output_shape(self, batch_input_shape):
        return tf.TensorShape(batch_input_shape.as_list()[:-1] + [self.units])

# Testing the custom layer
model = keras.models.Sequential([
    MyDense(30, activation="relu", input_shape=X_train.shape[1:]),
    MyDense(1)
])

model.compile(loss="mse", optimizer="nadam")
model.fit(X_train_scaled, y_train, epochs=2, verbose=0)
print("Custom Layer Model trained successfully.")

### 3.4 Automatic Differentiation (GradientTape)
Let's calculate the derivative of $f(w) = 3w^2 + 2w + 5$ at $w=3$.

In [None]:
w1, w2 = tf.Variable(5.), tf.Variable(3.)

with tf.GradientTape() as tape:
    z = 3 * w1 ** 2 + 2 * w1 * w2

gradients = tape.gradient(z, [w1, w2])

print("Function z = 3*w1^2 + 2*w1*w2")
print("dz/dw1 at (5,3):", gradients[0].numpy()) # Should be 6*5 + 2*3 = 36
print("dz/dw2 at (5,3):", gradients[1].numpy()) # Should be 2*5 = 10

### 3.5 Custom Training Loop
This is the ultimate level of control. We will not use `model.fit()`. Instead, we will write the loop that iterates over the dataset, computes gradients, and updates weights manually.

In [None]:
# 1. Define Model and Optimizer
l2_reg = keras.regularizers.l2(0.05)
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="elu", kernel_initializer="he_normal",
                       kernel_regularizer=l2_reg),
    keras.layers.Dense(1, kernel_regularizer=l2_reg)
])

# Helper function to sample random batches
def random_batch(X, y, batch_size=32):
    idx = np.random.randint(len(X), size=batch_size)
    return X[idx], y[idx]

def print_status_bar(iteration, total, loss, metrics=None):
    metrics = " - ".join(["{}: {:.4f}".format(m.name, m.result()) for m in [loss] + (metrics or [])])
    end = "" if iteration < total else "\n"
    print("\r{}/{} - ".format(iteration, total) + metrics, end=end)

n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
optimizer = keras.optimizers.Nadam(learning_rate=0.01)
loss_fn = keras.losses.MeanSquaredError()
mean_loss = keras.metrics.Mean()

print("Starting Custom Training Loop...")
for epoch in range(1, n_epochs + 1):
    print("Epoch {}/{}".format(epoch, n_epochs))
    for step in range(1, n_steps + 1):
        # A. Sampling
        X_batch, y_batch = random_batch(X_train_scaled, y_train)
        
        # B. Forward Pass & Recording Operations
        with tf.GradientTape() as tape:
            y_pred = model(X_batch, training=True) # training=True is important for Dropout/BN
            main_loss = loss_fn(y_batch, y_pred)
            loss = main_loss + tf.add_n(model.losses) # Add regularization losses
        
        # C. Backward Pass (Compute Gradients)
        gradients = tape.gradient(loss, model.trainable_variables)
        
        # D. Optimizer Step (Update Weights)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        
        # E. Update Metrics (for display)
        mean_loss(loss)
        print_status_bar(step, n_steps, mean_loss)
    
    # Reset metrics at end of epoch
    mean_loss.reset_states()

## 4. Step-by-Step Explanation

### 1. Vectorized Huber Loss
**Logic:** We want a loss that behaves quadratically for small errors (like MSE) but linearly for large errors (like MAE). 
**Implementation:** We calculate `error = y_true - y_pred`. `tf.where(condition, x, y)` works like a vectorized if-else statement. It checks every element in the error tensor; if the error is small (< 1), it applies the quadratic formula; otherwise, it applies the linear formula. This creates a robust loss function entirely using TensorFlow operations, meaning it's differentiable and GPU-compatible.

### 2. Custom Layer Anatomy
* `build()`: This is where we create variables. We use `add_weight`. It's crucial to do this in `build` and not `__init__` because in `__init__`, we don't know the input shape yet (how many neurons represent the input). Keras calls `build` the first time data passes through the layer.
* `call()`: This effectively performs the matrix multiplication $X \cdot W + b$. We then pass it through the activation function.

### 3. The Custom Loop
This loop replaces the `fit()` method entirely.
1.  **GradientTape:** We wrap the forward pass (`model(X_batch)`) and the loss calculation inside the `with` block. TF records every operation involving a Variable.
2.  **Regularization:** `model.losses` contains the regularization penalties (L2 loss) we added when defining the layers. We must manually add this to our main loss, or regularization won't happen.
3.  **Tape.gradient:** TF traverses the graph backwards from `loss` to `model.trainable_variables` using the chain rule to find the direction to move the weights.
4.  **Optimizer.apply_gradients:** This subtracts the gradients (multiplied by learning rate) from the weights.

## 5. Chapter Summary

* **TensorFlow Low-Level API:** Gives you full control over tensors, operations, and graphs.
* **Customizing:** You can subclass `Loss`, `Layer`, `Model`, and `Metric` to create components that don't exist in the standard library.
* **GradientTape:** The engine of Deep Learning in TF2. It enables automatic differentiation for custom training flows.
* **Custom Training Loops:** Necessary for advanced research (e.g., GANs, Reinforcement Learning) where the standard `fit()` method is too rigid.
* **Performance:** Use `@tf.function` to compile your custom python functions into high-performance TensorFlow Graphs.