Here is **Chapter 10: Neural Network Fundamentals** — the gateway to deep learning.

---

# **CHAPTER 10: NEURAL NETWORK FUNDAMENTALS**

*The Universal Approximator*

## **Chapter Overview**

Neural networks represent the most powerful and flexible class of machine learning models. From computer vision to natural language processing, they have revolutionized every domain they touch. This chapter builds the theoretical and practical foundation: starting from the mathematical neuron, through the backpropagation algorithm (the engine of deep learning), to modern training techniques that make deep networks trainable.

**Estimated Time:** 60-70 hours (4-5 weeks)  
**Prerequisites:** Chapters 1 (Math), 6 (Optimization/Gradient Descent), 9 (Regularization)

---

## **10.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Implement a multi-layer neural network from scratch using only NumPy (forward and backward pass)
2. Derive and compute gradients using the chain rule for any network architecture
3. Select appropriate activation functions and understand their impact on gradient flow
4. Apply modern optimization algorithms (Adam, RMSprop) with proper hyperparameter tuning
5. Implement regularization techniques (Dropout, Batch Normalization) to prevent overfitting
6. Initialize networks effectively to avoid vanishing/exploding gradients

---

## **10.1 From Biological Inspiration to Artificial Neurons**

#### **10.1.1 The Perceptron**

The fundamental unit: a weighted sum of inputs passed through an activation function.

$$z = \sum_{i=1}^n w_i x_i + b = \mathbf{w}^T \mathbf{x} + b$$

$$a = \sigma(z)$$

Where:
- $\mathbf{x} \in \mathbb{R}^n$: Input features
- $\mathbf{w} \in \mathbb{R}^n$: Weights (synaptic strengths)
- $b \in \mathbb{R}$: Bias (threshold adjustment)
- $\sigma$: Activation function (non-linearity)

**Implementation:**
```python
import numpy as np

class Perceptron:
    def __init__(self, input_dim):
        # Xavier initialization (explained later)
        self.weights = np.random.randn(input_dim) * np.sqrt(1.0 / input_dim)
        self.bias = 0.0
    
    def forward(self, x):
        """Forward pass"""
        z = np.dot(self.weights, x) + self.bias
        a = self.activation(z)
        return a
    
    def activation(self, z):
        """Step function (historical) or modern alternatives"""
        return 1 if z > 0 else 0  # Step function
        # return 1 / (1 + np.exp(-z))  # Sigmoid (modern)
```

**Limitation:** Single perceptron is a linear classifier (like logistic regression). Cannot solve XOR problem.

#### **10.1.2 Multi-Layer Perceptron (MLP)**

Stacking layers creates non-linear decision boundaries. A network with one hidden layer can approximate any continuous function (Universal Approximation Theorem).

**Architecture:**
- **Input Layer:** Receives data (no computation)
- **Hidden Layer(s):** Learned representations
- **Output Layer:** Predictions (dimensions = number of classes/outputs)

```python
class MLP:
    def __init__(self, layer_sizes):
        """
        layer_sizes: [input_dim, hidden1, hidden2, ..., output_dim]
        """
        self.layers = []
        for i in range(len(layer_sizes) - 1):
            in_dim = layer_sizes[i]
            out_dim = layer_sizes[i + 1]
            # Weight matrix: (in_dim, out_dim)
            W = np.random.randn(in_dim, out_dim) * np.sqrt(2.0 / in_dim)  # He init
            b = np.zeros(out_dim)
            self.layers.append({'W': W, 'b': b})
    
    def forward(self, X):
        """Full forward pass through all layers"""
        self.activations = [X]  # Store for backprop
        current = X
        
        for i, layer in enumerate(self.layers):
            z = np.dot(current, layer['W']) + layer['b']
            # Apply activation (ReLU for hidden, identity/softmax for output)
            if i < len(self.layers) - 1:  # Hidden layers
                a = np.maximum(0, z)  # ReLU
            else:  # Output layer
                a = z  # Linear (for regression) or softmax (for classification)
            self.activations.append(a)
            current = a
        
        return current
```

---

## **10.2 Activation Functions: The Non-Linearity**

Without non-linear activations, deep networks collapse to linear models: $\mathbf{W}_3(\mathbf{W}_2(\mathbf{W}_1\mathbf{x})) = \mathbf{W}_{eff}\mathbf{x}$.

#### **10.2.1 Sigmoid**

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

**Properties:** Output $\in (0,1)$, smooth gradient.  
**Problems:** 
- **Vanishing gradient:** Saturation at $\sigma(z) \approx 0$ or $1$ gives $\nabla \approx 0$.
- **Not zero-centered:** Outputs always positive (causes zig-zag dynamics in gradient descent).

**Derivative:** $\sigma'(z) = \sigma(z)(1 - \sigma(z))$

#### **10.2.2 Tanh (Hyperbolic Tangent)**

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$

**Properties:** Output $\in (-1, 1)$, zero-centered (better than sigmoid).  
**Problem:** Still suffers from vanishing gradients.

#### **10.2.3 ReLU (Rectified Linear Unit)**

$$\text{ReLU}(z) = \max(0, z)$$

**Properties:** 
- Computationally cheap (no exponentials)
- No vanishing gradient for $z > 0$ (constant gradient 1)
- Sparse activation (typically 50% of neurons inactive)

**Problem:** **Dying ReLU** — if $z < 0$, gradient is 0, neuron never updates (dead).

**Variants:**
- **Leaky ReLU:** $\max(\alpha z, z)$ where $\alpha = 0.01$ (small negative slope)
- **PReLU:** $\alpha$ is learned parameter
- **ELU:** Smooth version with negative saturation

```python
def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)  # 1 if z>0, else 0

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)
```

#### **10.2.4 Softmax (Output Layer)**

For multi-class classification. Converts logits to probability distribution.

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$$

**Numerical Stability:** Subtract $\max(z)$ before exponentiation to prevent overflow.

```python
def softmax(z):
    exp_z = np.exp(z - np.max(z, axis=-1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
```

#### **10.2.5 GELU and Swish (Modern)**

**GELU (Gaussian Error Linear Unit):** Used in BERT, GPT.

$$\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]$$

Approximation: $z \cdot \sigma(1.702z)$

**Swish:** $\sigma(z) \cdot z$ (smooth, non-monotonic, often better than ReLU)

---

## **10.3 Loss Functions**

#### **10.3.1 Regression Losses**

**Mean Squared Error (MSE):**
$$\mathcal{L} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$

**Mean Absolute Error (MAE):**
$$\mathcal{L} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|$$

**Huber Loss:** Quadratic near zero, linear far from zero (robust to outliers).

```python
def huber_loss(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    is_small_error = np.abs(error) <= delta
    squared_loss = 0.5 * error ** 2
    linear_loss = delta * (np.abs(error) - 0.5 * delta)
    return np.where(is_small_error, squared_loss, linear_loss)
```

#### **10.3.2 Classification Losses**

**Binary Cross-Entropy:**
$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n \left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]$$

**Categorical Cross-Entropy:**
$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K y_{i,k} \log(\hat{y}_{i,k})$$

**Implementation with Numerical Stability:**
```python
def cross_entropy(y_true, y_pred):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred))
```

---

## **10.4 Backpropagation: Learning by Gradient Descent**

The algorithm that makes deep learning possible. Computes gradients of loss w.r.t. all parameters using the **chain rule**.

#### **10.4.1 The Chain Rule Review**

If $L = f(g(x))$, then $\frac{dL}{dx} = \frac{dL}{dg} \cdot \frac{dg}{dx}$

For neural networks: composite functions of layers.

#### **10.4.2 Backprop Algorithm**

For layer $l$ with pre-activation $z^{[l]}$ and activation $a^{[l]} = \sigma(z^{[l]})$:

1. **Forward pass:** Compute and cache all $z^{[l]}$, $a^{[l]}$
2. **Backward pass (output layer):**
   $$\delta^{[L]} = \nabla_a L \odot \sigma'(z^{[L]})$$
   (Element-wise product of loss gradient and activation derivative)
   
3. **Propagate backwards:**
   $$\delta^{[l]} = ((W^{[l+1]})^T \delta^{[l+1]}) \odot \sigma'(z^{[l]})$$
   
4. **Compute gradients:**
   $$\nabla_{W^{[l]}} L = (a^{[l-1]})^T \delta^{[l]}$$
   $$\nabla_{b^{[l]}} L = \sum_{i} \delta^{[l]}_{(i)}$$

**Implementation:**
```python
class NeuralNetwork:
    def __init__(self, layers):
        # Initialize weights...
        pass
    
    def forward(self, X):
        self.cache = {'A0': X}
        A = X
        L = len(self.parameters) // 2
        
        for l in range(1, L + 1):
            W = self.parameters[f'W{l}']
            b = self.parameters[f'b{l}']
            Z = np.dot(A, W) + b
            self.cache[f'Z{l}'] = Z
            A = np.maximum(0, Z) if l < L else self.softmax(Z)  # ReLU hidden, softmax output
            self.cache[f'A{l}'] = A
        
        return A
    
    def backward(self, X, Y, AL):
        """Compute gradients"""
        gradients = {}
        L = len(self.parameters) // 2
        m = X.shape[0]
        
        # Output layer gradient
        dZL = AL - Y  # For softmax + cross-entropy, this simplifies nicely
        
        for l in range(L, 0, -1):
            A_prev = self.cache[f'A{l-1}']
            
            # Gradients for current layer
            dW = np.dot(A_prev.T, dZL) / m
            db = np.sum(dZL, axis=0, keepdims=True) / m
            gradients[f'dW{l}'] = dW
            gradients[f'db{l}'] = db
            
            if l > 1:
                W = self.parameters[f'W{l}']
                dA_prev = np.dot(dZL, W.T)
                Z_prev = self.cache[f'Z{l-1}']
                dZL = dA_prev * (Z_prev > 0).astype(float)  # ReLU derivative
        
        return gradients
    
    def update_parameters(self, gradients, learning_rate):
        L = len(self.parameters) // 2
        for l in range(1, L + 1):
            self.parameters[f'W{l}'] -= learning_rate * gradients[f'dW{l}']
            self.parameters[f'b{l}'] -= learning_rate * gradients[f'db{l}']
```

#### **10.4.3 Computational Graphs**

Modern frameworks (PyTorch, TensorFlow) build dynamic computational graphs:

```python
# PyTorch automatic differentiation
import torch

x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward()  # Computes dy/dx automatically
print(x.grad)  # dy/dx = 2x + 3 = 7
```

---

## **10.5 Optimization Algorithms**

#### **10.5.1 Stochastic Gradient Descent (SGD)**

Update: $\theta := \theta - \alpha \nabla_\theta J(\theta; x^{(i)}, y^{(i)})$

**Variants:**
- **Batch GD:** Use all $m$ samples (accurate, slow)
- **Mini-batch GD:** Use $b$ samples ($b=32, 64, 128$) — standard
- **SGD:** Use 1 sample (noisy, escapes local minima)

#### **10.5.2 Momentum**

Accumulate velocity to dampen oscillations:

$$v_t = \beta v_{t-1} + \nabla_\theta J(\theta)$$
$$\theta = \theta - \alpha v_t$$

$\beta = 0.9$ typically. Like a ball rolling downhill, accelerates in consistent directions.

#### **10.5.3 RMSprop**

Adapts learning rate per parameter using moving average of squared gradients.

$$s_t = \beta_2 s_{t-1} + (1 - \beta_2) (\nabla_\theta J)^2$$
$$\theta = \theta - \alpha \frac{\nabla_\theta J}{\sqrt{s_t} + \epsilon}$$

Good for non-stationary objectives (RNNs).

#### **10.5.4 Adam (Adaptive Moment Estimation)**

Combines Momentum + RMSprop. Default choice for most problems.

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta J \quad \text{(first moment)}$$
$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta J)^2 \quad \text{(second moment)}$$

Bias correction:
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

Update:
$$\theta = \theta - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

**Hyperparameters:** $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$

```python
class Adam:
    def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.params = params
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = {k: np.zeros_like(v) for k, v in params.items()}
        self.v = {k: np.zeros_like(v) for k, v in params.items()}
        self.t = 0
    
    def step(self, gradients):
        self.t += 1
        for key in self.params:
            g = gradients[f'd{key}']
            
            # Update biased first moment estimate
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * g
            
            # Update biased second raw moment estimate  
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (g ** 2)
            
            # Compute bias-corrected estimates
            m_hat = self.m[key] / (1 - self.beta1 ** self.t)
            v_hat = self.v[key] / (1 - self.beta2 ** self.t)
            
            # Update parameters
            self.params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
```

#### **10.5.5 Learning Rate Schedules**

- **Step Decay:** Reduce by factor every $N$ epochs
- **Exponential Decay:** $\alpha = \alpha_0 e^{-kt}$
- **Cosine Annealing:** $\alpha_t = \alpha_{min} + \frac{1}{2}(\alpha_{max} - \alpha_{min})(1 + \cos(\frac{t}{T}\pi))$
- **ReduceLROnPlateau:** Reduce when validation loss stops improving

---

## **10.6 Regularization for Deep Networks**

#### **10.6.1 L2 Regularization (Weight Decay)**

Add $\frac{\lambda}{2}\|\mathbf{w}\|^2$ to loss. Penalizes large weights.

**In AdamW:** Decouple weight decay from gradient update (more effective than L2 penalty in Adam).

#### **10.6.2 Dropout**

Randomly set neurons to zero during training. Forces network to learn redundant representations.

$$a^{[l]}_{dropped} = a^{[l]} * \mathbf{m}, \quad m_i \sim \text{Bernoulli}(p)$$

At test time: multiply by $p$ (or scale by $1/p$ during training — inverted dropout).

```python
def dropout_forward(X, dropout_rate=0.5):
    if dropout_rate == 0:
        return X, None
    
    mask = (np.random.rand(*X.shape) < (1 - dropout_rate)) / (1 - dropout_rate)
    out = X * mask
    return out, mask  # Cache mask for backward pass

def dropout_backward(dout, mask):
    return dout * mask
```

#### **10.6.3 Batch Normalization**

Normalize layer inputs to have mean 0, variance 1. Allows higher learning rates, reduces internal covariate shift.

$$\hat{x}^{(k)} = \frac{x^{(k)} - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$
$$y^{(k)} = \gamma \hat{x}^{(k)} + \beta$$

$\gamma$ and $\beta$ are learned parameters (allows network to learn identity if needed).

**Placement:** Usually after linear layer, before activation.

```python
def batchnorm_forward(x, gamma, beta, eps=1e-8):
    # Mini-batch mean and variance
    mu = np.mean(x, axis=0)
    var = np.var(x, axis=0)
    
    # Normalize
    x_hat = (x - mu) / np.sqrt(var + eps)
    
    # Scale and shift
    out = gamma * x_hat + beta
    
    cache = (x, x_hat, mu, var, gamma, eps)
    return out, cache
```

---

## **10.7 Weight Initialization**

Bad initialization can cause vanishing or exploding gradients.

#### **10.7.1 Xavier/Glorot Initialization**

For sigmoid/tanh. Variance of outputs ≈ variance of inputs.

$$W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in} + n_{out}}}\right) \text{ or } \mathcal{N}\left(0, \sqrt{\frac{1}{n_{in}}}\right)$$

#### **10.7.2 He Initialization**

For ReLU activations. Accounts for ReLU killing half the neurons.

$$W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)$$

```python
def initialize_weights(shape, method='he'):
    fan_in, fan_out = shape
    if method == 'xavier':
        limit = np.sqrt(6 / (fan_in + fan_out))
        return np.random.uniform(-limit, limit, shape)
    elif method == 'he':
        return np.random.randn(*shape) * np.sqrt(2.0 / fan_in)
```

---

## **10.8 Workbook Labs**

### **Lab 1: Neural Network from Scratch**
Implement a 2-layer NN (input → hidden → output) using only NumPy on MNIST:
1. Forward pass with ReLU and Softmax
2. Backpropagation computing all gradients manually
3. SGD and Adam optimizers
4. Achieve >95% accuracy on test set

**Deliverable:** `neural_network_numpy.py` with <5% accuracy gap to sklearn MLP.

### **Lab 2: Vanishing Gradient Demonstration**
Train deep networks (10+ layers) with:
1. Sigmoid activation (should fail/vanish)
2. ReLU activation (should train)
3. Proper initialization (He) vs random initialization

Visualize gradient norms at each layer.

**Deliverable:** Plots showing gradient flow comparison.

### **Lab 3: Optimizer Comparison**
Implement SGD, Momentum, RMSprop, Adam from scratch.
Train identical network with each, plot loss curves.
Show Adam converges fastest, SGD with momentum finds better minima (generalizes better).

**Deliverable:** Comparison notebook with convergence plots.

### **Lab 4: Regularization Effects**
On CIFAR-10 (or synthetic data):
1. Train without regularization (overfitting)
2. Add Dropout (0.5)
3. Add Batch Normalization
4. Combine Dropout + BatchNorm + Weight Decay

Measure train/val gap for each.

**Deliverable:** Table showing regularization impact on generalization gap.

---

## **10.9 Common Pitfalls**

1. **Not Shuffling Data:** Epochs must shuffle mini-batches. Sequential batches create biased gradients.

2. **Incorrect Loss Scaling:** For classification, ensure logits not passed through softmax before CrossEntropyLoss (PyTorch combines them for numerical stability).

3. **Learning Rate Too High:** Causes divergence (NaN losses). Too low: never converges. Use LR finder.

4. **Forgetting Train Mode vs Eval Mode:** Dropout and BatchNorm behave differently! Call `model.train()` and `model.eval()` in PyTorch.

5. **Exploding Gradients:** In RNNs or deep networks. Solution: Gradient clipping.
   ```python
   torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
   ```

6. **Testing on Training Data:** Always use separate validation set.

---

## **10.10 Interview Questions**

**Q1:** Why do we need activation functions? What happens if we don't use them?
*A: Without activation functions, a composition of linear transformations is still linear: $W_3(W_2(W_1x)) = W_{eff}x$. The network couldn't learn non-linear decision boundaries, regardless of depth. Activations introduce non-linearity, allowing approximation of complex functions.*

**Q2:** Explain the vanishing gradient problem and how ReLU helps.
*A: In deep networks with sigmoid/tanh, gradients become increasingly small as they propagate backward (derivatives < 1 multiply together, approaching zero). This prevents early layers from learning. ReLU has derivative 1 for positive inputs, allowing gradients to flow unchanged through active neurons, mitigating vanishing gradients.*

**Q3:** What is the difference between Batch Norm and Layer Norm?
*A: Batch Norm normalizes across the batch dimension (computes mean/var over batch for each feature). Layer Norm normalizes across the feature dimension (computes mean/var over all features for each sample). Batch Norm requires batch statistics; Layer Norm works with batch size 1. Layer Norm is preferred in RNNs and Transformers where sequence lengths vary.*

**Q4:** Why does Adam often converge faster than SGD, but SGD sometimes generalizes better?
*A: Adam adapts learning rates per parameter and uses momentum, navigating complex loss landscapes faster initially. However, SGD with momentum (and proper LR decay) can find wider, flatter minima that generalize better. Adam's adaptive learning rates might cause it to settle in sharp minima. Solutions: AdamW (decoupled weight decay), or fine-tuning with SGD after Adam pretraining.*

**Q5:** Implement backprop for a single linear layer: $y = Wx + b$, loss $L = \frac{1}{2}(y - t)^2$. Compute $\frac{\partial L}{\partial W}$.
*A: $\frac{\partial L}{\partial y} = (y - t)$. $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial W} = (y - t) \cdot x^T$ (outer product if vectors). For batch: $\frac{\partial L}{\partial W} = \frac{1}{N} (y - t) x^T$.*

---

## **10.11 Further Reading**

**Books:**
- *Deep Learning* (Goodfellow, Bengio, Courville) - Chapters 6 (Feedforward), 7 (Regularization), 8 (Optimization)
- *Neural Networks and Deep Learning* (Michael Nielsen) - Free online, excellent for backprop intuition

**Papers:**
- "ImageNet Classification with Deep CNNs" (Krizhevsky et al., 2012) - ReLU popularization
- "Batch Normalization" (Ioffe & Szegedy, 2015)
- "Adam: A Method for Stochastic Optimization" (Kingma & Ba, 2015)
- "Delving Deep into Rectifiers" (He et al., 2015) - He initialization

---

## **10.12 Checkpoint Project: Deep Learning Library (MicroTorch)**

Build a miniature PyTorch-like library with automatic differentiation.

**Requirements:**

1. **Tensor Class:** Wrapper around NumPy array with `requires_grad` flag
2. **Operations:** Add, Multiply, MatMul, ReLU, Softmax
3. **Autograd:** Build computation graph dynamically, implement backward() using reverse-mode autodiff
4. **Optimizer:** SGD and Adam implementations
5. **NN Module:** Linear layers, Sequential container
6. **Training Loop:** Fit on MNIST, achieve >90% accuracy

**API Design:**
```python
from microtorch import Tensor, nn, optim

# Should work like PyTorch
x = Tensor(np.random.randn(32, 784), requires_grad=True)
model = nn.Sequential([
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
])
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training
for epoch in range(10):
    optimizer.zero_grad()
    logits = model(x)
    loss = nn.cross_entropy(logits, y)
    loss.backward()  # Compute all gradients automatically
    optimizer.step()
```

**Deliverables:**
- `microtorch/` package with `tensor.py`, `nn.py`, `optim.py`
- Tests comparing gradients to numerical differentiation (finite differences)
- MNIST training script

**Success Criteria:**
- Backward pass computes correct gradients (within 1e-5 of numerical)
- Training converges on MNIST
- Memory efficient (no storing unnecessary intermediate values)

---

**End of Chapter 10**

*You now understand how neural networks learn. Chapter 11 will cover Deep Learning Frameworks (PyTorch, TensorFlow, JAX) and how to use them efficiently for large-scale models.*

---

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../2. Machine_learning_fundamentals/9. model_evaluation_validation_and_selection.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='11. deep_learning_frameworks.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
