
# Deep Learning & Artificial Neural Networks — A Hands‑On Primer

**Format:** Notebook with math in $$...$$ for easy copy into Markdown cells.  
**What you’ll learn:**  
1) CNN layers (convolution, ReLU, pooling, flattening)  
2) Neurons & synapses (weights & biases)  
3) Activation functions (what/why/derivatives)  
4) Gradient descent vs brute‑force optimization  
5) Full training loop of a tiny Neural Network (from scratch, NumPy)

> Tip: Run each code cell in order. The math blocks are self-contained and can be pasted into Markdown cells.



## 0) What is Deep Learning and an Artificial Neural Network (ANN)?

**Deep Learning** uses **stacked layers** of simple units (**neurons**) to learn complex functions from data.  
Each neuron computes a **weighted sum** of its inputs, adds a **bias**, then applies a **nonlinear activation**.

$$
\text{Given input vector } x \in \mathbb{R}^{d}, \quad
z = w^\top x + b, \quad
a = \phi(z).
$$

**Symbols:**  
- $$x \in \mathbb{R}^d $$ — input (features).  
- $$w \in \mathbb{R}^d$$ — weights (strength of each input connection).  
- $$b \in \mathbb{R}$$ — bias (threshold).  
- $$z \in \mathbb{R}$$ — pre-activation (linear response).  
- $$a = \phi(z)$$ — post-activation (nonlinear output).

A **layer** bundles many neurons in parallel. If the previous layer output is $$a^{(l-1)} \in \mathbb{R}^{n_{l-1}}$$ and the current has $$n_l$$ neurons, then

$$
z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}, \qquad
a^{(l)} = \phi\!\left(z^{(l)}\right).
$$

**Shapes:**  
- $$W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$$, $$b^{(l)} \in \mathbb{R}^{n_l}$$, $$z^{(l)}, a^{(l)} \in \mathbb{R}^{n_l}$$.

Stacking layers gives a **deep** network: a function composition that can model rich patterns.



## 1) CNN Layers: Convolution, ReLU, Pooling, Flattening

We use a **shop image** analogy: imagine an input image (e.g., a product photo). CNNs scan it with small **filters** to detect edges/textures.

### 1.1 Convolution (2D, single channel)

Given an input image $$X \in \mathbb{R}^{H \times W}$$ and a filter/kernel $$K \in \mathbb{R}^{k_h \times k_w}$$, **valid** convolution at location $$(i,j)$$ (top-left index) is

$$
Y_{i,j} = \sum_{u=0}^{k_h-1} \sum_{v=0}^{k_w-1} X_{i+u,\; j+v} \; K_{u,v}.
$$

**With stride $$s$$ and zero-padding $$p$$:** output size is  
$$
H_{\text{out}} = \left\lfloor \frac{H - k_h + 2p}{s} \right\rfloor + 1,\quad
W_{\text{out}} = \left\lfloor \frac{W - k_w + 2p}{s} \right\rfloor + 1.
$$

**Multi-channel (e.g., RGB):** for input depth $$C$$ and filter $$K \in \mathbb{R}^{C \times k_h \times k_w}$$,
$$
Y_{i,j} = \sum_{c=1}^{C} \sum_{u=0}^{k_h-1} \sum_{v=0}^{k_w-1} X_{c,i+u,\; j+v} \; K_{c,u,v} + b.
$$

**Symbols:**  
- $$X$$ — input image/feature map, $$K$$ — kernel/filter, $$b$$ — per-filter bias, $$Y$$ — output feature map.  
- $$s$$ — stride (step size), $$p$$ — padding (border zeros), $$(i,j)$$ — output spatial indices.

**Intuition:** kernels detect **local patterns** (edges/corners). Stacking many filters yields multiple feature maps.


In [None]:

import numpy as np

# Tiny numeric example: 4x4 image, 3x3 kernel, stride=1, no padding
X = np.array([
    [1, 2, 0, 1],
    [0, 1, 3, 2],
    [2, 1, 0, 1],
    [1, 0, 2, 3]
], dtype=float)

K = np.array([
    [1, 0, -1],
    [1, 0, -1],
    [1, 0, -1]
], dtype=float)  # simple "vertical edge" detector (Sobel-like simplified)

def conv2d_valid(X, K):
    H, W = X.shape
    kh, kw = K.shape
    Hout, Wout = H - kh + 1, W - kw + 1
    Y = np.zeros((Hout, Wout), dtype=float)
    for i in range(Hout):
        for j in range(Wout):
            window = X[i:i+kh, j:j+kw]
            Y[i, j] = np.sum(window * K)
    return Y

Y = conv2d_valid(X, K)
print("Input X:\n", X)
print("\nKernel K:\n", K)
print("\nConvolution Y:\n", Y)



### 1.2 ReLU (Rectified Linear Unit)

Makes features **nonlinear** and keeps positives, zeroes out negatives:

$$
\text{ReLU}(z) = \max(0, z), \qquad
\frac{d}{dz}\text{ReLU}(z) = \begin{cases}1 & z>0 \\ 0 & z\le 0 \end{cases}.
$$

**Why:** nonlinearity lets networks compose features into richer patterns; it’s cheap and reduces vanishing gradients compared to sigmoids.


In [None]:

import numpy as np

def relu(X):
    return np.maximum(0, X)

# Apply ReLU to the previous convolution output
Y_relu = relu(Y)
print("ReLU(Y):\n", Y_relu)



### 1.3 Pooling (Downsampling)

**Max pooling** keeps the strongest activation in a window; **average pooling** averages.  
For window size $$p_h \times p_w$$, stride $$s$$:

- Max pooling:
$$
Y_{i,j} = \max_{0 \le u < p_h,\; 0 \le v < p_w} X_{i\cdot s + u,\; j\cdot s + v}.
$$

- Average pooling:
$$
Y_{i,j} = \frac{1}{p_h p_w} \sum_{u=0}^{p_h-1} \sum_{v=0}^{p_w-1} X_{i\cdot s + u,\; j\cdot s + v}.
$$

**Why:** reduces spatial size, creates local invariance, and saves parameters/computation later.


In [None]:

def max_pool2d(X, pool=(2,2), stride=2):
    ph, pw = pool
    H, W = X.shape
    Hout = (H - ph)//stride + 1
    Wout = (W - pw)//stride + 1
    Y = np.zeros((Hout, Wout), dtype=float)
    for i in range(Hout):
        for j in range(Wout):
            window = X[i*stride:i*stride+ph, j*stride:j*stride+pw]
            Y[i, j] = np.max(window)
    return Y

pooled = max_pool2d(Y_relu, pool=(2,2), stride=2)
print("MaxPool(ReLU(conv)):\n", pooled)



### 1.4 Flattening

Converts a multi-dimensional feature map into a **1‑D vector** to feed **fully connected** layers.

If the feature map is $$X \in \mathbb{R}^{C \times H \times W}$$, **flatten** reshapes it into $$x \in \mathbb{R}^{CHW}$$ without changing values.

**Why:** Dense layers expect vectors; flatten bridges CNN features to classifiers/regressors.


In [None]:

# Example flatten
C, H, W = 1, pooled.shape[0], pooled.shape[1]
feat_map = pooled.reshape(C, H, W)
flat = feat_map.reshape(-1)  # CHW
print("Feature map shape:", feat_map.shape, "-> Flattened vector length:", flat.shape[0])



## 2) Neurons and Synapses (Weights)

A **neuron** computes a weighted sum (synapses = weighted connections) plus bias, then applies a nonlinearity:

$$
z = w^\top x + b, \qquad a = \phi(z).
$$

In a **layer** with $$n_l$$ neurons, compact form:

$$
z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}, \qquad a^{(l)} = \phi\!\left(z^{(l)}\right).
$$

**Connections = parameters:** every edge (synapse) has a weight. The goal of training is to **learn** these weights and biases to minimize a **loss**.


In [None]:

# Tiny dense layer forward example (from scratch):
rng = np.random.default_rng(42)
x = flat  # vector from CNN toy pathway
in_dim = x.size
out_dim = 3
W = rng.normal(scale=0.1, size=(out_dim, in_dim))
b = np.zeros(out_dim)
z = W @ x + b
a = np.maximum(0, z)  # ReLU
print("Dense layer output (3 units):", a)



## 3) Activation Functions

Nonlinearities let networks approximate complex functions. Common choices:

**Sigmoid** (good for probabilities in [0,1] but can saturate):
$$
\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \sigma'(z) = \sigma(z)(1-\sigma(z)).
$$

**Tanh** (zero-centered; still can saturate):
$$
\tanh(z) = \frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}, \quad \frac{d}{dz}\tanh(z) = 1 - \tanh^2(z).
$$

**ReLU** (sparse activations; simple; popular):
$$
\text{ReLU}(z) = \max(0,z), \quad \text{ReLU}'(z) = \begin{cases}1 & z>0\\ 0 & z\le 0\end{cases}.
$$

**Leaky ReLU** (fixes “dead” ReLUs):
$$
\text{LeakyReLU}_\alpha(z) = \begin{cases}z & z\ge 0\\ \alpha z & z<0 \end{cases}, \quad \alpha \in (0,1).
$$

**Softmax** (multi-class output; converts scores to a probability simplex):
$$
\text{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j} e^{z_j}}.
$$

**Notes:**  
- Use **ReLU-family** in hidden layers typically.  
- Use **sigmoid** for binary output and **softmax** for multi-class output (with cross-entropy loss).


In [None]:

# Visualize activations (one plot per function, as required)
import numpy as np
import matplotlib.pyplot as plt

xs = np.linspace(-5, 5, 400)

def sigmoid(z): return 1/(1+np.exp(-z))
def tanh(z): return np.tanh(z)
def relu(z): return np.maximum(0,z)
def leaky_relu(z, alpha=0.1): return np.where(z>0, z, alpha*z)

for name, f in [("Sigmoid", sigmoid), ("Tanh", tanh), ("ReLU", relu)]:
    plt.figure()
    plt.plot(xs, f(xs))
    plt.title(name)
    plt.xlabel("z"); plt.ylabel(f"{name}(z)")
    plt.show()

plt.figure()
plt.plot(xs, leaky_relu(xs, 0.1))
plt.title("Leaky ReLU (alpha=0.1)")
plt.xlabel("z"); plt.ylabel("LeakyReLU(z)")
plt.show()



## 4) Gradient Descent vs Brute Force Optimization

We want to minimize a **loss** $$L(\theta)$$ over parameters $$\theta$$.  
**Gradient Descent (GD)** updates iteratively in the negative gradient direction:

$$
\theta^{(t+1)} = \theta^{(t)} - \eta \, \nabla_\theta L\!\left(\theta^{(t)}\right).
$$

**Symbols:**  
- $$\theta$$ — parameters (weights/biases).  
- $$\eta>0$$ — learning rate (step size).  
- $$\nabla_\theta L$$ — gradient (vector of partial derivatives).

**Why not brute force?** Trying every parameter setting is **exponential/infinite**. Even for one parameter we could grid search, but for millions (typical NN) it’s impossible.

### Toy 1‑D example
Fit $$w$$ to minimize mean squared error on pairs $$(x_i,y_i)$$ with model $$\hat{y}_i = w x_i$$:

$$
L(w) = \frac{1}{N}\sum_{i=1}^N (\hat{y}_i - y_i)^2
= \frac{1}{N}\sum_{i=1}^N (w x_i - y_i)^2, \quad
\frac{dL}{dw} = \frac{2}{N}\sum_{i=1}^N x_i (w x_i - y_i).
$$

We’ll compare **GD** vs **brute-force grid** on $$w \in [-5,5]$$.


In [None]:

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(0)
N = 60
x = rng.uniform(-2, 2, size=N)
true_w = 1.7
y = true_w * x + rng.normal(scale=0.2, size=N)

def loss(w):
    return np.mean((w*x - y)**2)

def grad(w):
    return (2.0/len(x)) * np.sum(x * (w*x - y))

# Brute-force grid (for 1D only, feasible here)
ws = np.linspace(-5,5,1000)
Ls = np.array([loss(w) for w in ws])
w_star_grid = ws[np.argmin(Ls)]

# Gradient Descent
w = -4.0
eta = 0.05
hist = []
for t in range(200):
    g = grad(w)
    w = w - eta * g
    hist.append((t, w, loss(w)))

w_gd = w

# Plot loss landscape and GD path
plt.figure()
plt.plot(ws, Ls)
plt.title("1D Loss Landscape (grid)")
plt.xlabel("w"); plt.ylabel("L(w)")
plt.show()

plt.figure()
plt.plot([h[0] for h in hist], [h[2] for h in hist])
plt.title("Gradient Descent Loss over Iterations")
plt.xlabel("iteration"); plt.ylabel("L(w)")
plt.show()

print(f"True w ~ {true_w:.3f}, Grid argmin ~ {w_star_grid:.3f}, GD final ~ {w_gd:.3f}")



**Takeaways:**  
- In low dimensions, a grid can find the minimum, but scales horribly.  
- **GD** works in high dimensions by following the slope (gradient).  
- In practice, we use **mini‑batch SGD**, **Momentum**, **Adam**, etc., but the core idea is the same: move parameters downhill.



## 5) Training a Neural Network — Step by Step (from scratch)

We’ll train a tiny **2‑layer** network on a simple 2D dataset.

### 5.1 Forward pass

Layer 1 (hidden, ReLU):
$$
Z^{[1]} = X W^{[1]} + \mathbf{1} b^{[1]\top}, \quad A^{[1]} = \text{ReLU}(Z^{[1]}).
$$

Layer 2 (output, binary logistic):
$$
Z^{[2]} = A^{[1]} W^{[2]} + \mathbf{1} b^{[2]\top}, \quad \hat{y} = \sigma(Z^{[2]}) = \frac{1}{1+e^{-Z^{[2]}}}.
$$

**Shapes:**  
- $$X \in \mathbb{R}^{N \times d}$$, $$W^{[1]} \in \mathbb{R}^{d \times h}$$, $$b^{[1]} \in \mathbb{R}^{h}$$.  
- $$A^{[1]} \in \mathbb{R}^{N \times h}$$, $$W^{[2]} \in \mathbb{R}^{h \times 1}$$, $$b^{[2]} \in \mathbb{R}^{1}$$, $$\hat{y} \in \mathbb{R}^{N \times 1}$$.

### 5.2 Loss (binary cross-entropy)

$$
L = -\frac{1}{N}\sum_{i=1}^N \left[ y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i) \right].
$$

### 5.3 Backpropagation (gradients via chain rule)

For sigmoid + cross-entropy, the output derivative simplifies:
$$
\frac{\partial L}{\partial Z^{[2]}} = \frac{1}{N}(\hat{y} - y).
$$

Then propagate back:
$$
\frac{\partial L}{\partial W^{[2]}} = A^{[1]\top} \frac{\partial L}{\partial Z^{[2]}}, \quad
\frac{\partial L}{\partial b^{[2]}} = \sum_{i=1}^N \left(\frac{\partial L}{\partial Z^{[2]}}\right)_i.
$$

For ReLU:
$$
\frac{\partial L}{\partial Z^{[1]}} = \left(\frac{\partial L}{\partial A^{[1]}}\right) \odot \mathbf{1}[Z^{[1]} > 0], \quad
\frac{\partial L}{\partial A^{[1]}} = \frac{\partial L}{\partial Z^{[2]}} W^{[2]\top}.
$$

Finally:
$$
\frac{\partial L}{\partial W^{[1]}} = X^\top \frac{\partial L}{\partial Z^{[1]}}, \quad
\frac{\partial L}{\partial b^{[1]}} = \sum_{i=1}^N \left(\frac{\partial L}{\partial Z^{[1]}}\right)_i.
$$

**Symbols:** $$\odot$$ is elementwise product; $$\mathbf{1}[\,\cdot\,]$$ is an indicator mask (1 if condition true else 0).

### 5.4 Update (gradient descent)

With learning rate $$\eta$$:
$$
\Theta \leftarrow \Theta - \eta \nabla_\Theta L, \quad \Theta \in \{W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}\}.
$$

We’ll code all of this and plot the training loss.


In [None]:

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(1)

# Create a toy 2D binary dataset (two moons-like by manual shift/rotation)
N = 400
# Class 0
theta0 = rng.uniform(0, np.pi, N//2)
r0 = 1.0 + 0.05*rng.standard_normal(N//2)
x0 = np.c_[r0*np.cos(theta0), r0*np.sin(theta0)] + np.array([0.0, 0.0])
# Class 1
theta1 = rng.uniform(0, np.pi, N//2)
r1 = 1.0 + 0.05*rng.standard_normal(N//2)
x1 = np.c_[r1*np.cos(theta1), -r1*np.sin(theta1)] + np.array([0.8, 0.2])

X = np.vstack([x0, x1])
y = np.vstack([np.zeros((N//2,1)), np.ones((N//2,1))])

# Initialize 2-layer NN
d = 2       # input dimension
h = 8       # hidden units
W1 = rng.normal(scale=0.7, size=(d, h))
b1 = np.zeros((1, h))
W2 = rng.normal(scale=0.7, size=(h, 1))
b2 = np.zeros((1, 1))

def relu(Z): return np.maximum(0, Z)
def relu_grad(Z): return (Z > 0).astype(Z.dtype)
def sigmoid(Z): return 1/(1+np.exp(-Z))

def forward(X):
    Z1 = X @ W1 + b1
    A1 = relu(Z1)
    Z2 = A1 @ W2 + b2
    Yhat = sigmoid(Z2)
    return Z1, A1, Z2, Yhat

def loss(Yhat, y):
    eps = 1e-8
    return -np.mean(y*np.log(Yhat+eps) + (1-y)*np.log(1-Yhat+eps))

eta = 0.1
epochs = 4000
loss_hist = []

for t in range(epochs):
    # Forward
    Z1, A1, Z2, Yhat = forward(X)
    L = loss(Yhat, y)
    loss_hist.append(L)

    # Backward
    dZ2 = (Yhat - y)/len(X)         # (N,1)
    dW2 = A1.T @ dZ2                # (h,1)
    db2 = np.sum(dZ2, axis=0, keepdims=True)

    dA1 = dZ2 @ W2.T                # (N,h)
    dZ1 = dA1 * relu_grad(Z1)       # (N,h)
    dW1 = X.T @ dZ1                 # (d,h)
    db1 = np.sum(dZ1, axis=0, keepdims=True)

    # Update
    W2 -= eta * dW2; b2 -= eta * db2
    W1 -= eta * dW1; b1 -= eta * db1

# Plot loss
plt.figure()
plt.plot(loss_hist)
plt.title("Training Loss (2-layer NN)")
plt.xlabel("epoch"); plt.ylabel("cross-entropy")
plt.show()

# Evaluate accuracy
_, A1, _, Yhat = forward(X)
pred = (Yhat >= 0.5).astype(int)
acc = (pred == y).mean()
print(f"Training accuracy: {acc:.3f}")



### 5.5 How pieces connect

1. **Convolution** extracts local patterns from images (edges, textures).  
2. **ReLU** injects nonlinearity so stacked layers can model complex functions.  
3. **Pooling** compresses spatial info for invariance & efficiency.  
4. **Flatten** reshapes feature maps into vectors for dense layers.  
5. **Neurons & synapses** are the parameters we optimize.  
6. **Activations** shape signal flow and gradients.  
7. **Gradient descent** (via backprop) updates weights to reduce loss.  
8. Repeat forward→loss→backward→update over many **epochs** until convergence.
