# Core Equations from "Neural Networks and Deep Learning" Chapter 1

## Perceptron Output Rule
The perceptron output is determined as follows:
$$
\text{output} =
\begin{cases} 
0 & \text{if } \sum_j w_j x_j \leq \text{threshold}, \\
1 & \text{if } \sum_j w_j x_j > \text{threshold}.
\end{cases}
$$

Reformulated using the bias $b = -\text{threshold}$:
$$
\text{output} =
\begin{cases} 
0 & \text{if } w \cdot x + b \leq 0, \\
1 & \text{if } w \cdot x + b > 0.
\end{cases}
$$

---

## Sigmoid Neuron Output
The sigmoid function is defined as:
$$
\sigma(z) = \frac{1}{1 + e^{-z}},
$$
where $z = w \cdot x + b$.

Expanded form:
$$
\text{output} = \frac{1}{1 + \exp(-\sum_j w_j x_j - b)}.
$$

---

## Notation for Gradients and Updates
For small changes in weights $\Delta w_j$ and bias $\Delta b$, the output change is approximated as:
$$
\Delta \text{output} \approx \sum_j \frac{\partial \text{output}}{\partial w_j} \Delta w_j + \frac{\partial \text{output}}{\partial b} \Delta b.
$$

The gradient vector $\nabla C$ is defined as:
$$
\nabla C = 
\begin{pmatrix}
\frac{\partial C}{\partial w_1} \\
\frac{\partial C}{\partial w_2} \\
\vdots \\
\frac{\partial C}{\partial b}
\end{pmatrix}.
$$

---

## Quadratic Cost Function
The cost function quantifies the difference between the expected and actual outputs:
$$
C(w, b) = \frac{1}{2n} \sum_x \| y(x) - a \|^2,
$$
where:
- $w$ are the weights,
- $b$ are the biases,
- $n$ is the number of training inputs,
- $y(x)$ is the expected output for input $x$,
- $a$ is the network's actual output.

---

## Gradient Descent Rule
Weights and biases are updated using the gradient of the cost function:
$$
w_k \to w_k' = w_k - \eta \frac{\partial C}{\partial w_k},
$$
$$
b_l \to b_l' = b_l - \eta \frac{\partial C}{\partial b_l},
$$
where $\eta$ is the learning rate.

---

## Stochastic Gradient Descent for Mini-Batches
For a mini-batch of size $m$, update weights and biases as:
$$
w_k \to w_k' = w_k - \frac{\eta}{m} \sum_{j=1}^m \frac{\partial C_{X_j}}{\partial w_k},
$$
$$
b_l \to b_l' = b_l - \frac{\eta}{m} \sum_{j=1}^m \frac{\partial C_{X_j}}{\partial b_l},
$$
where $X_j$ represents the $j$-th training example in the mini-batch.

---

## Feedforward Output Calculation
The activation $a'$ of the next layer is computed as:
$$
a' = \sigma(w a + b),
$$
where:
- $a$ is the activation vector of the current layer,
- $w$ is the weight matrix,
- $b$ is the bias vector,
- $\sigma$ is applied element-wise.

---

## Gradient Descent Approximation
The change in cost $C$ due to a small step $\Delta v$ is approximated as:
$$
\Delta C \approx \nabla C \cdot \Delta v.
$$

Gradient descent update rule:
$$
v \to v' = v - \eta \nabla C.
$$

---

## Derivative of the Sigmoid Function
The derivative of the sigmoid function is:
$$
\sigma'(z) = \sigma(z) (1 - \sigma(z)).
$$

---

## Mini-Batch Updates for Training
To update weights and biases with backpropagation for a mini-batch of examples:
1. Compute the gradients for each example $(x, y)$:
   $$ \delta w_k = \frac{\partial C_{x,y}}{\partial w_k}, \quad \delta b_l = \frac{\partial C_{x,y}}{\partial b_l}. $$
2. Average the gradients across all examples in the mini-batch:
   $$ \nabla w_k = \frac{1}{m} \sum_{i=1}^m \delta w_k^{(i)}, \quad \nabla b_l = \frac{1}{m} \sum_{i=1}^m \delta b_l^{(i)}. $$
3. Update weights and biases:
   $$ w_k \to w_k - \eta \nabla w_k, \quad b_l \to b_l - \eta \nabla b_l. $$

---

These equations cover the core mathematical tools used in the neural network and gradient descent training processes.


# Core Equations from "Neural Networks and Deep Learning" Chapter 2

## Feedforward Activation (Matrix Form)
The activation of the \( l \)-th layer is computed using the following equation:
$$
a^l = \sigma(w^l a^{l-1} + b^l),
$$
where:
- \( a^l \) is the activation vector of the \( l \)-th layer,
- \( w^l \) is the weight matrix for the \( l \)-th layer,
- \( b^l \) is the bias vector for the \( l \)-th layer,
- \( \sigma \) is the activation function applied element-wise.

---

## Weighted Input
The weighted input \( z^l \) to the \( l \)-th layer is defined as:
$$
z^l = w^l a^{l-1} + b^l.
$$
The activation is related to the weighted input by:
$$
a^l = \sigma(z^l).
$$

---

## Quadratic Cost Function
The cost function for the network is given by:
$$
C = \frac{1}{2n} \sum_x \| y(x) - a^L(x) \|^2,
$$
where:
- \( n \) is the number of training examples,
- \( y(x) \) is the desired output for input \( x \),
- \( a^L(x) \) is the output of the network for input \( x \),
- \( L \) is the index of the output layer.

---

## Error in the Output Layer
The error \( \delta^L \) for the output layer is computed as:
$$
\delta^L = \nabla_a C \odot \sigma'(z^L),
$$
where:
- \( \nabla_a C \) is the gradient of the cost with respect to the activations,
- \( \sigma'(z^L) \) is the derivative of the activation function with respect to \( z^L \),
- \( \odot \) denotes the Hadamard (element-wise) product.

For the quadratic cost function:
$$
\nabla_a C = a^L - y.
$$

---

## Error Propagation (Backpropagation)
The error \( \delta^l \) for layer \( l \) is related to the error in the next layer \( l+1 \):
$$
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l),
$$
where \( (w^{l+1})^T \) is the transpose of the weight matrix for the \( l+1 \)-th layer.

---

## Gradients for Weights and Biases
The gradient of the cost function with respect to the weights and biases is given by:
$$
\frac{\partial C}{\partial w^l} = a^{l-1} (\delta^l)^T,
$$
$$
\frac{\partial C}{\partial b^l} = \delta^l.
$$

---

## Fully Matrix-Based Backpropagation (Output Layer)
Using a matrix \( X = [x_1, x_2, \dots, x_m] \) containing the mini-batch, the backpropagation equations become:
1. Feedforward:
   $$ Z^l = W^l A^{l-1} + B^l, \quad A^l = \sigma(Z^l). $$
2. Compute output layer error:
   $$ \Delta^L = \nabla_a C \odot \sigma'(Z^L). $$
3. Backpropagate errors:
   $$ \Delta^l = (W^{l+1})^T \Delta^{l+1} \odot \sigma'(Z^l). $$
4. Gradients for weights and biases:
   $$ \frac{\partial C}{\partial W^l} = \Delta^l (A^{l-1})^T, \quad \frac{\partial C}{\partial B^l} = \Delta^l. $$

---

These equations encapsulate the essence of backpropagation for gradient computation in neural networks.
