# **Complete Mathematical Equations for Major Deep Learning Models**

---

# **1. Feedforward Neural Network (FNN)**

Given input:

$$
\mathbf{x} \in \mathbb{R}^d
$$

Hidden layer:

$$
\mathbf{h} = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)
$$

Output:

$$
\hat{y} = \sigma(\mathbf{W}_2 \mathbf{h} + \mathbf{b}_2)
$$

Activation:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

---

# **2. Convolutional Neural Network (CNN)**

## **2.1 Convolution Operation**

For image \(X\) and kernel \(K\):

$$
Y_{i,j} = \sum_{m=1}^{k} \sum_{n=1}^{k} K_{m,n}\, X_{i+m, j+n}
$$

## **2.2 ReLU Activation**

$$
\text{ReLU}(z) = \max(0, z)
$$

## **2.3 Max Pooling**

$$
Y_{i,j} = \max_{p,q \in \text{window}} X_{p,q}
$$

---

# **3. Recurrent Neural Network (RNN)**

Hidden state update:

$$
h_t = \tanh(W_h h_{t-1} + W_x x_t + b_h)
$$

Prediction:

$$
\hat{y} = \sigma(W_y h_T + b_y)
$$

---

# **4. Long Short-Term Memory (LSTM)**

Given \( x_t, h_{t-1}, c_{t-1} \):

### Forget gate:

$$
f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)
$$

### Input gate:

$$
i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)
$$

### Candidate cell:

$$
\tilde{c}_t = \tanh(W_c x_t + U_c h_{t-1} + b_c)
$$

### Cell state:

$$
c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t
$$

### Output gate:

$$
o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)
$$

### Hidden state:

$$
h_t = o_t \odot \tanh(c_t)
$$

---

# **5. Gated Recurrent Unit (GRU)**

Update gate:

$$
z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)
$$

Reset gate:

$$
r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r)
$$

Candidate hidden:

$$
\tilde{h}_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h)
$$

Final hidden:

$$
h_t = (1 - z_t)\, h_{t-1} + z_t\, \tilde{h}_t
$$

---

# **6. Transformer — Self-Attention**

Input embeddings:

$$
X \in \mathbb{R}^{T \times d}
$$

### Linear projections:

$$
Q = XW^Q,\qquad K = XW^K,\qquad V = XW^V
$$

### Scaled dot-product attention:

Scores:

$$
S = \frac{QK^\top}{\sqrt{d_k}}
$$

Softmax weights:

$$
A = \text{softmax}(S)
$$

Output:

$$
O = A V
$$

### Multi-Head Attention:

For heads \(h = 1,\dots,H\):

$$
\text{MHA}(X) =
\text{Concat}(O_1, O_2, \dots, O_H)\, W^O
$$

### Transformer Feed-Forward Network:

$$
\text{FFN}(x) = W_2\, \text{ReLU}(W_1 x + b_1) + b_2
$$

### Residual + LayerNorm:

$$
x' = \text{LayerNorm}(x + \text{MHA}(x))
$$

$$
x_{\text{out}} = \text{LayerNorm}(x' + \text{FFN}(x'))
$$

---

# **Summary Table**

| Model | Key Equations |
|-------|---------------|
| **FNN** | $$h = \sigma(W_1 x + b_1), \qquad \hat{y} = \sigma(W_2 h + b_2)$$ |
| **CNN** | $$\text{Convolution}, \quad \text{ReLU}, \quad \text{MaxPool}$$ |
| **RNN** | $$h_t = \tanh(W_h h_{t-1} + W_x x_t)$$ |
| **LSTM** | $$f_t,\, i_t,\, o_t,\qquad c_t,\qquad h_t$$ |
| **GRU** | $$z_t,\, r_t,\qquad \tilde{h}_t,\qquad h_t$$ |
| **Transformer** | $$Q = XW^Q,\quad K = XW^K,\quad V = XW^V$$ $$A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right), \qquad O = AV$$ |





# **ALL GENERATIVE MODELS**

| **Model** | **Core Mathematical Equations** |
|----------|----------------------------------|
| **VAE** | $$\text{ELBO} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\,\|\,p(z))$$ $$z = \mu_\phi(x) + \sigma_\phi(x)\,\epsilon,\quad \epsilon \sim \mathcal{N}(0,I)$$ |
| **GAN** | $$\min_G \max_D\; \mathbb{E}_{x\sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z\sim p(z)}[\log(1 - D(G(z)))]$$ $$L_G = -\mathbb{E}_{z}[\log D(G(z))]$$ |
| **DDPM** | $$q(x_t|x_{t-1})=\mathcal{N}\!\left(x_t;\sqrt{1-\beta_t}\,x_{t-1},\,\beta_t I\right)$$ $$q(x_t|x_0)=\mathcal{N}\!\left(x_t;\sqrt{\bar{\alpha}_t}\,x_0,\,(1-\bar{\alpha}_t)I\right)$$ $$L = \mathbb{E}\left[\lVert \epsilon - \epsilon_\theta(x_t,t)\rVert^2\right]$$ |
| **Normalizing Flow** | $$x = f_K\circ\cdots\circ f_1(z_0),\qquad z_0\sim p_0(z_0)$$ $$\log p(x)=\log p_0(z_0)-\sum_{k=1}^K \log\left|\det\frac{\partial f_k}{\partial z_{k-1}}\right|$$ |
| **Autoregressive Models** | $$p(x)=\prod_{t=1}^T p(x_t \mid x_{<t})$$ $$A=\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)$$ |
| **Energy-Based Models (EBM)** | $$p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z_\theta}$$ $$Z_\theta = \int e^{-E_\theta(x)}\,dx$$ |
| **RBM** | $$E(v,h)=-v^\top Wh - b^\top v - c^\top h$$ $$p(v,h)=\frac{e^{-E(v,h)}}{Z}$$ $$\Delta W \propto \langle vh^\top\rangle_{\text{data}} - \langle vh^\top\rangle_{\text{model}}$$ |
| **Score-Based Models** | $$s_\theta(x)=\nabla_x \log p(x)$$ $$x_{t+1}=x_t + \frac{\epsilon}{2}s_\theta(x_t) + \sqrt{\epsilon}\,z_t$$ |
| **GMM** | $$p(x)=\sum_{k=1}^K \pi_k\,\mathcal{N}(x;\mu_k,\Sigma_k)$$ $$\gamma_{nk}=\frac{\pi_k\,\mathcal{N}(x_n;\mu_k,\Sigma_k)}{\sum_j\pi_j\,\mathcal{N}(x_n;\mu_j,\Sigma_j)}$$ $$\mu_k=\frac{\sum_n\gamma_{nk}x_n}{\sum_n\gamma_{nk}},\quad \Sigma_k=\frac{\sum_n\gamma_{nk}(x_n-\mu_k)(x_n-\mu_k)^\top}{\sum_n\gamma_{nk}},\quad \pi_k=\frac{1}{N}\sum_n\gamma_{nk}$$ |
| **Transformer Decoder (Generative)** | $$A=\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)$$ $$P(x_t|x_{<t})=\text{softmax}(W h_t)$$ $$x_t \sim P(x_t|x_{<t})$$ |

---



# **Activation Summary Table (All Equations)**

| **Activation** | **Mathematical Equation** |
|----------------|----------------------------|
| **Sigmoid** | $$\sigma(x)=\frac{1}{1+e^{-x}}$$ |
| **Tanh** | $$\tanh(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$$ |
| **ReLU** | $$\text{ReLU}(x)=\max(0,x)$$ |
| **Leaky ReLU** | $$\text{LeakyReLU}(x)=\begin{cases}\alpha x & x<0 \\ x & x\ge 0\end{cases}$$ |
| **PReLU** | $$\text{PReLU}(x)=\begin{cases}a x & x<0 \\ x & x\ge 0\end{cases}$$ |
| **ELU** | $$\text{ELU}(x)=\begin{cases}x & x\ge 0 \\ \alpha(e^{x}-1) & x<0\end{cases}$$ |
| **SELU** | $$\text{SELU}(x)=\lambda\begin{cases}x & x\ge 0 \\ \alpha(e^{x}-1) & x<0\end{cases}$$ |
| **GELU (exact)** | $$\text{GELU}(x)=x\,\Phi(x)$$ |
| **GELU (approx.)** | $$\text{GELU}(x)\approx 0.5x\left(1+\tanh\!\left[\sqrt{\frac{2}{\pi}}\,(x+0.044715x^3)\right]\right)$$ |
| **Softplus** | $$\text{Softplus}(x)=\ln(1+e^{x})$$ |
| **Mish** | $$\text{Mish}(x)=x\,\tanh(\ln(1+e^{x}))$$ |
| **Swish** | $$\text{Swish}(x)=x\,\sigma(x)$$ |
| **Hard Sigmoid** | $$\text{HardSigmoid}(x)=\max(0,\min(1,\,0.2x+0.5))$$ |
| **Hard Tanh** | $$\text{HardTanh}(x)=\begin{cases}-1 & x<-1 \\ x & -1\le x\le 1 \\ 1 & x>1\end{cases}$$ |
| **Softsign** | $$\text{Softsign}(x)=\frac{x}{1+|x|}$$ |
| **Softmax** | $$\text{Softmax}(x_i)=\frac{e^{x_i}}{\sum_j e^{x_j}}$$ |
| **Maxout** | $$\text{Maxout}(x)=\max_k(W_k x + b_k)$$ |

---

# **All Optimizers**

| **Optimizer** | **Core Mathematical Equation** |
|---------------|--------------------------------|
| **Gradient Descent (GD)** | $$\theta_{t+1}=\theta_t-\eta\,\nabla_\theta L(\theta_t)$$ |
| **SGD** | $$\theta_{t+1}=\theta_t-\eta\,\nabla_\theta L(\theta_t;x_i)$$ |
| **Momentum** | $$v_t=\beta v_{t-1}+(1-\beta)g_t,\qquad \theta_{t+1}=\theta_t-\eta v_t$$ |
| **NAG** | $$v_t=\beta v_{t-1}+\nabla_\theta L(\theta_t-\eta\beta v_{t-1}),\qquad \theta_{t+1}=\theta_t-\eta v_t$$ |
| **AdaGrad** | $$G_t=G_{t-1}+g_t^2,\qquad \theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{G_t+\epsilon}}g_t$$ |
| **RMSProp** | $$E[g^2]_t=\beta E[g^2]_{t-1}+(1-\beta)g_t^2,\qquad \theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}g_t$$ |
| **Adam** | $$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t,\quad v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$ $$\hat m_t=\frac{m_t}{1-\beta_1^t},\;\hat v_t=\frac{v_t}{1-\beta_2^t}$$ $$\theta_{t+1}=\theta_t-\eta\frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon}$$ |
| **AdamW** | $$\theta_{t+1}=\theta_t-\eta\left(\frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon}+\lambda\theta_t\right)$$ |
| **AdaDelta** | $$E[g^2]_t=\rho E[g^2]_{t-1}+(1-\rho)g_t^2$$ $$\Delta\theta_t=-\frac{\sqrt{E[\Delta\theta^2]_{t-1}+\epsilon}}{\sqrt{E[g^2]_t+\epsilon}}g_t$$ $$\theta_{t+1}=\theta_t+\Delta\theta_t$$ |
| **AMSGrad** | $$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2,\qquad \hat v_t=\max(\hat v_{t-1},v_t)$$ $$\theta_{t+1}=\theta_t-\eta\frac{m_t}{\sqrt{\hat v_t}+\epsilon}$$ |
| **Nadam** | $$\theta_{t+1}=\theta_t-\eta\frac{\beta_1\hat m_t+\frac{(1-\beta_1)g_t}{1-\beta_1^t}}{\sqrt{\hat v_t}+\epsilon}$$ |
| **Lion** | $$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t,\qquad \theta_{t+1}=\theta_t-\eta\,\text{sign}(m_t)$$ |
| **LAMB** | $$r_t=\frac{\|\theta_t\|}{\left\|\frac{m_t}{\sqrt{v_t}+\epsilon}\right\|}$$ $$\theta_{t+1}=\theta_t-\eta\,r_t\,\frac{m_t}{\sqrt{v_t}+\epsilon}$$ |
| **Adafactor** | $$v_t \approx r_t c_t,\qquad \theta_{t+1}=\theta_t-\eta\frac{g_t}{\sqrt{v_t}+\epsilon}$$ |
| **Lookahead** | $$\theta^{(f)}_{t+1}=\text{opt}(\theta^{(f)}_t)$$ $$\theta^{(s)}_{t+1}=\theta^{(s)}_t+\alpha(\theta^{(f)}_{t+1}-\theta^{(s)}_t)$$ $$\theta^{(f)}_{t+1}=\theta^{(s)}_{t+1}$$ |

---



# **ALL Backpropagation**

| **Component** | **Key Backpropagation Equation** |
|---------------|----------------------------------|
| **Feedforward (Linear Layer)** | $$\frac{\partial L}{\partial z}=\delta\odot f'(z)$$ $$\frac{\partial L}{\partial W}=\left(\frac{\partial L}{\partial z}\right)x^\top$$ $$\frac{\partial L}{\partial x}=W^\top\left(\frac{\partial L}{\partial z}\right)$$ |
| **Binary Cross-Entropy** | $$\frac{\partial L}{\partial \hat y}=\frac{\hat y - y}{\hat y(1-\hat y)}$$ |
| **Softmax + Cross-Entropy** | $$\frac{\partial L}{\partial z_i}=\hat y_i - y_i$$ |
| **Convolution (CNN)** | $$\frac{\partial L}{\partial K_{m,n}}=\sum_{i,j}\frac{\partial L}{\partial Y_{i,j}}X_{i+m,j+n}$$ $$\frac{\partial L}{\partial X_{i,j}}=\sum_{m,n}\frac{\partial L}{\partial Y_{i-m,j-n}}K_{m,n}$$ |
| **RNN** | $$\frac{\partial L}{\partial h_t}=\left.\frac{\partial L}{\partial h_t}\right|_{\text{local}}+W_h^\top\left(\frac{\partial L}{\partial h_{t+1}}\odot(1-h_{t+1}^2)\right)$$ |
| **LSTM — Cell** | $$\frac{\partial L}{\partial c_t}=\frac{\partial L}{\partial h_t}\odot o_t\odot(1-\tanh^2(c_t))+\frac{\partial L}{\partial c_{t+1}}\odot f_{t+1}$$ |
| **LSTM — Gates** | $$\frac{\partial L}{\partial f_t}=\frac{\partial L}{\partial c_t}c_{t-1}$$ $$\frac{\partial L}{\partial i_t}=\frac{\partial L}{\partial c_t}\tilde c_t$$ $$\frac{\partial L}{\partial \tilde c_t}=\frac{\partial L}{\partial c_t}i_t$$ $$\frac{\partial L}{\partial o_t}=\frac{\partial L}{\partial h_t}\tanh(c_t)$$ |
| **GRU — Update Gate** | $$\frac{\partial L}{\partial z_t}=(\tilde h_t - h_{t-1})\odot\frac{\partial L}{\partial h_t}$$ |
| **GRU — Candidate State** | $$\frac{\partial L}{\partial \tilde h_t}=z_t\odot\frac{\partial L}{\partial h_t}$$ |
| **GRU — Previous Hidden** | $$\frac{\partial L}{\partial h_{t-1}}=(1-z_t)\odot\frac{\partial L}{\partial h_t}+r_t U_h^\top\left((1-\tilde h_t^2)\odot\frac{\partial L}{\partial \tilde h_t}\right)$$ |
| **Self-Attention — Value** | $$\frac{\partial L}{\partial V}=A^\top\frac{\partial L}{\partial O}$$ |
| **Self-Attention — Attention Weights** | $$\frac{\partial L}{\partial A}=\frac{\partial L}{\partial O}V^\top$$ |
| **Softmax (Jacobian)** | $$\frac{\partial A_i}{\partial S_i}=\mathrm{diag}(A_i)-A_iA_i^\top$$ |
| **Self-Attention — Query** | $$\frac{\partial L}{\partial Q}=\left(\frac{\partial L}{\partial A}\cdot K\right)\frac{1}{\sqrt{d_k}}$$ |
| **Self-Attention — Key** | $$\frac{\partial L}{\partial K}=\left(\frac{\partial L}{\partial A}\right)^\top Q\frac{1}{\sqrt{d_k}}$$ |
| **LayerNorm** | $$\frac{\partial L}{\partial x_i}=\frac{\gamma}{\sqrt{\sigma^2+\epsilon}}\left[\delta_i-\frac{1}{n}\sum_j\delta_j-\frac{(x_i-\mu)}{\sigma^2+\epsilon}\sum_j\delta_j(x_j-\mu)\right]$$ |
| **Residual Connection** | $$\frac{\partial L}{\partial x}=\frac{\partial L}{\partial y}+\frac{\partial L}{\partial F(x)}\frac{\partial F(x)}{\partial x}$$ |
| **Attention Mask** | $$\frac{\partial L}{\partial A_{ij}}=0\quad \text{(if masked)}$$ |

---




# **Backpropagation for All Generative AI Models**

| **Model** | **Core Backpropagation Equation** |
|-----------|------------------------------------|
| **VAE** | $$\nabla_\theta \mathcal{L} = \mathbb{E}_{q_\phi(z|x)}\left[\nabla_\theta \log p_\theta(x|z)\right]$$ $$\nabla_\phi \mathcal{L} = \nabla_z \mathcal{L}\,\frac{\partial z}{\partial \phi} - \nabla_\phi D_{KL}(q_\phi(z|x)\,\|\,p(z))$$ |
| **VAE – KL Gradients** | $$\frac{\partial D_{KL}}{\partial \mu_i}=\mu_i,\qquad \frac{\partial D_{KL}}{\partial \sigma_i}=\sigma_i-\frac{1}{\sigma_i}$$ |
| **GAN – Discriminator** | $$\nabla_{\theta_D}L_D = -\frac{\nabla D(x)}{D(x)} + \frac{\nabla D(G(z))}{1-D(G(z))}$$ |
| **GAN – Generator** | $$\nabla_{\theta_G}L_G = -\frac{1}{D(G(z))}\nabla_{\theta_G}D(G(z))$$ $$\nabla_{\theta_G}D(G(z)) = \nabla_x D(x)\big|_{x=G(z)}\cdot \nabla_{\theta_G}G(z)$$ |
| **DDPM / Diffusion Models** | $$\nabla_\theta L = \mathbb{E}\left[2(\epsilon_\theta - \epsilon)\,\nabla_\theta \epsilon_\theta\right]$$ $$\frac{\partial L}{\partial x_t}=2(\epsilon_\theta-\epsilon)\frac{\partial \epsilon_\theta}{\partial x_t}$$ |
| **Normalizing Flows** | $$\nabla_\theta \log p(x)=\sum_k\left[-\nabla_\theta \log\lvert \det J_k\rvert + \frac{\partial z_{k-1}}{\partial \theta}\nabla_{z_0}\log p(z_0)\right]$$ |
| **Autoregressive Models** | $$L=-\sum_t\log p(x_t|x_{<t}),\qquad \frac{\partial L}{\partial z_t}=\hat y_t - y_t$$ $$\frac{\partial L}{\partial Q}=\frac{1}{\sqrt{d_k}}\left(\frac{\partial L}{\partial A}\right)K$$ $$\frac{\partial L}{\partial K}=\frac{1}{\sqrt{d_k}}\left(\frac{\partial L}{\partial A}\right)^\top Q$$ $$\frac{\partial L}{\partial V}=A^\top\frac{\partial L}{\partial O}$$ |
| **Energy-Based Models (EBM)** | $$\nabla_\theta \log p(x) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x)]$$ |
| **RBM (Contrastive Divergence)** | $$\frac{\partial L}{\partial W}=\langle vh^\top\rangle_{\text{data}} - \langle vh^\top\rangle_{\text{model}}$$ $$\frac{\partial L}{\partial b}=\langle v\rangle_{\text{data}} - \langle v\rangle_{\text{model}}$$ |
| **GMM (EM Algorithm)** | $$\gamma_{nk}=\frac{\pi_k\mathcal{N}(x_n|\mu_k,\Sigma_k)}{\sum_j\pi_j\mathcal{N}(x_n|\mu_j,\Sigma_j)}$$ $$\frac{\partial L}{\partial \mu_k}=\sum_n\gamma_{nk}(x_n-\mu_k)$$ $$\frac{\partial L}{\partial \Sigma_k}=\sum_n\gamma_{nk}\left[\Sigma_k^{-1}(x_n-\mu_k)(x_n-\mu_k)^\top\Sigma_k^{-1}-\Sigma_k^{-1}\right]$$ |
| **Transformer Decoders (LLMs)** | $$\frac{\partial L}{\partial z_t}=\hat y_t - y_t$$ $$\nabla_Q L=\frac{1}{\sqrt{d_k}}\left(\frac{\partial L}{\partial A}\right)K,\qquad \nabla_K L=\frac{1}{\sqrt{d_k}}\left(\frac{\partial L}{\partial A}\right)^\top Q$$ $$\nabla_V L=A^\top\nabla_O L$$ |

---




