# Assignment 1: Equivalence of Score Matching (ESM) and (ISM)

This document, based on the paper [Vincent]Denoising_Score_Matching.pdf, explains the mathematical equivalence between Explicit Score Matching (ESM) and Implicit Score Matching (ISM).

---

## 1. Basic Definitions

First, we define the relevant notation:

* **Model probability density:** $p(x;\theta)$, parameterized by $\theta$.
* **True data probability density:** $q(x)$ (unknown).
* **Model Score Function:**
    $$\psi(x;\theta) = \nabla_x \log p(x;\theta)$$
* **True data Score Function:**
    $$\nabla_x \log q(x)$$

## 2. Explicit Score Matching (ESM)

The objective of ESM is to directly minimize the expected L2 distance between the model score and the true data score.

**ESM Objective Function (Eq. 2):**
$$J_{ESMq}(\theta) = \mathbb{E}_{q(x)} \left[ \frac{1}{2} || \psi(x;\theta) - \nabla_x \log q(x) ||^2 \right]$$

This objective is intuitive but **cannot be computed directly**, as we do not know the true data score $\nabla_x \log q(x)$.

## 3. Implicit Score Matching (ISM)

ISM, proposed by Hyvärinen (2005), is an equivalent objective function that cleverly avoids the need to compute $\nabla_x \log q(x)$.

**ISM Objective Function (Eq. 3):**
$$J_{ISMq}(\theta) = \mathbb{E}_{q(x)} \left[ \text{tr}(\nabla_x \psi(x;\theta)) + \frac{1}{2} || \psi(x;\theta) ||^2 \right]$$

Where $\text{tr}(\nabla_x \psi(x;\theta))$ is the trace of the Jacobian matrix of the model score function $\psi$, i.e., $\sum_{i} \frac{\partial \psi_i(x;\theta)}{\partial x_i}$.

This objective function **is computable**, as it only depends on the model score $\psi(x;\theta)$ and its derivatives, and the data distribution $q(x)$ (for the expectation $\mathbb{E}_{q(x)}$, which can be estimated via a sample mean).

## 4. Equivalence Proof (ESM $\Leftrightarrow$ ISM)

To prove that $J_{ESMq}(\theta)$ and $J_{ISMq}(\theta)$ are equivalent for minimizing $\theta$, we show that they differ only by a constant that does not depend on $\theta$.

1.  We expand the ESM objective function $J_{ESMq}(\theta)$:
    $$J_{ESMq}(\theta) = \mathbb{E}_{q(x)} \left[ \frac{1}{2} ||\psi(x;\theta)||^2 - \psi(x;\theta)^T (\nabla_x \log q(x)) + \frac{1}{2} ||\nabla_x \log q(x)||^2 \right]$$

2.  This expression can be split into three terms:
    * (a) $\mathbb{E}_{q(x)} \left[ \frac{1}{2} ||\psi(x;\theta)||^2 \right]$
    * (b) $\mathbb{E}_{q(x)} \left[ - \psi(x;\theta)^T (\nabla_x \log q(x)) \right]$
    * (c) $\mathbb{E}_{q(x)} \left[ \frac{1}{2} ||\nabla_x \log q(x)||^2 \right]$

3.  Observe the third term (c). It depends only on the true data distribution $q(x)$ and not on the model parameters $\theta$. Therefore, it is a constant (let $C = (c)$) and can be ignored during minimization.

4.  We focus on transforming the second term (b), using $\nabla_x \log q(x) = \frac{\nabla_x q(x)}{q(x)}$:
    $$
    \begin{align*}
    (b) &= \mathbb{E}_{q(x)} \left[ - \psi(x;\theta)^T \frac{\nabla_x q(x)}{q(x)} \right] \\
    &= -\int q(x) \left[ \psi(x;\theta)^T \frac{\nabla_x q(x)}{q(x)} \right] dx \\
    &= -\int \psi(x;\theta)^T \nabla_x q(x) dx
    \end{align*}
    $$

5.  Apply **integration by parts** (or the Gaussian divergence theorem) to the expression above, assuming $q(x)\psi(x;\theta)$ vanishes at the boundaries:
    $$
    \begin{align*}
    (b) &= - \left[ \psi(x;\theta)^T q(x) \right]_{-\infty}^{\infty} + \int q(x) (\nabla_x \cdot \psi(x;\theta)) dx \\
    &= 0 + \int q(x) \text{tr}(\nabla_x \psi(x;\theta)) dx \\
    &= \mathbb{E}_{q(x)} \left[ \text{tr}(\nabla_x \psi(x;\theta)) \right]
    \end{align*}
    $$
    *Note: The divergence $\nabla_x \cdot \psi$ is equal to the trace of the Jacobian $\text{tr}(\nabla_x \psi)$.*

6.  Substitute the results for (a) and (b) back into the expression for $J_{ESMq}(\theta)$ (ignoring the constant $C$):
    $$
    \begin{align*}
    J_{ESMq}(\theta) &= \mathbb{E}_{q(x)} \left[ \text{tr}(\nabla_x \psi(x;\theta)) \right] + \mathbb{E}_{q(x)} \left[ \frac{1}{2} ||\psi(x;\theta)||^2 \right] + C \\
    J_{ESMq}(\theta) &= \mathbb{E}_{q(x)} \left[ \text{tr}(\nabla_x \psi(x;\theta)) + \frac{1}{2} || \psi(x;\theta) ||^2 \right] + C
    \end{align*}
    $$

7.  We find that:
    $$J_{ESMq}(\theta) = J_{ISMq}(\theta) + C$$

---

## 5. Conclusion

The Explicit Score Matching (ESM) and Implicit Score Matching (ISM) objective functions differ only by a constant $C$ that does not depend on $\theta$. Therefore, minimizing $J_{ESMq}(\theta)$ is equivalent to minimizing $J_{ISMq}(\theta)$.

The key advantage of ISM is that it transforms the dependency on the unknown term $\nabla_x \log q(x)$ into a computation involving the second-order derivatives (trace of the Hessian) of the model $\psi(x;\theta)$, which makes the objective function fully known and computable.

---
Also, [this](https://bobondemon.github.io/2022/01/08/Estimation-of-Non-Normalized-Statistical-Models-by-Score-Matching/) site explain it explicitly.

# Question:
1. Does there exist other score function?
2. What about its efficiency?