## Chapter 10 - Dimensionality Reduction and Metric Learning

_**Author:** Zitong Su_

*This note includes some formula derivations (and/or extended materials) additional to the __Machine Learning ("Watermelon Book")__ or the __Pumpkin Book__.*


### 10.3 Principal Component Analysis
#### 10.3.1 Simplified covariance in formula 10.15
Consider $m$ samples in $d$-dimensional space, the data matrix $\mathbf{X}=[\mathbf{x_1}, \mathbf{x_2}, ..., \mathbf{x_m}] \in \mathbb{R^{d \times m}}$.

Normally, the covariance matrix of $\mathbf{X}$ is
$$\mathbf{\Sigma} = \frac{1}{m-1} \mathbf{X} \mathbf{X}^{\top}$$

<br>But the book omits the unbiased normalization factor $\frac{1}{m-1}$ in the projection transformation of the covariance matrix: $\mathbf{W}^{\top} \mathbf{X} \mathbf{X}^{\top} \mathbf{W}$, ($\mathbf{W} = [\mathbf{w}_1, \mathbf{w}_2, ..., \mathbf{w}_{d'}] \in \mathbb{R}^{d \times d'}$).
<br>In PCA, MDS, or SVD, the scaling factor $\frac{1}{n-1}$ is often omitted because:
- It doesn't affect the eigenvectors (directions of variance)
- It only scales the eigenvalues (magnitudes of variance)

So for dimensionality reduction, the structure is preserved even without normalization.

<br>And also, the data is mean-centered before the transformation so the mean vector $\boldsymbol{\mu}$ is not included in the projection.

<br>

#### 10.3.2 Formula 10.16
Formula 10.16 is a constrained optimization problem. The constraint $\mathbf{W^{\top}}\mathbf{W} = \mathbf{I}$ reflects that $\mathbf{W}$ is an orthonormal / orthogonal matrix, where $\mathbf{w}^{\top}_{i} \mathbf{w}_{i} = 1$ and $\mathbf{w}^{\top}_{i} \mathbf{w}_{j} = 0 \; (i \neq j)$.
<br>This ensures:
- The columns of $\mathbf{W}$ (i.e., the principal components) are orthogonal (uncorrelated) and unit-length (normalized), which avoids arbitrary scaling or rotation in the projection.
- The variance interpretation is preserved. Without this constraint, you could artificially inflate the variance by scaling $\mathbf{W}$, making the optimization meaningless. The constraint ensures that the variance captured is due to the direction, not the magnitude of the projection.
- The problem fit into an optimization framework. With $\mathbf{W^{\top}}\mathbf{W} = \mathbf{I}$, the optimization becomes a constrained eigenvalue problem. The solution is given by the top-k eigenvectors of the covariance matrix $\mathbf{X} \mathbf{X}^{\top}$.






## Finding the Coordinates of x in the New Basis W

### 1. Form the Basis Matrix

Let the new basis vectors be
$w_1, w_2, \dots, w_n \;\in\; \mathbb{R}^n$.

Stack them as columns to form the matrix  
$W \;=\; [\,w_1 \;\; w_2 \;\;\dots\;\; w_n\,]\;\in\;\mathbb{R}^{n\times n}$.

<br>

### 2. Solve for x'

We want scalars $(c_1,\dots,c_n)$ such that $x \;=\; c_1\,w_1 \;+\; c_2\,w_2 \;+\;\dots+\;c_n\,w_n$.
  
In matrix form,  
$$
x \;=\; W\,x',
\quad
x' \;=\;
\begin{bmatrix}
c_1\\
\vdots\\
c_n
\end{bmatrix}.
$$

#### Case A: Invertible W

If the columns of $W$ are linearly independent (so $W^{-1}$ exists), compute $x' \;=\; W^{-1}\,x$.

<br>

#### Case B: Orthonormal Basis

If ${w_i}$ is orthonormal, then $W^{-1} = W^\top$ and $x' \;=\; W^\top\,x$.

Note: distinguish between _orthonormal basis (标准正交基)_ and _orthogonal basis (正交基)_.

<br>

#### Case C: Non-square or Overcomplete \(W\)

Use the pseudoinverse:  
$
x' \;=\; W^+\,x
\;=\;(W^\top W)^{-1}W^\top\,x.
$

<br>

### 3. Intuition

- Applying $W^{-1}$ (or $W^\top$ for orthonormal $W$) projects $x$ onto each basis vector $w_i$.  
- The entries of $x'$ tell you exactly how much of each $w_i$ is needed to reconstruct $x$.

