This posting has two different looks at how we can minimize the length of some vector $\mathbf x$ in an underdetermined (but full row rank) system of equations $\mathbf{Ax} = \mathbf b$.

The underlying scalars are in $\mathbb R$


consider the equation $\mathbf {A x} = \mathbf b$, where we know $\mathbf A$ and $\mathbf b$, and need to solve for $\mathbf x$. For avoidance of doubt $\mathbf b \neq \mathbf 0$.  $\mathbf A$ is an m x n matrix.

In the case where $\mathbf A$ is tall and skinny (and with noise in the data, this should equate to an overdetermined system of equations, with full column rank), we can use ordinary least squares to solve for the $\mathbf x$ that minimizes the squared length (2 norm), of $\mathbf v = \mathbf {Ax} - \mathbf b$, thus we are minimizing $\mathbf v^T \mathbf v$.  We may may use the Normal Equations, or QR factorization, or many other tools at our disposal.  

If $\mathbf A$ is square and of full rank, then we can directly invert $\mathbf A$, or use Gaussian elimination or whatever tool we want. 

Now consider the case where the $\mathbf A$ has more columns than rows -- i.e. n > m-- (and again there is noise in the data, so we have full row rank), this means that there are *many* solutions to $\mathbf {A x} = \mathbf b$, because $\mathbf A$ has a non-trivial nullspace.  In this case, we first will want to question why we have this situation, and perhaps gather more data. If we still want to 'solve' this equation, what form might we take?  We have many solutions at our disposal, so perhaps one that minimizes the length (2 norm) of $\mathbf x$ is the one we want. 

For avoidance of doubt we have $\mathbf A \in \mathbb R^{\text{m x n}}$ where $m \lt n$.  This also means that $\mathbf x \in \mathbb R^{\text{n}}$ and  $\mathbf b\in \mathbb R^{\text{m}} $ .  This means we need at most $m$ linearly independent vectors to generate *any* given $\mathbf b$.  Suppose we choose said linearly independent vectors to be mutually orthonormal, and we purge any vectors not in that set from the solution.  In such a case we would have a solution $\mathbf x$ satisfying the equation $\mathbf{Ax} = \mathbf b$.  We'd also clearly have a smaller solution (L2 norm) solution than any one using the above solution plus additional mutually orthonormal vectors, which must in some sense be in the nullspace... 

There are basically two approaches to solving this.  

First the algebraic one.

$\mathbf {A x} =\big( \mathbf {U \Sigma V}^T\big)\mathbf x = \mathbf b$, using the Singular Value Decomposition, where $\mathbf U$ and $\mathbf V$ are both rull rank, square, orthogonal matrices, but because $\mathbf A$ is not square, $\mathbf \Sigma$ is a diagonal matrix that has more columns than rows.  

That is $\mathbf A$ is an m x n matrix with rank m (meaning that each singular value > 0)

$\mathbf A =
\bigg[\begin{array}{c|c|c|c}
\mathbf u_1 & \mathbf u_2 &\cdots & \mathbf u_{m}
\end{array}\bigg] \begin{bmatrix}
\sigma_1 & 0 &0  &0 & ... &0 \\ 
0 & \sigma_2& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \sigma_m& ... &0  
\end{bmatrix} \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T
$   

if we left multiply both sides of $\mathbf {A x} $ by $\mathbf U^T$, we get 

$\mathbf {\Sigma V}^T \mathbf x = \begin{bmatrix}
\sigma_1 & 0 &0  &0 & ... &0 \\ 
0 & \sigma_2& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \sigma_m& ... &0  
\end{bmatrix} \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T \mathbf x = \mathbf U^T \mathbf b$

now with an **abuse of notation**, consider left multiplying by $\mathbf \Sigma^{-1}$:

where $\mathbf \Sigma^{-1} = \Big(\begin{bmatrix}
\frac{1}{\sigma_1} & 0 &0  &0 &  ... &0 \\ 
0 & \frac{1}{\sigma_2}& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \frac{1}{\sigma_m}& ... &0  
\end{bmatrix}^T$ 

Thus it is not technically an inverse or a left inverse... $\mathbf \Sigma^{-1}$ **is actually a right inverse** but we ultimately are multiplying on the left because that is all we can do here -- hence this is an abuse of notation.

$\mathbf {(D)V}^T \mathbf x = 
\Big(\begin{bmatrix}
\frac{1}{\sigma_1} & 0 &0  &0 &  ... &0 \\ 
0 & \frac{1}{\sigma_2}& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \frac{1}{\sigma_m}& ... &0  
\end{bmatrix}^T \begin{bmatrix}
\sigma_1 & 0 &0  &0 & ... &0 \\ 
0 & \sigma_2& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \sigma_m& ... &0  
\end{bmatrix}\Big) \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T \mathbf x =  \mathbf \Sigma^{-1}\mathbf U^T \mathbf b$


$\mathbf {(D)V}^T \mathbf x = \begin{bmatrix}
1 & 0 &0  &0 & \mathbf 0^T \\ 
0 & 1 & 0 & 0& \mathbf 0^T\\ 
0 & 0 &  \ddots & 0& \mathbf 0^T \\ 
0 & 0 & 0 & 1 & \mathbf 0^T  \\
\mathbf 0 & \mathbf 0 & \mathbf 0 & \mathbf 0 & \mathbf 0\mathbf 0^T  \\ 
\end{bmatrix} \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T \mathbf x = \begin{bmatrix}
\mathbf I & \mathbf{00}^T \\ 
\mathbf {00}^T & \mathbf {00}^T  \\ 
\end{bmatrix} \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T \mathbf x = \mathbf \Sigma^{-1}\mathbf U^T \mathbf b$

Which is to say that $\mathbf D$ is the spectra for an idempotent matrix (i.e. a projection matrix's eigenvalues).  (Note that to deal with notational overload, $\mathbf {0}$ is to be the appropriately sized zero vector, and $\mathbf {00}^T$ is the appropriately sized zero matrix.)  

From here multiply both sides by $\mathbf V$, and we get 

$\mathbf {VDV}^T \mathbf x =  \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg] \begin{bmatrix}
\mathbf I & \mathbf{00}^T \\ 
\mathbf {00}^T & \mathbf {00}^T  \\ 
\end{bmatrix}\bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T \mathbf x= \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b$


note that 

$\mathbf P = \mathbf {VDV}^T$  

obeys 

$\mathbf P^2 = \mathbf P$  

hence $\mathbf P$ is idempotent, and indeed it is orthgonally diagonalizable -- i.e. it is a projection matrix.  (Note: there are some different conventions -- some texts refer to all idempotent matrices as projection matrices, while others only refer to real symmetric -- or Hermitian -- ones such as this as projection matrices.)  

Note that if $\mathbf A$ was full rank, $\mathbf D = \mathbf I$, and $\mathbf \Sigma^{-1}$ would be an actual inverse, not an abuse of notation (right inverse in this case), and hence we would have solved our equation.  

That is, if $\mathbf A$ was full rank, we would have had:


$\mathbf {VIV}^T \mathbf x = \mathbf {VV}^T \mathbf x = \big(\mathbf{v_1 v_1}^T + \mathbf{v_2 v_2}^T +... + \mathbf{v_n v_n}^T\big) \mathbf x = \mathbf {I} \mathbf x = \mathbf x = \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b$

but instead what we have is

$\mathbf P \mathbf x =  \Big(\big(\mathbf{v_1 v_1}^T + \mathbf{v_2 v_2}^T +... + \mathbf{v_m v_m}^T\big) + \big(0\mathbf{v_{m+1} v_{m+1}}^T + 0 \mathbf{v_{m+2} v_{m+2}}^T + ... + 0\mathbf{v_n v_n}^T\big)\Big) \mathbf x = \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b$

or more simply 

$\mathbf {Px}= \big(\mathbf{v_1 v_1}^T + \mathbf{v_2 v_2}^T +... + \mathbf{v_m v_m}^T\big) \mathbf x =  \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b $

Now recall that $\mathbf V = \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]$, which is an n x n orthogonal matrix -- that is $\mathbf V$ can be thought of as a coordinate system.  Thus our solution vector $\mathbf x$ can be a linear combination of the columns of $\mathbf V$.  We can write this as 

$\mathbf x = \mathbf{Vy} = y_1\mathbf v_1 + y_2\mathbf v_2 + ... + y_m\mathbf v_m +  y_{m+1}\mathbf v_{m+1} + ... + y_{n}\mathbf v_{n}$  

and recalling that since $\mathbf V$ is orthogonal, it is length (2 norm) preserving, so:  

$\big \vert\big \vert \mathbf x \big \vert \big \vert_2^{2} =  \big \vert\big \vert \mathbf{Vy} \big \vert\big \vert_2^{2} = \big \vert \big \vert \mathbf y \big \vert\big \vert_2^{2} = \mathbf y^T \mathbf y = y_1^2 + y_2^2 + ... + y_m^2 + y_{m+1}^2 + ... + y_n^2$

we substitute in and get 

$\big(\mathbf{v_1 v_1}^T + \mathbf{v_2 v_2}^T +... + \mathbf{v_m v_m}^T\big) \big(y_1\mathbf v_1 + y_2\mathbf v_2 + ... + y_m\mathbf v_m +  y_{m+1}\mathbf v_{m+1} + ... + y_{n}\mathbf v_{n}\big) = \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b$

which by the orthogonality of the columns in $\mathbf V$ gives us:

$\mathbf x = y_1  \mathbf{v_1} + y_2\mathbf v_2 +... + y_m  \mathbf{v_m} = \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b $

From here we notice that any $y_k$, for $k \gt m$ contributes to the length of $\mathbf x$ but does not contribute to the solution of the problem (i.e. they are in the null space).  

Thus the minimal length solution to the underdetermined $\mathbf {Ax} = \mathbf b$ comes in the form of a solution to the equation that has $\mathbf x $ written purely as a linear combination of $\{\mathbf v_1, \mathbf v_2, ..., \mathbf v_m \}$

The way to interpret this, then, is we solve for any legal $\mathbf x$ that is a valid solution, and then project such a solution down to a subspace that only is spanned by $m$ mutually orthonormal vectors ($\mathbf v_1, ..., \mathbf v_m$)  

**a nice insight**   
suppose, for instance we solve a Linear Program and compute 

$\mathbf x_{\text{L1 norm minimized}}$  

if we project it down to the subspace of $\{\mathbf v_1, \mathbf v_2, ...., \mathbf v_m\}$ using, of course, our projector $\mathbf P$, to do so, we see 


$\mathbf P \mathbf x_{\text{L1 norm minimized}} = \mathbf x_{\text{L2 norm minimized}} = \mathbf x$  

of course $\mathbf P$ is not invertible (unless $\mathbf A$ is square) so this is a one way relation, but nevertheless interesting.  

- - - - - 

**alternative takes:  **  

1.)  
some math text, and some computational solvers (e.g. Julia's) will do SVD for short fat matrices in the form of 

$\mathbf A = \mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T$   

$\mathbf V^T$ is $m$ x $n$, or equivalently, $\mathbf V$ is $n$ x $m$, hence it has mutually orthonormal vectors of dimension $n$, but only has $m$ of them and hence is not an orthgonal matrix, just a matrix with mutually orthnormal columns -- but not enough to form a basis.  


To verify that the end results are unchanged, we can see 


$\mathbf A = \mathbf U \big(\mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T\big) = \big(\mathbf \Sigma\mathbf V^T\big)$  

where $\mathbf \Sigma_{\text{square}}$ is $m$ x $m$ 



because 

$\mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T = \mathbf \Sigma\mathbf V^T = 
\begin{bmatrix}
\sigma_1 \mathbf v_1^T \\
 \sigma_2 \mathbf v_2^T \\ 
\vdots\\ 
\sigma_{m-1} \mathbf v_{m-1}^T \\ 
\sigma_m \mathbf v_m^T
\end{bmatrix}$ 

and as before 

$\mathbf P = \big(\mathbf{v_1 v_1}^T + \mathbf{v_2 v_2}^T +... + \mathbf{v_m v_m}^T\big) = \mathbf V_{\text{non-square}} \mathbf V_{\text{non-square}}^T$  

2.) *An alternative computationally efficient approach:*  

While the SVD approach allows us to compute $\mathbf P$ and get insights into the geometry of the minimum costs L2 norm solution to our system of equations, what the below section (using Lagrange multipliers) shows is that the solution is given by 

$\mathbf x = \mathbf A^T \big(\mathbf{AA}^T\big)^{-1} \mathbf b$ 

as with ordinary least squares, we generally don't want to actually compute $\big(\mathbf{AA}^T\big)$ for cost and numeric stability reasons.  But the reader may also recall that computing the SVD is the most expensive of operation of the typical matrix factorizations.  The nice middle ground here (much like in least squares) is to use QR factorization.  An outline of the computational approach using QR factorization is shown below.  


$\mathbf A^T = \mathbf {QR}$

where $\mathbf Q$ is tall and skinny and $\mathbf R$ is a square upper triangular matrix.  Since $\mathbf A$ has full row rank this means $\mathbf A^T$ has full column rank, which mean $\mathbf R$ has no zeros along its diagonal, which means $\mathbf R^{-1}$ exists.  

If you did this, you'd get 

$\mathbf x = \mathbf A^T\big(\mathbf {AA}^T\big)^{-1}\mathbf b =\big(\mathbf {QR}\big)\Big(\big(\mathbf Q \mathbf R\big)^T\big(\mathbf Q\mathbf R\big)\Big)^{-1}\mathbf b = \mathbf {QR}\Big(\big(\mathbf R^T \mathbf Q^T\big)\big(\mathbf Q\mathbf R\big)\Big)^{-1}\mathbf b $ 

$ = \mathbf {QR}\big(\mathbf R^T\mathbf R \big)^{-1}\mathbf b = \mathbf {QR}\big(\mathbf R \big)^{-1}\big(\mathbf R^T\big)^{-1}\mathbf b = \mathbf Q\big(\mathbf R^T\big)^{-1}\mathbf b$  

hence we have 

$\mathbf x = \mathbf x_{\text{L2 norm minimized}} = \mathbf Q\big(\mathbf R^T\big)^{-1}\mathbf b$ 

which is to say that the solution is equivalent to running QR factorization on $\mathbf A^T$  

and then solving the lower triangular system of equations for $\mathbf y$ in 

$\mathbf R^T \mathbf y = \mathbf b$ 

and then after solving for $\mathbf y$, left multiplying by $\mathbf Q$


$\mathbf x = \mathbf Q\mathbf y =\mathbf Q\Big(\big(\mathbf R^T\big)^{-1}\mathbf b\Big)$   



we can streamline the above by considering that 

$\mathbf A^T = \mathbf {QR}$  is equivalent to  
$\mathbf A = \mathbf {R}^T\mathbf Q^T =  \mathbf {L}\mathbf Q^T$   

where $\mathbf Q$ is tall and skinny, so $\mathbf Q^T \in \mathbb R^{\text{m x n}}$ is short and fat.   
Now consider the projector given by   
$\mathbf P:= \mathbf Q \mathbf Q^T $  

$\mathbf P^2:= \mathbf Q \mathbf Q^T\mathbf Q \mathbf Q^T =  \mathbf Q\big(\mathbf Q^T\mathbf Q\big) \mathbf Q^T = \mathbf Q\big(\mathbf I_m \big) \mathbf Q^T = \mathbf P$  

and $\mathbf P = \mathbf P^T$ so it is a projector  

note: $\mathbf P$ has rank $m$ -- given by its trace.    


note that 

$\mathbf {AP} = \mathbf {LQ}^T\mathbf Q \mathbf Q^T = \mathbf {LQ}^T = \mathbf A$  

so when we consider any satisfying $\mathbf x$ where 
$\mathbf {Ax} = \mathbf b$ 

and we consider its squared legnth, we have, in effect via Pythagorean theorem    

$\Big \Vert \mathbf x\Big \Vert_2^2$  
$= \Big \Vert\big(\mathbf I\big)\mathbf x\Big \Vert_2^2$  
$= \Big \Vert\Big(\mathbf P +\big( \mathbf I - \mathbf P\big)\Big)\mathbf x\Big \Vert_2^2$  
$= \Big \Vert\mathbf {Px} +\big( \mathbf I - \mathbf P\big)\mathbf x\Big \Vert_2^2$  
$= \Big \Vert\mathbf {Px}\Big \Vert_2^2 + \Big \Vert \big( \mathbf I - \mathbf P\big)\mathbf x\Big \Vert_2^2$  
$\geq \Big \Vert\mathbf {Px}\Big \Vert_2^2 $  

with equality **iff**   
$\Big \Vert \big( \mathbf I - \mathbf P\big)\mathbf x\Big \Vert_2^2 = 0$ 

which we get by selecting our optimal $\mathbf x^* := \mathbf {Px}$, and of course  
$\Big \Vert \big( \mathbf I - \mathbf P\big)\mathbf x^*\Big \Vert_2^2= \Big \Vert \big( \mathbf I - \mathbf P\big)\mathbf P \mathbf x\Big \Vert_2^2 = \Big \Vert  \mathbf {Px} - \mathbf P^2 \mathbf x \Big \Vert_2^2 = \Big \Vert  \mathbf {Px} - \mathbf P \mathbf x \Big \Vert_2^2= 0$ 

so we are able to achieve this lower bound.  Furthermore, we know $\mathbf {x}^*$ still satisfies the original equation because  

$\mathbf {Ax}^* $  
$=\mathbf {A}\big(\mathbf {Px}\big)$  
$=\big(\mathbf {AP}\big)\mathbf {x}$  
$=\big(\mathbf {A}\big)\mathbf {x}$  
$=\mathbf {A}\mathbf {x}$  
$= \mathbf b$  

and of course with  
$\mathbf{R}^T\mathbf Q^T\mathbf x = \mathbf{Ax} = \mathbf b$  
$\mathbf Q^T\mathbf x = \big(\mathbf{R}^T\big)^{-1}\mathbf b$  
where we recall that $\det\big(\mathbf R\big) \neq 0$   
so the right hand side is uniquely specified by the problem at this point.  From here we multiply each side by $\mathbf Q$, which is tall and skinny and full rank    

$\mathbf Q \mathbf Q^T\mathbf x = \mathbf {Px} = \mathbf x^* = \mathbf Q\big(\mathbf{R}^T\big)^{-1}\mathbf b $  
 
this is enough to confirm the uniqueness of $\mathbf x^*$... 

suppose we have two distinct (non-optimized) solution vectors $\mathbf x_1$ and $\mathbf x_2$  

Then  
$\mathbf A\big(\mathbf x_1 - \mathbf x_2 \big) = \mathbf A\mathbf x_1 - \mathbf A \mathbf x_2 = \mathbf b - \mathbf b = \mathbf 0$  

left multiplying each side by 
$\big(\mathbf R^T\big)^{-1}$  gives  

$\mathbf Q^T \big(\mathbf x_1 - \mathbf x_2 \big) = \mathbf 0$  
and left multiplying each side by $\mathbf Q$ gives  
$\mathbf Q \mathbf Q^T \big(\mathbf x_1 - \mathbf x_2 \big) = \mathbf P \big(\mathbf x_1 - \mathbf x_2 \big) = \mathbf 0$  
or  
$\mathbf P \mathbf x_1 = \mathbf P \mathbf x_2 = \mathbf x^*$  
the contrapositive is if we think we find an even better minimal solution vector $\mathbf x^{**}$ where 
$\Big \Vert\mathbf {Px}\Big \Vert_2^2 \gt \Big \Vert\mathbf {x^{**}}\Big \Vert_2^2 $  
then we know $\mathbf {Px} \neq \mathbf x^{**}$ and in fact that $\mathbf {A}\mathbf x^{**} \neq \mathbf b$   



For avoidance of doubt we have $\mathbf A \in \mathbb R^{\text{m x n}}$ where $m \lt n$. 


*Radical streamlining of the above:*  
The idea, at its core is that 

$\mathbf A$ has rows that span a portion of the vector space that $\mathbf x$ lives in.  Extend this to make a basis, and write all legal solutions $\mathbf x$ as a linear combination of these vectors.  Then consider that $\mathbf{Ax}$ is equivalent to our original solution, except each 'basis vector' not in the span of $\mathbf A$'s rows.  With respect to minimizing a 2 norm, we can make this argument crisp if we first make our basis consist of mutually orthonormal vectors, and of course this is exactly what we did whether via QR factorization or SVD.  



consider invertible matrix $\mathbf B$ given by  
$\mathbf B = 
\bigg[\begin{array}{c|c|c|c|c|c} 
\mathbf a_1  &\cdots & \mathbf a_{m} & \mathbf b_{m+1} & \cdots & \mathbf b_p 
\end{array}\bigg]$  

$\underbrace{\{\mathbf a_1,...,\mathbf a_m\}}_{\text{m linearly independent columns}}$  
$\underbrace{\{\mathbf b_{m+1},...,\mathbf b_p\}}_{\text{p - m linearly independent vectors from left nullspace}}$  

i.e. 
$\mathbf b_k^* \mathbf A^*  = \mathbf 0^*$  

we know   
$\mathbf P \mathbf B = \bigg[\begin{array}{c|c|c|c|c|c} 
\mathbf P \mathbf a_1  &\cdots & \mathbf P \mathbf a_{m} & \mathbf P \mathbf b_{m+1} & \cdots & \mathbf P \mathbf b_p 
\end{array}\bigg] = \bigg[\begin{array}{c|c|c|c|c|c} 
\mathbf a_1  &\cdots & \mathbf a_{m} & \mathbf 0 & \cdots & \mathbf 0 
\end{array}\bigg]$  

- - - - 
and if we find some other projector $\mathbf S$ that has $\mathbf {SA} = \mathbf A$ where    

$\text{rank}\big(\mathbf P\big)= \text{rank}\big(\mathbf S\big)$    
(or what is equivalent $\text{trace}\big(\mathbf P\big)= \text{trace}\big(\mathbf S\big)$ )   

then  
$\mathbf P = \mathbf S$    

- - - - 
First consider 
$\mathbf B$  

$\text{rank}\big(\mathbf {PB}\big)= \text{rank}\big(\mathbf {SB}\big)$  
because multiplication by an inveritble matrix preserves rank  

hence  
$\mathbf {SB} = \bigg[\begin{array}{c|c|c|c|c|c} 
\mathbf a_1  &\cdots & \mathbf a_{m} & \mathbf S \mathbf b_{m+1} & \cdots & \mathbf S \mathbf b_p 
\end{array}\bigg] = \bigg[\begin{array}{c|c|c|c|c|c} 
\mathbf a_1  &\cdots & \mathbf a_{m} & \sum_{i=1}^m c_i^{(m+1)} \mathbf a_i & \cdots & \sum_{i=1}^m c_i^{(p)} \mathbf a_i
\end{array}\bigg]$  

$= \bigg[\begin{array}{c|c|c|c|c|c} 
\mathbf a_1  &\cdots & \mathbf a_{m} & \mathbf 0 & \cdots & \mathbf 0 
\end{array}\bigg]$  

we prove the second line as follows:  

a projector satisfies $\mathbf S = \mathbf S^2$ or equivalently is anhilated by  
$\mathbf S - \mathbf S^2 = \big(\mathbf I - \mathbf S\big)\mathbf S = \mathbf 0$    

hence  
$\big(\mathbf I - \mathbf S\big)\mathbf S\mathbf B = \mathbf 0$    




we claim that for each $\mathbf b_k$ we have  
$\mathbf S \mathbf b_{k} = \sum_{i=1}^m c_i^{(k)} \mathbf a_i = \mathbf 0$   
(i.e. each coefficient $c_i^{(k)}$ since $\mathbf a_i$ are linearly independen). 


This is most easily confirmed by running gramm schmidt on $\{\mathbf a_1,...,\mathbf a_m\}$ and decomposing into linear combinations mutually orthonormal vectors $\{\mathbf u_1,...,\mathbf u_m\}$ (note we may do this in triangular manner similar to QR decomposition -- which immediately implies $\mathbf a_1 = \mathbf u_1^* \mathbf b_k = 0 $, and for larger $j$ 

$0 = \mathbf a_j^*\mathbf b_k = \mathbf u_j^* \mathbf b_k  + \sum_{r =1 }^{j-1} \mathbf u_r^* \mathbf b_k  = \mathbf u_j^* \mathbf b_k $  

where $\mathbf u_r^* \mathbf b_k = 0$ by induction hypothesis   


So examine the above, decompose into $\mathbf u_r$'s and left multiply by $\mathbf u_j$

We confirm this by left multiplying by any $\mathbf u_j^*$ to see 

$0 = \mathbf u_j^* \mathbf b_{k}  = \big(\mathbf u_j^* \mathbf S\big) \mathbf b_{k} = \mathbf u_j^* \big(\mathbf S \mathbf b_{k}\big) =\mathbf u_j^* \sum_{i=1}^m c_i^{(k)}\mathbf a_j  =\mathbf u_j^* \sum_{i=1}^m c_i^{(k)}\sum_{t=1}^m \gamma_t \mathbf u_i = \alpha_j \big \Vert\mathbf u_j \big \Vert_2^2  + 0 = \alpha_j$   

where $\alpha_j = c_i^{(k)} \cdot \gamma_t$ and $\gamma_t \neq 0\longrightarrow c_i^{(k)} = 0$   


where the right hand side follows (i.e. each coefficient must be zero) by orthgonality  


** additional look at projection brought to us by SVD**  
In some sense this may be the cleanest approach, though it is overkill to read this *and* the above  

note: remember that $\mathbf A$ is assumed to be full row rank, which means it has column rank of $m$.  We want to have a linear combination of $\mathbf A$'s column such that said combination is equal to $\mathbf b\in \mathbb R^{\text{m}}$.  That is, we know we can cleverly select from $\mathbf A$'s columns to form a basis.  The goal, then is to select said basis in such a way that we can isolate the vectors which are in the (right) nullspace of $\mathbf A$ and remove them from our solution.  We use orthogonality to get a 'clean look' at the various vectors being used to construct a solution and the ones that only contribute to the length of $\mathbf x$ but do not contribute to the actual solution in $\mathbf b$.  Given that we are using orthogonality for a non square matrix, a very natural question is: what are the implications of using SVD on $\mathbf A$ in order to (attempt) to solve this problem.  

- - - - 
Each row vector in $\mathbf V^T $ must be $n$ dimensional, like $\mathbf x$.  

Let use consider the orthonormal basis given by the columns of $\mathbf V$.  

However, for our short fat $\mathbf A$, we may choose to have $\mathbf V_{\text{non-square}}^T$ (i.e. short and fat), with the other two matrices in the SVD being square, as given below  

$\mathbf A = \mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T$  

- - - - -  
*note on dimensions*  
i.e. $\mathbf V_{\text{non-square}} \in \mathbb R^{n x m}$  (i.e. the column vectors are n dimensional but there are only $m \lt n$ of them)   

equivalently, 

$\mathbf V_{\text{non-square}} \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{m}
\end{array}\bigg]$  

where $m \lt n$  
- - - - -  
*The key insight:*  we know that we can solve for a satisfying $\mathbf x$ given $\mathbf A$, $\mathbf b$, exactly -- why? Because the columns of $\mathbf A$ form a basis (this is implied by full row rank).  However there are just too many columns relative to the dimension so at least one column is linearly dependent. But since we have a basis in $\mathbf A$'s columns, why not choose a smart basis and write any satisfying $\mathbf x$ in terms of it?  

Furthermore, because  $\mathbf U \mathbf \Sigma_{\text{square}}$ are both invertible we can simplify the results with a change of variables and assume WLOG that we are solving for a linear transofmration of $\mathbf x$ that gives $\mathbf c$ for some $\mathbf c \neq 0$ where $\mathbf c: = \big(\mathbf U \mathbf \Sigma_{\text{square}}\big)^{-1}\mathbf b$.  

But this tells us the the smart basis to write our solution in terms of is $\mathbf V_{\text{non-square}}$ -- except first well extend this to an orthonormal basis, by including $\mathbf v_{m+1}, \mathbf v_{m+2}, ..., \mathbf v_{n}$  

- - - - -   

So we write $\mathbf x$ as a linear combination of $\mathbf v_k$'s     

$\mathbf x = \mathbf{Vy} = y_1\mathbf v_1 + y_2\mathbf v_2 + ... + y_m\mathbf v_m +  y_{m+1}\mathbf v_{m+1} + ... + y_{n}\mathbf v_{n}$

but this makes our problem:  

$\mathbf V_{\text{non-square}}^T  \mathbf x  = \mathbf V_{\text{non-square}}^T\mathbf{Vy}= \mathbf c_{\in \mathbb R^m}$  


now, we know 

$\mathbf V_{\text{non-square}}^T\big( y_{m+1}\mathbf v_{m+1} + ... + y_{n}\mathbf v_{n}\big) = \mathbf A\big( y_{m+1}\mathbf v_{m+1} + ... + y_{n}\mathbf v_{n}\big) = \mathbf 0_{\in \mathbb R^m} $  

because  

$\mathbf V_{\text{non-square}}^T \mathbf v_j = \mathbf 0$  for $j \gt m$   

but we also know 

$\mathbf V_{\text{non-square}}^T\big( y_{1}\mathbf v_{1} + y_{2}\mathbf v_{2} + ... + y_{m}\mathbf v_{m}\big) =  y_{1}\mathbf e_{1\in \mathbb R^m} + y_{2}\mathbf e_{2\in \mathbb R^m} + ... + y_m\mathbf e_{m\in \mathbb R^m} = \begin{bmatrix}
y_1\\ 
y_2\\ 
\vdots \\ 
y_m\\
\end{bmatrix} = \begin{bmatrix}
c_1\\ 
c_2\\ 
\vdots \\ 
c_m\\
\end{bmatrix} = \mathbf c $  


hence each $y_k$ for $1 \leq k \leq m$ is uniquely specified by $c_k$.  

equivalently, we know, that for any given $\mathbf c$, the $\mathbf x$ must at least include the linear combination of 

$\mathbf x = \mathbf{Vy} = y_1\mathbf v_1 + y_2\mathbf v_2 + ... + y_m\mathbf v_m  =  c_1\mathbf v_1 + c_2\mathbf v_2 + ... + c_m\mathbf v_m = \sum_{k=1}^m c_k \mathbf v_k = \mathbf V_{\text{non-square}}\mathbf{c}$

If any proprosed solution does not include the above linear combination of $\mathbf V_{\text{non-square}}\mathbf{c}$ for any $\mathbf c \neq \mathbf 0$ then it is not an accurate solution.  This solution has length (squared 2 norm) given by $\big \Vert \mathbf V_{\text{non-square}}\mathbf{c}\big \Vert_2^2 = \big \Vert \mathbf{c}\big \Vert_2^2 = \sum_{k=1}^m c_k^2 = \sum_{k=1}^m y_k^2 $.  Taking advantage of orthogonality, we can see that including any additional $y_j$  for $m \lt j \leq n$  does not change the configuration of $y_k$ but does necessarily increases the length of the solution (by positive definiteness of the 2 norm).   Hence we deem $\mathbf V_{\text{non-square}}\mathbf{c}$ as necessarily the valid and minimal length solution.  

- - - - 


This gives a nice interpretation to our projection matrix $\mathbf P$.  We have 


$\mathbf P\mathbf x^{(0)} =  \mathbf V \begin{bmatrix}
\mathbf I_m & \mathbf{00}^T \\ 
\mathbf {00}^T & \mathbf {00}^T  \\ 
\end{bmatrix} \mathbf V^T \mathbf x^{(0)} = \mathbf V \big(\mathbf {DV}^T \mathbf x^{(0)}\big) = \mathbf V \big(\mathbf D\mathbf y^{(0)}\big) = \mathbf V \big(\begin{bmatrix}
\mathbf I_m & \mathbf{00}^T \\ 
\mathbf {00}^T & \mathbf {00}^T  \\ 
\end{bmatrix}\begin{bmatrix}
y_1\\ 
y_2\\ 
\vdots \\ 
y_m\\
y_{m+1}\\
\vdots\\
y_n
\end{bmatrix}\big) = \mathbf V \begin{bmatrix}
y_1\\ 
y_2\\ 
\vdots \\ 
y_m\\
0\\
\vdots\\
0
\end{bmatrix} = \sum_{k=1}^m y_m \mathbf v_k = \mathbf x_{\text{optimal}}$  





That is we know there is only one $\mathbf x$ that will solve for any $\mathbf {Ax} = \mathbf b \neq \mathbf 0$, so long as we confine ourselves to writing $\mathbf x$ as a linear combination of $\mathbf v_k$'s for $1 \leq k \leq n$.  




# alternative approach using Lagrange multipliers

to derive the minimum cost (L2 norm) solution

- - - -

In this case, let 

$\mathbf d  =  \begin{bmatrix}
\lambda_1\\ 
\lambda_2\\ 
\vdots \\ 
\lambda_m
\end{bmatrix}
$

i.e. $\mathbf d$ is a vector containing the Lagrange multipliers

we setup the Lagrangian we want to minimize as

$L(\mathbf x) = \mathbf x^T \mathbf x + \mathbf d^T  \big(\mathbf{Ax} - \mathbf b \big)$

$\nabla_{\mathbf x} = 2 \mathbf x + \mathbf A^T \mathbf d := \mathbf 0$

Note: check dimensions to see why it is not $\mathbf d^T \mathbf A$

$\nabla_{\mathbf d } = \mathbf{Ax} - \mathbf b := \mathbf 0$

solving $\nabla_{\mathbf x}$ first, we see:

$\mathbf x = \frac{-1}{2} \mathbf{A}^T \mathbf d$

now substitute into the second equations $\nabla_{\mathbf d }$, we see

$\mathbf{A}\big(\frac{-1}{2} \mathbf{A}^T \mathbf d \big) - \mathbf b := \mathbf 0$

hence $\mathbf d = -2\big(\mathbf{AA}^T\big)^{-1} \mathbf b$

and here we plug this back into $\nabla_{\mathbf x}$  

$\mathbf 0 = 2 \mathbf x + \mathbf A^T \mathbf d  = 2 \mathbf x + \mathbf A^T \Big(-2\big(\mathbf{AA}^T\big)^{-1} \mathbf b\Big)$


$-2 \mathbf x = -2\mathbf A^T \big(\mathbf{AA}^T\big)^{-1} \mathbf b$ 

or 

$\mathbf x = \mathbf A^T \big(\mathbf{AA}^T\big)^{-1} \mathbf b $

and of course, if we used the SVD on $\mathbf A$, we'd see

$\mathbf x = \mathbf{V \Sigma}^T \mathbf U^T \big(\mathbf{U \Sigma \Sigma }^T \mathbf U^T\big)^{-1} \mathbf b = \mathbf{V \Sigma}^T \mathbf U^T \mathbf{U \big(\Sigma \Sigma }^T\big)^{-1} \mathbf U^T \mathbf b = \mathbf{V \Sigma}^T \big(\mathbf{\Sigma \Sigma }^T\big)^{-1} \mathbf U^T \mathbf b = \mathbf{V \Sigma}^{-1} \mathbf U^T \mathbf b $

where we recover the equation from the earlier approach, and as before 

where $\mathbf \Sigma^{-1} = \begin{bmatrix}
\frac{1}{\sigma_1} & 0 &0  &0 &  ... &0 \\ 
0 & \frac{1}{\sigma_2}& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \frac{1}{\sigma_m}& ... &0  
\end{bmatrix}^T$, 

which is to say that $\mathbf \Sigma^{-1}$ is the right inverse of the $\mathbf \Sigma$ matrix

- - - - 

it is also, of course, easy to verify that the proposed solution is valid (ignoring cost) without knowing anything about Lagrange Multipliers.  I.e. we can verify 

$\mathbf A\Big(\mathbf x\Big) = \mathbf A\Big(\mathbf A^T \big(\mathbf{AA}^T\big)^{-1} \mathbf b\Big) = \big(\mathbf A\mathbf A^T\big) \big(\mathbf{AA}^T\big)^{-1} \mathbf b = \mathbf b$    

and hence this choice of $\mathbf x$ is valid  



note that converting the final equation to the SVD equivalent one may be easier to intperet with the decomposition

$\mathbf A = \mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T$  

in which case we'd have 


$\mathbf x = \mathbf A^T \big(\mathbf{AA}^T\big)^{-1} \mathbf b = \big(\mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T\big)^T \Big(\mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T \big(\mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T\big)^T\Big)^{-1}\mathbf b $    

$= \big( \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}} \mathbf U^T\big) \Big(\mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}}\mathbf U^T \Big)^{-1}\mathbf  = \big( \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}} \mathbf U^T\big) \Big(\mathbf U \mathbf \Sigma_{\text{square}}^2 \mathbf U^T \Big)^{-1}\mathbf b $    

$ = \big( \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}} \mathbf U^T\big)\big(\mathbf U \mathbf \Sigma_{\text{square}}^{-2} \mathbf U^T\big) \mathbf b  = \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}}^{-1} \mathbf U^T \mathbf b $  


and a quick associativity check tells us 

$\mathbf{V \Sigma}^{-1} = \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}}^{-1}$   

hence this solution agrees with the earlier one 




