This posting has two different looks at how we can minimize the length of some vector $\mathbf x$ in an underdetermined (but full row rank) system of equations $\mathbf{Ax} = \mathbf b$.

The underlying scalars are in $\mathbb R$


consider the equation $\mathbf {A x} = \mathbf b$, where we know $\mathbf A$ and $\mathbf b$, and need to solve for $\mathbf x$. For avoidance of doubt $\mathbf b \neq \mathbf 0$.  $\mathbf A$ is an m x n matrix.

In the case where $\mathbf A$ is tall and skinny (and with noise in the data, this should equate to an overdetermined system of equations, with full column rank), we can use ordinary least squares to solve for the $\mathbf x$ that minimizes the squared length (2 norm), of $\mathbf v = \mathbf {Ax} - \mathbf b$, thus we are minimizing $\mathbf v^T \mathbf v$.  We may may use the Normal Equations, or QR factorization, or many other tools at our disposal.  

If $\mathbf A$ is square and of full rank, then we can directly invert $\mathbf A$, or use Gaussian elimination or whatever tool we want. 

Now consider the case where the $\mathbf A$ has more columns than rows -- i.e. n > m-- (and again there is noise in the data, so we have full row rank), this means that there are *many* solutions to $\mathbf {A x} = \mathbf b$, because $\mathbf A$ has a non-trivial nullspace.  In this case, we first will want to question why we have this situation, and perhaps gather more data. If we still want to 'solve' this equation, what form might we take?  We have many solutions at our disposal, so perhaps one that minimizes the length (2 norm) of $\mathbf x$ is the one we want. 

For avoidance of doubt we have $\mathbf A \in \mathbb R^{\text{m x n}}$ where $m \lt n$.  This also means that $\mathbf x \in \mathbb R^{\text{n}}$ and  $\mathbf b\in \mathbb R^{\text{m}} $ .  This means we need at most $m$ linearly independent vectors to generate *any* given $\mathbf b$.  Suppose we choose said linearly independent vectors to be mutually orthonormal, and we purge any vectors not in that set from the solution.  In such a case we would have a solution $\mathbf x$ satisfying the equation $\mathbf{Ax} = \mathbf b$.  We'd also clearly have a smaller solution (L2 norm) solution than any one using the above solution plus additional mutually orthonormal vectors, which must in some sense be in the nullspace... 

There are basically two approaches to solving this.  

First the algebraic one.

$\mathbf {A x} =\big( \mathbf {U \Sigma V}^T\big)\mathbf x = \mathbf b$, using the Singular Value Decomposition, where $\mathbf U$ and $\mathbf V$ are both rull rank, square, orthogonal matrices, but because $\mathbf A$ is not square, $\mathbf \Sigma$ is a diagonal matrix that has more columns than rows.  

That is $\mathbf A$ is an m x n matrix with rank m (meaning that each singular value > 0)

$\mathbf A =
\bigg[\begin{array}{c|c|c|c}
\mathbf u_1 & \mathbf u_2 &\cdots & \mathbf u_{m}
\end{array}\bigg] \begin{bmatrix}
\sigma_1 & 0 &0  &0 & ... &0 \\ 
0 & \sigma_2& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \sigma_m& ... &0  
\end{bmatrix} \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T
$   

if we left multiply both sides of $\mathbf {A x} $ by $\mathbf U^T$, we get 

$\mathbf {\Sigma V}^T \mathbf x = \begin{bmatrix}
\sigma_1 & 0 &0  &0 & ... &0 \\ 
0 & \sigma_2& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \sigma_m& ... &0  
\end{bmatrix} \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T \mathbf x = \mathbf U^T \mathbf b$

now with an **abuse of notation**, consider left multiplying by $\mathbf \Sigma^{-1}$:

where $\mathbf \Sigma^{-1} = \Big(\begin{bmatrix}
\frac{1}{\sigma_1} & 0 &0  &0 &  ... &0 \\ 
0 & \frac{1}{\sigma_2}& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \frac{1}{\sigma_m}& ... &0  
\end{bmatrix}^T$ 

Thus it is not technically an inverse or a left inverse... $\mathbf \Sigma^{-1}$ **is actually a right inverse** but we ultimately are multiplying on the left because that is all we can do here -- hence this is an abuse of notation.

$\mathbf {(D)V}^T \mathbf x = 
\Big(\begin{bmatrix}
\frac{1}{\sigma_1} & 0 &0  &0 &  ... &0 \\ 
0 & \frac{1}{\sigma_2}& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \frac{1}{\sigma_m}& ... &0  
\end{bmatrix}^T \begin{bmatrix}
\sigma_1 & 0 &0  &0 & ... &0 \\ 
0 & \sigma_2& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \sigma_m& ... &0  
\end{bmatrix}\Big) \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T \mathbf x =  \mathbf \Sigma^{-1}\mathbf U^T \mathbf b$


$\mathbf {(D)V}^T \mathbf x = \begin{bmatrix}
1 & 0 &0  &0 & \mathbf 0^T \\ 
0 & 1 & 0 & 0& \mathbf 0^T\\ 
0 & 0 &  \ddots & 0& \mathbf 0^T \\ 
0 & 0 & 0 & 1 & \mathbf 0^T  \\
\mathbf 0 & \mathbf 0 & \mathbf 0 & \mathbf 0 & \mathbf 0\mathbf 0^T  \\ 
\end{bmatrix} \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T \mathbf x = \begin{bmatrix}
\mathbf I & \mathbf{00}^T \\ 
\mathbf {00}^T & \mathbf {00}^T  \\ 
\end{bmatrix} \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T \mathbf x = \mathbf \Sigma^{-1}\mathbf U^T \mathbf b$

Which is to say that $\mathbf D$ is the spectra for an idempotent matrix (i.e. a projection matrix's eigenvalues).  (Note that to deal with notational overload, $\mathbf {0}$ is to be the appropriately sized zero vector, and $\mathbf {00}^T$ is the appropriately sized zero matrix.)  

From here multiply both sides by $\mathbf V$, and we get 

$\mathbf {VDV}^T \mathbf x =  \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg] \begin{bmatrix}
\mathbf I & \mathbf{00}^T \\ 
\mathbf {00}^T & \mathbf {00}^T  \\ 
\end{bmatrix}\bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]^T \mathbf x= \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b$


note that 

$\mathbf P = \mathbf {VDV}^T$  

obeys 

$\mathbf P^2 = \mathbf P$  

hence $\mathbf P$ is idempotent, and indeed it is orthgonally diagonalizable -- i.e. it is a projection matrix.  (Note: there are some different conventions -- some texts refer to all idempotent matrices as projection matrices, while others only refer to real symmetric -- or Hermitian -- ones such as this as projection matrices.)  

Note that if $\mathbf A$ was full rank, $\mathbf D = \mathbf I$, and $\mathbf \Sigma^{-1}$ would be an actual inverse, not an abuse of notation (right inverse in this case), and hence we would have solved our equation.  

That is, if $\mathbf A$ was full rank, we would have had:


$\mathbf {VIV}^T \mathbf x = \mathbf {VV}^T \mathbf x = \big(\mathbf{v_1 v_1}^T + \mathbf{v_2 v_2}^T +... + \mathbf{v_n v_n}^T\big) \mathbf x = \mathbf {I} \mathbf x = \mathbf x = \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b$

but instead what we have is

$\mathbf P \mathbf x =  \Big(\big(\mathbf{v_1 v_1}^T + \mathbf{v_2 v_2}^T +... + \mathbf{v_m v_m}^T\big) + \big(0\mathbf{v_{m+1} v_{m+1}}^T + 0 \mathbf{v_{m+2} v_{m+2}}^T + ... + 0\mathbf{v_n v_n}^T\big)\Big) \mathbf x = \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b$

or more simply 

$\mathbf {Px}= \big(\mathbf{v_1 v_1}^T + \mathbf{v_2 v_2}^T +... + \mathbf{v_m v_m}^T\big) \mathbf x =  \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b $

Now recall that $\mathbf V = \bigg[\begin{array}{c|c|c|c}
\mathbf v_1 & \mathbf v_2 &\cdots & \mathbf v_{n}
\end{array}\bigg]$, which is an n x n orthogonal matrix -- that is $\mathbf V$ can be thought of as a coordinate system.  Thus our solution vector $\mathbf x$ can be a linear combination of the columns of $\mathbf V$.  We can write this as 

$\mathbf x = \mathbf{Vy} = y_1\mathbf v_1 + y_2\mathbf v_2 + ... + y_m\mathbf v_m +  y_{m+1}\mathbf v_{m+1} + ... + y_{n}\mathbf v_{n}$  

and recalling that since $\mathbf V$ is orthogonal, it is length (2 norm) preserving, so:  

$\big \vert\big \vert \mathbf x \big \vert \big \vert_2^{2} =  \big \vert\big \vert \mathbf{Vy} \big \vert\big \vert_2^{2} = \big \vert \big \vert \mathbf y \big \vert\big \vert_2^{2} = \mathbf y^T \mathbf y = y_1^2 + y_2^2 + ... + y_m^2 + y_{m+1}^2 + ... + y_n^2$

we substitute in and get 

$\big(\mathbf{v_1 v_1}^T + \mathbf{v_2 v_2}^T +... + \mathbf{v_m v_m}^T\big) \big(y_1\mathbf v_1 + y_2\mathbf v_2 + ... + y_m\mathbf v_m +  y_{m+1}\mathbf v_{m+1} + ... + y_{n}\mathbf v_{n}\big) = \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b$

which by the orthogonality of the columns in $\mathbf V$ gives us:

$\mathbf x = y_1  \mathbf{v_1} + y_2\mathbf v_2 +... + y_m  \mathbf{v_m} = \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b $

From here we notice that any $y_k$, for $k \gt m$ contributes to the length of $\mathbf x$ but does not contribute to the solution of the problem (i.e. they are in the null space).  

Thus the minimal length solution to the underdetermined $\mathbf {Ax} = \mathbf b$ comes in the form of a solution to the equation that has $\mathbf x $ written purely as a linear combination of $\{\mathbf v_1, \mathbf v_2, ..., \mathbf v_m \}$

The way to interpret this, then, is we solve for any legal $\mathbf x$ that is a valid solution, and then project such a solution down to a subspace that only is spanned by $m$ mutually orthonormal vectors ($\mathbf v_1, ..., \mathbf v_m$)  

# the wording and focus is running here and needs cleaned up

hence the minimal solution is to take any $\mathbf x_k$ 

that satsifies $\mathbf A \mathbf x_k = \mathbf b$ 

and left multiply said $\mathbf x_k$ by $\mathbf P$

that is we have 

$\big(\mathbf \Sigma \mathbf V^T\big)\mathbf x_k = \mathbf U^T \mathbf b$  

$\mathbf {Px}_k = \mathbf {VDV}^T \mathbf x_k = \mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b$   

and for completeness we may verify that the above solution is in fact a solution, i.e. that 

$\mathbf A\Big(\mathbf {Px}_k\Big) = \mathbf A\Big(\mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b\Big) = \mathbf U \mathbf \Sigma \mathbf V^T  \Big(\mathbf {V\Sigma}^{-1}\mathbf U^T \mathbf b\Big) = \mathbf b$ 

(since $\mathbf \Sigma^{-1}$ is in fact a right inverse, then $\mathbf \Sigma \mathbf \Sigma^{-1} = \mathbf I$ and $\mathbf V^T \mathbf V = \mathbf I$ as well as $\mathbf U \mathbf U^T = \mathbf I$ because they are both orthogonal matrices.  

**a nice insight**   
suppose, for instance we solve a Linear Program and compute 

$\mathbf x_{\text{L1 norm minimized}}$  

if we project it down to the subspace of $\{\mathbf v_1, \mathbf v_2, ...., \mathbf v_m\}$ using, of course, our projector $\mathbf P$, to do so, we see 


$\mathbf P \mathbf x_{\text{L1 norm minimized}} = \mathbf x_{\text{L2 norm minimized}} = \mathbf x$  

of course $\mathbf P$ is not invertible (unless $\mathbf A$ is square) so this is a one way relation, but nevertheless interesting.  

- - - - - 

**alternative takes:  **  

1.)  
some math text, and some computational solvers (e.g. Julia's) will do SVD for short fat matrices in the form of 

$\mathbf A = \mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T$   

$\mathbf V^T$ is $m$ x $n$, or equivalently, $\mathbf V$ is $n$ x $m$, hence is has mutually orthonormal vectors of dimension $n$, but only has $m$ of them and hence is not an orthgonal matrix, just a matrix with mutually orthnormal columns -- but not enough to form a basis.  


To verify that the end results are unchanged, we can see 


$\mathbf A = \mathbf U \big(\mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T\big) = \big(\mathbf \Sigma\mathbf V^T\big)$  

where $\mathbf \Sigma_{\text{square}}$ is $m$ x $m$ 



because 

$\mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T = \mathbf \Sigma\mathbf V^T = 
\begin{bmatrix}
\sigma_1 \mathbf v_1^T \\
 \sigma_2 \mathbf v_2^T \\ 
\vdots\\ 
\sigma_{m-1} \mathbf v_{m-1}^T \\ 
\sigma_m \mathbf v_m^T
\end{bmatrix}$ 

and as before 

$\mathbf P = \Big(\big(\mathbf{v_1 v_1}^T + \mathbf{v_2 v_2}^T +... + \mathbf{v_m v_m}^T\big) = \mathbf V_{\text{non-square}} \mathbf V_{\text{non-square}}^T$  

2.) *An alternative computationally efficient approach:*  

While the SVD approach allows us to compute $\mathbf P$ and get insights into the geometry of the minimum costs L2 norm solution to our system of equations, what the below section (using Lagrange multipliers) shows is that the solution is given by 

$\mathbf x = \mathbf A^T \big(\mathbf{AA}^T\big)^{-1} \mathbf b$ 

as with ordinary least squares, we generally don't want to actually compute $\big(\mathbf{AA}^T\big)$ for cost and numeric stability reasons.  But the reader may also recall that computing the SVD is the most expensive of operation of the typical matrix factorizations.  The nice middle ground here (much like in least squares) is to use QR factorization.  An outline of the computational approach using QR factorization is shown below.  


$\mathbf A^T = \mathbf {QR}$

where $\mathbf Q$ is tall and skinny and $\mathbf R$ is a square upper triangular matrix.  Since $\mathbf A$ has full row rank this means $\mathbf A^T$ has full column rank, which mean $\mathbf R$ has no zeros along its diagonal, which means $\mathbf R^{-1}$ exists.  

If you did this, you'd get 

$\mathbf x = \mathbf A^T\big(\mathbf {AA}^T\big)^{-1}\mathbf b =\big(\mathbf {QR}\big)\Big(\big(\mathbf Q \mathbf R\big)^T\big(\mathbf Q\mathbf R\big)\Big)^{-1}\mathbf b = \mathbf {QR}\Big(\big(\mathbf R^T \mathbf Q^T\big)\big(\mathbf Q\mathbf R\big)\Big)^{-1}\mathbf b $ 

$ = \mathbf {QR}\big(\mathbf R^T\mathbf R \big)^{-1}\mathbf b = \mathbf {QR}\big(\mathbf R \big)^{-1}\big(\mathbf R^T\big)^{-1}\mathbf b = \mathbf Q\big(\mathbf R^T\big)^{-1}\mathbf b$  

hence we have 

$\mathbf x = \mathbf x_{\text{L2 norm minimized}} = \mathbf Q\big(\mathbf R^T\big)^{-1}\mathbf b$ 

which is to say that the solution is equivalent to running QR factorization on $\mathbf A^T$  

and then solving the lower triangular system of equations for $\mathbf y$ in 

$\mathbf R^T \mathbf y = \mathbf b$ 

and then after solving for $\mathbf y$, left multiplying by $\mathbf Q$


$\mathbf x = \mathbf Q\mathbf y =\mathbf Q\Big(\big(\mathbf R^T\big)^{-1}\mathbf b\Big)$   





note: remember that $\mathbf A$ is assumed to be full row rank, which means it has column rank of $m$.  We want to have a linear combination of $\mathbf A$'s column such that said combination is equal to $\mathbf b\in \mathbb R^{\text{m}}$.  That is, we know we can cleverly select from $\mathbf A$'s columns to form a basis.  The goal, then is to select said basis in such a way that we can isolate the vectors which are in the (right) nullspace of $\mathbf A$ and remove them from our solution.  We use orthogonality to get a 'clean look' at the various vectors being used to construct a solution and the ones that only contribute to the length of $\mathbf x$ but do not contribute to the actual solution in $\mathbf b$.  Given that we are using orthogonality for a non square matrix, a very natural question is: what are the implications of using SVD on $\mathbf A$ in order to (attempt) to solve this problem.  

- - - - 
note each row vector in $\mathbf V^T $ must be $n$ dimensional, like $\mathbf x$.  In the square $\mathbf V$ case there are $n$ mutually orthonormal $\mathbf v_k$'s and our matrix is orthogonal and forms a nice set of coordinates to write $\mathbf x$ in.  

$\mathbf A = \mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T$  

$\mathbf x = \mathbf{Vy} = y_1\mathbf v_1 + y_2\mathbf v_2 + ... + y_m\mathbf v_m +  y_{m+1}\mathbf v_{m+1} + ... + y_{n}\mathbf v_{n}$

$\mathbf A\mathbf x = \mathbf A \mathbf {Vy} = \mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T\big(y_1\mathbf v_1 + y_2\mathbf v_2 + ... + y_m\mathbf v_m +  y_{m+1}\mathbf v_{m+1} + ... + y_{n}\mathbf v_{n}\big)$  

note that $\big(\mathbf U \mathbf \Sigma_{\text{square}}\big)$ is invertible, so we may assume WLOG that we are solving for $\mathbf c:= \mathbf \Sigma_{\text{square}}^{-1} \mathbf U^T \mathbf b \neq \mathbf 0$ 

where $\mathbf c$ has the same dimensions as $\mathbf b$  

hence the problem is solve for a $\mathbf x \neq \mathbf 0$ 

such that 

$\mathbf V_{\text{non-square}}^T \mathbf x = \mathbf c_{\in \mathbb R^m}$  


now, we know 

$\mathbf V_{\text{non-square}}^T\big( y_{m+1}\mathbf v_{m+1} + ... + y_{n}\mathbf v_{n}\big) = \mathbf A\big( y_{m+1}\mathbf v_{m+1} + ... + y_{n}\mathbf v_{n}\big) = \mathbf 0_{\in \mathbb R^m} $  

because  

$\mathbf V_{\text{non-square}}^T \mathbf v_j = \mathbf 0$  for $j \gt m$   

but we also know 

$\mathbf V_{\text{non-square}}^T\big( y_{1}\mathbf v_{1} + y_{2}\mathbf v_{2} + ... + y_{m}\mathbf v_{m}\big) =  y_{1}\mathbf e_{1\in \mathbb R^m} + y_{2}\mathbf e_{2\in \mathbb R^m} + ... + \mathbf e_{m\in \mathbb R^m} = \begin{bmatrix}
y_1\\ 
y_2\\ 
\vdots \\ 
y_m\\
\end{bmatrix} = \begin{bmatrix}
c_1\\ 
c_2\\ 
\vdots \\ 
c_m\\
\end{bmatrix} = \mathbf c $  


hence each $y_k$ for $1 \leq k \leq m$ is uniquely specified by $c_k$.  

equivalently, we know, that for any given $\mathbf c$, the $\mathbf x$ must at least include the linear combination of 

$\mathbf x = \mathbf{Vy} = y_1\mathbf v_1 + y_2\mathbf v_2 + ... + y_m\mathbf v_m  =  c_1\mathbf v_1 + c_2\mathbf v_2 + ... + c_m\mathbf v_m = \sum_{k=1}^m c_k \mathbf v_k = \mathbf V_{\text{non-square}}\mathbf{c}$

If any proprosed solution does not include the above linear combination of $\mathbf V_{\text{non-square}}\mathbf{c}$ for any $\mathbf c \neq \mathbf 0$ then it is not an accurate solution.  This solution has length (squared 2 norm) given by $\big \Vert \mathbf V_{\text{non-square}}\mathbf{c}\big \Vert_2^2 = \big \Vert \mathbf{c}\big \Vert_2^2 = \sum_{k=1}^m c_k^2 = \sum_{k=1}^m y_k^2 $.  Taking advantage of orthogonality, we can see that including any additiona $y_j$  for $m \lt j \leq n$  does not change the configuration of $y_k$ but does necessarily increases the length of the solution (by positive definiteness of the 2 norm).   Hence we deem $\mathbf V_{\text{non-square}}\mathbf{c}$ as necessarily the valid and minimal length solution.  

- - - - 

**delete the below?** 

This gives a nice interpretation to our projection matrix $\mathbf P$.  We have 


$\mathbf P\mathbf x^{(0)} =  \mathbf V \begin{bmatrix}
\mathbf I_m & \mathbf{00}^T \\ 
\mathbf {00}^T & \mathbf {00}^T  \\ 
\end{bmatrix} \mathbf V^T \mathbf x^{(0)} = \mathbf V \big(\mathbf {DV}^T \mathbf x^{(0)}\big) = \mathbf V \big(\mathbf D\mathbf y^{(0)}\big) = \mathbf V \big(\begin{bmatrix}
\mathbf I_m & \mathbf{00}^T \\ 
\mathbf {00}^T & \mathbf {00}^T  \\ 
\end{bmatrix}\begin{bmatrix}
y_1\\ 
y_2\\ 
\vdots \\ 
y_m\\
y_{m+1}\\
\vdots\\
y_n
\end{bmatrix}\big) = \mathbf V \begin{bmatrix}
y_1\\ 
y_2\\ 
\vdots \\ 
y_m\\
0\\
\vdots\\
0
\end{bmatrix} = \sum_{k=1}^m y_m \mathbf v_k = \mathbf x_{\text{optimal}}$  





That is we know there is only one $\mathbf x$ that will solve for any $\mathbf {Ax} = \mathbf b \neq \mathbf 0$, so long as we confine ourselves to writing $\mathbf x$ as a linear combination of $\mathbf v_k$'s for $1 \leq k \leq n$.  




# alternative approach using Lagrange multipliers

to derive the minimum cost (L2 norm) solution

- - - -

In this case, let 

$\mathbf d  =  \begin{bmatrix}
\lambda_1\\ 
\lambda_2\\ 
\vdots \\ 
\lambda_m
\end{bmatrix}
$

i.e. $\mathbf d$ is a vector containing the Lagrange multipliers

we setup the Lagrangian we want to minimize as

$L(\mathbf x) = \mathbf x^T \mathbf x + \mathbf d^T  \big(\mathbf{Ax} - \mathbf b \big)$

$\nabla_{\mathbf x} = 2 \mathbf x + \mathbf A^T \mathbf d := \mathbf 0$

Note: check dimensions to see why it is not $\mathbf d^T \mathbf A$

$\nabla_{\mathbf d } = \mathbf{Ax} - \mathbf b := \mathbf 0$

solving $\nabla_{\mathbf x}$ first, we see:

$\mathbf x = \frac{-1}{2} \mathbf{A}^T \mathbf d$

now substitute into the second equations $\nabla_{\mathbf d }$, we see

$\mathbf{A}\big(\frac{-1}{2} \mathbf{A}^T \mathbf d \big) - \mathbf b := \mathbf 0$

hence $\mathbf d = -2\big(\mathbf{AA}^T\big)^{-1} \mathbf b$

and here we plug this back into $\nabla_{\mathbf x}$  

$\mathbf 0 = 2 \mathbf x + \mathbf A^T \mathbf d  = 2 \mathbf x + \mathbf A^T \Big(-2\big(\mathbf{AA}^T\big)^{-1} \mathbf b\Big)$


$-2 \mathbf x = -2\mathbf A^T \big(\mathbf{AA}^T\big)^{-1} \mathbf b$ 

or 

$\mathbf x = \mathbf A^T \big(\mathbf{AA}^T\big)^{-1} \mathbf b $

and of course, if we used the SVD on $\mathbf A$, we'd see

$\mathbf x = \mathbf{V \Sigma}^T \mathbf U^T \big(\mathbf{U \Sigma \Sigma }^T \mathbf U^T\big)^{-1} \mathbf b = \mathbf{V \Sigma}^T \mathbf U^T \mathbf{U \big(\Sigma \Sigma }^T\big)^{-1} \mathbf U^T \mathbf b = \mathbf{V \Sigma}^T \big(\mathbf{\Sigma \Sigma }^T\big)^{-1} \mathbf U^T \mathbf b = \mathbf{V \Sigma}^{-1} \mathbf U^T \mathbf b $

where we recover the equation from the earlier approach, and as before 

where $\mathbf \Sigma^{-1} = \begin{bmatrix}
\frac{1}{\sigma_1} & 0 &0  &0 &  ... &0 \\ 
0 & \frac{1}{\sigma_2}& 0 & 0& ... &0\\ 
0 & 0 &  \ddots & 0& ... &0 \\ 
0 & 0 & 0 & \frac{1}{\sigma_m}& ... &0  
\end{bmatrix}^T$, 

which is to say that $\mathbf \Sigma^{-1}$ is the right inverse of the $\mathbf \Sigma$ matrix

- - - - 

it is also, of course, easy to verify that the proposed solution is valid (ignoring cost) without knowing anything about Lagrange Multipliers.  I.e. we can verify 

$\mathbf A\Big(\mathbf x\Big) = \mathbf A\Big(\mathbf A^T \big(\mathbf{AA}^T\big)^{-1} \mathbf b\Big) = \big(\mathbf A\mathbf A^T\big) \big(\mathbf{AA}^T\big)^{-1} \mathbf b = \mathbf b$    

and hence this choice of $\mathbf x$ is valid  



note that converting the final equation to the SVD equivalent one may be easier to intperet with the decomposition

$\mathbf A = \mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T$  

in which case we'd have 


$\mathbf x = \mathbf A^T \big(\mathbf{AA}^T\big)^{-1} \mathbf b = \big(\mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T\big)^T \Big(\mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T \big(\mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T\big)^T\Big)^{-1}\mathbf b $    

$= \big( \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}} \mathbf U^T\big) \Big(\mathbf U \mathbf \Sigma_{\text{square}}\mathbf V_{\text{non-square}}^T \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}}\mathbf U^T \Big)^{-1}\mathbf  = \big( \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}} \mathbf U^T\big) \Big(\mathbf U \mathbf \Sigma_{\text{square}}^2 \mathbf U^T \Big)^{-1}\mathbf b $    

$ = \big( \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}} \mathbf U^T\big)\big(\mathbf U \mathbf \Sigma_{\text{square}}^{-2} \mathbf U^T\big) \mathbf b  = \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}}^{-1} \mathbf U^T \mathbf b $  


and a quick associativity check tells us 

$\mathbf{V \Sigma}^{-1} = \mathbf V_{\text{non-square}} \mathbf \Sigma_{\text{square}}^{-1}$   

hence this solution agrees with the earlier one 




