Quite often in basic machine learning applications -- say with linear regression -- we gather n samples of data and look to fit a model to it.  Note: we often have *a lot* of data, and in fact n can be any natural number.  For illustrative purposes, and without a loss of generality, this posting will use n = 5. 

Note that we typically also have multiple different features in our data, but **the goal of this posting is to strip down ideas to their very core*, so we consider the one feature case.  Also note that in machine learning we may use notation like $\mathbf {Xw} = \mathbf y$, where we solve for the weights in $\mathbf w$.  However, this posting uses the typical Linear Algebra setup of $\mathbf{Ax} = \mathbf b$, where we are interested in solving for $\mathbf x$.  

So initially we may just have the equation

$\mathbf{Ax} = \begin{bmatrix}
a_1\\ 
a_2\\ 
a_3\\ 
a_4\\ 
a_5
\end{bmatrix} \begin{bmatrix}
x_1\\ 
\end{bmatrix} = \mathbf b$

**this original 'data' matrix will also be written as **

$\mathbf a = \begin{bmatrix}
a_1\\ 
a_2\\ 
a_3\\ 
a_4\\ 
a_5
\end{bmatrix}$

Note that when we gather real world data there is noise in the data, so we would be *extremely* surprised if any of the entries in $\mathbf a$ are duplicates.  So, unless otherwise noted assume that each entry in $a_i$ is pairwise linearly independent from each other entry. (We address the issues of duplicates later on.)  Thus we'd say that the rows have an upper bound of 5 in terms of rank.  However, since there is only one column, the column rank of $\mathbf A$ is one, and the column rank = row rank, thus we know that the row rank = 1. 

Then we decide to insert a bias /affine translation piece (in index position zero -- to use notation from Caltech's "Learning From Data").  

Thus we end up with the following equation

$\mathbf{Ax} = \begin{bmatrix}
1 & a_1\\ 
1 & a_2\\ 
1 & a_3\\ 
1 & a_4\\ 
1 & a_5
\end{bmatrix} \begin{bmatrix}
x_0\\
x_1\\ 
\end{bmatrix} = \mathbf b$

Column 0 of $\mathbf A$ is the ones vector, also denoted as $\mathbf 1$.  

At this point we know that $\mathbf A$ still has full column rank (i.e. rank = 2) -- if this wasn't the case, this would imply that we could scale column 0 to get column 1 (i.e. everything in column 1 would have to be identical).   

From here we may simply decide to do least squares and solve (which we always can do when we have full column rank, and $\mathbf A $ has m rows and n columns, where $m \geq n$).  

Or we may decide to map this to a higher dimensional space that has a quadratic term.  

$\mathbf{Ax} = \begin{bmatrix}
1 & a_1 & a_1^2\\ 
1 & a_2 & a_2^2\\ 
1 & a_3 & a_3^2\\ 
1 & a_4 & a_4^2\\ 
1 & a_5 & a_5^2
\end{bmatrix} \begin{bmatrix}
x_0\\
x_1\\ 
x_2\\
\end{bmatrix} = \mathbf b$


At this point we may just do least squares and solve.  But that requires $\mathbf A$ to have full column rank.  How do we know the $\mathbf A$ has full column rank?  One way to think about it is that squaring each $a_i$ to get column 2 is not a linear transformation, so we would not expect it to be linear combination of prior columns.  

$\mathbf a \circ \mathbf a \neq \gamma_0 \mathbf 1 + \gamma_1 \mathbf a$

where $\circ$ denotes the Hadamard product.  And by earlier argument, we know $\mathbf a \neq \gamma_0 \mathbf 1$, hence each column is linearly independent.  There is another way to verify linear independence of these columns -- which comes from the Vandermonde Matrix, and we will address this shortly.  

We may however decide we want an even higher dimensional space for our data, so we add a cubic term:

$\mathbf{Ax} = \begin{bmatrix}
1 & a_1 & a_1^2 & a_1^3\\ 
1 & a_2 & a_2^2 & a_2^3\\ 
1 & a_3 & a_3^2 & a_3^3\\ 
1 & a_4 & a_4^2 & a_4^3\\ 
1 & a_5 & a_5^2 & a_5^3
\end{bmatrix} \begin{bmatrix}
x_0\\
x_1\\ 
x_2\\
x_3\\
\end{bmatrix} = \mathbf b$

Again we may be confident that the columns are linearly independent because our new column -- cubing $\mathbf a$ is not a linear transformation (or alternatively, using the hadamard product is not a linear transformation), so we write: 

$\mathbf a \circ \mathbf a \circ \mathbf a \neq \gamma_0 \mathbf 1 + \gamma_1 \mathbf a + \gamma_2 \big(\mathbf a \circ \mathbf a\big)$

And if the above is *still* not enough, we may add a term to the fourth power:

$\mathbf{Ax} = \begin{bmatrix}
1 & a_1 & a_1^2 & a_1^3 & a_1^4\\ 
1 & a_2 & a_2^2 & a_2^3 & a_2^4\\ 
1 & a_3 & a_3^2 & a_3^3 & a_3^4\\ 
1 & a_4 & a_4^2 & a_4^3 & a_4^4\\ 
1 & a_5 & a_5^2 & a_5^3 & a_5^4
\end{bmatrix} \begin{bmatrix}
x_0\\
x_1\\ 
x_2\\
x_3\\
x_4\\
\end{bmatrix} = \mathbf b$

Again quite confident that the above has full column rank because 

$\mathbf a \circ \mathbf a \circ \mathbf a \circ \mathbf a \neq \gamma_0 \mathbf 1 + \gamma_1 \mathbf a + \gamma_2 \big(\mathbf a \circ \mathbf a\big) + \gamma_3 \big(\mathbf a \circ \mathbf a \circ \mathbf a \big)$

We may be tempted to go to an even higher dimensional space at this point, but this requires considerable justification.  Notice that $\mathbf A$ is a square matrix now, and as we've argued, it has full column rank -- which means it also has full row rank.  Thus we can be sure to solve the above equation for a unique, exact solution, where $\mathbf x = \mathbf A^{-1}\mathbf b$.  If we were to go to a higher dimensional space we would be entering the world of an underdetermined system of equations -- see postings titled "Underdetermined_System_of_Equations.ipynb" for the L2 norm oriented solution, and "underdetermined_regression_minimize_L1_norm.ipynb" for the L1 norm oriented solution.  Since we can already be certain of solving for a single exact solution in this problem, we will stop mapping to higher dimensions here.  

In the above equation of $\mathbf{Ax} = \mathbf b$, the square $\mathbf A$ is a Vandermonde matrix.  Technical note: some texts say that $\mathbf A$ is the Vandermonde matrix, while others say $\mathbf A^T$ is the Vandermonde matrix.  The calculation of the determinant is identical, and for other properties, it is a small book-keeping adjustment to transpose the matrix.  

Note that the Vandermonde matrix is well studied, has special fast matrix vector multiplication (i.e. $\lt O(n^2)$) algorithms associated with it -- and a very special type of Vandermonde matrix is the Discrete Fourier Transform matrix.  It also has some very interesting properties for thinking about eigenvalues. 
- - - -
As a slight digression, consider the case where $a_1 = a_2$.  If this were true, the maximal row rank of $\mathbf A$ would be 4, and hence the maximal column rank would also be 4, and thus $\mathbf A$ would not be full rank aka $det\big(\mathbf A\big) = 0$.  
- - - -
There is another way to verify that $\mathbf A$ is full rank.  Let's look at the determinant of $\mathbf A^T$.  There are a few different ways to prove this.  Serge Winitzki had an interesting proof using wedge products -- that I may revisit in the not too distant future.  For the moment I'll just notice that there is a somewhat obvious 'pattern' to these Vandermonde matrices, so we'll do the proof using mathematical induction, that takes advantage of this pattern / progression in polynomial terms.  

**claim**: 

for natural number $n \geq 2$ where $\mathbf A \in \mathbb R^{n x n}$, and $\mathbf A$ is a Vandermonde matrix, 

$det \big(\mathbf A \big) = det \big(\mathbf A^T \big) = \prod_{1 \leq i \lt j \leq n} (a_j - a_i)$

*Base Case:* 

$n = 2$

$\mathbf A^T = \begin{bmatrix}
1 & 1\\ 
a_1 & a_2
\end{bmatrix}$

$det \big(\mathbf A^T \big) = (a_2 - a_1) = \prod_{1 \leq i \lt j \leq n} (a_j - a_i)$


*Inductive Case:*

For $n \gt 2$, assume formula is true where $\mathbf C = \mathbf A^T \in \mathbb R^{(n-1) x (n -1)}$

i.e. assume true where 

$\mathbf C = \begin{bmatrix}
1 & 1 & 1 & \dots & 1\\ 
a_1 & a_2 & a_3 & \dots & a_{n-1}\\ 
a_1^2 & a_2^2 & a_3^2 & \dots & a_{n-1}^2\\ 
\vdots & \vdots & \vdots & \ddots & \vdots\\ 
a_{1}^{n-2} & a_{2}^{n-2} & a_{3}^{n-2} & \dots & a_{n-1}^{n-2}
\end{bmatrix}$

Note that we call this submatrix $\mathbf C$ -- it will make a reappearance shortly!


we need to show that the formula holds true where dimension of $\mathbf A$ is $n$ x $n$. Thus consider the case where:

$\mathbf A^T  = \begin{bmatrix}
1 & 1 & 1 & \dots & 1 & 1\\ 
a_1 & a_2 & a_3 & \dots & a_{n-1} & a_n \\ 
a_1^2 & a_2^2 & a_3^2 & \dots & a_{n-1}^2 & a_{n}^2\\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 
a_{1}^{n-2} & a_{2}^{n-2} & a_{3}^{n-2} & \dots & a_{n-1}^{n-2} & a_{n}^{n-2}\\
a_{1}^{n-1} & a_{2}^{n-1} & a_{3}^{n-1} & \dots & a_{n-1}^{n-1} & a_{n}^{n-1}
\end{bmatrix} $

subtract $a_1$ times the $i - 1$ row from the ith row, for  $0 \lt i \leq n$ **starting from the bottom of the matrix and working our way up** (i.e. the operations / subproblem do not overlap in this regard).  

- - - - - 
**Justification:**

First, the reason we'd like to do this is because we see an obvious pattern in the polynomial progression in each column of $\mathbf A^T$.  Thus by following this procedure, we can zero out all entries in the zeroth column of $\mathbf A^T$ except, the 1 located in the top left (i.e. in $a_{0,0}$).  This will allow us to, in effect, reduce our problem to the n - 1 x n - 1 dimensional case.  

Also recall that the determinant of $\mathbf A^T$ is equivalent to the determinant of $\mathbf A$. Thus the above procedure is equivalent to subtracting a scaled version of column 0 of the original $\mathbf A$ from column 1, and a scaled version of column 1 in the original $\mathbf A$ from column 2, and so on.  We could consider that $\mathbf A = \mathbf{QR}$, thus $det \big(\mathbf A \big) = det \big(\mathbf{QR} \big) = det \big(\mathbf{Q} \big)det \big(\mathbf{R} \big)$.  Notice that these column operations will have not impact on $\mathbf Q$, and will only change the value of entries above the diagonal in $\mathbf R$, thus there is no change $det \big(\mathbf{Q} \big)$ or $det \big(\mathbf{R} \big)$ (which is given by the product of its diagonal entries).  This means there is no change in $det \big(\mathbf{A} \big)$.  

Technical note: there are other ways to interpret / prove that these sort of row operations on $\mathbf A^T$ do not alter the determinant.  However your author particularly likes Gram–Schmidt and orthgonality.

- - - - - 

$ = \begin{bmatrix}
1 & 1 & 1 & \dots & 1 & 1\\ 
0 & a_2 - a_1 & a_3 - a_1 & \dots & a_{n-1} - a_1 & a_n - a_1 \\ 
0 & a_2^2 - a_1 a_2 & a_3^2 - a_1 a_3 & \dots & a_{n-1}^2 - a_1 a_{n-1} & a_{n}^2 - a_1 a_{n}\\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 
0 & a_{2}^{n-2} - a_1 a_{2}^{n-3} & a_{3}^{n-2} - a_1 a_{3}^{n-3} & \dots & a_{n-1}^{n-2} - a_1 a_{n-1}^{n-3} & a_{n}^{n-2} - a_1 a_{n}^{n-3}\\
0 & a_{2}^{n-1} - a_1 a_2^{n-2} & a_{3}^{n-1} - a_1 a_3^{n-2}& \dots & a_{n-1}^{n-1} -  a_1 a_{n-1}^{n-2}& a_{n}^{n-1} - a_1 a_{n}^{n-1}
\end{bmatrix} $

$ = \begin{bmatrix}
1 & 1 & 1 & \dots & 1 & 1\\ 
0 & (a_2 - a_1) 1 & (a_3 - a_1)1 & \dots & (a_{n-1} - a_1) 1 & (a_n - a_1) 1 \\ 
0 & (a_2 - a_1) a_2 & (a_3 - a_1) a_3 & \dots & (a_{n-1} - a_1) a_{n-1} & (a_n - a_1) a_{n}\\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 
0 & (a_2 - a_1)a_{2}^{n-3} & (a_3 - a_1)a_{3}^{n-3} & \dots & (a_{n-1} - a_1)a_{n-1}^{n-3} & (a_n - a_1)a_{n}^{n-3}\\
0 & (a_2 - a_1)a_{2}^{n-2} & (a_3 - a_1)a_{3}^{n-2} & \dots & (a_{n-1} - a_1)a_{n-1}^{n-2} & (a_n - a_1)a_{n}^{n-2} 
\end{bmatrix}  $


we can rewrite this as 

$= \begin{bmatrix}
1 & \mathbf 1^T\\ 
\mathbf 0 & \mathbf{CD}
\end{bmatrix}$


where 

$\mathbf D = \begin{bmatrix}
(a_2-a_1) & 0 &  0& \dots & 0\\ 
0 & (a_3 - a_1) &0  &\dots  &0 \\ 
0 & 0 & (a_4 - a_1) & \dots & 0\\ 
0 & 0 & 0 & \ddots & \vdots \\ 
0 & 0 & 0 & \dots & (a_n - a_1)
\end{bmatrix}$

There clearly there is an eigenvalue of 1 associated with the top left entry of the matrix $\begin{bmatrix}
1 & \mathbf 1^T\\ \mathbf 0 & \mathbf{CD} \end{bmatrix}$. We'll call this $\lambda_1$.   Thus the determinant can be written as 

$det\big(\mathbf A^T \big) = (\lambda_1) * (\lambda_2  * \lambda_3 * ... * \lambda_n\big) = (1) * \det\big(\mathbf{CD}\big) = \det\big(\mathbf{C}\big) \det\big(\mathbf{D}\big)$

and 

$\det\big(\mathbf{D}\big) = (a_2-a_1) * (a_3 - a_1) * ... * (a_n - a_1)$

because the determininant is the product of the eigenvalues of a matrix, which are the diagonal entries of a triangular (or diagonal) matrix.  

and 

$det \big(\mathbf C \big) = \prod_{1 \leq i \lt j \leq n-1} (a_j - a_i)$ 

by inductive hypothesis.  

Thus we can say 

$ det\big(\mathbf A^T \big) = \big(\prod_{1 \leq i \lt j \leq n-1} (a_j - a_i)\big) \big((a_2-a_1) * (a_3 - a_1) * ... * (a_n - a_1)\big) = \prod_{1 \leq i \lt j \leq n} (a_j - a_i)$

And the induction is proved.  

Finally, we note that $det \big(\mathbf A \big) = det \big(\mathbf A^T \big)$ because $\mathbf A$ and $\mathbf A^T$ have the same characteristic polynomials (or equivalently, they have the same eigenvalues), we have thus proved the determinant formula for $\mathbf A$.  

(Technical note: if $\mathbf A \in \mathbb C^{n x n}$ then the above results still hold with respect to the magnitude of the determinant of $\mathbf A$.  This includes the very important special case of whether or not $\mathbf A$ has zero magnitude --i.e. whether or not $\mathbf A^{-1}$ exists.  However, with respect to the exact determinant, it would be more proper to state that $det\big(\mathbf A\big) = conjugate\Big(\det\big(\mathbf A^H\big)\Big)$. 
- - - -

This gives us another way to confirm that our Vandermonde Matrix is full rank.  We know that a square, finite dimensional matrix is singular iff it has a determinant of 0.  We then see that 

$\det \big(\mathbf A\big) = \big(\prod_{1 \leq i \lt j \leq n} (a_j - a_i)\big) = 0$ iff there is some $a_j = a_i$ where $i \neq j$.  

This of course is another way of saying that our Vandermonde Matrix is not full rank if some entry in our 'original' matrix of 

$\mathbf a = \begin{bmatrix}
a_1\\ 
a_2\\ 
a_3\\ 
a_4\\ 
a_5
\end{bmatrix}$

was not not-unique.  


Furthermore, notice that this determinant formula gives us a proof that we have full column rank in any thinner (i.e. more rows than columns) version of our Vandermonde matrix.  E.g. consider the case of 


$\mathbf{A} = \begin{bmatrix}
1 & a_1 & a_1^2 & a_1^3\\ 
1 & a_2 & a_2^2 & a_2^3\\ 
1 & a_3 & a_3^2 & a_3^3\\ 
1 & a_4 & a_4^2 & a_4^3\\ 
1 & a_5 & a_5^2 & a_5^3
\end{bmatrix}$


These columns must be linearly independent, so long as each $a_i \neq a_j$ where $i \neq j$.  If that was not the case, then appending additional columns until square (i.e. append $\mathbf a \circ \mathbf a \circ \mathbf a \circ \mathbf a$) would mean that 

$\mathbf A = \begin{bmatrix}
1 & a_1 & a_1^2 & a_1^3 & a_1^4\\ 
1 & a_2 & a_2^2 & a_2^3 & a_2^4\\ 
1 & a_3 & a_3^2 & a_3^3 & a_3^4\\ 
1 & a_4 & a_4^2 & a_4^3 & a_4^4\\ 
1 & a_5 & a_5^2 & a_5^3 & a_5^4
\end{bmatrix} $

could not have full column rank either.  Yet we know this matrix is full rank via our determinant formula (again so long as each $a_i$ is unique) thus we know that the columns of any smaller  "long and skinny" version of this matrix must also be linearly independent.

Also, when each $a_i$ is unique, since we know that our Vandermonde matrix is full rank, we know that each of its rows is linearly independent.  If for some reason we had a 'short and fat' version of the above matrix, like:

$\mathbf A = \begin{bmatrix}
1 & a_1 & a_1^2 & a_1^3 & a_1^4\\ 
1 & a_2 & a_2^2 & a_2^3 & a_2^4\\ 
1 & a_3 & a_3^2 & a_3^3 & a_3^4\\ 
\end{bmatrix} $

we would know that it is full row rank -- i.e. each of its rows are linearly independent.


**Application of Vandermonde Matrices: Proof of Linear Independence of Eigenvectors associated with Unique Eigenvalues**

This proves that if a square (finite dimensional) matrix --aka an operator --has all eigenvalues that are unique, then the eigenvectors must be linearly independent.  Put differently, this proves that such an operator is diagonalizable.  

The typical argument for linear indepdence is in fact a touch shorter than this and does not need Vandermonde matrices -- however it relies on a contradiction that is not particularly satisfying.  The following proof -- adapted from Winitzki's *Linear Algebra via Exterior Products* is direct -- and to your author-- very intuitive.  Note that I had long ago considered a similar approach, except rather than looking at the eigenectors over ${0, 1, 2, ..., n-1}$ iterations, I looked at what happens as the number of iterations tends to infinity.  This, however, no doubt introduced some heavier machinery than needed (including some subtletites from analysis that should have been addressed though perhaps weren't).  Furthermore, it was an effective way to consider the issue with respect to the magnitudes of the eigenvalues, but dealing with periodic behavior over equivalent magnitudes became rather difficult. 

Consider $\mathbf B \in \mathbb C^{n x n}$ matrix, which has n unique eigenvalues -- i.e. $\lambda_1 \neq \lambda_2 \neq ... \neq \lambda_n$.  


When looking for linear indepenence, 

$\gamma_1 \mathbf v_1 + \gamma_2 \mathbf v_2 + ... + \gamma_n \mathbf v_n = \mathbf 0$  

we can say that **the eigenvectors are linearly independent iff** $\gamma_1 = \gamma_2 = ... = \gamma_n = 0$

Further, for $k = \{1, 2, ..., n\}$, we know that  
$\mathbf v_k  = \mathbf v_k$  
$\mathbf B \mathbf v_k = \lambda_k \mathbf v_k$  
$\mathbf B \mathbf B \mathbf v_k = \mathbf B^2 \mathbf v_k = \lambda_k^2 \mathbf v_k$  
$\vdots $  

$\mathbf B^{n-1} \mathbf v_k = \lambda_k^{n-1} \mathbf v_k$  


Thus we can take our original linear independence test,

$\gamma_1 \mathbf v_1 + \gamma_2 \mathbf v_2 + ... + \gamma_n \mathbf v_n = \mathbf 0$  

and further generalize it to also include:

$ \lambda_1^r \gamma_1 \mathbf v_1 + \lambda_2^r  \gamma_2 \mathbf v_2 + ... + \lambda_n^r \gamma_n \mathbf v_n = \mathbf 0$  

for $r = \{1, 2, ..., n-1\}$
- - - -

Now let's collect these $n$ relationships in a system of equations:

$\bigg[\begin{array}{c|c|c|c}
\gamma_1 \mathbf v_1 & \gamma_2 \mathbf v_2 &\cdots & \gamma_n \mathbf v_n
\end{array}\bigg]
 \mathbf W = \bigg[\begin{array}{c|c|c|c}
\mathbf \gamma_1 \mathbf v_1 & \gamma_2 \mathbf v_2 &\cdots & \gamma_n \mathbf v_n
\end{array}\bigg]\begin{bmatrix}
1 & \lambda_1 & \lambda_1^2 & \dots  & \lambda_1^{n-1}\\ 
1 & \lambda_2 & \lambda_2^2 & \dots &  \lambda_2^{n-1} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \\ 
1 & \lambda_{n} & \lambda_{n}^{2} & \dots  & \lambda_{n}^{n-1}
\end{bmatrix} = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$

Notice that $\mathbf W$ is a Vandermonde matrix. Since each $\lambda_k$ is unique, we know that $det \big(\mathbf W\big) \neq 0$, and thus $\mathbf W^{-1}$ exists as a unique operator.  We multiply each term on the right by $\mathbf W^{-1}$.  

$\bigg[\begin{array}{c|c|c|c}
\mathbf \gamma_1 \mathbf v_1 & \gamma_2 \mathbf v_2 &\cdots & \gamma_n \mathbf v_n
\end{array}\bigg]
 \mathbf W \mathbf W^{-1}= \bigg[\begin{array}{c|c|c|c}
\gamma_1 \mathbf v_1 & \gamma_2 \mathbf v_2 &\cdots & \gamma_n \mathbf v_n
\end{array}\bigg]\mathbf I = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg] \mathbf W^{-1} = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$  


Thus we know that 

$\bigg[\begin{array}{c|c|c|c}
\mathbf \gamma_1 \mathbf v_1 & \gamma_2 \mathbf v_2 &\cdots & \gamma_n \mathbf v_n
\end{array}\bigg] = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$

By definition each eigenvector $\mathbf v_k \neq \mathbf 0$.  This means that each scalar $\gamma_k = 0$.  Each eigenvector has thus been proven to be linearly independent.  
