# Motivating Background: figuring out a function from data points

Quite often in basic machine learning applications -- say with linear regression -- we gather $n$ samples of data and look to fit a model to it.  Note: we often have *a lot* of data, and in fact n can be any natural number.  For illustrative purposes, we start with the case of n = 5. 

Note that we typically also have multiple different features in our data, but *the goal of this posting is to strip down ideas to their very core*, so we consider the one feature case.  Also note that in machine learning we may use notation like $\mathbf {Xw} = \mathbf y$, where we solve for the weights in $\mathbf w$.  However, this posting uses the typical Linear Algebra setup of $\mathbf{Ax} = \mathbf b$, where we are interested in solving for $\mathbf x$.  

So initially we may just have the equation

$\mathbf{Ax} = \begin{bmatrix}
a_1\\ 
a_2\\ 
a_3\\ 
a_4\\ 
a_5
\end{bmatrix} \begin{bmatrix}
x_1\\ 
\end{bmatrix} = \mathbf b$

**this original 'data' matrix will also be written as **

$\mathbf a = \begin{bmatrix}
a_1\\ 
a_2\\ 
a_3\\ 
a_4\\ 
a_5
\end{bmatrix}$

Note that when we gather real world data there is noise in the data, so we would be *extremely* surprised if any of the entries in $\mathbf a$ are duplicates.  So, unless otherwise noted assume that each entry in $a_i$ is unique. Since there is only one column, the column rank of $\mathbf A$ is one, and the column rank = row rank, thus we know that the row rank = 1. 

Then we decide to insert a bias /affine translation piece (in index position zero -- to use notation from Caltech's "Learning From Data").  

Thus we end up with the following equation

$\mathbf{Ax} = \begin{bmatrix}
1 & a_1\\ 
1 & a_2\\ 
1 & a_3\\ 
1 & a_4\\ 
1 & a_5
\end{bmatrix} \begin{bmatrix}
x_0\\
x_1\\ 
\end{bmatrix} = x_0 \mathbf 1 + x_1 \mathbf a = \mathbf b$

Column 0 of $\mathbf A$ is the ones vector, also denoted as $\mathbf 1$.  

At this point we know that $\mathbf A$ still has full column rank (i.e. rank = 2) -- if this wasn't the case, this would imply that we could scale column 0 to get column 1 (i.e. everything in column 1 would have to be identical).   

From here we may simply decide to do least squares and solve (which we always can do when we have full column rank, and $\mathbf A $ has m rows and n columns, where $m \geq n$).  

Or we may decide to map this to a higher dimensional space that has a quadratic term.  

$\mathbf{Ax} = \begin{bmatrix}
1 & a_1 & a_1^2\\ 
1 & a_2 & a_2^2\\ 
1 & a_3 & a_3^2\\ 
1 & a_4 & a_4^2\\ 
1 & a_5 & a_5^2
\end{bmatrix} \begin{bmatrix}
x_0\\
x_1\\ 
x_2\\
\end{bmatrix} = \mathbf b$


At this point we may just do least squares and solve.  But that requires $\mathbf A$ to have full column rank.  How do we know that $\mathbf A$ has full column rank?  An intuitive way to think about it is that squaring each $a_i$ to get column 2 is not a linear transformation, so we would not expect it to be linear combination of prior columns.  

$\mathbf a \circ \mathbf a \neq \gamma_0 \mathbf 1 + \gamma_1 \mathbf a$

where $\circ$ denotes the Hadamard product.  And by earlier argument, we know $\mathbf a \neq \gamma_0 \mathbf 1$, hence each column is linearly independent.  There is another (more mathemetically exact) way to verify linear independence of these columns -- which comes from the Vandermonde Matrix, and we will address this shortly.  

We may however decide we want an even higher dimensional space for our data, so we add a cubic term:

$\mathbf{Ax} = \begin{bmatrix}
1 & a_1 & a_1^2 & a_1^3\\ 
1 & a_2 & a_2^2 & a_2^3\\ 
1 & a_3 & a_3^2 & a_3^3\\ 
1 & a_4 & a_4^2 & a_4^3\\ 
1 & a_5 & a_5^2 & a_5^3
\end{bmatrix} \begin{bmatrix}
x_0\\
x_1\\ 
x_2\\
x_3\\
\end{bmatrix} = \mathbf b$

Again we may be confident that the columns are linearly independent because our new column -- cubing $\mathbf a$ is not a linear transformation (or alternatively, using the hadamard product is not a linear transformation), so we write: 

$\mathbf a \circ \mathbf a \circ \mathbf a \neq \gamma_0 \mathbf 1 + \gamma_1 \mathbf a + \gamma_2 \big(\mathbf a \circ \mathbf a\big)$

And if the above is *still* not enough, we may add a term to the fourth power:

$\mathbf{Ax} = \begin{bmatrix}
1 & a_1 & a_1^2 & a_1^3 & a_1^4\\ 
1 & a_2 & a_2^2 & a_2^3 & a_2^4\\ 
1 & a_3 & a_3^2 & a_3^3 & a_3^4\\ 
1 & a_4 & a_4^2 & a_4^3 & a_4^4\\ 
1 & a_5 & a_5^2 & a_5^3 & a_5^4
\end{bmatrix} \begin{bmatrix}
x_0\\
x_1\\ 
x_2\\
x_3\\
x_4\\
\end{bmatrix} = \mathbf b$

Again quite confident that the above has full column rank because 

$\mathbf a \circ \mathbf a \circ \mathbf a \circ \mathbf a \neq \gamma_0 \mathbf 1 + \gamma_1 \mathbf a + \gamma_2 \big(\mathbf a \circ \mathbf a\big) + \gamma_3 \big(\mathbf a \circ \mathbf a \circ \mathbf a \big)$

We may be tempted to go to an even higher dimensional space at this point, but this requires considerable justification.  Notice that $\mathbf A$ is a square matrix now, and as we've argued, it has full column rank -- which means it also has full row rank.  Thus we can be sure to solve the above equation for a single, exact solution, where $\mathbf x = \mathbf A^{-1}\mathbf b$.  If we were to go to a higher dimensional space we would be entering the world of an underdetermined system of equations -- see postings titled "Underdetermined_System_of_Equations.ipynb" for the L2 norm oriented solution, and "underdetermined_regression_minimize_L1_norm.ipynb" for the L1 norm oriented solution.  Since we can already be certain of solving for a single exact solution in this problem, we will stop mapping to higher dimensions here.  

In the above equation of $\mathbf{Ax} = \mathbf b$, the square $\mathbf A$ is a Vandermonde matrix.  Technical note: some texts say that $\mathbf A$ is the Vandermonde matrix, while others say $\mathbf A^T$ is the Vandermonde matrix.  The calculation of the determinant is identical, and for other properties, a mere small book-keeping adjustment is required.
  
Note that the Vandermonde matrix is well studied, has special fast matrix vector multiplication (i.e. $\lt O(n^2)$) algorithms associated with it -- and a very special type of Vandermonde matrix is the Discrete Fourier Transform matrix.  The Vandermonde matrix  also has some very interesting properties for thinking about eigenvalues. 


There is another, more exacting way to verify that $\mathbf A$ is full rank.  Let's look at the determinant of $\mathbf A^T$.  There are a few different ways to prove this.  Sergei Winitzki had an interesting proof using wedge products -- that I may revisit at some point in the future.  


# Begin Look at Vandermonde Matrices

For some real valued Vandermonde matrix $\mathbf A$, or it's transpose, we can say the following:

(note the book-keeping required to evaluate this as a complex matrix, is just a very small alteration)


$\mathbf A^T  = \begin{bmatrix}
1 & 1 & 1 & \dots & 1 & 1\\ 
a_1 & a_2 & a_3 & \dots & a_{n-1} & a_n \\ 
a_1^2 & a_2^2 & a_3^2 & \dots & a_{n-1}^2 & a_{n}^2\\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 
a_{1}^{n-2} & a_{2}^{n-2} & a_{3}^{n-2} & \dots & a_{n-1}^{n-2} & a_{n}^{n-2}\\
a_{1}^{n-1} & a_{2}^{n-1} & a_{3}^{n-1} & \dots & a_{n-1}^{n-1} & a_{n}^{n-1}
\end{bmatrix} $

For the now,  I'll just notice that there is a rather obvious 'pattern' to these Vandermonde matrices, so we'll do the proof using mathematical induction, which takes advantage of this pattern / progression in polynomial terms.  



**claim**: 

for natural number $n \geq 2$ where $\mathbf A \in \mathbb R^{n x n}$, and $\mathbf A$ is a Vandermonde matrix, 

$det \big(\mathbf A \big) = det \big(\mathbf A^T \big) = \prod_{1 \leq i \lt j \leq n} (a_j - a_i)$

*Base Case:* 

$n = 2$

$\mathbf A^T = \begin{bmatrix}
1 & 1\\ 
a_1 & a_2
\end{bmatrix}$

$det \big(\mathbf A^T \big) = (a_2 - a_1) = \prod_{1 \leq i \lt j \leq n} (a_j - a_i)$

*sneak peak:*  
if we follow the row operation procedure used during the inductive case, what we'd have is:

$det \big(\mathbf A^T \big) = det\Big(\begin{bmatrix}
1 & 1\\ 
0 & (a_2 - a_1)
\end{bmatrix}\Big) = 1*(a_2 - a_1)$


*Inductive Case:*

For $n \gt 2$, assume formula is true where $\mathbf C \in \mathbb R^{(n-1) x (n -1)}$

i.e. assume true where 

$\mathbf C = \begin{bmatrix}
1 & 1 & 1 & \dots & 1\\ 
a_1 & a_2 & a_3 & \dots & a_{n-1}\\ 
a_1^2 & a_2^2 & a_3^2 & \dots & a_{n-1}^2\\ 
\vdots & \vdots & \vdots & \ddots & \vdots\\ 
a_{1}^{n-2} & a_{2}^{n-2} & a_{3}^{n-2} & \dots & a_{n-1}^{n-2}
\end{bmatrix}$

Note that we call this submatrix $\mathbf C$ -- it will make a reappearance shortly!


We need to show that the formula holds true where dimension of $\mathbf A$ is $n$ x $n$. Thus consider the case where:

$\mathbf A^T  = \begin{bmatrix}
1 & 1 & 1 & \dots & 1 & 1\\ 
a_1 & a_2 & a_3 & \dots & a_{n-1} & a_n \\ 
a_1^2 & a_2^2 & a_3^2 & \dots & a_{n-1}^2 & a_{n}^2\\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 
a_{1}^{n-2} & a_{2}^{n-2} & a_{3}^{n-2} & \dots & a_{n-1}^{n-2} & a_{n}^{n-2}\\
a_{1}^{n-1} & a_{2}^{n-1} & a_{3}^{n-1} & \dots & a_{n-1}^{n-1} & a_{n}^{n-1}
\end{bmatrix} $

**Procedure:**
subtract $a_1$ times the $i - 1$ row from the ith row, for  $0 \lt i \leq n$ **starting from the bottom of the matrix and working our way up** (i.e. the operations / subproblems do not overlap in this regard).  

- - - - - 
**Justification:**

First, the reason we'd like to do this is because we see an obvious pattern in the polynomial progression in each column of $\mathbf A^T$.  Thus by following this procedure, we can zero out all entries in the zeroth column of $\mathbf A^T$ except, the 1 located in the top left (i.e. in $a_{0,0}$).  This will allow us to, in effect, reduce our problem to the n - 1 x n - 1 dimensional case.  

Also recall that the determinant of $\mathbf A^T$ is equivalent to the determinant of $\mathbf A$. Thus the above procedure is equivalent to subtracting a scaled version of column 0 of the original $\mathbf A$ from column 1, and a scaled version of column 1 in the original $\mathbf A$ from column 2, and so on.  These are standard operations that are well understood to not change the calculated determinant over any legal field. 

Since, your author particularly likes Gram–Schmidt and orthgonality, there is an additional more visual interpretation that can be used over inner product spaces (i.e. real or complex fields).  Consider that $\mathbf A = \mathbf{QR}$, thus $det \big(\mathbf A \big) = det \big(\mathbf{QR} \big) = det \big(\mathbf{Q} \big)det \big(\mathbf{R} \big)$.  Notice that these column operations will have no impact on $\mathbf Q$, and will only change the value of entries above the diagonal in $\mathbf R$, thus there is no change in $det \big(\mathbf{Q} \big)$ or $det \big(\mathbf{R} \big)$ (which is given by the product of its diagonal entries).  This means there is no change in $det \big(\mathbf{A} \big)$.  


- - - - - 

$ = \begin{bmatrix}
1 & 1 & 1 & \dots & 1 & 1\\ 
0 & a_2 - a_1 & a_3 - a_1 & \dots & a_{n-1} - a_1 & a_n - a_1 \\ 
0 & a_2^2 - a_1 a_2 & a_3^2 - a_1 a_3 & \dots & a_{n-1}^2 - a_1 a_{n-1} & a_{n}^2 - a_1 a_{n}\\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 
0 & a_{2}^{n-2} - a_1 a_{2}^{n-3} & a_{3}^{n-2} - a_1 a_{3}^{n-3} & \dots & a_{n-1}^{n-2} - a_1 a_{n-1}^{n-3} & a_{n}^{n-2} - a_1 a_{n}^{n-3}\\
0 & a_{2}^{n-1} - a_1 a_2^{n-2} & a_{3}^{n-1} - a_1 a_3^{n-2}& \dots & a_{n-1}^{n-1} -  a_1 a_{n-1}^{n-2}& a_{n}^{n-1} - a_1 a_{n}^{n-1}
\end{bmatrix} $

$ = \begin{bmatrix}
1 & 1 & 1 & \dots & 1 & 1\\ 
0 & (a_2 - a_1) 1 & (a_3 - a_1)1 & \dots & (a_{n-1} - a_1) 1 & (a_n - a_1) 1 \\ 
0 & (a_2 - a_1) a_2 & (a_3 - a_1) a_3 & \dots & (a_{n-1} - a_1) a_{n-1} & (a_n - a_1) a_{n}\\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 
0 & (a_2 - a_1)a_{2}^{n-3} & (a_3 - a_1)a_{3}^{n-3} & \dots & (a_{n-1} - a_1)a_{n-1}^{n-3} & (a_n - a_1)a_{n}^{n-3}\\
0 & (a_2 - a_1)a_{2}^{n-2} & (a_3 - a_1)a_{3}^{n-2} & \dots & (a_{n-1} - a_1)a_{n-1}^{n-2} & (a_n - a_1)a_{n}^{n-2} 
\end{bmatrix}  $

we can rewrite this as 

$= \begin{bmatrix}
1 & \mathbf 1^T\\ 
\mathbf 0 & \mathbf{CD}
\end{bmatrix}$


where 

$\mathbf D = Diag\Big(\begin{bmatrix}
a_2 & a_3 & a_4 & \dots & a_n
\end{bmatrix}^T \Big) - a_1 \mathbf I =    \begin{bmatrix}
(a_2-a_1) & 0 &  0& \dots & 0\\ 
0 & (a_3 - a_1) &0  &\dots  &0 \\ 
0 & 0 & (a_4 - a_1) & \dots & 0\\ 
\vdots & \vdots & \vdots & \ddots & \vdots \\ 
0 & 0 & 0 & \dots & (a_n - a_1)
\end{bmatrix}$

Note that $\begin{bmatrix}
1 & \mathbf 1^T\\ \mathbf 0 & \mathbf{CD} \end{bmatrix} - \lambda \begin{bmatrix}
1 & \mathbf 0^T\\ \mathbf 0 & \mathbf{I} \end{bmatrix} = \begin{bmatrix}
1 - \lambda & \mathbf 1^T\\ \mathbf 0 & \mathbf{CD } - \mathbf \lambda \mathbf I \end{bmatrix}$, which is not invertible when $\lambda := 1$ (because the left most column is all zeros).  

Hence we know that there is an eigenvalue of 1, given by the top left diagonal entry, associated with $\begin{bmatrix}
1 & \mathbf 1^T\\ 
\mathbf 0 & \mathbf{CD}
\end{bmatrix}$. We'll call this $\lambda_1$ -- for the first eigenvalue of the "MatrixAfterRowOperations".  

Thus the determinant can be written as 

$det\big(\mathbf A^T \big) = det\big(MatrixAfterRowOperations\big) = (\lambda_1) * (\lambda_2  * \lambda_3 * ... * \lambda_n\big) = (1) * \det\big(\mathbf{CD}\big) = \det\big(\mathbf{C}\big) \det\big(\mathbf{D}\big)$



- - - - -

**begin interlude** 

The fact that 
$det\big(\begin{bmatrix}
1 & \mathbf *\\ 
\mathbf 0 & \mathbf{Z}
\end{bmatrix}\big) = 1 * det\big(\mathbf{Z}\big)$

is well understood via properities of block matrices over many fields.  However, as is often the case, there is an additional interpretation over inner product spaces that makes use of orthogonality.  **This interlude is a bit overkill and may safely be skipped**.

Another way to think about this, is we can borrow from the Schur Decomposition $\mathbf X = \mathbf V \mathbf R \mathbf V^{H}$ where $\mathbf V$ is unitary and $\mathbf R$ is upper triangular.  Equivalently, $ \mathbf V^H \mathbf X  \mathbf V = \mathbf R$.  Also, we know the  eigenvector associated with $\lambda_1$ (which is $\begin{bmatrix}1 \\ \mathbf 0 \\ \end{bmatrix}$) can be chosen to be the left most column of $\mathbf V$.  Since all columns in $\mathbf V$ are mutually orthonormal, and hence all other columns must have a zero in the upper-most position.  Writing this out, and working through the blocked multiplication we get the following:

(note that $^H$ denotes conjugate transpose -- and of course if the values are real, then this acts like a regular transpose operation)

$\mathbf V^{H} \begin{bmatrix}
1 & \mathbf 1^H\\ 
\mathbf 0 & \mathbf{CD}
\end{bmatrix} \mathbf V = \mathbf R$

$\mathbf V^{H} \begin{bmatrix}
1 & \mathbf 1^H\\ 
\mathbf 0 & \mathbf{CD}
\end{bmatrix} \mathbf V = \begin{bmatrix}1 & \mathbf 0^H \\ 
\mathbf 0 & \mathbf{Q}
\end{bmatrix}^H \begin{bmatrix}
1 & \mathbf 1^H\\ 
\mathbf 0 & \mathbf{CD} \end{bmatrix} \begin{bmatrix}
1 & \mathbf 0^H \\ 
\mathbf 0 & \mathbf{Q}
\end{bmatrix}= \begin{bmatrix}1 & \mathbf 0^H \\ 
\mathbf 0 & \mathbf Q^H
\end{bmatrix} \Big(\begin{bmatrix}
1 & \mathbf 1^H\\ 
\mathbf 0 & \mathbf{CD} \end{bmatrix} \begin{bmatrix}
1 & \mathbf 0^H \\ 
\mathbf 0 & \mathbf{Q}
\end{bmatrix}\Big) $


$\mathbf V^{H} \begin{bmatrix}
1 & \mathbf 1^H\\ 
\mathbf 0 & \mathbf{CD}
\end{bmatrix} \mathbf V =  \begin{bmatrix} 1 & \mathbf 0^H \\ 
\mathbf 0 & \mathbf Q^H
\end{bmatrix} \Big(\begin{bmatrix}1 & \mathbf 1^H \mathbf Q \\ 
\mathbf 0 & \mathbf{CDQ}
\end{bmatrix}\Big) = \begin{bmatrix}1 & \mathbf 1^H \mathbf Q \\ 
\mathbf 0 & \mathbf{Q}^H\mathbf{ CDQ}
\end{bmatrix} = \begin{bmatrix}
1 & \mathbf 1^H \mathbf Q \\ 
\mathbf 0 & \mathbf{T}
\end{bmatrix}= \mathbf R$

Thus we know that the determinant we want comes from a similar matrix $\mathbf R$, who's determinant is the product of its eigenvalues (which are along its diagonal).  We further know that this is equal to $1 * det\big(\mathbf T\big) = 1 *det\big(\mathbf Q^H \mathbf{CDQ}\big) = det\big(\mathbf{Q}^H\big) det\big(\mathbf C\big)\det(\mathbf D\big) det\big(\mathbf Q\big) = det\big(\mathbf C\big)det\big(\mathbf D\big)$, via the fact that upper triangular matrix $\mathbf T = \mathbf Q^H \mathbf{CDQ}$, then applying multiplicative properties of determinants (and perhaps noticing that $\mathbf{CD}$ is similar to $\mathbf T$).  

Thus $det\Big(\begin{bmatrix}
1 & \mathbf 1^H\\ 
\mathbf 0 & \mathbf{CD}
\end{bmatrix}\Big) = det\Big(\begin{bmatrix}
1 & \mathbf *\\ 
\mathbf 0 & \mathbf{CD}
\end{bmatrix}\Big) = det\big(\mathbf{CD}\big) = det\big(\mathbf{C}\big) det\big(\mathbf{D}\big)$

**end interlude**
- - - - -
We know that 

$\det\big(\mathbf{D}\big) = (a_2-a_1) * (a_3 - a_1) * ... * (a_n - a_1)$

because the determininant of a diagonal matrix is the product of its diagonal entries (i.e. its eigenvalues)  

and 

$det \big(\mathbf C \big) = \prod_{1 \leq i \lt j \leq n-1} (a_j - a_i)$ 

by inductive hypothesis.  
Thus we can say 

$ det\big(\mathbf A^T \big) = \big(\prod_{1 \leq i \lt j \leq n-1} (a_j - a_i)\big) \big((a_2-a_1) * (a_3 - a_1) * ... * (a_n - a_1)\big) = \prod_{1 \leq i \lt j \leq n} (a_j - a_i)$

And the induction is proved.  

Finally, we note that $det \big(\mathbf A \big) = det \big(\mathbf A^T \big)$ because $\mathbf A$ and $\mathbf A^T$ have the same characteristic polynomials (or equivalently, they have the same eigenvalues). We have thus proved the determinant formula for $\mathbf A$.  

(Technical note: if $\mathbf A \in \mathbb C^{n x n}$ then the above results still hold with respect to the magnitude of the determinant of $\mathbf A$.  This includes the very important special case of whether or not $\big\vert det\big(\mathbf A\big)\big\vert = 0$ --i.e. whether or not $\mathbf A^{-1}$ exists.  However, with respect to the exact determinant, it would be more proper to state that $det\big(\mathbf A\big) = conjugate\Big(\det\big(\mathbf A^H\big)\Big)$. 
- - - -

This gives us another way to confirm that our Vandermonde Matrix is full rank.  We know that a square, finite dimensional matrix is singular iff it has a determinant of 0.  We then see that 

$\det \big(\mathbf A\big) = \big(\prod_{1 \leq i \lt j \leq n} (a_j - a_i)\big) = 0$ iff there is some $a_j = a_i$ where $i \neq j$.  

This of course is another way of saying that our Vandermonde Matrix is not full rank if some entry in our 'original' matrix of 

$\mathbf a = \begin{bmatrix}
a_1\\ 
a_2\\ 
a_3\\ 
a_4\\ 
a_5
\end{bmatrix}$

was not unique.  


- - - -
It is worth highlighting that if for some reason we did not like to explicitly use determinants, we could instead just repeatedly, and recursively apply the above procedure as a type of Gaussian Elimination, and in the end we would get have transformed $\mathbf{A}^T$ into the below Row Echelon form: 

$\begin{bmatrix}
1 & 1 & 1 &  \dots & 1 & 1\\
0 &(a_2-a_1) & 1 &   \dots & 1 & 1\\ 
0& 0 & (a_3 - a_1)(a_3 - a_2)  &\dots &1 &1 \\ 
\vdots &\vdots & \vdots & \ddots & \vdots & \vdots \\ 
0&0 & 0 & \dots & \big(\prod_{1 \leq i \lt n-1} (a_{n-1} - a_i)\big) & 1\\ 
0& 0 & 0 & \dots & 0 & \big(\prod_{1 \leq i \lt n} (a_{n} - a_i)\big)
\end{bmatrix}\mathbf x = \mathbf b$

(Of course, we can immediately notice that the determinant formula can be recovered by multiplying the diagonal elements of the above matrix.)

It is instructive to realize that we can solve for an exact $\mathbf x$ so long as we don't have any zeros on the diagonal of our above upper triangular /row echelon matrix.  We notice that this is the case only if and only if all $a_i$ are unique.

- - - -



Furthermore, notice that this determinant formula gives us a proof that we have full column rank in any thinner (i.e. more rows than columns) version of our Vandermonde matrix.  E.g. consider the case of 


$\mathbf{A} = \begin{bmatrix}
1 & a_1 & a_1^2 & a_1^3\\ 
1 & a_2 & a_2^2 & a_2^3\\ 
1 & a_3 & a_3^2 & a_3^3\\ 
1 & a_4 & a_4^2 & a_4^3\\ 
1 & a_5 & a_5^2 & a_5^3
\end{bmatrix}$


These columns must be linearly independent, so long as each $a_i \neq a_j$ where $i \neq j$.  If that was not the case, then appending additional columns until square (i.e. append $\mathbf a \circ \mathbf a \circ \mathbf a \circ \mathbf a$) would mean that 

$\mathbf A = \begin{bmatrix}
1 & a_1 & a_1^2 & a_1^3 & a_1^4\\ 
1 & a_2 & a_2^2 & a_2^3 & a_2^4\\ 
1 & a_3 & a_3^2 & a_3^3 & a_3^4\\ 
1 & a_4 & a_4^2 & a_4^3 & a_4^4\\ 
1 & a_5 & a_5^2 & a_5^3 & a_5^4
\end{bmatrix} $

could not have full column rank either.  Yet we know this matrix is full rank via our determinant formula (again so long as each $a_i$ is unique) thus we know that the columns of any smaller  "long and skinny" version of this matrix must also be linearly independent.

Also, when each $a_i$ is unique, since we know that our Vandermonde matrix is full rank, we know that each of its rows is linearly independent.  If for some reason we had a 'short and fat' version of the above matrix, like:

$\mathbf A = \begin{bmatrix}
1 & a_1 & a_1^2 & a_1^3 & a_1^4\\ 
1 & a_2 & a_2^2 & a_2^3 & a_2^4\\ 
1 & a_3 & a_3^2 & a_3^3 & a_3^4\\ 
\end{bmatrix} $

we would know that it is full row rank -- i.e. each of its rows are linearly independent.



**Implication: A degree n-1 polynomial is completely given by n uniqe data points**

Assuming there is no noise in the data -- or numeric precision issues-- the Vandermonde matrix, $\mathbf A$, allows you to solve for the unique values in some polynomial with coefficients of 


$x_0 * 1 + x_1 a + x_2 a^2 + x_3 a^3 + x_4 a^4 = b$


- - - - -
$\mathbf{Ax} = \begin{bmatrix}
1 & a_1 & a_1^2 & a_1^3 & a_1^4\\ 
1 & a_2 & a_2^2 & a_2^3 & a_2^4\\ 
1 & a_3 & a_3^2 & a_3^3 & a_3^4\\ 
1 & a_4 & a_4^2 & a_4^3 & a_4^4\\ 
1 & a_5 & a_5^2 & a_5^3 & a_5^4
\end{bmatrix} \begin{bmatrix}
x_0\\
x_1\\ 
x_2\\
x_3\\
x_4\\
\end{bmatrix} = \mathbf b$
- - - - -

The next extension is perhaps a bit more interesting.




**Extension: two ways to think about polynomials** 

Knowing that we can exactly specify a degree $n-1$ polynomial with $n$ distinct data points leads us to wonder:

is it 'better' to think about polynomials with respect to the coefficients or the data points?  In the above vector form -- the question becomes is it better to think about the polynomial in terms of $\mathbf x$ or $\mathbf b$? 

The answer is-- it depends.  To directly evaluate a function is much quicker when we know $\mathbf x$.  But as it turns out, when we want to multiply or convolve polynomials, it is considerably faster to know their point values contained in $\mathbf b$.  

And since the Vandermonde matrix is so helpful for encapsulating all of our knowledge about a polynomial, a natural question is -- what if we wanted to make multiplying $\mathbf A^{-1} \mathbf b$ to get $\mathbf x$ at least as easy as just multiplying $\mathbf{Ax}$ to get $\mathbf b$?  The clear answer would mean finding a way so that you don't have to explicitly invert $\mathbf A$.  This can be done most easily if $\mathbf A$ is unitary (i.e. orthogonal albeit in a complex inner product space), hence $\mathbf A^H = \mathbf A^{-1}$.  If $\mathbf A$ is unitary, this directly leads us to the Discrete Fourier Transform.  (And from there to the Fast Fourier Transform which is widely regarded as one of the top 10 algorithms of the last 100 years.)

But first, let's work through a couple of important related ideas where we can apply Vandermonde matrices: (a) square matrices that have unique eigenvalues must be diagonalizable and (b) some interesting cyclic and structural properties underlying Permutation matrices. 



*small formatting note*: in bold face LaTeX, the capital A, $\mathbf A$, looks very similar to the capital Lambda, given by $\mathbf \Lambda$, which is a diagonal matrix with eigenvalues $\lambda_k$, along the diagonal.  The rest of this posting will stop using capital A for an operator, accordingly.


**Application of Vandermonde Matrices: Proof of Linear Independence of Eigenvectors associated with Unique Eigenvalues**

This proves that if a square (finite dimensional) matrix --aka an operator --has all eigenvalues that are unique, then the eigenvectors must be linearly independent.  Put differently, this proves that such an operator is diagonalizable.  

The typical argument for linear indepdence is in fact a touch shorter than this and does not need Vandermonde matrices -- however it relies on a contradiction that is not particularly satisfying.  The following proof -- adapted from Winitzki's *Linear Algebra via Exterior Products* is direct -- and to your author-- very intuitive.   

Consider $\mathbf B \in \mathbb C^{n x n}$ matrix, which has n unique eigenvalues -- i.e. $\lambda_1 \neq \lambda_2 \neq ... \neq \lambda_n$.  

When looking for linear indepenence, 

$\gamma_1 \mathbf v_1 + \gamma_2 \mathbf v_2 + ... + \gamma_n \mathbf v_n = \mathbf 0$  

we can say that **the eigenvectors are linearly independent iff** $\gamma_1 = \gamma_2 = ... = \gamma_n = 0$

Further, for $k = \{1, 2, ..., n\}$, we know that  
$\mathbf v_k  = \mathbf v_k$  
$\mathbf B \mathbf v_k = \lambda_k \mathbf v_k$  
$\mathbf B \mathbf B \mathbf v_k = \mathbf B^2 \mathbf v_k = \lambda_k^2 \mathbf v_k$  
$\vdots $  

$\mathbf B^{n-1} \mathbf v_k = \lambda_k^{n-1} \mathbf v_k$  


Thus we can take our original linear independence test,

$\gamma_1 \mathbf v_1 + \gamma_2 \mathbf v_2 + ... + \gamma_n \mathbf v_n = \mathbf 0$  

and left multiply by $\mathbf B^r$ and get the following equalities, as well: 

$\mathbf B^r \mathbf 0 = \mathbf 0 =  \mathbf B^r \big(\gamma_1 \mathbf v_1 + \gamma_2 \mathbf v_2 + ... + \gamma_n \mathbf v_n\big) =  \lambda_1^r \gamma_1 \mathbf v_1 + \lambda_2^r  \gamma_2 \mathbf v_2 + ... + \lambda_n^r \gamma_n \mathbf v_n $  

for $r = \{1, 2, ..., n-1\}$
- - - -

Now let's collect these $n$ relationships in a system of equations:


$\bigg[\begin{array}{c|c|c|c}
\gamma_1 \mathbf v_1 & \gamma_2 \mathbf v_2 &\cdots & \gamma_n \mathbf v_n
\end{array}\bigg] \mathbf W = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$


where 

$\mathbf W = \begin{bmatrix}
1 & \lambda_1 & \lambda_1^2 & \dots  & \lambda_1^{n-1}\\ 
1 & \lambda_2 & \lambda_2^2 & \dots &  \lambda_2^{n-1} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \\ 
1 & \lambda_{n} & \lambda_{n}^{2} & \dots  & \lambda_{n}^{n-1}
\end{bmatrix}$


Notice that $\mathbf W$ is a Vandermonde matrix. Since $\lambda_i \neq \lambda_k$ if $i \neq k$, we know that $det \big(\mathbf W\big) \neq 0$, and thus $\mathbf W^{-1}$ exists as a unique operator.  We multiply each term on the right by $\mathbf W^{-1}$.  

$\bigg[\begin{array}{c|c|c|c}
\mathbf \gamma_1 \mathbf v_1 & \gamma_2 \mathbf v_2 &\cdots & \gamma_n \mathbf v_n
\end{array}\bigg]
 \mathbf W \mathbf W^{-1}= \bigg[\begin{array}{c|c|c|c}
\gamma_1 \mathbf v_1 & \gamma_2 \mathbf v_2 &\cdots & \gamma_n \mathbf v_n
\end{array}\bigg]\mathbf I = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg] \mathbf W^{-1} = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$  


Thus we know that 

$\bigg[\begin{array}{c|c|c|c}
\mathbf \gamma_1 \mathbf v_1 & \gamma_2 \mathbf v_2 &\cdots & \gamma_n \mathbf v_n
\end{array}\bigg] = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$

By definition each eigenvector $\mathbf v_k \neq \mathbf 0$.  This means that each scalar $\gamma_k = 0$.  Each eigenvector has thus been proven to be linearly independent in the case where eigenvalues are unique.  


# Permutation Matrices and Periodic Behavior (introducting DFT)

Consider an $n$ x $n$ permutation matrix $\mathbf P$.  Note that this matrix is real valued where each column has all zeros, and a single 1.  

We'll use the $^H$ to denote conjugate transpose (even though the matrix is entirely real valued), as some imaginary numbers will creep in later.


It is easy to verify that the permutation matrix is unitary (a special case infact which is orthogonal), i.e. that 

$\mathbf P^H \mathbf P = \mathbf I$

because, by construction, each column in a permutation matrix has all zeros, except a single 1, and the Permutation matrix is full rank -- hence each column must be orthogonal.  

Further, as mentioned in "Schurs_Inequality.ipynb", such a matrix can be diagonalized where   

$\mathbf P = \mathbf{V\Lambda V}^H$

where $\mathbf V$ is unitary, each eigenvalue $\lambda_i$ is contained along the diagonal of $\mathbf \Lambda$ and is on the unit circle.  

notice that for any permtuation matrix: 

$\mathbf {P1} = \mathbf 1$

Hence such a permutation matrix has $\lambda_1 = 1$ (i.e. a permutation matrix is stochastic -- in fact doubly so).  

Because $\mathbf P$ has all zeros, except a single 1 in each column (or equivalently, in each row), it can be interpretted as a special kind of Adjacency Matrix for a directed graph. 
- - - - 
Of particular interest is **the permutation matrix that relates to a connected graph** (i.e. where each node is reachable from each other node) with $n$ nodes.  One example, where $n = 6$ is the below:

<table style="background-color: white"></table><tr><td><table><tr><td><img src='images/permutation_graph_matrix.gif'style="width: 100%; background-color: white;"></td><td><img src='images/permutation_matrix_graph.png'style="width: 50%; background-color: white;" ></td></tr></table>



*Claim:*  
For a permutation matrix associated with a connected graph (i.e. where each node may be visited from each other node), the time it takes to repeat a visit to a node is $n$ iterations. 

*Proof:*  
since each node in the directed graph has an outbound connection to only one other node, and there are $n$ nodes total, if a cycle can occur in $\leq n - 1$ iterations, then the number of nodes you can reach from the starting node (including itself) is $\leq n - 1$ nodes, and hence the graph is not connected -- a contradiction. (And for avoidance of doubt, if it took $\geq n + 1$ iterations to visit the starting node, then that would mean after $n$ iterations, you've visited (at least)  one of the $n-1$ nodes, other than the starting one, more than once which means there is a cycle in the graph $\leq n-1$ nodes, which is a contradiction, as outlined above.  This second part in many ways is not needed-- we have many other tools to deal with linear dependence after n iterations.  The point is that this type of graph, by construction, has periodicity equal to its number of nodes -- i.e. all cycles take n iterations.)


Thus we can say that $\mathbf P^ 0 = \mathbf P^n = \mathbf I$. So $trace\big(\mathbf P^0\big) = trace\big(\mathbf P^n\big) = trace\big(\mathbf I\big) = n$.  

However, the diagonal entries of $\mathbf P^k$ are all zero for $k \in \{1, 2, ..., n-2, n-1\}$. Thus we have:

$trace\big(\mathbf P^k\big) = 0$

*note: the reader may wonder why I chose a permutation matrix associated with a connected graph -- this post was supposed to be about Vandermonde matrices! The core reasons are simple -- it is a special type of unitary (or orthogonal) matrx, it has a simple visual representation via graph theory, and its trace is extremely easy to compute.  On top of that there is a messier, related reason -- I was inspired by the n iterations cycle that is implied in the proof of linear independence of eigenvectors associated with unique eigenvalues which used a Vandermonde matrix, and I had recently used a connected graph 3 x 3 permutation matrix (and its eigenvalues) to explain to someone the that 3 complex numbers on the unit circle must be equidistant when there is a constraint they all sum to zero  -- I knew that first eigenvalue had to be one because the matrix is stochastic, I knew the trace was 0, and I knew that for a real valued matrix, complex eigenvalues numbers come in conjugate pairs and hence in the 3 x 3 case they had to be evenly spaced in the unit circle.  In many respects this posting grew out of my attempt to generalize ideas from that conversation, plus a growing interest I had in Vandermonde matrices and matrices used in convolutions. I chose to use permutation matrices associated with a connected graph for those reasons and was pleasantly surprised that I was able to derive the DFT and all of its properties in full, just using this permutation matrix and spectral theory, for an arbitrary $n$ x $n$ dimensional case.* 

Now consider the standard basis vectors $\mathbf e_j$, where $j \in \{1, 2, ..., n-1, n\}$ -- i.e. column slices of the identity matrix, shown below:

$\bigg[\begin{array}{c|c|c|c}
\mathbf e_1 & \mathbf  e_2 &\cdots &\mathbf  e_n
\end{array}\bigg] = \mathbf I$

Each one of these vectors is a valid starting position for having a 'presence' on exactly one node of the graph. With no loss of generality, we could choose just one the standard basis vectors to be our starting point -- e.g. set $\mathbf e_j := \mathbf e_1$.  However, we'll keep the notation $\mathbf e_j$, though the reader may decide to select a specific standard basis vector if helpful.

Since we have a connected graph and can only be on one position at a time as we iterate though, we know that $\mathbf P^k \mathbf e_j \perp \mathbf P^r \mathbf e_j$,

for natural numbers $r$, $k$, $\in \{1, 2, ..., n-2, n-1\}$, where $r \neq k$

Thus we can collect each location in the graph in an $n$ x $n$ matrix as below:

$\bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \mathbf P^2\mathbf e_j &\cdots & \mathbf P^{n-1}\mathbf e_j
\end{array}\bigg] = \bigg[\begin{array}{c|c|c|c|c}
\mathbf V \mathbf \Lambda^0 \mathbf V^H\mathbf e_j & \mathbf V \mathbf \Lambda^1 \mathbf V^H\mathbf e_j & \mathbf V \mathbf \Lambda^2 \mathbf V^H\mathbf e_j &\cdots & \mathbf V \mathbf \Lambda^{n-1} \mathbf V^H\mathbf e_j
\end{array}\bigg]$



$\bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \mathbf P^2\mathbf e_j &\cdots & \mathbf P^{n-1}\mathbf e_j
\end{array}\bigg] = \mathbf V \bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf V^H\mathbf e_j & \mathbf \Lambda^1 \mathbf V^H\mathbf e_j & \mathbf \Lambda^2 \mathbf V^H\mathbf e_j &\cdots & \mathbf \Lambda^{n-1} \mathbf V^H\mathbf e_j
\end{array}\bigg]$

Now left multiply each side by full rank, unitary matrix $\mathbf V^H$, and for notational simplity, let $\mathbf y := \mathbf V^H\mathbf e_j$


$\mathbf V^H \bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \mathbf P^2\mathbf e_j &\cdots & \mathbf P^{n-1}\mathbf e_j
\end{array}\bigg] = \bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf y & \mathbf \Lambda^1 \mathbf y & \mathbf \Lambda^2 \mathbf y &\cdots & \mathbf \Lambda^{n-1} \mathbf y
\end{array}\bigg]$

For each column vector on the right hand side, we have $\mathbf \Lambda^m \mathbf y$.  In various forms this can be written as 

$\mathbf \Lambda^m \mathbf y =  \mathbf \Lambda^m \big(\mathbf{Diag}\big(\mathbf y\big)\mathbf 1\big) = \mathbf{Diag}\big(\mathbf y\big) \mathbf \Lambda^m \mathbf 1 =
\mathbf{Diag}\big(\mathbf y\big)\big(\mathbf \Lambda^m \mathbf 1\big)$

Thus we can say: 

$\bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf y & \mathbf \Lambda^1 \mathbf y & \mathbf \Lambda^2 \mathbf y &\cdots & \mathbf \Lambda^{n-1} \mathbf y
\end{array}\bigg] = \bigg[\begin{array}{c|c|c|c|c}
\mathbf{Diag}\big(\mathbf y\big)\mathbf \Lambda^0 \mathbf 1 & \mathbf{Diag}\big(\mathbf y\big)\mathbf \Lambda^1 \mathbf 1 & \mathbf{Diag}\big(\mathbf y\big)\mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf{Diag}\big(\mathbf y\big)\mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg] $



$\bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf y & \mathbf \Lambda^1 \mathbf y & \mathbf \Lambda^2 \mathbf y &\cdots & \mathbf \Lambda^{n-1} \mathbf y
\end{array}\bigg] = \mathbf{Diag}\big(\mathbf y\big)\bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf \Lambda^1 \mathbf 1 & \mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg] $


We make this substitution and see: 

$\mathbf V^H \bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \mathbf P^2\mathbf e_j &\cdots & \mathbf P^{n-1}\mathbf e_j
\end{array}\bigg] = \mathbf{Diag}\big(\mathbf y\big)\bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf \Lambda^1 \mathbf 1 & \mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg] $

From here we may notice that since the left hand side is full rank, the right hand side must be full rank as well. 

Actually, we know consideraably more than this -- i.e. we know that the left hand side is unitary. 

where we have $\mathbf X = f(\mathbf e_j) = \bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \mathbf P^2\mathbf e_j &\cdots & \mathbf P^{n-1}\mathbf e_j
\end{array}\bigg]$

In general $\mathbf X = f(\mathbf s)$, is called a **circulant matrix**.  For now we confine ourselves to the case where $\mathbf s := \mathbf e_j$, though we'll loosen up this restriction at the end of this writeup. 

earlier we noted that:

$\mathbf P^k \mathbf e_j \perp \mathbf P^r \mathbf e_j$


and of course 

$\big \Vert \mathbf P^m \mathbf e_j\big \Vert_2^2 = \big(\mathbf P^m \mathbf e_j\big)^H\big(\mathbf P^m \mathbf e_j\big) = \mathbf e_j^H \big(\mathbf P^m\big)^H  \mathbf P^m \mathbf e_j = \mathbf e_j^H \mathbf I \mathbf e_j = \mathbf e_j^H \mathbf e_j = 1$

Thus each column in $\mathbf X$ is mutually orthonormal -- and $\mathbf U$ is $n$ x $n$ so it is a (real valued) unitary matrix. 

--

From here we see

$\big(\mathbf V^H \mathbf X\big)^H \mathbf V^H \mathbf X = \mathbf X^H \big(\mathbf V \mathbf V^H\big) \mathbf X = \mathbf X^H \mathbf X = \mathbf I$

So we know that the left hand side in unitary.  This means that the right handside must be unitary as well. 

Since the right hand side is unitary, that means it must be non-singular.  

Note that with respect to determinants, we could say:

$ \Big \vert Det\Big(\mathbf{Diag} \big(\mathbf y\big)\Big)\Big \vert*\Big \vert Det\Big( \bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf \Lambda^1 \mathbf 1 & \mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg]\Big) \Big \vert = 1$

Thus 

$Det\Big( \bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf \Lambda^1 \mathbf 1 & \mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg]\Big) \neq 0$

Finally, we 'unpack' this matrix and see that

$\bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf \Lambda^1 \mathbf 1 & \mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg]=\begin{bmatrix}
1 & \lambda_1 & \lambda_1^2 & \dots  & \lambda_1^{n-1}\\ 
1 & \lambda_2 & \lambda_2^2 & \dots &  \lambda_2^{n-1} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \\ 
1 & \lambda_{n} & \lambda_{n}^{2} & \dots  & \lambda_{n}^{n-1}
\end{bmatrix}$

This is the Vandermonde matrix, which is non-singular **iff**  each $\lambda_i$ is unique.  Thus we conclude that each $\lambda_i$ for our Permutation matrix of a connected graph must be unique.

**claim:**

$\bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf \Lambda^1 \mathbf 1 & \mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg]^H \bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf \Lambda^1 \mathbf 1 & \mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg] = n \mathbf I$

That is, the columns in the above matrix are mutually orthogonal (aka have an inner product of zero), and subject to some normalizing scalar constant, we know that the matrix is unitary. 

**proof:**  
First notice that each column has a squared L2 norm of $n$

for $m = \{0, 1, 2, ..., n-1\}$

$\big(\mathbf \Lambda^m \mathbf 1\big)^H \mathbf \Lambda^m \mathbf 1 = \mathbf 1^H \big(\mathbf \Lambda^m\big)^H \mathbf \Lambda^m \mathbf 1 = \mathbf 1^H \big(\mathbf \Lambda^m\big)^{-1} \mathbf \Lambda^m \mathbf 1 = \mathbf 1^H \big(\mathbf I\big) \mathbf 1 = trace \big(\mathbf I\big)  = n$ 

note that when we say $\mathbf 1^H \big(\mathbf I\big) \mathbf 1 = trace \big(\mathbf I\big)$, we notice first that $\mathbf 1^H \big(\mathbf Z\big) \mathbf 1$, means to sum up all entries in some operator $\mathbf Z$, and if $\mathbf Z$ is a diagonal matrix, then this is equivalent to just summing the entries along the diagonal of $\mathbf Z$ which is equal to the trace of $\mathbf Z$.

Next we want to prove that the inner product of any column $j$ with some other column $\neq j$, is zero.

Thus we are interested in the cases of

$\big(\mathbf \Lambda^r \mathbf 1\big)^H \mathbf \Lambda^k \mathbf 1$ 

for all natural numbers $k$, $r$, *first* where $0\leq r \lt k \leq n-1$ and *second* where $0\leq  k \lt r \leq n-1$.

First we observe the $r \lt k$ case:

$\big(\mathbf \Lambda^r \mathbf 1\big)^H \mathbf \Lambda^k \mathbf 1 = \mathbf 1^H \big(\mathbf \Lambda^r\big)^H \mathbf \Lambda^{k} \mathbf 1 = \mathbf 1^H \Big(\big( \mathbf \Lambda^{-r} \big) \mathbf \Lambda^{k}\Big) \mathbf 1 = \mathbf 1^H \Big( \mathbf \Lambda^{k - r}\Big) \mathbf 1 = trace\big(\mathbf \Lambda^{k - r}\big) $ 

since $k \gt r$, we know $0 \lt k - r \leq n-1$.  Put differently $(k - r) \%n \neq 0$

Thus $\big(\mathbf \Lambda^r \mathbf 1\big)^H \mathbf \Lambda^k \mathbf 1 = trace\big(\mathbf \Lambda^{k - r}\big) = trace\big(\mathbf Q^H \mathbf P^{k - r} \mathbf Q\big) = trace\big(\mathbf Q \mathbf Q^H \mathbf P^{k - r}\big) = trace\big(\mathbf P^{k - r}\big) = 0$

note that inner products are symmetric -- except for complex conjugation-- so in the case of an inner product equal to zero, we have 

$\Big(\big(\mathbf \Lambda^r \mathbf 1\big)^H \mathbf \Lambda^k \mathbf 1\Big)^H = trace\big(\mathbf \Lambda^{k - r}\big)^H  = 0^H = 0$

which covers the second case.


Thus all columns in

$\bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf \Lambda^1 \mathbf 1 & \mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg]$

have a squared length of $n$ and are mutually orthgonal.

Hence we can say:

$\frac{1}{\sqrt n} \bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf \Lambda^1 \mathbf 1 & \mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg] = \mathbf F$

is a unitary matrix.  In fact this matrix $\mathbf F$ is the discrete Fourier transform matrix.  

*note: in some cases, we may use the the conjugate transpose of this matrix, or another variant, as the DFT.  This is ultimately just a book-keeping adjustment*



# Claim: 
# The DFT Matrix is the collection of eigenvectors for a circulant matrix

We say that this circulant matrix is given by $\mathbf X$  


When we look at 

$\mathbf V^H \bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \mathbf P^2\mathbf e_j &\cdots & \mathbf P^{n-1}\mathbf e_j
\end{array}\bigg] = \mathbf{Diag}\big(\mathbf y\big) \bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf \Lambda^1 \mathbf 1 & \mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg]$

we can re-write this as

$\mathbf V^H \bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \mathbf P^2\mathbf e_j &\cdots & \mathbf P^{n-1}\mathbf e_j
\end{array}\bigg] = \sqrt(n)\mathbf{Diag}\big(\mathbf y\big) \frac{1}{\sqrt(n)}\bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf \Lambda^1 \mathbf 1 & \mathbf \Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{n-1} \mathbf 1
\end{array}\bigg] = \sqrt(n) \mathbf{Diag}\big(\mathbf y\big) \mathbf F$

we can recongize that this is a form of the singular value decompostion on our matrix $\mathbf X$ (so long as we relax the constraint that the diagonal matrix is real-valued, non-negative).  That is, we have 

$\mathbf X = \bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \mathbf P^2\mathbf e_j &\cdots & \mathbf P^{n-1}\mathbf e_j
\end{array}\bigg] = \mathbf V \Big(\sqrt(n) \mathbf{Diag} \big(\mathbf y\big)\Big)\mathbf F$

In the above case, $\mathbf X$ is decomposed into unitary matrix $\mathbf V$ times a diagonal matrix times unitary matrix $\mathbf F$.  


In this case, $\mathbf X$ is itself unitary and per the note in "Schurs_Inequality.ipynb", that means that $\mathbf X$ is normal.  Since $\mathbf X$ is normal, this means that our Singular Value Decomposition in fact gives us an eigenvalue decomposition.  Put differently we can set our left singular and right singular vectors to be equal, and allocate everything else to the middle diagonal matrix.  

$\mathbf V := \mathbf F^H $

$\mathbf X = \mathbf F^H \mathbf D_j \mathbf F$

or equivalently

$\mathbf F \mathbf X \mathbf F^H =  \mathbf D_j $

Thus in the above we can say $\mathbf X$ is unitarily similar to a diagonal matrix $\mathbf D_j$ with $\mathbf F$ containing the eigenvectors.  

Which is another way of saying that our unitary Vandermonde matrix $\mathbf F$ **contains the mutually orthonormal collection of eigenvectors for** $\mathbf X$.  

This immediately motivates the question-- what if $\mathbf X$ was a function of permuting some different, arbitrary vector $\mathbf s$ i.e. if $\mathbf X = f(\mathbf s)$ -- could we still say $\mathbf X$ is unitarily similar to a diagonal matrix with $\mathbf F$ containing the eigenvectors?  The answer is yes, though it takes a little bit more work to show it.



# Short form proof

for a quick take, consider that 

$\mathbf F \big(f(\mathbf e_j)\big) \mathbf F^H = \mathbf D_j$

where $\mathbf D_j$ denotes some diagonal matrix similar to our function applied on the jth standard basis vector.

The standard basis vectors form a basis so we can write $\mathbf s$ in terms of them:  
$\mathbf s = \gamma_1 \mathbf e_1 + \gamma_2 \mathbf e_2 + ... + \gamma_n \mathbf e_n  = \big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j\big)$

Then, using linearity we can say 

$\mathbf X = f(\mathbf s) = f(\gamma_1 \mathbf e_1 + \gamma_2 \mathbf e_2 + ... + \gamma_n \mathbf e_n) = f(\gamma_1 \mathbf e_1) + f(\gamma_2 \mathbf e_2) + ... + (\gamma_n \mathbf e_n)$

left multiply each side by $\mathbf F$ and right multiply each side by $\mathbf F^H$, and we get

$\mathbf F \big(\mathbf X\big) \mathbf F^H = \mathbf F \big(f(\mathbf s)\big) \mathbf F^H= \mathbf F \big( f(\gamma_1 \mathbf e_1)\big)\mathbf F^H + \mathbf F\big(f(\gamma_2 \mathbf e_2)\big)\mathbf F^H + ... + \mathbf F\big((\gamma_n \mathbf e_n)\big)\mathbf F^H = \gamma_1 \mathbf D_1 + \gamma_2 \mathbf D_2 + ... + \gamma_n \mathbf D_n$

The sum of a sequence of diagonal matrices is a diagonal matrix, hence we can can say that using $\mathbf F$, we find that $\mathbf X$ is unitarily similar to a diagonal matrix. 


# Begin Long Form Proof

To begin, notice that by linearity

$f(\mathbf e_1) + f(\mathbf e_2) = f(\mathbf e_1 + \mathbf e_2)$  

Written in terms of a matrix, this is:  

$\bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_1 & \mathbf P^1\mathbf e_1 & \cdots & \mathbf P^{n-1}\mathbf e_1
\end{array}\bigg] + \bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_2 & \mathbf P^1\mathbf e_2 & \cdots & \mathbf P^{n-1}\mathbf e_2
\end{array}\bigg] = \bigg[\begin{array}{c|c|c|c}
\mathbf P^0 \mathbf e_1 +  \mathbf P^0 \mathbf e_2 & \mathbf P^1 \mathbf e_1 + \mathbf P^1 \mathbf e_2 & \cdots & \mathbf P^{n-1} \mathbf e_1 + \mathbf P^{n-1} \mathbf e_2 
\end{array}\bigg]$

which we can restate as 

$\bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_1 & \mathbf P^1\mathbf e_1 & \cdots & \mathbf P^{n-1}\mathbf e_1
\end{array}\bigg] + \bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_2 & \mathbf P^1\mathbf e_2 & \cdots & \mathbf P^{n-1}\mathbf e_2
\end{array}\bigg] = \bigg[\begin{array}{c|c|c|c}
\mathbf P^0\big(\mathbf e_1 +  \mathbf e_2\big) & \mathbf P^1 \big(\mathbf e_1 + \mathbf e_2\big) & \cdots & \mathbf P^{n-1}\big(\mathbf e_1 + \mathbf e_2 \big)
\end{array}\bigg]$


If we left multiply by $\mathbf F$, what we get is


$\mathbf F \bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_1 & \mathbf P^1\mathbf e_1 & \cdots & \mathbf P^{n-1}\mathbf e_1
\end{array}\bigg] + \mathbf F \bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_2 & \mathbf P^1\mathbf e_2 &\cdots & \mathbf P^{n-1}\mathbf e_2
\end{array}\bigg]= \mathbf D_1\mathbf F + \mathbf D_2\mathbf F$

$= \big(\mathbf D_1 + \mathbf D_2\big) \mathbf F = \mathbf F \bigg[\begin{array}{c|c|c|c}
\mathbf P^0\big(\mathbf e_1 +  \mathbf e_2\big) & \mathbf P^1 \big(\mathbf e_1 + \mathbf e_2\big) &\cdots & \mathbf P^{n-1}\big(\mathbf e_1 + \mathbf e_2 \big)
\end{array}\bigg]$


and to further generalize this, notice that if we added all of the standard basis vectors, we'd get the ones vector.    Where $\mathbf D_j$ is a diagonal matrix similar to the permutation matrix given by $f(\mathbf e_j)$.
We can write this as:  

$\mathbf {11}^H = \Sigma_{j=1}^{n}\Big(\bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \cdots & \mathbf P^{n-1}\mathbf e_j
\end{array}\bigg]\Big) = \bigg[\begin{array}{c|c|c|c}
\mathbf P^0\big(\Sigma_{j=1}^{n} \mathbf e_j \big) & \mathbf P^1 \big(\Sigma_{j=1}^{n} \mathbf e_j \big) & \cdots & \mathbf P^{n-1}\big(\Sigma_{j=1}^{n} \mathbf e_j \big)
\end{array}\bigg]$  

and if we left multiply by $\mathbf F$, we get

$\mathbf F \big(\mathbf {11}^H \big) = \Sigma_{j=1}^{n}\Big(\mathbf F\bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \cdots & \mathbf P^{n-1}\mathbf e_j
\end{array}\bigg]\Big) = \Sigma_{j=1}^{n} \big(\mathbf D_j \mathbf F\big) = \big(\Sigma_{j=1}^{n}\mathbf D_j\big) \mathbf F$  

From here, consider what would happen if we instead decided to scale each standard basis vector, $\mathbf e_j$, by some arbitrary amount, $\gamma_j$, giving us the following expression:  

$\Sigma_{j=1}^{n}\Big(\gamma_j \bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \cdots & \mathbf P^{n-1} \mathbf e_j
\end{array}\bigg]\Big) = \Sigma_{j=1}^{n}\Big(\bigg[\begin{array}{c|c|c|c}
\mathbf P^0\big(\gamma_j \mathbf e_j\big) & \mathbf P^1\big(\gamma_j \mathbf e_j\big) & \cdots & \mathbf P^{n-1} \big(\gamma_j  \mathbf e_j\big)
\end{array}\bigg]\Big)$

which can be restated as  

$\Sigma_{j=1}^{n}\Big(\gamma_j \bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \cdots & \mathbf P^{n-1} \mathbf e_j
\end{array}\bigg]\Big)  = \bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big) & \mathbf P^1 \big(\Sigma_{j=1}^{n}\gamma_j \mathbf e_j \big) & \cdots & \mathbf P^{n-1}\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big)
\end{array}\bigg]$ 

again, left multiply this expression by $\mathbf F$ and we see

$\mathbf F\bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big) & \mathbf P^1 \big(\Sigma_{j=1}^{n}\gamma_j \mathbf e_j \big) & \cdots & \mathbf P^{n-1}\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big)
\end{array}\bigg] = \Sigma_{j=1}^{n}\Big(\gamma_j \mathbf F\bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \cdots & \mathbf P^{n-1} \mathbf e_j
\end{array}\bigg]\Big)$

from here notice  

$ \Sigma_{j=1}^{n}\Big(\gamma_j \mathbf F\bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \cdots & \mathbf P^{n-1} \mathbf e_j
\end{array}\bigg]\Big) = \Sigma_{j=1}^{n}\gamma_j\Big( \mathbf F\bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf e_j & \mathbf P^1\mathbf e_j & \cdots & \mathbf P^{n-1} \mathbf e_j
\end{array}\bigg]\Big) = \Sigma_{j=1}^{n} \mathbf \gamma_j \big(\mathbf D_j \mathbf F\big) = \big(\Sigma_{j=1}^{n}\mathbf \gamma_j \mathbf D_j\big) \mathbf F$

Thus we say

$\mathbf F\bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big) & \mathbf P^1 \big(\Sigma_{j=1}^{n}\gamma_j \mathbf e_j \big) & \cdots & \mathbf P^{n-1}\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big)
\end{array}\bigg] = \big(\Sigma_{j=1}^{n}\mathbf \gamma_j \mathbf D_j\big) \mathbf F$


Right multiply each side by $\mathbf F^H$:  

$\mathbf F\bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big) & \mathbf P^1 \big(\Sigma_{j=1}^{n}\gamma_j \mathbf e_j \big) & \cdots & \mathbf P^{n-1}\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big)
\end{array}\bigg] \mathbf F^H = \big(\Sigma_{j=1}^{n}\mathbf \gamma_j \mathbf D_j\big) \mathbf F \mathbf F^H  = \big(\Sigma_{j=1}^{n}\mathbf \gamma_j \mathbf D_j\big) $

Since the sum of a finite sequence of $n$ x $n$ diagonal matrices is itself a diagonal matrix, this tells us that our matrix is unitarily similar to a diagonal matrix, and the mutually orthonormal eigenvectors are contained in $\mathbf F$ (or technically, the right eigenvectors are contained as columns in $\mathbf F^H$ -- which again, is just a small bookkeeping adjustment).  

Now consider the general case where $\mathbf X = f(\mathbf s)$.  This looks quite formidable -- the circulant matrix is given by: 

$\mathbf X = \bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf s & \mathbf P^1\mathbf s & \cdots & \mathbf P^{n-1}\mathbf s
\end{array}\bigg]  = \begin{bmatrix}
s_0 & s_{n-1} & s_{n-2} & \dots & s_2 & s_1 \\ 
s_1 & s_0 & s_{n-1} & \dots & s_3 & s_2 \\ 
s_2 & s_1 & s_0 & \dots & s_4 & s_3 \\
\vdots & \vdots  & \vdots &\ddots & \vdots & \vdots\\ 
s_{n-2} & s_{n-3} & s_{n-4} & \dots & s_0  & s_{n-1} \\ 
s_{n-1} & s_{n-2}  & s_{n-3} & \dots & s_1 &  s_0
\end{bmatrix}$

But, we simply need to recall that the standard basis vectors in fact form a basis, so we can uniquely write $\mathbf s$ in terms of them. 

$\mathbf s = \gamma_1 \mathbf e_1 + \gamma_2 \mathbf e_2 + ... + \gamma_n \mathbf e_n  = \big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j\big)$

Thus we have 


$\mathbf X = \bigg[\begin{array}{c|c|c|c}
\mathbf P^0\mathbf s & \mathbf P^1\mathbf s & \cdots & \mathbf P^{n-1}\mathbf s
\end{array}\bigg] = \bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big) & \mathbf P^1 \big(\Sigma_{j=1}^{n}\gamma_j \mathbf e_j \big) & \cdots & \mathbf P^{n-1}\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big)
\end{array}\bigg]$

left multiply each side by $\mathbf F$ and right multiply each side by $\mathbf F^H$, and we get 

$\mathbf F \big(\mathbf X\big) \mathbf F^H = \mathbf F \bigg[\begin{array}{c|c|c|c|c}
\mathbf P^0\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big) & \mathbf P^1 \big(\Sigma_{j=1}^{n}\gamma_j \mathbf e_j \big) & \cdots & \mathbf P^{n-1}\big(\Sigma_{j=1}^{n} \gamma_j \mathbf e_j \big)
\end{array}\bigg] \mathbf F^H = \big(\Sigma_{j=1}^{n}\mathbf \gamma_j \mathbf D_j\big)$


which states that $\mathbf X = f(\mathbf s)$ is unitarily similar to a diagonal matrix, with vectors in $\mathbf F$ forming the basis of mutually orthonormal eigenvectors.  This completes the proof that a circulant matrix $\mathbf X$ is unitarily diagonalizable via the "help" of $\mathbf F$. 


# End Long Form Proof

- - - - - 
# But what the heck do the components of the DFT look like? 


When we consider that (a) each $\lambda_i$, contained in position $\mathbf \Lambda_{i,i}$, is distinct and also that (b) each $\lambda_i^n - 1 = 0$

as a reminder: this is because (a) the associated Vandermonde matrix is non-singular, and (b) $\mathbf \Lambda^n = \mathbf Q^H \mathbf P^n \mathbf Q = \mathbf Q^H \mathbf I \mathbf Q = \mathbf I $, hence each diagonal element raised to the nth power equals one.  

We know that $\lambda_1 = 1$, because $\mathbf {P1} = \mathbf 1$.  From here we can say, $\lambda_1$ has polor coordinate (1, $2\pi \frac{(1 - 1) }{n}$) which is to say it has magnitude 1, and an angle of $0 \pi$ i.e. it is all real valued = 1.  

$\lambda_2$ has polar coordinate of (1 , $2\pi\frac{(2-1)}{n} $)  
$\lambda_3$ has polar coordinate of (1,  $2\pi\frac{(3-1)}{n} $)  
$\vdots$  
$\lambda_{n-1}$ has polar coordinate of (1, $2\pi\frac{(n-1 -1)}{n} $)  
$\lambda_n$ has polar coordinate of (1, $2\pi\frac{(n-1)}{n}$).  

There is a variant of the Pidgeon Hole principle here: we have have $n$ $\lambda_j$'s, each of which must be unique, and there are only $n$ unique nth roots of unity$^{(1)}$ -- hence each nth root has one and only one $\lambda_j$ "in" it.  (This Wolfram alpha link is worth visiting, for its nice graphic: http://mathworld.wolfram.com/RootofUnity.html )



Thus **the Vandermonde matrix in the following form is unitary**:  (due to GitHub $\LaTeX$ rendering issues, the below formula has been inserted as an image)


![F_components](images/unitary_vandermondF_components.gif)

when each $\lambda_j$ has polar coordinate of (1, $2\pi\frac{(j-1)}{n} $)
- - - - -

$^{(1)}$ **Side note: How do we know there are exactly n roots of unity?** 

The reader may wonder how we know that there are "only $n$ unique nth roots of unity" available for us to choose from, for any natural number $n$.  One way to support this claim comes from using the fundamental theorem of algebra, which is rather high powered machinery that is not introduced or proved anywhere in this posting.  

The other approach is self contained and comes from using Vandermonde matrices.  Consider a degree $n$ polynomial (specifically the polynomial we are interested in is $\lambda_i^n - 1$, but any degree $n$ polynomial --that isn't the zero polynomial-- is valid here). Such a polynomial would have the following Vandermonde matrix associated with it:


$\mathbf S = \begin{bmatrix}
1 & s_1 & s_1^2 & \dots  & s_1^{n-1} &s_1^{n} \\ 
1 & s_2 & s_2^2 & \dots &  s_2^{n-1} & s_2^{n} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 
1 & s_{n} & s_{n}^{2} & \dots  & s_{n}^{n-1} & s_{n}^{n} \\
1 & s_{n+1} & s_{n+1}^{2} & \dots  & s_{n+1}^{n-1} & s_{n+1}^{n}
\end{bmatrix}$

That is, we evaluate our polynomial at $s_i$ where $i = \{1, 2, ... , n, n+1\}$.  The polynomial has coefficients associated with it, which are given in $\mathbf t$.  When we evaluate the polynomial at each $s_i$ we get the resulting value at each $b_i$.  Setting this up as an equation:   

$\begin{bmatrix}
1 & s_1 & s_1^2 & \dots  & s_1^{n-1} &s_1^{n} \\ 
1 & s_2 & s_2^2 & \dots &  s_2^{n-1} & s_2^{n} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 
1 & s_{n} & s_{n}^{2} & \dots  & s_{n}^{n-1} & s_{n}^{n} \\
1 & s_{n+1} & s_{n+1}^{2} & \dots  & s_{n+1}^{n-1} & s_{n+1}^{n}
\end{bmatrix}\begin{bmatrix}
t_0\\ 
t_1\\ 
\vdots\\ 
t_{n-1}\\ 
t_{n}
\end{bmatrix} = \begin{bmatrix}
b_1\\ 
b_2\\ 
\vdots\\ 
b_{n}\\ 
b_{n+1}
\end{bmatrix}$

or more succinctly, 

$\mathbf{St} = \mathbf b$ 

**For a contradiction:** assume that each of the $n+1$ data points is a distinct root of the polynomial.  Then every resulting value in $\mathbf b$ is zero, which reduces this to:

$\mathbf{St} = \mathbf 0$.  However since each $s_i$ is distinct, the Vandermonde matrix is invertible, which gives us 

$\mathbf S^{-1} \mathbf{St} = \mathbf t = \mathbf S^{-1} \mathbf 0 = \mathbf 0$   

thus 

$\mathbf t = \mathbf 0$   

Since every coefficient is zero, we in fact have the zero polynomial -- which is a contradiction.  However, if at least one of the $s_j$ is not a root (i.e. $\mathbf b \neq \mathbf 0$), then $\mathbf t \neq \mathbf 0$ and hence we may still have a degree $n$ polynomial.  This gives us an upper bound which tells us that a degree $n$ polynomial can have at most $n$ distinct roots.   

Now, for our DFT matrix $\mathbf F$, we are using eigenvalues from the connected graph permutation matrix and they have a constraint given by $\lambda^n - 1 = 0$.  Put differently we are looking for roots of a degree $n$ polynomial, where the polynomial is $\lambda^n - 1$.  These roots are called roots of unity.  Per the above, we upper bound the number of unique roots as being $\leq n$.  Now our matrix $\mathbf F$ is a unitary Vandermonde Matrix, which means it is non-singular, thus we determine that each $\lambda_k$ for $k = \{1, 2, ..., n-1, n \}$ must be distinct. This means there must be $\geq n$ distinct roots of unity.  Since our upper bound and lower bound are equal, we have a sandwich and conclude that there are **exactly** $n$ unique roots of unity.  



# Below is a simple example of an explicit use of the circulant matrix for discrete convolutions in probability

there are numerous opportunities to further optimize this


In [1]:
import numpy as np
np.set_printoptions(precision = 2, linewidth=180)

# setup

# simple (PMF) distributions for some peculiar experiment,
# that can return a certain number of "heads"
# the number of heads = [0, 1, 2, 3, 4], 
# an associated PMFs for x and y

x = np.random.random(5)
x = x / x.sum() # normalize

y = np.random.random(5)
y = y / y.sum() # normalize

print("x vs y")
print( x, " vs ", y, "\n")

m = x.shape[0] * 2
z = np.zeros(m)
circulant_mat = np.zeros((m,m))

# direct convolution is done below, 
# and the circulant matrix is populated while doing this
for i in range(m):
    for idx in range(x.shape[0]):
        jdx = i - idx
        if jdx >= y.shape[0]  or jdx < 0:
            # simple setup with the non-negative distribution, though inefficient
            continue
        z[i] += x[idx] * y[jdx]
        circulant_mat[i,idx] = y[jdx]
  
padded_x = np.zeros(m)
# just some extra zeros in the padding to accomdoate 
# the higher order polynomial so to speak

for i in range(x.shape[0]):
    padded_x[i] += x[i]

# mathematically, the below has no impact 
# (as these entries get scaled by the padded zeros on padded_x), 
# but it finished off the circulant structure associated w/ our convolution
# It is important to know this exists

for row_idx in range(m):
    if np.isclose(circulant_mat[row_idx, 0], 0):
        continue
    else:
        for j in range(x.shape[0],m):
            circulant_mat[(row_idx + j) % m , j] = circulant_mat[row_idx, 0]

print(circulant_mat)

newz = circulant_mat @ padded_x 
# the alternative, direct, way of calculating z is via the use of this circulant matrix

print(" ")
print("# | probability    | difference between for loops and use of circulant matrix")
for i in range(m):
    print(i, "|", z[i], "|", z[i] - newz[i])

# of course there is lots of room to optimize calculations

x vs y
[ 0.02  0.21  0.22  0.4   0.15]  vs  [ 0.27  0.11  0.16  0.2   0.26] 

[[ 0.27  0.    0.    0.    0.    0.    0.26  0.2   0.16  0.11]
 [ 0.11  0.27  0.    0.    0.    0.    0.    0.26  0.2   0.16]
 [ 0.16  0.11  0.27  0.    0.    0.    0.    0.    0.26  0.2 ]
 [ 0.2   0.16  0.11  0.27  0.    0.    0.    0.    0.    0.26]
 [ 0.26  0.2   0.16  0.11  0.27  0.    0.    0.    0.    0.  ]
 [ 0.    0.26  0.2   0.16  0.11  0.27  0.    0.    0.    0.  ]
 [ 0.    0.    0.26  0.2   0.16  0.11  0.27  0.    0.    0.  ]
 [ 0.    0.    0.    0.26  0.2   0.16  0.11  0.27  0.    0.  ]
 [ 0.    0.    0.    0.    0.26  0.2   0.16  0.11  0.27  0.  ]
 [ 0.    0.    0.    0.    0.    0.26  0.2   0.16  0.11  0.27]]
 
# | probability    | difference between for loops and use of circulant matrix
0 | 0.00480728004211 | 0.0
1 | 0.060396279743 | 0.0
2 | 0.0863612459009 | 0.0
3 | 0.169574082205 | 2.77555756156e-17
4 | 0.167181042318 | 2.77555756156e-17
5 | 0.178404865542 | 0.0
6 | 0.161314980263 | 0.0
7 | 0

# Full cycle trace relations and nilpotent matrices

**claim:**  
for $\mathbf B \in \mathbb C^{n x n}$ , 

if $trace\big(\mathbf B^r\big) = 0$, for $r = \{1, 2, ... ,n-1, n \}$ every eigenvalue, $\lambda_i$, of $\mathbf B$ is equal to zero, i.e.  $\lambda_i = 0$ for $i = \{1, 2, ... ,n-1, n \}$ .  


**comment:**  
since the trace gives the sum of the eigenvalues and any complex matrix is similar to an upper triangular matrix, it is clearly true that if all eigenvalues are zero, then the trace will be zero for $\mathbf B^r$ for any natural number $r$ -- including the case where $1 \leq r \leq n$ . What is not immediately clear is that this is an **iff**.

In the derivation of the DFT, we used $trace\big(\mathbf P^r\big) = 0$, for $r = \{1, 2, ... ,n-1 \}$, yet our matrix $\mathbf P$ had all eigenvalues with magnitude $= 1$.  Extending the range of $r$ to also includes $n$ radically changes things and makes all eigenvalues have magnitude $= 0$.  

**proof:**  
start by constructing the Vandermonde matrix:


$\mathbf W = \begin{bmatrix}
1 & \lambda_1 & \lambda_1^2 & \dots  & \lambda_1^{n-1}\\ 
1 & \lambda_2 & \lambda_2^2 & \dots &  \lambda_2^{n-1} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \\ 
1 & \lambda_{n} & \lambda_{n}^{2} & \dots  & \lambda_{n}^{n-1}
\end{bmatrix}$


We want to reflect our constraint as

$\mathbf 1^H \mathbf {\Lambda W} = \mathbf 0^H$

i.e. as

$\mathbf 1^H \begin{bmatrix}
\lambda_1 & \lambda_1^2 & \dots  & \lambda_1^{n-1} & \lambda_1^{n}\\ 
\lambda_2 & \lambda_2^2 & \dots &  \lambda_2^{n-1}& \lambda_2^{n} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \\ 
\lambda_{n} & \lambda_{n}^{2} & \dots  & \lambda_{n}^{n-1}& \lambda_n^{n}
\end{bmatrix} = \mathbf 0^H$


But we aren't sure if there are any eigenvalues equal to zero in $\mathbf \Lambda$ so we need to remove them first.  Why? First: if any eigenvalues are zero, the left side of the equation is not invertible.  Second, we are interested in the trace relations and we know that any eigenvalues of zero have no impact on the trace calculations, hence they may safely be removed.  

*The contradiction kicks in at this stage*

We remove all eigenvalues equal to zero and have an $m$ x $m$ matrix for some natural number $m$, where $ m \leq n$.  Assume $m \geq 2$, i.e. that some non-zero eigenvalues exist that satisfy our stated trace constraint.  

- - - 
(*Two bookkeeping notes that may be skipped:* First: after removing all zero eigenvalues, $m \neq 1$ -- because if $m=1$, then $trace\big(\mathbf B\big) = \lambda_1 = 0$ and the sole remaining eigenvalue $\lambda_1 \neq 0$ hence $m$ cannot be equal to one.  Second: we make the adjustment so that $\mathbf 1$ and $\mathbf 0$ are $m$ x $1$ column vectors.)  

Our equation becomes: 

$\mathbf 1^H \begin{bmatrix}
\lambda_1 & \lambda_1^2 & \dots  & \lambda_1^{m-1} & \lambda_1^{m}\\ 
\lambda_2 & \lambda_2^2 & \dots &  \lambda_2^{m-1}& \lambda_2^{m} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \\ 
\lambda_{m} & \lambda_{m}^{2} & \dots  & \lambda_{m}^{m-1}& \lambda_m^{m}
\end{bmatrix} = \mathbf 0^H$

we re-write the above as 

$\mathbf y^H \begin{bmatrix}
\lambda_1 & \lambda_1^2 & \dots  & \lambda_1^{m-1} & \lambda_1^{m}\\ 
\lambda_2 & \lambda_2^2 & \dots &  \lambda_2^{m-1}& \lambda_2^{m} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \\ 
\lambda_{m} & \lambda_{m}^{2} & \dots  & \lambda_{m}^{m-1}& \lambda_m^{m}
\end{bmatrix} = \mathbf 0^H$

where $\mathbf y = \mathbf 1$.  It is important to note the identity: $\mathbf y^H \mathbf 1 = m$.

Now we further prune our adjusted Vandermonde matrix to only include unique eigenvalues.  Thus we keep the $k$ unique eigenvalues where $2 \leq k \leq m$, and we adjust $\mathbf y$ so that the trace math is identical. (Again note that $k \neq 1$, because if so then there is only one unique eigenvalue $\lambda_1$ and thus $\frac{1}{m} trace \big(\mathbf B\big) = \lambda_1 = 0$, but we know that $\lambda_1 \neq 0$.)  

For example, if all eigenvalues were unique except $\lambda_m = \lambda_{m-1}$ we'd remove the mth row and mth column from our adjusted Vandermonde matrix, and now $\mathbf y$ would be an $m-1$ x $1$ column vector (as would the zero vector), where we have 

$\mathbf y = \begin{bmatrix}
1\\ 
1\\ 
\vdots\\ 
1 \\
2
\end{bmatrix}$

**Put differently, at this stage $y_i$ has algebraic multiplicity for each unique non-zero eigenvalue $\lambda_i$.**

The underlying math with respect to traces is the same, and we still have the key identity $\mathbf y^H \mathbf 1 = m$.

our equation is thus:   

$\mathbf y^H \begin{bmatrix}
\lambda_1 & \lambda_1^2 & \dots  & \lambda_1^{k-1} & \lambda_1^{k}\\ 
\lambda_2 & \lambda_2^2 & \dots &  \lambda_2^{k-1}& \lambda_2^{k} \\ 
\vdots & \vdots & \ddots & \vdots & \vdots & \\ 
\lambda_{k} & \lambda_{k}^{2} & \dots  & \lambda_{k}^{k-1}& \lambda_k^{k}
\end{bmatrix} = \mathbf 0^H$

This matrix has each $\lambda_i \neq 0$ and each $\lambda_i$ is unique. We can factor out a diagonal matrix $\mathbf D$ if we'd like.  Thus we have 

$\mathbf y^H \mathbf D \begin{bmatrix}
1 & \lambda_1 & \lambda_1^2 & \dots  & \lambda_1^{k-1} \\ 
1 & \lambda_2 & \lambda_2^2 & \dots &  \lambda_2^{k-1} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \\ 
1& \lambda_{k} & \lambda_{k}^{2} & \dots  & \lambda_{k}^{k-1}
\end{bmatrix} = \mathbf 0^H$

letting $\mathbf K$ be our adjusted Vandermonde matrix in this equation, i.e. $\mathbf K = \bigg[\begin{array}{c|c|c|c|c}
\mathbf D^0 \mathbf 1 & \mathbf D^1 \mathbf 1 & \mathbf D^2 \mathbf 1 &\cdots & \mathbf D^{k-1} \mathbf 1
\end{array}\bigg]$

we have 

$\mathbf y^H \mathbf D \mathbf K = \mathbf 0^H$

Because all $\lambda_i$'s are unique, $\mathbf K$ is non-singular and so must be $\mathbf D$ (it is diagonal with no zero eigenvalues).  We right multiply both sides of our equation by the inverse of $\mathbf K$ and this gives us 


$\mathbf y^H \mathbf D = \mathbf y^H \mathbf D \mathbf {KK}^{-1} = \mathbf 0^H \mathbf K^{-1} = \mathbf 0^H$ 

now right multiply both sides by $\mathbf D^{-1}$, and we have 

$\mathbf y^H = \mathbf y^H \mathbf D \mathbf D^{-1} =  \mathbf 0^H \mathbf D^{-1} = \mathbf 0^H$ 

This tells us that $\mathbf y^H = \mathbf 0^H$.  Yet this is a contradiction, because 

$\mathbf y^H \mathbf 1 = m \neq \mathbf 0^H \mathbf 1 = 0$


hence we know that $m \ngeq 2$, and as mentioned earlier $m \neq 1$.  Thus $m = 0$.  Put differently, $\mathbf K$ does not exist (i.e. it must be a $0$ x $0$ matrix).  This proves the claim that all eigenvalues of $\mathbf B$ must be equal to zero if 

$trace\big(\mathbf B^r\big) = 0$, for $r = \{1, 2, ... ,n-1, n \}$ 

- - - -
- - - -
- - - -
**alternative approach:** if the reader feels the above contradiction to be unsatifying, consider instead the following setup:


$\mathbf y^H \mathbf D \begin{bmatrix}
1 & \lambda_1 & \lambda_1^2 & \dots  & \lambda_1^{k-1} & \lambda_1^{k} \\ 
1 & \lambda_2 & \lambda_2^2 & \dots &  \lambda_2^{k-1} & \lambda_2^{k} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \\ 
1& \lambda_{k} & \lambda_{k}^{2} & \dots  & \lambda_{k}^{k-1} & \lambda_k^{k} \\
1& 0 & 0 & \dots  & 0 & 0
\end{bmatrix} = \mathbf 0^H$

Here we have $k + 1$ rows in our Vandermonde matrix -- where the eigenvalue of zero is contained in the final row. $\mathbf D$ now is a $k+1 $ x $k+1$ diagonal matrix that has a zero in its bottom right corner, and $\mathbf y$ has the algebraic multiplicity for each of the $k+1$ unique eigenvalues (inclusive of the eigenvalue equal to zero, which is given by $y_{k+1}$).  We know that $\mathbf B$ has $n$ eigenvalues thus $\mathbf y^H \mathbf 1 = n$.

Our Vandermonde Matrix is invertible so we multiply both sides on the right by its inverse, giving us 

$\mathbf y^H \mathbf D = \mathbf 0^H$

or equivalently

$\mathbf D^H \mathbf y = \mathbf 0$

now we notice that $\mathbf D$ is singular, with all non-zero entries along its diagonal except the entry in the bottom right corner.  However the above equation tells us that 

for $i = \{1, 2, ..., k\}$

$y_i \bar{\lambda_i} = 0$


we observe that this is also true in the $k+1$ case:  $y_{k+1} \bar{\lambda_{k+1}} = 0$

First we deal with $i = \{1, 2, ..., k\}$ noticing that $\bar{\lambda_i} \neq 0$

$y_i \bar{\lambda_i} = 0$

divide both sides by $\bar{\lambda_i}$ and see that 

$y_i = 0$

Finally for $y_{k+1}$, we have  

$y_{k+1} \bar{\lambda_{k+1}} = y_{k+1} 0 = 0$

but we also have the constraint $n = \mathbf y^H \mathbf 1 = \mathbf 1^H \mathbf y = \big(y_0 + y_1 + ... y_{k-1} + y_k \big) + y_{k+1} = \big(0 \big) + y_{k+1} $ 

hence we see that $y_{k+1} = n$.  Put differently, all of the eigenvalues for $\mathbf B$ are zero -- i.e. $\mathbf B$ is nilpotent.



# extension: 
**for some some $m$ x $n$ matrix ** $\mathbf G$ **and some $n$ x $m$ matrix ** $\mathbf H$, then $\mathbf {GH}$ and $\mathbf {HG}$ have the same non-zero eigenvalues (in terms of algebraic multiplicity).

first notice 

$trace\big((\mathbf {GH})\big) = trace\big((\mathbf {HG})\big)$

via the cyclic property of the trace.  Now in general, we can say

for $r = \{2, ... , n\}$

$trace\big((\mathbf {GH})^r\big) = trace\big(\mathbf{GH} (\mathbf G\mathbf H)^{r-1}\big) = trace\big(\mathbf{H} (\mathbf G\mathbf H)^{r-1} \mathbf G\big) = trace\big((\mathbf {HG})^r\big)$

thus we have 

$trace\big((\mathbf {GH})^r\big) = trace\big((\mathbf {HG})^r\big)$ 

for $r = \{1, 2, ... , n\}$

At this point, people will frequently notice that $r$ can be any natural number (i.e. it need not be capped at $n$), then import Newton's Idenities and determine that the non-zero eigenvalues (and their algebraic multiplicities) must be the same.  While this approach is concise, Netwon's Identities are fairly high powered machinery.  Here is another approach: 

for each unique eigenvector not in the nullspace of $\big(\mathbf{HG}\big)$

$\big(\mathbf H \mathbf G\big) \mathbf x = \lambda \mathbf x$

Now left multiply by $\mathbf G$

$\mathbf G \mathbf H \mathbf G \mathbf x = \mathbf G \lambda \mathbf x = \lambda \mathbf G \mathbf x$

$\big(\mathbf G \mathbf H\big) \big(\mathbf G \mathbf x\big) = \lambda \big(\mathbf G \mathbf x\big)$

so $\{\lambda, \mathbf x\}$ is the eigenpair of $\mathbf {HG}$ and $\{ \lambda, (\mathbf {Gx})\}$ is the eigenpair for $\mathbf{GH}$.  This holds for all $\lambda \neq 0$.  (The issue that comes up when $\lambda =0$ is that we run into issues with having the zero vector as an eigenvector in the second eigenpair, which is not allowed.)

Further, if we wanted to be extra thorough, we could also do a "backward pass" and say: 

for each unique eigenvector not in the nullspace of $\big(\mathbf{GH}\big)$

$\big(\mathbf{GH}\big) \mathbf v = \lambda \mathbf v$

Now left multiply by $\mathbf H$

$\mathbf {HGHv} = \mathbf {H} \lambda \mathbf{v} = \lambda \mathbf H \mathbf v$

$\big(\mathbf H \mathbf G\big) \big(\mathbf H \mathbf v\big) = \lambda \big(\mathbf H \mathbf v\big)$

so $\{\lambda, \mathbf v\}$ is the eigenpair of $\mathbf {GH}$ and $\{ \lambda, (\mathbf {Hv})\}$ is the eigenpair for $\mathbf{HG}$.  This holds for all $\lambda \neq 0$. 


At this point we have enumerated all unique eigenvectors for $\mathbf{GH}$ and $\mathbf{HG}$ that are not in their respective nullspaces and have found the same eigenvalues between $\mathbf{GH}$ and $\mathbf {HG}$.  Since we have enumerated all unique eigenvectors not in the nullspace, we must have enumerated all unique $\lambda \neq 0$. Why? The vectors associated wiht unique eigenvalues must be linearly independent and hence eigenvector $i$ cannot be 'the same as' some other eigenvector $j$ unless $\lambda_i = \lambda_j$.  Put differently, if $\lambda_i \neq \lambda_j$ then we know that the associated eigenvectors are not linearly dependent -- i.e. in an inner product space like $\mathbb C$ this means eigenvector i must have some component that is completely independent (read: orthogonal) from eigenvector j.  Put another way, iterating through the complete collection of unique eigenvectors outside (the nullspace) does not mean we will find multiple unique eigenvalues -- it merely means that if unique eigenvalues exist outside the nullspace, we will see them during this process.  For more information see the proof earlier in this posting called "Application of Vandermonde Matrices: Proof of Linear Independence of Eigenvectors associated with Unique Eigenvalues."

Now to confirm that each unique non-zero eigenvalue has the same algebraic multiplicity, we use the fact that $trace\big((\mathbf {GH})^r\big) = trace\big((\mathbf {HG})^r\big)$, and in a setup very similar to that in the above *"Full cycle trace relations and nilpotent matrices"*, **we collect each unique non-zero eigenvalue in a diagonal matrix, $\mathbf D$**, and thus $\mathbf D$ has the following eigenvalues $\{\lambda_1, \lambda_2, ..., \lambda_k\}$ 
For avoidance of doubt, $k$ is some natural number where $1 \leq k \leq min(m,n)$.

We then set up the $k$ x $k$ Vandermonde matrix $\mathbf K$

$\mathbf K = \bigg[\begin{array}{c|c|c|c|c}
\mathbf D^0 \mathbf 1 & \mathbf D^1 \mathbf 1 & \mathbf D^2 \mathbf 1 &\cdots & \mathbf D^{k-1} \mathbf 1
\end{array}\bigg]$

To model our trace relations, we collect the algebraic multiplicities of $(\mathbf {GH})^r$ in $\mathbf y$ and place it on the left hand side of the equation.  We collect the algebraic multiplicities of $(\mathbf {HG})^r$ and place them in $\mathbf z$ on the right hand side of the equation.  

$\mathbf y^H \mathbf D \mathbf K = \mathbf z^H \mathbf D \mathbf K $

In the special case where the trace is zero for $r = \{1, 2, ... , n\}$, we know that all eigenvalues are equal to zero, based on the preceding proof.  (This special case cannot occur of course, since by design we have only included $\lambda_i \neq 0$ in our $\mathbf D$ and $\mathbf K$.)    If the trace is not equal zero for a "full cycle",  then the above equation exists and we may say 

$\mathbf y^H \mathbf D \mathbf K = \mathbf z^H \mathbf D \mathbf K $

$\mathbf y^H \big(\mathbf D \mathbf K\big)\big(\mathbf D \mathbf K\big)^{-1} = \mathbf y^H \mathbf D \mathbf {KK}^{-1} \mathbf D ^{-1} = \mathbf z^H \mathbf D \mathbf {KK}^{-1} \mathbf D ^{-1} = \mathbf z^H \big(\mathbf D \mathbf K\big)\big(\mathbf D \mathbf K\big)^{-1}$

$\mathbf y^H = \mathbf z^H $

Thus $\mathbf y = \mathbf z$, and we know that the algebraic multiplicity of each unique non-zero eigenvalue of $\mathbf{GH}$ equals its algebraic multiplicity in $\mathbf{HG}$ (and vice versa).  Hence we say that $\mathbf{GH}$ and  $\mathbf{HG}$ have the same non-zero eigenvalues.



# Cayley Hamilton 

**claim**: each operator, $\mathbf B \in \mathbb C^{n x n}$ obeys its characteristic polynomial.  i.e. 

$c_0 \mathbf I + c_1 \mathbf B + c_2 \mathbf B^2 + ... + c_{n-1}\mathbf B^{n-1} + c_{n}\mathbf B^n = c_0 \mathbf I +  \Sigma_{r=1}^{n} c_r \mathbf B^r = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$




**proof: non-defective case, where $\mathbf B$ has $n$ linearly independent eigenvectors.**

We know that each eigenvalue is a root to the characteristic polynomial.  

Put differently, we know that for $k = \{1, 2, ..., n-1, n\}$, we have an eigenpair of $\mathbf x_k, \lambda_k$

$c_0 +  \Sigma_{r=1}^{n} c_r \lambda_k^r = 0$


we can multiply this by $\mathbf x_k$:

$\big(c_0 +  \Sigma_{r=1}^{n} c_r \lambda_k^r \big) \mathbf x_k = \mathbf 0$


or equivalently

$\big(c_0 \mathbf I  + \Sigma_{r=1}^{n} c_r \mathbf B^r \big) \mathbf x_k = \mathbf 0$



Now let's collect these $n$ relationships in a system of equations:


$\big(c_0 \mathbf I + \Sigma_{r=1}^{n} c_r \mathbf B^r \big) \bigg[\begin{array}{c|c|c|c}
\mathbf x_1 & \mathbf x_2 &\cdots & \mathbf x_n
\end{array}\bigg] = \big(c_0 \mathbf I + \Sigma_{r=1}^{n} c_r \mathbf B^r \big)\mathbf X = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$

because the eigenvectors are stated to be linearly independent, we multiply each side on the right by $\mathbf X^{-1}$ seeing that


$\big(c_0 \mathbf I + \Sigma_{r=1}^{n} c_r \mathbf B^r \big) = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$

which proves that $\mathbf B$ follows its characteristic polynomial at least in the case where its eigenvectors form a basis.

**proof: defective case:**  

a sketch of the proof covers two things

First, it is clear from Schur decomposition that any matrix is unitarily similar to an upper triangular matrix

$\mathbf Q^H \mathbf{BQ} = \mathbf T$ or 

$\mathbf{B} = \mathbf {QTQ}^H$


The eigenvalues of $\mathbf T$ obey their characteristic polynomial, hence the characteristic polynomial of $\mathbf B$ or equivalently $\mathbf T$, must be a nilpotent matrix.  However Cayley Hamilton makes a stronger claim that in fact it is not just any nilpotent matrix, but the zero matrix.  

*A key takeway from this, however, is that if a matrix was not nilpotent (i.e. it has at least one non-zero eigenvalue), and it becomes nilpotent after applying some other matrix's characteristic polynomial, then that means your matrix has roots to that other matrices characteristic polynomial -- i.e. your non-zero eigenvalues are non-zero eigenvalues for that other matrix. *

an outline of the analysis approach is that:

we can find an upper triangular matrix $\mathbf R$ where all entries are identical to $\mathbf T$ except diagonal elements are perturbed by small enough $\delta$, $\delta^2$, $\delta^3$, and so on as needed for all duplicate eigenvalues.  Afterward, we have 

$\big \Vert \mathbf{T} - \mathbf R \big \Vert_F^2 \lt \epsilon$

for any $\epsilon \gt 0$ 

where $\mathbf C = \mathbf{QRQ}^H$

But now each eigenvalue is unique and per the proof near the top of this posting $\mathbf C$  is now diagonalizable aka non-defective, and the earlier part of this cell -- i.e. the proof that all diagonalizable matrices obey Cayley Hamilton-- may be used. (There is a tecnical nit that by perturbing the eigenvalues, we have changed the characteristic polynomial, but this change is $O(\delta)$ and becomes arbitrarily small as $\delta \to 0$).  

Thus we may say, up to any arbitrary level of precision we can approximate $\mathbf B$ or $\mathbf T$ and find that those approximations all obey Cayley Hamilton, hence $\mathbf B$ obeys Cayley Hamilton as well. 

*note: there are purely algebraic proofs of Cayley Hamilton for defective matrices that do not require limits / analysis.  The analysis view is more intuitive, but requires some heavier duty machinery to be fully rigorous.  A purely algebraic proof is pending.  In any case these different approaches are complementary.*  



We now combine two cells: "Full cycle trace relations and nilpotent matrices" and the above "Cayley Hamilton" proofs.  

**claim:**  

If for $m$ x $m$ matrix $\mathbf X$ and $n$ x $n$ matrix $\mathbf Y$:

$trace\big(\mathbf X^k\big)$ = $trace\big(\mathbf Y^k \big)$  
for natural numbers $k = \{1, 2, 3, ...,\}$, then they have the same non-zero eigenvalues (with same algebraic multiplicity for each non-zero eigenvalue).


**commentary: ** This may be useful in cases where perhaps we know the traces, or even the eigenvalues, of $\mathbf X$ and want to make inferences about $\mathbf Y$.  

**proof:**

for convenience notice that, if for $r = \{1, 2, 3, ... , max(m,n)\}$

$trace\big(\mathbf X^r\big) = trace\big(\mathbf Y^r\big)  = 0 $, then both $\mathbf X$ and $\mathbf Y$ are nilpotent.  

*The rest of the writeup assumes that they are not nilpotent matrices.*

Now suppose we know the eigenvalues of $\mathbf X$, and in particular the non-zero eigenvalues of $\mathbf X$.  Then we know the characteristic polynomial, $p$ and use Cayley Hamilton to see the below mapping:  

$p\big(\mathbf X\big) = c_0 \mathbf I + c_1 \mathbf X + c_2 \mathbf X^2 + ... + c_{n-1}\mathbf X^{n-1} + c_{n}\mathbf X^n = c_0 \mathbf I +  \Sigma_{j=1}^{n} c_j \mathbf X^j = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$

We right multiply the above by $\mathbf X$

$p\big(\mathbf X\big)\mathbf X = c_0 \mathbf X + c_1 \mathbf X^2 + c_2 \mathbf X^3 + ... + c_{n-1}\mathbf X^{n} + c_{n}\mathbf X^{n+1} =\bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg] \mathbf X = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$

Notice that if we square the above, we have 

$\big(p\big(\mathbf X\big)\mathbf X\big)^2 = \big(c_0 \mathbf X + c_1 \mathbf X^2 + c_2 \mathbf X^3 + ... + c_{n-1}\mathbf X^{n} + c_{n}\mathbf X^{n+1}\big)^2 = \big(\Sigma_{j=1}^{n+1} (c_{j-1})\mathbf X^j\big) \big(\Sigma_{j=1}^{n+1} (c_{j-1})\mathbf X^j\big) = \Sigma_i \gamma_i \mathbf X^i = \bigg[\begin{array}{c|c|c|c}
  \mathbf 0 & \mathbf 0 &\cdots & \mathbf 0
\end{array}\bigg]$

where $\gamma_i$ is the appropriate scalar that comes from multiplying out the respective $c_j$ term.  We will return to this momentarily.

Taking the trace, we see:

$trace\big(p\big(\mathbf X\big)\mathbf X \big) = c_0 trace\big(\mathbf X\big) + c_1 trace\big(\mathbf X^2\big) + c_2 trace\big(\mathbf X^3\big) + ... + c_{n-1} trace\big(\mathbf X^{n}\big) + c_{n} trace\big(\mathbf X^{n+1}\big) = 0$

but this is equivalent to 

$trace\big(p\big(\mathbf Y\big)\mathbf Y \big) = c_0 trace\big(\mathbf Y\big) + c_1 trace\big(\mathbf Y^2\big) + c_2 trace\big(\mathbf Y^3\big) + ... + c_{n-1} trace\big(\mathbf Y^{n}\big) + c_{n} trace\big(\mathbf Y^{n+1}\big) = 0$


and more generally, we see that 

for $r = \{1, 2, 3, ... , max(m,n)\}$

$trace\Big( \big(p\big(\mathbf X\big)\mathbf X \big)\big)^r\Big)= trace\Big(\big(\Sigma_{j=1}^{n+1} (c_{j-1})\mathbf X^j\big)^r\Big) = trace\big(\Sigma_i \gamma_i \mathbf X^i\big) =\Sigma_i \gamma_i trace\big(\mathbf X^i\big) = 0$

where again, $\gamma_i$ indicates the scalar result of multiplying the relevant $c_j$ terms.  We then recall that for each term in this finite series, for the relevant natural numbers $i$, 

$trace\big(\mathbf X^i\big) = trace\big(\mathbf Y^i\big)$  

and scaling this by some $\gamma_i$,

$\gamma_i trace\big(\mathbf X^i\big) = \gamma_i trace\big(\mathbf Y^i\big)$  

hence 

$0 = \Sigma_i \gamma_i trace\big(\mathbf X^i\big) = \Sigma_i \gamma_itrace\big(\mathbf Y^i\big)$  

We then conclude that for $r = \{1, 2, 3, ... , max(m,n)\}$

$trace\Big( \big(p\big(\mathbf X\big)\mathbf X \big)\big)^r\Big) = trace\Big( \big(p\big(\mathbf Y\big)\mathbf Y \big)\big)^r\Big) = 0$  

We now know that the matrix given by $\Big(p\big(\mathbf Y\big)\mathbf Y\Big)$ is nilpotent.  Recalling that $p\big(\mathbf Y\big)$ is just a finite series of $ \mathbf Y^k$ with particular scalars applied for each appropriate $k$, we do Schur Decomposition and see  

$ p\big(\mathbf Y\big) = \mathbf {QUQ}^H $ and $\mathbf Y = \mathbf {QRQ}^H$, then 

$\Big(p\big(\mathbf Y\big)\mathbf Y\Big) = \Big(\mathbf {QUQ}^H \mathbf {QRQ}^H \Big) = \mathbf {QURQ}^H$

since $\mathbf Y$ is not nilpotent (i.e. $\mathbf R$ is not nilpotent) but $\big(\mathbf{UR}\big)$ is nilpotent, this tells us that $\mathbf U$ is strictly upper triangular -- i.e. $\mathbf U$ is nilpotent, which means that the matrix given by $p\big(\mathbf Y\big)$ is nilpotent.  

We thus see that all non-zero diagonal elements of $\mathbf R$ -- aka the all non-zero eigenvalues of $\mathbf Y$ obey the characteristic polynomial given by $p$.  (If we wanted, we could take this one steup further and reconize that if $p\big(\mathbf Y\big)$ is nilpotent, then by Cayley Hamilton it is in fact the zero matrix, though this is not needed here so we don't pursue it.)  

At a bare minimum, the above shows that the set of unique non zero eigenvalues of $\mathbf Y \subset $ unique non zero eigenvalues of $\mathbf X$

- - - -
*Here are two different approaches now to finish off the proof*   
- - - -


**(1)**  

Do the exact same argument used above, except swap $\mathbf X$ for $\mathbf Y$. 

(In many programming languages we would say:  
(a) $\mathbf X, \mathbf Y := \mathbf Y, \mathbf X$  
(b) call on argument used above, once.
)  

At a minimum, doing that shows that the set of unique non zero eigenvalues of $\mathbf X \subset $ unique non zero eigenvalues of $\mathbf Y$

hence with respect to unique non-zero eigenvalues we have $\lambda\big(\mathbf X\big) \subset \lambda\big(\mathbf Y\big)$ and from before, with respect to nonzero eigenvalues, we have $\lambda\big(\mathbf Y\big) \subset \lambda\big(\mathbf X\big)$ which proves that with respect to unique non-zero eigenvalues $\lambda\big(\mathbf Y\big) = \lambda\big(\mathbf X\big)$.  

As before we collect these unique non-zero eigenvalues in a diagonal matrix $\mathbf D$.  There are $t$ non-zero eigenvalues, and $\mathbf D$ is $t$ x $t$.  

Collect the algebraic multiplicities for these unique nonzero eigenvalues of $\mathbf X$ in $\mathbf a_x$ and collect the algebraic multiplicities for the unique nonzero eigenvalues of $\mathbf Y$ in $\mathbf a_y$.  (As reminder, because they are algebraic multiplicities, each entry in $\mathbf a_x$ and $\mathbf a_y$ must be an integer $\geq 1$.)

Thus we have 
$\mathbf W = \bigg[\begin{array}{c|c|c|c|c}
\mathbf D^0 \mathbf 1 & \mathbf D^1 \mathbf 1 & \mathbf D^2 \mathbf 1 &\cdots & \mathbf D^{t-1} \mathbf 1
\end{array}\bigg]$

and we show that these traces are equal with the following expression:  


$\mathbf a_x^H \mathbf D \mathbf W = \mathbf a_y^H \mathbf D \mathbf W $  
$\mathbf a_x^H \big(\mathbf D \mathbf W\big)\big(\mathbf D \mathbf W\big)^{-1} = \mathbf a_x^H \mathbf D \mathbf W \mathbf W^{-1} \mathbf D^{-1} = \mathbf a_x^H = \mathbf a_y^H = \mathbf a_y^H \mathbf D \mathbf W \mathbf W^{-1} \mathbf D^{-1}  = \mathbf a_y^H  \big(\mathbf D \mathbf W\big)\big(\mathbf D \mathbf W\big)^{-1} $  

hence $\mathbf a_x^H = \mathbf a_y^H$

and equivalently: $\mathbf a_x = \mathbf a_y$

and we see that $\mathbf X$ and $\mathbf Y$ not only have the same unique non-zero eigenvalues, but that each one of those unique non-zero eigenvalues has the same algebraic multiplicity.
- - - -
*alternative approach to finish the proof:  *  
**(2)**

In this case, we only know that with respect to unique non-zero eigenvalues, we have 

$\lambda\big(\mathbf Y\big) \subset \lambda\big(\mathbf X\big)$

we suppose that $\lambda\big(\mathbf Y\big) \neq \lambda\big(\mathbf X\big)$, i.e. that $\mathbf X$ has some unique non-zero eigenvalues that $\mathbf Y$ doesn't have.  Thus we assume that $\mathbf Y$ has $k$ unique non-zero eigenvalues and $\mathbf X$ has $r$ unique non-zero eigenvalues, where $1 \leq k \lt r$.

and we collect the algebraic multiplicities for the $k$ unique eigenvalues $\mathbf Y$ in $\mathbf a_y$

We create a short, fat $k$ x $r$ Vandermonde matrix for $\mathbf Y$, below 

$\mathbf W_{\mathbf Y} = \begin{bmatrix}
1 & \lambda_1 & \lambda_1^2 & \dots  & \lambda_1^{r-1}\\ 
1 & \lambda_2 & \lambda_2^2 & \dots &  \lambda_2^{r-1} \\ 
\vdots & \vdots & \vdots & \ddots & \vdots & \\ 
1 & \lambda_{k} & \lambda_{k}^{2} & \dots  & \lambda_{k}^{r-1}
\end{bmatrix}= \bigg[\begin{array}{c|c|c|c|c}
\mathbf D^0 \mathbf 1 & \mathbf D^1 \mathbf 1 & \mathbf D^2 \mathbf 1 &\cdots & \mathbf D^{r-1} \mathbf 1
\end{array}\bigg]$

$rank\big(\mathbf W_Y\big) = k$ 

now we setup the trace relation as 

$\big(\mathbf a_Y\big) \mathbf D \mathbf W_{\mathbf Y} = \big(\mathbf 1^H Diag\big(\mathbf a_Y\big)\big) \mathbf D \mathbf W_Y = \begin{bmatrix} trace\big(\mathbf Y\big) & trace\big(\mathbf Y^2\big)  & trace\big(\mathbf Y^3\big) & \cdots & trace\big(\mathbf Y^r\big) \end{bmatrix}$  

from here we build this out to an $r$ x $r$ matrix, where we have 

$\mathbf H = \Bigg[\begin{matrix}
trace\big(\mathbf Y\big)  & trace\big(\mathbf Y^2\big)  & trace\big(\mathbf Y^3\big)  &\cdots  &trace\big(\mathbf Y^r\big) \\ 
trace\big(\mathbf Y^2\big)  & trace\big(\mathbf Y^3\big)  & trace\big(\mathbf Y^4\big)  & \cdots & trace\big(\mathbf Y^{r+1}\big) \\ 
trace\big(\mathbf Y^3\big) &  trace\big(\mathbf Y^4\big)& trace\big(\mathbf Y^5\big)  & \cdots &trace\big(\mathbf Y^{r+2}\big) \\ 
\vdots & \vdots & \vdots &  \ddots & \vdots \\ 
trace\big(\mathbf Y^r\big) & trace\big(\mathbf Y^{r+1}\big) &trace\big(\mathbf Y^{r+2}\big)  & \cdots & trace\big(\mathbf Y^{2r}\big)
\end{matrix}\Bigg]$  

note: While there may be some LaTeX rendering issues, it is clear that the above matrix $\mathbf H$ is a Hankel matrix.  It is symmetric, but if any of the entries are complex, it is not Hermitian. From here, notice that while the first row is given by

$\begin{bmatrix}
1 & 1 & 1 & \cdots  &1 
\end{bmatrix}Diag\big(\mathbf a_Y\big) \mathbf D \mathbf W_{\mathbf Y}$

the second row is given by 

$\begin{bmatrix}
\lambda_1 & \lambda_2 & \lambda_3 & \cdots  &\lambda_k
\end{bmatrix}Diag\big(\mathbf a_Y\big) \mathbf D \mathbf W_{\mathbf Y}$

the third row is given by 

$\begin{bmatrix}
\lambda_1^2 & \lambda_2^2 & \lambda_3^2 & \cdots  &\lambda_k^2 \end{bmatrix}Diag\big(\mathbf a_Y\big) \mathbf D \mathbf W_{\mathbf Y}$

... and the final, rth row is given by 

$\begin{bmatrix}
\lambda_1^{r-1} & \lambda_2^{r-1} & \lambda_3^{r-1} & \cdots  &\lambda_k^{r-1} \end{bmatrix}Diag\big(\mathbf a_Y\big) \mathbf D \mathbf W_{\mathbf Y}$

thus we have

$\mathbf W_{\mathbf Y}^T Diag\big(\mathbf a_Y\big) \mathbf D \mathbf W_{\mathbf Y} = \mathbf H$

notice that the above $\mathbf W^T \neq \mathbf W^H$ except in the special case where all $\lambda_i$'s are real.  The above matrix is square and it is *not* full rank. It is at most rank $k$ because $rank\big(\mathbf W_Y\big) = k$, and as a reminder we have assumed $k \lt r$.  

Put differently $det\big(\mathbf H\big) = 0$.  

If we work through the exact same calculations, for $\mathbf X$, we find that 

$\mathbf W_{\mathbf X}^T Diag\big(\mathbf a_X\big) \mathbf \Lambda \mathbf W_{\mathbf X} = \mathbf H$

where $ \mathbf \Lambda$ has the same diagonal elements as $\mathbf D$ for $\lambda_{i,i}$ for $i = \{1,2, ..., k\}$, and has the additional unique, non-zero eigenvalues that we've assumed in $\lambda_{i,i}$ for $i = \{k + 1,k + 2, ..., r\}$. 


$\mathbf W_{\mathbf X}$ is an $r$ x $r$ matrix with each of those $r$ unique non-zero eigenvalues in a Vandermonde Matrix, i.e. 

$\mathbf W_{\mathbf X} = \bigg[\begin{array}{c|c|c|c|c}
\mathbf \Lambda^0 \mathbf 1 & \mathbf\Lambda^1 \mathbf 1 & \mathbf\Lambda^2 \mathbf 1 &\cdots & \mathbf \Lambda^{r-1} \mathbf 1
\end{array}\bigg]$


further notice that $a_{X, i}$ for $i = \{1, 2, ...., r\} \geq 1$, hence $Diag\big(\mathbf a_X\big)$  is invertible. And since each eigenvalue in $\mathbf W_{\mathbf X}$ is unique, the square matrix given by $\mathbf W_{\mathbf X}$ is full rank.  

Thus we have 

$det\big(\mathbf W_{\mathbf X}^T Diag\big(\mathbf a_X\big) \mathbf \Lambda \mathbf W_{\mathbf X}\big) = det\big(\mathbf H \big) \neq 0 = det\big(\mathbf H \big) = det\big(\mathbf W_{\mathbf X}^T Diag\big(\mathbf a_X\big) \mathbf \Lambda \mathbf W_{\mathbf X}\big) $

which is a contradiction.  Or equivalently

$rank\big(\mathbf W_{\mathbf X}^T Diag\big(\mathbf a_X\big) \mathbf \Lambda \mathbf W_{\mathbf X}\big) = rank\big(\mathbf H \big) = r = rank\big(\mathbf H \big) = rank\big(\mathbf W_{\mathbf X}^T Diag\big(\mathbf a_X\big) \mathbf \Lambda \mathbf W_{\mathbf X}\big) \leq k$

Because we have assumed that $k \lt r$ (or equivalently $r \gt k$), it cannot be the case that $r \leq k$, an obvious contradiction.  Notice that even if we swapped $\mathbf X$ and $\mathbf Y$ ala the finish for alternative *(1)*, the equality can hold **iff** $k = r$.  

Thus we observe that that the number of unique non-zero eigenvalues for $\mathbf Y$, given by $k$ must be equal to the number of unique non-zero eigenvalues of $\mathbf X$, given by $r$.  

as in *(1)*, we recongize that $\mathbf Y$ has $k$ unique non-zero eigenvalues, which are contained in the set of unique non-zero eigenvalues of $\mathbf X$ which contains $r$ elements (i.e. unique non-zero eigenvalues), and we now know that $k = r$, thus we can set up our Vandermonde Matrix $\mathbf W$ such that 

$\mathbf W_{\mathbf Y} = \mathbf W_{\mathbf X} = \mathbf W$

we collect the algebraic multiplicities for unique non-zero eigenvalues of $\mathbf X$ in $\mathbf a_x$ and the respective eigenvalues in for $\mathbf Y$ in $\mathbf a_y$, and we have the below equation:    

$\mathbf a_x^H \mathbf D \mathbf W = \mathbf a_y^H \mathbf D \mathbf W $  
$\mathbf a_x^H \big(\mathbf D \mathbf W\big)\big(\mathbf D \mathbf W\big)^{-1} = \mathbf a_y^H  \big(\mathbf D \mathbf W\big)\big(\mathbf D \mathbf W\big)^{-1} $  

hence $\mathbf a_x^H = \mathbf a_y^H$

and equivalently: $\mathbf a_x = \mathbf a_y$


Thus, if $trace\big(\mathbf X^k\big)$ = $trace\big(\mathbf Y^k \big)$  for natural numbers $k = \{1, 2, 3, ...,\}$, then $\mathbf X$ and $\mathbf Y$ have the same non-zero eigenvalues (with same algebraic multiplicity for each non-zero eigenvalue).

**remark:**

*(2)* took a lot more lines than *(1)*, why bother?  

First, proving the same thing from two (or more) different perspective is frequently instructive, helping intuition, and is a great way to flush out 'bugs' in one's logic.  

Second, it may be useful to take a step back and realize that if $\mathbf X$ and $\mathbf Y$ are nilpotent, then we handled the entire proof at time zero.  

Otherwise using if they are not nilpotent, *(2)* is in some sense simpler.  

When using *(2)* with non-nilpotent matrices, in all cases we may find some matrix $\mathbf Z$ that is not defective (i.e. even with repeated eigenvalues, we choose eigenvectors that form a basis), where $trace\big(\mathbf Z^k\big) = trace\big(\mathbf X^k\big)$ for $k =\{1,2,3, ...\}$, and as before $ trace\big(\mathbf X^k\big) = trace\big(\mathbf Y^k\big)$.  

Specifically, we can then call on the *easy* Cayley Hamilton proof that applies for a non-defective square matrix like $\mathbf Z$, and use proof *(2)* to show that $\mathbf X$ has the same non-zero eigenvalues, with the same multiplicities, as $\mathbf Z$.  We can then repeat this process and see that $\mathbf Y$ has the same non-zero eigenvalues, with the same multiplicities as $\mathbf Z$.  Thus $\mathbf X$ has the same non-zero eigenvalues, with the same multiplicities as $\mathbf Y$.  




