# Schur's Inequality

note: this inequality is frequently used without attribution
It is referred to as Schur's Inequality in some sources e.g. p.150 of Prasolov's "Problems and Theorems in Linear Algebra"  

**Schur's Triangulation Theorem**  
with the fundamental theorem of algebra, and Gramm Schmidt in hand we prove that 
any $\mathbf A \in \mathbf C^{\text{ n x n}}$ is unitarily similar to an upper triangular matrix $\mathbf T$.  The proof works by induction on $n$.  In the case of $n=1$ the result is trivially true -- let the unitary matrix be the 1x1 matrix $[1]$ and we get $\mathbf A$ is similar to (and equal to) itself and hence upper triangular.  

We now have an induction hypothesis that we may call on for any $\text{n-1 x n-1}$ matrix.  (The proof below closely follows that in Meyer's *Matrix Analysis* though we don't require the unitary matrix to be involutive as done by Meyer.)  

By the fundamental theorem of algebra, the characteristic polynomial of $\mathbf A$ necessarily has at least one distinct eigenvalue $\lambda$, and an eigenvector $\mathbf x$ 

-- i.e. $\mathbf A \mathbf x = \lambda \mathbf x$ where we take $\big \Vert \mathbf x\big \Vert_2 = 1$  

Now construct a unitary matrix $\mathbf Q$ where   
$\mathbf Q := \bigg[\begin{array}{c|c|c|c}\mathbf x & \mathbf v_1 &\cdots & \mathbf v_{n-1}\end{array}\bigg] = \bigg[\begin{array}{c|c}
\mathbf x & \mathbf V\end{array}\bigg]$  

note:  
$\mathbf Q \mathbf e_1 = \mathbf x$  
and  
$\mathbf e_1 = \mathbf Q^{-1} \mathbf x = \mathbf Q^{H} \mathbf x$   


(The easiest approach to come up with $\mathbf Q$ is to just generate n-1 random vectors with components uniformly at random in $[-1,1]$ stack them to the right of $\mathbf x$ in a matrix and run QR factorization on the matrix, discarding the $R$.  We can also take a matrix that has $n$ standard basis vectors (i.e. $\mathbf I_n$), substitute $\mathbf x$ for any  $\mathbf e_i$ when $x_i \neq 0$, and we have a linearly independent set -- easy proof that result has n linearly independent column vectors: Cramer's rule / multilinearity of determinant-- then permute the columns such that $\mathbf x$ is the first column of the matrix, then run QR factorization, discarding R.)  

so 

$\mathbf Q^H \mathbf A \mathbf Q $  
$= \mathbf Q^H \big(\mathbf A \mathbf Q\big) $  
$= \mathbf Q^H \bigg[\begin{array}{c|c}\mathbf A\mathbf x & \mathbf {AV}\end{array}\bigg]$  
$= \mathbf Q^H \bigg[\begin{array}{c|c} \lambda \mathbf x  & \mathbf {AV}\end{array}\bigg]$  
$= \bigg[\begin{array}{c|c}\lambda \mathbf Q^H  \mathbf x  & \mathbf Q^H \mathbf {AV}\end{array}\bigg]$  
$= \bigg[\begin{array}{c|c} \lambda \mathbf e_1  & \begin{bmatrix} \mathbf x^H\\\mathbf V^H \end{bmatrix} \mathbf {AV}\end{array}\bigg]$  
$= \bigg[\begin{array}{c|c} \lambda \mathbf e_1  & \begin{bmatrix} \mathbf x^H  \mathbf {AV} \\\mathbf V^H  \mathbf {AV} \end{bmatrix}\end{array}\bigg]$  
$ = \begin{bmatrix} \lambda & \mathbf x^H\mathbf {AV} \\ \mathbf 0 & \mathbf V^H\mathbf {AV} \end{bmatrix}  $      

- - - -  
*note: the 2nd and 3rd to last lines render properly locally but do not render on Github, unfortunately. They further clarify what the blocked matrix looks like, but may be skipped as they are there (ironically?) for extra clarity only*  
- - - -  


but notice that $\mathbf V^H \mathbf {AV}$ is $\text{(n-1)  x  (n-1)}$ so our induction hypothesis tells us we can effect a unitary similarity transformation with some matrix $\mathbf U_{n-1}$ to see that $\mathbf V^H \mathbf {AV}$ is unitarily similar to a triangular matrix $\mathbf T_{n-1}$.  Thus making use of one more blocked multiplication with 

$\mathbf U := \begin{bmatrix} 1 & \mathbf 0^T \\ \mathbf 0 & \mathbf U_{n-1} \end{bmatrix}$   

we get    
$\mathbf U^H \mathbf Q^H \mathbf A \mathbf Q\mathbf U  $   
$=\begin{bmatrix} \lambda & \mathbf x^H\mathbf {AVU}_{n-1} \\ \mathbf 0 & \mathbf U_{n-1}^H \mathbf V^H\mathbf {AV}\mathbf U_{n-1} \end{bmatrix}$  
$=\begin{bmatrix} \lambda & \mathbf x^H\mathbf {AVU}_{n-1} \\ \mathbf 0 & \mathbf T_{n-1} \end{bmatrix} $  
$= \mathbf T$  

we observe that $\big(\mathbf Q\mathbf U\big) $ is unitary, which completes the proof  

*remark:*  
at each 'stage' the key technique is to recognize that we are dealing with a matrix living in a $d$ dimensional space  -- or if the reader prefers a matrix whose characteristic polynomial lives is degree $d$--- (whether $d=1$ or $d=2$... or $d=n-1$ or $d = n$) and that matrix has at least one distinct eigenvalue with an accompanying eigenvector (with 2 norm set equal to 1), and so we can repeat the same recipe as before -- apply a well chosen $d x d$ unitary matrix that has this eigenvector as its first column vector.  If we view the problem recursively, we first apply this with $d=n$ then apply the procedure with $d=n-1$, then apply the procedure with $d=n-2$... and so on until $d=2$ and then finally $d=1$ (which we get, in effect, for free) -- thus we recursively call on this unitary matrix construction and application routine $n$ times and as a result get the desired unitary similarity transform.  

*additional remark:*    
Schur Triangularization gives rise to many nice results since we now know that every square matrix with scalars in $\mathbb C$ is unitarily similar to a triangular matrix (and the above proof is *much* easier than Jordan form results).  One result that follows is Cayley Hamilton (see Vandermonde Matrices notebook).  

An immediate corollary is any Hermitian (including real symmetric) matrix is unitarily diagonalizable, because  
$\mathbf T = \mathbf Q^{-1} \mathbf A \mathbf Q = \mathbf Q^H \mathbf A \mathbf Q  = \mathbf Q^H \mathbf A^H \mathbf Q =  \mathbf T^H$  

which means $\mathbf T$ is both upper triangular and lower triangular -- i.e. it is diagonal.  This unlocks the primary results of the spectral theorem (as well as a way to derive SVD) that are of interest.  

Schur Triangularization also gives rise to Schur's Inequality -- that and various extensions and corollaries are the subject of the rest of this notebook. As we'll see, Schur's Inequality gives rise to other interpretations for unitarily diagonalizable matrices, which are then called Normal Matrices (which is a bit more general of an idea).  



**extension**  
an interesting, different approach to the above is to 'merely' triangulate our matrix with some invertible matrix $\mathbf S$ (see page 118 of Artin 1st ed), so at each stage of the above process we 'merely' extend to a basis /invertible matrix, but we disregard notions of orthogonality -- this approach has the benefit of working over arbitrary fields, so long as the eigenvalues exist in those fields.  This would give us  

$\mathbf A = \mathbf S \mathbf T' \mathbf S^{-1}$  

*after* this we can run QR factorization, so $\mathbf S = \mathbf Q\mathbf R$ and  
with $\mathbf T:= \big( \mathbf R \mathbf T' \mathbf R^{-1}\big)$   
recalling that upper triangular matrices are closed under multiplication and inversion (when the inverse exists... if we specialize to invertible matrices this is a way of saying that upper triangular matrices form a subgroup)   

$\mathbf A = \mathbf S \mathbf T' \mathbf S^{-1}=\mathbf Q\big( \mathbf R \mathbf T' \mathbf R^{-1}\big)\mathbf Q^{-1}=\mathbf Q\mathbf T\mathbf Q^{*}$  

and we recover Schur Triangularization  




Schur's Inequality tells us that for any $n$ x $n$ matrix in a complex scalar field, i.e. $\mathbf A \in \mathbb C^{n x n}$

**Claim:**

$\big \Vert \mathbf A \big \Vert_F^{2} = \text{trace}\big(\mathbf A^H \mathbf A\big) \geq \sum_{i = 1}^{n} \big \vert \lambda_i\big \vert ^2 \geq \big \vert \sum_{i = 1}^{n} \lambda_i^2\big \vert = \Big \vert \text{trace}\big(\mathbf A \mathbf A\big) \Big \vert $

note that  $\sum_{i = 1}^{n} \big \vert \lambda_i\big \vert ^2 \geq \big \vert \sum_{i = 1}^{n} \lambda_i^2\big \vert$ was included at the end via the triangle inequality


**Background:**

We can collect all of the eigenvalues $\big \vert \lambda_1 \big \vert \geq \big \vert\lambda_2 \big \vert\geq \big \vert 
\lambda_3 \big \vert \geq ... \geq \big \vert\lambda_n \big \vert$ in the diagonal matrix $\mathbf D$, and restate Schur's Inequality as:  


$\big \vert \big \vert \mathbf A \big \vert \big \vert_F^{2} \geq  \big \vert \big \vert \mathbf D \big \vert \big \vert_F^{2} = \text{trace}\big(\mathbf D^H \mathbf D\big)$ 

Note that by Schur Decomposition, we can write $\mathbf A = \mathbf {Q R Q}^H$  where $\mathbf Q$ is unitary, and $\mathbf R$ is upper triangular.  As a reminder, recall that the eigenvalues of an upper triangular matrix are on its diagonal, hence $\mathbf R_{i,i} = \lambda_i$.


**Proof:**
revisiting the inequality, we write this as:


$\big \vert \big \vert \mathbf A \big \vert \big \vert_F^{2} $   
$\big \vert \big \vert \mathbf {QRQ}^H\big \vert \big \vert_F^{2} $   
$= \big \vert \big \vert \mathbf R \big \vert \big \vert_F^{2} $  
$=  \Big(\sum_{k = 1}^{n} \sum_{j \gt k}  \big \vert r_{j,k}\big \vert^2\Big) + \text{trace}\big(\mathbf D^H \mathbf D\big)$  
$\geq \text{trace}\big(\mathbf D^H \mathbf D\big)$  


- - - - -
Alternatively, we may say:

$\big \vert \big \vert \mathbf A \big \vert \big \vert_F^{2} = \big \vert \big \vert \mathbf R \big \vert \big \vert_F^{2} = \big \Vert \big(\mathbf R - \mathbf D\big) + \mathbf D  \big \Vert_F^{2} = \big \vert \big \vert \big(\mathbf R - \mathbf D\big) \big \vert \big \vert_F^{2}  + \big \vert \big \vert \mathbf D \big \vert \big \vert_F^{2} \geq \big \vert \big \vert \mathbf D \big \vert \big \vert_F^{2}$

with equality **iff**
$\big \vert \big \vert \big(\mathbf R - \mathbf D\big) \big \vert \big \vert_F^{2} = 0$,
which occurs **iff** $\mathbf R - \mathbf D = \mathbf 0$, aka this occurs **iff** $\mathbf R = \mathbf D$.  Thus in the case where the Schur Inequality is an equality, we know that $\mathbf A$ is diagonalizable with mutually orthonormal eigenvectors $\mathbf A = \mathbf {Q RQ}^H = \mathbf {Q D Q}^H$.  Note that this does *not* make any claims as to whether or not the eigenvalues are real or complex.

- - - - 
*for avoidance of doubt:*    
we can verify  
$\big \Vert \big(\mathbf R - \mathbf D\big) + \mathbf D  \big \Vert_F^{2} = \big \vert \big \vert \big(\mathbf R - \mathbf D\big) \big \vert \big \vert_F^{2}  + \big \vert \big \vert \mathbf D \big \vert \big \vert_F^{2}$  
as being somewhat obvious in terms of sums of finitely many of squares   
Alternatively we may directly apply linear algebra to verify  the result  

$\big \Vert \big(\mathbf R - \mathbf D\big) + \mathbf D  \big \Vert_F^{2} $  
$= \text{trace}\Big(\big(\big(\mathbf R - \mathbf D\big) + \mathbf D  \big)^H \big(\big(\mathbf R - \mathbf D\big) + \mathbf D \big)\Big)  $  
$= \text{trace}\Big(\big(\big(\mathbf R - \mathbf D\big)^H + \mathbf D^H \big) \big(\big(\mathbf R - \mathbf D\big) + \mathbf D \big)\Big)  $  
$= \text{trace}\Big(\big(\mathbf R - \mathbf D\big)^H \big(\mathbf R - \mathbf D\big)  +  \big(\mathbf R - \mathbf D\big)^H\mathbf D  + \mathbf D^H\big(\mathbf R - \mathbf D\big) + \mathbf D^H \mathbf D \Big)  $  
$= \text{trace}\Big(\big(\mathbf R - \mathbf D\big)^H \big(\mathbf R - \mathbf D\big)\Big)  +  \text{trace}\Big(\big(\mathbf R - \mathbf D\big)^H\mathbf D\big)  + \text{trace}\Big(\mathbf D^H\big(\mathbf R - \mathbf D\big)\Big) + \text{trace}\Big(\mathbf D^H \mathbf D \Big)  $  
$= \text{trace}\Big(\big(\mathbf R - \mathbf D\big)^H \big(\mathbf R - \mathbf D\big)\Big)  +  0  +0 + \text{trace}\Big(\mathbf D^H \mathbf D \Big)  $  
$= \big \vert \big \vert \big(\mathbf R - \mathbf D\big) \big \vert \big \vert_F^{2}  + \big \vert \big \vert \mathbf D \big \vert \big \vert_F^{2}$

where in the 2nd to last line we observe that a strictly lower (upper) triangular matrix times a diagonal matrix gives a matrix that still is strictly lower (upper) triangular -- i.e. has all zeros on the diagonal, and hence has zero trace.  

- - - - 


**Technical Note:** *iff* the inequality is met with equality, then we say that $\mathbf A$ **is a normal matrix.**


an interesting implication is that if we have Hermitian $\text{n  x  n}$  matrices $\mathbf A$ and $\mathbf B$ with $\mathbf A \succeq 0$  

with  
$\mathbf C: = \mathbf A^\frac{1}{2}\mathbf B  \mathbf A^\frac{1}{2}$   

then  

$\big \Vert \mathbf {AB}\big \Vert_F^2 \geq \big \Vert \mathbf C\big \Vert_F^2 = \sum_{i = 1}^{n} \big \vert \lambda_i^{(C)}\big \vert ^2 = \sum_{i = 1}^{n} \big \vert \lambda_i^{(AB)}\big \vert ^2$  

where the inequality is strict unless $\big(\mathbf {AB}\big)$ is normal (for the most part this tends to require commutativity between the two matrices)  


**application:** if $P\in \mathbb R^{d\times d}$ is a non-singular (row) stochastic matrix and $P^{-1}$ is also a stochastic matrix, then $P$ is a Permutation matrix.  

*remark:* this can be proven in a combinatorial manner though triangle inequality plus Schur's Inequality allows for a particularly nice proof  


*lemma:* for any row of $P$   
$\big\Vert \tilde{\mathbf p_r}^T\big\Vert_2 =\big \Vert \sum_{j=1}^d \tilde{p_j}^{(r)}\cdot \mathbf e_r\big\Vert_2\leq \sum_{j=1}^d \big \Vert \tilde{p_j}^{(r)}\cdot \mathbf e_r\big\Vert_2=\big\Vert \tilde{\mathbf p_r}^T\big\Vert_1 = \tilde{\mathbf p_r}^T\mathbf 1=1$  
by triangle inequality  

$\implies \big\Vert \tilde{\mathbf p_r}^T\big\Vert_2^2 \leq 1$  
with equality *iff* $\tilde{\mathbf p_r}^T= \mathbf e_k^T$  
for some standard basis vector $\mathbf e_k$  

(additional justification: triangle inequality with equality implies  $\tilde{\mathbf p_r}^T=\alpha \mathbf e_k^T$ but the LHS is real non-negative so the RHS is as well $\implies \alpha\geq 0$ and the RHS has L1 norm of 1 thus the RHS does as well $\implies \alpha =1$ )  

*proof:*   
$P$ being stochastic means $P\mathbf 1 = \lambda_1\mathbf 1 = \mathbf 1$   
$\lambda_1=1$ is semi-simple and we have  
$\vert \lambda_1\vert \geq \vert \lambda_2\vert\geq ... \geq \vert \lambda_d\vert \gt 0$  
by Gerschgorin Discs or, better, the Perron theory section of 'Artin_chp4.ipynb'   
Thus 
$P^{-1}$ has eigenvalues  
$1=\vert \lambda_1\vert^{-1} \leq \vert \lambda_2\vert^{-1}\leq ... \leq \vert \lambda_d\vert^{-1}$  
but since $P^{-1}$ is stochastic we also know $P^{-1}\mathbf 1 = \lambda_1^{-1}\mathbf 1 = \mathbf 1$   
$1=\vert \lambda_1\vert^{-1} \geq \vert \lambda_j\vert^{-1}$  
for $j\in \big\{1,2,3,...,d\big\}$  
$\implies \vert \lambda_j\vert =1$   
i.e. all eigenvalues are on the unit circle.  Finally, combining the lemma with Schur's Inequality gives  

$d \geq \sum_{i=1}^d \big\Vert \tilde{\mathbf p_i}^T\big\Vert_2^2= \big \Vert P\big \Vert_F^2\geq \sum_{j=1}^d \vert \lambda_j\vert^2 = d $  
Thus Schur's Inequality is met with equality and meeting the upper bound with equality tells us each row of $P$ is a standard basis vector.  Since $\text{rank}\big(P\big)=d$ this means none of $P$'s rows are linearly dependent i.e. each row of $P$ is a distinct standard basis vector, i.e. $P$ is a permutation matrix.  


**Interesting L1 style extension:**  

The stated inequality which is based around summing squared entries is this:  
$\sum_{k=1}^n \sigma_k^2 = \big \Vert \mathbf A \big \Vert_F^{2} = \text{trace}\big(\mathbf A^H \mathbf A\big) \geq \sum_{i = 1}^{n} \big \vert \lambda_i\big \vert ^2 \geq \big \vert \sum_{i = 1}^{n} \lambda_i^2\big \vert = \Big \vert \text{trace}\big(\mathbf A \mathbf A\big) \Big \vert $

There is something of an interesting L1 point of view of this, which relates the trace to the nuclear norm (sum of singular values)

$\sum_{k=1}^n \sigma_k = \text{trace}\big(\mathbf Y\big) \geq   \sum_{k = 1}^{n} \big \vert \lambda_k\big \vert \geq \big \vert \text{trace}\big(\mathbf {QY}\big) \big \vert  = \big \vert \sum_{i = 1}^{n} \lambda_i \big \vert = \Big \vert \text{trace}\big(\mathbf A \big) \Big \vert $


The proof comes from noticing that similarity transforms do not change the trace, and specializing to unitary similarity transforms which don't change singular values, so we may consider upper triangular $\mathbf T$   

$\mathbf T = \mathbf U^* \mathbf A \mathbf U$  

but we can get the desired relation by multiplying 

$\mathbf {DT} = \mathbf R$  
Such that $\mathbf R$ has all eigenvalues real-nonnegative.  
i.e. $\mathbf D$ is diagonal with components on the unit circle -- in particular if $\lambda_i \neq 0$ then  
$d_{i,i} = \frac{\bar{\lambda_i}}{\vert \lambda_i \vert} $    
and if $\lambda_i = 0$, $d_{i,i} := 1$  

This means it is unitary as well, i.e. 
$\mathbf D^* \mathbf D = \mathbf I$   

but multiplication by a unitary matrix does not change the singular values, so we have  

$\big \vert \text{trace}\big(\mathbf {A} \big)\big \vert$   
$= \big \vert\text{trace}\big(\mathbf {T} \big)\big \vert$   
$= \big \vert\sum_{k=1}^n \lambda_k \big \vert$   
$\leq \sum_{k=1}^n \big \vert \lambda_k \big \vert$   
$= \text{trace}\big(\mathbf {DT} \big)$  
$= \big \vert \text{trace}\big(\mathbf {DT} \big)\big \vert$  
$= \big \vert\text{trace}\big(\mathbf {DQY} \big)\big \vert$  
$\leq \text{trace}\big(\mathbf {Y} \big)$  
$= \sum_{k=1}^n \sigma_k $    
$=\big\Vert\mathbf {A}\big\Vert_{S_1}$  

where $\mathbf {QY} =\mathbf T$ i.e. polar decomposition on the matrix $\mathbf T$ (which is unitarily similar to $\mathbf A$), which by construction is upper triangular with all real-non-negative eigenvalues.  It is thus Hermitian positive (semi)definite *iff* it is diagonal, which occurs *iff* $\mathbf A$ is normal  -- i.e. the purpose of this entire writeup. But to be clear, as noted in the "Fun with trace" writeup-- equality conditions of the second inequality are clear in the case of $\mathbf A$ being non-singular, however the exact inequality conditions are a bit muddy and less clear to your author when $\mathbf A$ is singular.  

The first inequality is the triangle inequality, and the second inequality (see "Fun with trace writeup") comes from the fact that $\big(\mathbf{DQ}\big)$  is unitary and the magnitude of the trace of the product of a unitary matrix and a Hermitian positive semi-definite matrix is bounded above by the trace of said Hermitian positive semi-definite matrix -- which is in some ways an extension or generalization of the triangle inequality (esp polar form in $\mathbb C$) to matrix traces.  



as a reminder we know 

$\big(\sum_{k=1}^n \sigma_k^2\big)^\frac{1}{2} \leq  \sum_{k=1}^n \sigma_k \leq n^\frac{1}{2}\big(\sum_{k=1}^n \sigma_k^2\big)^\frac{1}{2}$  
by triangle inequality and then Cauchy-Schwarz (ones trick)    

This gives us a nice proof of *submultiplicativity* of the  Schatten 1-Norm / Nuclear Norm
(cleanly for the case of n x n matrices -- with some care and padding by zeros, we can generalize to rectangular matrices where the product is defined)  

if we consider the matrix $\big(\mathbf {AB}\big)$ in polar form, so  

$\big(\mathbf {AB}\big) = \mathbf {QC}$  
then   
$\sum_{i=1}^n \sigma_{i (AB)} = \text{trace}\big(\mathbf C\big)  = \text{trace}\big(\mathbf Q^*\mathbf {AB}\big)  $   

so considering the Schatten 1-Norm of $\big(\mathbf {AB}\big)$  


$\big\Vert\mathbf {AB}\big\Vert_{S_1}$   
$=\sum_{i=1}^n \sigma_{i (AB)}$  
$= \text{trace}\Big(\mathbf Q^* \big(\mathbf {AB}\big)\Big)$  
$= \text{trace}\Big(\big( \mathbf Q^* \mathbf A \big) \mathbf B\big)\Big)$  
$= \big \vert \text{trace}\Big(\big( \mathbf Q^* \mathbf A \big) \mathbf B\big)\Big)\big \vert $  
$\leq \big \Vert \mathbf Q^* \mathbf A\big \Vert_F \big \Vert  \mathbf B\big \Vert_F $  
$=  \big \Vert \mathbf A\big \Vert_F \big \Vert  \mathbf B\big \Vert_F $  
$=  \big \Vert \mathbf A\big \Vert_{S_2} \big \Vert  \mathbf B\big \Vert_{S_2} $  
$= \big(\sum_{i=1}^n \sigma_{i (A)}^2\big)^\frac{1}{2}\big(\sum_{i=1}^n \sigma_{i (B)}^2\big)^\frac{1}{2}  $  
$\leq \big(\sum_{i=1}^n \sigma_{i (A)}\big)\big(\sum_{i=1}^n \sigma_{i (B)}\big) $  
$= \big\Vert\mathbf {A}\big\Vert_{S_1} \big\Vert\mathbf {B}\big\Vert_{S_1}$    
where the first inequality follows by Cauchy Schwarz and the second inequality follows by above mentioned triangle inequality  


as for *subadditivity* of the Schatten 1-Norm / Nuclear Norm, we can see it follows via quasi-linearization:  
Using an inequality from 'Fun with trace' involving Hermitian positive (semi)definite matrices and unitary matrices and the resulting trace, we can define  

$ \sum_{i=1}^n \sigma_{i (X)} = \big\Vert\mathbf {X}\big\Vert_{S_1}  := \big \vert \text{trace}\Big(\mathbf Q_0^* \big(\mathbf {X}\big)\Big)\big \vert$  

where $\mathbf Q_0^*$ is the (not necessarily unique) unitary matrix that maximizes 
$\big \vert \text{trace}\Big(\mathbf Q_0^* \big(\mathbf {X}\big)\Big)\big \vert$  (reference polar form and trace inequalities )  

but if (sticking with square matrices for now)  

$\mathbf X = \mathbf A + \mathbf B$  

then  
$\Big\Vert\mathbf {X}\Big\Vert_{S_1}$   
$= \Big \vert \text{trace}\Big(\mathbf Q_0^* \big(\mathbf {X}\big)\Big)\Big \vert$  
$= \Big \vert \text{trace}\Big(\mathbf Q_0^* \big(\mathbf {A} + \mathbf B\big)\Big)\Big \vert$  
$= \Big \vert \text{trace}\Big(\mathbf Q_0^* \mathbf {A}\Big) + \text{trace}\Big(\mathbf Q_0^* \mathbf {B}\Big)\Big \vert$  
$\leq \Big \vert \text{trace}\Big(\mathbf Q_0^* \mathbf {A}\Big)\Big \vert  + \Big \vert \text{trace}\Big(\mathbf Q_0^* \mathbf {B}\Big)\Big \vert$  
$\leq \Big \vert \text{trace}\Big(\mathbf Q_1^* \mathbf {A}\Big)\Big \vert  + \Big \vert \text{trace}\Big(\mathbf Q_2^* \mathbf {B}\Big)\Big \vert$  
$= \big\Vert\mathbf {A}\big\Vert_{S_1} +  \big\Vert\mathbf {B}\big\Vert_{S_1}$   

where the first inequality follows by triangle inequality, and the second inequality follows because 2 choices are better than one  

Positive definiteness follows immediately for the Schatten 1-norm because  
$0 \leq \Big\Vert\mathbf {X}\Big\Vert_{F} = \Big\Vert\mathbf {X}\Big\Vert_{S_2} = \big(\sum_{k=1}^n \sigma_k^2\big)^\frac{1}{2} \leq  \sum_{k=1}^n \sigma_k = \Big\Vert\mathbf {X}\Big\Vert_{S_1}$  
and the Frobenius norm of a matrix is zero *iff* the matrix is zero.  Finally the homogeniety with respect to positive scaling is immediate from looking at SVD of a matrix or its polar decomposition and the fact that scalar multiplication commutes.  

The Schatten 1-norm, being intimately linked in with the trace can be a convenient metric to use when dealing with the trace and bounding it (just about any metric will work in finite dimensions though our life is much more enjoyable if we choose wisely).  


**application:**   
consider the mapping  
$f_k: \mathbb C^{n x n} \longrightarrow \mathbb C$  
given by  
$f_k\big(\mathbf C\big) = \text{trace}\big(\mathbf C^k\big)$  
for natural number $k$  

we prove this mapping is continuous with respect to the coefficients in $\mathbf C$ because for any $\epsilon \gt 0$ there exists some $\delta \gt 0$ where 
 
$f_k\Big( N\big(\mathbf C,\delta\big)\Big) \subseteq N\Big( f_k\big(\mathbf C\big),\epsilon\Big) $  
and we use the 1 norm (here, meaning the sum of the magnitude of components in a matrix) to give our metric underlying the N, neighborhood, function above   

that is we can restate the above explicitly using the 1 norm as   
$\Big \Vert \text{trace}\Big(\big(\mathbf C + \mathbf E\big)^k\Big) - \text{trace}\big(\mathbf C^k\big)\Big \Vert_{1}  = \Big \vert \text{trace}\Big(\big(\mathbf C + \mathbf E\big)^k\Big) - \text{trace}\big(\mathbf C^k\big)\Big \vert \lt \epsilon  $  

as for any $\epsilon \gt 0$ there exists some $\delta \gt 0$ where  
$\big\Vert\mathbf {E}\big\Vert_{1} =   \big\Vert \big(\mathbf C + \mathbf E\big) - \mathbf C\big\Vert_{1}\lt \delta$  

and along the way, use the nuclear norm / Schatten 1-norm as a nice bridge between results  
- - - -  

**proof**  
the idea is to make use of linearity of the trace and the binomial theorem to get  

$\Big \Vert \text{trace}\Big(\big(\mathbf C + \mathbf E\big)^k\Big) - \text{trace}\big(\mathbf C^k\big)\Big \Vert_{1}  $   
$= \Big \Vert \text{trace}\Big(\big(\mathbf C + \mathbf E\big)^k - \mathbf C^k\Big)\Big \Vert_{1}  $  
$= \Big \Vert \text{trace}\Big(\sum_{i=1}^k \binom{k}{i}\mathbf C^{i} \mathbf E^{k-i}\Big) \Big \Vert_{1}  $    

**except**   
while there will be $\binom{k}{i}$ terms in the summation with *total* multiplications of $i$ by $\mathbf C$ and total multiplications by $\mathbf E$ of $(k-i)$ we know that matrix multiplications do not generally commute.  However submultiplicativity and subadditivity of 'nice' norms -- in particular the Schatten 1 norm-- come to the rescue, by mapping to a convenient scalar case which is an upperbound, where we do have commutativity amongst the resulting real (and non-negative) scalars.   

with  
$ 0 \leq m = \big \Vert  \mathbf C\big \Vert_{S_1} $   
we can prove the desired result as follows  

$\Big \Vert \text{trace}\Big(\big(\mathbf C + \mathbf E\big)^k\Big) - \text{trace}\big(\mathbf C^k\big)\Big \Vert_{1}$    
$= \Big\vert \text{trace}\Big(\big(\mathbf C + \mathbf E\big)^k- \mathbf C^k \Big)\Big\vert$  
$\leq  \Big \Vert \big(\mathbf C + \mathbf E\big)^k- \mathbf C^k \Big \Vert_{S_1}$ see preceding section "Interesting L1 style extension" with $\mathbf A:= \Big(\big(\mathbf C + \mathbf E\big)^k- \mathbf C^k\Big)$     
$\leq  \sum_{i=1}^k \binom{k}{i} \Big \Vert  \mathbf C\Big \Vert_{S_1}^{i} \Big\Vert\mathbf E \Big \Vert_{S_1}^{k-i}  $     (by subadditivity, then submultiplicativity)  
$= \sum_{i=1}^k \binom{k}{i} m^{i} \Big\Vert\mathbf E \Big \Vert_{S_1}^{k-i}  $   
$\leq  \sum_{i=1}^k \binom{k}{i} m^{i} \Big\Vert\mathbf E \Big \Vert_{S_1}  $  for sufficiently small $\Big\Vert\mathbf E  \Big \Vert_{S_1}$ ($i.e. \leq 1$)  
$\leq \Big\Vert\mathbf E  \Big \Vert_{S_1} \cdot \sum_{i=0}^k \binom{k}{i} m^{i}   $    
$= \Big\Vert\mathbf E  \Big \Vert_{S_1} \cdot \big(1 + m\big)^k  $     
$= \Big\Vert\mathbf E  \Big \Vert_{S_1} \cdot M  $  for some positive constant $M\gt 0$   
$\leq \big(n^\frac{1}{2}\cdot \sum_{i}\sum_{j}  \vert e_{i,j}\vert\big)\cdot M$  
$= n^\frac{1}{2} \cdot \big \Vert \mathbf E\big \Vert_1 \cdot M$  

hence selecting $\delta := \min\big(\frac{1}{\sqrt{n}}, \frac{\epsilon}{\sqrt{n} \cdot M}\big)$  completes the argument  
- - - -  
with respect to the final inequality, consider that   

$\Big\Vert\mathbf E  \Big \Vert_{S_1} = \sum_{k=1}^n \sigma_k \leq n^\frac{1}{2}\big(\sum_{k=1}^n \sigma_k^2\big)^\frac{1}{2} =  n^\frac{1}{2}\cdot \big(\sum_{i}\sum_{j}  \vert e_{i,j}\vert^2\big)^\frac{1}{2} \leq n^\frac{1}{2}\cdot \sum_{i}\sum_{j}  \vert e_{i,j}\vert$  

where results follow from Cauchy-Schwarz, then triangle inequality  
these inequalities make it clear that we use the Schatten 1 norm for convenience, but e.g. we could have just as easily used the Schatten 2 norm (aka Frobenius norm) because    

$\big \vert \text{trace}\big(\mathbf A\big)  \big \vert $  
$ =\big \vert \sum_{k=1}^n  \lambda_k \big \vert$  
$\leq \sum_{k=1}^n \big \vert \lambda_k \big \vert$  
$\leq n^\frac{1}{2}\big(\sum_{k=1}^n \big \vert \lambda_k \big \vert^2\big)^\frac{1}{2}$  
$\leq n^\frac{1}{2}\big(\sum_{k=1}^n \sigma_k^2\big)^\frac{1}{2}$  (Schur's Inequality)  
$= n^\frac{1}{2}\big \Vert \mathbf A \big \Vert_F$  
$\leq n^\frac{1}{2}\cdot \sum_{i}\sum_{j}  \vert a_{i,j}\vert$  
$= n^\frac{1}{2}\big \Vert \mathbf A \big \Vert_1$  
- - - -   

thus we've proven for any given $\epsilon \gt 0$ there exists some $\delta \gt 0$ such that     
$f_k\Big( N\big(\mathbf C,\delta\big)\Big) \subseteq N\Big( f_k\big(\mathbf C\big),\epsilon\Big)$

hence for each natural number $k$, we see that $f_k$ is continuous for any n x n matrix $\mathbf C$ and in particular varies continuously with respect to the magnitude of the change in components of $\mathbf C$  

**corollary:**  
The coefficients of the characteristic polynomial of a matrix $\in \mathbb C^{n x n}$ vary continuously with its entries.  

*proof:*  
apply Newton's Identities (see 2 proofs at end of Vandermonde Matrix writeup) in sequence.  
The above proves that $\text{trace}\big(\mathbf C^k\big)$ varies continuously with the entries of $\mathbf C$.  But the characteristic polynomial  
$p(x) = x^n + a_{n-1}x^{n-1} + a_{n-2}x^{n-2} +... + a_{1}x^{1}+ a_0$  

has a leading coefficient of one, and $a_{n-1}$ immediately and obviously varies continuously with the coefficients of the diagonal of $\mathbf C$.  This is enough to set up an induction (which we carry out for finitely many steps since a polynomial has only finitely many terms).  

in particular for $0\lt r \lt n$ we have  

$a_{n-r}  = -\frac{1}{r} \sum_{k=1}^{r} a_{n-r + k}\cdot \text{trace}\big(\mathbf C^k\big) $

where by inductive hypothesis, we know that all $ a_{n-r + k}$ vary continuously with the components of $\mathbf C$, and the preceding proofs shows that $\text{trace}\big(\mathbf C^k\big)$ varies continuously as well.  It is immediate that $a_{n-r}$ varies continuously with components of $\mathbf C$ because it is written as a linear combination/ composition involving finitely many terms of sums and products consisting solely of items that vary continuously with components of $\mathbf C$.  

the final terms (i.e. determinant multiplied by the sign function) also varies continuously with components of $\mathbf C$ because 

$(-1)^n \det\big(\mathbf C\big)= a_0 = \frac{-1}{n} \big(a_1 \cdot \text{trace}\big(\mathbf C^{1}\big) +  a_2 \cdot \text{trace}\big(\mathbf C^{2}\big)  + ... + a_n \cdot \text{trace}\big(\mathbf C^{n}\big)\big)$  

(via Netwon's Identities, or Cayley Hamilton.  Note that if $n$ is even then $a_0$ is equal to the determinant.  If $n$ is odd we can of course re-run the above argument on $\big(-\mathbf C\big)$ to see that the determinant of $\mathbf C$ varies continuously with its coefficients-- there are more direct approaches-- in particular working with principal minors--though teasing out the conclusion via manipulation of the trace has appeal to your author.)  

This again, consists of a finite number of operations involving sums and products of things that vary continuously with the components of $\mathbf C$ and hence we conclude that the result, $a_0$ varies continuously with the coefficients with $\mathbf C$.  


in the special case where $\mathbf A$ is a rank one matrix **in reals**, we get the following:

where $\mathbf A = \mathbf {xy}^T = \mathbf {xy}^H$

hence 

$\big \vert \text{trace}\Big(\big(\mathbf {xy}^H\big)^2\Big)\big \vert = \big \vert \lambda_1^2 + 0 + 0+ .... + 0 \big \vert  = \big \vert \lambda_1^2 \big \vert = \big \vert \lambda_1 \big \vert^2 $

i.e. triangle inequality is not needed, and we can in fact look at 

$\big \Vert \mathbf D \big \Vert_F^2 = \big \vert \text{trace}\big(\big(\mathbf {xy}^H\big)^2\big)\big \vert = \big \vert \text{trace}\big(\mathbf {xy}^H\big)\big \vert^2 = \big \vert \lambda_1 \big \vert^2$  

now we see that 

$\big \vert \big \vert \mathbf {xy}^H \big \vert \big \vert_F^{2} =  \frac{1}{2}\big \vert \big \vert \big(\mathbf R - \mathbf R^H \big) \big \vert \big \vert_F^{2}  + \big \vert \big \vert \mathbf D \big \vert \big \vert_F^{2} = \frac{1}{2}\big \vert \big \vert \big(\mathbf {xy}^H - \mathbf {yx}^H \big) \big \vert \big \vert_F^{2}  + \big \vert \big \vert \mathbf D \big \vert \big \vert_F^{2} =  \frac{1}{2}\big \vert \big \vert \big(\mathbf {xy}^H - \mathbf {yx}^H \big) \big \vert \big \vert_F^{2}  + \big \vert \text{trace}\big(\mathbf {xy}^H\big)\big \vert^2$

This is the Lagrange Identity in reals.


**Immediate Consequence:** 

For any **unitary** (or if in reals: Orthogonal) $n$ x $n$ matrix, $\mathbf U$, which must have eigenvalues, $\big \vert \lambda_k\big \vert = 1$, for $k = \{1, 2, ...,n\}$  (see middle part of "Fun_with_Trace_and_Quadratic_Forms_CauchySchwartz_.ipynb" under the heading *Thoughts on Unitary Matrices* for a proof of this, using quadratic forms / singular values, to upper and lower bound the eigenvalue magnitudes)

Noting that $\mathbf U^H \mathbf U = \mathbf I$, for any $n$ x $n$ unitary matrix and applying Schur's Inequality, we have:



$\big \vert \big \vert \mathbf U \big \vert \big \vert_F^{2} = \text{trace}\big(\mathbf U^H \mathbf U\big) = \text{trace}\big(\mathbf I \big) = n \geq \big \vert \big \vert \mathbf D \big \vert \big \vert_F^{2} =  \sum_{i = 1}^{n}  \lambda_i^H \lambda_i= \sum_{i = 1}^{n} \big \vert \lambda_i\big \vert ^2 =  \sum_{i = 1}^{n} 1^2  =  \sum_{i = 1}^{n} 1 = n  $

hence: 

$n = \big \vert \big \vert \mathbf U \big \vert \big \vert_F^{2} = \big \vert \big \vert \mathbf D \big \vert \big \vert_F^{2}$ 

which tell us that, if we wanted, we could diagonalize $\mathbf U$ with mutually orthonormal eigenvectors, $\mathbf U = \mathbf{QDQ}^H$.


*Link of SVD and Eigen-decompositions for Normal Matrices* 

If we were going to do SVD on $\mathbf U$, notice that the left and right singular vectors would be the same, except for an issue of rotations on complex plane. Starting with the above eigen-decomposition:  

$\mathbf U = \mathbf {Q  D Q}^H $

now we make the substitution $\mathbf D =\mathbf{\Lambda \Sigma} = \mathbf{\Lambda \mathbf I} = \mathbf \Lambda $

That is we factor $\mathbf D$ into two diagonal matrices -- $\mathbf \Sigma$ which must be real valued and non-negative, and hence has the magnitudes of all the values in $\mathbf D$, and the remaining complex numbers / rotations / angles (i.e. information on the unit circle) in $\mathbf D$ gets allocated to $\mathbf \Lambda$.  Note that since all singular values are equal to one in a unitary matrix, then $\mathbf \Sigma = \mathbf I$, which can make the decomposition a bit pedantic.    

So, we have $\mathbf U =  \mathbf {Q  D Q}^H =  \mathbf{ Q \Lambda} \mathbf {\Sigma Q }^H= \big( \mathbf{ Q \Lambda}\big)  \mathbf {\Sigma Q }^H = \big( \mathbf{ Q \Lambda}\big) \mathbf I \mathbf {Q }^H = \big( \mathbf{ Q \Lambda}\big)  \mathbf {Q }^H $, which is to say that the left and right singular vectors are the same -- and in fact can be chosen to be the eigenvectors, *if* we relax the constraint that $\mathbf \Sigma$ has only real valued, non-negative entries. 



**More on Normal Matrices**

Another way to define / test for a **square matrix** $\mathbf A$ being normal, is the following:

if 

$\mathbf{AA}^H = \mathbf A^H \mathbf A$  

Then: first we observe that $\big(\mathbf A^H \mathbf A\big)$ is a Hermitian matrix, and $\big(\mathbf{AA}^H\big)$ is also a Hermitian matrix.  This means that each can be diagonalized with a unitary basis of eigenvectors.  

Because $\mathbf{AA}^H = \mathbf A^H \mathbf A$, we know that each side has the same eigenvalues. (Side note: in general $\mathbf{AB}$ and $\mathbf B \mathbf A$ must have the same non-zero eigenvalues, as proven in the Vandermonde Matrix writeup.)  Because each side is equivalent, we can select eigenvectors for each side to be equivalent.  

Thus 

$\mathbf U \mathbf D \mathbf U^H = \mathbf{AA}^H = \mathbf A^H \mathbf A = \mathbf V \mathbf D \mathbf V^H$

where $\mathbf U = \mathbf V$.  

However, recall that $\mathbf V$ and $\mathbf U$ are the right and left singular vectors for $\mathbf A$.  


Thus we can go through our process of doing singular value decomposition on $\mathbf A$, *except we no longer enforce the constraint / definition of all singular values being real and non-negative* -- instead we simply ensure that $\mathbf \Sigma ^H \mathbf \Sigma = \mathbf D$, observing that it is always the case with square matrices that $\mathbf \Sigma ^H \mathbf \Sigma =  \mathbf \Sigma \mathbf \Sigma^H$ and we get:

$\mathbf A = \mathbf{U \Sigma V}^H =  \mathbf{U \Sigma U}^H$

and hence we have diagonalized $\mathbf A$ with a unitary basis of eigenvectors.  Thus $\mathbf A$ is normal.  

*note: In case the reader is curious as to how we can be sure to select actual correct complex numbers for $\mathbf \Sigma$, since all we seem to know is the squared magnitude of each entry -- one simple approach is to use quadratic forms*  

$\mathbf U = \bigg[\begin{array}{c|c|c|c}
\mathbf u_1 & \mathbf u_2 &\cdots & \mathbf u_n\end{array}\bigg] $

where $\mathbf x_k = \mathbf u_k$, we see that the following result

$\mathbf x_k^H \mathbf A \mathbf x_k = \sigma_k$

which gives us the exact complex number associated with $\sigma_k$


**another look at normal matrices:**  
*i.e. this section shows how  the 'typical' definition of normal -- when a matrix commutes with its conjugate transpose-- implies the equality case of Schur's Inequality*  


The claim is that matrices are normal **iff** $\mathbf A \mathbf A^H = \mathbf A^H \mathbf A$  which we indicated is the same as the matrix being unitarily diagonalizable, via Schur's Inequality.  

Let's examine the Schur decompositions of each of these matrices.  

I.e. if these two matrices are the same, we see:  

 $\mathbf A \mathbf A^H  = \big(\mathbf Q \mathbf R \mathbf Q^H \big)\big(\mathbf Q \mathbf R^H \mathbf Q^H\big) = \mathbf Q \mathbf R  \mathbf R^H \mathbf Q^H = \mathbf Q \mathbf R^H  \mathbf R \mathbf Q^H = \big(\mathbf Q \mathbf R^H \mathbf Q^H \big) \big(\mathbf Q \mathbf R \mathbf Q^H \big) = \mathbf A^H \mathbf A$  
 
thus the statement comes down to verifying that 

$\mathbf Q \mathbf R  \mathbf R^H \mathbf Q^H = \mathbf Q \mathbf R^H  \mathbf R \mathbf Q^H $

and since $\mathbf Q$ is full rank (and unitary) we can multiply on the left by $\mathbf Q^H$ and on the right by $\mathbf Q$, without changing the problem, which gets us:  
 
$ \mathbf R  \mathbf R^H  = \mathbf R^H  \mathbf R  $


because diagonal matrices commute, it is easy to verify that if $\mathbf R = \mathbf D$ then the statement is true, i.e. that 

$ \mathbf D  \mathbf D^H  = \mathbf D^H  \mathbf D  $

What is more subtle is verifying the other leg of the *iff*, i.e. that if $ \mathbf R  \mathbf R^H  = \mathbf R^H  \mathbf R  $ it *must be that the case that* $\mathbf R = \mathbf D$  

Note that if two matrices are equal, then their diagonal entries must be the same.  And, because of special structure (as will become clear) in triangular matrices, it is enough to verify the implications of the 'sameness' of the diagonal entries of the two matrices $\big(\mathbf R  \mathbf R^H\big)$  and  $\big(\mathbf R^H  \mathbf R \big)$.  

now lets look at our upper triangular matrix $\mathbf R$ which has $\mathbf A$'s eigenvalues along its diagonal.  We can partition this two different ways. 

first by columns  

$\mathbf R = 
\bigg[\begin{array}{c|c|c|c}
\mathbf r_1 & \mathbf r_2 &\cdots & \mathbf r_{n}
\end{array}\bigg]$

then by rows 

$\mathbf R= 
\begin{bmatrix}
\tilde{ \mathbf r_1}^T \\
\tilde{ \mathbf r_2}^T \\ 
\vdots\\ 
\tilde{ \mathbf r}_{n-1}^T \\ 
\tilde{ \mathbf r_n}^T
\end{bmatrix}
$   


consider the $j$th diagonal entry of $\big(\mathbf R^H  \mathbf R \big)$.  It is given by $\big(\mathbf R^H  \mathbf R \big) = \mathbf r_j^H \mathbf r_j = \langle\ \mathbf r_j, \mathbf r_j\rangle  = \big \Vert \mathbf r_j \big \Vert_2^2  $

it's a bit messy, but when we consider the the jth entry of $\big(\mathbf R  \mathbf R^H \big)$, it is given by  
$\big(\tilde{ \mathbf r_j}^T\big) \big(\tilde{ \mathbf r_j}^T\big)^H = \langle\ \tilde{ \mathbf r_j}^T, \tilde{ \mathbf r_j}^T \rangle = \big \Vert \tilde{ \mathbf r_j}^T \big \Vert_2^2 $


Thus by examining the diagonal entries of $\big(\mathbf R^H  \mathbf R \big) $  and $\big(\mathbf R  \mathbf R^H \big)$ which must be equal since $\mathbf {AA}^H = \mathbf A^H \mathbf A$ for normal matrices-- we are actually looking at the squared length (2 norms) of each column and each row of $\mathbf R$.  

In general, of course, $\text{trace}\big(\mathbf R^H  \mathbf R \big) = \text{trace}\big(\mathbf R  \mathbf R^H \big)$ by the cyclic property of trace, but this looks at the summed/ aggregated values.  Looking at diagonal entries and examining the implications if each jth diagonal entry is the same, gives considerably more insight.  (There are analogies with special structure in markov chains -- in general irreducible positive time recurrent chains having global balance equations that must be satisfied -- i.e. at the summation level for each state, however *time reversible* chains satisfy the detailed balance equations i.e. at a granular, pre-summation level which allows us to squeeze special insights out of them.)   


Specifically consider the following dynamic programming inspired approach.  (Note: this could just be referred to as induction, but that seems to miss some of the essence of the overlapping subproblems that exist here.)  

- - - -
*The Close*  

Suppose we look at the squared length of column $\mathbf r_1$  and compare it to the squared length of row $\tilde{ \mathbf r_1}^T$.  Being upper triangular, we know $\big \Vert \mathbf r_1 \big \Vert_2^2 = \big \vert \lambda_1 \big \vert^2 $.  But we are insisting the diagonal entries of $\big(\mathbf R^H  \mathbf R \big)$  and $\big(\mathbf R  \mathbf R^H \big)$ are the same, and hence we have $\big \Vert \mathbf r_1 \big \Vert_2^2 = \big \Vert \tilde{ \mathbf r_1}^T \big \Vert_2^2 = \big \vert \lambda_1 \big \vert^2 $.  We know $\tilde{ \mathbf r_1}^T$ always contains $\lambda_1$, but the comparison length tells us it *only* contains $\lambda_1$ i.e. row 1 has only an element on the diagonal.  

Put another way, $\big\Vert \big(\tilde{ \mathbf r_1}^T - \lambda_1 \mathbf e_1^T\big)^T \big \Vert_2^2 = 0$ which occurs **iff** $\big(\tilde{ \mathbf r_1}^T - \lambda_1 \mathbf e_1^T\big)^T  = \mathbf 0$, where as a reminder we have the standard basis vectors given by:  $\mathbf I = 
\bigg[\begin{array}{c|c|c|c}
\mathbf e_1 & \mathbf e_2 &\cdots & \mathbf e_{n}
\end{array}\bigg]$


Now we proceed to column 2 and row 2. When evaluating $\big \Vert \mathbf r_2 \big \Vert_2^2$, we think through the following:  being upper triangular we, again, know there are only zeros below the diagonal, and as we've just uncovered, everything above the diagonal (i.e. i.e. everything in row 1 that isn't $\lambda_1$ ) is a zero as well, and hence we know $\big \Vert \mathbf r_2 \big \Vert_2^2 = \big \vert \lambda_2 \big \vert^2 $.  But $\big \Vert \tilde{ \mathbf r_2}^T \big \Vert_2^2  = \big \Vert \mathbf r_2 \big \Vert_2^2 = \big \vert \lambda_2 \big \vert^2 $ and hence we discover that everything in row 2 must be a zero except for the eigenvalue.  

Now we could proceed most formally via induction, but the idea is that of overlapping subproblems -- i.e. we repeat the above process for $k = 3, 4, ..., n$ and for each $k$ we recognize that $\big \Vert \mathbf r_k \big \Vert_2^2 = \big \vert \lambda_k \big \vert^2 $, because the preceding subproblems tell us that there are only zeros above the diagonal entry  for column $k$ (i.e. for rows $i = \{1,..., k-1\})$.  But then, because $\big \Vert \tilde{ \mathbf r_k}^T \big \Vert_2^2  = \big \Vert \mathbf r_k \big \Vert_2^2 = \big \vert \lambda_k \big \vert^2 $ we discover that there cannot be anything non-zero to the right of the diagonal for row $k$ either (and of course being upper triangular, there are only zeros to the left of the diagonal).  And after repeating this process for all columns in $\mathbf R$, we have verified that $\mathbf R$ is in fact diagonal -- i.e. $\mathbf A$ is unitarily diagonalizable and in line with Schur's Inequality.  



Immediate (and obvious) corrolaries from this characterization of normality are that Hermitian and skew Hermitian matrices are normal, and hence unitarily diagonalizable.  

i.e. for Hermiation  
$\mathbf A^H \mathbf A = \mathbf A^H \mathbf A^H = \mathbf A \mathbf A^H$  and hence $\mathbf A$ is normal  

For *skew Hermitian* $\mathbf A$  
$\mathbf A^H \mathbf A = \mathbf A^H (-1)\mathbf A^H = (-1)\mathbf A (-1)\mathbf A^H = \mathbf A \mathbf A^H$   
and hence $\mathbf A$ is normal  


**extension:** some results about skew Hermitian matrices which are normal (see above)


It is perhaps worth noting that in either the real case or the complex case, we may do a Schur Decomposition on $\mathbf A$, and see that $\lambda_k = -\lambda_k^H$ for $k = \{1, 2, ..., n\}$ , i.e. that each eigenvalue is equal to the negative of its own conjugate.  This means that each eigenvalue is either purely imaginary, or equal to zero (which can be interpreted as a special case of purely imaginary).  One immediate consequence for the real case, is that for an $n$ x $n$ real, skew symmetric $\mathbf A$, if $n$ is odd, we know that it is singular.  Why? For real matrices, complex eigenvalues come in conjugate pairs, which means that there must be (at least) one element that does not have a pair, and hence it cannot be imaginary and hence must be zero.  

Note: we can also verify singularity of odd dimensional skew symmetric matrices another way. Recalling that the determinant is a multi-linear function, and how a scalar impacts it, we may say:  

$\det\Big(\mathbf A\Big) = \det\Big(\mathbf A^T \Big) = \det\Big( \big(-\mathbf A\big) \Big) = \det\Big((-1) \mathbf A\Big)  = (-1)^n \det\Big(\mathbf A\Big)$  

or equivalently 

$\det\Big(\mathbf A\Big) = \det\Big(\mathbf A^T \Big) = \det\Big( \big(-\mathbf A\big) \Big) = \det\Big(\big(-\mathbf I\big) \mathbf A\Big)  = \det\Big(-\mathbf I\Big) \det\Big( \mathbf A\Big) =(-1)^n \det\Big(\mathbf A\Big)$  

hence if $n$ is odd, we have $\det\big(\mathbf A\big) = -\det\big(\mathbf A\big)$ or  
$2\det\big(\mathbf A\big) = 0$, or    
$\det\big(\mathbf A\big) = 0$, i.e. $\mathbf A$ is singular    

Note that this is a moderately more general result.  It more directly applies to scalar fields not of characteristic zero *and* it tells us additional information about a matrix with complex scalars (i.e. with at least one scalar with a non-zero imaginary part) that is skew symmetric (but not skew hermitian) -- while such matrices would not in general be normal, and they would not have a 'requirement' for eigenvalues coming in conjugate pairs, they would be singular if $n$ is odd.  


An interesting *additional* way to look at Real Skew Symmetric matrices is that they in some sense represent bipartite graphs 

(note while the Zero Matrix is *technically* skew symmetric, we confine the discuss below to all non-zero real skew symmetric matrices, below, in the interest of linguistic clarity)  

- - - - -  
consider that for odd natural number $k$, for real skew symmetric matrices we have   
$\text{trace}\big(\mathbf A^k\big) = \text{trace}\big((\mathbf A^k)^T\big)  = \text{trace}\big((\mathbf A^T)^k\big)  = \text{trace}\big((-\mathbf A)^k\big) = \text{trace}\big((-1)^k\mathbf A^k\big) = (-1)^k\text{trace}\big(\mathbf A^k\big) = -\text{trace}\big(\mathbf A^k\big) $  

hence  
$2\cdot \text{trace}\big(\mathbf A^k\big)=0\to \text{trace}\big(\mathbf A^k\big)=0$   
i.e. real skew symmetric matrices are traceless for odd powers. Recalling how this insight was used in "Blocked_Matrices_Sympy_BipartiteGraphs.ipynb" for bipartite graphs, we can immediately recognize that the spectra of $\mathbf A$ is given by  

$\{\lambda_1, -\lambda_1, \lambda_2, -\lambda_2, ..., \lambda_{n/2}, -\lambda_{n/2}\}$  
(technical nit: if $n$ is odd, we can instead end with $\frac{n-1}{2}$ and insert a zero as we know such a matrix is singular, based on the above determinant argument) 

and we know   
$0 \lt \big\Vert \mathbf A\big \Vert_F^2 = \text{trace}\big(\mathbf A^T \mathbf A\big)=-\text{trace}\big(\mathbf A^2\big)$ 

(the LHS is due to positive definiteness and the fact that we've carved the zero matrix out from this)  

or equivalently,  
$\text{trace}\big(\mathbf A^2\big) \lt 0 \lt \big\Vert \mathbf A\big \Vert_F^2 $ 


but revisiting Schur's Inequality, we can see for a real skew symmetric matrix that we have an equality case 

$\big \Vert \mathbf A \big \Vert_F^{2} = \text{trace}\big(\mathbf A^H \mathbf A\big) \geq \sum_{i = 1}^{n} \big \vert \lambda_i\big \vert ^2 \geq \big \vert \sum_{i = 1}^{n} \lambda_i^2\big \vert = \Big \vert \text{trace}\big(\mathbf A \mathbf A\big) \Big \vert$  

This of course is another way of verifying that real skew symmetric matrices are normal, but we may also note that the triangle inequality becoming an equality tells us the each $\lambda_i^2$ must point the 'same direction'.  

consider a complex 'mixed' eigenvalue given by 
$\lambda = a + bi$  

where $a \neq 0$, $b \neq 0$  

then 
$\lambda^2 = (a + bi)^2 = a^2 - b^2 + 2abi$  
but since our matrix is real, the eigenvalues come in conjugate pairs, so the conjugate is  
$\bar{\lambda}^2 = (a - bi)^2 = a^2 - b^2 - 2abi$  

which cannot 'point' in the same direction since $2abi \neq 0$  

hence $\mathbf A$ cannot have mixed complex eigenvalues.  This leaves us with the option of purely real eigenvalues and/or purely imaginary (where $\lambda =0$ may be interpreted as either one).  

Again the triangle inequality comes into play:  
consider the purely real (and non-zero) case  
if $\lambda_k = a \neq 0$ and $\lambda_{k+1} = -a$ (recall bipartite style traces), then  
$\lambda_k^2 = a^2 = \lambda_{k+1}^2$  

and now consider the purely imaginary (and non-zero) case  
$\lambda_j = bi \neq 0$ and (for bipartite and conjugate pair reasons) $\lambda_{j+1} = -bi$, then  
$\lambda_{j}^2 = (bi)^2 = -b^2 = \lambda_{j+1}^2$  

However $\lambda_j$ and $\lambda_k$ cannot both exist -- if they did, then the triangle inequality would be violated   
i.e. $\lambda_j^2 $ is negative and $\lambda_k^2$ is positive, so they point in 'opposite directions'. 

Hence the eigenvalues must be either purely real or purely imaginary.  But since all purely real numbers, squared, are non-negative, the associated trace of $\mathbf A^2$ would have to be non-negative, yet we know  
$\text{trace}\big(\mathbf A^2\big) \lt 0 $  

hence $\mathbf A$ must have purely imaginary eigenvalues.   


**yet another look at at Normal Matrices**

if two matrices are normal, we have 

$\mathbf A^H \mathbf A = \mathbf {AA}^H$

or equivalently:

$\mathbf A^H \mathbf A - \mathbf {AA}^H = \mathbf 0$

now looking at the squared Frobenius norm of this matrix, we can get another test for normality in general... that is we make use of positive definiteness for norms -- in particular that the (squared) Frobenius norm of a matrix is zero **iff** the matrix is the zero matrix.


$\big \Vert \mathbf A^H \mathbf A - \mathbf {AA}^H \big \Vert_F^2 = \text{trace}\Big(\big(\mathbf A^H \mathbf A - \mathbf {AA}^H \big)^H\big(\mathbf A^H \mathbf A - \mathbf {AA}^H \big)\big)= 0$

we can expand this to

$ \text{trace}\Big(\big(\mathbf A^H \mathbf A\big)^2 \Big) + \text{trace}\Big(\big(\mathbf A \mathbf A^H\big)^2 \Big) - \text{trace}\Big(\big(\mathbf A^H \mathbf A\big)\big(\mathbf A \mathbf A^H\big) \Big) - \text{trace}\Big(\big(\mathbf A \mathbf A^H\big)\big(\mathbf A^H \mathbf A\big) \Big)  =  0$  


and using cyclic property of the trace, then re-arranging terms:

$ \text{trace}\Big(\big(\mathbf A^H \mathbf A\big)^2 \Big) + \text{trace}\Big( \mathbf A^H \mathbf A \mathbf A^H \mathbf A\Big) - \text{trace}\Big(\mathbf A^H \mathbf A \mathbf A \mathbf A^H \Big) - \text{trace}\Big(\mathbf A \mathbf A^H \mathbf A^H \mathbf A \Big)  =  0$  

$2\cdot \text{trace}\Big(\big(\mathbf A^H \mathbf A\big)^2 \Big)  =  \text{trace}\Big(\mathbf A^H \mathbf A^H  \mathbf A \mathbf A \Big) + \text{trace}\Big(\mathbf A^H \mathbf A^H   \mathbf A \mathbf A \Big) = 2 \cdot \text{trace}\Big(\mathbf A^H \mathbf A^H \mathbf A \mathbf A  \Big)  = 2 \cdot \text{trace}\Big(\big(\mathbf A^2\big)^H \big(\mathbf A^2\big)  \Big)  $ 

which gives us another test for normality: We may say a matrix is normal **iff**  
$\text{trace}\Big(\big(\mathbf A^H \mathbf A\big)^2 \Big) =  \text{trace}\Big(\big(\mathbf A^2\big)^H \big(\mathbf A^2\big)  \Big)   $ 




**a nicer way to this result**  
$\text{trace}\Big(\big(\mathbf A^H \mathbf A\big)^2 \Big) = \text{trace}\Big(\big(\mathbf A \mathbf A^H\big)^2 \Big)^\frac{1}{2} \text{trace}\Big(\big(\mathbf A^H \mathbf A\big)^2 \Big)^\frac{1}{2}  \geq   \text{trace}\Big(\big(\mathbf A^H \mathbf A\big)^H\big( \mathbf A\mathbf A^H\big)  \Big)   =\text{trace}\Big(\big(\mathbf A^2\big)^H \big(\mathbf A^2\big)  \Big) $ 

a direct way of evaluating this trace inequality is to observe the above.  And setting aside the trivial case of $\mathbf A = \mathbf 0$, this means, if the above inequality is met with equality then Cauchy-Schwarz tells us   

$\big( \mathbf A\mathbf A^H\big) = \gamma \cdot \big(\mathbf A^H \mathbf A\big)$  
but taking the trace of each side tells us  
$0 \lt \text{trace}\big( \mathbf A\mathbf A^H\big) = \text{trace}\big(\mathbf A^H \mathbf A\big) = \gamma \cdot \text{trace}\big(\mathbf A^H \mathbf A\big) \implies 1 = \gamma $    

i.e. due to Cauchy-Schwarz, being met with equality we know  
$\big( \mathbf A\mathbf A^H\big) = \big(\mathbf A^H \mathbf A\big)$  
i.e. $\mathbf A$ is normal   

This is ultimately the exact same thing as the above calculations showing 
$\text{trace}\Big(\big(\mathbf A^2\big)^H \big(\mathbf A^2\big) = \text{trace}\Big(\big(\mathbf A^H \mathbf A\big)^2 \Big)$  implies  
$\big \Vert \mathbf A^H \mathbf A - \mathbf {AA}^H \big \Vert_F^2 = 0 $  
however showing the result directly with Cauchy-Schwarz seems more pleasant to your author  


The following is problem 49, from 

"Matrices : Theory & Applications
Additional exercises"

found here: 

http://perso.ens-lyon.fr/serre/DPF/exobis.pdf


For $\mathbf M \in \mathbb R^{n x n}$

**claim:**  

$\Big( \text{trace}\big( \mathbf M \big) \Big)^2  \leq \text{rank}\big(\mathbf M\big)\text{trace}\big(\mathbf M^T \mathbf M\big) = \text{rank}\big(\mathbf M\big)\big \Vert \mathbf M\big\Vert_F^2$

for the proof, suppose that we have well ordered eigenvalues, $k$ of which are not zero, i.e. where $0 \leq k \leq n$, given below:

$\big \vert \lambda_1 \big \vert \geq \big \vert \lambda_2 \big \vert \geq ... \geq \big \vert \lambda_k \big \vert \geq 0 = \big \vert \lambda_{k+1} \big \vert = .... = \big \vert \lambda_{n}\big \vert$


(Note that while the field is Reals, we observe the typical relaxation that allows Complex numbers during intermediate steps involving eigenvalues.) 

**proof:**    
we may collect all $n$ eigenvalues in a diagonal matrix $\mathbf \Lambda$, and the $k$ non-zero eigenvalues in a $k$ x $k$ diagonal matrix $\mathbf D$.  The ones vector $\mathbf 1$ has $k$ entries, each equal to one. 

$\Big( \text{trace}\big( \mathbf M \big) \Big)^2  = \langle \mathbf 1_k \,, \big( \mathbf {D1_k}\big)\rangle^2 \leq \langle \mathbf 1_k \,, \mathbf 1_k\rangle\cdot \langle \big(\mathbf {D1}_k\big) \,, \big( \mathbf {D1}_k\big)\rangle    = k \cdot \text{trace}\big(\mathbf D^H \mathbf D\big) = k\cdot \text{trace}\big(\mathbf \Lambda^H \mathbf \Lambda \big) = k \cdot \sum_{i=1}^n \big\vert \lambda_i\big\vert ^2 $

where the above inequality is given by Cauchy Schwartz

$\Big( \text{trace}\big( \mathbf M \big) \Big)^2 \leq k \cdot\sum_{i=1}^n \big\vert \lambda_i\big\vert ^2  \leq \text{rank}\big(\mathbf M\big) \sum_{i=1}^n \big\vert \lambda_i\big\vert ^2 $

From here we observe that the $k$ non-zero eigenvalues of a matrix are a *lower bound* on its rank.  
(I.e. this final item is in here to address the possibility of defective matrices. A common application of this theorem is with normal, and in particular real symmetric, matrices, where subtleties about defective matrices are not relevant.)  


- - - - -  
*justification:* any real $n$ x $n$ matrix is unitarily similar to an upper triangular one.  Interpreted in terms of Gaussian Elimination, such an upper triangular matrix has at least $k$ pivots.  Put differently, once said matrix is put in reduced row echelon form, its number of pivots (i.e. its rank) cannot be less than $k$.   

*alternative justification:* using rank nullity, we have  

$\text{dim}\Big(\text{nullspace}\big(\mathbf M\big)\Big) = \text{geometric multiplicity of eigenvalue zero} \leq \text{algebraic multiplicity of eigenvalue zero}$  
$n = \text{dim}\Big(\text{nullspace}\big(\mathbf M\big)\Big) + \text{rank}\big(\mathbf M\big) \leq  \text{algebraic multiplicity of eigenvalue zero} + \text{rank}\big(\mathbf M\big)$  

so  
$\text{algebraic multiplicity of Non-Zero eigenvalues (k)} = n - \text{algebraic multiplicity of eigenvalue zero} \leq  \text{rank}\big(\mathbf M\big)$  
- - - - -  
thus we can finish with  

$\Big( \text{trace}\big( \mathbf M \big) \Big)^2 \leq \text{rank}\big(\mathbf M\big) \sum_{i=1}^n \big\vert \lambda_i\big\vert ^2 \leq \text{rank}\big(\mathbf M\big)\big \Vert \mathbf M\big\Vert_F^2$ 

via Schur's Inequality.


**Misc. extension:**  

**claim:** Suppose $\mathbf A$ and $\mathbf B$ are normal. (Note: this implies they are square) Further suppose that $\mathbf {AB}$ is normal.  Then $\mathbf {BA}$ is normal as well.  

**proof:**  
By Schur Inequality, since $\mathbf {AB}$ is normal, we know 
$\big\Vert \mathbf {AB} \big\Vert_F^2 = \sum_{i=1}^n \Big(\big\vert \lambda\big(\mathbf {AB}\big)_i \big\vert^2\Big) $

Now we make use of the fact that $\big(\mathbf{AB}\big)$ and $\big(\mathbf {BA}\big)$ have the same eigenvalues.  

*the main argument:*  
$\sum_{i=1}^n \Big(\big\vert \lambda\big(\mathbf {BA}\big)_i \big\vert^2\Big)$   
$=\sum_{i=1}^n \Big(\big\vert \lambda\big(\mathbf {AB}\big)_i \big\vert^2\Big)$   
$=\big\Vert \mathbf {AB} \big\Vert_F^2 $   
$= \text{trace}\big(\mathbf B^H \mathbf A^H \mathbf A \mathbf B\big)$   
$= \text{trace}\big(\mathbf A^H \mathbf A \mathbf B \mathbf B^H \big) $  
$= \text{trace}\big(\mathbf A \mathbf A^H \mathbf B^H \mathbf B \big) $  
$= \text{trace}\big(\mathbf A^H \mathbf B^H \mathbf B \mathbf A \big) $  
$=  \big\Vert \mathbf {BA} \big\Vert_F^2$   

where we make use of the cyclic property of the trace, and $\mathbf A \mathbf A^H =\mathbf A^H \mathbf A $ due to the normality of $\mathbf A$, and  $\mathbf B^H \mathbf B = \mathbf B \mathbf B^H $ due to the normality of $\mathbf B$

Hence we've seen 
$\big\Vert \mathbf {BA} \big\Vert_F^2 = \sum_{i=1}^n \Big(\big\vert \lambda\big(\mathbf {BA}\big)_i \big\vert^2\Big)$, i.e. that it satisfies the Schur Inequality with equality, and hence $\big(\mathbf{BA}\big)$ is normal. 


**applications of  Schur Triangularization / Schur's Inequality**  
for any $\mathbf A \in \mathbb C^\text{n x n}$  

(i)  For any *simple* eigenvalue $\lambda_1 = 1$ with (right) eigenvector $\mathbf x$ the matrix $\big(\mathbf A -\mathbf I + \mathbf x\mathbf v^*\big)$ 
is invertible, for *any* $\mathbf v$ such that $\mathbf v^H\mathbf x \neq 0$  (not orthogonal $\mathbf v$)  

note via rescaling, and shifting (i.e. adding the identity matrix) we may always assume WLOG that the simple eigenvalue of interest is equal to one  

we applying Schur triangularization, with $\mathbf x$ (scaled to have norm 1) as the first vector  

$\mathbf Q := \bigg[\begin{array}{c|c|c|c}\mathbf x & \mathbf q_2 &\cdots & \mathbf q_{n}\end{array}\bigg] $  

$\mathbf Q^H\mathbf A \mathbf Q = \begin{bmatrix} \lambda & * \\ \mathbf 0 & \mathbf R_{n-1} \end{bmatrix}$  

so 
$\mathbf Q^H\big(\mathbf A-\mathbf I_n  \mathbf Q - \mathbf I_n  = \begin{bmatrix} 0 & * \\ \mathbf 0 & \mathbf R_{n-1}-\mathbf I_{n-1} \end{bmatrix}$   

and $\det\big(\mathbf R_{n-1}-\mathbf I_{n-1}\big) \neq 0$ because $\lambda_1 =1 $ is simple, so $\mathbf R_{n-1}$ is triangular with no zeros on the diagonal  

finally, since  
$\mathbf x = \mathbf Q\mathbf e_1$, then we have  
$\mathbf Q^H\mathbf x = \mathbf e_1$  


so 
$\mathbf Q^H\mathbf x\mathbf v^H\mathbf Q = \mathbf e_1\mathbf y^H$ and  
$\text{trace}\big(\mathbf Q^H\mathbf x\mathbf v^H\mathbf Q\big)=\text{trace}\big(\mathbf v^H\mathbf x\big)= \mathbf y^H\mathbf e_1= \bar{y_1} \neq 0$   

finally we consider 

$\mathbf A-\mathbf I_n +\mathbf x\mathbf v^*$ and effect a similarity transform (which preserves rank) and get  

$\mathbf Q^H\big(\mathbf A-\mathbf I_n +\mathbf e_1\mathbf y^H\big)\mathbf Q = \begin{bmatrix} \bar{y_1} & * \\ \mathbf 0 & \mathbf R_{n-1}-\mathbf I_{n-1} \end{bmatrix}$  

*application*  
for a (finite state) markov chain with a single communicating class we have    

$\mathbf \pi^T\big(\mathbf A - \mathbf I +\mathbf {11}^T\big) = \mathbf \pi^T\mathbf A - \mathbf \pi^T\mathbf I + \mathbf \pi^T\mathbf {11}^T = \mathbf {1}^T$  

but since $\big(\mathbf A - \mathbf I +\mathbf {11}^T\big)^{-1}$ exists, we could also solve for the steady state vector as  

$\mathbf \pi^T = \mathbf {1}^T\big(\mathbf A - \mathbf I +\mathbf {11}^T\big)$  


*converse:*  
a left eigenvector associated with a simple eigenvalue cannot be orthogonal to the right eigenvector associated with the same eigenvalue.  

where as before $\lambda_1$ is simple and WLOG $ =1 $  

i.e. suppose for a contradiction that  
$\mathbf z^H \mathbf A = \lambda_1 \mathbf z^H = \mathbf z^H$ and $\mathbf A\mathbf x = \mathbf x$  but $\mathbf z^H \mathbf x = 0$  


then as before, consider  
$\mathbf Q := \bigg[\begin{array}{c|c|c|c}\mathbf x & \mathbf q_2 &\cdots & \mathbf q_{n}\end{array}\bigg] $   
$\mathbf Q^H\mathbf A \mathbf Q = \begin{bmatrix} \lambda & * \\ \mathbf 0 & \mathbf R_{n-1} \end{bmatrix}$  

and we subtract the identity matrix to shift these eigenvectors into the kernel   

$\mathbf Q^H\big(\mathbf A - \mathbf I\big)\mathbf Q =  \begin{bmatrix} 0 & * \\ \mathbf 0 & \mathbf R_{n-1}-\mathbf I_{n-1} \end{bmatrix} $ 

and for convenience examine the conjugate transpose, and multiply on the right by $\mathbf Q^H$    

$\mathbf Q^H\big(\mathbf A^H - \mathbf I\big) =  \begin{bmatrix} 0 & \mathbf 0^H \\ * & \mathbf R_{n-1}^H-\mathbf I_{n-1} \end{bmatrix}\mathbf Q ^H $  
if we compute the same thing two different ways, we'll see it results in a contradiction. First  

$\mathbf Q^H\big(\mathbf A^H - \mathbf I\big)\mathbf z = \mathbf Q^H\big(\mathbf A^H\mathbf z - \mathbf I\mathbf z\big)=\mathbf Q^H\big(\mathbf z - \mathbf z\big) = \mathbf Q^H \mathbf 0 = \mathbf 0$  

second  
$\mathbf 0 = \begin{bmatrix} 0 & \mathbf 0^H \\ * & \mathbf R_{n-1}^H-\mathbf I_{n-1} \end{bmatrix}\mathbf Q ^H\mathbf z = \begin{bmatrix} 0 & \mathbf 0^H \\ * & \mathbf R_{n-1}^H-\mathbf I_{n-1} \end{bmatrix}\big(\sum_{j=2}^n \alpha_j\mathbf e_j\big) $   


because $\mathbf Q$ is unitary we know $\mathbf Q^H \mathbf z \neq \mathbf 0$ since $\mathbf z \neq \mathbf 0$ but we also know that the top row of $\mathbf Q^H$ is $\mathbf x^H$ which is orthogonal to $\mathbf z$ and hence the top component of $\mathbf Q^H\mathbf z$ is zero.  Thus $\mathbf z$ is written as a nontrivial linear combination of standard basis vectors $\mathbf e_j$ for $j\in \{2,3,...,n\}$  

we can see that the above is equivalent to asserting  
$\begin{bmatrix} \mathbf 0^H \\ \mathbf R_{n-1}^H-\mathbf I_{n-1} \end{bmatrix}\mathbf c = \mathbf 0$  for some $\mathbf c \neq \mathbf 0$  


but 

$\text{rank}\left(\begin{bmatrix} \mathbf 0^H \\ \mathbf R_{n-1}^H-\mathbf I_{n-1} \end{bmatrix}\right) = \text{rank}\left(\mathbf R_{n-1}^H-\mathbf I_{n-1}\right) = n-1$    

which is a contradiction.  

(equivalently $\mathbf e_1$ is in the nullspace of $\begin{bmatrix} 0 & \mathbf 0^H \\ * & \mathbf R_{n-1}^H-\mathbf I_{n-1} \end{bmatrix}$ and by rank-nullity that has a 1 dimensional nullspace so it is impossible for a non-trivial linear combination of standard basis vectors $\mathbf e_j$ for $j\in \{2,3,...,n\}$  to be in the nullspace of that matrix.)   

*remark:*   
The above proof immediately generalizes to the case of semisimple eigenvalues  

(ii)  The absolute values of the eigenvalues of $\mathbf A \in \mathbb C^\text{n x n }$ are weakly majorized by the singular values of $\mathbf A$ with equality in each partial sum equality *iff* $\mathbf A$ is normal.  The weak majorization of absolute values of eigenvalues by singular values is proven in the middle of the "Fun with trace" notebook.  Here we examine the equality conditions.  

The first leg is immediate -- if a matrix is normal, its eigenvalues *are* its singular values --up to rescaling by a point on the unit circle.  

The second leg, consider $f:u \mapsto u^2$ which is a convex and increasing function, so as shown in the "Fun with Trace" notebook,  

we know, for $k\in\{1,2,...,n\}$  
$\sum_{i=1}^k \big \vert\lambda_i\big \vert \leq  \sum_{i=1}^k  \sigma_i$  
and if there is even one $r$ where  
$\sum_{i=1}^r \big \vert\lambda_i\big \vert \lt  \sum_{i=1}^r  \sigma_i$  

then we have (weak) majorization that is not met with equality in each partial sum so application of $f$ gives  
$\sum_{i=1}^n \big \vert\lambda_i\big \vert^2 \lt  \sum_{i=1}^r  \sigma_i^2$    
which means Schur's Inequality is not met with equality and hence $\mathbf A$ is not *normal*  

(iii) This is proven in a generalized way in the Artin chapter 8 notes,  
left and right eigenvectors associated with a simple eigenvalue are not orthogonal with respect to the dot product  


(iii) $\mathbf A$ commutes with every other n x n matrix *iff* it is the scalar matrix $c \mathbf I$ (i.e. scaled identity matrix, where $c\in \mathbb C$ and in fact may be zero).  

(it is immediate that the scaled identity matrix commutes with any other matrix but it is not as easy to prove the other leg without some cleverness)  

The fact that by commuting $\mathbf A^*\mathbf A = \mathbf A\mathbf A^*$ means that $\mathbf A$ is unitarily diagonalizable, so 

$\mathbf D = \mathbf Q^*\mathbf A\mathbf Q = \mathbf Q^{-1}\mathbf A\mathbf Q = \mathbf Q^{-1}\mathbf Q\mathbf A = \mathbf A$  

Thus $\mathbf A$ is necessarily some diagonal matrix $\mathbf D$.  We know $\mathbf A$ commutes with any permutation matrix $\mathbf P$ so  

$\mathbf P\mathbf D\mathbf P^T= \mathbf D\mathbf P\mathbf P^T = \mathbf D$  
(which in essence is a graph isomorphism argument)  
thus each component of the diagonal must be the same constant value $c$, 
we conclude $\mathbf A = c \mathbf I$  

**remark**  
there are many other ways to prove this result and in particular, over arbitrary fields.  *However the above use of normality results in an extremely short proof.*  


That said, if we were e.g. interested in group properties (and the center in particular) we'd have  

if $\mathbf A \in GL_n(\mathbb F)$ commutes with *every* matrix $\mathbf B \in GL_n(\mathbb F)$, then  
$\mathbf A = c \cdot \mathbf I$ with $0 \neq c \in \mathbb F$  

A nice way to prove this is to consider the Sylvester equation 

$T\big(\mathbf B\big) = \mathbf A\mathbf B - \mathbf B \mathbf A = \mathbf 0$  
and in particular to check it on this standard basis for n x n matrices (treating matrices as a vector space), thus  
for $i,j \in \{1,2,...,n\}$  

$\mathbf A\mathbf e_i\mathbf e_j^T- \mathbf e_i\mathbf e_j^T \mathbf A = \mathbf 0$  

at this point we have 
$T\big(\mathbf B\big) = \mathbf 0$  

so $T$ is a finite dimensional linear transformation with dim kernel $n^2$ in space with dim $n^2$, so it has rank 0.  This should be enough to uniquely identify $\mathbf A$ as the scaled identity matrix, but we proceed via use of the Kronecker product to make this explicit.  

then re-arrange the equation  
$\mathbf e_i\mathbf e_j^T \mathbf A=\mathbf A\mathbf e_i\mathbf e_j^T  $  
and take advantage of the invertibility of $\mathbf A \in GL_n(\mathbb F)$  

$\mathbf A^{-1}\mathbf e_i\mathbf e_j^T\mathbf A = \mathbf e_i\mathbf e_j^T  $   
from here, we can visit the results under Sylvester Equation in the Kronecker products notebook and write this as  
$\mathbf A^{-1}\mathbf e_i\mathbf e_j^T\mathbf A = \mathbf e_i\mathbf e_j^T  $   
or  
$\text{vec}\big(\mathbf A^{-1}\mathbf e_i\mathbf e_j^T \mathbf A\big) = \big(\mathbf A^T \otimes \mathbf A^{-1}\big)\text{vec}\big(\mathbf e_i\mathbf e_j^T \big)= \text{vec}\big(\mathbf e_i\mathbf e_j^T \big)$  
for  
$i,j \in \{1,2,...,n\}$  
or  

$\big(\mathbf A^T \otimes \mathbf I\big)\big(\mathbf I \otimes \mathbf A^{-1}\big)\mathbf I_{n^2}=\big(\mathbf A^T \otimes \mathbf A^{-1}\big)\mathbf I_{n^2} = \mathbf I_{n^2}$  

so  

$\big(\mathbf A^T \otimes \mathbf I\big)= \big(\mathbf I \otimes \mathbf A^{-1}\big)^{-1}=\big(\mathbf I \otimes \mathbf A\big)$   

Since  

$\begin{bmatrix}
a_{1,1}\mathbf I & *\\
* & *\\ 
\end{bmatrix}=\big(\mathbf A^T \otimes \mathbf I\big) = \big(\mathbf I \otimes \mathbf A\big)=\begin{bmatrix}
\mathbf {A} & *\\
* & *\\ 
\end{bmatrix}$ 

we see that not only must $\mathbf A$ be diagonal, it must have constant diagonal components, hence $\mathbf A = c\mathbf I$.  

*remark 1*  
the above proof is missing one thing -- in order to form a basis for the $n^2$ dimensional. We can bootstrap the above to matrices in $ GL_n(\mathbb F)$, by considering  

$\mathbf e_i\mathbf e_j^T \mathbf A=\mathbf A\mathbf e_i\mathbf e_j^T  $  
is equivalent to  

$\big(\mathbf I + \mathbf e_i\mathbf e_j^T\big) \mathbf A=\mathbf A\big(\mathbf I +\mathbf e_i\mathbf e_j^T \big) $  
and then subtracting $\mathbf A$ from both sides.  

This works over arbitrary fields, except when $j=i$ and we are working in a field of characteristic 2 ($\mathbb F_2$).  But crucially the work we've already done holds verbatim over any field except possibly that of characteristic 2.  

The above boostrapping can be viewed as a 'trick' of padding/incrementing the matrix we want with the identity matrix.  However, it can also be viewed as using elementary matrices of the first type (when $j\neq i$), which e.g. generate all of $SL_n(\mathbb F)$ so they are very natural to use here. The case of $j=i$ then corresponds to using elementary matrices of the 3rd type which generate the cosets of $SL_n(\mathbb F)$ but of course there are no (new) cosets to $SL_n(\mathbb F)$ when working in $\mathbb F_2$ as all determinants are necessarily 0 or 1 and the zeros are not allowed since we require a matrix to be invertible to be in $GL_n(\mathbb F)$.  There is hope for fields of characteristic two, though -- we can use elementary matrices of the second type (very simple permutation matrix) in combination with our earlier setup involving the identity matrix plus elementary matrices of the first type.     

So to finalize the result in $\mathbb F_2$ where we want to consider $i=j$ consider the case of $k\neq j$ and permutation matrix $\mathbf P$ that only has a single swap (transposition), so $\mathbf P\mathbf e_k = \mathbf e_j$ and $\mathbf P \mathbf e_r = \mathbf e_r$ when $k\neq r \neq j$  

$\mathbf P\mathbf A + \mathbf e_j \mathbf e_j^T \mathbf A= \mathbf P\mathbf A + \mathbf P\mathbf e_k \mathbf e_j^T \mathbf A=\Big(\mathbf P\big(\mathbf I + \mathbf e_k\mathbf e_j^T\big) \mathbf A\Big)=\mathbf A\Big(\mathbf P\big(\mathbf I +\mathbf e_k\mathbf e_j^T \big)\Big)=\mathbf A\mathbf P + \mathbf A\mathbf P\mathbf e_k \mathbf e_j^T \mathbf A = \mathbf P\mathbf A + \mathbf A\mathbf P\mathbf e_k \mathbf e_j^T = \mathbf P\mathbf A + \mathbf A\mathbf e_j \mathbf e_j^T $  

which of course simplifies to  
$\mathbf e_j \mathbf e_j^T \mathbf A =  \mathbf A\mathbf e_j \mathbf e_j^T$   

and gives the desired result, even over $\mathbb F_2$  


*remark 2*  
while the above is aimed at $\mathbf A \in GL_n(\mathbb F)$ and the invertibility of $\mathbf A$ makes the argument nice with respect to $\mathbf A$, we can just as well look at arbitrary matrices (not necessarily invertible ones) in $\mathbb F^\text{n x n}$ and if we push the argument through Kronecker products/sums.   

We could instead have directly considered the Sylvester Equation 

for $i,j \in \{1,2,...,n\}$  
$\mathbf A\mathbf e_i\mathbf e_j^T + \mathbf e_i\mathbf e_j^T \big(-\mathbf A\big) = \mathbf 0 \longrightarrow \big(\mathbf A \oplus - \mathbf A^T\big)\text{vec}\Big(\mathbf e_i\mathbf e_j^T \Big) $  
for $i,j \in \{1,2,...,n\}$  or  

$\big(\mathbf I\otimes \mathbf A\big) -\big(\mathbf A^T \otimes \mathbf I\big)=\big(\mathbf I\otimes \mathbf A\big) +\big(-\mathbf A^T \otimes \mathbf I\big) = \big(\mathbf A \oplus - \mathbf A^T\big)\mathbf I_{n^2} = \mathbf 0$  

so  
$\big(\mathbf I\otimes \mathbf A\big) =\big(\mathbf A^T \otimes \mathbf I\big) $  
and the argument finishes as before   

