This is a nice way to 'eyeball' the range that eigenvalues can be in.  It also has a nice picture associated with it.  I'd finally add that this leads to an immediate proof for understanding diagonally dominant matrices, a clean way to prove that symmetric (or in complex space, Hermitian) matrices are positive (semi) definite.

Indeed we can use these Gerschgorin discs to make especially strong claims about matrices with special structure.  

At first this post was going to be my adaptation of the proof from Kuttler's "Linear Algebra" (page 177).  However I ultimately decided that a more indirect route -- via Levy-Desplanques -- was more a lot more intuitive.  


While I generally prefer proofs other than contradictions, this proof of Levy-Desplanques is elementary, short and sweet and immediately leads to Gerschgorin discs in a very intuitive way.

The first part of this is, essentially, a direct lift from 
https://shreevatsa.wordpress.com/2007/10/03/a-nice-theorem-and-trying-to-invert-a-function/ .  However, I added extra steps in here to make the derivation a bit more deliberate.  

We say that $\mathbf A \in \mathbb C^{n x n}$ is (strictly) diagonally dominant if for every row i, $\big \vert a_{i,i}\big \vert \gt \sum_{j \neq i}\big \vert a_{i,j}\big \vert$.

**claim:** if (square) matrix $\mathbf A$ is diagonally dominant, then $\det(\mathbf A) \neq 0$

The contradiction comes from assuming $\det(\mathbf A) =0$, i.e. that $\mathbf A$ is not invertible.  If this is the case, there must some $\mathbf x \neq \mathbf 0$ where $\mathbf A \mathbf x = \mathbf 0$.  

Consider the maximal coordinate on $\mathbf x$, $x_k$, where $\big \vert x_k \big \vert \geq \big \vert x_j\big \vert$ for $j = \{1, 2, ..., n\}$

hence $x_k \neq 0$, and we can say that $\big \vert x_k \big \vert \gt 0$

Now look at the kth row of $\mathbf {Ax}$.  We have 

$a_{k, 1} x_1  + a_{k, 2} x_2 + ... + a_{k,n} x_n = \sum_{j=1}^{n} a_{k,j} x_j = a_{k,k}x_k + \sum_{j \neq k} a_{k,j} x_j$

now recall that we are considering the case where $\mathbf {Ax} = \mathbf 0$ for some non-zero $\mathbf x$, hence

$a_{k,k}x_k + \sum_{j \neq k} a_{k,j} x_j = 0$

or 

$a_{k,k}x_k = - \sum_{j \neq k} a_{k,j} x_j $

now take the magnitude of both sides

$\big \vert a_{k,k}x_k \big \vert = \big \vert a_{k,k}\big \vert \big \vert x_k \big \vert = \big \vert \sum_{j \neq k} a_{k,j} x_j \big \vert \leq  \sum_{j \neq k} \big \vert a_{k,j}\big \vert \big \vert x_j \big \vert \leq \sum_{j \neq k} \big \vert a_{k,j}\big \vert \big \vert x_k \big \vert = \big \vert x_k \big \vert\big( \sum_{j \neq k} \big \vert a_{k,j}\big \vert \big)$


From here the contradiction is aparent, but we can further distill this to:

$\big \vert a_{k,k}\big \vert \big \vert x_k \big \vert \leq \big \vert x_k \big \vert\big( \sum_{j \neq k} \big \vert a_{k,j}\big \vert \big)$

thus, because $\big \vert x_k \big \vert \gt 0$, we can divide it out of the above and since it is positive, the inequality sign does not change.  

$\big \vert a_{k,k}\big \vert \leq \sum_{j \neq k} \big \vert a_{k,j}\big \vert$

yet this contradicts our (strong) defintion of diagonal dominance, because we said our matix satisfied:

$\big \vert a_{k,k}\big \vert \gt \sum_{j \neq k}\big \vert a_{k,j}\big \vert$

hence we know $\mathbf A \mathbf x = \mathbf 0$ if and only if $\mathbf x = \mathbf 0$, thus $det\big(\mathbf A\big) \neq 0$.

# Now the good part:

for each eigenvalue, $\lambda$ in $\mathbf A$, we can say that when we consider the matrix

$\big(\mathbf A - \lambda \mathbf I\big)$, it is not invertible, so per Levy-Desplanques  there must be some diagonal entry on $\mathbf A$ where 

$\big \vert a_{i,i} - \lambda \big \vert \ngtr \sum_{j\neq i} \big \vert a_{i,j}\big \vert$

we can restate this as:

$\big \vert a_{i,i} - \lambda \big \vert \leq \sum_{j\neq i} \big \vert a_{i,j}\big \vert$

and this is the Gershgorin disc formula.  To be clear, the above does not tell us which diagonal entry gives us the range for a given eigenvalue -- it just tells us that any given eigenvalue must be located in a disc associated with one of these diagonal entries.  

It is perhaps worth noting that we can also apply this to the conjugate transpose of $\mathbf A$ -- hence we could instead interpret this formula in terms of the columns of $\mathbf A$.  

# Why might this be useful?

Gerschgorin discs are a very nice tool for identifying things like whether a Hermitian --Symmetric, in reals-- matrix is positive (semi) definite.  To be clear, the test associated with the discs will generally tell us yes it is positive semi-definite or it will say unclear.  In some cases it may allow us to reject the hypothesis as well, but we already have other tools at our disposal, like -- (a) are any of the diagonal entries negative, (b) if any of the diagonal entries are zero, then you need the entire column and row associated with that diagonal to be zero, and (c) we also have the inequality $N \cdot trace\big(\mathbf A \big) \geq sum \big(\mathbf{A}\big)$ -- proved over reals in the posting "SPDSP_Trace_Inequality" which must be true for a Hermitian or Symmetric matrix that is positive semi-definite.  Notably all of these other tools allow us to reject whether a matrix is positive semi-definite -- they don't allow us to confirm that it is.  However, in certain cases, Gerschgorin discs do allow us to confirm this-- the calculation involved is quite simple, and their proof is quite simple as well.   

Now, if we are doing something like a second derivative test, using a Hessian Matrix, we would in fact just use numeric values at or around a critical point.  There are times, however where we may want to evaluate our Hessian symbolically over a large range of values.  

for example consider the function, f, where 

$f(x,y,z) = x^{2} y^{2} z^{4} + z^{2}$

which has the following Hessian:

$\left[\begin{matrix}2 y^{2} z^{4} & 4 x y z^{4} & 8 x y^{2} z^{3}\\4 x y z^{4} & 2 x^{2} z^{4} & 8 x^{2} y z^{3}\\8 x y^{2} z^{3} & 8 x^{2} y z^{3} & 12 x^{2} y^{2} z^{2} + 2\end{matrix}\right]$




**Taussky's Refinement**  
reference pages 59 - 61 of Brualdi's *The Mutually Beneficial Relationship of Graphs and Matrices*  

A *weakly* diagonally dominant square matrix is one where   
$\big \vert a_{i,i} \big\vert \geq \sum_{j\neq i} \big \vert a_{i,j}\big \vert = r_i$  (i.e. radius on row i)  
with the inequality strict for at least one row $i$.  


**claim:**  
if $\mathbf A$ is irreducible and weakly diagonally dominant, then  
$\det\big(\mathbf A\big) \neq 0$  

*your author's approach:*  
remark: being irreducible implies that there are no zero rows, which implies each $a_{i,i} \neq 0$  

For this proof, *we can assume WLOG that each $a_{i,i} = -1$*  

- - - -  
Why? If this isn't the case we can consider 

$\mathbf A = \mathbf {DZ}$  

where $\mathbf D$ is a normalizing diagonal matrix such that $z_{i,i} = -1$.  As such $\det \big( \mathbf D\big) \neq 0$, so we need to determine whether $\det\big(\mathbf Z\big) = 0$ to determine the singularity of $\mathbf A$.  
- - - -  

Hence if $\det\big(\mathbf A\big) = 0$ then there is some $\mathbf x \neq \mathbf 0$ such that 
$\mathbf {Ax} = \mathbf 0$  

Equivalently if $\mathbf A$ is singular, then 

$\big(\mathbf A + \mathbf I\big)\mathbf x = \mathbf I \mathbf x = \mathbf x$  

for some $\mathbf x \neq \mathbf 0$ 

so we define 

$\mathbf B := \big(\mathbf A + \mathbf I\big)$  

Now, by repeated application of triangle inequality, we know  

$\big \vert \mathbf x \big \vert =  \big \vert \big(\mathbf B \mathbf x\big) \big \vert \leq  \big(\big \vert\mathbf B\big \vert \cdot \big \vert \mathbf x\big \vert\big) $  

where the magnitude / absolute value is understood to be applied component wise and the inequality is evaluated component-wise.  (Notationally this is similar to that used in Brualdi and elsewhere in discussions of Peron Frobenius Theory).  

equivalently, if we look at the scalars in row $i$, this reads  
$\big \vert x_i\big \vert = \big \vert \sum_{j=1}^n b_{i,j}x_j\big \vert \leq \sum_{j=1}^n \big \vert b_{i,j}x_j\big \vert = \sum_{j=1}^n \big \vert b_{i,j}\big \vert \cdot \big \vert x_j\big \vert$  

and by further application of triangle inequality, we have 

$\big \vert \mathbf x \big \vert = \big \vert\big(\mathbf B^2 \mathbf x\big) \big \vert \leq \big \vert\mathbf B\big \vert \cdot \big \vert \big(\mathbf B \mathbf x\big) \big \vert \leq  \big(\big \vert\mathbf B\big \vert \big \vert\mathbf B\big \vert \cdot \big \vert \mathbf x\big \vert\big) = \big \vert\mathbf B\big \vert^2 \cdot \big \vert \mathbf x\big \vert $  

and by induction we have  

$\big \vert \mathbf x \big \vert = \big \vert\big(\mathbf B^k \mathbf x\big)\big \vert \leq \big \vert\mathbf B\big \vert^k \cdot \big \vert \mathbf x\big \vert = \mathbf P^k \cdot \big \vert \mathbf x\big \vert$  

for all natural numbers $k$, where $\mathbf P :=  \big \vert\mathbf B\big \vert$  

Based on our construction of weak diagonal dominance and magnitude one on the diagonal, we see that $\mathbf P$ is an substochastic matrix associated with an irreducible markov chain.  To bring the point home, we can embed $\mathbf P$ in an absorbing chain, where we've inserted a state 0 as the absorbing state

$\mathbf M = \begin{bmatrix} 1 & 0\\ * & \mathbf P\\ \end{bmatrix}$


where at least one starred component is positive so that the matrix is row stochastic $\mathbf {M1} = \mathbf 1$.  That means that at least one state in $\mathbf P$ communicates with the absorbing state -- so said state is transient.  But since the underlying graph in $\mathbf P$ is irreducible, and transience is a class property, *all states are transient*.  We also, of course can multiply this in blocked form:  

$\mathbf M^k = \begin{bmatrix} 1 & 0\\ * & \mathbf P^k\\ \end{bmatrix}$

The end result, is that we have   

$\big \vert \mathbf x \big \vert = \lim_{k\to \infty} \big \vert \mathbf x \big \vert \leq \lim_{k\to \infty}  \mathbf P^k \cdot \big \vert \mathbf x\big \vert = \mathbf 0$ 

or, if the reader prefers, with 

$\mathbf v := \begin{bmatrix} 
0\\
\big \vert \mathbf x\big \vert
\end{bmatrix}$  


$\mathbf v = \lim_{k\to \infty} \mathbf v \leq \lim_{k\to \infty}  \mathbf M^k \cdot \big \vert \mathbf v\big \vert = \mathbf e_0$ 

which is a contradiction.  

# note this can also be done with idea just before experimental section at end...
# i.e. consider n iterations and apply union bound to show contradiction if 
# eig of 1 exists for transient states  
# then again I maybe should cut this section  

basic idea is something like   

$\mathbf v^T \mathbf M^k \mathbf 1 = \mathbf v^T \mathbf 1 = \big \Vert \mathbf v \big \Vert_1 = 1$  

but and the same idea  

$\mathbf S:= \frac{1}{n+1}\sum_{k=1}^{n+1} \mathbf M^k$  
$\mathbf v^T\big(\mathbf S\big) \mathbf 1 = \mathbf v^T \mathbf 1 = \big \Vert \mathbf v \big \Vert_1 = 1$  

but we know that 

$\mathbf v^T \mathbf S \mathbf e_0 \gt 0$  
(communicating from any state i to transient state connected to absorbing state in at most n steps, then one more to absorbing state with positive probability...) and hence $\sum_{j=1}^n v_j \lt 1$  

Note: this idea maybe does not work... a little more thought needed I guess  





i.e. by chopping through this argument:   

$\mathbf M = \begin{bmatrix} 1 & 0\\ * &\gamma\mathbf A\\ \end{bmatrix}$  
$\gamma^{-1} = \big(\text{maximal row sum of }\mathbf A\big)$  
i.e. so $\gamma \mathbf A$ has a maximal row sum of 1.  

since $\mathbf A$ is not already stochastic, but is real non-negative with a connected graph and maximal modulus eigenvalue of one, we know $\gamma^{-1} \geq 1$ i.e. $\gamma \in (0,1]$.  As we'll see it in fact is the case that $\gamma \in (0,1)$  

As before the $*$ cells are real non-negative such that  
and $\mathbf M\mathbf 1 = \mathbf 1$  

and as before we have the blocked structure  
$\mathbf M^k = \begin{bmatrix} 1 & 0\\ * & \mathbf P^k\\ \end{bmatrix}$  


however, supposing for a contradiction that $v_j = 0$ (i.e. $\mathbf v$ isn't strictly positive)  
we now select our real non-negative eigenvector $\mathbf v$ where $\big \Vert \mathbf v \big \Vert_1=1$ and run our markov chain, starting in this 'decaying steady state' 

it is immediate that  

$\mathbf v^T \mathbf M^k = \begin{bmatrix}  1 - \lambda^k \mathbf 1^T \mathbf v\\ \lambda^k \mathbf v\\  \end{bmatrix}^T $  

and if we examine $k \in \{1,2,..., n, n+1\}$  
where the event $A_k$ indicates that given our starting position (any fixed arbitrary state $i$ in the graph for $\mathbf P$), we visit state $j$ on the $k$ visit, we have 

$0 \lt Pr\big\{A_1 \bigcup A_2  \bigcup ... \bigcup A_{n} \bigcup A_{n+1}\big\} \leq Pr\big(A_1\big) + Pr\big(A_2\big) +  ... + Pr\big(A_{n}\big)  + Pr\big(A_{n+1}\big) = 0 $   
where the LHS is positive by the fact that the graph is connected, and hence a positive path $i \to j$ exists in at most $n$ iterations for any starting state $i$, and the RHS is the union bound (or markov inequality -- i.e. if we start in this decaying steady state the expected visits to $j$ are 0 which contradicts the positive probability of visiting $j$).

# this needs adapted to the case at hand here...  

- - - - -  
the above are standard markov chain results.  Based on zero or one laws, if a node in a communicating class is transient, then all in that class are transient.  Since a renewal does not occur with probability one, this implies that the expected number of visits to said state is finite, which means that the number of visits after time $t$ tends to zero by selecting large enough $t$ (either via Borell Cantelli or Markov Inequality).  Thus all diagonal components of $P$ tend to zero. For avoidance of doubt, this *also* implies that the off diagonal components tends to zero.  Consider:  

for $j, i \geq 1$  

$\sum_{k=1}^\infty p_{i,j} = E\Big[N\Big] = E\Big[\big[N\big \vert \text{vist state j once}\big]\Big] = 0 + p \cdot \big(1 + \sum_{k=1}^\infty P_{j,j}^{(k)}\big) \leq 1 + \sum_{k=1}^\infty P_{j,j}^{(k)} \lt \infty$  

with $p$ being the total probability of ever reaching state $j$ from state $i$.  Being probability we know $p \in [0,1]$.  

this implies that 
$P_{j,j}^{(k)} \to 0$ 
by selecting large enough $k$ (again by Borell Cantelli).  Equivalently the this is a delayed defective renewal process, where the 'real renewal' occurs at state $j$ but the transition probability to there from $i$ may be defective, and we *know* that the renewal process from $j\to j$ is defective.   

*remark:*  
Some of the ideas here closely follow pages 400 - 402 of Feller Vol 1 (3rd edition).  

*another finish -- via extremal characterization:*  
For an irreducible chain (i.e. underlying graph is connected) that is weakly diagonally dominant.  Again for our matrix $\mathbf A$ with **diagonal components that we assume WLOG are** $-1$, we can use an extremal argument to prove that $\mathbf A$ is non-singular.  

It is enough to prove that $\mathbf {Av} = \mathbf 0$ has a unique solution (equivalently, only the zero vector is in the nullspace of $\mathbf A$).  

so we know  
$\mathbf {A0} = \mathbf 0$  
and aim to prove this is the only possible solution.  Suppose for a contradiction that some $\mathbf v \neq \mathbf 0$ exists such that $\mathbf {Av} = \mathbf 0$  

as such  
$\max_{i} \big \vert v_i \big \vert = \alpha \gt 0$  

now partition into two cases.  First the case where the max value occurs in a row where 
$\big \vert a_{i,i} \big\vert \gt \sum_{j\neq i} \big \vert a_{i,j}\big \vert = r_i$  

for reasons that will become clear, we'll refer to this as a sub-convex combination and the case where 
$\big \vert a_{i,i} \big\vert = \sum_{j\neq i} \big \vert a_{i,j}\big \vert = r_i$  
which we refer to as a convex combination  

In either case for this maximal row $i$  

$-1\cdot v_i + \sum_{j\neq i} a_{i,j} v_j = a_{i,i} v_i + \sum_{j\neq i} a_{i,j} v_j = 0$  
$v_i = \sum_{j\neq i} a_{i,j} v_j$  
taking the magnitude of each side  
$\alpha $  
$= \big \vert v_i\big \vert $  
$ = \big \vert \sum_{j\neq i} a_{i,j} v_j\big \vert$  
$\leq  \sum_{j\neq i}\big \vert a_{i,j} v_j\big \vert$  
$= \sum_{j\neq i}\big \vert a_{i,j}\big \vert \big \vert  v_j\big \vert$  
$\leq  \sum_{j\neq i} \big \vert   a_{i,j}\big \vert\alpha$  
$= \alpha \cdot \sum_{j\neq i}\big \vert  a_{i,j}\big \vert$  
$\leq \alpha $  

where the inequalities follow by triangle inequality, the fact that $\big \vert v_j \big \vert \leq \alpha$ for all $j$, and the fact that each diagonal element is magnitude 1 but the rows are weakly diagonally dominant -- i.e. each row has a magnitude sum of at most one.  

In the case of a "sub-convex" combination this last inequality is strict, i.e. with some $p \in (0,1)$  
giving us   

$\alpha \leq p \cdot \alpha \lt \alpha$  
which is impossible for any $\alpha \gt 0$.  

In the other case where the maximal row $i$ has a proper convex combination this last inequality need not be strict.  However the equality conditions are achievable *iff* each $\big \vert v_j\big \vert = \alpha$.  That is each node $j$ that node $i$ is directly connected to must have a maximal magnitude value such that $\big \vert v_j\big \vert = \alpha$.  But the underlying graph is connected.  This means that at least one node in the "convex combination" class communicates with at least one node in that "sub-convex combination" class, which implies that for said node $k$ in the subconvex combination class, $\big \vert v_k\big \vert = \alpha \leq p \cdot \alpha \lt \alpha$, which is a contradiction for any $\alpha \gt 0$.  

Thus the  
$\max_{i} \big \vert v_i \big \vert = 0 \longrightarrow \mathbf v= 0$  afterall
so we conclude that $\mathbf A$ is nonsingular    


*remark*  
with a small bit of insight this gives a proof of standard Perron Frobenius results related to connectivity of graphs and dominant eigenvalues, at least in the case of (finite state, time homogeneous) Markov chains.  

I.e. if we have a row stochastic matrix  then we know 

$\mathbf A \mathbf 1 = \mathbf 1$ 
so there is at least multiplicity of one for eigenvalue of one.  It may be convenient to consider a lazy chain of the form $\frac{1}{2}\big(\mathbf A + \mathbf I\big)$   

For convenience, we can ignore the class of transient states (which as a class are substochastc...) and consider that

$\text{disconnected recurrent classes} \longrightarrow \text{eigenvalue of 1 has geometric multiplicity }\geq 2$  

if the recurrent classes are disconnected then the transition matrix may be (up to graph isomorphism) given by 

$\mathbf A =  \begin{bmatrix} \mathbf A_k^{(1)} & \mathbf 0\\ \mathbf 0 & \mathbf A_{n-k}^{(2)}\\ \end{bmatrix}$  

where the 
but we than have  
$\mathbf v_1 =  \begin{bmatrix} \mathbf 1_k \\ \mathbf 0_{n-k} \end{bmatrix}$  
and  
$\mathbf v_2 =  \begin{bmatrix} \mathbf 0_k \\ \mathbf 1_{n-k} \end{bmatrix}$  

and   
$\mathbf A \mathbf v_1 = \mathbf v_1$  
$\mathbf A \mathbf v_2 = \mathbf v_2$  

which proves algebraic and geometric multiplicity of at least 2 for eigenvalue 1.  The other leg is given by what we just worked through, i.e. 

$\text{single connected recurrent classes} \longrightarrow \text{eigenvalue of 1 has geometric multiplicity }= 1$  

argue by contradiction that we find geometric multiplicity of (at least) 2 for eigenvalue 1 for a connected graph.  Using the machinery of the preceding proof, we have a single communicating class and have distinct  maximal and minimal components $v_i \gt v_j$ of some steady state vector $\mathbf v$  but $v_i$ is a convex combination of its directly connected neighbors, and perhaps more to the point if we consider the 'time averaged' matrix 

$\frac{1}{n}\mathbf S^{(n)} := \frac{1}{n}\big(\mathbf I + \mathbf A + \mathbf A^2 + ... + \mathbf A^{n-1}\big)$, then    

(the reader should verify that the above is still row stochastic, and if $\mathbf A\mathbf v = \mathbf v$ then $\frac{1}{n}\mathbf S^{(n)}\mathbf v = \mathbf v$  

But this means that $\mathbf S^{(n)}$ has paths from $i$ to *all* other nodes hence applying $\mathbf S^{(n)}\mathbf v = \mathbf v$ tells us that $v_i$ is written as a convex combination with strictly positive weights attached to each node's payoff, including $v_j \lt v_i$, so $v_i$ isn't a maximum afterall.  And this holds for any $i$, hence for a connected aperiodic markov chain, the only possible right eigenvector with eigenvalue one is the ones vector.  

Of course if we had real symmetric matrix, or a reversible markov chain, we could stop here.  However for general markov chains we'd need to deal with the possibility of a defective matrix and in particular the possibility that the algebraic multiplicity of 1 is strictly greater than the geometric multiplicity.   

$ \text{eigenvalue of 1 has geometric multiplicity }= 1 \longrightarrow  \text{eigenvalue of 1 has algebraic multiplicity }= 1 $  

The algebraic finish to prove $\lambda = 1$ is a simple root (i.e. algebraic multiplicity of 1) then would be to consider Jordan Form implications (see 'fun with trace').  A possible probabilistic finish would be to consider a (delayed) renewal rewards process and see that the time averaged reward (where reward of 1 is given) for visiting state $j$ is $\frac{1}{\bar{X_j}}$ but the union of those rewards consists of disjoint events, hence $\sum_{j=1}^n\frac{1}{\bar{X_j}} = 1$, which tells us that  
$\big \vert 1- \frac{1}{r}\sum_{k=1}^r\text{trace} \big(\mathbf A^k\big)  \big \vert \lt \epsilon$ for any $\epsilon \gt 0$ by selecting $r\geq R$ 

and in particular it tells us that the time averaged trace is strictly less than 2, hence the algebraic multipilicity of all eigenvalues equal to 1 is one. (Recall we are dealing with a lazy chain, so direct application of Gerschgorin discs tells us the only possilbe distinct eigenvalue on the unit circle is $\lambda= 1$.)    
Another approach explicitly uses coupling techniques.  The below interlude develops this result via an elementary telescoping argument.  

- - - -  
*begin interlude using exercise from Grinstead and Snell, not renewal theory*    

Note on Time Average (Cesaro mean) For Finite State Markov Chain  

There is a nice exercise (number 16, page 468) of Grinstead and Snell's probability book.  This is shown below.  Due to notational overload, assume that matrices are in $\mathbb R^\text{m x m}$  

$\mathbf S^{(n)} := \mathbf I + \mathbf A + \mathbf A^2 + ... + \mathbf A^{n-1}$ 

recall that  
$\mathbf A^k \mathbf E_1 = \big(\mathbf A^k \mathbf 1\big)\mathbf v_1^T= \mathbf 1 \mathbf v_1^T =\mathbf E_1$  

where based on the above we know that $\big(\mathbf I - \mathbf A\big)$ has a single non-zero vector in its right nullspace (the ones vector) and a single non-zero vector in its left nullspace (the as yet mysterious $\mathbf v_1^T$). The only thing we need to know here is that $\mathbf v_1^T\mathbf 1 \neq 0$ -- though as is custom, we will choose this dot product to be equal to one.  

*how do we know it isn't orthogonal?*  As a dominant (and real) eigenvector it would need positive and negative values to be orthogonal to the ones vector, our only right eigenvector -- i.e. sum to zero-- but if this was the case we'd have, by application of triangle inequality:    

$\big \vert \mathbf v\big \vert^T \mathbf A^2  \geq \big \vert \mathbf v\big \vert^T \mathbf A \geq   \mathbf v^T \mathbf A $   

where *the inequality is strict in at least one component in each case*,  This is a red flag as it indicates a monotone non-decreasing sequence, which if bounded above indicates another eigenvector in the kernel of $\big(\mathbf I - \mathbf A\big)$  which is impossible by our preceding work, and if not bounded above this creates an even bigger problem.  But more basically we have the contradiction that, when we sum over the above bound (via the use of the ones vector as a right eigenvector)    

$\big(\big \vert \mathbf v\big \vert^T \mathbf A^2\big)\mathbf 1  \gt \big(\big\vert \mathbf v\big \vert^T \mathbf A\big)\big(\mathbf A\mathbf 1 \big)  = \big(\big\vert \mathbf v\big \vert^T \mathbf A\big)\mathbf 1 \gt \big \vert \mathbf v\big \vert^T\big( \mathbf A^2\mathbf 1\big) = \big \vert \mathbf v\big \vert^T\mathbf 1 \gt 0 $  
which contradicts the associativity  of matrix vector /matrix products i.e. because it has   
$\big(\big \vert \mathbf v\big \vert^T \mathbf A^2\big)\mathbf 1\gt  \big \vert \mathbf v\big \vert^T\big( \mathbf A^2\mathbf 1\big)  $  



hence $\mathbf v$ must be sign homogeneous -- and we select it to be entirely real-nonnegative such that  
$\text{trace}\big(\mathbf E_1\big) = 1$  

(This is the natural choice for several reasons -- the most basic is selecting the trace to be one for this rank one matrix ensures that that $\mathbf E_1^2 = \mathbf E_1$ i.e. it gives us idempotence.)  

Now we return to the exercise and we use a telescoping identity:  

$\big(\mathbf I - \mathbf A + \mathbf E_1\big)\mathbf S^{(n)} = \big(\mathbf I - \mathbf A + \mathbf E_1\big)\big(\mathbf I + \mathbf A + \mathbf A^2 + ... + \mathbf A^{n-1}\big) = \mathbf I - \mathbf A^n + n\mathbf E_1  $

thus  
$\mathbf S^{(n)} = \big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\big(\mathbf I - \mathbf A^n + n\mathbf E_1  \big) = \big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1} - \big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\mathbf A^n + n\mathbf E_1  $  

note:  
$\big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\mathbf E_1 = \mathbf E_1$  
(this is intimately tied in with $\text{trace}\big(\mathbf E_1\big) = 1$)  

hence  
$\frac{\mathbf S^{(n)}}{n} = \frac{\big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}}{n} - \frac{\big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\mathbf A^n}{n} + \mathbf E_1 $  

and  
$\lim_{n \to \infty} \frac{\mathbf S^{(n)}}{n} = \mathbf E_1$  

or if the reader prefers,  
$\Big \Vert\frac{\mathbf S^{(n)}}{n} - \mathbf E_1\Big \Vert_F = \Big \Vert\frac{\big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}}{n} - \frac{\big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\mathbf A^n}{n}\Big \Vert_F \leq  \Big \Vert\frac{\big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}}{n}\Big \Vert_F + \Big \Vert  \frac{\big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\mathbf A^n}{n}\Big \Vert_F \lt \epsilon$   
by selecting large enough $n$  

*a few technical details:*   
$\big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}$  exists.  

for $\mathbf x \neq \mathbf 0$  
$\big(\mathbf I - \mathbf A\big)\mathbf x = \mathbf 0$ 
*iff* 
$\mathbf x \propto \mathbf 1$  

so consider writing any nonzero $\mathbf x$ as the linear combination of 
$\mathbf x = \alpha\mathbf 1 + \sum_{k=2}^m\beta_k\mathbf q_k$ where $\mathbf v^T\mathbf q_k = 0$, $\mathbf x \neq 0$ and each $\mathbf q_k$ is mutually orthornormal (and hence these $m-1$ real vectors are linearly independent)  

note:  
considering $\mathbf v^T\mathbf A \mathbf 1 = 1$ but $\mathbf v^T\mathbf A \big(\sum_{k=2}^m \beta_k\mathbf q\big) = 0 $  proves that $\mathbf 1$ and $\mathbf q_k$ are linearly independent hence this collection of $m$ vectors forms a basis  

and for any $\mathbf x \neq \mathbf 0$  
$\big(\mathbf I - \mathbf A+\mathbf E_1\big)\mathbf x = \alpha\mathbf 1 + \big(\mathbf I - \mathbf A\big)\big(\sum_{k=2}^m\beta_k\mathbf q_k\big) \neq \mathbf 0$   

why?  because if $\alpha \neq 0$ then consider left multiplying by $\mathbf v^T $ to see   
$\mathbf v^T\big(\mathbf I - \mathbf A+\mathbf E_1\big)\mathbf x = \alpha \neq 0$.  

In the case that $\alpha = 0$  we know that at least one $\beta_k \neq 0$ and  
$\big(\mathbf I - \mathbf A+\mathbf E_1\big) \big(\sum_{k=2}^m\beta_k\mathbf q_k\big)  = \big(\mathbf I - \mathbf A\big) \big(\sum_{k=2}^m\beta_k\mathbf q_k\big) \neq \mathbf 0$  
because by linear independence we know $\big(\sum_{k=2}^m\beta_k\mathbf q_k\big)$ is *not* $\propto \mathbf 1$ and hence not in the 1 dimensional nullspace of $\big(\mathbf I - \mathbf A\big)$   

- - - -   
Note that for any given transition matrix, we have 

$\big \Vert \big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\big \Vert_F = c$

where $c$ is some positive constant.  Thus we have 

$\big \Vert \frac{\big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}}{n}\big \Vert_F = \frac{\big \Vert \big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\big \Vert_F}{n} = \frac{c}{n}\lt \frac{\epsilon}{2}$  
for large enough $n$.  

As for the matrix 

$\frac{\big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\mathbf A^n}{n}$ 

we can bound its Frobenius norm as follows 

$0 \leq   \big \Vert \frac{1}{n} \big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\mathbf A^n\big \Vert_F = \frac{1}{n} \big \Vert \big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\mathbf A^n\big \Vert_F \leq  \frac{1}{n} \big \Vert \big(\mathbf I - \mathbf A + \mathbf E_1\big)^{-1}\big\Vert_F \big \Vert \mathbf A^n\big \Vert_F \leq \frac{1}{n}\big(c\big)\big(m\big)  \lt \frac{\epsilon}{2} $  

where from left to right, we used positive definiteness, then homogeneity of positive scaling, then submultiplicativity and then the fact that $\mathbf A^k$ is stochastic and hence each component has magnitude at most of 1, then the sum of of each column is at most $m$ and there are $m$ columns, hence its Frobenius norm is at most $m$. Thus for any given $c$ and any given $m$ the right hand side tends to zero as $m$ grows large which creates the squeeze we seek, showing that this term too may be made arbitrarily small by selecting large enough $n$.   

but pointing out the obvious, we have, via application of the triangle inequality       
$ \Big \vert\text{trace}\big(\frac{\mathbf S^{(n)}}{n}\big) - 1 \Big \vert  = \Big \vert\text{trace}\big(\frac{\mathbf S^{(n)}}{n}- \mathbf E_1\big) \Big \vert \leq  \Big \Vert \mathbf I \circ \big(\frac{\mathbf S^{(n)}}{n} - \mathbf E_1\big) \Big \Vert_{S_1}\leq \sqrt{m}\Big \Vert \mathbf I \circ \big(\frac{\mathbf S^{(n)}}{n} - \mathbf E_1\big) \Big \Vert_{F} \leq  \sqrt{m}\Big \Vert\frac{\mathbf S^{(n)}}{n} - \mathbf E_1\Big \Vert_F \lt \sqrt{m}\cdot \epsilon$ 

(where $\circ$ denotes the hadamard product and where $S_1$ is the one Schatten norm and $\sum_{k=1}^m \sigma_k \leq m^\frac{1}{2}\big(\sum_{k=1}^m \sigma_k^2\big)^\frac{1}{2}$  by triangle inequality and then cauchy-schwarz (ones trick) -- see Shur Inequality writeup for more information)  

which tells us that the trace of $\mathbf S^{(n)}{n}$ may be made arbitrarily close to $1$ by selecting large enough $n$.  This immediately tells us that the algebraic multiplicity of eigenvalue 1 must be one.  (And recall we have a lazy chain so all other eigenvalues are strictly less than one in magnitude). 

for avoidance of doubt all of the eigenvalues strictly less than one for $\frac{\mathbf S^{(n)}}{n}$ may be written as a finite geometric series $\frac{1}{n}\cdot \lambda \frac{1-\lambda^n}{1-\lambda}$, which may be made arbitrarily small in magnitude, i.e. application of triangle inequality gives  $\frac{1}{n}\cdot\big \vert \lambda \frac{1-\lambda^n}{1-\lambda}\big \vert \leq \frac{1}{n}\cdot\big \vert \lambda \frac{2}{1-\lambda}\big \vert $  

*We can now verify that each component of* $\mathbf v$ *must be strictly positive*    

The chain consists of a single recurrent class which means if we start in state $j$, we return with probability 1 to state $j$.  But this is equivalent to having an infinite expected number of visits to state $j$.  Since the subdominant eigenvalues decay geometrically quick (use jordan form, or with some work cayley hamilton) having a $v_j=0$ would result in a contradiction.  An even easier approach is to recognize that as a single recurrent class, if we start in any state $i$ we transition to state $j$ with probability 1.  This holds for any state $i$, and any convex combination of such states (via linearity).  So if we had $v_j=0$ and we started the process in steady state $\mathbf v$, then we'd have 0 expected visits to $j$ (and by markov inequality or union bound-- the probability of visting $j$ is zero), which contradicts that $j$ is visited with probability 1 (or even the much weaker claim that from just one state $i$ where $v_i \gt 0$ the probability of visiting $j$ becomes $i \cdot v_i \gt 0$)   



*end interlude using exercise from Grinstead and Snell, not renewal theory*  

and since we now know that the algebraic and geometric multiplicities match, we know the left nullspace of 
$\big(\mathbf A - \mathbf I\big)$ has exactly one nonzero vector $\mathbf \pi$ in it.  The components must be positive because $\mathbf A^k$ is strictly positive for $k \geq n$  (consider that $\mathbf B:= \mathbf A - \mathbf {1\pi}$ is a matrix with all eigenvalues strictly less than one, hence $\mathbf B^k \to \mathbf 0$... this result is flushed out via the atypical route of applying Cayley Hamilton, repeatedly, in the cells that follow this section on Perron theory.  The **said cells are a work in progress** though.)  

In fact now these values associated with the steady state vector must agree with renewal rewards time averaged result so we must have 

$\pi_i = \frac{1}{\bar{X_j}}$  



**the above was a lot of work spent on developing Perron Frobenius for the special case of (finite state time homogenous) markov chains.  What about the 'more general' case of n x n matrices with real non-negative entries?**  

$\mathbf A \in \mathbb R^{\text{n x n}}$  

0. If the matrix is nilpotent then there is nothing more to be said. Everything that follows assumes the matrix is not nilpotent.  If the matrix (or its transpose) has the ones vector as an eigenvector, divide by the associated eigenvalue and we recover Peron Frobenius, as we've already done, for markov chains.  **everything that follows assumes that the underlying graph is connected**  

1.  First, since there are $n$ (possibly complex) eigenvalues we know there is some maximal magnitude eigenvalue(s) and  we divide by the magnitude of the maximal magnitude eigenvalue.  
2.  We then show this implies the existence of an eigenvalue of 1, with a real non-negative eigenvector $\mathbf v$  
3.  Next we show that the associated eigenvector $\mathbf v$ must in fact be strictly positive.  
4. consider $\mathbf D = \text{Diag}\big(\mathbf v\big) $  and the matrix $\mathbf P = \mathbf D^{-1}\mathbf A \mathbf D $  is a stochastic matrix.  i.e. $\mathbf A$ is real non-negative and a similarity transform involving only positive numbers leaves this intact.  And further $\mathbf 1:= \mathbf D^{-1}\mathbf v $ is an eigenvector, that is  
$\mathbf P\mathbf 1  =  \mathbf D^{-1}\mathbf A \mathbf D \mathbf 1=  \mathbf D^{-1}\mathbf A \mathbf D \mathbf D^{-1}\mathbf v = \mathbf D^{-1}\mathbf A \mathbf v= \mathbf D^{-1} \mathbf v=\mathbf 1$  
5. Hence we have $\mathbf P$ which is a stochastic matrix and inherits everything that we've proven above for the 'special case' of stochastic matrices.  But since $\mathbf A$ is similar to $\mathbf P$, then it too inherits all of these properties. 

It remains for us to prove (2) and (3) 


2:  
selecting our maximal magnitude eigenpair $\lambda, \mathbf x$, which we don't know much about just yet, we have  

$\mathbf x^* \mathbf A = \lambda \mathbf x^*$  and we know $\big \vert \lambda \big \vert = 1$  

so (echoing what was done much earlier in the writeup, where the absolute value is taken component wise, and in technique, where we repeatedly apply the triangle inequality)  

$\big \vert \mathbf x^* \mathbf A\big \vert  = \lambda \mathbf x^*$  and we know $\big \vert \lambda \big \vert = 1$  


$\big \vert \mathbf x^* \mathbf A\big \vert= \big \vert \lambda \mathbf x^*\big \vert =  \big \vert \lambda \big \vert \big \vert\mathbf x^*\big \vert = \big \vert\mathbf x\big \vert^T$  
but application of triangle inequality gives  


$ \big \vert\mathbf x\big \vert^T = \big \vert \mathbf x^* \mathbf A\big \vert \leq \big \vert \mathbf x^* \big \vert \big \vert\mathbf A\big \vert = \big \vert \mathbf x \big \vert^T \mathbf A$   

equivalently we have  

$ \big(\mathbf A^T\big)\big \vert \mathbf x \big \vert\geq \big \vert\mathbf x\big \vert $   

so we have term by term (weak) dominance.  Taking advantage of real non-negativity

$ \big(\mathbf A^T\big) \big(\mathbf A^T\big)\big \vert \mathbf x \big \vert\geq  \big(\mathbf A^T\big)\big \vert\mathbf x\big \vert $   

(the reader should confirm this holds for an arbitrarily chosen ith row)  

so  

$ \big(\mathbf A^T\big)^2\big \vert \mathbf x \big \vert \geq  \big(\mathbf A^T\big)\big \vert \mathbf x \big \vert \geq \big \vert\mathbf x\big \vert $   

and iterating gives  

$ \big(\mathbf A^T\big)^{k+1}\big \vert \mathbf x \big \vert \geq  \big(\mathbf A^T\big)^{k}\big \vert \mathbf x \big \vert\geq ... \geq \big(\mathbf A^T\big)^3\big \vert \mathbf x \big \vert \geq  \big(\mathbf A^T\big)^2\big \vert \mathbf x \big \vert \geq  \big(\mathbf A^T\big)\big \vert \mathbf x \big \vert \geq \big \vert\mathbf x\big \vert $   

so we have a monotone, non-decreasing sequence which leads us to two different possibilities  

(i) if the sequence is bounded above by a constant real valued vector then 

$\lim_{k\to \infty}\big(\mathbf A^T\big)^{k}\big \vert \mathbf x \big \vert = \mathbf v$  
by monotone convergence theorem, but  
$\lim_{k\to \infty}\big(\mathbf A^T\big)^{k+1}\big \vert \mathbf x \big \vert = \Big( \lim_{k\to \infty}\mathbf A^T\big(\mathbf A^T\big)^{k}\big \vert \mathbf x \big \vert\Big)  =\mathbf A^T\cdot\Big( \lim_{k\to \infty}\big(\mathbf A^T\big)^{k}\big \vert \mathbf x \big \vert\Big)  = \mathbf A^T\mathbf v$  

where $\mathbf v \geq 0$ and at least one component is positive because $\big \vert \mathbf x \big \vert $ has at least one positive component (i.e. the zero vector isn't an eigenvector)  

thus  
$\mathbf A^T\mathbf v  =1 \cdot \mathbf v$    
so we have found a (left) eigenvector with real non-negative components and eigenvalue of 1  

(ii)  
what if the sequence is not bounded above by a constant?  Your author does not know a direct answer to this question.  The intuition is that the spectral radius (i.e. maximal magnitude eigenvalue) of $\mathbf A$ is 1.  If it were any larger, it would clearly tend to become artibtrarily large.  And if it were any smaller $\mathbf A^k \to \mathbf 0$.  This is proven *in a work in progress manner* in the cells that follow this Peron Frobenius Theorem proof, using Cayley-Hamilton; the proof is conceptually simple using subadditivity and submultiplacativity of a nice norm like the Frobenius norm, and noting that for large enough $k$ a matrix to the power k can be written as a linear combination of powers of that matrix with arbitrarily small (in magnitude) coefficients, hence some power of $\mathbf A$ would tend to zero; the general result follows, e.g. by showing this implies a Cauchy sequence.  

If one wants a more direct proof with more algebraic machinery, then the result follows that $\mathbf A^k \to \mathbf 0$ *iff* $\text{spec}\big(\mathbf A\big)\lt 1$ by using the Jordan Canonical Form of $\mathbf A$.  

In any case since we cannot directly attack the question as to whether the sequence is bounded above, *but we do know that it is on the cusp of being a decreasing sequence* so we proceed by trying to shrink the spectral radius of $\mathbf A$ by a very small amount.  

So we revisit 
$\mathbf B^T = \frac{1}{n}\mathbf S^{(n)} := \frac{1}{n}\big(\mathbf I + \mathbf A + \mathbf A^2 + ... + \mathbf A^{n-1}\big)$  

and our non-decreasing sequence above tells us that 

$\big \vert \mathbf x \big \vert^T \leq \big \vert \mathbf x\big \vert^T\mathbf B$  


*in particular if* $\mathbf x$ is real non-negative in every component, then it is our desired eigenvector with eigenvalue of 1.  

**suppose for a contradiction**  $\mathbf x$ *can NOT be chosen as real non-negative in every component, but since* $\mathbf A$ *is a (/has an underlying) connected graph*, we know $\mathbf B$ *has strictly positive components and application of Triangle Inequality tells us that*  

$\big \vert \mathbf x \big \vert^T \lt \big \vert \mathbf x\big \vert^T\mathbf B = \mathbf a$  

let  
$\alpha := \text{min i:  }\frac{\big \vert \mathbf x^* \mathbf B\big \vert_i}{a_i} \gt 0$   
(i.e. ignore the indices where $x_i =0$ and amongst the remaining ones, the case where the 'rate of growth' is smallest)  
and we see  $\alpha \in (0,1)$  

then we have (and iterating, again taking advantage of non-negative components)   
$\big \vert \mathbf x\big \vert^T \leq \big \vert \mathbf x\big \vert^T\big(\alpha\mathbf B\big)\leq \big \vert \mathbf x\big \vert^T\big(\alpha\mathbf B\big)^2 \leq... \leq \big \vert \mathbf x\big \vert^T\big(\alpha\mathbf B\big)^k$  

If we toggle associativity, we can consider matrix matrix multiplication first and know $\big(\alpha\mathbf B\big)^k \to \mathbf 0$ because its spectral radius is $\lt 1$ which implies 
$\big(\alpha\mathbf B^T\big)^k\mathbf x \to \mathbf 0$  

but if we consider matrix vector products first then we see that 
$\big(\alpha\mathbf B^T...\big(\alpha\mathbf B^T\big(\alpha\mathbf B^T\mathbf x\big)\big)...\big) \geq  \big \vert \mathbf x \big \vert $ 

i.e. it is bounded below by a constant real-nonnegative vector with at least one positive component and hence   
$ \big \Vert \big(\alpha\mathbf B\big)^k - \mathbf 0\big\Vert_F \gt c \gt 0$   
for all $k$ which contradicts $\big(\alpha\mathbf B\big)^k \to \mathbf 0$    

thus we know   
$\mathbf x = \big \vert\mathbf x\big \vert = \mathbf v$, where $\mathbf v^T\mathbf A = \mathbf v$  

3:  
The final step involves proving that all components of $\mathbf v$ must in fact be positive, i.e. $\mathbf v\gt 0$.  As in our earlier Gerschgorin disc proof, consider embedding our matrix $\mathbf A$ in an absorbing state markov chain, i.e. consider 

$\mathbf M = \begin{bmatrix} 1 & 0\\ * &\gamma\mathbf A\\ \end{bmatrix}$  
$\gamma^{-1} = \big(\text{maximal row sum of }\mathbf A\big)$  
i.e. so $\gamma \mathbf A$ has a maximal row sum of 1.  

since $\mathbf A$ is not already stochastic, but is real non-negative with a connected graph and maximal modulus eigenvalue of one, we know $\gamma^{-1} \gt 1$ (see above Taussky refinement), i.e. we know  $\gamma \in (0,1)$, which means that the new maximum modulus eigenvalue is $(\gamma \lambda) = \gamma \in (0,1)$.  Note: technically the Taussky refinement isn't needed here, a basic application of gerschorin discs tells us that $(\gamma \lambda) = \gamma \in (0,1]$ -- the strictness is nice to know (and in fact implied by) but not needed in the below argument.  

As before the $*$ cells are real non-negative such that  
and $\mathbf M\mathbf 1 = \mathbf 1$  

and as before we have the blocked structure  
$\mathbf M^k = \begin{bmatrix} 1 & 0\\ * & \mathbf P^k\\ \end{bmatrix}$  


however, supposing for a contradiction that $v_j = 0$ (i.e. $\mathbf v$ isn't strictly positive)  
we now select our real non-negative eigenvector $\mathbf v$ where $\big \Vert \mathbf v \big \Vert_1=1$ and run our markov chain, starting in this 'decaying steady state' 

it is immediate that  

$\begin{bmatrix}  0\\ \mathbf v\\ \end{bmatrix}^T \mathbf M^k = \begin{bmatrix}  1 - \gamma^k \mathbf 1^T \mathbf v\\ \gamma^k \mathbf v\\  \end{bmatrix}^T $  

and if we examine $k \in \{1,2,..., n, n+1\}$  
where the event $A_k$ indicates that given our starting position (any fixed arbitrary state $i$ in the graph for $\mathbf P$), we visit state $j$ on the $k$ visit, we have 

$0 \lt Pr\big\{A_1 \bigcup A_2  \bigcup ... \bigcup A_{n} \bigcup A_{n+1}\big\} \leq Pr\big(A_1\big) + Pr\big(A_2\big) +  ... + Pr\big(A_{n}\big)  + Pr\big(A_{n+1}\big) = 0 $   
where the LHS is positive by the fact that the graph is connected, and hence a positive path $i \to j$ exists in at most $n$ iterations for any starting state $i$, and the RHS is the union bound (or markov inequality -- i.e. if we start in this decaying steady state the expected visits to $j$ are 0 which contradicts the positive probability of visiting $j$).  

Thus it must be the case that $\mathbf v \gt \mathbf 0$.  As outlined in the opener to this section, this then lets map $\mathbf A$ to a markov chain via similarity transform, and we inherit all of the preceding Perron Frobenius Theory properties proven there.  




(There are some addtional items worth consideration, e.g. about periodicity in graphs and min max theorems -- your author recommends Meyer's *Matrix Analysis* for discussion of these topics.  The discussion in this notebook is unique, to your author's knowledge, in that it heavily pushes and derives the bulk of results in Peron Frobenius theory, via the 'special case' of probability theory.)  



# experimental / incomplete items are below and may be ignored  

*yet another finish for Gerschgorin Discs*   

Let us reconsider  

$\big \vert \mathbf x \big \vert = \big \vert\big(\mathbf B^k \mathbf x\big)\big \vert  \leq \big \vert\mathbf B\big \vert^k \cdot \big \vert \mathbf x\big \vert = \mathbf P^k \cdot \big \vert \mathbf x\big \vert$  

in particular  

$\big \vert \mathbf x \big \vert =  \mathbf P^k \cdot \big \vert \mathbf x\big \vert$  

then if we sum over this result for $\mathbf I$ and $\mathbf P^{k}$ for $k \in \{1,2,...,n, n+1\}$   

i.e.  

$\mathbf S = \frac{1}{n+1} \big(\mathbf P^{1} + \mathbf P^{2} + ... + \mathbf P^{n} + \mathbf P^{n+1}\big) = \frac{1}{n+1} \big(\mathbf I + \mathbf P^{1} + \mathbf P^{2} + ... + \mathbf P^{n}\big) \mathbf P$     

we still have  
$\mathbf S  \big \vert \mathbf x\big \vert =  \big \vert \mathbf x\big \vert$   

Further, since we have one communicating class, each state is reachable from any other state in at most $n$ iterations. Thus for each row $i$ and each state $j$, where we interpret $\mathbf P^0 =\mathbf I$) we have some $k$ where 

$\mathbf e_i^T \mathbf P^{k}\mathbf e_j = w_j \gt 0$  

i.e.  
$\mathbf e_i^T \mathbf P^{k}  =  w_j \mathbf e_j + \sum_{r\neq j} w_r\mathbf e_r $  
where $\sum_{m=1}^n w_m = 1$ and $w_m \geq 0$  (i.e. a convex combination).  

However since this applies for every $j$ that includes the (at least one) state that has out transition probabilities that sum to strictly less than one.  Hence  

$\mathbf e_i^T \mathbf P^{k+1}\mathbf 1 $  
$= \Big(\mathbf e_i^T \mathbf P^{k}\big)\mathbf P\Big) \mathbf 1 $  
$= \big( w_j \mathbf e_j \mathbf P\big)\mathbf 1 + \sum_{r\neq j} \big(w_r\mathbf e_r \mathbf P \big)\mathbf 1 $  
$= w_j \alpha_j + \sum_{r\neq j} w_r \alpha_r $  
$\leq w_j \alpha_j + \sum_{r\neq j} w_r $  
$= w_j \alpha_j + \big(1 + w_j\cdot(-1)\big) $  
$= 1 + w_j \big(-1 + \alpha_j\big) $  
$\lt 1$  

because $\alpha_r \leq 1$ but $\alpha_j \lt 1$ with $0 \lt w_j \leq 1$  

thus for any $\mathbf e_i$ we have  

$\mathbf e_i^T \mathbf S\mathbf 1 = \mathbf e_i^T  \frac{1}{n+1} \big(\mathbf P^{1} + \mathbf P^{2} + ... + \mathbf P^{n} + \mathbf P^{n+1}\big)\mathbf 1 \leq \frac{1}{n+1}\alpha_j + \frac{n}{n+1}\lt 1 $   

which tells us that *every* row of $\mathbf S$ sums to less than one. Since all components of $\mathbf S$ are real non-negative, a direct application our originally proven Gerschorin discs tells us that 

$\big \vert \lambda_\text{max}\big(\mathbf S\big) \big \vert \leq \text{largest row sum }\big(\mathbf S\big) \lt 1$    

but if $\mathbf P$ has an eigenvalue of $1$, so must $\mathbf S$.  The contrapositive is that since $\mathbf S$ cannot have an eigenvalue of $1$, then we know that all eigenvalues of $\mathbf P$ have magnitude $\lt 1$.  Then using Jordan Form, or result from Gelfand, we know that $\mathbf P^k \to 0$ and so the strictness of the threorem is proven.  
**edit:** see post from Jairo Bochi here:  
https://mathoverflow.net/questions/232132/applications-of-the-cayley-hamilton-theorem  
it seems we can get gelfand's formula itself directly from cayley hamilton, and the below may be moot  


a slightly different and amusing finish would be to notice that it is sufficient to show that some multiply of $\mathbf P$ tends to zero, i.e. for some natural number $j$ 

$\big \Vert \big(\mathbf P^j\big)^k \big \Vert \leq \epsilon$  

for all $\epsilon \gt 0$ by selecting large enough $k$.  

One way to finish this is to select $j$ large enough (i.e. so the eigenvalue magnitudes are all *very* small), so that the maximal magnitude eigenvalue of $\mathbf P$ is less than $3^{-n}$. From here, apply Cayley Hamilton, triangle inequality and positive definiteness.  

I.e.  for $n$ by $n$ matrices, we have  

$ a_n\big(\mathbf P^j\big)^1 + a_{n-1}\big(\mathbf P^j\big)^2 + ... + a_1\big(\mathbf P^j\big)^{n-1}  + \big(\mathbf P^j\big)^n = \mathbf 0$  

by Cayley Hamilton. Thus  

$\big(\mathbf P^j\big)^n  = -\Big( a_n\big(\mathbf P^j\big)^1 + a_{n-1}\big(\mathbf P^j\big)^2 + ... + a_1\big(\mathbf P^j\big)^{n-1} \Big)  $  

computing the norm of each side --- with some care, the argument works with any norm, however the Frobenius or operator 2 norm is suggested here for convenience.  Any submultipilicative norm will work here, however.  

If $\big \Vert \mathbf P\big \Vert \lt 1$, then submultiplicativity is enough to prove that 
$\big \Vert \mathbf P^k\big \Vert \leq \big \Vert \mathbf P\big \Vert^k \leq \big \Vert \mathbf P\big \Vert^{k-1} ... \leq \big \Vert \mathbf P\big \Vert  \lt 1$  

which is monotone decreasing, bounded below by zero, and has an obvious limit of zero.  


For the rest of the post, we assume  
$\big \Vert \mathbf P\big \Vert =M \geq 1$. 

Applying triangle inequality, we have  


$\big \Vert \mathbf P^{nj}\big \Vert $  
$=\big \Vert-\Big( a_n\big(\mathbf P^j\big)^1 + a_{n-1}\big(\mathbf P^j\big)^2 + ... + a_1\big(\mathbf P^j\big)^{n-1} \Big)\big \Vert   $  
$ \leq  \vert a_n\vert \big \Vert \mathbf P^j \big \Vert  + \vert a_{n-1}\vert \big \Vert \mathbf P^{2j}\big \Vert + ... + \vert a_1\vert \big \Vert \mathbf P^{(n-1)j}\big \Vert$  
$ \leq  \vert a_n\vert \big \Vert \mathbf P^j \big \Vert  + \vert a_{n-1}\vert \big \Vert \mathbf P^{j}\big \Vert^2 + ... + \vert a_1\vert \big \Vert \mathbf P^{j}\big \Vert^{(n-1)}$  
$=  \vert a_n\vert \cdot c  + \vert a_{n-1}\vert c^2 + ... + \vert a_1\vert c^{(n-1)}$  

with $ c:  =\big \Vert \mathbf P^j \big \Vert$  


and by submultiplicativity of our norm, for $i \in\{1, 2, ... , n-1,n\}$  

$\big \Vert \mathbf P^{nj + i}\big \Vert$  
$\leq  \big \Vert \mathbf P \big \Vert^i \cdot  \big \Vert \mathbf P^{nj}\big \Vert $  
$\leq  M^i \cdot  \big \Vert \mathbf P^{nj}\big \Vert $  
$\leq  M^i \cdot \big(\vert a_n\vert \cdot c  + \vert a_{n-1}\vert c^2 + ... + \vert a_1\vert c^{(n-1)}\big) $    
$\leq  M^{n} \cdot \big(\vert a_n\vert \cdot c  + \vert a_{n-1}\vert c^2 + ... + \vert a_1\vert c^{(n-1)}\big) $    

thus we have,  
$\big \Vert \mathbf P^{nj + i}\big \Vert$    
$\leq  M^{n} \cdot \big(\vert a_n\vert \cdot c  + \vert a_{n-1}\vert c^2 + ... + \vert a_1\vert c^{(n-1)}\big) $    

for some arbitrary fixed constants $c, M \gt 0$  are specified for any given $\mathbf P$.  

It thus suffices to show that for any given $\mathbf P$, we may find 
by selecting large enough $j$,  such that  
$\big \Vert \mathbf P^{nj}\big \Vert \leq \big(\vert a_n\vert \cdot c  + \vert a_{n-1}\vert c^2 + ... + \vert a_1\vert c^{(n-1)}\big) \lt \frac{1}{M^n} $  

because once we have this, we know for $i \in\{1, 2, ... , n-1,n\}$  
$\big \Vert \mathbf P^{nj +i}\big \Vert \to 0$

which implies $\mathbf P^{nj +i} \to \mathbf 0$.  

and we may select it to be arbitrarily close to zero by selecting large enough $n$  

**I don't think this is quite needed here... and the discussion is getting muddled right around now** 

so what we want is to have  
$\sum_{i=0}^{n-1} \big \Vert \mathbf P^{nj +i}\big \Vert$  
$\leq \sum_{i=0}^{n-1} M^i\big \Vert \mathbf P^{nj}\big \Vert$    
$ \leq \big \Vert \mathbf P^{nj}\big \Vert \cdot \sum_{i=0}^{n-1} M^i $    
$ \leq \big \Vert \mathbf P^{nj}\big \Vert \cdot \frac{1-M^n}{1-M}$    
(where the upper bound is understood to be simply $\big \Vert \mathbf P^{nj}\big \Vert \cdot n$ if $M=1$)  
 

but, we know  
$\big \vert  \lambda_1\big \vert \geq \big \vert \lambda_2 \big \vert  \geq ... \geq \big \vert \lambda_n\big \vert $  

with $e_k$ being the $kth$ elementary symmetric function, 
$\vert a_k \vert $  
$\big \vert e_k\big(\lambda_1 , \lambda_2 , ..., \lambda_n\big)\big \vert$  
$\leq e_k\big(\big \vert\lambda_1\big \vert , \big \vert\lambda_2\big \vert , ..., \big \vert \lambda_n\big \vert\big)$  
$\leq e_k\big(\big \vert\lambda_1\big \vert , \big \vert\lambda_1\big \vert , ..., \big \vert \lambda_1\big \vert\big)$  
$ =  \big \vert \lambda_1\big \vert \binom{n}{k}$   

by application first of triangle inequality, then a point-wise bound (the second step is alternatively justified via maclaurin's inequalities or shur concativity of elementary symmetric functions and writing $\lambda_1 = c \cdot A$ where $A$ is the arithmetic mean of the magnitudes of the eigenvalues, with $c \geq 1$)  

hence we know  

$ \vert a_n\vert + \big \vert a_{n-1}\vert +  ... +  \vert a_1\vert$   
$\leq \vert \lambda_1 \vert \Big( \binom{n}{1} + \binom{n}{2} + ... + \binom{n}{n-1} +  \binom{n}{n}\Big)$   
$\leq \vert \lambda_1 \vert \Big( \binom{n}{0} + \binom{n}{1} + \binom{n}{2} + ... + \binom{n}{n-1} +  \binom{n}{n}\Big)$   
$= \vert \lambda_1 \vert\cdot 2^n $   
$= \big(\frac{2}{3}\big)^n $   

- - - - 
the idea from here, to then look at all $k\geq K$ where we set $K = jn\cdot M$ or something like this, and hence we can verify that the norm for all $\big \Vert P^k\big \Vert \lt 1$, then looking at maximum over chunks of n we see as sequence that is monotone decreasing, (less than one in norm) and bounded below by $0$ and hence the limit is zero.  
- - - - 


From here we have a companion system /  residual life (renewal) chain  


$\begin{bmatrix}
v_{n-1}\\ 
v_{n-2}\\ 
\vdots\\ 
v_{2}\\ 
v_{1}
\end{bmatrix} =  \left[\begin{matrix}  
\vert a_1\vert & \vert a_{2}\vert & \vert a_{3}\vert & \cdots & \vert a_n\vert
\\1 & 0 & 0 & \cdots & 0 
\\0 & 1 & 0 & \cdots & 0 
\\\vdots & \vdots & \ddots & \ddots & \vdots 
\\0 & 0 & 0 & 1 & 0  
\end{matrix}\right]\begin{bmatrix}
v_{n-1}\\ 
v_{n-2}\\ 
\vdots\\ 
v_{1}\\ 
v_{0}
\end{bmatrix}
$  


selecting  
$\begin{bmatrix}
v_{n-1}\\ 
v_{n-2}\\ 
\vdots\\ 
v_{2}\\ 
v_{1}
\end{bmatrix} := \begin{bmatrix}
\big \Vert \mathbf P^{(n-1)j}\big \Vert\\ 
\big \Vert \mathbf P^{(n-2)j}\big \Vert\\ 
\vdots\\ 
\big \Vert \mathbf P^{2j}\big \Vert\\ 
\big \Vert \mathbf P^{1j}\big \Vert
\end{bmatrix}$  

recalling our earlier inequalities, we know 

$\big \Vert \mathbf P^{(n)j}\big \Vert \leq v_{n}$ 

and by induction 
$0 \leq \big \Vert \mathbf P^{(n+i)j}\big \Vert \leq v_{n+i}$  
for all natural numbers $i$  

- - - - -  
another approach to this companion system, is to note that if there is a right eigenvector with eigenvalue 1, i.e. a fixed point, then 

$\mathbf C\mathbf v = \mathbf v$, but examining the second row tells us that $v_1 = v_2$, and examining the 3rd row tells us that $v_2 = v_3$, and so on, giving us $v_1 = v_2 = v_3 = ... = v_{n-1} = v_n$.  i.e. any fixed point $\propto \mathbf 1$.  However, focusing on the first row of this matrix, we can see that it is substochastic, so far any $c \gt 0$  


$c \big(\vert a_1\vert \cdot 1 + \vert a_{2}\vert \cdot 1 + \vert a_{3}\vert\cdot 1 + ...\cdots + \vert a_n\vert \cdot 1\big)  $  
$= c \big(\vert a_1\vert + \vert a_{2}\vert + \vert a_{3}\vert + ...\cdots + \vert a_n\vert \big)  $  
$\lt c  $  

i.e.  
$c \cdot \mathbf e_1^T \mathbf C \mathbf 1 \lt c \cdot \mathbf e_1^T \mathbf 1  = c$   

we can verify that if the first component decreases, then   
$c \cdot \mathbf e_k^T \mathbf C \mathbf 1 \leq c \cdot \mathbf e_1^T \mathbf 1  = c$    
for $k\in \{1,2, ..., n\}$    

i.e. we have a component-wise inequality  

$c \cdot  \mathbf C \mathbf 1 \leq c \cdot \mathbf 1  $   



then 
$c \cdot \mathbf e_1^T \mathbf C^2 \mathbf 1  = c \cdot \mathbf e_1^T \mathbf C \big( \mathbf C \mathbf 1\big) \lt c \cdot \mathbf e_1^T \mathbf C \mathbf 1 \lt c \cdot \mathbf e_1^T \mathbf 1  = c$   

because we take a sub convex combination in the second iteration of positive components each of which have non-increased from the prior iteration.  

and by induction we have a decreasing sequence 

$c \cdot \mathbf e_1^T \mathbf C^r \mathbf 1  \lt c \cdot \mathbf e_1^T \mathbf C^{r-1} \lt ... \lt e_1^T \mathbf C \mathbf 1 \lt c \cdot \mathbf e_1^T \mathbf 1  = c$   

which is bounded below by 0 (i.e. a linear combination of terms where all scalars involved are non-negative gives a non-negative resul).  

Hence we have a bounded monotone decreasing sequence, and we know some limit exists, i.e.  

$\lim_{r\to\infty} c \cdot \mathbf e_1^T \mathbf C^r \mathbf 1  = L$  

however, since  
$ c \cdot \mathbf e_{k}^T \mathbf C^{r-k+1} \mathbf 1 =  c \cdot \mathbf e_{1}^T \mathbf C^{r} \mathbf 1$    

this tells us that each component must have the same limitting value:  
$\lim_{r\to\infty} c \cdot \mathbf e_1^T \mathbf C^r \mathbf 1  = L \mathbf 1$   

But $L$ must be zero -- if $L \gt 0$ then we have 
$\lim_{r\to\infty} c \cdot \mathbf C^r \mathbf 1  = L \mathbf 1$  
which is a right eigenvector with eigenvalue 1, a contradiction.  

hence 
$\lim_{r\to\infty} c \cdot \mathbf C^r \mathbf 1  = \mathbf 0$    

for any $c \gt 0$.  

now, for any other $n$ dimenstional $\mathbf v$ with real non-negative components, we have  
$\lim_{r\to\infty} c \cdot \mathbf C^r \mathbf v  = \mathbf 0$    

because by selecting large enough $c$, we have the component wise inequality  
$c \cdot \mathbf C^r \mathbf 1 \geq c \cdot \mathbf C^r \mathbf v$    
for all natural numbers $r$ 

this may be equivalently written as  
$ \mathbf C^r \big(c \cdot\mathbf 1 -\mathbf v\big) = \mathbf C^r \big(\mathbf x\big) \geq 0$    

because $\mathbf x \geq 0$ and (sub) convex combinations of real non-negative numbers results in real non-negative numbers.  

- - - -  
*note:* making use of submultiplicativity we can further clean this up by noting that 
$\big \Vert \mathbf P^{mj}\big \Vert \leq \big \Vert \mathbf P^{1j}\big \Vert^m$  
and hence getting a Vandermonde style result of 

$\begin{bmatrix}
c^{(n-1)j}\\ 
c^{(n-2)j}\\ 
\vdots\\ 
c^{2j}\\ 
c^{1j}
\end{bmatrix}$

with $c := \big \Vert \mathbf P^{j}\big \Vert$   
- - - -  


we now prove $\lim_{i\to \infty} v_{n+i} \to  0 $   and hence 
$0 \leq \lim_{i \to \infty} \big \Vert \mathbf P^{(n+i)j}\big \Vert \leq 0$  

which by positive definite ness of norms implies $\mathbf P^{(n+i)j} \to \mathbf 0$    

referencing our Feller chp 15 notes, we note that the first row sums of our companion system is real non-negative and sums to less than one, hence we may exponentially tilt it.  I.e. via intermediate value theroem we know there is some $\theta \gt 1$ such that 

$ \vert a_1 \vert \theta + \vert a_{2}\vert \theta^2 + \vert a_{3}\vert \theta^3 +... +  \cdots + \vert a_n\vert \theta^n =1$ 

so  

$\left[\begin{matrix}  
\vert a_1\theta \vert & \vert a_{2}\vert \theta^2& \vert a_{3}\vert \theta^3 & \cdots & \vert a_n\vert \theta^n
\\1 & 0 & 0 & \cdots & 0 
\\0 & 1 & 0 & \cdots & 0 
\\\vdots & \vdots & \ddots & \ddots & \vdots 
\\0 & 0 & 0 & 1 & 0  
\end{matrix}\right]\mathbf 1 = \mathbf 1$

from here, using one of many results in the Feller chp 15 notebook, we can observe that, where 

$\mu = 1 \cdot \vert a_1\theta \vert + 2 \cdot \vert a_{2}\vert \theta^2 + 3 \cdot vert a_{3}\vert \theta^3 +... +  \cdots + n \cdot \vert a_n\vert \theta^n$  

hence 
$ v_{n+i} \to \frac{\theta^{-i}}{\mu}$  

which may be made arbitrarily small.  

referencing the section "an even better approach" shows us that is to note that 

$v_{n+i} \leq  c \dot \theta^{n-i}$  

selecting $i$ large enough such that 

$v_{n+i} \lt \min\{c^{-n},1\} $  

we find that 
$\big \Vert \mathbf P^{(n+i)j}\big \Vert \leq v_{n+i}\lt \min\{c^{-n},1\}$  

but then for *any* power $\mathbf P^k$ for $k \geq nj \geq i$ (where large enough $i$ is used above) we may write it as $r = k\% n$ and $k-r$  

**below needs cleaned up, though it is intuitively clear to me...**  
using submultiplicativity, if $r \neq 0$     
$\big \Vert \mathbf P^k \big \Vert_F$   
$\leq \big \Vert \mathbf P^r \big \Vert_F \big \Vert \mathbf P^{(k-r)} \big \Vert_F$  
$\big \Vert \mathbf P^r \big \Vert_F \big \Vert \mathbf P^{jn} \big \Vert_F$  
$\leq \big \Vert \mathbf P \big \Vert_F^r \big \Vert \mathbf P^{(k-r)} \big \Vert_F$  
$\leq c^r \big \Vert \mathbf P^{(k-r)} \big \Vert_F$  
$\lt c^r \min\{c^{-n},1\} $  
$\lt 1$  

but since norms are submultiplicative, this implies any higher power of this norm shrink and the matrix will tend to zero.  



So we now know that muliples of $jn$ tend to zero.  This means that for any $\epsilon \gt 0$, there exists some $J$, such that that for all $j \geq J$  
$\big \Vert P^{jn}\big \Vert_F \lt \epsilon$  

but this also tells us we have convergence for all (large enough) powers of $\mathbf P$.   

that is, but making use of submultiplicativity, we observe  the pointwise inequalities, selecting some $\epsilon_0 \gt 0$

$\big \Vert \mathbf P^{jn}\big \Vert_F \lt \epsilon_0$  
$\big \Vert \mathbf P^{jn+1}\big \Vert_F = \big \Vert \mathbf P \mathbf P^{jn}\big \Vert_F \leq \big \Vert \mathbf P\big \Vert_F\Vert \mathbf P^{jn}\big \Vert_F = m \Vert \mathbf P^{jn}\big \Vert_F \lt m \cdot \epsilon_0$  
$\big \Vert \mathbf P^{jn+2}\big \Vert_F = \big \Vert \mathbf P \mathbf P  \mathbf P^{jn}\big \Vert_F \leq \big \Vert \mathbf P\big \Vert_F \big \Vert \mathbf P\big \Vert_F \big \Vert \mathbf P^{jn}\big \Vert_F = m^2 \big \Vert \mathbf P^{jn}\big \Vert_F \lt m^2 \cdot \epsilon_0$  
$\vdots$  
$\big \Vert \mathbf P^{jn+j(n-1)}\big \Vert_F = \big \Vert \underbrace{\mathbf P...\mathbf P}_{\text{jn-1 times}}  \mathbf P^{jn}\big \Vert_F \leq  \underbrace{ \big \Vert \mathbf P\big \Vert_F... \big \Vert \mathbf P\big \Vert_F}_{\text{jn-1 times}}  \big \Vert \mathbf P^{jn}\big \Vert_F = m^{jn-1} \big \Vert \mathbf P^{jn}\big \Vert_F \lt m^{jn-1} \cdot \epsilon_0$  
$\Vert \mathbf P^{jn+(jn)}\big \Vert_F = \big \Vert \mathbf P^{2jn}\big \Vert_F \lt\epsilon_0$  
$\big \Vert \mathbf P^{2jn + 1}\big \Vert_F \lt m\cdot \epsilon_0$  
$\big \Vert \mathbf P^{2jn + 2}\big \Vert_F \lt m^2\cdot \epsilon_0$  
$\vdots$  
$\big \Vert \mathbf P^{2jn+j(n-1)}\big \Vert_F = \big \Vert \underbrace{\mathbf P...\mathbf P}_{\text{jn-1 times}}  \mathbf P^{2jn)}\big \Vert_F \leq  \underbrace{ \big \Vert \mathbf P\big \Vert_F... \big \Vert \mathbf P\big \Vert_F}_{\text{jn-1 times}}  \big \Vert \mathbf P^{2jn)}\big \Vert_F = m^{n-1} \big \Vert \mathbf P^{2jn}\big \Vert_F \lt m^{jn-1} \cdot \epsilon_0$   
and this cyclic pattern continues to repeat.  

Thus for any $\epsilon \gt 0$ we may select   
$\epsilon_0 := \frac{\epsilon}{m^{jn-1}}$   
which tells us there is a $J$  such that for all $j\geq J$, we have a desired upper bound on the norm of our matrix.  We can restate this as for all $k \geq J\cdot n$, where we examine numbers modulo $jn$,  
$r := k\%jn$   
$i := k - r$  

$\big \Vert \mathbf P^k \big \Vert_F = \big \Vert \mathbf P^{in +r}\big \Vert_F \lt m^r \epsilon_0 \leq m^{jn-1} \epsilon_0 = m^{jn-1} \frac{\epsilon}{m^{jn-1}} = \epsilon$  

which proves that 
$\big \Vert \mathbf P^k \big \Vert_F \to 0$  

and hence  

$\mathbf P^k \to \mathbf 0$  

Another way to finish the above is to note that we have a cauchy sequence, so by selecting appropriate $K$, for all 

$k,m\gt K$ (where $v$ is $\min\{$k\%n$, m\%n\}$)  

here I try to simplify this by choosing $K$ to be a multiple of $n$   

$\big \Vert \mathbf P^k - \mathbf P^m \big \Vert_F $  
$\leq \big \Vert \mathbf P^k\big \Vert_F + \big \Vert \mathbf P^m \big \Vert_F$  
$\leq  2 M^{jn} \cdot \big\Vert \mathbf P^{K}\big \Vert_F $  
$\text{insert line showing it arbitrarily small as a function of well chosen M}$  
$\lt \epsilon$  


- - - - - -  
**broken alternative close**    

an alternative close would make use of the extremal characterization associated with row stochastic matrices. In particular, reconsidering  

$\big \vert \mathbf x \big \vert = \big \vert\big(\mathbf B^k \mathbf x\big) \leq \big \vert\mathbf B\big \vert^k \cdot \big \vert \mathbf x\big \vert = \mathbf P^k \cdot \big \vert \mathbf x\big \vert$  

in particular  

$\big \vert \mathbf x \big \vert =  \mathbf P^k \cdot \big \vert \mathbf x\big \vert = \mathbf P^k \cdot \big \vert \mathbf x\big \vert$  


since $\mathbf x in \mathbb R^n$ we have finitely many points and hence there is a global maximum and a global minimum associated with $\big \vert \mathbf x\big \vert$. 


$\text{minimum value } \leq \vert x_i\vert \leq\text{ maximum value } $   


But considering that we have one communicating class here, and first considering the stochastic rows, 
since  
$\big \vert x_i \big \vert  = \sum_{j=1}^n w_j^{(i)} \big \vert x_j \big \vert$   

where each $w_j \geq 0$ and $\sum_{j}w_j = 1$  
we know  

$\big \vert x_i \big \vert  \leq \text{max}_{j\in \text{1 step transition neighbors}}\{ \vert x_j \vert\} $   

this inequality also holds for the substochastic row (though it isn fact strict as it involves a convex combination with a zero value -- we will return to this)  

however, since there is one communicating class, each state is reachable in at most $n$ steps from any state, and since $\big \vert \mathbf x\big \vert $ is a fixed point, we see that for $k=\{1, 2,...,n\}$  

$\big \vert x_i \big \vert  \leq \text{max}_{j\in \text{k step transition neighbors}}\{ \vert x_j \vert\} $   

taking the union over each of these inequalities, and recognizing that each $x_j$ shows up at least once in the union, we get  
**needs cleaned up**  

we find that each $i$, $\big \vert x_i\big \vert $ is bounded above by the maxmium of a set containint $\big \vert x_j \big \vert$ for any $j\neq i$

**the above should be cleaned up or perhaps more likely deleted? upon reflection it is awfully similar to the result from brualdi, below**  








- - - - - 
*Brualdi's Approach*  

The standard approach in Brualdi is much shorter but does not rely on probability theory but relies on a nested contradiction.  

As before $\mathbf A$ is an $n$ x $n$ matrix that is irreducible and weakly diagonally dominant. If $\det\big(\mathbf A\big) = 0$ then there is some $\mathbf x \neq 0$ such that $\mathbf {A x} = \mathbf 0$.  

Now, since the inequality is strict for at least one row of $\mathbf A$, we can see that $\mathbf x \propto \mathbf 1 \neq \mathbf 0$.  This means there is some maximal magnitude component of our vector, called $x_k$ as well as at least one $\big \vert x_i\big \vert \lt \big \vert x_k \big \vert $.  So we create a bipartition -- the set $U$ has all components of $\mathbf x$ where $\big \vert x_j \big \vert = \big \vert x_k\big \vert$, and $U^C$ has all other components of $\mathbf x$.  Since the underlying graph is irreducible, there must be a $p \in U$ and $q \in U^C$ (**tbc mechanics of why this holds**) such that $a_{p,q} = \neq 0$.  Since $p \in U$ we can infer that $\big \vert x_q\big \vert = \big \vert x_k\big \vert$ 

**(needs cleaned up and finished)**  

The immediate corollary is:  

if $\mathbf A$ is irreducible, then if we revisit our Gershgoring Disc formula:  

$\big \vert a_{i,i} - \lambda \big \vert \leq \sum_{j\neq i} \big \vert a_{i,j}\big \vert$

we find that $\lambda$ can be an eigenvalue on the boundary of the union of discs **iff** it is a boundary point of all of the circular discs  

# end experimental / incomplete items  



This matrix is small enough (3x3 means cubic root) that we can solve symbollically for the eigenvalues exactly, but those eigenvalues -- given two cells down-- are not so easy to interpret.  This gets increasingly difficult for much larger matrices.  As is, we an simply look at (1,1,1) and see that the trace inequality is not being observed, hence the Hessian is not positive semi-definite at (1,1,1) and the function is thus not convex.  (We can of course look at the fact that diagonal elements are always positive to know that the function is *not negative convex*.)  

There are *many* applications where we may want to make claims about the eigenvalues / singularity of a matrix without looking at specific numerical values.


# Application: Graph Laplacian

By construction we know that the Graph Laplacian is symmetric and real.  Thus we know all eigenvalues are real.  Further, we know that the diagonal entries are positive, and all off diagonal entries are either zero or negative.  An example is shown below.

$\mathbf L = 
\begin{bmatrix}
3 & -1 & 0 & -1 &  -1 & 0\\ 
-1 & 3 & -1 & 0 & -1 & 0\\ 
0 & -1 & 2 & 0 & -1 &0 \\ 
 -1& 0 & 0 & 2 & -1 & 0\\ 
-1 & -1 & -1 & -1 & 5 & -1\\ 
0 & 0 & 0 &0  & -1 & 1
\end{bmatrix}$


The minus ones correspond to edges and the diagonal values represented the degree of a given node (recall that the graph is undirected).  Thus for any Laplacian, we know:

$\mathbf{L1} = \mathbf 0$

This combined with aforementioned structure (symmetric, positives on diagonal, non-positives off diagonal) tells us, that the Laplacian is singular, and using Gershgorin's discs we can observed that $\big \vert l_{i,i} - \lambda \big \vert \leq l_{i,i}$, which tells us that the smallest an eigenvalue can be is 0.  (If an eigenvalue were less than zero, its distance from the strictly postively valued $l_{i,i}$ would necessarily be more than $l_{i,i}$.)  Hence we observe that the graph Laplacian is Symmetric Positive Semi-Definite.  There are other ways to prove this fact, of course, but the Gerschgorin disc approach is extremely quick and intuitive.  

- - - -


# Other Applications


And again, whether we think of this in terms of eigenvalues with Gerschgorin discs, or Levy-Desplanques, we also have a way of knowing whether or not certain matrices are invertible without going through the full calculation -- i.e. if they are diagonally dominant, their invertibility should just jump off the page at you. 

Thus the ability to bound eigenvalues via the simple and intuitive Gerschgorin Discs, gives us a new way of interpretting special structure in matrices.  

There are of course also numeric applications in engineering where matrices with special structures are used.  In these cases if may be nice to prove that the matrices are symmetric positive semi defnite in general, irrespetive of their size -- whether they are 5x5 or 1,000 x 1,000 or more generally $n$ x $n$.  There may be alternative approaches that involve importing machinery like Cauchy Interlacing and then doing induction on $n$, but using Gerschgorin Discs gives a simple, direct and very visual way to prove this.  


For instance consider the matrix given on page number 34 (35 of 42 according to PDF viewer) here: https://ocw.mit.edu/courses/aeronautics-and-astronautics/16-920j-numerical-methods-for-partial-differential-equations-sma-5212-spring-2003/lecture-notes/lec15.pdf

Just by looking at it, we we tell it is real valued and symmetric, so we know its eigenvalues are all real.  We can use gerschgorin discs to determine that the minimum possible eigvenvalue is zero.  Hence we know the matrix is at least Symmetric Positive Semi-Definite.  We can also easily look through the implicit Gram-Schmidt and determine that the first n - 1 columns must be linearly independent, and hence the rank of this $n$ x $n$ matrix is at least n - 1.  With a small bit of more work, we can then determine that the final column must be in linearly independent as well, and thus the matrix is Symmetric Positive Definite.  But the main point is that by simply eyeballing the symmetry, and knowing about Gerschgorin discs, we were able to have deep and general understanding of the spectrum underlying this matrix for any finite $n$ x $n$ dimension that it may take on.  

Finally, Levy-Desplanques and Gerschgorin Discs also are used in numerical linear algebra for evaluating conditioning, making claims on whether pivotting is needed in Gaussian Elimination, and so on.

Note: with respect to time homogenous finite state Markov Chains, while there are more powerful approaches using greatest common divisor (which generaize to countable state markov chains), Gerschgorin discs (with a strictness refinement due to Taussky) immediately tell us that for a graph with a single communicating class (irreducible) and even one self-loop -- said graph cannot have periodic behavior because the only point on the unit circle touching/inside of *all* Gerschgorin Discs is the value 1 (this is the Taussky refinement), hence the only eigenvalue with magnitude 1 is an eigenvalue of one. All other eigenvalues have magnitude less than 1 and may be made arbitrarily small after a large enough number of iterations.  The fact that the eigenvalue of 1 is simple (i.e. algebraic multiplicity of one) is of course given by Perron Frobenius Theory or standard markov chain results from Kolmogorov.  

Of interest: it is also implied directly by the elementary renewal theorem with a delayed start -- or perhaps better, we could formulate this as a renewal rewards problem where are reward of one is given each time we are visit any state in the graph, of a cycle starting and finishing at some arbitrary node $i$.  The renewal reward theorem (and perhaps common sense) tells us that we have a time averaged reward of 1 -- but this is equivalent to 

$1  =  \lim_{t \to \infty}\frac{E[r(t)]}{t}  =\lim_{t \to \infty} \frac{1}{t}\sum_{k=1}^t \text{trace}\big(\mathbf A^k\big)$  

However if the algebraic multipliciity of eigenvalue $1$ is larger than 1, (e.g. 2 or 3 or...) then the time averaged trace must be at least $2$, which is a contradiction-- this proves that the eigenvalue of one is simple. (Note that this *also*  proves simplicity of eigenvalue 1 for any irreducible time homogenous finite state markov chain -- including periodic chains.  Consider the matrix $\mathbf A$ and its eigenvalues.  Now consider the convex combination of $\mathbf B: = \frac{1}{2}\big(\mathbf A + \mathbf I\big)$. Here the graph in $\mathbf B$ is connected but has (many) self-loops and hence is aperiodic -- and the above tells us that $\mathbf B$ has a simple eigenvalue of $1$ and all others with magnitude less than $1$.  Yet the eigenvalues of $\mathbf B$ the average of the eigenvalues of $\mathbf A$ and $1$.  Thus if $\mathbf A$ had multiple eigenvalues of $1$, so would $\mathbf B$, which tells us that $\mathbf A$ has a simple eigenvalue of $1$.  


In [1]:
import sympy as sp

x = sp.Symbol('x')
y = sp.Symbol('y')
z = sp.Symbol('z')

myfunc = x**2*y**2*z**4 + z**2

mylist = [x, y, z]

gradient = [myfunc.diff(variable) for variable in mylist]
hessian = [[partial1.diff(variable) for variable in mylist] for partial1 in gradient]


hessianmatrix = sp.Matrix(hessian)

print(hessianmatrix.eigenvals())
# these are NOT easy to interpret and it is a very small Hessian!
# use a different tool!
    

{4*x**2*y**2*z**2 + 2*x**2*z**4/3 + 2*y**2*z**4/3 - (-1512*x**4*y**4*z**10 + 324*x**2*y**2*z**8 + sqrt((-3024*x**4*y**4*z**10 + 648*x**2*y**2*z**8 - (-108*x**2*y**2*z**2 - 18*x**2*z**4 - 18*y**2*z**4 - 18)*(-40*x**4*y**2*z**6 - 40*x**2*y**4*z**6 - 12*x**2*y**2*z**8 + 4*x**2*z**4 + 4*y**2*z**4) + 2*(-12*x**2*y**2*z**2 - 2*x**2*z**4 - 2*y**2*z**4 - 2)**3)**2 - 4*(120*x**4*y**2*z**6 + 120*x**2*y**4*z**6 + 36*x**2*y**2*z**8 - 12*x**2*z**4 - 12*y**2*z**4 + (-12*x**2*y**2*z**2 - 2*x**2*z**4 - 2*y**2*z**4 - 2)**2)**3)/2 - (-108*x**2*y**2*z**2 - 18*x**2*z**4 - 18*y**2*z**4 - 18)*(-40*x**4*y**2*z**6 - 40*x**2*y**4*z**6 - 12*x**2*y**2*z**8 + 4*x**2*z**4 + 4*y**2*z**4)/2 + (-12*x**2*y**2*z**2 - 2*x**2*z**4 - 2*y**2*z**4 - 2)**3)**(1/3)/3 + 2/3 - (120*x**4*y**2*z**6 + 120*x**2*y**4*z**6 + 36*x**2*y**2*z**8 - 12*x**2*z**4 - 12*y**2*z**4 + (-12*x**2*y**2*z**2 - 2*x**2*z**4 - 2*y**2*z**4 - 2)**2)/(3*(-1512*x**4*y**4*z**10 + 324*x**2*y**2*z**8 + sqrt((-3024*x**4*y**4*z**10 + 648*x**2*y**2*z**8 - (-108

# extension: include Cassini Disks

http://bwlewis.github.io/cassini/#br1

or better: work through the Casini ovals stated (and then extended with graph properties), in this file: 

'CasiniOvals_extension.pdf'

located in Linear Algebra folder ...

also this seems to quite good

http://planetmath.org/sites/default/files/texpdf/37503.pdf

also this: 

http://www.math.kent.edu/~varga/pub/paper_232.pdf



