# Kruskal's MST Algorithm

We take the approach of including edges from the cheapest first, until a spanning tree is created, being careful to skip edges that create cycles.

Psuedo-code 
```
Kruskals():
    Sort edges in order of increasing cost

    T = {}

    for i=1 to m:
        if T union {i} has no cycles
            add i to T
    
    return T
```

## Proofs

### Proof that the output is a spanning tree

Let $T^*$ be the output of Kruskal's algorithm. By definition $T^*$ has no cycles. It stands to show that $T^*$ is connected.

Fix a cut $(A, B)$. Since the input graph is connected, at least one if its edges crosses $(A, B)$. 

We know that Kruskals algorithm considers all edges at least once. 

By the double crossing lemma, the first time Kruskals algorithm encounters an edge that crosses $(A,B)$ it can not be part of an existing cycle, since there must be at least two crossing edges for a cycle to exist.

Therefore, the first edge Kruskal's algorithm encounters will be included. 

Therefore, there are no cuts with no crossing edges, and $T^*$ is connected.

### Proof that the spanning tree is minimal

We will show that every edge of $T^*$ satsifies the minimim cut property.

Consider an interation where the algorithm includes the edge $(u, v)$ to the set $T$

Since $T \cup (u, v)$ has no cycles, there does not currently exist a path between $u$ and $v$ in $T$.

Therefore there must exist an empty cut $(A, B)$ with $u$ and $v$ in seperate parts.

Since Kruskal's algorihtm iterates over sorted edges, the edge $(u, v)$ is garunteed to be the cheapest edge across the $(A, B)$ cut. If there were a cheaper such edge, this edge would have already been considered at an earlier iteration of the algorithm.

## Straighforward Implementation

Sorting the edges will take $O(m\log{m}) = O(m\log{n})$ time.
$$
m = O(n^2) \\
\log{m} \leq 2\log{n} + c \\
\log{m} = O(\log{n})
$$

Iterating over the edges takes $O(m)$ time, however checking for cycles will take $O(n)$ time. 

In order to check for a cycle, we can carry out DFS or BFS which runs in linear time.

This gives an overal running time of
$$
O(m\log{n}) + O(mn) = O(mn)
$$

## Union-Find

This data structure hopefully allows constant time cycle checking.

A Union-Find data strucutre maintains a parition of a set of objects and supports two operations

1. Find $x$ : -> name of the group that $x$ belongs to
2. Union $g_1$, $g_2$ : -> fuse the two groups into one

To implement the Union-Find,

1. maintain one linked structure per connected component 
2. each connected component has an arbitrarily chosen leader
3. each vertex contains a pointer to the leader of its component

To check if an edge $(u, v)$ will create a cycle, check if both nodes point to the same leader.
$$
\text{Find}(u) = \text{Find}(v) \iff \text{cycle exists}
$$
By looking up pointers this runs in $O(1)$ time.

In order to update leader points when performing a Union,

1. keep the leader with more children as the new leader
2. rewire the smaller leaders children.

Globally, any one node will be updated $O(\log{n})$ time, each rewire costs $O(n)$, resulting in $O(n\log{n})$ overall rewiring cost.

Overall, this gives
1. $O(m\log{n})$ time for sorting
2. $O(m) \times O(1)$ time for cycle checks
3. $O(n\log{n})$ time for overall rewires

Which has overall $O(m\log{n})$ running time.

There is another way to do this,
1. When carrying out $\text{Find}(x)$ follow leader's recursively
2. When carrying out $\text{Union}(g_1, g_2)$ only rewire leaders.

This would increase the running time of $\text{Find}$ to $O(\log{n})$, and decrease the running time of $\text{Union}$ to $O(1)$. Overall,

2. $O(m) \times O(\log{n})$ time for cycle checks
3. $O(m) \times O(1)$ time for rewires

Which has the same overall $O(m\log{n})$ running time.

I think the other implementation should be faster but this is yet to be checked.

## State of the Art MST

There is an [$O(m)$ randomized algorithm](https://cs.brown.edu/research/pubs/pdfs/1995/Karger-1995-RLT.pdf)! It is unknown is there is a determinstic linear time algorithm.

There is a almost linear [$O(m\alpha(n))$ determmistic algorihtm](https://www.cs.princeton.edu/~chazelle/pubs/mst.pdf). Where $\alpha(n)$ is the inverse [ackerman function](https://en.wikipedia.org/wiki/Ackermann_function). 

The inverse ackerman function grows SUPER slowly, in fact its grows much slower than
$$
\log^*{n} := 
\begin{cases}
0 & \text{if} \; n \leq 1 \\
1 + \log^*(\log{n}) & \text{otherwise}
\end{cases}
$$
Which simply put is the "inverse tetration" function.

The ackerman function can be stated
$$
\begin{aligned}
&A(0, n) &&= \quad n+1 \\
&A(m+1, 0) &&= \quad A(m, 1) \\
&A(m+1, n+1) &&= \quad A(m, A(m+1, n))
\end{aligned}
$$

### Open Questions

- There are no simple (undergraduate level) randomised linear time algorithms. It is sufficient to find a MST verification algorithm to deal with a randomised algorithm. 

- Is there a determinised linear time algorithm?

# Clustering



Given $n$ points in some space, we want to classify them into "coherent groups"

We provide a distance metric $d(q,p)$ between each point pair that is symmetric, ie,
$$
d(p, q) = d(q, p)
$$.

We aim to group "nearby" points by minimising the distance metric.


## Max-Spacing k-Clusterings

Assume that we know that there are $k$ clusters to look for.

Label two points $p, q$ as separted if they're assigned to different clusters.

Define the spacing of a k-clustering
$$
\min_{\text{seperated} \; p, q}{d(p, q)}
$$
In other words, the closest two seperated points.

Given a distance metric $d$ and some number of clusters $k$ we seek to compute a clustering that has the maximum possible spacing.

### A Greedy Algorithm

Starting with all points in their own clusters, 
1. Find the closest pair of seperated points,
2. Fuse these two clusters into one 
3. If this gives the correct number of clusters, return
3. Else Repeat

This is the same as Kruskals algorithm!! This is known as Single-link clustering

### Corectness Proof

Let the greedy clustering be
$$
C_1, C_2, \cdots, C_k 
$$

Let an arbitrary clustering be
$$
\hat{C}_1, \hat{C}_2, \cdots, \hat{C}_k
$$

Define $S$ to be the spacing of a clustering
$$
S(C_1, C_2, \cdots, C_k) = \text{spacing of the clustering}
$$

Let $s^*$ be the optimal spacing

We seek to show that
$$
S(\hat{C}_1, \hat{C}_2, \cdots, \hat{C}_k) \leq s^*
$$

Case 1:

If $\hat{C}_i \sim C_i$ by renaming and re-ordering, then they have the same spacing.

Case 2:

Otherwise, we can find a pair of points $p,q$ such that 
1. $p, q \in C_i$
2. $p \in \hat{C_i}$ while $q \in \hat{C_j}$

The property of the greedy algorithm indicates that if two points $x, y$ have been directly merged, then
$$
d(x, y) \leq s^*
$$

Furthermore, the distances between merged pairs in order, is always increasing.

Trivial case:

$p, q$ were directly merged by the algorithm. This would imply that
$$
d(p, q) \leq s^*
$$
Since $p$ and $q$ lie in seperate clusters in the arbitrary clustering, then
$$
S(\hat{C}_1, \hat{C}_2, \cdots, \hat{C}_k) \leq d(p, q) \leq s^*
$$

Tricky case:

$p, q$ were merged through multiple direct merges.

Let 
$$
p, a_1, a_2, \cdots, a_r, q
$$
be the path of direct greedy merges connecting $p$ and $q$

Since $p \in \hat{C}_i$ and $q \in \hat{C}_j$ there exists a consecuitive pair of nodes $a_j, a_{j+1}$ that bridges the gap.

Since the consecuitive nodes were directly merged,
$$
S(\hat{C}_1, \hat{C}_2, \cdots, \hat{C}_k) \leq d(a_j, a_{j+1}) \leq s^*
$$

# Union-Find in depth

## Lazy Unions

This is what I wrote about earlier, where we only rewire leaders during a union.

We consider the array representation of the parent relationships. Let $A[i]$ denote the name of $i$'s parent.

Consider the example where objects $1$ and $2$ belong to the same group, and objects $3$, $4$, $5$ belong to a second group.
$$
\fbox{1}\fbox{1}\fbox{4}\fbox{4}\fbox{4} \\[10pt]
\rightarrow \quad \fbox{1}\fbox{1}\fbox{1}\fbox{4}\fbox{4}
$$

In general,

Define a root as a node $x$ where,
$$
\text{parent}[x] = x
$$

Initialise: for all $x$, $\text{parent}[x] := x$
$$
\text{Find}(x) : \text{traverse parent pointers until you hit a root} \\[10pt]
\begin{aligned}
    \text{Union}(x,y) : \;& s_1 = \text{Find}(x) \\
    & s_2 = \text{Find}(y) \\
    & \text{reset parent of one of} \; s_2 , s_1 \; \text{to the other}
\end{aligned}
$$

As we allow the parent graph depth to increase, we will need to be careful about how we perform unions such that our tree depth is not $O(n)$ but rather $O(\log{n})$

### Union by Rank Optimisation

For each object $x \in X$ maintain a field $\text{rank}[x]$ where
$$
\text{rank}[x] = \text{maximum number of hops from a leaf to} \; x
$$

Using the rank, we can be careful to always maintain the leader with higher rank as the top most node.

```
Union(x,y):
    s1 = Find(x)
    s2 = Find(x)

    if rank[s1] > rank[s2]:
        parent[s2] = s1

    else:
        parent[s1] = s2

```

Properties of rank:

1. for all objects $x$, $\text{rank}[x]$ increases monotonically
2. only the ranks of root can increase $\implies$ once $x$ is not a root, $\text{rank}[x]$ is fixed
3. ranks strictly increasing along the path to the root 

#### Rank Lemma:

Consider an arbitrary sequence of $\text{Union}$ and $\text{Find}$ operations.
$$
\forall r \in \{0, 1, 2, \cdots\}
$$
there are at most
$$
\frac{n}{2^r}
$$
objects with rank $r$.

Collorary:
$$
\forall x, \quad \text{rank}[x] \leq \log_2{n}
$$
Therefore the wost-case running time of $\text{Find}$ and $\text{Union}$ is
$$
O(\log{n})
$$

#### Proof of Rank Lemma:

Claim 1:

If $x$, $y$ have the same rank r, then their subtrees are disjoint.

Claim 2:

The subtree of a rank r object has size $\geq 2^r$

With Claim 1 and Claim 2, the rank lemma is implied.

Proof of claim 1 (by contrapositive):

Suppose subtrees of $x$, $y$ have object $z$ in common.
$$
\exists \text{paths} \quad z \rightarrow x, \; z \rightarrow y 
$$

Since the these trees can only have one parent pointer, this implies that both $x$ and $y$ lie on the same path to the parent pointer. 

This implies that one of $x$ or $y$ is an ancestor of the other.

This implies that the rank of $x$ cannot be the same as $y$. As the ancestor must have a strictly higher rank.

Proof of Claim 2 (By Induction):

Base Case: initially all ranks =0, subtree size = 1

Inductive Hypothesis:
$$
\text{subtree size} \geq 2^r
$$

Inductive step:

$\text{Union}(x, y)$ performs
$$
s_1 = \text{Find}(x) \quad s_2 = \text{Find}(y)
$$

The ranks of $s_1$ and $s_2$ will only change when,
$$
\text{rank}[s_1] = \text{rank}[s_2]
$$

In which case, the rank of $s_2$ will increase by 1. Furthermore, the subtree size will increase to
$$
s_2 \; \text{old subtree size} + s_2 \; \text{old subtree size} \\[10pt]
\implies \text{subtree of rank}[s_2] \geq 2^r + 2^r = 2^{r+1}
$$

### Path Compression Optimisation

This acts ontop of union by rank. The idea, is to remap parents once a $\text{Find}$ is invoked once! This saves the work and long paths only need to be traversed once!

The remapping only adds a constant work factor to each $\text{Find}$. Furthermore, each subsequent $\text{Find}$ operation would then take $O(1)$ time.

#### Interacting with Rank

Ranks can be maintained in the same way.
1. Initialised to 0
2. in $\text{Union}$, new root := old root with bigger rank
3. If the two ranks are equal, increment new rank

Do not change ranks after applying path compression. Path Compression does not change the Rank Lemma. Further,
$$
\text{rank}[\text{parent}[x]] > \text{rank}[x]
$$


### Hopcroft-Ullman Theorem

Consider a Union-Find data structure implemented with
1. Union by Rank
2. Path Compression

Then $m$ $\text{Union}$ and $\text{Find}$ operations will take,
$$
O(m \log^*{n})
$$
time. Inverse tetration!!

#### Proof of Hopcroft-Ullman Theorem

We seek to measure the improvement over subsequent $\text{Find}$ and $\text{Union}$ operations.

Consider a non-root object $x$. We can define a measure of progress for $x$ as
$$
\text{rank}[\text{parent}[x]] - \text{rank}[x]
$$

The difference allows us to get a handle on the worst case number of parent hops, before landing at the parent. A small difference indicates that the parent pointer moves slowly up the rank space, while a larger difference indicates making more progress through to a root.

Applying path compression will improve this progress measure. Since, if $x$ has old parent $p$ and new parent $p^\prime$.
$$
\text{rank}[p^\prime] > \text{rank}[p]
$$

Rank blocks:
$$
\{0\}, \; \{1\}, \; \{2, 3, 4\}, \; \{5, \cdots, 2^4\}, \; \{17, \cdots, 2^16\}, \cdots
$$
In general define a rank block as a set of numbers of in the range
$$
\{k, \cdots 2^k\}
$$
Where $k$ is the largest number of the previous rank block.

In general for $n$ elements, there will be
$$
O(\log^*{n})
$$
rank blocks.

We will use rank blocks to asses the gap between the rank a non root node and its parent. Generally, if the two ranks are in a different rank block, we think of this as "fast progress".

Lets call a object $x$ "good" if
1. $x$ or $x$ parent is a root
2. $\text{rank}[\text{parent}[x]]$ is in a larger rank block than $x$

Otherwise $x$ is "bad"

If we land on a good node, $\text{Find}$ would only encounter
$$
2 + \text{\# of rank blocks} = O(\log^*{n})
$$,
since each hop must take the parent up to the next rank block.

Total work done during $m$ operations =
$$
O(m \log^*{n}) + \; \text{total \# of visits to bad nodes}
$$

Consider a rank block $\{k+1, k+2, \cdots, 2^k\}$. 

When a bad node $x$ is visited, it's parent is changed to one with a strictly larger rank. Further, note that since $x$ is bad, $x$ is neither a root node, nor a direct child of a root node.

Since the rank of the parent is increased during path compression, it's parent can be changed a maximum of $2^k$ times, before $x$ becomes a good node (forevermore).
$$
\text{\# of times a bad node can be visited} \leq 2^k
$$

By the rank lemma, the total number of objects $x$ with final rank in this rank block is
$$
\sum_{i+k_1}^{2^k}{\frac{n}{2^i}} \leq \frac{n}{2^k}
$$

Therefore there are less than $n$ visits to bad nodes in each rank block

Therefore, since there are only $O(\log^*{n})$ rank blocks, the total work done on bad nodes is 
$$
O(n\log^*{n})
$$


### Tarjan's Bound

For Union-Find data structure implemented with Union by Rank and path compression, $m$ $\text{Union}$ and $\text{Find}$ operations will take 
$$
O(m\alpha(n))
$$
Where $\alpha(n)$ is the inverse ackerman function

#### The Ackerman Function

Define $A_k(r)$ for all $k \geq 0 $, $r \geq 1$, $k, r \in \mathbb{Z}$
$$
A_0(r) = r + 1 
$$
for $k, r \geq 1$
$$
\begin{aligned}
A_k(r) &= \; \text{apply} \; A_k \; r \; \text{times to} \; r \\
& = \underbrace{(A_{k-1} \circ A_{k-1} \circ \cdots \circ A_{k-1})}_{r \; \text{times}}(r)
\end{aligned}
$$

Some initial cases,
$$
\begin{aligned}
A_1(r) &= \underbrace{(A_{0} \circ A_{0} \circ \cdots \circ A_{0})}_{r \; \text{times}}(r) \\
&=\underbrace{(\cdots(((r +1) +1) +1) \cdots ) +1}_{r \; \text{times}} \\
&= 2r \\[20pt]

A_2(r) &= \underbrace{(A_{1} \circ A_{1} \circ \cdots \circ A_{1})}_{r \; \text{times}}(r) \\
&=\underbrace{2 \times (\cdots(2 \times (2 \times (r))) \cdots )}_{r \; \text{times}} \\
&= r2^r \\[20pt]

A_3(r) &= \underbrace{(A_{2} \circ A_{2} \circ \cdots \circ A_{2})}_{r \; \text{times}}(r) \\
&> 2^{2^{2^{\cdot^{\cdot^{\cdot^{r}}}}}} \\[20pt]
\end{aligned}
$$

#### The Inverse Ackerman Function

For our cases, we define the inverse ackerman function for every $n\geq 4$
$$
\alpha(n) = \text{minimum value of} \; k \; \text{such that} \; A_k(2) \geq n
$$

For the first few cases,
$$
\begin{aligned}
&\text{for} \; n = 4, \\
&&\alpha(n) = 1 \\
&\text{for} \; n = 5, 6, 7, 8 \\
&&\alpha(n) = 2 \\
&\text{for} \; n = 9, 10, \cdots, 2048 \\
&&\alpha(n) = 3\\
&\text{for} \; n = 2048, 2049, \cdots,  \approx{^{2048}2}\\
&&\alpha(n) = 4\\
\vdots
\end{aligned}
$$

#### Proof of Tarjan's Bound

We will proceed similarly to the Hopcroft Ullman Theorem. However, instead of arguing that the rank of the parent increases by at minimum 1, we will seek to show that the rank of a new parent is typically much larger than the rank of the old parent.

Consider a non-root object $x$, define
$$
\delta(x) = \max{k} \; \text{such that} \;{(\text{rank}[\text{parent}[x]] \geq A_k(\text{rank}[x]))}
$$

Some examples,
$$
\text{rank}[\text{parent}[x]] \geq \text{rank}[x] + 1 \iff \delta(x) \geq 0 \\[10pt]

\text{rank}[\text{parent}[x]] \geq 2 \times \text{rank}[x] \iff \delta(x) \geq 1 \\[10pt]

\text{rank}[\text{parent}[x]] \geq \text{rank}[x] \times 2^{\text{rank}[x]} \iff \delta(x) \geq 2 \\[10pt]
$$

The larger the gap between the rank a non-roots object and its parent, the larger $\delta(x)$ becomes. Further,
$$
\text{Since} \; A_{\alpha(n)}(2) \geq n \\[10pt]
\forall x, \; \text{rank}[x] \geq 2 \\
\implies \delta(x) \leq \alpha(n) 
$$

An object $x$ is defined to be "bad" if all of the following are true:
1. $x$ is not a root
2. $\text{parent}[x]$ is not a root
3. $\text{rank}[x] \geq 2$
4. $x$ has an ancestor $y$ with $\delta(y) = \delta(x)$

Otherwise $x$ is good

Lemma: For any objecet-root path there are at most $O(\alpha(n))$ good objects.
$$
\begin{aligned}
\text{\# of good nodes} &= 1 \; \text{root} \\
&+1 \; \text{child of root} \\
&+1 \; \text{of rank} \; 0 \\
&+1 \; \text{of rank} \; 1 \\
&\quad \text{for each} \; k = 0, 1, 2, \cdots, \alpha(n): \\
& \quad \quad +1 \; \text{object} \\
&= O(\alpha(n))
\end{aligned}
$$

For the last statement, we consider all objects in the path that have
$$
\delta(x) = k
$$
There can be at most one, such good object. This will be the object heighest in the tree. All other objects will have the same $\delta$ value and hence are bad objects as they will have at least this one good object as an ancestor object.

In general, the total work for $m$ operations will be
$$
= \underbrace{O(m\alpha(n))}_{\text{visits to good objects}} + \text{total \# of visits to bad objects}
$$

Suppose a $\text{Find}$ operation visits a bad object $x$.
$$
\cdots \rightarrow \fbox{$x$} \rightarrow \fbox{$p$} \rightarrow \cdots \rightarrow \fbox{$y$} \rightarrow \fbox{$p^\prime$} \rightarrow \cdots
$$
Where,
$$
\delta(x) = k \\
\delta(y) = k
$$
After applying path compression, $x$ new parent will be $p^\prime$ or higher.

Let $p^*$ be the new parent of $x$ at the root of tree. Applying path compression
$$
\implies \text{rank}[p^*] \geq \text{rank}[p^\prime]\geq A_k(\text{rank}[y]) \geq A_k(\text{rank}[p])
$$
Therefore we deduce that path compression applies the $A_k$ function to the rank of an objects function.

If $r = \text{rank}[x]$ and $r \geq 2$, then after $r$ such pointer updates we have
$$
\begin{aligned}
\text{rank}[\text{parent}[x]] &\geq \underbrace{(A_{k} \circ A_{k} \circ \cdots \circ A_{k})}_{r \; \text{times}}(r) \\
& = A_{k+1}(r)
\end{aligned}
$$

Therefore after every $r$ visits to a bad $x$, the value of $\delta(x)$ must increment.
$$
\implies \text{\# of visits to bad} \; x \leq r \times \alpha(n)
$$

Therefore the total amount of work done on bad objects
$$
\leq \sum_{\text{objects} \; x}{\text{rank}[x]} \times \alpha(n)\\[10pt]
= \alpha(n) \sum_{r \geq 0}{r \times \text{\# objects with rank} \; r} \\[10pt]
= \alpha(n) \sum_{r \geq 0}{r \times \frac{n}{2^r}} \\[10pt]
= n\alpha(n) \sum_{r \geq 0}{\frac{r}{2^r}}
$$
We can evaluate the sum

Note that by geometric series,
$$
\sum_{r \geq 0}{\frac{a}{2^r}} = \frac{a}{1 - \frac{1}{2}}
$$
Then,
$$
\begin{aligned}
\sum_{r \geq 0}{\frac{r}{2^r}} &= 0 +\frac{1}{2^1} +\frac{2}{2^2} +\frac{3}{2^3} +\frac{4}{2^4} + \frac{5}{2^5} + \cdots \\[10pt]
&\begin{alignat*}{5}
&=\frac{1}{2^1} &+\frac{1}{2^2} &+\frac{1}{2^3} &+\frac{1}{2^4} &+\frac{1}{2^5} + \cdots \\[10pt]
& &+\frac{1}{2^2} &+\frac{1}{2^3} &+\frac{1}{2^4} &+\frac{1}{2^5} + \cdots \\[10pt]
& & &+\frac{1}{2^3} &+\frac{1}{2^4} &+\frac{1}{2^5} + \cdots \\[10pt]
& & & &+\frac{1}{2^4} &+\frac{1}{2^5} + \cdots \\[10pt]
& & & & &+\frac{1}{2^5} + \cdots \\[10pt]
& & & & &\quad \quad \vdots
\end{alignat*} \\[20pt]
&= \sum_{i \geq 1}{\sum_{r \geq i}{\frac{1}{2^r}}} \\
&= \sum_{i \geq 1}{\left(\frac{1}{2^i} \times \frac{1}{1-\frac{1}{2}}\right)} \\
&= 2 \times \sum_{i \geq 1}{\frac{1}{2^i}} \\
& = 2 \times \frac{\frac{1}{2}}{1 - \frac{1}{2}} = 2
\end{aligned}
$$

Therefore the work done on bad nodes will be
$$
O(n\alpha(n))
$$
Making the overall work
$$
O((m + n )\alpha(n)) = O(m \alpha(n))
$$