# Kruskal's MST Algorithm

We take the approach of including edges from the cheapest first, until a spanning tree is created, being careful to skip edges that create cycles.

Psuedo-code 
```
Kruskals():
    Sort edges in order of increasing cost

    T = {}

    for i=1 to m:
        if T union {i} has no cycles
            add i to T
    
    return T
```

## Proofs

### Proof that the output is a spanning tree

Let $T^*$ be the output of Kruskal's algorithm. By definition $T^*$ has no cycles. It stands to show that $T^*$ is connected.

Fix a cut $(A, B)$. Since the input graph is connected, at least one if its edges crosses $(A, B)$. 

We know that Kruskals algorithm considers all edges at least once. 

By the double crossing lemma, the first time Kruskals algorithm encounters an edge that crosses $(A,B)$ it can not be part of an existing cycle, since there must be at least two crossing edges for a cycle to exist.

Therefore, the first edge Kruskal's algorithm encounters will be included. 

Therefore, there are no cuts with no crossing edges, and $T^*$ is connected.

### Proof that the spanning tree is minimal

We will show that every edge of $T^*$ satsifies the minimim cut property.

Consider an interation where the algorithm includes the edge $(u, v)$ to the set $T$

Since $T \cup (u, v)$ has no cycles, there does not currently exist a path between $u$ and $v$ in $T$.

Therefore there must exist an empty cut $(A, B)$ with $u$ and $v$ in seperate parts.

Since Kruskal's algorihtm iterates over sorted edges, the edge $(u, v)$ is garunteed to be the cheapest edge across the $(A, B)$ cut. If there were a cheaper such edge, this edge would have already been considered at an earlier iteration of the algorithm.

## Straighforward Implementation

Sorting the edges will take $O(m\log{m}) = O(m\log{n})$ time.
$$
m = O(n^2) \\
\log{m} \leq 2\log{n} + c \\
\log{m} = O(\log{n})
$$

Iterating over the edges takes $O(m)$ time, however checking for cycles will take $O(n)$ time. 

In order to check for a cycle, we can carry out DFS or BFS which runs in linear time.

This gives an overal running time of
$$
O(m\log{n}) + O(mn) = O(mn)
$$

## Union-Find

This data structure hopefully allows constant time cycle checking.

A Union-Find data strucutre maintains a parition of a set of objects and supports two operations

1. Find $x$ : -> name of the group that $x$ belongs to
2. Union $g_1$, $g_2$ : -> fuse the two groups into one

To implement the Union-Find,

1. maintain one linked structure per connected component 
2. each connected component has an arbitrarily chosen leader
3. each vertex contains a pointer to the leader of its component

To check if an edge $(u, v)$ will create a cycle, check if both nodes point to the same leader.
$$
\text{Find}(u) = \text{Find}(v) \iff \text{cycle exists}
$$
By looking up pointers this runs in $O(1)$ time.

In order to update leader points when performing a Union,

1. keep the leader with more children as the new leader
2. rewire the smaller leaders children.

Globally, any one node will be updated $O(\log{n})$ time, each rewire costs $O(n)$, resulting in $O(n\log{n})$ overall rewiring cost.

Overall, this gives
1. $O(m\log{n})$ time for sorting
2. $O(m) \times O(1)$ time for cycle checks
3. $O(n\log{n})$ time for overall rewires

Which has overall $O(m\log{n})$ running time.

There is another way to do this,
1. When carrying out $\text{Find}(x)$ follow leader's recursively
2. When carrying out $\text{Union}(g_1, g_2)$ only rewire leaders.

This would increase the running time of $\text{Find}$ to $O(\log{n})$, and decrease the running time of $\text{Union}$ to $O(1)$. Overall,

2. $O(m) \times O(\log{n})$ time for cycle checks
3. $O(m) \times O(1)$ time for rewires

Which has the same overall $O(m\log{n})$ running time.

I think the other implementation should be faster but this is yet to be checked.

## State of the Art MST

There is an [$O(m)$ randomized algorithm](https://cs.brown.edu/research/pubs/pdfs/1995/Karger-1995-RLT.pdf)! It is unknown is there is a determinstic linear time algorithm.

There is a almost linear [$O(m\alpha(n))$ determmistic algorihtm](https://www.cs.princeton.edu/~chazelle/pubs/mst.pdf). Where $\alpha(n)$ is the inverse [ackerman function](https://en.wikipedia.org/wiki/Ackermann_function). 

The inverse ackerman function grows SUPER slowly, in fact its grows much slower than
$$
\log^*{n} := 
\begin{cases}
0 & \text{if} \; n \leq 1 \\
1 + \log^*(\log{n}) & \text{otherwise}
\end{cases}
$$
Which simply put is the "inverse tetration" function.

The ackerman function can be stated
$$
\begin{aligned}
&A(0, n) &&= \quad n+1 \\
&A(m+1, 0) &&= \quad A(m, 1) \\
&A(m+1, n+1) &&= \quad A(m, A(m+1, n))
\end{aligned}
$$

### Open Questions

- There are no simple (undergraduate level) randomised linear time algorithms. It is sufficient to find a MST verification algorithm to deal with a randomised algorithm. 

- Is there a determinised linear time algorithm?

# Clustering



Given $n$ points in some space, we want to classify them into "coherent groups"

We provide a distance metric $d(q,p)$ between each point pair that is symmetric, ie,
$$
d(p, q) = d(q, p)
$$.

We aim to group "nearby" points by minimising the distance metric.


## Max-Spacing k-Clusterings

Assume that we know that there are $k$ clusters to look for.

Label two points $p, q$ as separted if they're assigned to different clusters.

Define the spacing of a k-clustering
$$
\min_{\text{seperated} \; p, q}{d(p, q)}
$$
In other words, the closest two seperated points.

Given a distance metric $d$ and some number of clusters $k$ we seek to compute a clustering that has the maximum possible spacing.

### A Greedy Algorithm

Starting with all points in their own clusters, 
1. Find the closest pair of seperated points,
2. Fuse these two clusters into one 
3. If this gives the correct number of clusters, return
3. Else Repeat

This is the same as Kruskals algorithm!! This is known as Single-link clustering

### Corectness Proof

Let the greedy clustering be
$$
C_1, C_2, \cdots, C_k 
$$

Let an arbitrary clustering be
$$
\hat{C}_1, \hat{C}_2, \cdots, \hat{C}_k
$$

Define $S$ to be the spacing of a clustering
$$
S(C_1, C_2, \cdots, C_k) = \text{spacing of the clustering}
$$

Let $s^*$ be the optimal spacing

We seek to show that
$$
S(\hat{C}_1, \hat{C}_2, \cdots, \hat{C}_k) \leq s^*
$$

Case 1:

If $\hat{C}_i \sim C_i$ by renaming and re-ordering, then they have the same spacing.

Case 2:

Otherwise, we can find a pair of points $p,q$ such that 
1. $p, q \in C_i$
2. $p \in \hat{C_i}$ while $q \in \hat{C_j}$

The property of the greedy algorithm indicates that if two points $x, y$ have been directly merged, then
$$
d(x, y) \leq s^*
$$

Furthermore, the distances between merged pairs in order, is always increasing.

Trivial case:

$p, q$ were directly merged by the algorithm. This would imply that
$$
d(p, q) \leq s^*
$$
Since $p$ and $q$ lie in seperate clusters in the arbitrary clustering, then
$$
S(\hat{C}_1, \hat{C}_2, \cdots, \hat{C}_k) \leq d(p, q) \leq s^*
$$

Tricky case:

$p, q$ were merged through multiple direct merges.

Let 
$$
p, a_1, a_2, \cdots, a_r, q
$$
be the path of direct greedy merges connecting $p$ and $q$

Since $p \in \hat{C}_i$ and $q \in \hat{C}_j$ there exists a consecuitive pair of nodes $a_j, a_{j+1}$ that bridges the gap.

Since the consecuitive nodes were directly merged,
$$
S(\hat{C}_1, \hat{C}_2, \cdots, \hat{C}_k) \leq d(a_j, a_{j+1}) \leq s^*
$$

# Union-Find in depth

## Lazy Unions

This is what I wrote about earlier, where we only rewire leaders during a union.

We consider the array representation of the parent relationships. Let $A[i]$ denote the name of $i$'s parent.

Consider the example where objects $1$ and $2$ belong to the same group, and objects $3$, $4$, $5$ belong to a second group.
$$
\fbox{1}\fbox{1}\fbox{4}\fbox{4}\fbox{4} \\[10pt]
\rightarrow \quad \fbox{1}\fbox{1}\fbox{1}\fbox{4}\fbox{4}
$$

In general,

Define a root as a node $x$ where,
$$
\text{parent}[x] = x
$$

Initialise: for all $x$, $\text{parent}[x] := x$
$$
\text{Find}(x) : \text{traverse parent pointers until you hit a root} \\[10pt]
\begin{aligned}
    \text{Union}(x,y) : \;& s_1 = \text{Find}(x) \\
    & s_2 = \text{Find}(y) \\
    & \text{reset parent of one of} \; s_2 , s_1 \; \text{to the other}
\end{aligned}
$$

As we allow the parent graph depth to increase, we will need to be careful about how we perform unions such that our tree depth is not $O(n)$ but rather $O(\log{n})$

### Union by Rank Optimisation

For each object $x \in X$ maintain a field $\text{rank}[x]$ where
$$
\text{rank}[x] = \text{maximum number of hops from a leaf to} \; x
$$

Using the rank, we can be careful to always maintain the leader with higher rank as the top most node.

```
Union(x,y):
    s1 = Find(x)
    s2 = Find(x)

    if rank[s1] > rank[s2]:
        parent[s2] = s1

    else:
        parent[s1] = s2

```