# Disjoint Sets
We know will discuss a data structure that will help us solve the equivalence problem but before we describe it let us give a little background on the equivalence problem.

## Equivalence Relations
A **relation** R is defined on a set S if for every pair of elements (a,b), $a,b \in S, a R b$ is either true or false. If $a R b$ is true we say that a is related to b. <br>
An **equivalence relation** is a relation on R that satisfies three properties: <br>
1. (*Reflexive*) $a R a$ for all $a \in S$.
2. (*Symmetric*) $a R b$ if and only if $b R a$. 
3. (*Transitive*) $a R b$ and $b R c$ implies that $a R c$.

For example $\leq$ is not an equivalence relation. Although it is reflexive in that $a \leq a$ and transitive $a \leq b$ and $b \leq c$ implies that $a \leq c$. It is not symmetric since $a \leq b$ does not imply $b \leq a$ <br>
A trivial example would be that = is a equivalence relation. It is relflexive $a = a$, symmetric $a = b$ implies $b = a$ and transitive $a = b$ and $b = c$ implies $a = c$. 

## Dynamic Equivalence Problem
Given an equivalence relation $\sim$, the natural problem is to decide, for any a and b, if $a \sim b$. If the relation is stored as a two-dimensional array of boolean variables, then, of course this can be done in constant time. The problem is that the relation is not usually explicit but rather implicitly defined. <br>

As an example suppose the equivalence relation defined over a five element set $\{a_1,a_2,a_3,a_4,a_5\}$. Then there are 25 pairs of elements, each of which is either related or not. However the information $a_1 \sim a_2$, $a_3 \sim a_4$, $a_5 \sim a_1$, $a_4 \sim a_2$ implies that all pairs are related. We would like to be able to infer this quickly. <br>

The **equivalence class** of an element $a \in S$ is the subset of S that contains all the elements that are related to a. Notice that the equivalence classes form a partition of S (*Meaning that every member of S appears in exactly one equivalence class*). To decide if $a \sim b$, we need only to check whether a and b are in the same equivalence class. <br>

The input is initially a collection of N sets, each with one element. This initial representation is that all relations (except relfexive) as false. Each set has a different element, so that $S_i \cap S_j = \emptyset$. This makes the sets disjoint. <br>
There are two operations we can perform, find and union. **Find** returns the name of the set (the equivalence class) containing a given element. Observe that the name of the set returned by find is fairly abitrary. All that really matters is that **find(a) == find(b)**. If we want to add a relation $a \sim b$, the we first see if a and b are already related. This is done by performing a find on both a and b and checking whether they are in the same equivalence class. If they are not, then we **union** them. We do this by merging the two equivalence classes containing a and b into a new equivalence class. In other words the $S_k = S_i \bigcup S_j$, destroying the originals and perserving the disjointness of all the sets. We can refer to this algorithm as the disjoint set **union/find** algorithm. <br>
Notice that we do not perform any operations comparing the relative values of elements but only require the knowledge of their location. For this we can assume all the elements have been numbered sequentially from 0 to N-1 and that the numbering can be determined easily by some hashing scheme. Thus, initially we have $S_i = \{i\}$ for i = 0 through N-1. (We use N-1 because arrays start at 0) <br>

## Disjoint Set Structure
Recall that the find operation does not need to return a specific name, just that the find on two elements return the same answer if and only if they are in the same set. To do this we could represent each set using a **tree**, since each element in the same set will have the same root. In this case the trees we are using are not the traditional tree that contains information about it's children, because the only information we need to maintain is the parent. The name of a set is given by the node at the root, and initially all the sets only contain one element. Since only the name of the parent is required, we can store the tree in an array. In the array each entry s\[i\] represents the parent of element i. If i is the root, then s\[i\] = -1. So for the following example for all i, $0 \leq i \lt 8$, s\[i\] = -1. 
<img src="./files/DisjointSets/initial_ds.png" width="600"/>

To perform the union of two sets, we merge the two trees by making the parent link of one tree's root link to the root node of the other tree. For these examples we are adopting that the new root after union(x,y) is x. So after union(4,5) s\[5\] now becomes 4. 
<img src="./files/DisjointSets/ds_union_4-5.png" width="600"/>

After the union(6,7) s\[7\] now becomes 6.
<img src="./files/DisjointSets/ds_union_6-7.png" width="600"/>

After the union(4,6) s\[6\] now becomes 4.
<img src="./files/DisjointSets/ds_union_4-6.png" width="600"/>

So our finally array would look as follows:
<img src="./files/DisjointSets/ds_merge_array.png" width="400"/>

A find(x) on element x is performed by returning the root of the tree containing x. The time to perform this operation is proportional to the depth of the node representing x. Using the array strategy above it is possible to create a tree of depth N-1, so the worst case running time of find(x) is O(N). 

## Smart Union Algorithms
In the union algorithm above we arbitrarily chose to make the second tree the subtree of the first. This can make the find operation very inefficient as we are not trying to prevent the depth of the tree from getting too large. Here we can apply some simple heuristics when performing a union to fix this which guarantee the depth of each tree is at most O(logN).

The first is we can perform a **union-by-size**. In this we always make the smaller tree the subtree of the larger. To implement this we now need to keep track of the size of each tree. Since the underlying structure is just an array, we can have each entry of each **root** contain the negative of the size of tree. Thus initially all the elements in the array are -1. When a union is performed we check the sizes and the new size is the sum of the old. Thus this implementation requires no extra space to be used. 

An alternative implementation is to use **union-by-height**. We keep track of the height instead of the size of each tree and perform a union by making the tree with the smaller height a subtree of the tree with larger height. We only have to update the height when unioning two trees of equal height and at which point the height only goes up by one otherwise the root value remains the same. Since a tree with only a root is of height 0 we actually store the negative of the height - 1. Therefore initially all the entries are -1. 

The following shows what a tree doing union by size and height of union(3,4) on the previous example along with the array implementations. The first array is union-by-size and the second is union-by-height. <br>
<img src="./files/DisjointSets/ds_smart_union.png" width="500"/> <br>

## Path Compression
If we use the union algorithms above on average find becomes linear. However the worst case is still (MlogN). If we come across a tree as so:
<img src="./files/DisjointSets/ds_worst_case.png" width="600"/>

To prevent this we can apply something called **path compression**. Path compression is performed during a **find** operation and is independent from the strategy used to perform unions. We achieve this by when performing a find(x) every node on the path from x to the root has it's parent changed to the root. So using the tree above this is what happens after performing find(14). 

<img src="./files/DisjointSets/ds_path_compression.png" width="600"/>

The effect of path compression is that the parent of 12 and 14 are now 0. This makes 12 and 13 one position closer to the root while making 14 and 15 two positions closer to the root which will improve the runtime of the find operation. 

Path compression is perfectly compatible with union-by-size as this does not effect the size of the tree. However this does affect union-by-height since the height of the tree can be modified during path compression. So we instead use something called **union-by-rank**. This is almost exactly the same as union-by-height except we use ranks which are the estimated height of the tree. So when performing path compression we do not have recompute the height of each tree we just use the rank or estimated height of the tree. 

## Example Problems
The natural question is what exactly do we use this data structure for? Well one popular example is generating mazes. We start with each cell being it's own set and continue to randomly union sets close to each other until the start and exit cell are in the same set. For a more concrete example let's look at the following problem:

#### Redundant Connection
In this problem, a tree is an undirected graph that is connected and has no cycles.

The given input is a graph that started as a tree with N nodes (with distinct values 1, 2, ..., N), with one additional edge added. The added edge has two different vertices chosen from 1 to N, and was not an edge that already existed.

The resulting graph is given as a 2D-array of edges. Each element of edges is a pair [u, v] with u < v, that represents an undirected edge connecting nodes u and v.

Return an edge that can be removed so that the resulting graph is a tree of N nodes. If there are multiple answers, return the answer that occurs last in the given 2D-array. The answer edge [u, v] should be in the same format, with u < v. 

Example: <br>
Input: \[\[1,2\], \[1,3\], \[2,3\]\]<br>
Output: \[2,3\]<br>
Explanation: The given undirected graph will be like this:<br>
<pre>
  1
 / \
2 - 3
</pre>

In [3]:
def update_set(s, n):
    for i in range((n+1)-len(s)):
        s.append(-1)

# s is the current array of disjoint sets
# x is the node we are finding which set it belongs in
def find(s, x):
    while s[x] >= 0:
        x = s[x]
    return x

# s is the array of disjoint sets
# u and v are the sets being unioned
# union by size
def union(s, u, v):
    r_u = find(s,u)
    r_v = find(s,v)
    if r_u == r_v:
        return False
    if s[r_u] <= s[r_v]:
        s[r_u] += s[r_v]
        s[r_v] = r_u
    else:
        s[r_v] += s[r_u]
        s[r_u] = r_v
    return True

def redundant_connection(edge_arr):
    redun_connect = None
    max_num = -1
    for i in range(len(edge_arr)):
        max_num = max(edge_arr[i][1], max_num)
        
    sets = [-1] * (max_num+1)
    
    for u, v in edge_arr:
        if not union(sets, u,v):
            redun_connect = [u,v]
            
    return redun_connect

In [7]:
edges = [[1,2], [1,3], [2,4], [1,4], [2,3], [1, 5], [2, 7]]

redundant_connection(edges)

[2, 3]