# Data Structures

## Heaps

Supports:
- Insertion $O(\log{n})$ 
- Extract Min or Max $O(\log{n})$
- Heapify $O(n)$ n batched inserts
- Delete $O(\log{n})$

A container for objects that are "ordered" by keys

Applications:
- Improve selection sort $O(n^2)$ using a heap which then runs in $O(n\log{n})$
- Priority Queue
- Event Manager
- Median Maintanence

    Given a sequence of numbers $x_1, x_2, \cdots, x_n$ one at a time, keep track of the medain of those numbers.

    Solution:

    maintain two heaps,
    $$
    H_{low} : \text{supports extract max} \\
    H_{high} : \text{supports extract min}
    $$
    maintain that approximately half of the elements are in each of the two heaps.

    The median will be either the max of $H_{low}$ or the min of $H_{high}$. Then when a new element comes in we can compare it to this median and add this element to either the high or the low heap as neccesary keeping the two heaps balanced.


## Search Trees

Static sorted arrays support the following operations

|Operation|Running Time|
|---|---|
|Search|$O(\log{n})$|
|Select|$O(1)$|
|Min/Max|$O(1)$|
|Pred/Succ|$O(1)$|
|Rank (# of keys less than given value)| $O(\log{n})$|
|Output in sorted order|$O(n)$|

A Search Tree can provide sorted array like behaviour whilst allowing for insertion and deletes.

|Operation|Running Time|
|---|---|
|Search|$O(\log{n})$|
|Select|$O(\log{n})$|
|Min/Max|$O(\log{n})$|
|Pred/Succ|$O(\log{n})$|
|Rank (# of keys less than given value)| $O(\log{n})$|
|Output in sorted order|$O(n)$|
|Insertion|$O(\log{n})$|
|Deletion|$O(\log{n})$|

### Binary Search Tree Structure

Nodes will contain a key, as well as a pointer to data. We have a root node, and leaves where each node has three pointers,
1. left child
2. right child
3. parent

Search Tree property maintains that for each node all of the nodes stored under the left child have keys less than the node, and all the nodes stores under the right child have keys more than the node.

The heigh/depth of such a search tree is defined as the max number of generations in a tree. This can be at most $n$ and at least $\log_2{n}$.

1. Searching and Inserting

    We can do a binary search style recursion to recurse on the left or right subtree by comparing the search item to the parent node.

    In order to insert, we follow the same search procedure until we reach a NULL node, then we can insert it there.

2. Min and Max

    The minimum element will be the left most child in the entire tree, and likewise the maximum element will be the right most child in the tree.

3. Predecessor or Successor Elements

    The predecessor is the largest element less than and element, therefore it will be the right most node in the elements left sub tree. Likewise the successor will be the left most element in the elements right sub tree.

    If the subtree is empty, we can follow parent pointers until we find the first parent less than the element itself. Or more succintly, the first time we make a "left" turn going up.

4. In-Order Traversal
    ```
    Let $r$ = root of search Tree with subtrees $T_L$ and $T_R$

    Recurse on $T_L$

    yeild $r$
    
    Recurse on $T_R$
    ```

5. Deletion
    
    If the node has no children, just delete it

    If the node has one child, that child can replace the node

    If the node has two children, we will need to identify who to replace that node. First we compute its Predecessor. Then we swap this node with its predecessor. We can then delete the node at the predecessor position. Since the predecessors position is garunteed to only have one left child.

6. Select and Rank

    We store the $size(x)$ = # of tree nodes in substree starting from $x$. We can get this recursively,

    for node $x$ with children $y$ and $z$
    $$
    size(x) = size(y) + size(z) + 1
    $$

    These will have to be maintained during insertion and deletion. For insertion as we go down a path, we increment their subtree sizes by 1

    Then, the $i^{th}$ order statistic will be the node such that it's left subtree is of size $i-1$

    ```
    start at root 

    start = x
    let a = size(left child)

    if a = i -1:
        return x
    
    if a >= i:
        recurse on left child

    if a < i-1
        recurse on right child, looking for (i - a - 1) order statistic
    ```

## Red-Black Trees

Since the running time of search tree operations depends on its height, we want to implement a balanced search tree such that the height is $\log{n}$.

See also
- [Avl Trees](https://en.wikipedia.org/wiki/AVL_tree)
- [Splay Trees](https://en.wikipedia.org/wiki/Splay_tree)
- [B Trees](https://en.wikipedia.org/wiki/B-tree)

A Red Black garuntees a balanced tree with logarithmic height by enforcing more invariants.

1. Each Node is either red or black
2. Root is black
3. No 2 Reds in a row i.e the children of a red node must be black
4. Every root - NULL path has the same number of black nodes

These four invariants Gurantee that the height of a tree with $n$ nodes is a most $2\log_2{n}$.

Proof:

1. Observation: if every root - Null path has $\geq k$ nodes, then the top $k$ levels of the tree are perfectly balanced.

Size of the search tree,
$$
n \geq 2^{k} -1 \quad \text{where} \quad k= \text{minimum number of nodes on a root-null path} \\

\implies k \leq \log_2{n + 1}
$$

Thus, in a red-black tree with $n$ nodes there is a root-null path with at most $\log_2{n + 1}$ black nodes.

By the $4^{th}$ invariant, every root-null path has $\log_2{n+1}$ black nodes.

By the $3^{rd}$ invariant, in the worst case there is an equal number of red and black nodes, i.e, black, red, black, red, etc.

This gives that every root-null path has $\leq 2 \log_2{n+1}$ nodes


## Rotations

A rotation is a key primitive that does constant work $O(1)$ that helps to locally rebalance subtrees at a node. 

Rotations are invoked upon a parent - child pair. When the child is a right child, the rotation is defined as a "left" rotation, and likewise for a left child.

WLOG let's discuss a left rotation:

Let the parent be $x$ and the right child $y$.

Further, let the left subtree of $x$ be $x_L$, and the left and right subtrees of $y$ be $y_L$ and $y_R$ respectively.

By the search tree property, we have
$$
x < y \\
x_L < x \\
y_R > y \\
x < y_L < y
$$

The goal of the rotation is to invert the relationship between $x$ and $y$ such that $y$ becomes the parent.

The new tree, will have $y$ as a parent,

Left Child: $x$
- Left Child: $x_L$
- Right Child: $y_L$

Right Child: $y_R$

## Insertion into a Red-Black Tree

General idea is to insert normally, and then recolor or perform rotations until invariants are restored.

```
Insert(x):
    Insert x as usual

    Remember the parent of x, call it y
    
    Color x red

    If y is black:
        return
    
    // else y is red:

    y has a black parent, call it w
    y has a sibling child wall it z

    If z is red:
        \\ this steps maintains the 4th invariant
        recolor y, z as black and w as red
        
        \\ this might propagate the double red upward to w's parent and w
        \\ can only happen a maximum of O(log n) times
        \\ if we reach the root node, we color it black

    else z is black:
        \\ 2 - 3 rotations + recolorings is sufficient to maintain the invariants
```

# Programming Assignment 3

The goal of this problem is to implement the "Median Maintenance" algorithm (covered in the Week 3 lecture on heap applications).  The text file contains a list of the integers from 1 to 10000 in unsorted order; you should treat this as a stream of numbers, arriving one by one.  Letting $x_i$ denote the $i$ th number of the file, the $k$ th median $m_k$ is defined as the median of the numbers $x_1, x_2 \cdots , x_k$.  (So, if $k$ is odd, then $m_k$ is $\frac{k+1}{2}$ th smallest number among $x_1, x_2 \cdots , x_k$; if $k$ is even, then $m_k$ is the $\frac{k}{2}$ th smallest number among $x_1, x_2 \cdots , x_k$.)

In the box below you should type the sum of these 10000 medians, modulo 10000 (i.e., only the last 4 digits).  That is, you should compute $(m_1 + m_2 + m_3 + \cdots + m_{10000}) \mod 10000$


OPTIONAL EXERCISE: Compare the performance achieved by heap-based and search-tree-based implementations of the algorithm.

In [91]:
from typing import Tuple, TypeVar, Generic

T = TypeVar('T')

class Heap(Generic[T]):
    def __init__(self, type="min"):
        self.order = type if type in ["min", "max"] else "min"
        self.arr: list[T] = []
        self.keys: dict[T, int] = dict()
        self.index: dict[T, int] = dict()
        return
    
    @property
    def size(self):
        return len(self.arr)
    
    @property
    def lead(self):
        if self.size == 0:
            return None
        return self.arr[0]
    
    def compare_check(self, key_1: int , key_2: int):
        if self.order == "max":
            return key_1 > key_2
        return key_1 < key_2

    def heapify_up(self, start_idx: int):
        if start_idx == 0:
            return
        
        parent_idx = (start_idx -1) // 2

        parent_node = self.arr[parent_idx]
        start_node = self.arr[start_idx]
        
        parent_key = self.keys[parent_node]
        node_key = self.keys[start_node]

        if self.compare_check(parent_key, node_key):
            return

        self.arr[parent_idx], self.arr[start_idx] = self.arr[start_idx], self.arr[parent_idx]
        self.index[parent_node], self.index[start_node] = self.index[start_node], self.index[parent_node]

        self.heapify_up(parent_idx)
        
        return

    def heapify_down(self, start_idx: int):
        arr_size = len(self.arr)

        left_idx = 2 * start_idx + 1
        right_idx = 2 * start_idx + 2

        start_node = self.arr[start_idx]
        node_key = self.keys[start_node]
        
        if left_idx > arr_size-1:
            return
        
        left_node = self.arr[left_idx]
        left_key = self.keys[left_node]

        compare_key, compare_idx, compare_node = left_key, left_idx, left_node
        
        if right_idx < arr_size:

            right_node = self.arr[right_idx]
            right_key = self.keys[right_node]

            if not self.compare_check(left_key, right_key):
                compare_key, compare_idx, compare_node = right_key, right_idx, right_node
        
        if self.compare_check(node_key, compare_key):
            return
        
        self.arr[compare_idx], self.arr[start_idx] = self.arr[start_idx], self.arr[compare_idx]
        self.index[compare_node], self.index[start_node] = self.index[start_node], self.index[compare_node]

        self.heapify_down(compare_idx)

        return

    def extract_lead(self) -> Tuple[T, int]:
        arr_size = len(self.arr)

        if arr_size == 0:
            return None
        
        if arr_size == 1:
            min_element = self.arr.pop()

            self.index.pop(min_element)
            return min_element, self.keys.pop(min_element)
        
        last_element = self.arr[-1]
        
        self.arr[0], self.arr[-1] = self.arr[-1], self.arr[0]
        self.index[last_element] = 0
        
        min_element = self.arr.pop()
        key = self.keys.pop(min_element)
        self.index.pop(min_element)

        self.heapify_down(0)
        
        return min_element, key

    def insert(self, item: T, key: int=None):
        size = len(self.arr)

        self.arr.append(item)
        self.index[item] = size
        
        if key == None:
            key = item
        
        self.keys[item] = key
        
        if size > 0:
            self.heapify_up(size)
        return
    
    def delete(self, item: T):
        idx = self.index[item]

        if idx == len(self.arr) -1:
            self.arr.pop()
            self.index.pop(item)
            self.keys.pop(item)
            
            return

        last = self.arr[-1]

        self.arr[idx], self.arr[-1] = self.arr[-1], self.arr[idx]
        self.index[last] = idx
        
        deleted = self.arr.pop()
        self.index.pop(item)
        self.keys.pop(item)
        
        if self.compare_check(deleted, last):
            self.heapify_down(idx)
        else:
            self.heapify_up(idx)
            
        return
    
    def validate(self):

        def report():
            print(self.arr)
            print(self.index)
            print(self.keys)
            return
        
        if len(self.arr) != len(self.keys.keys()) and len(self.arr) != len(self.index.keys()):
            report()
            return False

        for idx, item in enumerate(self.arr):
            if self.index[item] != idx:
                report()
                return False
        
        nodeStack = [0]
        while len(nodeStack) != 0:
            parent_idx = nodeStack.pop()

            lchild_idx = 2 * parent_idx + 1
            rchild_idx = 2 * parent_idx + 2

            arr_size = len(self.arr)
            
            parent_key = self.keys[self.arr[parent_idx]]

            if lchild_idx > arr_size -1:
                continue
            
            left_key = self.keys[self.arr[lchild_idx]]
            
            if rchild_idx > arr_size -1:
                if self.compare_check(parent_key, left_key):
                    continue
                
                report()
                return False

            right_key = self.keys[self.arr[rchild_idx]]
            
            if self.compare_check(parent_key, left_key) and self.compare_check(parent_key, right_key):
                nodeStack.append(lchild_idx)
                nodeStack.append(rchild_idx)
            else:
                report()
                return False

        return True

In [78]:
def test_heap():
    heap = Heap(type="max")
    
    for i in [2, 5, 3, 8, 9, 7, 4, 11, 1, 40, 23, 41, 52, 15, 13]:
        heap.insert(i)
    
    print(f'Insert: {"PASSED" if heap.validate() else "FAILED"}')
    
    for i in [3, 8, 7, 1, 11, 23, 15]:
        heap.delete(i)

    print(f'Delete: {"PASSED" if heap.validate() else "FAILED"}')
    
    heap.extract_lead()
    print(f'Extract Lead: {"PASSED" if heap.validate() else "FAILED"}')
    
test_heap()

Insert: PASSED
Delete: PASSED
Extract Lead: PASSED


In [123]:
class MedianManager(Generic[T]):
    def __init__(self):
        self.lowHeap = Heap[T](type="max")
        self.highHeap = Heap[T](type="min")
        return
        
    @property
    def median(self):
        return self.lowHeap.lead
    
    @property
    def inspect(self):
        low = self.lowHeap.arr
        high = self.highHeap.arr
        print(f'low:\n\tsize: {len(low)}\n\tarr: {low}\nmedian: {self.median}\nhigh:\n\tsize: {len(high)}\n\tarr: {high}')

        return 
    
    def clear(self):
        self.lowHeap = Heap[T](type="max")
        self.highHeap = Heap[T](type="min")
        return

    def append(self, item: T):
        if self.lowHeap.size == 0:
            self.lowHeap.insert(item)
            return
        
        curr_median = self.median
        highMin = self.highHeap.lead if self.highHeap.lead != None else curr_median

        # case for a pushing into low heap
        if item <= curr_median:
            if self.lowHeap.size == self.highHeap.size +1:
                lowMax, _ = self.lowHeap.extract_lead()
                self.highHeap.insert(lowMax)

            self.lowHeap.insert(item)
        # case for new median
        elif item < highMin: 
            if self.highHeap.size == self.lowHeap.size:
                self.lowHeap.insert(item)
            else:
                self.highHeap.insert(item)
        # case for pushing into highHeap
        else: 
            if self.highHeap.size == self.lowHeap.size:
                highMin, _ = self.highHeap.extract_lead()
                self.lowHeap.insert(highMin)

            self.highHeap.insert(item)
            

In [124]:
import random

def test():
    m = MedianManager[int]()

    data = random.sample(range(1, 100), 27)

    passed = True

    seen_items = []
    length = 0

    for item in data:
        m.append(item)

        length += 1
        seen_items.append(item)
        seen_items.sort()
        
        if m.median != seen_items[((length +1) // 2) -1]:
            passed = False

            m.inspect
            print('FAILED')
            break
    
    if passed == True:
        m.inspect
        print('PASSED all inserts')
    return

test()

low:
	size: 14
	arr: [63, 54, 51, 27, 53, 42, 38, 17, 24, 31, 41, 23, 12, 18]
median: 63
high:
	size: 13
	arr: [68, 71, 70, 80, 74, 78, 96, 95, 93, 82, 75, 89, 79]
PASSED all inserts


In [127]:
m = MedianManager[int]()

base = 10_000
modsum = 0

with open("Week 3 Median.txt", mode="r") as f:
    for line in f:
        m.append(int(line))
        modsum += m.median

        while modsum >= base:
            modsum = modsum - base

print(modsum)

1213
