# Binary Codes


A mapping between each character in an alphabete and binary string.

A standard approach would be to use fixed length codes. However we can do better with variable length codes. However, variable length codes may lead to ambigious messages. Therefore we will insist that these variable length codes are prefix-free

A Prefix-free code is an encoding, such that for every pair of characters in an alphabet, neither of the encoding is a prefix of the other.

An example of this
$$
\begin{aligned}
\text{A} &\rightarrow 0 \\
\text{B} &\rightarrow 10 \\
\text{C} &\rightarrow 110 \\
\text{D} &\rightarrow 111 \\
\end{aligned}
$$

We will use variable length encodings to take advantage of character frequencies. For instance, encoding more frequent characters with fewer bits would lead to much better compression.

Given an alphabet an frequencies, how can we optimise the compression of the binary encoding?

## Codes as Trees

We can think of binary codes as binary trees. For instance consider the following examples

![exampleTrees](img/binary%20code%20tree.png)

Notice how the second example is variable lenght, but not prefix free. We can see this visually as there are alphabets whose nodes are "parents" of other alphabets.

![prefixFree](img/prefix%20free%20tree.png)

In general,
1. left child egdes -> 0, right child edges -> 1
2. for each character there is only one node
3. encoding of the character = bits along the root node path
4. prefix-free = labelled nodes are the leaves

To decode, simply follow the tree until you hit a leaf node. Output the leaf character. Repeat

Further, note that
$$
\text{encoding length of} \; i \in \Phi = \text{depth of} \; i \; \text{in tree}
$$

Where $i$ is a character from the alphabet $\Phi$.

## Problem Definition

Given a probability $p_i$ for each character $i \in \Phi$,

if $T =$ tree with leaves $\leftrightarrow$ symbols of $\Phi$, let
$$
L(T) = \sum_{i \in \Phi}{(p_i \times [\text{depth of} \; i \; \text{in} \; T])}
$$

We seek to minimise $L(T)$ over all $T$.

## Huffman Codes

There is a natrual but suboptimal approach to build the tree from the top-down in a divide and conquer like algorithm.

Huffmans optial idea is to build the tree from the bottom-up using successive mergers.

We will take a greedy approach.

- Which pair of symbols are "safe" to merge?

Observe that the final encoding length of $i$ is equal to the number of merges the subtree endures. Therefore for each merger, we increase the number of bits in those characters encoding by 1.

This informs our decision to merge the least frequent symbols first.

How should we recurse after iterating these single characters?

After merging two characters, lets introduce a new "hidden-layer" symbol. Such that
$$
p_{\text{hidden layer}} = p_a + p_b
$$

As a result, mergers keep track of the total frequency of their children.

Psuedo-code
```
Hufffmans(S)

    Find a, b in S, such that they have the smallest frequencies

    Let S* = S with a and b fused into a single node

    p_ab = p_a + p_b

    Recurse on S* to compute the Tree T

    Return T

```

### Proof of Corectness

We aim to show that the binary tree produced by Huffmans algorithm minimises $L(T)$.

Proof by induction on $n = \lvert\Phi\rvert$

Base case $n = 2$
- Tree is optimal with 1 bit per symbol

Assume $n=k$ is optimal. For $n=k+1$,

Let $\Phi^\prime$ be $\Phi$ with $a$, $b$ replaced by a "hidden layer"/"meta-symbol" $ab$

Define $p_{ab} = p_a + p_b$

Let $T^\prime$ be the tree for $\Phi^\prime$ and $T$ be the tree for $\Phi$. For every such pair $T^\prime$ and $T$,
$$
\begin{aligned}
L(T) - L(T^\prime) &= p_a \times [\text{depth of} \; a \; \text{in} \; T] \\
&+ p_a \times [\text{depth of} \; b \; \text{in} \; T] \\
&- p_{ab} \times [\text{depth of} \; ab \; \text{in} \; T^\prime]
\end{aligned}
$$
Note that
$$
p_{ab} = p_a + p_b \\
$$
Further, let
$$
\text{depth of} \; a \; \text{in} \; T = d \\
\text{depth of} \; a \; \text{in} \; T = d \\
\text{depth of} \; ab \; \text{in} \; T^\prime = d -1
$$
Therefore
$$
L(T) - L(T^\prime) = dp_a + dp_b - (p_a + p_b)(d -1) \\
= p_a + p_b
$$

By an exchange argument.

Let $T^*$ be any tree that minimises $L(T)$ for $\Phi$.

Let $x,y$ bt siblings at the deepest level of $T^*$.

Obtain a new tree $\hat{T}$ from $T^*$ by swapping
1. $a \leftrightarrow x$
2. $b \leftrightarrow b$

Note that after performing the swap, $\hat{T}$ will contain the "meta-node" $ab$

We will show that 
$$
L(\hat{T}) \leq L(T^*)
$$

Since,
$$
\begin{aligned}
L(T^*) - L(\hat{T}) &= (p_x - p_a) \times [\text{depth of} \; x - a \; \text{in} \; T^*] \\
&+ (p_y - p_b)\times [\text{depth of} \; y - b \; \text{in} \; T^*] 
\end{aligned}
$$

Note that since $a$ and $b$ have the smallest frequencies
$$
(p_x - p_a) \geq 0 \\
(p_y - p_b) \geq 0
$$
Further since $x$ and $y$ have to be at deeper levels than $a$ and $b$
$$
[\text{depth of} \; x - a \; \text{in} \; T^*] \geq 0 \\
[\text{depth of} \; y - b \; \text{in} \; T^*] \geq 0 
$$

Therefore
$$
L(T^*) - L(\hat{T}) \geq 0  
$$

### Running Time

Naive implementation: $O(n^2)$ time for $n = \lvert\Phi\rvert$

Heap implementation: $O(n\log{n})$
1. keys = frequencies
2. during a merge reinsert new meta symbol

Even faster: Sorting + $O(n)$ additional work
1. manage meta symbols using two queues

# Dynamic Programming

Idea:

- Reason about the strucure of an optimal solution in terms of optimal solutions to smaller sub-problems.
- Cache / memoise past sub-problem results

Key Ingredients:

1. Identify a small number of subproblems
2. can quickly & correctly solve larger subproblems given the solutions to smaller subproblems 
3. after solving all subproblems, can quickly compute final solution

![whyDynamic](img/WhyDynamic.png)

## Weighted Independent Sets in Path Graphs

Given a path graph $G = (V, E)$ with nonnegative weights on vertice, return a subset of nonadjacent vertices (an independent set) of maximum total weight.

Let $S \subset V$ be a max-weight independent Set.

Let $v_n$ be the last vertex of the path

There are two cases. Either $v_n \notin S$(case 1), or $v_n \in S$ (case 2).

Case 1.

Let $G^\prime=G$ with $v_n$ deleted. Note:
1. $S$ is also an IS of $G^\prime$
2. $S$ must be a max-weight IS of $G^\prime$

Case 2.

Since $v_n \in S$, $v_{n-1} \notin S$.

Let $G^{\prime\prime} = G$ with $v_{n-1}$, $v_n$ deleted. Note:
1. $S - \{v_n\}$ is an IS of $G^{\prime\prime}$
2. $S - \{v_n\}$ is also a max-weight IS of $G^{\prime\prime}$

    - If there exists some $S^*$ that is a larger weight IS than $S$ in $G^{\prime\prime}$, then $S^* \cup \{v_n\}$ is better than $S$ in $G$ which is a contradiction


Therefore a max-weight IS must be either
1. a max-weight IS of $G^\prime$
2. $v_n$ + a max-weight IS of $G^{\prime\prime}$

This allows us to proceed recursively. The upshot is, out of all the exponential recursive calls, there are only $O(n)$ different subproblems, since the recursive calls only pluck out vertices from the left.

This gives one algorithm,

Let $G_i$ = G[0:i]

Populate an array A, such that
```
A[i] = max-weight IS of G_i
```

Initialise 
```
A[0] = 0, A[1] = w_1
```

Main loop
```
For i = 1, 2, ..., n:
    A[i] = max(A[i-1], A[i-2] + w_i)
```

In order to return the Indepenant set itself, we could augment the data structure to contain the vertex information. However an alternative approach would be to reconstruct the optimal solution by tracing back through the filled in array A.

Starting with the right most node,
```
i = n
While i >= 1:
    if A[i-1] >= A[i-2] + w_i(case 1):
        do not include w_i
        
        i -= 1
    else (case 2):
        include w_i

        i-=2
```

# Programming Assingment

# 1, 2
Code up the greedy algorithm from the lectures on Huffman coding.

The text file 'Week 3 huffman.txt' has the following format:

[num of symbols]

[weight of symbol 1]

[weight of symbol 2]

...

Run the huffman coding algorithm from the lecture on this data set. 

What is the maximum length of a codeword in the resulting Huffman code?

What is the minimum length codeword?

In [30]:
def load_data():
    with open('Week 3 huffman.txt') as f:
        num_symbols = next(f)
        return int(num_symbols), [int(x) for x in f]
    
num_symbols, weights = load_data()

In [31]:
from local_utils import Heap

def huffman(num_symbols: int, weights: list[int]):
    char_heap = Heap[tuple[int]](type="min")

    char_list = [(x) for x in range(num_symbols)]
    char_heap.build_heap(
        char_list, 
        {char: weight for char, weight in zip(char_list, weights)}
    )

    while char_heap.size != 1:
        min1, key1 = char_heap.extract_lead()
        min2, key2 = char_heap.extract_lead()

        char_heap.insert((min1, min2), key1+key2)
    
    final_tree = char_heap.extract_lead()

    char_found = False
    min_depth = None
    max_depth = 0

    def recurse(tree, depth):
        nonlocal max_depth
        nonlocal min_depth
        nonlocal char_found

        if depth > max_depth:
            max_depth = depth

        left, right = tree

        if type(left) is tuple:
            recurse(left, depth+1)
        else:
            if not char_found:
                min_depth = depth
                char_found = True

        if type(right) is tuple:
            recurse(right, depth+1)
        else:
            if not char_found:
                min_depth = depth
                char_found = True
        
    recurse(final_tree, 0)

    return max_depth, min_depth


In [32]:
huffman(num_symbols, weights)

(19, 9)

# 3 

Code up dynamic programming algorithm for computing a maximum weight indepedent set of a path graph.

The file 'Week 3 mwis.txt' has the following format

[num of vertices]

[weigth of first vertex]

[weight of second vertex]

...

The task is to run the dynamic programming algorithm and reconstruction procedure.

Of the vertices
$$
1, 2, 3, 4, 17, 117, 517, 997
$$

Which ones belong to the maximum-weight independant set. 

Note that vertex 1 is the first vertex, there is no zeroth vertex.

In the answering box, enter a 8-bit string where the $i^{th}$ bit should be 1 if the $i^{th}$ vertex of the of these 8 vertices is in the mwis.

In [34]:
def load_data():
    with open('Week 3 mwis.txt') as f:
        num_vertices = next(f)
        return int(num_vertices), [int(x) for x in f]

num_vertices, weights = load_data()

In [56]:
def mwis(num_vertices: int, weights: list[int]):
    
    memo = [0, weights[0]]
    for i in range(1, num_vertices):
        memo.append(max(memo[i], memo[i -1] + weights[i]))
    
    inde_set = set()
    k = num_vertices

    while k >= 1:
        if memo[k-1] >= memo[k-2] + weights[k -1] :
            k -= 1
        else:
            inde_set.add(k -1)
            k -= 2

    return inde_set

inde_set = mwis(num_vertices, weights)

In [57]:
"".join(["1" if x-1 in inde_set else "0" for x in [1, 2, 3, 4, 17, 117, 517, 997] ])

'10100110'