# Set up the environment

## Import packages

In [1]:
import os

# Introduction

## Motivation
This notebook is inpired by information theory [lectures by David McKay](https://www.youtube.com/playlist?list=PLN3p8NUNcClDu1hc2m5cVp8FOEmuF3vRy). I want to implement [Huffman coding algorithm](https://en.wikipedia.org/wiki/Huffman_coding) for data compression and use it to compress [human genome](https://en.wikipedia.org/wiki/Human_genome) sequence. Why doing it? Well, I'm greately facinated by [Arithmetic coding](https://en.wikipedia.org/wiki/Arithmetic_coding) algorithm for data compression and Huffman's algorithm would be great as a beseline to compare to

## Brief overview of related concepts

### Symbol coding problem

In symbol coding problem we have a set of symbols $S = \{s_1, s_2, \ldots, s_n\}$ with probabilities $P = \{p_1, p_2, \ldots, p_n\}$ and we want to encode them using binary strings (codewords) $C = \{c_1, c_2, \ldots, c_n\}$ such that the expected codeword length is minimized. The expected codeword length $L$ is given by:

$$
L = \sum_{i=1}^{n} p_i \cdot l(c_i), \tag{1}
$$

where $l(c_i)$ is the length of codeword $c_i$. Let's look at an example to illustrate this.

---
**Example 1:** Let's say we have a set of symbols $S = \{A, B, C, D\}$ with probabilities $P = \{0.5, 0.25, 0.125, 0.125\}$. We want to find the optimal codewords $C$ that minimize the expected codeword length.

a) One possible solution is to assign the following codewords:
- A: 1000
- B: 0100
- C: 0010
- D: 0001

The expected codeword length $L$ is calculated as follows:
$$
L = \sum_{i=1}^{n} p_i \cdot l(c_i) = 0.5 \cdot 4 + 0.25 \cdot 4 + 0.125 \cdot 4 + 0.125 \cdot 4 = 4
$$

b) Another possible solution is to assign the following codewords:
- A: 1
- B: 01
- C: 001
- D: 0001

The expected codeword length $L$ is calculated as follows:
$$
L = \sum_{i=1}^{n} p_i \cdot l(c_i) = 0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 4 = 1.875
$$

c) Third possible solution is to assign the following codewords:
- A: 1
- B: 00
- C: 010
- D: 10

The expected codeword length $L$ is calculated as follows:
$$
L = \sum_{i=1}^{n} p_i \cdot l(c_i) = 0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 4 = 1.625
$$
---

Moreover we want our code to have several useful properties:
- **Uniquely decodable**: for any string $x$ and $y$ such that $x \neq y$ codewords $C(x)$ and $C(y)$ must be different $C(x) \neq C(y)$. In plain english this means that we can always decode a string of codewords back to the original symbols without ambiguity. In this light, solution (c) in example 1 is not uniquely decodable because the both string `DC` and `ABD` are encoded as $C(DC)=C(ABD) = 10010$.
- **Minimal expected codeword length**: we want to minimize the expected codeword length $L$.

**Note:** [ASCII](https://en.wikipedia.org/wiki/ASCII) code is another interesting example of symbol coding.

### Source coding theorem

A question arises: what is the minimum expected codeword length $L$ that we can achieve? The answer is given by [Shannon's source coding theorem](https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem) which states that the minimum expected codeword length $L$ is bounded by the entropy $H$ of the source:

$$
L \geq H, \tag{2}
$$

where the entropy $H$ is defined as:

$$
H = -\sum_{i=1}^{n} p_i \log_2 p_i, \tag{3}
$$

This means that no lossless compression scheme can achieve an expected codeword length less than the entropy of the source.

By comparing equations (1) and (2) we can see that equality $L = H$ holds when codeword length is equal to $l(c_i) = -\log_2 p_i$ for all symbols $s_i$. However, this is not always possible because codeword lengths must be integers. Therefore, we can only achieve $L$ that is close to $H$.