# Set up the environment

## Import packages

In [1]:
import os
import heapq
from __future__ import annotations
from dataclasses import dataclass
from typing import Dict, Optional, List, Tuple
from itertools import count
import numpy as np
import pandas as pd
import struct
from collections import Counter

## Helping functions

In [2]:
def print_tree(node, prefix="", is_last=True):
    """Print tree structure as text"""
    if node is None:
        return
    
    # Print current node
    connector = "└── " if is_last else "├── "
    if node.symbol:
        print(f"{prefix}{connector}{node.symbol} ({node.weight})")
    else:
        print(f"{prefix}{connector}* ({node.weight if node.weight else 'N/A'})")
    
    # Update prefix for children
    new_prefix = prefix + ("    " if is_last else "│   ")
    
    # Print children
    if node.left or node.right:
        if node.left:
            print_tree(node.left, new_prefix, not bool(node.right))
        if node.right:
            print_tree(node.right, new_prefix, True)

# Introduction

## Motivation
This notebook is inpired by information theory [lectures by David McKay](https://www.youtube.com/playlist?list=PLN3p8NUNcClDu1hc2m5cVp8FOEmuF3vRy). I want to implement [Huffman coding algorithm](https://en.wikipedia.org/wiki/Huffman_coding) for data compression and use it to compress [human genome](https://en.wikipedia.org/wiki/Human_genome) sequence. Why doing it? Well, I'm greately facinated by [Arithmetic coding](https://en.wikipedia.org/wiki/Arithmetic_coding) algorithm for data compression and Huffman's algorithm would be great as a beseline to compare to

## Brief overview of related concepts

### Symbol coding problem

In symbol coding problem we have a set of symbols $S = \{s_1, s_2, \ldots, s_n\}$ with probabilities $P = \{p_1, p_2, \ldots, p_n\}$ and we want to encode them using binary strings (codewords) $C = \{c_1, c_2, \ldots, c_n\}$ such that the expected codeword length is minimized. The expected codeword length $L$ is given by:

$$
L = \sum_{i=1}^{n} p_i \cdot l(c_i), \tag{1}
$$

where $l(c_i)$ is the length of codeword $c_i$. Let's look at an example to illustrate this.

#### Example 1
Let's say we have a set of symbols $S = \{A, B, C, D\}$ with probabilities $P = \{0.5, 0.25, 0.125, 0.125\}$. We want to find the optimal codewords $C$ that minimize the expected codeword length.

a) One possible solution is to assign the following codewords:
- A: 1000
- B: 0100
- C: 0010
- D: 0001

The expected codeword length $L$ is calculated as follows:
$$
L = \sum_{i=1}^{n} p_i \cdot l(c_i) = 0.5 \cdot 4 + 0.25 \cdot 4 + 0.125 \cdot 4 + 0.125 \cdot 4 = 4
$$

b) We can do much better if we consider only codewords of size 2:

-A: 00
-B: 01
-C: 10
-D: 11

The expected codeword length $L$ is calculated as follows:
$$
L = \sum_{i=1}^{n} p_i \cdot l(c_i) = 0.5 \cdot 2 + 0.25 \cdot 2 + 0.125 \cdot 2 + 0.125 \cdot 2 = 2
$$

c) Another possible solution is to assign the following codewords:
- A: 1
- B: 01
- C: 001
- D: 0001

The expected codeword length $L$ is calculated as follows:
$$
L = \sum_{i=1}^{n} p_i \cdot l(c_i) = 0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 4 = 1.875
$$

f) Finally we can try to go for even shorter expected length:
- A: 1
- B: 00
- C: 010
- D: 10

The expected codeword length $L$ is calculated as follows:
$$
L = \sum_{i=1}^{n} p_i \cdot l(c_i) = 0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 4 = 1.625
$$


What is important to mention is that we want our coding to have several useful properties:
- **Uniquely decodable**: for any string $x$ and $y$ such that $x \neq y$ codewords $C(x)$ and $C(y)$ must be different $C(x) \neq C(y)$. In plain english this means that we can always decode a string of codewords back to the original symbols without ambiguity. In this light, solution (c) in example 1 is not uniquely decodable because the both string `DC` and `ABD` are encoded as $C(DC)=C(ABD) = 10010$.
- **Minimal expected codeword length**: we want to minimize the expected codeword length $L$.

**Note:** [ASCII](https://en.wikipedia.org/wiki/ASCII) code is another interesting example of symbol coding.

### Source coding theorem

A question arises: what is the minimum expected codeword length $L$ that we can achieve? The answer is given by [Shannon's source coding theorem](https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem) which states that the minimum expected codeword length $L$ is bounded by the entropy $H$ of the source:

$$
L \geq H, \tag{2}
$$

where the entropy $H$ is defined as:

$$
H = -\sum_{i=1}^{n} p_i \log_2 p_i, \tag{3}
$$

This means that no lossless compression scheme can achieve an expected codeword length less than the entropy of the source.

By comparing equations (1) and (2) we can see that equality $L = H$ holds when codeword length is equal to $l(c_i) = -\log_2 p_i$ for all symbols $s_i$. However, this is not always possible because codeword lengths must be integers. Therefore, we can only achieve $L$ that is close to $H$.

### Huffman Coding

[Huffman coding](https://en.wikipedia.org/wiki/Huffman_coding) is a popular algorithm used for lossless data compression. The basic idea is to assign variable-length codes to input characters, with shorter codes assigned to more frequent characters. This results in a prefix-free binary code, meaning no code is a prefix of any other, which allows for efficient encoding and decoding.

The algorithm works by building a binary tree where each leaf node represents a symbol and its probability/frequency. The tree is constructed by repeatedly merging the two nodes with the lowest probabilities until only one node remains, which becomes the root of the tree. The code for each symbol is then determined by traversing the tree from the root to the leaf, assigning a '0' for a left branch and a '1' for a right branch.

#### Example 2

Consider the "alphabet" from previous example, where symbol probabilities are

```text
A: 0.5
B: 0.25
C: 0.125
D: 0.125
```

Then the binary tree for this alphabet would look like this:

```text
        *
       / \
      /   \
     *     A(0.5)
    / \
   /   \
B(0.25) *
       / \
      /   \
  C(0.125) D(0.125)
```

The algorithm assigns codes by traversing from root to leaf: left=0, right=1. So the resulting code for each symbol would be:
"
```text
A: 1
B: 01
C: 001
D: 000
```

An average code length $L$ can be calculated as follows:

$$
L = \sum_{i=1}^4 p_i \cdot l_i = \frac{1}{2} \cdot 1 + \frac{1}{4} \cdot 2 + \frac{1}{8} \cdot 3 + \frac{1}{8} \cdot 3  = 1.75
$$

For comparison, entropy $H$ can be calculated as follows:

$$
H = - \sum_{i=1}^4 p_i \cdot \log_2p_i = \frac{1}{2} \log_22 + \frac{1}{4} \log_24 + \frac{1}{8} \log_28 + \frac{1}{8} \log_28 = 1.75
$$

As we can see, the average code length $L$ is equal to the entropy $H$ for this particular example. This is a special case and may not hold for all probability distributions, but it illustrates the relationship between Huffman coding and entropy.


### Heap

A heap is a specialized tree-based data structure that satisfies the heap property. In a min-heap, for example, the key at a parent node is always less than or equal to the keys of its children, and the smallest key is at the root. This property makes heaps useful for implementing priority queues, where the highest (or lowest) priority element can be efficiently accessed.

#### Operation cost

The time complexity for the main operations on a heap is as follows:

- Insertion: O(log n)
- Deletion (removing the root): O(log n)
- Accessing the minimum element: O(1)

#### Example 3

Let's use a min-heap to illustrate the heap property. Consider the following sequence of numbers: 5, 3, 8, 1, 4. We can insert these numbers into a min-heap as follows:
```text
1. Insert 5:       5
2. Insert 3:       3
                  /
                 5
3. Insert 8:       3
                  / \
                 5   8
4. Insert 1:       1
                  / \
                 3   8
                /
               5
5. Insert 4:       1
                  / \
                 3   4
                / \
               5   8
6. Remove 1:       3
                  / \
                 5   4
                /
               8
```
In this min-heap, the smallest element (1) is at the root, and the heap property is maintained at each level.

Let's use `heapq` in Python to demonstrate heap operations.

In [3]:
import heapq

# Create a min-heap
heap = []

# Insert elements into the heap
heapq.heappush(heap, 5)
heapq.heappush(heap, 3)
heapq.heappush(heap, 8)
heapq.heappush(heap, 1)
heapq.heappush(heap, 4)

# Display the heap
print("Heap:", heap)

# Remove the smallest element
smallest = heapq.heappop(heap)
print("Removed smallest element:", smallest)
print("Heap after removal:", heap)

Heap: [1, 3, 8, 5, 4]
Removed smallest element: 1
Heap after removal: [3, 4, 8, 5]


### FASTA and FASTQ formats

The [FASTA](https://en.wikipedia.org/wiki/FASTA_format) and [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) file formats are widely used for storing biological sequence data.

- The FASTA format is a text-based format for representing nucleotide sequences. Each entry in a FASTA file begins with a header line starting with '>', followed by the sequence identifier. The sequence itself is represented on the following lines. Here's a simple example of a FASTA entry:

```text
>SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
```

- The FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequences) and its corresponding quality scores. Each entry in a FASTQ file consists of four lines:

1. A header line starting with '@' followed by a sequence identifier.
2. The raw sequence of nucleotides.
3. A separator line starting with '+' (optionally followed by the same sequence identifier).
4. A line of quality scores corresponding to the sequence, encoded as ASCII characters.

Here's a simple example of a FASTQ entry:

```text
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
```

In this example, the sequence is `GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT` and the quality scores are represented by the ASCII characters `!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65`.


# Let's play around

## Huffman coding algorithm

### Description 

Huffman coding algorithm follows these steps:

1. Build a priority queue (min-heap) of nodes, where each node represents a symbol and its frequency.
2. While there is more than one node in the queue:
   a. Remove the two nodes of lowest frequency from the queue.
   b. Create a new internal node with these two nodes as children and a frequency equal to the sum of their frequencies.
   c. Insert the new node back into the queue.
3. The remaining node is the root of the Huffman tree.
4. Traverse the tree to generate the Huffman codes for each symbol.


### Implement a binary-tree class

Binary tree is an essential data structure for this algorithm. It is a hierarchical structure where each node has at most two children. In the context of Huffman coding, each leaf node represents a symbol and its probability/frequency, while internal nodes represent the combined frequency of their child nodes.

In [4]:
@dataclass
class Node:
    weight: float = 0.0             # weight of the node (frequency or probability)
    symbol: Optional[str] = None    # symbol for leaf nodes
    left: Optional["Node"] = None   # left child
    right: Optional["Node"] = None  # right child

Let's build a huffman tree from example above


```text
        *
       / \
      /   \
     *     A(0.5)
    / \
   /   \
B(0.25) *
       / \
      /   \
  C(0.125) D(0.125)
```

In [5]:
# Create leaf nodes
node_A = Node(weight=0.5, symbol="A")
node_B = Node(weight=0.25, symbol="B") 
node_C = Node(weight=0.125, symbol="C")
node_D = Node(weight=0.125, symbol="D")

# Create internal nodes (bottom-up)
# Right internal node: combines C and D
right_internal = Node(weight=0.25, left=node_C, right=node_D)

# Left internal node: combines B and the right internal node  
left_internal = Node(weight=0.5, left=node_B, right=right_internal)

# Root node: combines left internal node and A
root = Node(weight=1.0, left=left_internal, right=node_A)

# Visualise
print("Tree structure:")
print_tree(root)

Tree structure:
└── * (1.0)
    ├── * (0.5)
    │   ├── B (0.25)
    │   └── * (0.25)
    │       ├── C (0.125)
    │       └── D (0.125)
    └── A (0.5)


### Build a Huffman Tree

To efficiently manage the nodes, we'll use a priority queue (min-heap) to store and retrieve the nodes based on their frequencies. This allows us to quickly access the least frequent nodes when building the tree. The algorithm works by repeatedly merging the two least frequent nodes until only one node remains, which becomes the root of the Huffman tree. Schematic representation of the process:

#### Example 4

Let's use the following symbols and their frequencies:

- A: 0.5
- B: 0.25
- C: 0.125
- D: 0.125

1. Create all leaf nodes for our future tree
    ```python
    Node(A, 0.5)
    Node(B, 0.25)
    Node(C, 0.125)
    Node(D, 0.125)
    ```
2. Create a priority queue (min-heap) and insert all leaf nodes:
    ```python
    min_heap = [Node(A, 0.5), Node(B, 0.25), Node(C, 0.125), Node(D, 0.125)]
    ```

3. Extract the two least frequent nodes:
    ```python
    left = extract_min(min_heap) # Node(C, 0.125)
    right = extract_min(min_heap) # Node(D, 0.125)
    ```

4. Create a new internal node with these two nodes as children and their combined frequency:
    ```python
    new_node = Node(None, left.frequency + right.frequency, left, right)
    ```

5. Insert the new node back into the priority queue:
    ```python
    insert(min_heap, new_node) # [Node(B, 0.25), Node(A, 0.5), Node(None, 0.25, Node(C, 0.125), Node(D, 0.125))]
    ```

6. Repeat steps 3-5 until only one node remains in the priority queue. This node will be the root of the Huffman tree.

In [6]:
def build_huffman_tree(weights: Dict[str, float]) -> Optional[Node]:
    """
    Build a Huffman tree from the given symbol weights
    """
    # Check that weights are > 0
    items = [(float(w), str(s)) for s, w in weights.items() if float(w) > 0.0]
    if not items:
        return None

    # Sort items by weight to make code deterministic (important for decoding)
    items.sort(key=lambda x: (x[0], x[1]))

    # Build the tree using a min-heap
    uid = count()                                   # unique integer 0,1,2,...
    heap = [(w, next(uid), Node(weight=w, symbol=s))   # (weight, tie_id, tree)
            for (w, s) in items]
    heapq.heapify(heap)

    # Handle single-symbol edge case
    if len(heap) == 1:
        w, _, leaf = heapq.heappop(heap)
        return Node(weight=w, left=leaf)

    # Build the tree
    while len(heap) > 1:
        w1, _, n1 = heapq.heappop(heap)             # smallest
        w2, _, n2 = heapq.heappop(heap)             # 2nd smallest
        parent = Node(weight=w1 + w2, left=n1, right=n2)
        heapq.heappush(heap, (parent.weight, next(uid), parent))

    return heap[0][2]

Now we can apply the algorithm to build the Huffman tree and generate the codes for each symbol

In [7]:
probabilities = {
    'A': 0.5,
    'B': 0.25,
    'C': 0.125,
    'D': 0.125
}

root = build_huffman_tree(probabilities)
print("Huffman Tree structure:")
print_tree(root)

Huffman Tree structure:
└── * (1.0)
    ├── A (0.5)
    └── * (0.5)
        ├── B (0.25)
        └── * (0.25)
            ├── C (0.125)
            └── D (0.125)


### Derive codes

To derive codes from Huffman tree we will use a recursive traversal of the tree. Starting from the root, we will traverse to the left child by appending '0' to the current code and to the right child by appending '1'. When we reach a leaf node, we will record the symbol and its corresponding code. This is essentially a depth-first search (DFS) of the tree.

Schematic representation of the process:

1. Start at the root with an empty code:
   ```python
   current_code = ""
   ```

2. Traverse the tree:
   - If the current node is a leaf, record the symbol and its code:
     ```python
     if node is leaf:
         codes[node.symbol] = current_code
     ```

   - If the current node is not a leaf, traverse left and right:
     ```python
     traverse(node.left, current_code + "0")
     traverse(node.right, current_code + "1")
     ```

The final codes dictionary will contain the Huffman codes for all symbols.

In [8]:
def codes_from_tree(root: Optional[Node]) -> Dict[str, str]:
    """
    Derive binary codes from the Huffman tree
    """
    # Return empty dictionary if root is None
    if root is None:
        return {}

    # Create a dictionary to hold the codes
    codes: Dict[str, str] = {}

    # Depth-first search (DFS) to traverse the tree
    def dfs(n: Node, path: str):
        if n.symbol is not None:                 # leaf node found
            codes[n.symbol] = path or "0"        # assign code
            return
        if n.left:  dfs(n.left,  path + "0") # go left, add "0"
        if n.right: dfs(n.right, path + "1") # go right, add "1"

    # Start DFS traversal from the root
    dfs(root, "")
    return codes

Now we can apply this algorithm to generate the Huffman codes for each symbol from the constructed Huffman tree

In [9]:
codes = codes_from_tree(root)
print("Huffman Codes:", codes)

Huffman Codes: {'A': '0', 'B': '10', 'C': '110', 'D': '111'}


As you can see the code is very similar to the one we got in [Exercise 2](#example-2)

### Compare entropy and expected length

Although we have already compared entropy and expected length in Exercise 2. Let's create functions to compute both values for future examples.

In [10]:
def entropy(probabilities: Dict[str, float]) -> float:
    """
    Calculate the entropy of a set of symbol probabilities
    """
    p = np.array(list(probabilities.values()))
    return -np.sum(p * np.log2(p, where=(p > 0)))

def expected_length(coding: Dict[str, str], probabilities: Dict[str, float]) -> float:
    """
    Calculate the expected length of a set of symbol codes
    """
    keys = coding.keys()
    p = np.array([probabilities.get(symbol, 0) for symbol in keys])
    length = np.array([len(coding[symbol]) for symbol in keys])
    return np.sum(length * p)

Now let's print the summary for our example

In [11]:
print("Symbol Probabilities:", probabilities)
print("Huffman Codes:", codes)
print("Huffman tree from weights:")
print_tree(root)
print("Entropy:", entropy(probabilities))
print("Expected Length:", expected_length(codes, probabilities))

Symbol Probabilities: {'A': 0.5, 'B': 0.25, 'C': 0.125, 'D': 0.125}
Huffman Codes: {'A': '0', 'B': '10', 'C': '110', 'D': '111'}
Huffman tree from weights:
└── * (1.0)
    ├── A (0.5)
    └── * (0.5)
        ├── B (0.25)
        └── * (0.25)
            ├── C (0.125)
            └── D (0.125)
Entropy: 1.75
Expected Length: 1.75


### Convert a string of charachters to bytes

Let's create a function to convert a string to bytes and compare the size of our byte representation to **ASCII**, **UTF-8** and naive encodings. Both **ASCII** and **UTF-8** are used in [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) format to store sequencing information. By naive encoding I mean the following list of codewords:

```text
A: 00, B: 01, C: 10, D: 11
```


In [12]:
def bits2bytes(bits: str):
    """
    Convert a string of bits to bytes
    """
    # Pad the bit string to make its length a multiple of 8
    paddlen = (8 - len(bits) % 8) % 8
    padded_bits = bits + '0' * paddlen

    # Convert each group of 8 bits to a byte
    byte_array = bytearray()
    for i in range(0, len(padded_bits), 8):
        byte = padded_bits[i:i + 8]
        byte_array.append(int(byte, 2))
    return bytes(byte_array)

Now let's generate a string of characters from our distribution and generate a bit and byte string representations:

```text
A: 0.5, B: 0.25, C: 0.125, D: 0.125
```

In [13]:
# fix random seed
np.random.seed(4)

# generate string
n = 100000
char_list = list(probabilities.keys())
prob_list = list(probabilities.values())
string = "".join(np.random.choice(char_list, p=prob_list, size=n))
print(f"Generated string: {string[:50]}...")

# calculate frequency
freq = {char: string.count(char) for char in char_list}
print("Character counts:", freq)
print("Character frequencies:", {char: freq[char] / n for char in char_list})

Generated string: DBDBBADAAACACDABAAADADCCAABABABDBAABBBABABCAABBABB...
Character counts: {'A': 50057, 'B': 24819, 'C': 12415, 'D': 12709}
Character frequencies: {'A': 0.50057, 'B': 0.24819, 'C': 0.12415, 'D': 0.12709}


In [14]:
# encode the string using different encoding schemes
huffman_bits = "".join([codes.get(char) for char in string])
huffman_bytes = bits2bytes(huffman_bits)
naive_codes = {"A": "00", "B": "01", "C": "10", "D": "11"}
naive_bits = "".join([naive_codes.get(char) for char in string])
naive_bytes = bits2bytes(naive_bits)
utf8_bytes = string.encode('utf-8')
ascii_bytes = string.encode('ascii')

# print sizes
print(f"Huffman encoding size: {len(huffman_bytes)} bytes ({len(huffman_bits)} bits)")
print(f"Naive encoding size: {len(naive_bytes)} bytes ({len(naive_bits)} bits), it's {len(naive_bits) / len(huffman_bits): .3f} times larger than Huffman encoding")
#print(f"Naive encoding size (bytes): {len(naive_bytes)}, it's {len(naive_bytes) / len(huffman_bytes): .3f} times larger than Huffman encoding")
print(f"UTF-8 encoding size: {len(utf8_bytes)} bytes, it's {len(utf8_bytes) / len(huffman_bytes): .3f} times larger than Huffman encoding")
print(f"ASCII encoding size: {len(ascii_bytes)} bytes, it's {len(ascii_bytes) / len(huffman_bytes): .3f} times larger than Huffman encoding")

Huffman encoding size: 21884 bytes (175067 bits)
Naive encoding size: 25000 bytes (200000 bits), it's  1.142 times larger than Huffman encoding
UTF-8 encoding size: 100000 bytes, it's  4.570 times larger than Huffman encoding
ASCII encoding size: 100000 bytes, it's  4.570 times larger than Huffman encoding


These results are not very surprising. Huffman encoding is exactly $\frac{L_{naive}}{L_{huffman}} = \frac{2}{1.75} \approx 1.142$ times shorter than naive encoding. Moreover UTF-8 and ASCII encodings use 8 bits per character, so they are $\frac{8}{1.75} \approx 4.571$ times longer than Huffman encoding.

### File header

The "header" is everything you need to prepend so another program can rebuild the codebook and decode the data. In our case it will be a mapping of symbols to their corresponding Huffman codes. The "cannonical Huffman" idea is to use code lengths to rebuild the codebook without needing to transmit the actual codes. This is possible because in canonical Huffman, codes of the same length are consecutive integers.

The canonical assignment rule:

1. Collect and sort symbols by code length and lexicographic order (for tie-breaking)
2. Count how many codes of each length
3. Assign codes in the following manner:
    - Skip past all codes assigned at length $l - 1$ (we skip already taken prefixes basically)
    - Assign the next available codes for length $l$

#### Example 5

Let's say I have a following list of code lengths: 1 code of length 1, 1 code of length 2 and 2 codes of length 3:

```text
                            {1: 1, 2: 1, 3: 2}
```


Let's assign codes step by step:
1. For length 1, we have 1 code. The smallest available code is 0, so we assign it. This way we block all codes starting with 0 (as we a uniquely decodable codes).
    ```python
    codebook[0] = 0
    last_code = 0
    ```
2. To get a next available prefix we take binary representation of `last_code + 1`, which is `1` in this case. We then make a [binary shift](https://python-central-hub.vercel.app/tutorials/python-operator/bitwise-operators/#bitwise-left-shift) (pad it with zeros to the right) to get a first code of length 2: `10`. Finally we assign all codes of length 2 (in our case only one):
    ```python
    last_code = last_code + 1 # 1
    first_code_length_2 = last_code << 1 # 2 which is equal 10 in binary
    codebook[1] = first_code_length_2
    last_code = codebook[1]
    ```
3. To assign codes of length 3, we repeat the process. The next available prefix is `last_code + 1`, which is `3`. We pad it to get the first code of length 3: `110`. We assign all codes of length 3 (in our case two codes):
    ```python
    last_code = last_code + 1 # 3
    first_code_length_3 = last_code << 1 # 6 which is equal 110 in binary
    codebook[2] = first_code_length_3
    codebook[3] = first_code_length_3 + 1 # 7 which is equal 111 in binary
    ```
4. Finaly let's convert integers to binary strings:
    ```python
    codebook = [bin(code)[2:] for code in codebook]
    # codebook = ['0', '10', '110', '111']
    ```

This can be illustrated on the table of all possible codes:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }

    .highlight-blue {
        background-color: #007bff !important;
        color: white !important;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr>
      <th>length 1</th>
      <th>length 2</th>
      <th>length 3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="4" valign="top" class="highlight-blue">0 = 0</th>
      <th rowspan="2" valign="top">00 = 0</th>
      <th>000 = 0</th>
    </tr>
    <tr>
      <th>001 = 1</th>
    </tr>
    <tr>
      <th rowspan="2" valign="top">01 = 1</th>
      <th>010 = 2</th>
    </tr>
    <tr>
      <th>011 = 3</th>
    </tr>
    <tr>
      <th rowspan="4" valign="top">1 = 1</th>
      <th rowspan="2" valign="top" class="highlight-blue">10 = 2</th>
      <th>100 = 4</th>
    </tr>
    <tr>
      <th>101 = 5</th>
    </tr>
    <tr>
      <th rowspan="2" valign="top">11 = 3</th>
      <th class="highlight-blue">110 = 6</th>
    </tr>
    <tr>
      <th class="highlight-blue">111 = 7</th>
    </tr>
  </tbody>
</table>
</div>

In the same way we could do for the following list of code lengths:

```text
                                {1: 0, 2: 2, 3: 3}
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }

    .highlight-blue {
        background-color: #007bff !important;
        color: white !important;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr>
      <th>length 1</th>
      <th>length 2</th>
      <th>length 3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="4" valign="top">0 = 0</th>
      <th rowspan="2" valign="top" class="highlight-blue">00 = 0</th>
      <th>000 = 0</th>
    </tr>
    <tr>
      <th>001 = 1</th>
    </tr>
    <tr>
      <th rowspan="2" valign="top" class="highlight-blue">01 = 1</th>
      <th>010 = 2</th>
    </tr>
    <tr>
      <th>011 = 3</th>
    </tr>
    <tr>
      <th rowspan="4" valign="top">1 = 1</th>
      <th rowspan="2" valign="top">10 = 2</th>
      <th class="highlight-blue">100 = 4</th>
    </tr>
    <tr>
      <th class="highlight-blue">101 = 5</th>
    </tr>
    <tr>
      <th rowspan="2" valign="top">11 = 3</th>
      <th class="highlight-blue" class="highlight-blue">110 = 6</th>
    </tr>
    <tr>
      <th>111 = 7</th>
    </tr>
  </tbody>
</table>
</div>

In [15]:
def canonical_codes_from_lengths(codelens: Dict[str, str]) -> Dict[str, Tuple[int, int]]:
    """
    Generate canonical Huffman codes from code lengths
    """
    # Filter out zeros
    items = [(L, ord(s), s) for s, L in codelens.items() if L > 0]
    if not items:
        return {}
    
    # Sort by (length, codepoint) for determinism
    items.sort(key=lambda x: (x[0], x[1]))
    

    # Count how many codes of each length
    max_len = max(L for L, _, _ in items)
    bl_count = [0] * (max_len + 1)
    for L, _, _ in items:
        bl_count[L] += 1
    
    # Compute first code for each length
    next_code = [0] * (max_len + 1)
    code = 0
    for L in range(1, max_len + 1):
        code = (code + (bl_count[L - 1] if L - 1 >= 0 else 0)) * 2
        next_code[L] = code

    # Assign codes in (length, symbol) order
    out: Dict[str, Tuple[int, int]] = {}
    for L, _, s in items:
        out[s] = (next_code[L], L)
        next_code[L] += 1
    return out

def format_bits(code: int, L: int) -> str:
    return format(code, f"0{L}b") if L > 0 else ""

In [16]:
code_lengths = {symbol: len(code) for symbol, code in codes.items()}
print("Code lengths:", code_lengths)

canonical_codes = canonical_codes_from_lengths(code_lengths)
formatted_bits = {symbol: format_bits(code, L) for symbol, (code, L) in canonical_codes.items()}
print("Original codes:", codes)
print("Canonical codes:", formatted_bits)

Code lengths: {'A': 1, 'B': 2, 'C': 3, 'D': 3}
Original codes: {'A': '0', 'B': '10', 'C': '110', 'D': '111'}
Canonical codes: {'A': '0', 'B': '10', 'C': '110', 'D': '111'}


Now we can construct the header:

```text
File format (2 bytes): 'HC'
Alphabet size (2 bytes): N
Code lengths (4 bytes per symbol + 1 byte for length): {symbol: length}
Bit length (4 bytes): L
```

This header is self-describing. Given that you know the file format is 'HC', you can easily parse the rest of the header fields:
1. Read the alphabet size (2 bytes).
2. For each symbol, read its code length (4 bytes + 1 byte).
3. Finally, read the total bit length (4 bytes).

In [17]:
def write_header(code_lengths: Dict[str, int], bitlen: int) -> bytes:
    """
    Write the header for the compressed file:
        MAGIC(2)='HC' | N:uint16 | N×(codepoint:uint32, codelen:uint8) | BITLEN:uint32
    """
    # Get code lengths and sort them by (length, codepoint)
    entries = [(ord(s), int(L)) for s, L in code_lengths.items() if L > 0]
    entries.sort(key=lambda x: (x[1], x[0]))
    N = len(entries)

    # Write the header
    parts = []

    parts.append(struct.pack(">2sH", b"HC", N)) # Magic number and entry count: 2 bytes + 2 bytes
    for codepoint, codelen in entries:
        parts.append(struct.pack(">IB", codepoint, codelen)) # Codepoint and code length: uint32 + uint8
    parts.append(struct.pack(">I", bitlen)) # Total bit length: uint32
    return b"".join(parts)

Let's generate the header for the string that we generated earlier

In [18]:
header = write_header(code_lengths, len(huffman_bits))
print("Header size:", len(header), "bytes")
print("Header in hex:", header.hex())

Header size: 28 bytes
Header in hex: 4843000400000041010000004202000000430300000044030002abdb


Let's also create a function to decode the header

In [19]:
def read_header(data: bytes, offset: int = 0):
    """
    Parse the header and return the code lengths, bitlen and offset
    Args:
        data (bytes): The header data
        offset (int, optional): The starting bit offset. Defaults to 0.
    Returns:
        Tuple[Dict[str, int], int, int]: The code lengths, bitlen and offset
    """
    # Read the filetype
    magic, = struct.unpack_from(">2s", data, offset); offset += 2

    if magic != b"HC":
        raise ValueError("bad magic: expected b'HC'")

    # Read number of symbols
    (N,) = struct.unpack_from(">H", data, offset); offset += 2

    # Read code lengths
    code_lengths: Dict[str, int] = {}
    for _ in range(N):
        codepoint, codelen = struct.unpack_from(">IB", data, offset); offset += 5
        code_lengths[chr(codepoint)] = codelen

    # Read bitlen
    (bitlen,) = struct.unpack_from(">I", data, offset); offset += 4
    return code_lengths, bitlen, offset # offset now points to the start of PAYLOAD

We can try to decode our header now

In [20]:
decoded_code_lengths, bitlen, offset = read_header(header)
print("Original code lengths:", code_lengths)
print("Decoded code lengths:", decoded_code_lengths)
print("Bit length:", bitlen)
print("Offset:", offset)

Original code lengths: {'A': 1, 'B': 2, 'C': 3, 'D': 3}
Decoded code lengths: {'A': 1, 'B': 2, 'C': 3, 'D': 3}
Bit length: 175067
Offset: 28


### Decoding

At this point we now how to:
1. Create code words from probabilities using Huffman coding algorithm
2. Encode the data using the generated code words
3. Create a header that contains all necessary information to decode the data:
   - Original code lengths
   - Decoded code lengths
   - Bit length
   - Offset
4. Reconstruct code words from the encoded data using the header information

The final step is to decode the data using the reconstructed code words and the header information. A Huffman code is a **prefix-free**, so you can read the bitstream one bit at a time and walk a binary tree: `0` = go left, `1` = go right. Whenever you land on a leaf, you’ve finished exactly one symbol. Emit it and jump back to the root to start the next symbol.

But before doing that we first need to construct the Huffman tree from the code lengths. Let's do that.

In [21]:
def decode_tree(canon: Dict[str, Tuple[int, int]]) -> Node:
    """
    Decode a Huffman tree from canonical code words
    """
    root = Node()
    for symbol, (code, length) in canon.items():
        current = root
        # Walk from root following the binary path
        for i in range(length - 1, -1, -1):  # MSB to LSB
            bit = (code >> i) & 1            # Extract bit i
            if bit == 0:
                # Go left, create node if needed
                if current.left is None:
                    current.left = Node(weight=0)
                current = current.left
            else:
                # Go right, create node if needed  
                if current.right is None:
                    current.right = Node(weight=0)
                current = current.right
        # Mark final node as leaf with this symbol
        current.symbol = symbol
    return root

In [22]:
code_lengths = {symbol: len(code) for symbol, code in codes.items()}
print("Code lengths:", code_lengths)

canonical_codes = canonical_codes_from_lengths(code_lengths)
formatted_bits = {symbol: format_bits(code, L) for symbol, (code, L) in canonical_codes.items()}
print("Original codes:", codes)
print("Canonical codes:", formatted_bits)
decode_tree = decode_tree(canonical_codes)
print("Decoding Tree structure:")
print_tree(decode_tree)

Code lengths: {'A': 1, 'B': 2, 'C': 3, 'D': 3}
Original codes: {'A': '0', 'B': '10', 'C': '110', 'D': '111'}
Canonical codes: {'A': '0', 'B': '10', 'C': '110', 'D': '111'}
Decoding Tree structure:
└── * (N/A)
    ├── A (0)
    └── * (N/A)
        ├── B (0)
        └── * (N/A)
            ├── C (0)
            └── D (0)


All we need to do now to decode a stream of bits is to traverse the decoding tree according to the bit values. Starting from the root, we move left for a '0' and right for a '1', until we reach a leaf node, which gives us the decoded symbol.

In [23]:
def decode_huffman(encoded_str: str, root: Node) -> str:
    """
    Decode a Huffman encoded string
    """
    decoded_output = []
    current_node = root

    for bit in encoded_str:
        # Traverse the tree
        current_node = current_node.left if bit == '0' else current_node.right

        # Raise an error if we reach an invalid node
        if current_node is None:
            raise ValueError("Invalid encoded string")

        # If we reach a leaf node, append the symbol
        if current_node and current_node.symbol:
            decoded_output.append(current_node.symbol)
            current_node = root  # Reset for the next symbol

    return ''.join(decoded_output)

Finally let's decode an encoded string using our Huffman tree.

In [24]:
decoded_string = decode_huffman(huffman_bits, decode_tree)

# Check if the strings are the same
assert string == decoded_string, "Decoded string does not match the original!"

print("Original string:", string[:30] + "..." + string[-30:])
print("Decoded string: ", decoded_string[:30] + "..." + decoded_string[-30:])

Original string: DBDBBADAAACACDABAAADADCCAABABA...ABBAADBDAADABABABAAAADCADADAAD
Decoded string:  DBDBBADAAACACDABAAADADCCAABABA...ABBAADBDAADABABABAAAADCADADAAD


## Encode human genome

### Download human genome data

The [human genome](https://en.wikipedia.org/wiki/Human_genome) is the complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. The human genome is estimated to have about 3 billion base pairs of DNA, which are divided into approximately 20,000-25,000 genes.

The human genome sequence is stored in several formats, including FASTA and FASTQ. The FASTA format is a text-based format for representing nucleotide sequences, while the FASTQ format is used for storing both nucleotide sequences and their corresponding quality scores. Let's download only 21st chromosome from human genome (as it is one of the smallest) from [Ensembl](https://www.ensembl.org/index.html)

In [25]:
!wget -O genome.fa.gz https://ftp.ensembl.org/pub/release-114/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
!gunzip -f genome.fa.gz

--2025-08-31 18:51:36--  https://ftp.ensembl.org/pub/release-114/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11787503 (11M) [application/x-gzip]
Saving to: ‘genome.fa.gz’


2025-08-31 18:51:37 (12.2 MB/s) - ‘genome.fa.gz’ saved [11787503/11787503]



### File structure

Let's take a look at the downloaded FASTA file for the 21st chromosome

In [26]:
# File head
!echo -e "Head of the file:"
!head -n 5 genome.fa

# Middle of the file
!echo -e "\nMiddle of the file:"
!sed -n '103524,103530p' genome.fa

# File tail
!echo -e "\nFile tail:"
!tail -n 5 genome.fa

Head of the file:
>21 dna:chromosome chromosome:GRCh38:21:1:46709983:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Middle of the file:
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGATCATACT
TGTAGCATTAGGCCCAGGGATGAGAGTCAACATCATTACAATTAACTATGTCAGGATAGG
AGACTCATCCCTTGCCTATGAGCTGAGTTTAGATGTGGGCCACCATTTTAACTCTGGTTG
AATGTTTATATATGAACACAGGCCTAGCACCAATGTGATGTGAGTCTTTGGCCTAGACAC
TTCAAGCAGGAGGCAATGTGACATATCTCTGGGTCTATCAACTATTTGATATGACCTTCC
TTTTTTACCTGAGCTTTCCCCATAAAAGAGATGTGACATATGTCTAGACCCAGCACCTGG
GTGATGTGGCTCTTCTTTATTGACTGAGCCCTGTGTATTTTGGGTATTCTGACATATCCC

File tail:
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

The header line in the `genome.fa` file contains the sequence identifier and description, while the subsequent lines contain the nucleotide sequence in FASTA format. Another important aspect is the presence of gaps (represented by 'N's) in the sequence, which indicate regions that are difficult to sequence or assemble.

### Calculate frequency of each nucleotide

To compress this file I will discard the header and focus on the nucleotide sequences. I will also ignore new line characters as they appear every 60 bases and so are easy to recover. We need to count the frequency of each nucleotide (A, T, G, C, N) in the sequence.

In [27]:
# Read genome file
with open("genome.fa", "r") as file:
    lines = file.readlines()
    header = lines[0].strip()  # First line is the header
    genome = "".join(line.strip() for line in lines[1:])  # Join all subsequent lines

print("Genome sequence:", genome[:15] + "..." + genome[6240000:6240020] + "..." + genome[-15:])
print("Chromosome length:", len(genome))

Genome sequence: NNNNNNNNNNNNNNN...TTCTCTGCACTCCAGCCTGG...NNNNNNNNNNNNNNN
Chromosome length: 46709983


In [28]:
# Count nucleotide frequencies
N = len(genome)
counts = Counter(genome)
nucleotide_counts = {b: counts.get(b, 0) for b in "ATGCN"}
nucleotide_frequencies = {k: v / N for k, v in nucleotide_counts.items()}

print("Nucleotide counts/frequencies:")
for nucleotide, number in nucleotide_counts.items():
    print(f"{nucleotide}: {number} /{number / N: .3f}")

Nucleotide counts/frequencies:
A: 11820664 / 0.253
T: 11856330 / 0.254
G: 8226381 / 0.176
C: 8185244 / 0.175
N: 6621364 / 0.142


Let's calculate the entropy of the nucleotide frequencies.

In [29]:
entropy_value = entropy(nucleotide_frequencies)
print(f"Entropy of nucleotide frequencies: {entropy_value: .3f}")

Entropy of nucleotide frequencies:  2.285


### Huffman codes for nucleotides

Based on the nucleotide frequencies, we can construct a Huffman tree and generate Huffman codes for each nucleotide.

In [30]:
# Construct the tree
encoding_tree = build_huffman_tree(nucleotide_frequencies)

# Generate Huffman codes
codes = codes_from_tree(encoding_tree)

# Print results
print("Symbol Probabilities:", nucleotide_frequencies)
print("Huffman Codes:", codes)
print("Huffman tree from weights:")
print_tree(encoding_tree)
print("Entropy:", entropy(nucleotide_frequencies))
print("Expected Length:", expected_length(codes, nucleotide_frequencies))

Symbol Probabilities: {'A': 0.25306504607376973, 'T': 0.2538286087580036, 'G': 0.17611612061601478, 'C': 0.17523543093560964, 'N': 0.14175479361660226}
Huffman Codes: {'G': '00', 'A': '01', 'T': '10', 'N': '110', 'C': '111'}
Huffman tree from weights:
└── * (1.0)
    ├── * (0.4291811666897845)
    │   ├── G (0.17611612061601478)
    │   └── A (0.25306504607376973)
    └── * (0.5708188333102155)
        ├── T (0.2538286087580036)
        └── * (0.31699022455221193)
            ├── N (0.14175479361660226)
            └── C (0.17523543093560964)
Entropy: 2.2848569671015464
Expected Length: 2.316990224552212


As you can see in this case expected length is a bit larger than the entropy.

### Compress the fasta file

Now we can compress the fasta file using the generated Huffman codes. We will also create a header that contains all necessary information to decode the data.

In [31]:
# Generate genome encoding
genome_bits_encoding = "".join([codes.get(char) for char in genome])
genome_bytes_encoding = bits2bytes(genome_bits_encoding)

# Generate the header
code_lengths = {symbol: len(code) for symbol, code in codes.items()}
header = write_header(code_lengths, len(genome_bits_encoding))

# Write the compressed data to a file
with open("genome.fa.huff", "wb") as f:
    f.write(header)
    f.write(genome_bytes_encoding)

print(f"Compressed file size: {os.path.getsize('genome.fa.huff')} bytes")

Compressed file size: 13528355 bytes


For comparison let's remove the header and new line characters from original fasta file and compare sizes

In [32]:
!tail -n +2 genome.fa | tr -d '\n' > genome_clean.fa
!ls -lh genome*

-rw-r--r-- 1 claptar claptar  46M Jan 30  2025 genome.fa
-rw-r--r-- 1 claptar claptar  13M Aug 31 18:51 genome.fa.huff
-rw-r--r-- 1 claptar claptar  45M Aug 31 18:51 genome_clean.fa
-rw-r--r-- 1 claptar claptar  45M Aug 31 18:33 genome_no_header_no_newlines.fa
-rw-r--r-- 1 claptar claptar 9.5M Aug 31 18:50 genome_no_header_no_newlines.fa.bz2
-rw-r--r-- 1 claptar claptar 9.5M Aug 31 18:45 genome_no_header_no_newlines.fa.bz9
-rw-r--r-- 1 claptar claptar  11M Aug 31 18:50 genome_no_header_no_newlines.fa.gz
-rw-r--r-- 1 claptar claptar 3.8M Aug 31 18:50 genome_no_header_no_newlines.fa.gz9
-rw-r--r-- 1 claptar claptar  21M Aug 31 18:50 genome_no_header_no_newlines.fa.lz4
-rw-r--r-- 1 claptar claptar 2.4M Aug 31 18:50 genome_no_header_no_newlines.fa.xz
-rw-r--r-- 1 claptar claptar 8.2M Aug 31 18:46 genome_no_header_no_newlines.fa.xz9
-rw-r--r-- 1 claptar claptar  11M Aug 31 18:50 genome_no_header_no_newlines.fa.zst
-rw-r--r-- 1 claptar claptar    0 Aug 31 18:46 genome_no_header_no_newlines.f

In [33]:
print(f"Original FASTA file size: {os.path.getsize('genome.fa')} bytes")
print(f"Original FASTA file size (no header, no newlines): {os.path.getsize('genome_clean.fa')} bytes")
print(f"Compressed file size: {os.path.getsize('genome.fa.huff')} bytes")
print(f"Compression ratio: {os.path.getsize('genome_clean.fa')} / {os.path.getsize('genome.fa.huff')} = {os.path.getsize('genome_clean.fa') / os.path.getsize('genome.fa.huff'):.2f}")

Original FASTA file size: 47488540 bytes
Original FASTA file size (no header, no newlines): 46709983 bytes
Compressed file size: 13528355 bytes
Compression ratio: 46709983 / 13528355 = 3.45


As you can see, the compressed file is significantly smaller than the original FASTA file, even after removing the header and newlines. This demonstrates the effectiveness of the Huffman coding compression technique.

### Compare compression rates to popular tools

Here's a brief description of each compression tool and their relationship to Huffman coding:

#### Standard Compression Tools
##### gzip (.gz)

- Uses DEFLATE algorithm which combines LZ77 + Huffman coding
- Fast compression, widely supported
- Uses Huffman coding for the final encoding step
##### bzip2 (.bz2)

- Uses Burrows-Wheeler Transform + Move-to-Front + Huffman coding
- Better compression than gzip, slower speed
- Uses Huffman coding as the final entropy coding stage
##### xz (.xz)

- Uses LZMA2 algorithm with range encoding (arithmetic coding variant)
- Excellent compression ratio, slower than gzip/bzip2
- Does NOT use Huffman coding - uses arithmetic coding instead
##### lz4 (.lz4)

- Uses LZ4 algorithm focused on speed over compression ratio
- Very fast compression/decompression
- Does NOT use Huffman coding - uses simple byte-oriented encoding
##### zstd (.zst)

- Modern algorithm using LZ77 + FSE (Finite State Entropy)
- Good balance of speed and compression ratio
- Does NOT use Huffman coding - uses FSE (similar to ANS coding)


#### Summary
- Use Huffman coding: gzip, bzip2
- Don't use Huffman coding: xz (arithmetic), lz4 (simple), zstd (FSE)

Our Huffman implementation will be most comparable to gzip and bzip2 since they also use Huffman coding in their compression pipeline.

In [35]:
# Standard compression tools
!gzip -c genome_clean.fa > genome_clean_standart.fa.gz
!bzip2 -c genome_clean.fa > genome_clean_standart.fa.bz2
!xz -c genome_clean.fa > genome_clean_standart.fa.xz
!lz4 -c genome_clean.fa > genome_clean_standart.fa.lz4
!zstd -c genome_clean.fa > genome_clean_standart.fa.zst

# High compression variants
!gzip -9 -c genome_clean.fa > genome_clean_high_compression.fa.gz9
!bzip2 -9 -c genome_clean.fa > genome_clean_high_compression.fa.bz9
!xz -9 -c genome_clean.fa > genome_clean_high_compression.fa.xz9
!zstd -19 -c genome_clean.fa > genome_clean_high_compression.fa.zst19

# Compare sizes in human readable format
!du -h genome*

46M	genome.fa
13M	genome.fa.huff
45M	genome_clean.fa
9.5M	genome_clean_high_compression.fa.bz9
9.9M	genome_clean_high_compression.fa.gz9
8.2M	genome_clean_high_compression.fa.xz9
8.3M	genome_clean_high_compression.fa.zst19
9.5M	genome_clean_standart.fa.bz2
11M	genome_clean_standart.fa.gz
21M	genome_clean_standart.fa.lz4
8.3M	genome_clean_standart.fa.xz
11M	genome_clean_standart.fa.zst


### Decode