# Kompresja tekstu

## Zadanie polega na implementacji dwóch algorytmów kompresji: 
- statycznego algorytmu Huffmana (2 p)
- dynamicznego algorytmu Huffmana (3 p)

### Dla każdego z algorytmów należy wykonać następujące zadania:
1. Opracować format pliku przechowującego dane. Zwróć uwagę na dwie kwestie:
    - Liczba bitów wynikowego pliku nie musi być podzielna przez 8, ale z dysku zawsze odczytujemy pełne bajty, dlatego ważne jest, aby jakoś rozwiązać ten problem. W przeciwnym razie po dekompresji można uzyskać nadmiarowe dane.
    - Plik wynikowy musi być binarny, tzn. rozwiązanie nie może zakładać, że w pliku tym zapisywane są 0 i 1 jako znaki ASCII.
2. Zaimplementować algorytm kompresji i dekompresji danych dla tego formatu pliku.
3. Zmierzyć współczynnik kompresji (wyrażone w procentach: 1 - plik_skompresowany / plik_nieskompresowany) dla plików o rozmiarach: 1kB, 10kB, 100kB, 1MB, o różnej zawartości:
    - wybrany przez Ciebie plik tekstowy z projektu Gutenberg,
    - wybrany przez Ciebie plik z kodem źródłowym jądra Linuksa,
    - plik ze znakami losowanymi z rozkładu jednostajnego - należy uwzględnić wszystkie 256 wartości, a nie tylko znaki drukowalne.
4. W sumie w punkcie 3 należy przeprowadzić analizę dla łącznie 12 plików (4 rozmiary x 3 typy plików).
5. Zmierzyć czas kompresji i dekompresji dla plików z punktu 3.

In [149]:
import numpy as np
from collections import deque
from bitarray import bitarray
from timeit import default_timer as timer
import os

### Statyczny algorytm Huffmana 

In [150]:
class Node:
    def __init__(self, weight=0, char=None, left=None, right=None):
        self.weight = weight
        self.char = char
        self.left = left
        self.right = right

In [151]:
def get_min_element_and_update_lists(leaves, internal_nodes):
        if not len(leaves):
            value = internal_nodes.popleft()
            return value, leaves, internal_nodes
        if not len(internal_nodes):
            value = leaves.popleft()
            return value, leaves, internal_nodes
        if leaves[0].weight <= internal_nodes[0].weight:
            value = leaves.popleft()
            return value, leaves, internal_nodes
        value = internal_nodes.popleft()
        return value, leaves, internal_nodes

class StaticHuffman:
    def __init__(self, text):
        self.text = text
        self.set_weights()
        self.root = self.create_tree()
        self.codes = {}
        self.set_codes(self.root, bitarray())
        
    def set_weights(self):
        self.weights = {}
        for letter in self.text:
            if letter not in self.weights:
                self.weights[letter] = 1
            else:
                self.weights[letter] += 1 
    
    def create_tree(self):
        nodes = []
        for char, weight in self.weights.items():
            nodes.append(Node(char=char, weight=weight))
        nodes.append(Node(char="end", weight=0))
        leaves = deque(sorted(nodes, key=lambda x: x.weight))
        internal_nodes = deque()
        while(len(leaves) + len(internal_nodes) > 1):
            e1, leaves, internal_nodes = get_min_element_and_update_lists(leaves, internal_nodes)
            e2, leaves, internal_nodes = get_min_element_and_update_lists(leaves, internal_nodes)
            if e1.char:
                internal_nodes.append( Node(weight=e1.weight + e2.weight, left=e1, right=e2) )
            else:
                internal_nodes.append( Node(weight=e1.weight + e2.weight, left=e2, right=e1) )
            
        return internal_nodes[0]
    
    def set_codes(self, node, code):
        copy_code = code.copy()
        if node.char:
            self.codes[node.char] = copy_code
            return
        if node.left:
            self.set_codes(node.left, copy_code + [0])
        if node.right:
            self.set_codes(node.right, copy_code + [1])
    
    def encode(self):
        result = bitarray()
        for char in self.text:
            result.extend(self.codes[char])
        result.extend(self.codes["end"])
        result += bitarray(8 - len(result) % 8)
        return result
    
    def decode(self, encoded):
        root = self.root
        decoded = ""
        for bit in encoded:
            if root.char == "end":
                break
            if root.char:
                if isinstance(root.char, int):
                    decoded += chr(root.char)
                else:
                    decoded += root.char
                root = self.root
            if bit:
                root = root.right
            else:
                root = root.left
        return decoded

### TESTY

In [152]:
def get_compression_ratio_in_percentages(file, compressed_file):
    original_size = os.path.getsize(file)
    compressed_size = os.path.getsize(compressed_file)
    return (1 - compressed_size / original_size) * 100

In [153]:
def compression_time_comparision(file, huffman, n=10):    
    avg_time_encoding = 0
    avg_time_decoding = 0
    
    for i in range(n):
        start = timer()
        result = huffman.encode()
        end = timer()
        avg_time_encoding += (end - start)
        
        start = timer()
        huffman.decode(result)
        end = timer()
        avg_time_decoding += (end - start)
    avg_time_encoding /= n
    avg_time_decoding /= n
    return avg_time_encoding, avg_time_decoding

In [154]:
def get_text(file, mode="r"):
    F = open(file, mode)
    text = F.read()
    return text

In [155]:
def compression_test(file, mode="r"):
    save_file = f"static_compression_{file}.txt"
    text = get_text(file, mode)
    static_huffman = StaticHuffman(text)
    encoded = static_huffman.encode()
    with open(save_file, "wb+") as f:
        encoded.tofile(f)
    static_compression = get_compression_ratio_in_percentages(file, save_file)
    static_avg_time_encoding, static_avg_time_decoding = compression_time_comparision(file, static_huffman)
    
    print(f"Stats for {file}:\nFor Static Huffman:\n\tCompresion ratio: {round(static_compression,2)}%")
    print(f"\tAVG encoding time: {round(static_avg_time_encoding, 6)}\n\tAVG decoding time: {round(static_avg_time_decoding, 6)}")

### Gutenberg

In [156]:
compression_test("Gutenberg_1kB.txt")

Stats for Gutenberg_1kB.txt:
For Static Huffman:
	Compresion ratio: 41.32%
	AVG encoding time: 0.000119
	AVG decoding time: 0.000551


In [157]:
compression_test("Gutenberg_10kB.txt")

Stats for Gutenberg_10kB.txt:
For Static Huffman:
	Compresion ratio: 39.13%
	AVG encoding time: 0.00156
	AVG decoding time: 0.007611


In [158]:
compression_test("Gutenberg_100kB.txt")

Stats for Gutenberg_100kB.txt:
For Static Huffman:
	Compresion ratio: 45.64%
	AVG encoding time: 0.014819
	AVG decoding time: 0.067409


In [159]:
compression_test("Gutenberg_1MB.txt")

Stats for Gutenberg_1MB.txt:
For Static Huffman:
	Compresion ratio: 54.11%
	AVG encoding time: 0.147626
	AVG decoding time: 0.63174


### Linux

In [160]:
compression_test("Linux_1kB.txt")

Stats for Linux_1kB.txt:
For Static Huffman:
	Compresion ratio: 36.58%
	AVG encoding time: 0.000123
	AVG decoding time: 0.000571


In [161]:
compression_test("Linux_10kB.txt")

Stats for Linux_10kB.txt:
For Static Huffman:
	Compresion ratio: 38.35%
	AVG encoding time: 0.001525
	AVG decoding time: 0.007341


In [162]:
compression_test("Linux_100kB.txt")

Stats for Linux_100kB.txt:
For Static Huffman:
	Compresion ratio: 32.18%
	AVG encoding time: 0.01581
	AVG decoding time: 0.077388


In [163]:
compression_test("Linux_1MB.txt")

Stats for Linux_1MB.txt:
For Static Huffman:
	Compresion ratio: 34.6%
	AVG encoding time: 0.154209
	AVG decoding time: 0.762303


### All ASCII signs uniform distribution

In [164]:
def create_uniform_distribution_file(size):
    file = open(f"Uniform_distribution_{size}kB.txt", "wb")
    l = []
    for i in range(size*1000):
        l.append(int(np.random.uniform(256)))
        
    file.write(bytes(l))
    file.close()

In [165]:
create_uniform_distribution_file(1)

In [166]:
create_uniform_distribution_file(10)

In [167]:
create_uniform_distribution_file(100)

In [168]:
create_uniform_distribution_file(1000)

In [169]:
compression_test("Uniform_distribution_1kB.txt", "br")

Stats for Uniform_distribution_1kB.txt:
For Static Huffman:
	Compresion ratio: 1.9%
	AVG encoding time: 0.000161
	AVG decoding time: 0.001127


In [170]:
compression_test("Uniform_distribution_10kB.txt", "br")

Stats for Uniform_distribution_10kB.txt:
For Static Huffman:
	Compresion ratio: 0.05%
	AVG encoding time: 0.001432
	AVG decoding time: 0.011125


In [171]:
compression_test("Uniform_distribution_100kB.txt", "br")

Stats for Uniform_distribution_100kB.txt:
For Static Huffman:
	Compresion ratio: 0.01%
	AVG encoding time: 0.014351
	AVG decoding time: 0.111126


In [172]:
compression_test("Uniform_distribution_1000kB.txt", "br")

Stats for Uniform_distribution_1000kB.txt:
For Static Huffman:
	Compresion ratio: 0.0%
	AVG encoding time: 0.140656
	AVG decoding time: 1.07448
