##### *Python libraries used in this notebook*

In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Implementation of Hoffman`s algorithm

## Abstract
This notebook discusses the Hoffmann algorithm, compares lossy and lossless compression algorithms. Represents Hoffman's algorithm mathematically. A compression method based on the considered algorithm is implemented with Python. The degree of compression of text arrays has been verified through trials.

## 1. Huffman`s algorithm
Huffman coding, named after its inventor David A. Huffman, is a widely used technique in computer science and information theory for achieving lossless data compression. It utilizes a special kind of code called a prefix code, where the bit sequence representing a symbol never appears as the beginning of another symbol's sequence. This algorithm was introduced by Huffman in his 1952 paper titled "A Method for the Construction of Minimum-Redundancy Codes".

### 1.1 Difference between lossless and lossy compression
Digital files can be quite large. To address this issue, compression techniques are employed to reduce file sizes without compromising (**for lossless**) or with some acceptable compromise (**for lossy**) in quality. Two primary methods of compression are lossless and lossy compression.

#### 1.1.1 Lossless Compression 
**Definition:** Lossless compression aims to reduce file size without any loss of data. The decompressed file is an exact replica of the original file.

**How it works:** Lossless compression algorithms identify and eliminate redundant data within the file. Common techniques include:
> - Run-Length Encoding (RLE): Replaces repeated occurrences of a character or sequence with a count and the character/sequence, reducing the number of bits required.
> - Huffman Coding: Assigns shorter codes to frequently occurring characters and longer codes to less frequent ones, minimizing the overall bit usage.
> - Lempel-Ziv-Welch (LZW): Identifies and replaces recurring patterns in the data with shorter codes, further reducing redundancy.

**Pros:** No data is lost, so the original file can be perfectly reconstructed. Essential for applications where data integrity is crucial, like text documents, medical imaging, and some scientific data.\
**Cons:** Generally achieves less compression compared to lossy methods. May not be sufficient for applications requiring extreme reductions in file size.

Lossles compression is used to compress text files (ZIP, GZIP), images (PNG, LZW) and audio (FLAC, ALAC).

#### 1.1.2 Lossy Compression
**Definition:** Lossy compression reduces file size by permanently eliminating some data, particularly data that is considered less important or imperceptible to human senses.

**How it works:** It uses algorithms that remove redundant or less critical information. Common techniques include:
> - Transform Coding: Transforms data into a different domain (like frequency domain), then quantizes and encodes the less significant parts with fewer bits.
> - Quantization: Reduces the precision of less critical data points.
> - Entropy Coding: Further compresses the data after quantization.

**Pros:** Achieves much higher compression ratios compared to lossless methods. Suitable for applications where some loss of quality is acceptable, such as streaming media, online images, and consumer audio.

**Cons:** Some data is permanently lost, which can affect quality, especially at higher compression rates. Decompressed files are not identical to the original; repeated compression and decompression can degrade quality further.

Lossy compression is used to compress images (JPEG), audio (MP3, AAC) and video (MP4).

#### 1.1.3 Key Differences
- **Data Integrity:** Lossless compression retains all original data, while lossy compression sacrifices some data for smaller file sizes.
- **Compression Ratio:** Lossy compression generally achieves higher compression ratios compared to lossless.
- **Use Cases:** Lossless is used where data integrity is crucial; lossy is used where reduced file size is more important than perfect fidelity.

### 1.2 Applicability of lossless compression algoritms
Lossless compression algorithms are appropriate in the following situations:
- **Text Files:** When compressing text documents such as source code, configuration files, or any other textual data where exact reproduction of the original data is essential.
- **Executable Files:** For software distribution, where the integrity of the executable files must be maintained to ensure proper functionality.
- **Database Files:** In scenarios where data precision is critical, such as in database backups and archival, to ensure no data corruption.
- **Medical Imaging:** In medical applications (e.g., medical scans, MRI images) where any data loss could lead to misdiagnosis or incorrect treatment.
- **Scientific Data:** For scientific data and research where exact replication of data is necessary to maintain accuracy and reproducibility of results.
- **Financial Records:** In financial and legal documents where data integrity and accuracy are paramount.
- **Configuration and Log Files:** To ensure that configuration settings and logs are preserved exactly as they were originally recorded.
- **Audio and Image Files for Editing:** When working with audio, image, or video files in professional editing contexts where repeated saving and loading without quality degradation is required.
- **Version Control Systems:** In version control systems like Git, where it’s important to track exact changes in files over time.
- **Data Deduplication:** In scenarios where storage efficiency is achieved by eliminating duplicate copies of repeating data, requiring exact duplicates to be recognized and handled appropriately.

Lossless compression is preferred in these cases because it ensures that the original data can be perfectly reconstructed from the compressed data, maintaining data integrity and accuracy.

## 2. Mathematical (formalistic) representation
### 2.1 
**Input**\
Alphabet $A = (a_1, a_2,\dots, a_n)$, which is the symbol alphabet of size n.\
Tuple $W = (w_1, w_2, \dots, w_n)$, which is the tuple of the (positive) symbol weights (usually proportional to probabilities), i.e $w_i = weight(a_i), i \in \{1, 2,\dots, n\}$.

**Output**\
Code $C(W) = (c_1, c_2, \dots, c_n)$, which is the tuple of (binary) codewords, where $c_i$ is the codeword for $a_i, i \in \{1, 2, \dots, n\}$.

**Goal**\
Length $L(C(W)) = \sum_{i = 1}^n w_i lenght(c_i)$ be the weighted path length of code $C$. Condition: $L(C(W)) \leq L(T(W))$ for any code $T(W)$.



Alice ->Bob: Hello bob.
Note right of Bob: thinks
Bob-->Alice: I am

### References

1. <a href="https://en.wikipedia.org/wiki/Huffman_coding">Wikipedia - Huffman`s coding</a>
2. <a href="https://en.wikipedia.org/wiki/Lossy_compression">Wikipedia - Lossy compression</a>
3. <a href="https://en.wikipedia.org/wiki/Lossless_compression">Wikipedia - Lossless compression</a>
4. <a href="https://www.khanacademy.org/computing/computers-and-internet/xcae6f4a7ff015e7d:digital-information/xcae6f4a7ff015e7d:data-compression/a/lossy-compression">Khanacademy - Lossy compression</a>
