# Exercise 2

## Overview

This python notebook implements the solution for a lossless compression and decompression algorithm using the `Rice Coding` method. This notebook only implements the solution and discusses the step-by-step implementation process. The theoretical aspects of these techniques will be further elaborated in the report submitted alongside this notebook.

## Implementation

### Install Dependencies

We'll begin by installing the libraries required for building the application.

In [1]:
%pip install wave pandas bitarray tqdm

Collecting wave
  Downloading Wave-0.0.2.zip (38 kB)
Collecting bitarray
  Downloading bitarray-2.8.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB)
[K     |████████████████████████████████| 277 kB 67.3 MB/s eta 0:00:01
Building wheels for collected packages: wave
  Building wheel for wave (setup.py) ... [?25ldone
[?25h  Created wheel for wave: filename=Wave-0.0.2-py3-none-any.whl size=1245 sha256=eabf2b77020dec3a0f532381793879b100a4efc3183aeb9f828c7d734bff8df9
  Stored in directory: /home/jovyan/.cache/pip/wheels/25/e8/fe/458c7dac00c6abedad6380b9d0ef1a5cbc7c21807df1d30915
Successfully built wave
Installing collected packages: wave, bitarray
Successfully installed bitarray-2.8.1 wave-0.0.2
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


Following that we'll also create the output folder if it does not exist so that the application can save output files.

In [2]:
import os

output_files_folder = './Exercise_2_Output'

if not os.path.exists(output_files_folder):
    os.mkdir(output_files_folder)

### Application

#### Utility Functions

We'll begin by creating some useful functions for working with Rice encoding and decoding of basic non-negative whole numbers. These functions include `rice_encode()` and `rice_decode()`, which are the heart of our compression technique. They rely on smaller functions like `unary_encode()`, `binary_encode()`, `unary_decode()`, and `binary_decode()` to do their jobs effectively.

In addition to the main functions, we'll create four additional functions. One is called `read_file_as_byte_array()`, which allows us to read the content of a file and see it as a sequence of bytes. Another one is `compare_compressed_file_sizes()`, which helps us understand how effective our compression is. The last two are `calculate_file_hash()` and `compare_files()`, which help us confirm if the decoding process is functioning properly. These extra functions make our program more flexible and enable us to assess its performance.

In [3]:
import os
import hashlib

def unary_encode(n: int) -> str:
    """
    Encodes a non-negative integer using unary encoding.

    Parameters:
    n (int): The non-negative integer to be encoded.

    Returns:
    str: The unary encoded string representation of the integer.
    """
    assert n >= 0, "Input value must be non-negative."
    
    return '1' * n + '0'

def binary_encode(n: int, k: int) -> str:
    """
    Encodes an integer in binary format with a specified minimum width.

    Parameters:
    n (int): The integer to be encoded.
    k (int): The minimum width of the binary representation.

    Returns:
    str: The binary encoded string representation of the integer.
    """
    assert n >= 0, "Input value must be non-negative."
    assert k > 0, "Minimum width 'k' must be greater than 0."
    
    return format(n, '0{}b'.format(k))

def unary_decode(code: str) -> int:
    """
    Decodes a unary encoded string to a non-negative integer.

    Parameters:
    code (str): The unary encoded string to be decoded.

    Returns:
    int: The decoded non-negative integer.
    """
    assert all(char == '1' or char == '0' for char in code), "Input code must consist of '0's and '1's only."
    return code.count('1')

def binary_decode(code: str) -> int:
    """
    Decodes a binary encoded string to an integer.

    Parameters:
    code (str): The binary encoded string to be decoded.

    Returns:
    int: The decoded integer.
    """
    assert all(char == '0' or char == '1' for char in code), "Input code must consist of '0's and '1's only."
    return int(code, 2)

def rice_encode(value:int, k: int) -> str:
    """
    Encodes an integer using Rice coding with a specified parameter.

    Parameters:
    value (int): The non-negative integer to be encoded.
    k (int): The parameter value used in Rice coding.

    Returns:
    str: The Rice encoded string representation of the integer.
    """
    assert value >= 0, "Input value must be non-negative."
    assert k > 0, "Parameter 'k' must be greater than 0."
    
    modulus = 2**k
    
    quotient = value // modulus
    remainder = value % modulus

    quotient_code = unary_encode(quotient)
    remainder_code = binary_encode(remainder, k)
    rice_code = quotient_code + remainder_code

    return rice_code

def rice_decode(value: str, k: int) -> int:
    """
    Decodes a Rice encoded string to an integer using a specified parameter.

    Parameters:
    value (str): The Rice encoded string to be decoded.
    k (int): The parameter value used in Rice coding.

    Returns:
    int: The decoded integer.
    """
    assert all(char == '0' or char == '1' for char in value), "Input code must consist of '0's and '1's only."
    assert k > 0, "Parameter 'k' must be greater than 0."

    modulus = 2 ** k

    first_0_index = len(value) - 1
    for index, char in enumerate(value):
        if char == '0':
            first_0_index = index
            break

    quotient_code = value[:first_0_index]
    remainder_code = value[first_0_index:]

    quotient = unary_decode(quotient_code)
    remainder = binary_decode(remainder_code)
    number = quotient * modulus + remainder

    return number

def read_file_as_byte_array(file_path: str):
    """
    Reads a file as a byte array.

    Parameters:
    file_path (str): The path to the file to be read.

    Returns:
    bytes: The contents of the file as a byte array.
    """
    with open(file_path, 'rb') as file:
        byte_array = file.read()
    return byte_array

def calculate_file_hash(file_path):
    """
    Calculate the SHA-256 hash of a file.

    Parameters:
    file_path (str): The path to the file.

    Returns:
    str: The SHA-256 hash in hexadecimal format.
    """
    # Technique adpoted from -> https://www.geeksforgeeks.org/compare-two-files-using-hashing-in-python/
    sha256_hash = hashlib.sha256()
    
    with open(file_path, "rb") as f:
        # Read the file in chunks and update the hash
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)
            
    return sha256_hash.hexdigest()

def compare_files(file_path_1, file_path_2):
    """
    Compare two files by calculating their SHA-256 hashes.

    Parameters:
    file_path_1 (str): The path to the first file.
    file_path_2 (str): The path to the second file.

    Returns:
    bool: True if the files have the same hash, False otherwise.
    """
    hash_1 = calculate_file_hash(file_path_1)
    hash_2 = calculate_file_hash(file_path_2)
    
    if hash_1 == hash_2:
        return True
    else:
        return False
    

#### Main Functions

Now that we have our encoding utility ready, we can move ahead and create functions to encode and decode files using the Rice Coding algorithm. Let's break down the steps each of these functions takes:

##### Steps in the `rice_encode_audio_file()` Function
1. The function begins by loading the file from the provided input path as an array of bytes.
2. It then converts each byte into a string of Rice codes, resulting in a list of these code strings.
3. These individual strings are joined together into a single big string.
4. The combined string is split into smaller strings of 8 characters each, where each represents a byte. If the last string isn't 8 characters long, it's padded with zeros, and the count of those zeros is noted.
5. The first byte in the final byte array holds the count of padded zeros.
6. The 8-character strings are transformed into bytes and added to the final byte array.
7. The complete byte array is saved to a given path as binary data.

##### Steps in the `rice_decode_audio_file()` Function
1. The function starts by loading the encoded file from the provided input path as an array of bytes.
2. It turns each byte into a binary string, giving us a list of binary strings where each one represents the bits in a byte.
3. These binary strings are combined into a single big string.
4. The first 8 characters of this string are turned into a byte, helping us figure out the number of padding zeros added to the last byte.
5. Reading through the string from start to end (ignoring the padding zeros), the function reconstructs the original byte value by identifying quotient and remainder values during the process.
6. The reconstructed bytes are then written to the provided output path, creating a new WAV file.

In [4]:
def rice_encode_audio_file(input_audio_file_path: str, output_audio_file_path: str, k: int):
    # Load the audio file
    audio_buffer = read_file_as_byte_array(input_audio_file_path)

    # Encoding bytes
    encoded_bit_strings = [ rice_encode(byte, k) for byte in audio_buffer ]
    # Separate bits into groups of 8 for bytes
    encoded_bit_string = "".join(encoded_bit_strings)
    encoded_byte_bit_strings = [ encoded_bit_string[i : i+8] for i in range(0, len(encoded_bit_string), 8) ]
    last_bit_string_padding_size = 8 - len(encoded_byte_bit_strings[-1])
    
    # Convert the binary string to bytes
    i, encoded_audio_buffer = 0, bytearray()
    # Adding the number of 0s added for padding to the beginning of the last encoded bit string
    encoded_audio_buffer.append(last_bit_string_padding_size)
    # Converting bits to bytes
    for bit_string in encoded_byte_bit_strings:
        byte = int(bit_string, 2)
        encoded_audio_buffer.append(byte)

    # Writing encoded audio samples to output file
    with open(output_audio_file_path, 'wb+') as output_file:
        output_file.write(encoded_audio_buffer)

def rice_decode_audio_file(input_audio_file_path: str, output_audio_file_path: str, k: int):
    # Load the encoded file
    encoded_buffer = read_file_as_byte_array(input_audio_file_path)
    # Extracting the first byte that contains the number of padding zeroes in the last 
    padding_size = encoded_buffer[0]

    # Converting byte array into bit strings
    encoded_bit_strings = [ binary_encode(byte, 8) for byte in encoded_buffer[1:] ]
    # Removing any zeroes used as padding in the last byte
    encoded_bit_strings[-1] = encoded_bit_strings[-1][padding_size:]
    encoded_bit_string = ''.join(encoded_bit_strings)

    i, decoded_buffer = 0, bytearray()
    encoded_byte_start = 0
    
    while i < len(encoded_bit_string):
        if encoded_bit_string[i] == '0':
            encoded_byte_end = i + k + 1
            decoded_byte = rice_decode(encoded_bit_string[encoded_byte_start : encoded_byte_end], k)
            decoded_buffer.append(decoded_byte)
            
            i = encoded_byte_end
            encoded_byte_start = i
        else:
            i += 1

    # Writing encoded audio samples to output file
    with open(output_audio_file_path, 'wb+') as output_file:
        output_file.write(decoded_buffer)

#### Testing

Now that our primary functions for encoding and decoding files with rice coding are in place, let's move forward and put them to the test. Our testing approach involves rice encoding the `./files/Sound1.wav` file using a `k` value of `2`. Subsequently, we'll perform a decoding operation on the encoded file and make a hash comparison with the original file to verify the algorithms.

In [5]:
# Encoding 
rice_encode_audio_file('./Exercise_2_Files/Sound1.wav', './Exercise_2_Output/Sound1_Enc_K_2.ex2', k = 2)

# Decoding
rice_decode_audio_file('./Exercise_2_Output/Sound1_Enc_K_2.ex2', './Exercise_2_Output/Sound1_Enc_Dec_K_2.wav', k = 2)

# Comparing original file with the decoded file
if compare_files('./Exercise_2_Files/Sound1.wav', './Exercise_2_Output/Sound1_Enc_Dec_K_2.wav'):
    print('The encoding and decoding functions work!')
else:
    print('Something is wrong with the encoding and decoding functions!')

The encoding and decoding functions work!


From the output of the above cell, we can see the encoding and decoding functions work as expected.

#### Compression Result

With the working encoding and decoding functions working, we can construct the table requested in the task sheet to compare the compression effect of the different k values on the files. 

To do that we will encode the `./files/Sound1.wav` and `./files/Sound2.wav` files using both `k` values of `2` and `4` and compare file sizes to create the table using pandas.

In [6]:
import pandas as pd
import os

audio_file_encoding_configs = [
    {
        'file_name': 'Sound1.wav',
        'input_file_path': './Exercise_2_Files/Sound1.wav',
        'output_file_path': './Exercise_2_Output/Sound1_Enc_K_2.ex2',
        'k': 2
    },
    {
        'file_name': 'Sound1.wav',
        'input_file_path': './Exercise_2_Files/Sound1.wav',
        'output_file_path': './Exercise_2_Output/Sound1_Enc_K_4.ex2',
        'k': 4
    },
    {
        'file_name': 'Sound2.wav',
        'input_file_path': './Exercise_2_Files/Sound2.wav',
        'output_file_path': './Exercise_2_Output/Sound2_Enc_K_2.ex2',
        'k': 2
    },
    {
        'file_name': 'Sound2.wav',
        'input_file_path': './Exercise_2_Files/Sound2.wav',
        'output_file_path': './Exercise_2_Output/Sound2_Enc_K_4.ex2',
        'k': 4
    }
]

file_size_dataframe = pd.DataFrame()

for config in audio_file_encoding_configs:
    file_name = config['file_name']
    input_file_path = config['input_file_path']
    output_file_path = config['output_file_path']
    k = config['k']
    
    rice_encode_audio_file(input_file_path, output_file_path, k)
    
    original_size, compressed_size = os.path.getsize(config['input_file_path']), os.path.getsize(config['output_file_path'])
    compression_percentage = (compressed_size / original_size) * 100
    
    new_row = {
        'file_name': file_name, 
        'k': k,
        'original_size': original_size, 
        'compressed_size': compressed_size,
        'compression_percentage': compression_percentage
    }
    file_size_dataframe = pd.concat([file_size_dataframe, pd.DataFrame([new_row])], ignore_index=True)
    
file_size_dataframe

Unnamed: 0,file_name,k,original_size,compressed_size,compression_percentage
0,Sound1.wav,2,1002088,4115719,410.714328
1,Sound1.wav,4,1002088,1516266,151.310663
2,Sound2.wav,2,1008044,4348596,431.389503
3,Sound2.wav,4,1008044,1575348,156.277702


### Further Deveolopment

For the further development we will be building a more effective compression solution the Lempel–Ziv–Storer–Szymanski(LZSS) algorithm. The algorithm implementation is adopted from this [blog](https://tim.cogan.dev/lzss/) on the LZSS algorithm by Tim Cogan.

#### Utility Functions

Similar to implementation of the rice coding algorithm, we will begin by implementing some useful utility functions that will help us encode audio files using the LZSS algorithm. These functions include `generate_repeated_sequence()` and `find_longest_match()`, where `generate_repeated_sequence()` is used for generating a sequence of bytes of a specified length using a pattern of bytes and `find_longest_match()` is used to find the longest match for a sequence of bytes within the provided window.

The `find_longest_match()` is less trivial is implemented using the below steps:
1. The function starts by defining the end index for the largest match candidate.
2. Then the function defines the start index for pattern searching among the already encoded bytes before the `current_position`. The start_index depends on the provided `window_size`.
3. The function then iterates through the possible match candidates, with each iteration reducing the length of the match candidate by 1. These match candidates are compared to the existing byte sequences(the bytes perceeding `current_position`). If a match candidate is found to be equal to an existing byte sequence the function returns the offset for the index of the begininning of that byte sequence from the `current_position` and the length of the repeating sequence.

In [7]:
from typing import Optional, Tuple
from bitarray import bitarray
from tqdm.notebook import tqdm

def generate_repeated_sequence(x: bytes, num_bytes: int) -> bytes:
    """
    Generate a sequence by repeating the given bytes and appending a portion of the bytes.

    Examples:
        generate_repeated_sequence(b"1234567", 5) -> b"12345"
        generate_repeated_sequence(b"123", 5) -> b"12312"

    Args:
        x (bytes): The bytes to repeat and append.
        num_bytes (int): The desired length of the generated sequence.

    Returns:
        bytes: The generated sequence of bytes.
    """
    repetitions = num_bytes // len(x)
    remainder = num_bytes % len(x)
    
    return x * repetitions + x[:remainder]

def find_longest_match(data: bytes, current_position: int, window_size: int, min_pattern_length: int, max_pattern_length: int) -> Optional[Tuple[int, int]]:
    """
    Find the longest repeated match in the given data buffer.

    Args:
        data (bytes): The data buffer to search for matches.
        current_position (int): The current position in the data buffer.
        window_size (int): The size of the window to search for patterns.
        min_pattern_length (int): The minimum length of a pattern to consider.
        max_pattern_length (int): The maximum length of a pattern to consider.

    Returns:
        Optional[Tuple[int, int]]: A tuple containing the match distance and match length if a match is found,
        or None if no match is found.
    """
    # Calculate the end of the buffer where the search should stop
    end_of_buffer = min(current_position + min_pattern_length + max_pattern_length, len(data))
    # Calculate the starting position of the search window
    search_start = max(0, current_position - window_size)
    
    # Loop through potential match candidate ends in descending order
    for match_candidate_end in range(end_of_buffer, current_position + min_pattern_length + 1, -1):
        # Extract the candidate sequence for matching
        match_candidate = data[current_position:match_candidate_end]
        
        # Loop through positions in the search window
        for search_position in range(search_start, current_position):
            # Check if the candidate matches the wrapped slice of data in the search window
            if match_candidate == generate_repeated_sequence(data[search_position:current_position], len(match_candidate)):
                # If the check is true, we return match distance (offset) and match length
                return current_position - search_position, len(match_candidate)

#### Main Functions

Now, we can move ahead and create functions to encode and decode files using the LZSS algorithm. Let's break down the steps each of these functions takes:

##### Steps in the `rice_encode_audio_file()` Function
1. The function begins by initializing a bitarray to store the compressed data.
2. It then iterates through each byte in the input byte array. In each iteration the function checks if an existing byte sequence can be found by the `find_longest_match()` function at the current index. 
    - If a matching sequence is found, a matching flag bit (`1`) is appended to the output bitarray, followed by the `12 bits` for the match distance (offset) and `4 bits` the match length. A total of `17 bits` is added when a matching sequence is found.
   - If no match is found, a not matching flag bit (`0`) is appended to the output bitarray, follwed by the bits of the actual byte itself. A total of `9 bits` is added when a matching sequence is not found.
3. After processing all bytes, the bit array is padded to ensure the last byte is complete.
4. Finally, the bit array is converted to bytes using tobytes() and returned.

##### Steps in the `rice_decode_audio_file()` Function
1. The function begins by initializing an empty bit array with the input bytes and creating an empty list called `output_buffer` to store the decoded bytes.
2. It then iterates through the initialized bit array until the bit array contains `9 bits` or more (since `9 bits` is the minimum amount used to represent data in the encoding method). During each iteration, a bit is popped from the bit array. If the popped bit is `0`, the following `8 bits` are removed from the bit array, converted to a byte, and appended to the output_buffer. If the popped bit is `1`, the match size and match length are extracted from the next `16 bits`, which are then deleted from the bit array. These extracted values are used to reconstruct the corresponding bytes, which are added to the `output_buffer`.
3. The decoded bytes within the `output_buffer` are then returned.

In [8]:
IS_MATCH_BIT = True

# We only consider patterns between the length of 2 and 15, and just
# output any substring of length 1 (9 bits un-encoded is better than a 17-bit
# reference for the flag, distance, and length)
# Since lengths 0 and 1 are unused, we can encode lengths 2-17(inclusive) in only 4 bits.
MIN_PATTERN_LENGTH = 2
MAX_PATTERN_LENGTH = 15

def lzss_encode(data: bytes, pattern_look_up_window_size: int = 64) -> bytes:
    """
    Encodes the given data using the LZSS algorithm.

    Args:
        data (bytes): The data to be encoded.

    Returns:
        bytes: The encoded data.
    """
    output_buffer = bitarray()
    
    # Initialize progress bar to track encoding progress
    with tqdm(total=len(data), desc='LZSS Encoding', leave=True) as progress_bar:
        i = 0
        
        # Process the data byte by byte
        while i < len(data):
            match = find_longest_match(data, i, pattern_look_up_window_size, MIN_PATTERN_LENGTH, MAX_PATTERN_LENGTH)
            # Check if a match is found using the find_longest_match function
            if match is not None:
                match_distance, match_length = match
                
                # Append the match flag to the output buffer
                output_buffer.append(IS_MATCH_BIT)
                # Calculate the high and low parts of the match distance
                dist_hi, dist_lo = match_distance >> 4, (match_distance) & 0xF
                # Append the match distance and match length to the output buffer
                output_buffer.frombytes(bytes([dist_hi, (dist_lo << 4) | (match_length - 2)]))
                
                # Update the current position in the data and the progress bar
                i += match_length
                progress_bar.update(match_length)
            else:
                # If no match is found, append the non-match flag to the output buffer
                output_buffer.append(not IS_MATCH_BIT)
                # Append the byte from the input data to the output buffer
                output_buffer.frombytes(bytes([data[i]]))
                
                # Update the current position in the data and the progress bar
                i += 1
                progress_bar.update(1)
    
    # Pad the output buffer to complete the last byte
    output_buffer.fill()
    
    # Convert the output buffer to bytes and return
    return output_buffer.tobytes()

def lzss_decode(encoded_bytes: bytes) -> bytes:
    """
    Decodes the given LZSS encoded data.

    Args:
        encoded_bytes (bytes): The encoded data to be decoded.

    Returns:
        bytes: The decoded data.
    """
    # Initialize a bitarray to hold the encoded data
    data = bitarray()
    # Convert the decoded bytes to a bitarray
    data.frombytes(encoded_bytes)
    # Ensure there's data to decode
    assert data, f"Cannot decode {encoded_bytes}"

    # Initialize a list to store the decoded bytes
    output_buffer = []
    
    # Initialize progress bar to track encoding progress
    with tqdm(total=len(data), desc='LZSS Decoding', leave=True) as progress_bar:
        # Continue decoding while there are enough bits for a match or non-match flag. Anything less than 9 bits is padding
        while len(data) >= 9:
            if data.pop(0) != IS_MATCH_BIT:
                # If it's a non-match, extract the next 8 bits as a byte
                byte = data[:8].tobytes()
                del data[:8]
                
                # Update the progress bar
                progress_bar.update(9)
                # Append the byte to the output buffer
                output_buffer.append(byte)
            else:
                # If it's a match, extract the next 16 bits as hi and lo bytes
                hi, lo = data[:16].tobytes()
                del data[:16]
                
                # Update the progress bar
                progress_bar.update(17)
                # Calculate the match distance
                distance = (hi << 4) | (lo >> 4)
                # Calculate the match length
                length = (lo & MAX_PATTERN_LENGTH) + MIN_PATTERN_LENGTH
                
                # Reconstruct the matched substring using history (output_buffer)
                for _ in range(length):
                    output_buffer.append(output_buffer[-distance])

    # Convert the output buffer to bytes and return the decoded data
    return b"".join(output_buffer)

With the base functions for LZSS encoding complete, we can create functions for encoding and decoding files using them.

In [9]:
def lzss_encode_file(input_audio_file_path: str, output_audio_file_path: str, pattern_look_up_window_size: int):
    """
    Compresses an audio file using the LZSS algorithm and writes the compressed data to an output file.

    Args:
        input_audio_file_path (str): The path to the input audio file.
        output_audio_file_path (str): The path to the output compressed audio file.
        pattern_look_up_window_size (int): The window size for pattern lookup in LZSS encoding(affects speed of the algorithm).
    """
    # Load the audio file
    audio_buffer = read_file_as_byte_array(input_audio_file_path)

    # Using LZSS encoding algorithm to compress audio buffer
    compressed_audio_buffer = lzss_encode(audio_buffer, pattern_look_up_window_size)

    # Writing encoded audio samples to output file
    with open(output_audio_file_path, 'wb+') as output_file:
        output_file.write(compressed_audio_buffer)

def lzss_decode_file(input_audio_file_path: str, output_audio_file_path: str):
    """
    Decompresses a previously LZSS compressed audio file and writes the decompressed data to an output file.

    Args:
        input_audio_file_path (str): The path to the input compressed audio file.
        output_audio_file_path (str): The path to the output decompressed audio file.
    """
    # Load the audio file
    compressed_audio_buffer = read_file_as_byte_array(input_audio_file_path)

    # Using LZSS decoding algorithm to decompress buffer
    audio_buffer = lzss_decode(compressed_audio_buffer)

    # Writing encoded audio samples to output file
    with open(output_audio_file_path, 'wb+') as output_file:
        output_file.write(audio_buffer)

#### Testing

Now that we have the functions ready, we can test them out and compare the results. When it comes to the LZSS algorithm, making the pattern detection window larger can usually make the compression better. However, we have to keep in mind that this will also slow down the encoding process because each byte needs to be compared with `n` previous bytes(where `n` is the window_size). For our tests, we'll use a relatively small window size of `64`.

The code snippet below will encode the `./files/Sound1.wav` file, decode the encoded file, and then compare the original file with the final decoded file to verify the correctness of the code.

In [10]:
lzss_encode_file('./Exercise_2_Files/Sound1.wav', './Exercise_2_Output/Sound1_Enc_Win_64.lzss', 64)
lzss_decode_file('./Exercise_2_Output/Sound1_Enc_Win_64.lzss', './Exercise_2_Output/Sound1_Enc_Win_64_Dec_lzss.wav')

# Comparing original file with the decoded file
if compare_files('./Exercise_2_Files/Sound1.wav', './Exercise_2_Output/Sound1_Enc_Win_64_Dec_lzss.wav'):
    print('The encoding and decoding functions work!')
else:
    print('Something is wrong with the encoding and decoding functions!')

HBox(children=(FloatProgress(value=0.0, description='LZSS Encoding', max=1002088.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='LZSS Decoding', max=7666392.0, style=ProgressStyle(descri…


The encoding and decoding functions work!


From the output of the above cell, we can see the LZSS encoding and decoding functions work as expected.

#### Compression Result

With the working encoding and decoding functions working, we can now construct a table showing the compression results

To do that we will encode the `./files/Sound1.wav` and `./files/Sound2.wav` files using both `k` values of `2` and `4` and compare file sizes to create the table using pandas.

In [12]:
import pandas as pd
import os

audio_file_encoding_configs = [
    {
        'file_name': 'Sound1.wav',
        'input_file_path': './Exercise_2_Files/Sound1.wav',
        'output_file_path': './Exercise_2_Output/Sound1_Enc_Win_64.lzss',
        'window_size': 64
    },
    {
        'file_name': 'Sound1.wav',
        'input_file_path': './Exercise_2_Files/Sound1.wav',
        'output_file_path': './Exercise_2_Output/Sound1_Enc_Win_128.lzss',
        'window_size': 128
    },
    {
        'file_name': 'Sound2.wav',
        'input_file_path': './Exercise_2_Files/Sound2.wav',
        'output_file_path': './Exercise_2_Output/Sound2_Enc_Win_64.lzss',
        'window_size': 64
    },
    {
        'file_name': 'Sound2.wav',
        'input_file_path': './Exercise_2_Files/Sound2.wav',
        'output_file_path': './Exercise_2_Output/Sound2_Enc_Win_128.lzss',
        'window_size': 128
    },
]

file_size_dataframe = pd.DataFrame()

for config in audio_file_encoding_configs:
    file_name = config['file_name']
    input_file_path = config['input_file_path']
    output_file_path = config['output_file_path']
    window_size = config['window_size']
    
    lzss_encode_file(input_file_path, output_file_path, window_size)
    
    original_size, compressed_size = os.path.getsize(config['input_file_path']), os.path.getsize(config['output_file_path'])
    compression_percentage = (compressed_size / original_size) * 100
    
    new_row = {
        'file_name': file_name, 
        'window_size': window_size,
        'original_size': original_size, 
        'compressed_size': compressed_size,
        'compression_percentage': compression_percentage
    }
    file_size_dataframe = pd.concat([file_size_dataframe, pd.DataFrame([new_row])], ignore_index=True)
    
file_size_dataframe

HBox(children=(FloatProgress(value=0.0, description='LZSS Encoding', max=1002088.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='LZSS Encoding', max=1002088.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='LZSS Encoding', max=1008044.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='LZSS Encoding', max=1008044.0, style=ProgressStyle(descri…




Unnamed: 0,file_name,window_size,original_size,compressed_size,compression_percentage
0,Sound1.wav,64,1002088,958299,95.630224
1,Sound1.wav,128,1002088,932662,93.071866
2,Sound2.wav,64,1008044,1134040,112.499058
3,Sound2.wav,128,1008044,1134038,112.498859
