LZ77 Algorithm in brief:
- LZ77 is a lossless data compression algorithm that identifies repeated sequences of data within a "sliding window" of previously processed data.

- Instead of storing the repeated sequence directly, it replaces it with a "pointer" (distance) to its last occurrence and the "length" of the repeated sequence.

- This mechanism reduces redundancy by referencing past data, leading to compression.

In [1]:
import os
import requests
from pathlib import Path
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import gzip
import pandas as pd
import shutil
from tqdm.notebook import tqdm

In [2]:
# TODO: UPDATE THIS WITH YOUR OWN LOCAL PATH IF YOU WANT
base_path = Path("/Users/andrewrobbertz/__SOC_CODE__/__sandbox__/data/PADRE/MEDDEA")
experiment_path = base_path / "RAW"
compressed_path = base_path / "RAW_COMPRESSED"

if not experiment_path.exists():
    os.makedirs(experiment_path)

if not compressed_path.exists():
    os.makedirs(compressed_path)

## Download all the Relevant Datasets

In [3]:
def download_from_url(base_url, experiment_path):
    response = requests.get(base_url)
    parent_folder = BeautifulSoup(response.text, 'html.parser')

    for link in parent_folder.find_all('a'):
        href = link.get('href')
        if href and not href.startswith('?') and not href.startswith('/'):
            if href and href.endswith('.DAT'):
                file_url = urljoin(base_url, href)
                print(f"Downloading {file_url}")
                file_path = experiment_path / href
                with open(file_path, 'wb') as file:
                    response = requests.get(file_url)
                    file.write(response.content)

In [4]:
### 2025-04-07
base_url = 'https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/04/07/'
download_from_url(base_url, experiment_path)

### 2025-05-04
base_url = 'https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/05/04/'
download_from_url(base_url, experiment_path)

Downloading https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/04/07/PADREMDA0_250407091334.DAT
Downloading https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/04/07/PADREMDA0_250407092035.DAT
Downloading https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/04/07/PADREMDA0_250407092725.DAT
Downloading https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/04/07/PADREMDA0_250407093412.DAT
Downloading https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/04/07/PADREMDA0_250407094111.DAT
Downloading https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/04/07/PADREMDA0_250407094815.DAT
Downloading https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/04/07/PADREMDA0_250407095503.DAT
Downloading https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/04/07/PADREMDA2_250407074546.DAT
Downloading https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/04/07/PADREMDA2_250407082056.DAT
Downloading https://umbra.nascom.nasa.gov/padre/padre-meddea/raw/2025/04/

In [5]:
spectrum_files = list(experiment_path.glob("*A2_*.DAT"))
spectrum_files.sort()
print(f"Found {len(spectrum_files)} spectrum files")
spectrum_files

Found 12 spectrum files


[PosixPath('/Users/andrewrobbertz/__SOC_CODE__/__sandbox__/data/PADRE/MEDDEA/RAW/PADREMDA2_250407074546.DAT'),
 PosixPath('/Users/andrewrobbertz/__SOC_CODE__/__sandbox__/data/PADRE/MEDDEA/RAW/PADREMDA2_250407082056.DAT'),
 PosixPath('/Users/andrewrobbertz/__SOC_CODE__/__sandbox__/data/PADRE/MEDDEA/RAW/PADREMDA2_250407085636.DAT'),
 PosixPath('/Users/andrewrobbertz/__SOC_CODE__/__sandbox__/data/PADRE/MEDDEA/RAW/PADREMDA2_250407093406.DAT'),
 PosixPath('/Users/andrewrobbertz/__SOC_CODE__/__sandbox__/data/PADRE/MEDDEA/RAW/PADREMDA2_250504070426.DAT'),
 PosixPath('/Users/andrewrobbertz/__SOC_CODE__/__sandbox__/data/PADRE/MEDDEA/RAW/PADREMDA2_250504081536.DAT'),
 PosixPath('/Users/andrewrobbertz/__SOC_CODE__/__sandbox__/data/PADRE/MEDDEA/RAW/PADREMDA2_250504103826.DAT'),
 PosixPath('/Users/andrewrobbertz/__SOC_CODE__/__sandbox__/data/PADRE/MEDDEA/RAW/PADREMDA2_250504114936.DAT'),
 PosixPath('/Users/andrewrobbertz/__SOC_CODE__/__sandbox__/data/PADRE/MEDDEA/RAW/PADREMDA2_250504130056.DAT'),
 

## Test Compression of Spectrum DAT Files

### GZIP Compression 

How gzip and LZ77 relate in Python:
- The gzip module in Python leverages the zlib module, which implements the Deflate compression algorithm.
- The Deflate algorithm is a combination of LZ77 and Huffman coding. It first applies LZ77 to find and replace repeated sequences, and then uses Huffman coding to further compress the resulting stream of literals and LZ77 back-references.
- Therefore, when you use gzip.compress() or gzip.open() in Python, the underlying compression process involves the LZ77 algorithm as a core component of the Deflate method.

In [6]:
# Create a function to compress a file using gzip (LZ77-based compression)
def compress_gzip(input_file, output_dir):
    # Create output filename with .gz extension
    output_file = output_dir / (input_file.name + '.gz')
    
    # Compress the file
    with open(input_file, 'rb') as f_in:
        with gzip.open(output_file, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
    
    return output_file

### LZ77-Compressor

- Follow Setup Instructions from: https://github.com/manassra/LZ77-Compressor

In [7]:
from LZ77 import LZ77Compressor
  
compressor = LZ77Compressor() # window_size is optional

def compress_lz77(input_file, output_dir):
    # Create output filename with .lz77 extension
    output_file = output_dir / (input_file.name + '.lz77')

    # compress the input file and write it as binary into the output file
    compressor.compress(input_file, output_file)

    return output_file

In [8]:
# Process all files and collect compression statistics
results = []

# Define conversion factor for bytes to MB
BYTES_TO_MB = 1024 * 1024

for file in tqdm(spectrum_files, desc="Compressing files"):
    # Get original file size
    orig_size_bytes = file.stat().st_size
    orig_size_mb = orig_size_bytes / BYTES_TO_MB
    
    # Compress the file - with gzip
    gz_compressed_file = compress_gzip(file, compressed_path)
    
    # Get compressed file size
    gz_compressed_size_bytes = gz_compressed_file.stat().st_size
    gz_compressed_size_mb = gz_compressed_size_bytes / BYTES_TO_MB
    
    # Calculate space savings
    gz_space_saved_mb = orig_size_mb - gz_compressed_size_mb
    gz_savings_percent = (gz_space_saved_mb / orig_size_mb) * 100

    # Compress the file - with LZ77
    lz_compressed_file = compress_lz77(file, compressed_path)

    # Get compressed file size
    lz_compressed_size_bytes = lz_compressed_file.stat().st_size
    lz_compressed_size_mb = lz_compressed_size_bytes / BYTES_TO_MB

    # Calculate space savings
    lz_space_saved_mb = orig_size_mb - lz_compressed_size_mb
    lz_savings_percent = (lz_space_saved_mb / orig_size_mb) * 100

    # Add to results
    results.append({
        'Filename': file.name,
        'Original Size (MB)': round(orig_size_mb, 3),
        'Gzip Compressed Size (MB)': round(gz_compressed_size_mb, 3),
        'Gzip Space Saved (MB)': round(gz_space_saved_mb, 3),
        'Gzip Space Savings (%)': round(gz_savings_percent, 2),
        'LZ77 Compressed Size (MB)': round(lz_compressed_size_mb, 3),
        'LZ77 Space Saved (MB)': round(lz_space_saved_mb, 3),
        'LZ77 Space Savings (%)': round(lz_savings_percent, 2)
    })

# Create a DataFrame for better visualization
compression_df = pd.DataFrame(results)
compression_df

Compressing files:   0%|          | 0/12 [00:00<?, ?it/s]

File was compressed successfully and saved to output path ...
File was compressed successfully and saved to output path ...
File was compressed successfully and saved to output path ...
File was compressed successfully and saved to output path ...
File was compressed successfully and saved to output path ...
File was compressed successfully and saved to output path ...
File was compressed successfully and saved to output path ...
File was compressed successfully and saved to output path ...
File was compressed successfully and saved to output path ...
File was compressed successfully and saved to output path ...
File was compressed successfully and saved to output path ...
File was compressed successfully and saved to output path ...


Unnamed: 0,Filename,Original Size (MB),Gzip Compressed Size (MB),Gzip Space Saved (MB),Gzip Space Savings (%),LZ77 Compressed Size (MB),LZ77 Space Saved (MB),LZ77 Space Savings (%)
0,PADREMDA2_250407074546.DAT,5.007,0.063,4.944,98.74,0.818,4.188,83.65
1,PADREMDA2_250407082056.DAT,5.007,0.084,4.923,98.32,0.845,4.162,83.13
2,PADREMDA2_250407085636.DAT,5.007,0.171,4.836,96.58,0.96,4.047,80.83
3,PADREMDA2_250407093406.DAT,5.007,0.223,4.784,95.54,1.027,3.98,79.5
4,PADREMDA2_250504070426.DAT,10.014,0.26,9.754,97.4,1.791,8.222,82.11
5,PADREMDA2_250504081536.DAT,10.014,0.393,9.621,96.08,1.96,8.054,80.43
6,PADREMDA2_250504103826.DAT,10.014,0.488,9.525,95.12,2.074,7.939,79.28
7,PADREMDA2_250504114936.DAT,10.014,0.551,9.463,94.5,2.148,7.865,78.54
8,PADREMDA2_250504130056.DAT,10.014,0.549,9.464,94.52,2.147,7.866,78.56
9,PADREMDA2_250504141226.DAT,10.014,0.562,9.452,94.39,2.17,7.844,78.33


In [9]:
compression_df.describe()

Unnamed: 0,Original Size (MB),Gzip Compressed Size (MB),Gzip Space Saved (MB),Gzip Space Savings (%),LZ77 Compressed Size (MB),LZ77 Space Saved (MB),LZ77 Space Savings (%)
count,12.0,12.0,12.0,12.0,12.0,12.0,12.0
mean,6.77,0.285167,6.484667,95.906667,1.35025,5.419333,80.2025
std,3.731163,0.213408,3.54503,1.724984,0.787928,2.95561,2.145486
min,0.047,0.002,0.045,92.98,0.009,0.038,76.48
25%,5.007,0.082,4.823,94.515,0.83825,4.03025,78.555
50%,7.5105,0.2415,7.198,95.81,1.409,6.016,79.965
75%,10.014,0.50325,9.47925,96.8825,2.09225,7.88425,81.72
max,10.014,0.562,9.754,98.74,2.17,8.222,83.65
