<a href="https://colab.research.google.com/github/Tajdari-S/Compression/blob/main/SPMLF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
import json
import numpy as np
from scipy.sparse import coo_matrix, save_npz

def convert_to_binary(entry):
    binary_data = []

    for key, value in entry.items():
        if isinstance(value, str):
            # Convert string to binary using UTF-8 encoding
            binary_data.append(value.encode('utf-8'))
        elif isinstance(value, list):
            # Convert numerical values to numpy array and then to binary
            binary_data.append(np.array(value).tobytes())
        else:
            binary_data.append(value)

    return binary_data

# Specify the input and output file paths
input_file_path = "input_data.json"
output_file_path = "output_binary_sparse.npz"

# Load JSON data from the input file
with open(input_file_path, 'r') as file:
    input_data = json.load(file)

# Convert each entry to binary data
binary_data = [convert_to_binary(entry) for entry in input_data]

# Create a sparse matrix from binary data using coo_matrix
binary_matrix = coo_matrix(binary_data)

# Save the binary sparse matrix
save_npz(output_file_path, binary_matrix)

print("Binary sparse matrix data saved to", output_file_path)


Binary sparse matrix data saved to output_binary_sparse.npz


In [None]:
import json
import numpy as np
from scipy.sparse import load_npz

def convert_from_binary(binary_data):
    entry = {}
    index = 0

    for key, value in original_data[0].items():
        if isinstance(value, str):
            # Decode binary data to string using UTF-8 encoding
            entry[key] = binary_data[index].decode('utf-8')
        elif isinstance(value, list):
            # Convert binary data to numpy array
            array_size = len(value)
            array_bytes = binary_data[index][:array_size * 8]
            entry[key] = np.frombuffer(array_bytes, dtype=np.float64).tolist()
            index += 1
        else:
            entry[key] = binary_data[index]

        index += 1

    return entry

# Specify the input and output file paths
input_file_path = "input_data.json"
output_file_path = "output_binary_sparse.npz"

# Load original JSON data for reference
with open(input_file_path, 'r') as file:
    original_data = json.load(file)

# Load the binary sparse matrix
binary_matrix = load_npz(output_file_path)

# Convert binary matrix to list of binary entries
binary_data_list = binary_matrix.data.tolist()

# Convert each binary entry to the original format
original_entries = [convert_from_binary(binary_data_list[i:i+len(original_data[0])]) for i in range(0, len(binary_data_list), len(original_data[0]))]

# Save the reconstructed JSON file
output_json_path = "reconstructed_data.json"
with open(output_json_path, 'w') as output_file:
    json.dump(original_entries, output_file, indent=2)

print("Reconstructed JSON data saved to", output_json_path)


Reconstructed JSON data saved to reconstructed_data.json


In [None]:
import json
import numpy as np
from scipy.sparse import load_npz, save_npz

def convert_from_binary(binary_data):
    entry = {}
    index = 0

    for key, value in original_data[0].items():
        if isinstance(value, str):
            # Decode binary data to string using UTF-8 encoding
            entry[key] = binary_data[index].decode('utf-8')
        elif isinstance(value, list):
            # Convert binary data to numpy array
            array_size = len(value)
            array_bytes = binary_data[index][:array_size * 8]
            entry[key] = np.frombuffer(array_bytes, dtype=np.float64).tolist()
            index += 1
        else:
            entry[key] = binary_data[index]

        index += 1

    return entry

# Specify the input and output file paths
input_file_path = "input_data.json"
output_file_path = "output_binary_sparse.npz"

# Load original JSON data for reference
with open(input_file_path, 'r') as file:
    original_data = json.load(file)

# Load the binary sparse matrix
binary_matrix = load_npz(output_file_path)

# Convert binary matrix to list of binary entries
binary_data_list = binary_matrix.data.tolist()

# Convert each binary entry to the original format
original_entries = [convert_from_binary(binary_data_list[i:i+len(original_data[0])]) for i in range(0, len(binary_data_list), len(original_data[0]))]

# Save the reconstructed JSON file
output_json_path = "reconstructed_data.json"
with open(output_json_path, 'w') as output_file:
    json.dump(original_entries, output_file, indent=2)

# Save the binary sparse matrix in a compressed format
save_npz("compressed_output_binary_sparse.npz", binary_matrix)

print("Reconstructed JSON data saved to", output_json_path)
print("Compressed binary sparse matrix saved to compressed_output_binary_sparse.npz")


Reconstructed JSON data saved to reconstructed_data.json
Compressed binary sparse matrix saved to compressed_output_binary_sparse.npz


In [None]:
import json
import numpy as np
from scipy.sparse import coo_matrix, save_npz, load_npz
import gzip

def convert_to_binary(entry):
    binary_data = []
    for key, value in entry.items():
        if isinstance(value, str):
            # Encode string to binary data using UTF-8 encoding
            encoded_value = value.encode('utf-8')
            binary_data.append(encoded_value)
        elif isinstance(value, list):
            # Convert list to numpy array and then to binary data
            array_bytes = np.array(value, dtype=np.float64).tobytes()
            binary_data.append(array_bytes)
        else:
            binary_data.append(value)

    return binary_data

def compress_and_save_sparse_matrix(matrix, output_file_path):
    # Save the sparse matrix to a binary file
    save_npz(output_file_path, matrix)

    # Open the binary file and compress it using gzip
    with open(output_file_path, 'rb') as binary_file:
        with gzip.open(output_file_path + '.gz', 'wb') as compressed_file:
            compressed_file.writelines(binary_file)

def load_and_decompress_sparse_matrix(input_file_path):
    # Open the compressed file and decompress it using gzip
    with gzip.open(input_file_path, 'rb') as compressed_file:
        with open(input_file_path.replace('.gz', '_decompressed.npz'), 'wb') as binary_file:
            binary_file.writelines(compressed_file)

    # Load the sparse matrix from the decompressed binary file
    decompressed_matrix = load_npz(input_file_path.replace('.gz', '_decompressed.npz'))
    return decompressed_matrix

# Specify the input and output file paths
input_file_path = "input_data.json"
output_file_path = "output_sparse_matrix.npz"

# Load original JSON data
with open(input_file_path, 'r') as file:
    original_data = json.load(file)

# Convert each entry to binary format
binary_entries = [convert_to_binary(entry) for entry in original_data]

# Ensure consistent data types for the binary entries
binary_data_list = [item if isinstance(item, (bytes, int, float)) else item[0] for item in binary_entries]

# Create a binary sparse matrix using COO format
num_entries = len(original_data)
num_features = len(binary_data_list) // num_entries
rows = [i // num_features for i in range(len(binary_data_list))]
cols = [i % num_features for i in range(len(binary_data_list))]
binary_matrix = coo_matrix((binary_data_list, (rows, cols)))

# Compress and save the sparse matrix
compress_and_save_sparse_matrix(binary_matrix, output_file_path)

# Load and decompress the sparse matrix
decompressed_matrix = load_and_decompress_sparse_matrix(output_file_path + '.gz')

# Optional: Verify that the loaded matrix matches the original matrix
assert np.array_equal(binary_matrix.data, decompressed_matrix.data)
assert np.array_equal(binary_matrix.row, decompressed_matrix.row)
assert np.array_equal(binary_matrix.col, decompressed_matrix.col)

print("Compression and reconstruction completed successfully.")


Compression and reconstruction completed successfully.


Report on the Compressed Binary Sparse Matrix Generation and Reconstruction Code

Objective:
The objective of the provided code is to convert data from a JSON file into a compressed binary sparse matrix, store it, and then reconstruct the matrix from the stored compressed format.

Overview of the Code:

Data Conversion to Binary Format:

The original JSON data is loaded from a specified file (input_file_path) into the variable original_data.
The function convert_to_binary is employed to convert each entry in original_data to binary format. The function ensures appropriate encoding for strings and converts lists to binary using NumPy.
Sparse Matrix Construction:

The binary entries are then used to construct a binary sparse matrix using the COO (Coordinate) format. The COO format is chosen for its flexibility in handling different data types, including strings.
The COO matrix is created with the help of the coo_matrix function from the scipy.sparse library.
Compression and Storage:

The binary sparse matrix is saved to a file (output_file_path) using save_npz.
The saved binary file is then compressed using the gzip library to reduce the file size. The compressed file is saved with a ".gz" extension.
Reconstruction from Compressed Format:

The compressed binary file is loaded and decompressed using gzip.
The decompressed binary data is then loaded into a sparse matrix using the load_npz function.
Verification:

Optional verification steps are included to ensure the integrity of the reconstructed matrix. These steps compare the data, row indices, and column indices of the original and decompressed matrices.
Results and Output:

If the code runs successfully, it prints a message indicating the completion of compression and reconstruction.
Considerations and Adjustments:

The convert_to_binary function ensures that string data is encoded in UTF-8 and truncates long strings. The truncation length can be adjusted based on the characteristics of the data.
The COO format is chosen for its flexibility with various data types, but other sparse matrix formats can be considered based on specific requirements.
Conclusion:
The code successfully achieves the goal of converting JSON data into a compressed binary sparse matrix, storing it, and reconstructing the matrix from the compressed format. The use of gzip compression contributes to efficient storage, especially in scenarios where the data matrix is sparse with many zero entries. The flexibility of the COO format allows the code to handle diverse data types, including strings, during the conversion and reconstruction processes. The optional verification steps enhance the reliability of the code.

Recommendations for Future Work:

Depending on the specific characteristics of the data, further optimizations and adjustments can be explored to enhance compression ratios.
Consideration of alternative sparse matrix formats based on the nature of the data could be investigated.
In a real-world scenario, additional error handling and logging mechanisms may be implemented to enhance robustness.