# Data Compression and Archiving

**Data Compression**
- It is the process of reducing the size of a file or data to save storage space and improve transmission efficiency.

**Types of Data Compression**
- Lossless Compression: 
  - In this method, data is compressed without any loss of information. The original data can be perfectly reconstructed from the compressed version.
  - Common lossless compression methods include Huffman coding, Run-Length Encoding (RLE), Lempel-Ziv-Welch (LZW), and Burrows-Wheeler Transform (BWT).
  - Lossless compression is ideal for text files, databases, and program files where preserving the exact data is critical.
  
- Lossy Compression: 
  - In this method, data is compressed with some loss of information. The reconstructed data might not be identical to the original, but the loss is often imperceptible, like in audio and image compression.
  - It is commonly used in multimedia applications, such as image, audio, and video compression, where minor quality loss is acceptable to achieve significant file size reduction.
  - Popular lossy compression methods include JPEG (for images), MP3 (for audio), and H.264 (for video).
  

The choice between lossless and lossy compression depends on the specific use case and the importance of preserving the original data fidelity. Lossless compression is preferred when data integrity is critical, while lossy compression is more suitable for cases where reducing file size is the primary concern and minor quality loss is acceptable.


**Data Archiving**
- It is the process of storing data for long-term preservation and easy retrieval, typically to free up space on active storage systems.
- Archiving helps maintain historical records, compliance with regulations, and data retention policies.
- Archiving can involve segmenting data based on its relevance and importance, moving less frequently accessed data to archival storage.
- Data is usually stored in archival formats that ensure long-term preservation, often using open standards to avoid proprietary dependencies.


**Data Compression vs Data Archiving**
- Data compression focuses on reducing file sizes to optimize storage and data transmission, 
- while data archiving aims to preserve and manage data for long-term accessibility and regulatory compliance. 
- Both techniques are essential for efficient data management and play complementary roles in ensuring data is stored, organized, and made available as needed.

**Python Libraries for Data Compression and Archiving**
- `zlib` this library allows to do compressions 
- `gzip` This library allows you to work with gzip-compressed files using both file-like objects and in-memory data.
- `bz2` It provides functions to compress and decompress data using the bzip2 compression algorithm.
- `lzma` This library offers support for the LZMA and XZ compression formats.
- `zipfile` It enables you to create, read, and extract ZIP archives.
- `tarfile` This library allows you to create, read, and extract tar archives, which can then be optionally compressed using other compression libraries like gzip or bzip2.

## Zlib

`zlib` module is used for compressing and decompressing files with `.gz` extensions. While `gzip` module is used to read and write `.gz` files. 

**Methods in Zlib**
- `zlib.adler32(data[, value])` it computes an Adler-32 checksum of data, the result is an unsigned 32-bit integer. If value is present, it is used as the starting value of the checksum; otherwise, a default value of 1 is used.
  
- `zlib.compress(data, /, level=- 1, wbits=MAX_WBITS)` it compresses the bytes in data, returning a bytes object containing compressed data. 
  - `level` is an integer from 0 to 9 or -1 controlling the level of compression; 1 (Z_BEST_SPEED) is fastest and produces the least compression, 9 (Z_BEST_COMPRESSION) is slowest and produces the most. 0 (Z_NO_COMPRESSION) is no compression. The default value is -1 (Z_DEFAULT_COMPRESSION). Z_DEFAULT_COMPRESSION represents a default compromise between speed and compression (currently equivalent to level 6).
  - `wbits` argument controls the size of the history buffer (or the “window size”) used when compressing data, and whether a header and trailer is included in the output. It can take several ranges of values, defaulting to 15 (MAX_WBITS):

- `zlib.compressobj(level=-1, method=DEFLATED, wbits=MAX_WBITS, memLevel=DEF_MEM_LEVEL, strategy=Z_DEFAULT_STRATEGY[, zdict])` returns a compression object, to be used for compressing data streams that won’t fit into memory at once.
  - `level` is the compression level – an integer from 0 to 9 or -1. A value of 1 (Z_BEST_SPEED) is fastest and produces the least compression, while a value of 9 (Z_BEST_COMPRESSION) is slowest and produces the most. 0 (Z_NO_COMPRESSION) is no compression. The default value is -1 (Z_DEFAULT_COMPRESSION). Z_DEFAULT_COMPRESSION represents a default compromise between speed and compression (currently equivalent to level 6).
  - `method` is the compression algorithm. 
  - `wbits` same as described above
  - `memLevel` controls the amount of memory used for the internal compression state, values range from 1 to 9. Higher values use more memory, but are faster and produce smaller output.
  - `strategy` is used to tune the compression algorithm. Possible values are Z_DEFAULT_STRATEGY, Z_FILTERED, Z_HUFFMAN_ONLY, Z_RLE and Z_FIXED.
  - `zdict` is a predefined compression dictionary. This is a sequence of bytes (such as a bytes object) containing subsequences that are expected to occur frequently in the data that is to be compressed. Those subsequences that are expected to be most common should come at the end of the dictionary.

- `zlib.crc32(data[, value])` computes a CRC (Cyclic Redundancy Check) checksum of data. The result is an unsigned 32-bit integer. If value is present, it is used as the starting value of the checksum; otherwise, a default value of 0 is used. Passing in value allows computing a running checksum over the concatenation of several inputs. 
  
- `zlib.decompress(data, /, wbits=MAX_WBITS, bufsize=DEF_BUF_SIZE)`
decompresses the bytes in data, returning a bytes object containing the uncompressed data. 
  - If `bufsize` is given, it is used as the initial size of the output buffer. Raises the error exception if any error occurs.

- `zlib.decompressobj(wbits=MAX_WBITS[, zdict])` returns a decompression object, to be used for decompressing data streams that won’t fit into memory at once.

- `Compress.compress(data)` compress data, returning a bytes object containing compressed data for at least part of the data in data. This data should be concatenated to the output produced by any preceding calls to the compress() method. Some input may be kept in internal buffers for later processing.

- `Compress.flush([mode])` all pending input is processed, and a bytes object containing the remaining compressed output is returned. mode can be selected from the constants Z_NO_FLUSH, Z_PARTIAL_FLUSH, Z_SYNC_FLUSH, Z_FULL_FLUSH, Z_BLOCK or Z_FINISH, defaulting to Z_FINISH. Except Z_FINISH, all constants allow compressing further bytestrings of data, while Z_FINISH finishes the compressed stream and prevents compressing any more data. After calling flush() with mode set to Z_FINISH, the compress() method cannot be called again; the only realistic action is to delete the object.

- `Compress.copy()` returns a copy of the compression object. This can be used to efficiently compress a set of data that share a common initial prefix.

- `Decompress.unused_data` a bytes object which contains any bytes past the end of the compressed data. That is, this remains b"" until the last byte that contains compression data is available. If the whole bytestring turned out to contain compressed data, this is b"", an empty bytes object.

- `Decompress.unconsumed_tail` a bytes object that contains any data that was not consumed by the last decompress() call because it exceeded the limit for the uncompressed data buffer. This data has not yet been seen by the zlib machinery, so you must feed it (possibly with further data concatenated to it) back to a subsequent decompress() method call in order to get correct output.

- `Decompress.eof` boolean indicating whether the end of the compressed data stream has been reached.

- `Decompress.decompress(data, max_length=0)` decompress data, returning a bytes object containing the uncompressed data corresponding to at least part of the data in string. This data should be concatenated to the output produced by any preceding calls to the decompress() method. Some of the input data may be preserved in internal buffers for later processing.

- `Decompress.flush([length])` all pending input is processed, and a bytes object containing the remaining uncompressed output is returned. After calling flush(), the decompress() method cannot be called again; the only realistic action is to delete the object.

- `Decompress.copy()` returns a copy of the decompression object. This can be used to save the state of the decompressor midway through the data stream in order to speed up random seeks into the stream at a future point.

- `zlib.ZLIB_VERSION` returns version of zlib used in building the module

- `zlib.ZLIB_RUNTIME_VERSION` returns version of zlib loaded by interpreter

**Note:**
- Zlib has a official documentation page - [Zlib](https://www.zlib.net/)
- Zlib Architecture manual - [Manual](https://www.zlib.net/manual.html)

In [11]:
import zlib
import binascii


data = b'Hello world'

compressed_data = zlib.compress(data, 2)

print(data)
print(compressed_data)

b'Hello world'
b'x^\xf3H\xcd\xc9\xc9W(\xcf/\xcaI\x01\x00\x18\xab\x04='


## Gzip

`gzip` module used for compression and decompression.

`gzip` has `GzipFile` class which is used to read and write gzip-format files (`.gz`).

**Methods in Gzip**
- `gzip.open(filename, mode='rb', compresslevel=9, encoding=None, errors=None, newline=None)` open a gzip-compressed file in binary or text mode, returning a file object.
  - `filename`  can be an actual filename or an existing file object to read from or write to.
  - `mode` can be any of 'r', 'rb', 'a', 'ab', 'w', 'wb', 'x' or 'xb' for binary mode, or 'rt', 'at', 'wt', or 'xt' for text mode.(default is 'rb').
  - `compresslevel` is an integer from 0 to 9, as for the GzipFile constructor.

**Classes in Gzip**
`class gzip.GzipFile(filename=None, mode=None, compresslevel=9, fileobj=None, mtime=None)` constructor for the GzipFile class 
  - `fileobj` is not None, the filename is only used to be included in the gzip file header, which may include the original filename of the uncompressed file. It defaults to the filename of fileobj, if discernible; otherwise, it defaults to the empty string, and in this case the original filename is not included in the header.
  - `mode` same as above
  - `compresslevel` is an integer from 0 to 9 controlling the level of compression; 1 is fastest and produces the least compression, and 9 is slowest and produces the most compression. 0 is no compression. The default is 9.
  - `mtime` is an optional numeric timestamp to be written to the last modification time field in the stream when compressing. It should only be provided in compression mode. If omitted or None, the current time is used. 
  - `peek(n)` read n uncompressed bytes without advancing the file position. At most one single read on the compressed stream is done to satisfy the call. The number of bytes returned may be more or less than requested.
  - `name` the path to the gzip file on disk, as a str or bytes. Equivalent to the output of os.fspath() on the original input path, with no other normalization, resolution or expansion.

- `gzip.compress(data, compresslevel=9, *, mtime=None)`
Compress the data, returning a bytes object containing the compressed data. compresslevel and mtime have the same meaning as in the GzipFile constructor above. When mtime is set to 0, this function is equivalent to zlib.compress() with wbits set to 31. The zlib function is faster.

- `gzip.decompress(data)` decompress the data, returning a bytes object containing the uncompressed data. This function is capable of decompressing multi-member gzip data (multiple gzip blocks concatenated together). When the data is certain to contain only one member the zlib.decompress() function with wbits set to 31 is faster.

In [None]:
# Reading compressed file
import gzip
with gzip.open('/home/joe/file.txt.gz', 'rb') as f:
    file_content = f.read()

In [None]:
# Creating compressed GZIP file
import gzip
content = b"Lots of content here"
with gzip.open('/home/joe/file.txt.gz', 'wb') as f:
    f.write(content)

In [None]:
# Creating compressed GZIP file
import gzip
import shutil
with open('/home/joe/file.txt', 'rb') as f_in:
    with gzip.open('/home/joe/file.txt.gz', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [None]:
# Gzip compress a binary string
import gzip
s_in = b"Lots of content here"
s_out = gzip.compress(s_in)

## BZ2

`bz2` uses bzip2 compression algorithm.

**Classes in Bz2**
- `class bz2.BZ2File(filename, mode='r', *, compresslevel=9)` opens a bzip2-compressed file in binary mode.
  - `filename` should be str or bytes object or file object
  - `mode` can be  'rb', 'wb', 'xb', 'ab', 'r' (default), 'w'

- - `class bz2.BZ2Compressor(compresslevel=9)` create a new compressor object. This object may be used to compress data incrementally. 
  -  For one-shot compression, use the `compress()` function instead.
  - `compresslevel`, an integer between 1 and 9. The default is 9.

- `class bz2.BZ2Decompressor` create a new decompressor object. This object may be used to decompress data incrementally. 
  - For one-shot compression, use the `decompress()` function instead.

**Methods in Bz2**
- `bz2.open(filename, mode='rb', compresslevel=9, encoding=None, errors=None, newline=None)` open a bzip2 compressed file
- `bz2.compress(data, compresslevel=9)` compress data as bytes like object (one shot compression)
- `bz2.decompress(data)` decompress data as bytes like object (one shot decompression)
- `bz2.BZ2File.peek([n])` return buffered data without advancing the file position
- `bz2.BZ2Compressor.compress(data)` returns chunk of compressed data if possible else an empty string
- `bz2.BZ2Compressor.flush()` finish the compression process and returns the compressed data in internal buffers. compressor object cant be used after calling this method
- `bz2.BZ2Decompressor.decompress(data, maxlength=-1)` decompress the data and return uncompressed data as bytes
  - `max_length` specifies at most bytes to be decompressed data

**One Shot compression and decompression**

In [12]:
import bz2

data = b"""Iron Man"""

c = bz2.compress(data)

In [13]:
print('Data Compression Ratio:' + str(len(data) / len(c)))  

Data Compression Ratio:0.16666666666666666


In [14]:
d = bz2.decompress(c)

In [15]:
# verifying decompressed data same as original data or not
data == d 

True

## Lzma

`lzma` module uses LZMA compression algorithm.

It uses `.xz` and `.lzma` file extensions.

**Classes in Lzma**
- `class lzma.LZMAFile(filename=None, mode='r', *, format=None, check=- 1, preset=None, filters=None)` open ana LZMA compressed file in binary mode.
    - An LZMAFile can wrap an alread open file object, or operate directly on named file.
    - `filename` specifies either the file object to wrap, or the name of the file to open.
    - `mode` can be either `r` (default), `w`, `x`, `a`, `rb`, `wb` and `ab` 
    - `format` specifies what container format should be used, possible values are:
      - `FORMAT_XZ` its the default, the `.xz` container format
      - `FORMAT_ALONE` the legacy `.lzma` container format
      - `FORMAT_RAW` a raw data stream, not using any container format
    - `check` specifies the type of integrity check to include in the compressed data, its possible values are:
      - `CHECK_NONE` No integrity check. This is the default (and the only acceptable value) for `FORMAT_ALONE` and `FORMAT_RAW`.
      - `CHECK_CRC32` 32-bit Cyclic Redundancy Check.
      - `CHECK_CRC64` 64-bit Cyclic Redundancy Check. This is the default for `FORMAT_XZ`.
      - `CHECK_SHA256` 256-bit Secure Hash Algorithm.
    - `preset` is an integer in range of  0 to 9 , if specified higher it is smaller the output and vice versa also higher it is slower the compression will be
    - `filter` specifies a filter chain specifier

- `class lzma.LZMACompressor(format=FORMAT_XZ, check=- 1, preset=None, filters=None)` create a compressor object which is used to compress data incrementally
  - To compress data in a single chunk `compress()` function is used
  - Arguments have same meaning as specified above
- 
- `class lzma.LZMADecompressor(format=FORMAT_AUTO, memlimit=None, filters=None)` create a decompressor object which is used to decompress data incrementally
  - To decompress data in a single chunk `decompress()` function is used
  - Arguments have same meaning as specified above
  - `memlimit` specifies a limit in bytes on the amount of memory that decompressor can use

**Methods in Lzma**
- `lzma.open(filename, mode='rb', *, format=None, check=- 1, preset=None, filters=None, encoding=None, errors=None, newline=None)` open an LZMA compressed file in binary or text mode, returning a file object
- `lzma.compress(data, format=FORMAT_XZ, check=- 1, preset=None, filters=None)` Compress data, returning the compressed data as a bytes object.
- `lzma.decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)` decompress data , returning the uncompressed data as a bytes object.
- `lzma.is_check_supported(check)` return True if the given integrity check is supported on this system
- `lzma.LZMAFile.peek(size=-1)` return buffered data without advancing the file position
- `lzma.LZMACompressor.compress(data)` compress data, returning a bytes object containing compressed data for at least part of the input. Some of data may be buffered internally for use in later calls of `compress()` and `flush()`
- `lzma.LZMACompressor.flush()` finish the compression process, returning a bytes object containing any data stored in compressor's internal buffers
- `lzma.LZMADecompressor.decompress(data, max_length=-1)` decompress data, returning uncompressed data as bytes. Some of data may be buffered internally, for use in later calls to `decompress()`
  - `max_length` if non-negative, returns at most `max_length` bytes of decompressed data

**Attributes in Lzma**
- `lzma.LZMADecompressor.check` the id of the integrity check used by the input stream
- `lzma.LZMADecompressor.eof` True if the end-of-stream marker has been reached
- `lzma.LZMADecompressor.unused_data` data found after the end of the compressed stream, before the end of the stream it will be `b""`
- `lzma.LZMADecompressor.needs_input` False if the `decompress()` method can provide more decompressed data before requiring new uncompressed input

**Note:** For custom filter chain creation refer documentation

In [None]:
# Reading a compressed file at once

import lzma

with lzma.open("file.xz") as f:
    file_content = f.read()

In [None]:
# Compressing a file at once

import lzma
data = b"Optimus Prime"
with lzma.open("file.xz", "w") as f:
    f.write(data)

In [17]:
# Iterative compression

import lzma

lzc = lzma.LZMACompressor()

out1 = lzc.compress(b"I\n")
out2 = lzc.compress(b"am\n")
out3 = lzc.compress(b"Iron man\n")

out4 = lzc.flush()

final_output = b"".join([out1, out2, out3, out4])

print(final_output)

b'\xfd7zXZ\x00\x00\x04\xe6\xd6\xb4F\x02\x00!\x01\x16\x00\x00\x00t/\xe5\xa3\x01\x00\rI\nam\nIron man\n\x00\x00\x00\x9f\xc2}g\x97\xd3\xd6V\x00\x01&\x0e\x08\x1b\xe0\x04\x1f\xb6\xf3}\x01\x00\x00\x00\x00\x04YZ'


In [None]:
# Writing compressed data to an already-open file

import lzma

with open("file.xz", "wb") as f:
    f.write(b"This data will not be compressed\n")
    with lzma.open(f, "w") as lzf:
        lzf.write(b"This *will* be compressed\n")
    f.write(b"Not compressed\n")

In [None]:
# Creating a compressed file using a custom filter chain

import lzma

my_filters = [
    {"id": lzma.FILTER_DELTA, "dist": 5},
    {"id": lzma.FILTER_LZMA2, "preset": 7 | lzma.PRESET_EXTREME},
]

with lzma.open("file.xz", "w", filters=my_filters) as f:
    f.write(b"blah blah blah")