##### Reference: Classic Computer science Problems in Python . Chapter 1

Compression is the act of taking data and encoding it in such a way that it takes up less space. Decompression is reversing the process, returning the data to its original form. 

Why is all data not compressed? There is a tradeoff between time and space. Data compression only makes sense in situations where small size is prioritzed over fast execution. 

The easiest way to compress data is to realize its storage type uses more bits than are strictly required for its contnets. For instance, thinking low-level, if an unsigned integer that will never exceed 65535 is being stored as a 64-but unsigned integer in memory, it is being stored inefficiently. It could instead be stored as a 16-but unsigned integer. This would reduce the space conspution by 75%. If there are millions of such numbers being stored inefficiently, it can add up to megabytes of wasted space. 

In Python, there is no 64-bit unsigned integer type, and there is no 16-bit unsigned integer type. There is just a single int type that can store numbers of artibitrary precision. The function sys.getsizeof() can help you find how many bytes of memory your Python objects are consuming. However, due to the inherent overhead of the Python object system there is no way to create an int that takes up less than 28 bytes (224 bits). A single int can be extended one bit at a time, but it consumes a minimum of 28 bytes. 

If the number of possible different values that a type is meant to represent is less than the number of values that the bits being used to store it can represent, it can likely be more efficiently stored. Consider the nucleotides that form a gene in DNA.3 Each nucleotide can only be one of four values: A, C, G, or T . Yet, if the gene is stored as a str, which can be thought of as a collection of Unicode characters, each nucleotide will be represented by a character, which generally requires 8 bits of storage. In binary, just 2 bits are needed to store a type with four possible values: 00, 01, 10, and 11 are
the four different values that can be represented by 2 bits. If A is assigned 00, C is assigned 01, G is assigned 10, and T is assigned 11, then the storage required for a string of nucleotides can be reduced by 75% (8 bits to 2 bits per nucleotide).

Instead of storing our nucleotides as a str, they can be stored as a bit string. A bit string is exactly what it sounds like—an arbitrary length sequence of 1s and 0s. Unfortunately, the Python standard library contains no off-the-shelf construct for working with bit strings of arbitrary length. The following code converts a str composed of As, Cs, Gs, and Ts into a string of bits and back again. The string of bits is stored within an int. Since the int
type in Python can be of any length, it can be used as a bit string of any length. To convert back into a str, we will implement the Python __str__() special method.

the __init__() method initializes the bit-string construct with the appropriate data. It calls the __compress()__ to do the dirty work of actually converting the provided str of nucleotides into a bit string. the compress() method starts with an underscore. Python has no concept of truly private methods/variables (all variables/methods can be accessed through reflection, there's no strict enforcement of privacy). A leading underscore is used a convention to indicate the impolementation of a method should not be relied on by actors outside of the class (it is subject to change and should be treated as private). 

The compress method looks at each character in the str of nucleotides sequentially. When it sees an A it adds 00 bit string, C = 01, G = 10 and T = 11- Every nucleotide is added using an 'or'(|) operation. AFter the left shift, two 0s are added to the right-hand side of the bit string. IN bitwise operations 'oring' 0s with any other value results in the other value replacing the 0s. In other other words, we continually add two new bits to the right-hand side of the bit string. THe two bits that are added are determined by the type of the nucleotide. 

decompress() reads 2 bits from the bit string at a time. It uses those two bits to determine which character to add to the end of the str representation of the gene. Since the bits are being read in the opposite order from that which they were compressed in (right to left instead of left to right), the str representation is ultimately reversed (using the slicing notation for
reversal [::-1]). Finally, note how the convenient int method bit_length() aided in the development of decompress().

In [6]:
class CompressedGene:
    def __init__(self,gene:str) -> None:
        self._compress(gene)
    
    def _compress(self,gene:str) -> None:
        self.bit_string:int = 1 #start with sentinel 
        for nucleotide in gene.upper():
            self.bit_string <<=2 #shift left two bits
            if nucleotide == "A": # change last two bits to 00
                self.bit_string |= 0b00
            elif nucleotide == "C": # change last two bits to 01
                self.bit_string |= 0b01
            elif nucleotide == "G": # change last two bits to 10
                self.bit_string |= 0b10
            elif nucleotide == "T": # change last two bits to 11
                self.bit_string |= 0b11
            else:
                raise ValueError("Invalid Nucleotide:{}".format(nucleotide))
    
    def decompress(self) -> str:
        gene: str = ""
        for i in range(0, self.bit_string.bit_length() - 1, 2): # - 1 to exclude sentinel
            bits: int = self.bit_string >> i & 0b11 # get just 2 relevant bits
            if bits == 0b00: # A
                gene += "A"
            elif bits == 0b01: # C
                gene += "C"
            elif bits == 0b10: # G
                gene += "G"
            elif bits == 0b11: # T
                gene += "T"
            else:
                raise ValueError("Invalid bits:{}".format(bits))
        return gene[::-1] # [::-1] reverses string by slicing backwards
    
    def __str__(self) -> str: # string representation for pretty printing
        return self.decompress()    

In [7]:
#testing it all out
if __name__ == "__main__":
    from sys import getsizeof
    original: str = "TAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATA" * 100
    print("original is {} bytes".format(getsizeof(original)))
    compressed: CompressedGene = CompressedGene(original) # compress
    print("compressed is {} bytes".format(getsizeof(compressed.bit_string)))
    print(compressed) # decompress
    print("original and decompressed are the same: {}".format(original == compressed.decompress()))

original is 8649 bytes
compressed is 2320 bytes
TAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGA