### 6.3 (a) 
_Create a long random string using a Python program, and use a lossless compression algorithm of your choice to compress the string. Note the compression ratio_

In [1]:
import random
import zlib
import gzip
import bz2 # bzip
import sys
import string

In [2]:
# Create_rand_str is a lambda function that creates a rand string of len n chars
create_rand_str = lambda n: ''.join(random.choices(string.ascii_letters, k=n)) 

rand_long_str = create_rand_str(1000000000) # Generate random string with n=1000000000 chars

long_rand_variable = rand_long_str.encode() # Need to encode to bytes for compression

In [3]:
# long_rand_variable must be in bytes
encoded_rand_long_str_size = sys.getsizeof(long_rand_variable)
print(f'Uncompressed Size: {encoded_rand_long_str_size} bytes')

Uncompressed Size: 1000000033 bytes


#### Testing with zlib

In [4]:
encoded_compressed_str = zlib.compress(long_rand_variable)
encoded_compressed_str_size = sys.getsizeof(encoded_compressed_str)
print(f'Compressed Size: {encoded_compressed_str_size} bytes')
compression_ratio = encoded_rand_long_str_size/encoded_compressed_str_size

print(f'Compression ratio zlib {compression_ratio:.3f} bits/bit')


Compressed Size: 728308808 bytes
Compression ratio zlib 1.373 bits/bit


#### Testing with gzip

In [5]:
encoded_compressed_str = gzip.compress(long_rand_variable)
encoded_compressed_str_size = sys.getsizeof(encoded_compressed_str)
print(f'Compressed Size: {encoded_compressed_str_size} bytes')

compression_ratio = encoded_rand_long_str_size/encoded_compressed_str_size

print(f'Compression ratio gzip {compression_ratio:.3f} bits/bit')

Compressed Size: 728308820 bytes
Compression ratio gzip 1.373 bits/bit


#### Testing with bzip

In [6]:
encoded_compressed_str = bz2.compress(long_rand_variable)
encoded_compressed_str_size = sys.getsizeof(encoded_compressed_str)
print(f'Compressed Size: {encoded_compressed_str_size} bytes')

compression_ratio = encoded_rand_long_str_size/encoded_compressed_str_size

print(f'Compression ratio bzip {compression_ratio:.3f} bits/bit')

Compressed Size: 721162869 bytes
Compression ratio bzip 1.387 bits/bit


#### (a) Conclusion:
Despite generating random source (a string with random chars), we are still able to compress the data with a compression ratio **G ≈ 1.38 bits/bit** regardless of the compression algorithm used (zlib, gzip, bz2).

The ratio of 1.38 bits/bit on a random source signifies that we could pretty much do a 80/20 training/test split and achieve high accuracy, despite not learning any underlying function (since no underlying function exists for a random source)

In short, just because we can compress it, doesn't mean we can learn it 

### 6.3 (b)
_What is the expected compression ratio in (a)? Explain why?_

#### (b) Answer:
The **expected compression ratio in (a) is 2 bits/bit**. In other terms a 2:1 compression ratio is the theoretical limit because as n -> infinity, we can almost guarantee to memorize 2 points per parameter (MacKay 2003). This phenomenon was as demonstrated in 6.1, were we experimentally proved Definition 5.1

In the 6.3 (a) experiment, as we increase the length of the random string (n) the compression ratio increases from around 1.1 -> 1.3.
Theoretically as n -> infinity, the compression ratio will approach 2, since we can memorize 2 points per parameter, which would provide a 2:1 compression ratio.

