# CamZIP

## Tree data structures

Import all functions from the package trees where we put together a number of tools for handling trees in the 3F7 lab.

In [1]:
from trees import *

Now define a simple tree (play around with this command and construct more complicated trees....)

In [2]:
t = [-1,0,1,1,0]

The following command will print a string that can be copy-pasted into a tree visualising website like [phylo.io](https://phylo.io) (don't forget to add a new line at the end of the string after cutting and pasting)

In [3]:
print(tree2newick(t))
print('Cut and paste the string on the previous line and add a "new line" at the end of the string.')

((1000,1001)1004,1002)1003
Cut and paste the string on the previous line and add a "new line" at the end of the string.


You can also add labels to the nodes in the `tree2newick` command.

In [4]:
print(tree2newick(t,['root', 'child 0', 'grandchild 0', 'grandchild 1', 'child 1']))

((grandchild 0,grandchild 1)child 0,child 1)root


If there are less labels than nodes, the labels will be interpreted "leaves first"

In [5]:
print(tree2newick(t,['symbol 0','symbol 1', 'symbol 2']))

((symbol 0,symbol 1)1004,symbol 2)1003


The following command converts a variable-length code described by a tree to a code table format.

In [6]:
print(tree2code(t))

{'1000': [0, 0], '1001': [0, 1], '1002': [1]}


Verify that the inverse function can recover the tree. 

In [7]:
print(code2tree(tree2code(t)))

[-1, 0, 1, 1, 0]


But the following may happen as well. Can you explain why?

In [8]:
print(code2tree(tree2code([3,3,4,4,-1])))

[-1, 0, 1, 1, 0]


In [9]:
print(tree2newick([3,3,4,4,-1], ['grandchild 0', 'grandchild 1', 'child 0', 'child 1', 'root']))

(child 0,(grandchild 0,grandchild 1)child 1)root


Similarly but far more problematic is the following inversion. The resulting assignment of codeword to symbols is fundamentally different from the original and would result in wrong decoding.

In [10]:
print(tree2code(code2tree({'0':[1], '1':[0,1], '2':[0,0,1], '3':[0,0,0]})))

{'1000': [0], '1001': [1, 0], '1002': [1, 1, 0], '1003': [1, 1, 1]}


These problems are all solved when using the extended tree format.

In [11]:
xt = tree2xtree([3,3,4,4,-1], ['a', 'b', 'c'])
print(xtree2newick(xt))
print(xt)
print(xt[4][2])

(c,(a,b)1003)1004
[[3, [], 'a'], [3, [], 'b'], [4, [], 'c'], [4, [0, 1], '1003'], [-1, [2, 3], '1004']]
1004


In [12]:
print(xtree2code(code2xtree({'0':[1], '1':[0,1], '2':[0,0,1], '3':[0,0,0]})))


{'0': [1], '1': [0, 1], '2': [0, 0, 1], '3': [0, 0, 0]}


## Testing your Shannon-Fano Code

This next section can only be completed once you have a working Shannon-Fano function `shannon_fano()`

In [13]:
from vl_codes import shannon_fano
from random import random
p = [random() for k in range(16)]
p = dict([(chr(k+ord('a')),p[k]/sum(p)) for k in range(len(p))])
print(f'Probability distribution: {p}\n')
c = shannon_fano(p)
print(f'Codebook: {c}\n')
xt = code2xtree(c)
print(f'Cut and paste for phylo.io: {xtree2newick(xt)}')

Probability distribution: {'a': 0.04239428986168972, 'b': 0.1413928108090173, 'c': 0.11793634351222965, 'd': 0.15987119465592425, 'e': 0.12919836117452208, 'f': 0.12464793958171358, 'g': 0.08256411296684028, 'h': 0.140297926049496, 'i': 0.011608654352364587, 'j': 0.05008836703620239}

Codebook: {'a': [0, 0, 1, 1, 1], 'b': [1, 1, 0], 'c': [0, 1, 0], 'd': [1, 1, 1], 'e': [1, 0, 0], 'f': [0, 1, 1], 'g': [0, 0, 0], 'h': [1, 0, 1], 'i': [0, 0, 1, 1, 0], 'j': [0, 0, 1, 0]}

Cut and paste for phylo.io: (((g,(j,(i,a)10)11)12,(c,f)13)16,((e,h)14,(b,d)15)17)18
Probability distribution: {'a': 0.08810244334017629, 'b': 0.0009933154376389943, 'c': 0.08752882334448035, 'd': 0.06536610798757081, 'e': 0.062008622431748414, 'f': 0.06176557752946635, 'g': 0.06443848446426129, 'h': 0.09771859966702921, 'i': 0.05097241066272622, 'j': 0.057015220181369584, 'k': 0.0548424117038805, 'l': 0.02933880129563677, 'm': 0.0950378454617218, 'n': 0.059405460381843356, 'o': 0.06356543370676078, 'p': 0.0619004424036894

We can upload data from a file, for example `hamlet.txt`, and display the first few lines...

In [14]:
f = open('hamlet.txt', 'r')
hamlet = f.read()
f.close()
print(hamlet[:294])

        HAMLET


        DRAMATIS PERSONAE


CLAUDIUS        king of Denmark. (KING CLAUDIUS:)

HAMLET  son to the late, and nephew to the present king.

POLONIUS        lord chamberlain. (LORD POLONIUS:)

HORATIO friend to Hamlet.

LAERTES son to Polonius.

LUCIANUS        nephew to the king.


We now compute the startistics of the file:

In [15]:
from itertools import groupby
frequencies = dict([(key, len(list(group))) for key, group in groupby(sorted(hamlet))])
Nin = sum([frequencies[a] for a in frequencies])
print(Nin)
p = dict([(a,frequencies[a]/Nin) for a in frequencies])
print(f'File length: {Nin}')

207039
File length: 207039


We can view the alphabet of symbols used in the file:

In [16]:
print(list(p))
print(len(p))

['\n', ' ', '!', '&', "'", '(', ')', ',', '-', '.', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|']
67


We are now ready to construct the Shannon-Fano code for this file, and view its tree (cut and paste into [phylo.io](https://phylo.io), don't forget to add a carriage return at the end, click on "Branch Labels/Support" under "Settings", then right-click on the root of the tree and select "expand all". 

In [46]:
c = shannon_fano(p)
print(len(c))
count=0

xt = code2xtree(c)
for i in xt:
    if len(i[1]) == 0:
        count += 1
    else:
        pass
print(c)
print(len(c['&']))

print(count)
print(xt)
print(len(xt))
print(xtree2newick(xt))
print(len(c['&']))

67
{' ': [0, 0], 'e': [0, 1, 0, 0], 't': [0, 1, 0, 1, 0], 'o': [0, 1, 1, 0, 0], 'a': [0, 1, 1, 1, 0], 's': [0, 1, 1, 1, 1], 'n': [1, 0, 0, 0, 0], 'h': [1, 0, 0, 1, 0], 'i': [1, 0, 0, 1, 1], 'r': [1, 0, 1, 0, 0], '\n': [1, 0, 1, 0, 1, 1], 'l': [1, 0, 1, 1, 0, 0], 'd': [1, 0, 1, 1, 1, 0], 'u': [1, 1, 0, 0, 0, 0], 'm': [1, 1, 0, 0, 0, 1], ',': [1, 1, 0, 0, 1, 0], 'y': [1, 1, 0, 0, 1, 1, 1], 'w': [1, 1, 0, 1, 0, 0, 1], 'f': [1, 1, 0, 1, 0, 1, 0], 'c': [1, 1, 0, 1, 1, 0, 0], 'g': [1, 1, 0, 1, 1, 0, 1], 'p': [1, 1, 0, 1, 1, 1, 0], 'A': [1, 1, 1, 0, 0, 0, 0, 0], 'b': [1, 1, 1, 0, 0, 0, 1, 0], 'T': [1, 1, 1, 0, 0, 0, 1, 1], 'I': [1, 1, 1, 0, 0, 1, 0, 1], 'E': [1, 1, 1, 0, 0, 1, 1, 1], '.': [1, 1, 1, 0, 1, 0, 0, 1], 'L': [1, 1, 1, 0, 1, 0, 1, 1], 'v': [1, 1, 1, 0, 1, 1, 0, 0], 'k': [1, 1, 1, 0, 1, 1, 1, 0], 'O': [1, 1, 1, 0, 1, 1, 1, 1], "'": [1, 1, 1, 1, 0, 0, 0, 0], 'H': [1, 1, 1, 1, 0, 0, 1, 0], 'R': [1, 1, 1, 1, 0, 0, 1, 1], 'N': [1, 1, 1, 1, 0, 1, 0, 0], 'S': [1, 1, 1, 1, 0, 1, 0, 1, 0], '

Now we can actually encode the file `hamlet.txt` using the Shannon-Fano code we constructed.

In [18]:
from vl_codes import vl_encode
hamlet_sf = vl_encode(hamlet, c)
print(f'Length of binary sequence: {len(hamlet_sf)}')

Length of binary sequence: 997548


We have commands to convert a bit sequence into a byte sequence (including a 3 bit prefix that helps us determine the length of the bit sequence):

In [19]:
from vl_codes import bytes2bits, bits2bytes
x = bits2bytes([0,1])
print([format(a, '08b') for a in x])
y = bytes2bits(x)
print(f'The original bits are: {y}')
print(bits2bytes([0,1,1,0,1,1,0,0,0]))

['01101000']
The original bits are: [0, 1]
[141, 128]


We now apply the bits to byte conversion to the compressed text of Hamlet to compute the length of the compressed file.

In [20]:
hamlet_zipped = bits2bytes(hamlet_sf)
Nout = len(hamlet_zipped)
print(f'Length of compressed string: {Nout}')

Length of compressed string: 124694


The compression ratio can be expressed in two ways, unitless or in bits/bytes:

In [21]:
print(f'Compression ratio (rateless): {Nout/Nin}')
print(f'Compression ratio (bits per byte): {8.0*Nout/Nin}')

Compression ratio (rateless): 0.6022730017049928
Compression ratio (bits per byte): 4.818184013639942


The lower bound for compression is the Entropy, measured in bits, that can be computed using an in-line function in Python:

In [22]:
from math import log2
H = lambda pr: -sum([pr[a]*log2(pr[a]) for a in pr])
print(f'Entropy: {H(p)}')

Entropy: 4.449863631694343


We now proceed to decode the compressed Hamlet sequence

In [23]:
from vl_codes import vl_decode
xt = code2xtree(c)
hamlet_unzipped = vl_decode(hamlet_sf, xt)
print(f'Length of unzipped file: {len(hamlet_unzipped)}')

Length of unzipped file: 207039


We can view the first few lines of the input (note the command `join` that turns the list of strings into one string)

In [24]:
print(''.join(hamlet_unzipped[:294]))

        HAMLET


        DRAMATIS PERSONAE


CLAUDIUS        king of Denmark. (KING CLAUDIUS:)

HAMLET  son to the late, and nephew to the present king.

POLONIUS        lord chamberlain. (LORD POLONIUS:)

HORATIO friend to Hamlet.

LAERTES son to Polonius.

LUCIANUS        nephew to the king.


## Compressing and uncompressing files

This is where we put it all together, compressing directly from input to output file. Play around with these commands once you implemented Huffman coding and arithmetic coding. We begin by importing the compression and decompression functions.

In [25]:
from camzip import camzip
from camunzip import camunzip

Probability distribution: {'a': 0.01338227872264756, 'b': 0.11436312083570961, 'c': 0.1740808557262424, 'd': 0.05034990240579951, 'e': 0.20166850345777199, 'f': 0.048096567102601955, 'g': 0.21305915301485603, 'h': 0.042224049469473966, 'i': 0.044798325028058104, 'j': 0.0979772442368389}

{'a': 0, 'b': 0.01338227872264756, 'c': 0.12774539955835718, 'd': 0.3018262552845996, 'e': 0.3521761576903991, 'f': 0.5538446611481711, 'g': 0.601941228250773, 'h': 0.815000381265629, 'i': 0.857224430735103, 'j': 0.9020227557631612}
[1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1]
['j', 'b', 'b', 'b', 'b', 'b', 'a', 'b', 'd', 'e', 'f', 'g', 'h', 'i', 'd', 'e', 'f', 'g', 'a', 'b', 'f']


The next commands define the method to be used and the filename. Modify those when you are trying other methods on various files. 

In [26]:
method = 'shannon_fano'
filename = 'hamlet.txt'

Now we do the actual compression and decompression...

In [27]:
camzip(method, filename)
camunzip(filename + '.cz' + method[0])

The next few lines perform various statistical measurements and verifies that the decompressed file is identical to the compressed file.

In [28]:
from filecmp import cmp
from os import stat
from json import load
Nin = stat(filename).st_size
print(f'Length of original file: {Nin} bytes')
Nout = stat(filename + '.cz' + method[0]).st_size
print(f'Length of compressed file: {Nout} bytes')
print(f'Compression rate: {8.0*Nout/Nin} bits/byte')
with open(filename + '.czp', 'r') as fp:
    freq = load(fp)
pf = dict([(a, freq[a]/Nin) for a in freq])
print(f'Entropy: {H(pf)} bits per symbol')
if cmp(filename,filename+'.cuz'):
    print('The two files are the same')
else:
    print('The files are different')

Length of original file: 207039 bytes
Length of compressed file: 124694 bytes
Compression rate: 4.818184013639942 bits/byte
Entropy: 4.449863631694343 bits per symbol
The two files are the same


## Huffman coding

This section will only work once you have a working function `huffman()`. We first repeat the tree construction and visualisation.

In [29]:
from vl_codes import huffman

xt = huffman(p)
print(xt)
print(xtree2newick(xt))

[[115, [], '\n'], [131, [], ' '], [80, [], '!'], [67, [], '&'], [93, [], "'"], [68, [], '('], [69, [], ')'], [109, [], ','], [82, [], '-'], [97, [], '.'], [86, [], ':'], [87, [], ';'], [83, [], '?'], [100, [], 'A'], [78, [], 'B'], [85, [], 'C'], [86, [], 'D'], [98, [], 'E'], [78, [], 'F'], [84, [], 'G'], [92, [], 'H'], [99, [], 'I'], [67, [], 'J'], [75, [], 'K'], [96, [], 'L'], [88, [], 'M'], [90, [], 'N'], [94, [], 'O'], [81, [], 'P'], [73, [], 'Q'], [91, [], 'R'], [89, [], 'S'], [99, [], 'T'], [89, [], 'U'], [70, [], 'V'], [83, [], 'W'], [74, [], 'Y'], [72, [], 'Z'], [79, [], '['], [79, [], ']'], [121, [], 'a'], [100, [], 'b'], [105, [], 'c'], [113, [], 'd'], [125, [], 'e'], [106, [], 'f'], [103, [], 'g'], [119, [], 'h'], [118, [], 'i'], [74, [], 'j'], [95, [], 'k'], [114, [], 'l'], [111, [], 'm'], [119, [], 'n'], [122, [], 'o'], [101, [], 'p'], [75, [], 'q'], [117, [], 'r'], [120, [], 's'], [123, [], 't'], [112, [], 'u'], [96, [], 'v'], [107, [], 'w'], [76, [], 'x'], [108, [], 'y'],

In [30]:
xtree2code(huffman({'a':.5, 'b':.25, 'c':.25, 'd':0}))

{'a': [0], 'b': [1, 1, 1], 'c': [1, 0], 'd': [1, 1, 0]}

Observe how the Huffman tree differs from the Shannon-Fano tree. What are its shortest and its longest codeword? You can use the `camzip` code above changing the method to `'huffman'` to test the compression rate etc. You may also want to do it by hand to test the error resilience:

In [31]:
c = xtree2code(xt)
hamlet_huf = vl_encode(hamlet, c)
hamlet_decoded = vl_decode(hamlet_huf, xt)
print(''.join(hamlet_decoded[:294]))

        HAMLET


        DRAMATIS PERSONAE


CLAUDIUS        king of Denmark. (KING CLAUDIUS:)

HAMLET  son to the late, and nephew to the present king.

POLONIUS        lord chamberlain. (LORD POLONIUS:)

HORATIO friend to Hamlet.

LAERTES son to Polonius.

LUCIANUS        nephew to the king.


We now introduce a random bit flip (bit 400 flipped) in the compressed sequence and observe the result.

In [32]:
hamlet_corrupted = hamlet_huf.copy()
hamlet_corrupted[400] ^= 1
hamlet_decoded = vl_decode(hamlet_corrupted, xt)
print(''.join(hamlet_decoded[:297]))

        HAMLET


        DRAMATIS PERSONAE


CLAUDIUS        king of Denmark. y,; 
aNG CLAUDIUS:)

HAMLET  son to the late, and nephew to the present king.

POLONIUS        lord chamberlain. (LORD POLONIUS:)

HORATIO friend to Hamlet.

LAERTES son to Polonius.

LUCIANUS        nephew to the king.


## Arithmetic coding

We first try "by hand" to operate the steps of arithmetic coding using floating point numbers. We first compute the cumulative probability distribution.

In [33]:
f = [0.0]
for a in p:
    f.append(f[-1]+p[a])
f.pop()
f = dict([(a,f[k]) for a,k in zip(p,range(len(p)))])

We now perform by hand the first `n=4` steps of arithmetic coding. Vary `n` to observe the loss of precision. 

In [34]:
lo, hi = 0.0, 1.0
n = 10
for k in range(n):
    a = hamlet[k]
    lohi_range = hi - lo
    hi = lo + lohi_range * (f[a] + p[a])
    lo = lo + lohi_range * f[a]
print(f'lo = {lo}, hi = {hi}, hi-lo = {hi-lo}')

lo = 0.040179412020354126, hi = 0.04017941312428125, hi-lo = 1.1039271233248549e-09


The output sequence is roughly the binary expression of `lo` (not exactly) and we can compute and observe it. What length `ell` would we need when encoding all of Hamlet?

In [35]:
from math import floor, ceil
ell = ceil(-log2(hi-lo))+2 if hi-lo > 0.0 else 96
print(bin(floor(lo*2**ell)))

0b1010010010010011001010101100


We encode and decode Hamlet again using arithmetic coding and verify that the first few lines of the play look as expected.

In [39]:
import arithmetic as arith
arith_encoded = arith.encode(hamlet, p)
arith_decoded = arith.decode(arith_encoded, p, Nin)
print('\n'+''.join(arith_decoded[:1000]))

{'\n': 0, ' ': 0.029197397591758073, '!': 0.3023391728128517, '&': 0.3037930051825985, "'": 0.30381715522196306, '(': 0.30848294282719685, ')': 0.30856022295316343, ',': 0.30863750307913, '-': 0.3244895889180299, '.': 0.32652785224039915, ':': 0.3328165224909317, ';': 0.33548268683677956, '?': 0.33829375141881485, 'A': 0.3403609947884215, 'B': 0.34809383739295496, 'C': 0.34926269929819986, 'D': 0.3516052531165626, 'E': 0.35411685721047725, 'F': 0.3612266287994049, 'G': 0.3623471906259207, 'H': 0.36447722409787525, 'I': 0.3690850516086341, 'J': 0.37649911369355527, 'K': 0.37654258376441146, 'L': 0.37718497481150887, 'M': 0.38293751418814803, 'N': 0.3860238892189393, 'O': 0.39009558585580484, 'P': 0.39529750433493194, 'Q': 0.3967754867440433, 'R': 0.39724399750771583, 'S': 0.40159100459333735, 'T': 0.4054067108129385, 'U': 0.4129318630789367, 'V': 0.4161534783301696, 'W': 0.41634667864508607, 'Y': 0.4184718821091677, 'Z': 0.41899352295944225, '[': 0.41933162351054615, ']': 0.420514975439

We now repeat the steps above but introduce a one bit mistake (bit 399 flipped) and observe the effect on the decoded text. Repeat this experiment varying the location of the mistake or adding more than one mistake. What do you observe? Can you explain why?

In [50]:
arith_corrupted = arith_encoded.copy()



arith_corrupted[323] ^= 1
arith_decoded = arith.decode(arith_corrupted, p, Nin)
print('\n'+''.join(arith_decoded[:294]))


        HAMLET


        DRAMATIS PERSONAE


CLAUDIUS        kisea 
rethre yO o A, nraesnslEf c
li
M,
eDsfnttma iri o csexllinH  n sOeah,u t-  tsreUs hthc. ;ab Iuity gs dntkie  dl,  id bo' daeG n m  h e sse   a
xooose Eoe
o nneLe hets  ]siseqsee
  eiodNc ,hmt o   h
a ;ldnmhk tka  e  ,  e
egi, 


In [38]:
xtree2newick(tree2xtree([-1,0,0,1,1,3,3,4,2]),[str(chr(a+ord('0'))) for a in range(9)])

'(((5,6)3,(7)4)1,(8)2)0'