# About Huffman Trees and Codes
## Divide Pair Conquer
### Due: Monday, 1 March 2021, 11:59 pm

## Goal

Review Huffman Trees and Codes from DM1 to get ready for your Ponder and Prove assignment.

In [11]:
from math import ceil, log
from collections import Counter

def show_results(message, code_tuples):
  total_characters = len(message)
  total_unique_characters = len(code_tuples)
  total_bits = 0
  for char, count, code in code_tuples:
    total_bits += count * len(code)
  average_bits_per_character = total_bits / total_characters
  fixed_bits_per_character = ceil(log(total_unique_characters, 2))
  total_fixed_bits = total_characters * fixed_bits_per_character
  compression_ratio = (total_fixed_bits - total_bits) / total_fixed_bits
  print(f'          Total Characters: {total_characters}')
  print(f'   Total Unique Characters: {total_unique_characters}')
  print(f'                Total Bits: {total_bits}')
  print(f'Average Bits per Character: {average_bits_per_character:.2f}')
  print(f'  Fixed Bits per Character: {fixed_bits_per_character}')
  print(f'          Total Fixed Bits: {total_fixed_bits}')
  print(f'         Compression Ratio: {compression_ratio:.3f}')

message1 = 'thebookofmormon'
counter1 = Counter(message1)

print(message1, '-->', counter1)

message2 = 'therestoration'

counter2 = Counter(message2)

print(message2, '-->', counter2)

thebookofmormon --> Counter({'o': 5, 'm': 2, 't': 1, 'h': 1, 'e': 1, 'b': 1, 'k': 1, 'f': 1, 'r': 1, 'n': 1})
therestoration --> Counter({'t': 3, 'e': 2, 'r': 2, 'o': 2, 'h': 1, 's': 1, 'a': 1, 'i': 1, 'n': 1})


### Which message has the lower compression ratio?

#### Message 1

Do all the steps, like the examples in the book, first sorting the counted occurrences:

| Char | # |
|------|---|
|   b  | 1 |
|   e  | 1 |
|   f  | 1 |
|   h  | 1 |
|   k  | 1 |
|   n  | 1 |
|   r  | 1 |
|   t  | 1 |
|   m  | 2 |
|   o  | 5 |

##### The ever-shrinking queue:

* b1 e1 f1 h1 k1 n1 r1 t1 m2 o5
* f1 h1 k1 n1 r1 t1 m2 be2 o5
* k1 n1 r1 t1 m2 be2 fh2 o5
* r1 t1 m2 be2 fh2 kn2 o5
* m2 be2 fh2 kn2 rt2 o5
* fh2 kn2 rt2 mbe4 o5
* rt2 meb4 fhkn4 o5
* fhkn4 o5 rtmeb6
* rtmbe6 fhkno9
* rtmbefhkno15

##### The Huffman Tree:

In [12]:
'''
       rtmbefhkno15
         /        \
     rtmbe6      fhkno9
     /   \        /    \
  rt2   mbe4   fhkn4   o5
  /\    / \     /   \
r1 t1 m2 be2  fh2   kn2
         / \  / \   / \
       b1 e1 f1 h1 k1 n1
'''

'\n       rtmbefhkno15\n         /             rtmbe6      fhkno9\n     /   \\        /      rt2   mbe4   fhkn4   o5\n  /\\    / \\     /   r1 t1 m2 be2  fh2   kn2\n         / \\  / \\   /        b1 e1 f1 h1 k1 n1\n'

##### The Code Tuples

Read the codes from the tree:

In [13]:
message1_code_tuples = \
[('b', 1, '0110'),
 ('e', 1, '0111'),
 ('f', 1, '1000'),
 ('h', 1, '1001'),
 ('k', 1, '1010'),
 ('m', 2, '010'),
 ('n', 1, '1011'),
 ('o', 5, '11'),
 ('r', 1, '000'),
 ('t', 1, '001'),
]

show_results(message1, message1_code_tuples)

          Total Characters: 15
   Total Unique Characters: 10
                Total Bits: 46
Average Bits per Character: 3.07
  Fixed Bits per Character: 4
          Total Fixed Bits: 60
         Compression Ratio: 0.233


#### Message 2

Do all the steps, like the examples in the book, first sorting the counted occurrences:

| Char | # |
|------|---|
|   a  | 1 |
|   h  | 1 |
|   i  | 1 |
|   n  | 1 |
|   s  | 1 |
|   e  | 2 |
|   o  | 2 |
|   r  | 2 |
|   t  | 3 |

##### The ever-shrinking queue:

* a1 h1 i1 n1 s1 e2 o2 r2 t3
* i1 n1 s1 e2 o2 r2 ah2 t3
* s1 e2 o2 r2 ah2 in2 t3
* o2 r2 ah2 in2 t3 se3
* ah2 in2 t3 se3 or4
* t3 se3 or4 ahin4
* or4 ahin4 tse6
* tse6 orahin8
* tseorahin14

##### The Huffman Tree:

In [14]:
'''
    tseorahin14
    /        \
 tse6     orahin8
  / \      /    \
t3 se3   or4   ahin4
   / \   / \    /   \
  s1 e2 o2 r2 ah2   in2
              / \   / \
             a1 h1 i1 n1
'''

'\n    tseorahin14\n    /         tse6     orahin8\n  / \\      /    t3 se3   or4   ahin4\n   / \\   / \\    /     s1 e2 o2 r2 ah2   in2\n              / \\   /              a1 h1 i1 n1\n'

##### The Code Tuples

Read the codes from the tree:

In [15]:
message2_code_tuples = \
[('a', 1, '1100'),
 ('e', 2, '011'),
 ('h', 1, '1101'),
 ('i', 1, '1110'),
 ('n', 1, '1111'),
 ('o', 2, '100'),
 ('r', 2, '101'),
 ('s', 1, '010'),
 ('t', 3, '00'),
]

show_results(message2, message2_code_tuples)

          Total Characters: 14
   Total Unique Characters: 9
                Total Bits: 43
Average Bits per Character: 3.07
  Fixed Bits per Character: 4
          Total Fixed Bits: 56
         Compression Ratio: 0.232


### TODO Create Data Tree and Code

More warmup for your Ponder and Prove assignment this week:

Create a Huffman Tree and codes for the gaps between the first few prime (except for the gap of size 1 between 2 and 3). Your goal is to find how many is "few" enough to have a compression ratio **better than 24%**.


In [16]:
import math
import sys
global probabilities
probabilities = []

class HuffmanCode:
    def __init__(self,probability):
        self.probability = probability

    def position(self, value, index):
        for j in range(len(self.probability)):
            if(value >= self.probability[j]):
                return j
        return index-1

    def characteristics_huffman_code(self, code):
        length_of_code = [len(k) for k in code]

        mean_length = sum([a*b for a, b in zip(length_of_code, self.probability)])

        print("Average length of the code: %f" % mean_length)

    def compute_code(self):
        num = len(self.probability)
        huffman_code = ['']*num

        for i in range(num-2):
            val = self.probability[num-i-1] + self.probability[num-i-2]
            if(huffman_code[num-i-1] != '' and huffman_code[num-i-2] != ''):
                huffman_code[-1] = ['1' + symbol for symbol in huffman_code[-1]]
                huffman_code[-2] = ['0' + symbol for symbol in huffman_code[-2]]
            elif(huffman_code[num-i-1] != ''):
                huffman_code[num-i-2] = '0'
                huffman_code[-1] = ['1' + symbol for symbol in huffman_code[-1]]
            elif(huffman_code[num-i-2] != ''):
                huffman_code[num-i-1] = '1'
                huffman_code[-2] = ['0' + symbol for symbol in huffman_code[-2]]
            else:
                huffman_code[num-i-1] = '1'
                huffman_code[num-i-2] = '0'

            position = self.position(val, i)
            probability = self.probability[0:(len(self.probability) - 2)]
            probability.insert(position, val)
            if(isinstance(huffman_code[num-i-2], list) and isinstance(huffman_code[num-i-1], list)):
                complete_code = huffman_code[num-i-1] + huffman_code[num-i-2]
            elif(isinstance(huffman_code[num-i-2], list)):
                complete_code = huffman_code[num-i-2] + [huffman_code[num-i-1]]
            elif(isinstance(huffman_code[num-i-1], list)):
                complete_code = huffman_code[num-i-1] + [huffman_code[num-i-2]]
            else:
                complete_code = [huffman_code[num-i-2], huffman_code[num-i-1]]

            huffman_code = huffman_code[0:(len(huffman_code)-2)]
            huffman_code.insert(position, complete_code)

        huffman_code[0] = ['0' + symbol for symbol in huffman_code[0]]
        huffman_code[1] = ['1' + symbol for symbol in huffman_code[1]]

        if(len(huffman_code[1]) == 0):
            huffman_code[1] = '1'

        count = 0
        final_code = ['']*num

        for i in range(2):
            for j in range(len(huffman_code[i])):
                final_code[count] = huffman_code[i][j]
                count += 1

        final_code = sorted(final_code, key=len)
        return final_code

In [22]:
from sympy import primerange
import collections

list_of_gaps = []
prev = 3
gap = 0
for i in (range(100, 999)):
    prime_list = list(primerange(4, i))
    for prime in prime_list:
        gap = prime - prev
        #print(gap)
        prev = prime
        list_of_gaps.append(gap)

    string = list_of_gaps
    freq = {}
    for c in string:
        if c in freq:
            freq[c] += 1
        else:
            freq[c] = 1

    freq = sorted(freq.items(), key=lambda x: x[1], reverse=True)
    length = len(string)

    probabilities = [float("{:.2f}".format(frequency[1]/length)) for frequency in freq]
    probabilities = sorted(probabilities, reverse=True)

    huffmanClassObject = HuffmanCode(probabilities)
    P = probabilities

    huffman_code = huffmanClassObject.compute_code()

    tree = []
    #print(' Char | Huffman code ')
    #print('----------------------')

    for id,char in enumerate(freq):
        if huffman_code[id]=='':
        # print(' %-4r |%12s' % (char[0], 1))
            tree.append((char[0], char[1], 1))
            continue
        #print(' %-4r |%12s' % (char[0], huffman_code[id]))
        tree.append((char[0], char[1], huffman_code[id]))

    huffmanClassObject.characteristics_huffman_code(huffman_code)
    print(tree)
    show_results(list(prime_list), tree)

 Char | Huffman code 
----------------------
 2    |           1
 4    |          01
 6    |         000
 8    |         001
Average length of the code: 1.970000
[(2, 8, '1'), (4, 7, '01'), (6, 7, '000'), (8, 1, '001')]
          Total Characters: 23
   Total Unique Characters: 4
                Total Bits: 46
Average Bits per Character: 2.00
  Fixed Bits per Character: 2
          Total Fixed Bits: 46
         Compression Ratio: 0.000


## Answer
from the 4th Prime all the way until the 908th prime.