## 16.3 Huffman codes

### 16.3-1

> Explain why, in the proof of Lemma 16.2, if $x.freq = b.freq$, then we must have $a.freq = b.freq = x.freq = y.freq$.

### 16.3-2

> Prove that a binary tree that is not full cannot correspond to an optimal prefix code.

### 16.3-3

> What is an optimal Huffman code for the following set of frequencies, based on
the first 8 Fibonacci numbers? 

> a:1 b:1 c:2 d:3 e:5 f:8 g:13 h:21 

> Can you generalize your answer to find the optimal code when the frequencies are the first $n$ Fibonacci numbers?

* a: 1111111
* b: 1111110
* c: 111110
* d: 11110
* e: 1110
* f: 110
* g: 10
* h: 0

### 16.3-4

> Prove that we can also express the total cost of a tree for a code as the sum, over all internal nodes, of the combined frequencies of the two children of the node.

### 16.3-5

> Prove that if we order the characters in an alphabet so that their frequencies are monotonically decreasing, then there exists an optimal code whose codeword lengths are monotonically increasing.

### 16.3-6

> Suppose we have an optimal prefix code on a set $C = \{0, 1, \dots, n - 1\}$ of characters and we wish to transmit this code using as few bits as possible. Show how to represent any optimal prefix code on $C$ using only $2n - 1 + n \lceil lg n \rceil$ bits.

Use one bit for representing internal or leaf node, which is $2n - 1$ bits.

### 16.3-7

> Generalize Huffman’s algorithm to ternary codewords (i.e., codewords using the symbols 0, 1, and 2), and prove that it yields optimal ternary codes.

Merge three nodes.

### 16.3-8

> Suppose that a data file contains a sequence of 8-bit characters such that all 256 characters are about equally common: the maximum character frequency is less than twice the minimum character frequency. Prove that Huffman coding in this case is no more efficient than using an ordinary 8-bit fixed-length code.

Full binary tree, another 8-bit encoding.

### 16.3-9

> Show that no compression scheme can expect to compress a file of randomly chosen 8-bit characters by even a single bit.

$2^n >> 2^{n-1}$

# Hw 

Huffman code 를 만들어라. 우선순위 큐(minimum heap 이용) 

In [27]:
# priority queue(minimum heap + ADT) 필요 

def parent(i):
    return (i - 1) >> 1 # 2로 나눔 


def left(i):
    return (i << 1) + 1  # 2 곱함 ; index 0부터 시작이므로 +1


def right(i):
    return (i << 1) + 2 # 2 곱하고 1더함 ; ndex 0부터 시작이므로 +1


def min_heapify(a, i):
    min_idx = i
    l, r = left(i), right(i)
    if l < len(a) and a[l] < a[min_idx]:
        min_idx = l
    if r < len(a) and a[r] < a[min_idx]:
        min_idx = r
    if min_idx != i:
        a[i], a[min_idx] = a[min_idx], a[i] # swape a[i] with a[min_idx]
        min_heapify(a, min_idx)
    
def build_min_heap(a):
    for i in range(int(len(a)/2),-1,-1):
        min_heapify(a, i)
        
########################  ADT #################################################################################

def heap_minimum(a):
    assert(len(a) > 0)
    return a[0] # min-heap 을 따르므로 index 0 일때 가장 minimum 


def heap_extract_min(a):
    assert(len(a) > 0) #양수인지 검사하는 오류 검사 코드 
    val = a[0]
    a[0] = a[-1] # a 배열의 마지막값을 맨 위로 올림 
    del a[-1] # 바뀐 마지막 배열 제거 
    min_heapify(a, 0) # min_heapify를 call 하여, min-heap property 유지 ; it takes O(lgn) times  
    return val # 아까 저장한 min 값 추출 


def heap_decrease_key(a, i, key):  # a[i] 값을 key로 낮추고 min-heap data structure를 따르도록하는 적절한 위치 찾는다 
    assert(key <= a[i]) # key값이 a[i] 보다 낮은지 검사한다. 
    a[i] = key
    while i > 0 and a[i] < a[parent(i)]:  # O(lgn) 만큼 소요 ; root까지 올라가며 비교해야하므로 
        a[i], a[parent(i)] = a[parent(i)], a[i]
        i = parent(i)


def min_heap_insert(a, key):
    a.append(1e100) # 매우 큰값을 sintinel value로 두고 배열에 끝에 그 값을 추가한다. 
    heap_decrease_key(a, len(a) - 1, key) # 새로 생긴 값이 적당한 위치를 찾아가게 된다 it takes O(lgn)

In [35]:
def Huffman(C):
    n = len(C)
    build_min_heap(C)
    Q = C
    print("debug1 ", Q)
    for i in range(1,n):
        x = heap_extract_min(Q)
        y = heap_extract_min(Q)
        z = x + y 
        min_heap_insert(Q, z)
        print("debug2 ", Q)
    
    print("debug3 ",Q)
    
    return heap_extract_min(Q)

In [36]:
#C  = [5,9,12,13,16,45] # character 의 freq가 key를 의미
C = [45,13,12,16,9,5]   #['a','b','c','d','e','f']
Huffman(C)


debug1  [5, 9, 12, 16, 13, 45]
debug2  [12, 13, 45, 16, 14]
debug2  [14, 16, 45, 25]
debug2  [25, 45, 30]
debug2  [45, 55]
debug2  [100]
debug3  [100]


100

In [37]:
import heapq
from collections import defaultdict
 
 
def Encode_fun(feq):
    heap = [[weight, [symbol, '']] for symbol, weight in feq.items()]
    # heapq: module provides an implementation of the heap queue algorithm
    # also known as the priority queue algorithm
    heapq.heapify(heap)
    while len(heap) > 1:
        low = heapq.heappop(heap)
        hi = heapq.heappop(heap)
        for pair in low[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heapq.heappush(heap, [low[0] + hi[0]] + low[1:] + hi[1:])
    return sorted(heapq.heappop(heap)[1:], key=lambda p: (len(p[-1]), p))

In [38]:
# Driver 
input = "Huffman coding Algorithm"
# defaultdict: dict subclass that calls a factory function
# to supply missing values
feq = defaultdict(int)
# frequency for each character
for symbol in input:
    feq[symbol] += 1
 
# Huffman coding function 
huff = Encode_fun(feq)
print ("Symbol".ljust(10) + "Weight".ljust(10) + "Huffman Code")
for p in huff:
    print (p[0].ljust(10) + str(feq[p[0]]).ljust(10) + p[1])

Symbol    Weight    Huffman Code
m         2         000
n         2         001
o         2         010
          2         1000
f         2         1100
g         2         1101
i         2         1110
t         1         0110
u         1         0111
A         1         10010
H         1         10011
a         1         10100
c         1         10101
d         1         10110
h         1         10111
l         1         11110
r         1         11111
