Skip to content

This project intends to closer understand how to compress data using Huffman Coding and delve into a couple of programming examples.

License

Notifications You must be signed in to change notification settings

Gyakobo/Huffman-Coding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Huffman Coding

image image

Author: Andrew Gyakobo

Note

Please feel free to make pull requests and edit this project to your will.

Introduction

Huffman Coding is a widely used method of lossless data compression. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and it was published in his seminal paper in 1952. The algorithm is used to compress data by assigning variable-length codes to characters based on their frequencies. Characters that occur more frequently are given shorter codes, while less frequent characters are assigned longer codes. This approach reduces the overall size of the encoded data.

The process of Huffman Coding involves:

  1. Frequency Analysis: Calculate the frequency of each character in the input data.

  2. Building a Huffman Tree: Create a binary tree where each leaf node represents a character and its frequency. The tree is built by repeatedly merging the two nodes with the lowest frequencies until only one node remains (the root of the tree).

  3. Generating Codes: Assign binary codes to characters by traversing the Huffman Tree. Each left edge represents 0, and each right edge represents a 1.

Huffman Coding is optimal for symbol-by-symbol encoding and is widely used in various applications such as file compression (e.g., ZIP files), image compression (e.g., JPEG), and multimedia codecs (e.g., MP3).

Introduction to David A. Huffman

David A. Huffman (August 9, 1925 - October 7, 1999) was an American computer scientist and professor, known for his pioneering work in the field of information theory and data compression. Huffman was in Ohio and went on to study at the Massachussets Institute of Technology (MIT), where he completed his Ph.D. in electrical engineering.

While still a graduate student, Huffman devised the Huffman Coding algorithm as part of a term paper for a course on information theory taught by Robert M. Fano. His algorithm quickly became a fundamental technique in the field of data compression due to its efficiency and simplicity.

Throughout his career, Huffman made significant contributions to various areas of computer science and engineering. He was a professor at MIT and later at the University of California, Santa Cruz, where he helped establish the Computer Science Department.

David Huffman's work has had a lasting impact on how data is efficiently stored and quickly transmitted, and his legacy continues to influence modern computing and telecommunications.

Methodology

Step 1: Frequency Table of English Letters

First, we need a frequency table for the English letters. Here's a commonly used table based on large corpora of English texts. Below is a representation in Python code:

# Frequency table of English letters (including space)
frequency_table = {
    'A': 8.167, 'B': 1.492, 'C': 2.782, 'D': 4.253, 'E': 12.702,
    'F': 2.228, 'G': 2.015, 'H': 6.094, 'I': 6.966, 'J': 0.153,
    'K': 0.772, 'L': 4.025, 'M': 2.406, 'N': 6.749, 'O': 7.507,
    'P': 1.929, 'Q': 0.095, 'R': 5.987, 'S': 6.327, 'T': 9.056,
    'U': 2.758, 'V': 0.978, 'W': 2.360, 'X': 0.150, 'Y': 1.974,
    'Z': 0.074, ' ': 18.29
}
Letter Frequency (%) Letter Frequency (%)
A8.167 B1.492
C2.782 D4.253
E12.702 F2.228
G2.015 H6.094
I6.966 J0.153
K0.772 L4.025
M2.406 N6.749
O7.507 P1.929
Q0.095 R5.987
S6.327 T9.056
U2.758 V0.978
W2.36 X0.15
Y1.974 Z0.074
SPACE18.29

Step 2: Build a Huffman Code for English

We still use the frequency table to build a Huffman Tree and derive the Huffman codes for each character. Here's a simplified version of the process:

  1. Initialize: Create a leaf node for each character and add it to a priority queue based on frequency.
# Count frequency of each character in the sample text
sample_freq = Counter(sample_text.upper())
  1. Build the Tree:
  • Extract two nodes with the lowest frequency.
  • Create a new internal node with these two nodes as children and the sum of their frequencies as the new frequency.
  • Insert the new node back into the priority queue.
  • Repeat until only one node (the root) remains.
# Build Huffman Tree
heap = [[weight, [char, ""]] for char, weight in frequency_table.items()]
heapq.heapify(heap)

while len(heap) > 1:
    lo = heapq.heappop(heap)
    hi = heapq.heappop(heap)
    for pair in lo[1:]:
        pair[1] = '0' + pair[1]
    for pair in hi[1:]:
        pair[1] = '1' + pair[1]
    heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
  1. Generate Codes: Traverse the tree to assign codes to each character.
# Generate Huffman codes
huffman_codes = sorted(heap[0][1:], key=lambda p: (len(p[-1]), p))

# Calculate the number of bits needed using Huffman coding
huffman_bits = sum(sample_freq[char] * len(code) for char, code in huffman_codes if char in sample_freq)

To run the final program please just run the main file. I also wrote a sample encoder that encodes any input with arbitrary bits just for you, my dear reader, to compare their inherit weights.

Conclusion

David A. Huffman, through his groundbreaking development of Huffman Coding, revolutionized the field of data compression. His algorithm, created during his time as a Ph.D. student at MIT, stands as a testament to the power of innovative thinking in solving practical problems. Huffman Coding's efficient method of variable-length encoding based on character frequency not only optimizes data storage and transmission but also underpins many modern compression techniques used in everyday digital media. Huffman's work exemplifies the profound impact that a single, elegant solution can have on technology, shaping the way we manage and process information in the digital age.

License

MIT

About

This project intends to closer understand how to compress data using Huffman Coding and delve into a couple of programming examples.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published