An intelligent, cross-platform compression utility that automatically selects the best codec (zstd/brotli/gzip/bz2/lzma) per file type and content, with streaming, chunking, and optional dictionary training.
- Overview
- Problem Statement
- Key Features
- DSA Concepts
- Architecture
- Installation
- Usage
- Examples
- Performance
- Project Structure
- Learning Outcomes
- Interview Preparation
This project builds a production-grade file compression utility that intelligently selects compression algorithms based on file content analysis. Unlike simple wrappers, it demonstrates deep understanding of:
- Data Structures: Huffman Trees, Min Heaps, Hash Maps
- Algorithms: Greedy algorithms, entropy calculation, tree traversal
- System Design: Modular architecture, strategy pattern, streaming I/O
- Software Engineering: Error handling, testing, documentation
File compression is fundamental to:
- 💾 Cloud Storage: AWS S3 charges for storage & egress
- 🔄 CI/CD: Artifact compression saves bandwidth
- 📊 Big Data: Logs and analytics compression cuts costs
- 🗜️ Backups: Efficient compression enables larger backups
Real Impact: Smart codec selection saves 30-70% storage across diverse file types.
Different file types compress differently:
File Type | Entropy | Best Codec | Typical Ratio
────────────────────────────────────────────────────────
Logs (text) | 5.2 | zstd-8 | 15-25%
JSON | 5.5 | zstd-10 | 12-22%
CSV | 6.1 | zstd-8 | 20-35%
Images (PNG) | 7.8 | store | 98-102%
Videos (MP4) | 7.9 | store | 99-101%
Binary | 6.5 | lzma-7 | 40-55%
Problem: Manual codec selection is tedious and suboptimal.
Solution: Automatic content-aware codec selection! ✨
- Detects file type via magic bytes + MIME type
- Analyzes file entropy and content patterns
- Selects optimal codec automatically
- Supports multiple modes: fast/balanced/max
| Codec | Speed | Ratio | Best For |
|---|---|---|---|
| zstd | ⚡⚡⚡ | ⭐⭐⭐ | Logs, text, balanced |
| gzip | ⚡⚡ | ⭐⭐ | Universal standard |
| bz2 | ⚡ | ⭐⭐⭐ | High ratio archive |
| lzma | 🐢 | ⭐⭐⭐⭐ | Maximum compression |
| brotli | ⚡⚡ | ⭐⭐⭐ | Web content |
| huffman | ⚡⚡ | ⭐⭐ | Educational, simple |
- Memory-efficient for large files
- Chunked processing (1-4 MB blocks)
- Optional parallel compression
- No loading entire file into RAM
- SHA256 checksums for all files
- Manifest files (.dfc.json) for reproducibility
- Round-trip verification (compress → decompress)
- Byte-perfect integrity guarantees
dfc compress input.json --mode auto # Smart compression
dfc decompress input.json.zst # Quick decompression
dfc verify input.json.zst.dfc.json # Verify integrity
dfc info input.json.zst # Show statistics- Dictionary training (zstd)
- Archive mode (tar + zstd)
- Detailed logging & statistics
- Comprehensive error handling
- Cross-platform (Windows/Mac/Linux)
This project showcases real DSA in production systems:
| Structure | Usage | Complexity |
|---|---|---|
| Binary Tree | Huffman tree nodes | Build: O(n log n) |
| Min Heap | Priority queue for Huffman | Insert/Pop: O(log n) |
| Hash Map | Frequency tables, code lookup | O(1) average |
| Bit Manipulation | Encode/decode bytes | O(n) traversal |
| Streams | File I/O chunking | O(1) memory |
-
Huffman Coding (Greedy)
Frequency Analysis → Min Heap → Build Tree → Generate Codes → Encode O(n log n) for tree, O(n) for encoding -
Entropy Calculation (Information Theory)
H(X) = -Σ p(x) * log₂(p(x)) Determines compressibility (0-8 scale) -
Binary Search (Codec tree traversal)
O(log n) lookup for code generation -
Divide & Conquer (Optional)
Split file → Compress chunks → Merge results Parallel scaling with ThreadPoolExecutor
INPUT FILE
↓
┌─────────────────────────────────────────────┐
│ DETECTION PHASE │
│ • Magic bytes (0xFF 0xD8 0xFF = JPEG) │
│ • MIME type guess (application/json) │
│ • Entropy calculation (5.2 = compressible) │
│ • Sample analysis │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ STRATEGY SELECTION PHASE │
│ IF already_compressed: STORE │
│ ELIF is_text AND entropy < 6.0: zstd-8 │
│ ELIF mode == "max": lzma-7 │
│ ELSE: zstd-6 (balanced) │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ COMPRESSION ENGINE PHASE │
│ • Stream file in chunks │
│ • Apply selected codec │
│ • Calculate statistics │
│ • Compute SHA256 hash │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ VERIFICATION & OUTPUT PHASE │
│ • Write compressed file │
│ • Create manifest (.dfc.json) │
│ • Test decompression (round-trip) │
│ • Verify SHA256 match │
└─────────────────────────────────────────────┘
↓
OUTPUT: compressed.zst + compressed.zst.dfc.json
- Python 3.9+
- pip (Python package manager)
git clone https://github.com/yourusername/dynamic-file-compression.git
cd dynamic-file-compression# Windows
python -m venv venv
venv\Scripts\activate
# Mac/Linux
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtpython main.py --help# Auto-detect best codec
python main.py compress data.json
# Output:
# ✅ Compression successful!
# Original: 5.0MB (5,000,000 bytes)
# Compressed: 850.0KB (850,000 bytes)
# Ratio: 17.0% (saved 83.0%)
# Codec: zstd level 8
# Time: 0.35s
# Speed: 14.3 MB/s# Fast compression (prioritize speed)
python main.py compress data.json --mode fast
# zstd level 3, ~50% compression, instant
# Balanced (default)
python main.py compress data.json --mode balanced
# zstd level 6, ~60% compression, fast
# Maximum compression (prioritize ratio)
python main.py compress data.json --mode max
# lzma level 7, ~75% compression, slowerpython main.py decompress data.json.zst
# Output: data.json (restored exactly)python main.py verify data.json.zst.dfc.json
# ✅ Integrity verified!
# SHA256 matches, file is safepython main.py info data.json.zst
# Shows complete manifest with statistics$ python main.py compress large_dataset.json --mode balanced
🔐 Compressing: large_dataset.json
Mode: balanced
✅ Compression successful!
Original: 250.0MB (250,000,000 bytes)
Compressed: 37.5MB (37,500,000 bytes)
Ratio: 15.0% (saved 85.0%)
Codec: zstd level 6
Time: 2.15s
Speed: 116.3 MB/s
Output: large_dataset.json.zst
Manifest: large_dataset.json.zst.dfc.json$ python main.py compress build_artifacts/ --mode fast
🔐 Compressing: build_artifacts/...
Mode: fast
✅ Compression successful!
Original: 500.0MB
Compressed: 250.0MB
Ratio: 50.0%
Codec: zstd level 3
Time: 1.2s
Speed: 417 MB/s$ python main.py compress logs/ --mode max
🔐 Compressing: logs/...
Mode: max
✅ Compression successful!
Original: 2.0GB
Compressed: 200.0MB
Ratio: 10.0% (saved 90%!)
Codec: lzma level 7
Time: 45.3s
Speed: 44.1 MB/sFile Type | Size | Fast | Balanced | Max | Time Saved*
────────────────────────────────────────────────────────────────────
Text logs | 1 GB | 540 MB | 180 MB | 100 MB | 55-90%
JSON API data | 500 MB| 280 MB | 75 MB | 50 MB | 50-90%
CSV reports | 200 MB| 120 MB | 30 MB | 15 MB | 60-93%
Already-compressed (PNG/MP4) | 100 MB | 100 MB | 100 MB | 100 MB | 0%
Binary files | 200 MB| 110 MB | 85 MB | 65 MB | 42-68%
*Compared to storing uncompressed
Codec | Speed | Typical File Type
────────────────────────────────────
zstd-3 | 450 | Fast CI/CD build
zstd-6 | 300 | Balanced logs
zstd-8 | 150 | High ratio text
lzma-7 | 40 | Archive storage
Dynamic-File-Compression-Utility/
│
├── main.py # Entry point
├── requirements.txt # Dependencies
├── .gitignore # Git ignore rules
│
├── src/
│ ├── __init__.py # Package init
│ ├── huffman_node.py # Huffman tree node
│ ├── huffman_coder.py # Huffman algorithm
│ ├── detector.py # File type detection
│ ├── strategy.py # Codec selection logic
│ ├── compressor.py # Compression engine
│ ├── verify.py # Decompression & verify
│ ├── cli.py # Command-line interface
│ └── archive.py # Folder compression
│
├── input_files/ # Sample inputs
│ ├── sample.txt # Text file
│ ├── sample.json # JSON file
│ └── sample.csv # CSV file (optional)
│
├── compressed_files/ # Compressed outputs
├── decompressed_files/ # Decompressed outputs
│
├── outputs/ # Statistics & reports
├── images/ # Screenshots
├── docs/
│ ├── 01_PROJECT_EXPLANATION.md # Detailed explanation
│ ├── 02_TECH_STACK_OPTIONS.md # Tech choices
│ ├── 03_ARCHITECTURE.md # System architecture
│ ├── 04_IMPLEMENTATION_PLAN.md # Phase-by-phase plan
│ ├── 05_DSA_CONCEPTS.md # DSA deep dive
│ ├── 06_INTERVIEW_QNA.md # Interview prep
│ └── 07_GITHUB_GUIDE.md # GitHub workflow
│
└── README.md # This file
After completing this project, you'll understand:
✅ Huffman Coding algorithm and greedy approach
✅ Binary trees and tree traversals (DFS)
✅ Min heap operations and priority queues
✅ Hash maps for frequency tables
✅ Entropy and information theory basics
✅ Modular architecture (detector → strategy → compressor)
✅ Strategy pattern for codec selection
✅ Streaming I/O for memory efficiency
✅ Error handling and validation
✅ Professional CLI design
✅ Configuration management (modes, strategies)
✅ Testing and verification
✅ Documentation and code comments
✅ Cross-platform compatibility
✅ Production-grade Python
✅ Multiple compression libraries
✅ Performance optimization
✅ Integrity verification
✅ DevOps thinking (storage, costs)
See docs/06_INTERVIEW_QNA.md for:
- 10 Common Interview Questions with detailed answers
- DSA Deep Dives (Huffman, heaps, trees)
- System Design questions about scaling
- HR Questions about project and learning
- Technical Follow-ups for architects
Q: "Explain your project."
A: "I built a dynamic compression utility that intelligently selects the best codec for each file. The system:
- Detects file type via magic bytes and calculates entropy
- Strategizes by analyzing if file is text, already-compressed, or binary
- Compresses using appropriate codec (zstd for text, lzma for max ratio)
- Verifies with SHA256 checksums
The technical depth includes building a Huffman tree using a min heap for O(n log n) tree construction, then O(n) encoding. It demonstrates understanding of greedy algorithms, tree data structures, streaming I/O for memory efficiency, and system design with modular architecture."
git add .
git commit -m "Phase 1: Project structure and dependencies"
git push# After each phase
git add src/
git commit -m "Phase X: [Feature description]"
git pushgit add README.md docs/ images/
git commit -m "Phase 10: Complete documentation and GitHub ready"
git pushRepository Name: dynamic-file-compression-utility
Description:
Intelligent, cross-platform file compression with automatic codec
selection. DSA project showcasing Huffman coding, min heaps, binary
trees, and production system design. Supports zstd/brotli/gzip/bz2/lzma.
Topics: compression, dsa, huffman-coding, data-structures, algorithms, python, cli-tool, backend, system-design
ReadMe: Include this file
License: MIT
Enable: Issues, Discussions
$ python main.py compress sample.json --mode auto
🔐 Compressing: sample.json
Mode: auto
✅ Compression successful!
Original: 5.0MB (5,000,000 bytes)
Compressed: 850.0KB (850,000 bytes)
Ratio: 17.0% (saved 83.0%)
Codec: zstd level 8
Time: 0.35s
Speed: 14.3 MB/s
Output: sample.json.zst
$ python main.py verify sample.json.zst.dfc.json
🔍 Verifying: sample.json.zst.dfc.json
✅ Integrity verified!
============================================================
Compression Manifest
============================================================
Source: sample.json
Output: sample.json.zst
Codec: zstd (level 8)
Original: 5,000,000 bytes
Compressed: 850,000 bytes
Ratio: 17.0%
Time: 0.35s
Speed: 14.3 MB/s
SHA256: a1b2c3d4...
============================================================
pip install zstandard brotli# Ensure file path is correct
python main.py compress ./input_files/sample.txt# Make script executable (Mac/Linux)
chmod +x main.pyKey Talking Points:
-
Problem Solving: "I identified that different files compress differently. My solution auto-selects codecs based on content analysis."
-
DSA Application: "Used Huffman coding with min heaps (O(n log n)) to build optimal prefix-free codes. This is classic greedy algorithm."
-
System Design: "Implemented modular architecture: detector → strategy → compressor. Each component is independently testable."
-
Production Thinking: "Added streaming I/O for large files, SHA256 verification for integrity, and manifest files for reproducibility."
-
Learning Journey: "Started with basic Huffman, expanded to multiple codecs, then added intelligent strategy selection. Each phase builds on previous."
- Huffman Coding - Wikipedia
- zstd - Zstandard
- Information Theory - Shannon Entropy
- Python Heapq Documentation
MIT License - See LICENSE file for details
Potential Enhancements:
- GPU acceleration for compression
- Distributed compression (multiple machines)
- S3/cloud integration
- Web UI dashboard
- Real-time compression monitoring
- Dictionary optimization
- Parallel frame compression
Contributions welcome! For major changes:
- Fork the repository
- Create feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m 'Add AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open Pull Request
Built with ❤️ for DSA Learning & Interview Preparation
Industry-oriented Dynamic File Compression Utility using Huffman Coding, intelligent codec selection, streaming compression, SHA-256 verification, CLI and Streamlit UI.
511dbfd4ef0428000923cce31a0f74c54fcb3c15