Skip to content

0sami0/HoloKV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HoloKV: Holographic Phase-Shifting for O(N/k) KV-Cache Compression

Author: Sami Hilali (@HilaliSami42552)
Status: Open Research Draft (Mathematical Proof-of-Concept)
Read the Paper:HoloKV_Whitepaper pdf

πŸš€ Latest Breakthrough: A Compute-Constrained Proof of Concept.

Using a deterministic Walsh-Hadamard phase matrix and an end-to-end Knowledge Distillation pipeline, the HoloKV PyTorch simulator successfully extracted a target zero-shot reasoning token from a $k=4$ (75% compressed) superimposed noise block.

Terminal Output from Qwen-0.5B (HoloKV-Injected): [4/4] Running HoloKV Inference (75% Cache Compressed)...

================================================== FINAL BENCHMARK

Target Prompt Code : 'ALPHA-77' Baseline Output : 'ALPHA-77.' HoloKV Output : 'ALPHA-77.'

[βœ“] ARCHITECTURE VERIFIED: Perfect Zero-Shot Denoising Achieved.


🚨 Call for Hardware Collaborators (Triton / CUDA)

HoloKV is an independent research initiative. The core mathematics (Orthogonal Phase-Shifting, RoPE Even-Boundary Rule, Variance Normalization) have been successfully modeled in PyTorch. However, to achieve the actual physical $\mathcal{O}(N/k)$ VRAM reduction, we need to build a custom SRAM Active Accumulation Buffer kernel.

If you are an engineer experienced in OpenAI Triton or CUDA C++ and want to help build a custom FlashAttention-style kernel to make infinite-context LLMs a reality, please DM me on X or open an Issue!


🧠 What is HoloKV?

As Large Language Models scale, the KV-Cache scales linearly at $\mathcal{O}(N)$, creating a massive "Memory Wall." Standard compression methods drop tokens or quantize precision, which degrades reasoning.

HoloKV takes a geometric approach inspired by telecommunications (CDMA). Instead of appending new memory slots, HoloKV multiplexes (stacks) $k$ temporal tokens into a single physical memory slot using static, orthogonal $+1/-1$ phase keys.

Key Innovations:

  1. Holographic Superposition: Compresses KV memory by 75% to 87.5% ($k=4$ to $k=8$) without permanent token eviction.
  2. Variance Normalization: A mathematically derived $\sqrt{k}$ scaling penalty that prevents Softmax entropy collapse caused by superimposing dense vectors.
  3. The Strict Even-Boundary Rule: A deterministic phase-key assignment constraint that perfectly preserves the 2D rotary commutative math of RoPE (Rotary Positional Embeddings), allowing HoloKV to work natively on Llama 3 and Qwen architectures.
  4. LoRA Denoising Engine: A lightweight Knowledge Distillation method that injects Query/Value LoRA adapters to natively filter out Gaussian background static generated by the multiplexing.

πŸ“‚ Repository Contents

  • HoloKV_Whitepaper.pdf: The full architectural draft detailing the math, scaling laws, and hardware theory.
  • holokv_math_simulator.py: A PyTorch implementation of the HoloKV forward pass. Note: This is a strict mathematical simulator used to validate the phase-shifting, RoPE compatibility, and Softmax normalization. It does not yield physical VRAM savings as it currently lacks the fused SRAM hardware kernel.

🀝 Let's Build This

The math works. The next step is the hardware execution. Let's shatter the Memory Wall together.

About

HoloKV a mathematical framework using CDMA phase-shifting to stack LLM context windows, theoretically reducing KV-Cache by 75% without token dropping.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages