GPT-NeoX allocating full-length KV cache (octoml#179)

This PR changes the GPT-NeoX KV cache creation function to create to full size at the beginning, so no memory allocation will be required when running on the fly.
Lunderberg · May 18, 2023 · a181bd5 · a181bd5
1 parent de7b5ab
commit a181bd5
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/mlc_llm/relax_model/gpt_neox.py b/mlc_llm/relax_model/gpt_neox.py
@@ -593,7 +593,7 @@ def create_kv_cache_func(
 ) -> None:
     init_shape = relax.ShapeExpr(
         (
-            1,
+            config.max_sequence_length,
             config.num_attention_heads,
             config.hidden_size // config.num_attention_heads,
         )