perf: serve quantized Psyfighter2 #81

sambarnes · 2024-03-19T15:16:36Z

Details

Code of Conduct

I agree to follow this project's Code of Conduct
I agree to license this contribution under the MIT LICENSE
I checked the current PR for duplication.

sambarnes · 2024-03-19T15:19:19Z

modal/runner/containers/vllm_unified.py

+_psyfighter2 = "TheBloke/LLaMA2-13B-Psyfighter2-GPTQ"
 VllmContainer_KoboldAIPsyfighter2 = _make_container(
    name="VllmContainer_KoboldAIPsyfighter2",
-    model_name="KoboldAI/LLaMA2-13B-Psyfighter2",
-    gpu=modal.gpu.A100(count=1, memory=40),
-    concurrent_inputs=32,
+    model_name=_psyfighter2,
+    gpu=modal.gpu.A10G(count=1),
+    concurrent_inputs=4,
+    max_containers=5,
+    quantization="GPTQ",


we get like max 10 requests a minute for this model. think this allocation should be more than enough, might be able to further reduce batch size for better throughput floor

will try this first tho

sambarnes added 2 commits March 19, 2024 09:11

perf: serve quantized Psyfighter2

69b698a

fix: add GPTQ quant param

265aeff

sambarnes marked this pull request as ready for review March 19, 2024 15:16

sambarnes commented Mar 19, 2024

View reviewed changes

sambarnes merged commit 754d41f into main Mar 19, 2024
3 checks passed

sambarnes deleted the quantize-last-one branch March 19, 2024 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: serve quantized Psyfighter2 #81

perf: serve quantized Psyfighter2 #81

sambarnes commented Mar 19, 2024

sambarnes Mar 19, 2024

perf: serve quantized Psyfighter2 #81

perf: serve quantized Psyfighter2 #81

Conversation

sambarnes commented Mar 19, 2024

Details

Code of Conduct

sambarnes Mar 19, 2024

Choose a reason for hiding this comment