您好!感谢您出色的工作!我在复现您的工作的时候尝试计算每一个模型的生成速率(针对于您提供的Evaluation中的Falcon-40b)。代码为
./build/bin/main -m /data/models/falcon-40b-relu-powerinfer/falcon-40b-relu.q4.powerinfer.gguf -n 512 -t 8 -p "In the depths of twilight, where shadows dance with whispers, ancient secrets stir beneath the surface, beckoning the curious to unravel mysteries that linger within the fabric of time and space, awaiting discovery and enlightenment. The moon casts its gentle glow, illuminating pathways obscured by darkness, guiding intrepid souls on a journey towards the unknown, where truths and wonders intertwine in the eternal dance of existence." --vram-budget 22
我使用以下的方法计算Falcon-40B:
llama_print_timings: load time = 13806.71 ms
llama_print_timings: sample time = 240.38 ms / 254 runs ( 0.95 ms per token, 1056.67 tokens per second)
llama_print_timings: prompt eval time = 2031.23 ms / 79 tokens ( 25.71 ms per token, 38.89 tokens per second)
llama_print_timings: eval time = 10300.94 ms / 253 runs ( 40.72 ms per token, 24.56 tokens per second)
llama_print_timings: total time = 12701.22 ms
Log end
1000 / (0.95 + 25.71 + 40.72)= 1000 / 67.38 = 14.84 token/s
想请问您这样的计算方式是正确的吗?如果不正确,您论文中的计算方式是什么的呢?期待您的解答!
您好!感谢您出色的工作!我在复现您的工作的时候尝试计算每一个模型的生成速率(针对于您提供的Evaluation中的Falcon-40b)。代码为
./build/bin/main -m /data/models/falcon-40b-relu-powerinfer/falcon-40b-relu.q4.powerinfer.gguf -n 512 -t 8 -p "In the depths of twilight, where shadows dance with whispers, ancient secrets stir beneath the surface, beckoning the curious to unravel mysteries that linger within the fabric of time and space, awaiting discovery and enlightenment. The moon casts its gentle glow, illuminating pathways obscured by darkness, guiding intrepid souls on a journey towards the unknown, where truths and wonders intertwine in the eternal dance of existence." --vram-budget 22我使用以下的方法计算Falcon-40B:
llama_print_timings: load time = 13806.71 ms
llama_print_timings: sample time = 240.38 ms / 254 runs ( 0.95 ms per token, 1056.67 tokens per second)
llama_print_timings: prompt eval time = 2031.23 ms / 79 tokens ( 25.71 ms per token, 38.89 tokens per second)
llama_print_timings: eval time = 10300.94 ms / 253 runs ( 40.72 ms per token, 24.56 tokens per second)
llama_print_timings: total time = 12701.22 ms
Log end
1000 / (0.95 + 25.71 + 40.72)= 1000 / 67.38 = 14.84 token/s
想请问您这样的计算方式是正确的吗?如果不正确,您论文中的计算方式是什么的呢?期待您的解答!