Replication of results with h2o_flexgen flex_opt.py #23

g-x-w · 2024-03-13T06:40:06Z

I was trying to replicate the benchmark results shown in h2o_flexgen/benchmark/h2o and was able to observe the decreases in peak gpu memory and latencies and increase in tokens/sec throughput, but I noticed that the actual text being generated during the inference task was . I wanted to confirm whether or not this is expected behaviour. This is using the default implementation of flex_opt.py provided with facebook/opt-2.7b

As an example output, running inference with the provided example AI prompt produced this as the output:

And running inference again with the --hh-ratio 0.2 --hh-all flags enabling H2O produces the same output (but I do see the higher prefill and decode throughput)

The text was updated successfully, but these errors were encountered:

KangkangStu · 2024-03-25T01:08:40Z

How do you run flex_opt.py? I tried running it according to the textbook, but I couldn't get the program to run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication of results with h2o_flexgen flex_opt.py #23

Replication of results with h2o_flexgen flex_opt.py #23

g-x-w commented Mar 13, 2024

KangkangStu commented Mar 25, 2024

Replication of results with h2o_flexgen flex_opt.py #23

Replication of results with h2o_flexgen flex_opt.py #23

Comments

g-x-w commented Mar 13, 2024

KangkangStu commented Mar 25, 2024