Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication of results with h2o_flexgen flex_opt.py #23

Open
g-x-w opened this issue Mar 13, 2024 · 1 comment
Open

Replication of results with h2o_flexgen flex_opt.py #23

g-x-w opened this issue Mar 13, 2024 · 1 comment

Comments

@g-x-w
Copy link

g-x-w commented Mar 13, 2024

I was trying to replicate the benchmark results shown in h2o_flexgen/benchmark/h2o and was able to observe the decreases in peak gpu memory and latencies and increase in tokens/sec throughput, but I noticed that the actual text being generated during the inference task was . I wanted to confirm whether or not this is expected behaviour. This is using the default implementation of flex_opt.py provided with facebook/opt-2.7b

As an example output, running inference with the provided example AI prompt produced this as the output:
Screenshot 2024-03-12 at 5 50 52 PM

And running inference again with the --hh-ratio 0.2 --hh-all flags enabling H2O produces the same output (but I do see the higher prefill and decode throughput)

@KangkangStu
Copy link

How do you run flex_opt.py? I tried running it according to the textbook, but I couldn't get the program to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants