Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the files for woq of codegen25 using ipex #3024

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bbhattar
Copy link
Contributor

Description

I am adding an example for deploying the code generation model with IPEX.
We use the IPEX Weight-only Quantization to convert the model to INT8 precision.

Files:

  • README.md
  • codegen_handler.py - custom handler for quantizing and deploying the model
  • model-properties.yaml - config for model preparation
  • benchmark.sh allows you to batch your inference requests
  • sample_text_0.txt has a sample prompt you can use to test the code generation model.

Type of change

  • New example (non-breaking change which adds functionality)

Feature/Issue validation/testing

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@min-jean-cho
Copy link
Collaborator

@lxning

Comment on lines 70 to 75
if self.lowp_mode == "BF16":
self.amp_enabled = True
self.amp_dtype = torch.bfloat16
else:
self.amp_enabled = False
self.amp_dtype = torch.float32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to set amp in model-config.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to enable amp from model-config.yaml

self.tokenizer.pad_token=self.tokenizer.eos_token


if self.benchmark:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TS, the initialize function is used to load model. Here, the benchmark is trying to run inference based on the sample input. TS supports customized metrics in backend to measure each stage latency during inference. So this section is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed benchmark and benchmark-related args.

@bbhattar bbhattar requested a review from lxning March 27, 2024 21:36
@anupren
Copy link

anupren commented Apr 3, 2024

Apache Benchmark data for this PR (codegen model) on Xeon hardware [2 sockets, 32 physical cores per socket]:

Model_Name Benchmark TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Batch size Batch delay Workers Concurrency Input Requests Model_p50 Model_p90 Model_p99 Queue time p50 Queue time p90 Queue time p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
Codegen25 AB 0 0.2 4987 5151 5151 4987.605 0 1 200 1 1 ipex_woq/sample_text_0.txt 10 4928.27 4958.85 4958.85 0 0 0 4984.41 4984.23 0 1.2
Codegen25 AB 0 0.36 5261 5410 5514 5535.294 0 2 200 1 2 ipex_woq/sample_text_0.txt 40 5221.29 5248.67 5248.67 0 1 100 5256.64 5256.45 5.2 2.24
Codegen25 AB 0 0.63 5757 6152 6154 6379.648 0 4 200 1 4 ipex_woq/sample_text_0.txt 40 5738.32 5743.07 5743.07 1 99 100 5773.37 5773.18 10.78 3.55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants