Added the files for woq of codegen25 using ipex #3024

bbhattar · 2024-03-13T19:33:21Z

Description

I am adding an example for deploying the code generation model with IPEX.
We use the IPEX Weight-only Quantization to convert the model to INT8 precision.

Files:

README.md
codegen_handler.py - custom handler for quantizing and deploying the model
model-properties.yaml - config for model preparation
benchmark.sh allows you to batch your inference requests
sample_text_0.txt has a sample prompt you can use to test the code generation model.

Type of change

New example (non-breaking change which adds functionality)

Feature/Issue validation/testing

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

min-jean-cho · 2024-03-13T19:44:38Z

@lxning

lxning · 2024-03-15T21:34:26Z

examples/large_models/codegen25_ipex/ipex_woq/codegen_handler.py

+        if self.lowp_mode == "BF16":
+            self.amp_enabled = True
+            self.amp_dtype = torch.bfloat16
+        else:
+            self.amp_enabled = False
+            self.amp_dtype = torch.float32


it's better to set amp in model-config.yaml

Updated to enable amp from model-config.yaml

lxning · 2024-03-15T23:35:16Z

examples/large_models/codegen25_ipex/ipex_woq/codegen_handler.py

+            self.tokenizer.pad_token=self.tokenizer.eos_token
+
+
+        if self.benchmark:


In TS, the initialize function is used to load model. Here, the benchmark is trying to run inference based on the sample input. TS supports customized metrics in backend to measure each stage latency during inference. So this section is not needed.

Removed benchmark and benchmark-related args.

anupren · 2024-04-03T03:10:59Z

Apache Benchmark data for this PR (codegen model) on Xeon hardware [2 sockets, 32 physical cores per socket]:

Model_Name	Benchmark	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	Batch size	Batch delay	Workers	Concurrency	Input	Requests	Model_p50	Model_p90	Model_p99	Queue time p50	Queue time p90	Queue time p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
Codegen25	AB	0.2	4987	5151	5151	4987.605	1	200	1	1	ipex_woq/sample_text_0.txt	10	4928.27	4958.85	4958.85	0	0	0	4984.41	4984.23	0	1.2
Codegen25	AB	0.36	5261	5410	5514	5535.294	2	200	1	2	ipex_woq/sample_text_0.txt	40	5221.29	5248.67	5248.67	0	1	100	5256.64	5256.45	5.2	2.24
Codegen25	AB	0.63	5757	6152	6154	6379.648	4	200	1	4	ipex_woq/sample_text_0.txt	40	5738.32	5743.07	5743.07	1	99	100	5773.37	5773.18	10.78	3.55

Added the files for woq of codegen25 using ipex

c3bd4ee

lxning reviewed Mar 15, 2024

View reviewed changes

Ubuntu and others added 2 commits March 19, 2024 20:21

set amx from args

952f9b3

Merge branch 'master' into ipex_codegen25_woq

faa6389

bbhattar requested a review from lxning March 27, 2024 21:36

Merge branch 'master' into ipex_codegen25_woq

d259165

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added the files for woq of codegen25 using ipex #3024

Added the files for woq of codegen25 using ipex #3024

bbhattar commented Mar 13, 2024

min-jean-cho commented Mar 13, 2024

lxning Mar 15, 2024

bbhattar Mar 19, 2024

lxning Mar 15, 2024

bbhattar Mar 19, 2024

anupren commented Apr 3, 2024

		self.tokenizer.pad_token=self.tokenizer.eos_token


		if self.benchmark:

Added the files for woq of codegen25 using ipex #3024

Are you sure you want to change the base?

Added the files for woq of codegen25 using ipex #3024

Conversation

bbhattar commented Mar 13, 2024

Description

Type of change

Feature/Issue validation/testing

Checklist:

min-jean-cho commented Mar 13, 2024

lxning Mar 15, 2024

Choose a reason for hiding this comment

bbhattar Mar 19, 2024

Choose a reason for hiding this comment

lxning Mar 15, 2024

Choose a reason for hiding this comment

bbhattar Mar 19, 2024

Choose a reason for hiding this comment

anupren commented Apr 3, 2024