Mixtral Mixture of Experts example #3075

agunapal · 2024-04-04T17:57:55Z

Description

This example shows how to deploy Mixtral-8x7B model with HuggingFace with the following features

low_cpu_mem_usage=True for loading with limited resource using accelerate
8-bit quantization using bitsandbytes
Accelerated Transformers using optimum
TorchServe streaming response

python test_streaming.py

produces the output

What is the difference between cricket and baseball?

- Cricket is a bat-and-ball game played between two teams of eleven players each on a field at the center of which is a rectangular 22-yard-long pitch. Each team takes its turn to bat,

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Logs for Test A
Test B
Logs for Test B

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

… into examples/mixtral_moe

lxning

Can you please move this example under folder /example/large_models/Huggingface_accelerate

lxning · 2024-04-08T18:09:32Z

examples/large_models/mixtral_moe/hugging_face_llm_handler.py

+        logger.info("Model %s loading tokenizer", ctx.model_name)
+        self.model = AutoModelForCausalLM.from_pretrained(
+            model_path,
+            device_map="balanced",


according to the link, it seems "auto" can cover the future changes. Or you can allow cx to define in the model-config.yaml.

Quoted from the link:
"The options "auto" and "balanced" produce the same results for now, but the behavior of "auto" might change in the future if we find a strategy that makes more sense, while "balanced" will stay stable."

lxning · 2024-04-08T18:23:04Z

examples/large_models/mixtral_moe/README.md

+- `low_cpu_mem_usage=True` for loading with limited resource using `accelerate`
+- 8-bit quantization using `bitsandbytes`
+- `Accelerated Transformers` using `optimum`
+- TorchServe streaming response


Is this example going to demo microbatch+streaming or continuousbatching+streaming?

lxning · 2024-04-08T18:25:50Z

examples/large_models/mixtral_moe/hugging_face_llm_handler.py

+        )
+        self.model.resize_token_embeddings(self.model.config.vocab_size + 1)
+
+        self.output_streamer = TextIteratorStreamerBatch(


you can use TS customized TextIteratorStreamerBatch with microbatching. See inf2 example:

serve/examples/large_models/inferentia2/llama2/streamer/inf2_handler.py

Line 96 in 8450a2e

self.output_streamer = TextIteratorStreamerBatch(

lxning · 2024-04-09T02:59:41Z

examples/large_models/mixtral_moe/hugging_face_llm_handler.py

+            input_ids, attention_mask = self.encode_input_text(input_text["prompt"])
+            input_ids_batch.append(input_ids)
+            attention_mask_batch.append(attention_mask)
+            params.append(input_text["params"])


We are going to use openAI payload style (see https://github.com/lxning/benchmark-locust/blob/ts/llm_bench/load_test.py#L254) since openAI api is most popular. the params is flatten.

lxning · 2024-04-09T03:00:56Z

examples/large_models/mixtral_moe/hugging_face_llm_handler.py

+        self.output_streamer = TextIteratorStreamerBatch(
+            self.tokenizer,
+            batch_size=len(input_ids_batch),
+            skip_special_tokens=True,
+        )
+        generation_kwargs = dict(
+            inputs=input_ids_batch,
+            attention_mask=attention_mask_batch,
+            streamer=self.output_streamer,
+            max_new_tokens=params[0]["max_new_tokens"],
+            temperature=params[0]["temperature"],
+            top_p=params[0]["top_p"],
+        )
+        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
+        thread.start()
+
+        for new_text in self.output_streamer:
+            send_intermediate_predict_response(


you can check inf2 example to update this section to combine microbatch+streaming.

agunapal and others added 4 commits April 4, 2024 17:57

Mixtral MOE example

b108298

Merge branch 'master' into examples/mixtral_moe

f2041f8

spellcheck

389e6fc

Merge branch 'examples/mixtral_moe' of https://github.com/pytorch/serve…

475fd06

… into examples/mixtral_moe

agunapal marked this pull request as ready for review April 4, 2024 18:03

agunapal requested review from chauhang and lxning April 4, 2024 18:03

Merge branch 'master' into examples/mixtral_moe

67ea2f6

lxning reviewed Apr 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixtral Mixture of Experts example #3075

Mixtral Mixture of Experts example #3075

agunapal commented Apr 4, 2024 •

edited

lxning left a comment

lxning Apr 8, 2024

lxning Apr 8, 2024

lxning Apr 8, 2024

lxning Apr 9, 2024

lxning Apr 9, 2024

Mixtral Mixture of Experts example #3075

Are you sure you want to change the base?

Mixtral Mixture of Experts example #3075

Conversation

agunapal commented Apr 4, 2024 • edited

Description

Type of change

Feature/Issue validation/testing

Checklist:

lxning left a comment

Choose a reason for hiding this comment

lxning Apr 8, 2024

Choose a reason for hiding this comment

lxning Apr 8, 2024

Choose a reason for hiding this comment

lxning Apr 8, 2024

Choose a reason for hiding this comment

lxning Apr 9, 2024

Choose a reason for hiding this comment

lxning Apr 9, 2024

Choose a reason for hiding this comment

agunapal commented Apr 4, 2024 •

edited