Skip to content

Conversation

@wang2yn84
Copy link
Collaborator

Mixtral 8x7b model is working for both offline and online, bf16 and int8. Let's get this in first so we can parallelize the work. Will add tests in the coming PRs.

@wang2yn84 wang2yn84 requested review from FanhaiLu1 and qihqi June 10, 2024 21:38
@qihqi
Copy link
Collaborator

qihqi commented Jun 10, 2024

please make sure the name is mixtral and not mistral. We might add mistral 7b ( the non-Moe version) later, so it would be confusing

README.md Outdated
export input_ckpt_dir=Original llama weights directory
export output_ckpt_dir=The output directory
export model_name="llama-3" # or "llama-2", "gemma"
export model_name="llama-3" # or "llama-2", "gemma", "mistral"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to mixtral

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I was confused about the name initially and that's why there are mixes of mistral and mixtral. I also changes everything to Mixtral. Done.

torch.empty(config.num_experts, config.intermediate_size, config.dim)
)

def forward(self, x: Tensor, expert_indices: Tensor) -> Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a change to use different logic for longer seqlen and i pushed to your branch, is that lost from merging?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the quantized change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original model. Your changes are in model.py

@qihqi qihqi self-requested a review June 10, 2024 21:51
Copy link
Collaborator

@FanhaiLu1 FanhaiLu1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for add Mixtral, the code is clean and overlay look good!

"layers.{}.attention.wk.weight": "layers.{}.attention.wk.weight",
"layers.{}.attention.wv.weight": "layers.{}.attention.wv.weight",
"layers.{}.attention.wo.weight": "layers.{}.attention.wo.weight",
"layers.{}.block_sparse_moe.w1": "layers.{}.block_sparse_moe.cond_ffn.w1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like only these weight name are difference, can we only store the the different name in the map?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed

@wang2yn84 wang2yn84 merged commit d6bf068 into main Jun 11, 2024
@qihqi qihqi deleted the mixtral branch July 15, 2024 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants