Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for "mistralai/Mistral-7B-Instruct-v0.1" model #1501

Closed
Matthieu-Tinycoaching opened this issue Sep 28, 2023 · 38 comments
Closed

Support for "mistralai/Mistral-7B-Instruct-v0.1" model #1501

Matthieu-Tinycoaching opened this issue Sep 28, 2023 · 38 comments

Comments

@Matthieu-Tinycoaching
Copy link

Hi,

Would it be possible to add support for "mistralai/Mistral-7B-Instruct-v0.1" model?

@vince62s
Copy link
Member

Just use the llama converter.
it works fine at least for MMLU evaluation even without the sliding window attention implementation .....
Maybe with much longer inputs it may break.

@winstxnhdw
Copy link

winstxnhdw commented Sep 29, 2023

Most people using Mistral will be using it for RAG, meaning it'll probably break without the sliding window attention.

@vince62s
Copy link
Member

RAG ?

@BBC-Esq
Copy link

BBC-Esq commented Sep 29, 2023

Retrieval augmented generation, as in creating a vector database and querying it for results, then appending those results to a user's query that are both sent to an LLM for an answer. It lets one ask for an answer from an LLM on specific information that is after a model's knowledge cutoff date, for example. Very powerful.

@vince62s
Copy link
Member

and what, is the common usage of this with seq length higher than 4096 ?

@winstxnhdw
Copy link

winstxnhdw commented Sep 29, 2023

You can certainly do RAG decently under 4096 but typically, the point of RAG is to make use of as much context as possible.

@vince62s
Copy link
Member

but again, the sliding window is only for the attention mask. it does mean that it will "break".
if something breaks it's just because the sequence length might be way too long and it will OOM by itself.
does not mean results will be bad.
anyway, I am implementing the sliding mask in -py and will check how easy it is to replicate in ct2.

@winstxnhdw
Copy link

You are right, I misunderstood their article. My apologies.

@MrigankRaman
Copy link

Just use the llama converter. it works fine at least for MMLU evaluation even without the sliding window attention implementation ..... Maybe with much longer inputs it may break.

What would be the command to use llama convertor for Mistral?

@winstxnhdw
Copy link

I've uploaded the converted model to Hugging Face. See here.

@vince62s
Copy link
Member

vince62s commented Oct 1, 2023

@NeonBohdan
Copy link

Just use the llama converter.
it works fine at least for MMLU evaluation even without the sliding window attention implementation .....
Maybe with much longer inputs it may break.

When I do this

ct2-transformers-converter --model mistralai/Mistral-7B-v0.1 --quantization int8 --output_dir ./models/ctranslate2 --low_cpu_mem_usage

It outputs

ValueError: No conversion is registered for the model configuration MistralConfig

Maybe need to change model type too or what?

@vince62s
Copy link
Member

vince62s commented Oct 2, 2023

did you try to change here: https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197
to MistralConfig
if this is not enough we'll need to add the config, ootherwise you can download directly the the converted file from @winstxnhdw

@manishiitg
Copy link

@winstxnhdw possible to share how u did the conversion? i am getting the same error

ValueError: No conversion is registered for the model configuration MistralConfig

@MrigankRaman
Copy link

@winstxnhdw possible to share how u did the conversion? i am getting the same error

ValueError: No conversion is registered for the model configuration MistralConfig

I solved it. I went to https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197

copied llama_loader, created a new function and registered MistralConfig with the new function. Basically copy llama loader and register mistral config
image

@vince62s
Copy link
Member

vince62s commented Oct 2, 2023

Just a nice reminder, this will behave 100% as Mistral as long as the the sequence length is <=4096 tokens.
Would be interesting to see how it behaves with longer sequences.

@MrigankRaman
Copy link

Just a nice reminder, this will behave 100% as Mistral as long as the the sequence length is <=4096 tokens.
Would be interesting to see how it behaves with longer sequences.

When will ctranslate2 support SWA?

@BBC-Esq
Copy link

BBC-Esq commented Oct 2, 2023

@winstxnhdw possible to share how u did the conversion? i am getting the same error

ValueError: No conversion is registered for the model configuration MistralConfig

I solved it. I went to https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197

copied llama_loader, created a new function and registered MistralConfig with the new function. Basically copy llama loader and register mistral config image

Can you please post your code for me instead of a picture of it??

@wsxiaoys
Copy link
Contributor

wsxiaoys commented Oct 2, 2023

@register_loader("MistralConfig")
class MistralLoader(ModelLoader):
    @property
    def architecture_name(self):
        return "MistralForCausalLM"

    def get_model_spec(self, model):
        num_layers = model.config.num_hidden_layers

        num_heads = model.config.num_attention_heads
        num_heads_kv = getattr(model.config, "num_key_value_heads", num_heads)
        if num_heads_kv == num_heads:
            num_heads_kv = None

        spec = transformer_spec.TransformerDecoderModelSpec.from_config(
            num_layers,
            num_heads,
            activation=common_spec.Activation.SWISH,
            pre_norm=True,
            ffn_glu=True,
            rms_norm=True,
            rotary_dim=0,
            rotary_interleave=False,
            num_heads_kv=num_heads_kv,
        )

        self.set_decoder(spec.decoder, model.model)
        self.set_linear(spec.decoder.projection, model.lm_head)
        return spec

    def get_vocabulary(self, model, tokenizer):
        tokens = super().get_vocabulary(model, tokenizer)

        extra_ids = model.config.vocab_size - len(tokens)
        for i in range(extra_ids):
            tokens.append("<extra_id_%d>" % i)

        return tokens

    def set_vocabulary(self, spec, tokens):
        spec.register_vocabulary(tokens)

    def set_config(self, config, model, tokenizer):
        config.bos_token = tokenizer.bos_token
        config.eos_token = tokenizer.eos_token
        config.unk_token = tokenizer.unk_token
        config.layer_norm_epsilon = model.config.rms_norm_eps

    def set_layer_norm(self, spec, layer_norm):
        spec.gamma = layer_norm.weight

    def set_decoder(self, spec, module):
        spec.scale_embeddings = False
        self.set_embeddings(spec.embeddings, module.embed_tokens)
        self.set_layer_norm(spec.layer_norm, module.norm)

        for layer_spec, layer in zip(spec.layer, module.layers):
            self.set_layer_norm(
                layer_spec.self_attention.layer_norm, layer.input_layernorm
            )
            self.set_layer_norm(
                layer_spec.ffn.layer_norm, layer.post_attention_layernorm
            )

            wq = layer.self_attn.q_proj.weight
            wk = layer.self_attn.k_proj.weight
            wv = layer.self_attn.v_proj.weight
            wo = layer.self_attn.o_proj.weight

            layer_spec.self_attention.linear[0].weight = torch.cat([wq, wk, wv])
            layer_spec.self_attention.linear[1].weight = wo

            self.set_linear(layer_spec.ffn.linear_0, layer.mlp.gate_proj)
            self.set_linear(layer_spec.ffn.linear_0_noact, layer.mlp.up_proj)
            self.set_linear(layer_spec.ffn.linear_1, layer.mlp.down_proj)

            delattr(layer, "self_attn")
            delattr(layer, "mlp")
            gc.collect()

Here's a snippet which I succesfully conducted the convertion. Not sure if it's good to send out a PR - given the sliding window support is not there yet.

@BBC-Esq
Copy link

BBC-Esq commented Oct 2, 2023

@register_loader("MistralConfig")
class MistralLoader(ModelLoader):
    @property
    def architecture_name(self):
        return "MistralForCausalLM"

    def get_model_spec(self, model):
        num_layers = model.config.num_hidden_layers

        num_heads = model.config.num_attention_heads
        num_heads_kv = getattr(model.config, "num_key_value_heads", num_heads)
        if num_heads_kv == num_heads:
            num_heads_kv = None

        spec = transformer_spec.TransformerDecoderModelSpec.from_config(
            num_layers,
            num_heads,
            activation=common_spec.Activation.SWISH,
            pre_norm=True,
            ffn_glu=True,
            rms_norm=True,
            rotary_dim=0,
            rotary_interleave=False,
            num_heads_kv=num_heads_kv,
        )

        self.set_decoder(spec.decoder, model.model)
        self.set_linear(spec.decoder.projection, model.lm_head)
        return spec

    def get_vocabulary(self, model, tokenizer):
        tokens = super().get_vocabulary(model, tokenizer)

        extra_ids = model.config.vocab_size - len(tokens)
        for i in range(extra_ids):
            tokens.append("<extra_id_%d>" % i)

        return tokens

    def set_vocabulary(self, spec, tokens):
        spec.register_vocabulary(tokens)

    def set_config(self, config, model, tokenizer):
        config.bos_token = tokenizer.bos_token
        config.eos_token = tokenizer.eos_token
        config.unk_token = tokenizer.unk_token
        config.layer_norm_epsilon = model.config.rms_norm_eps

    def set_layer_norm(self, spec, layer_norm):
        spec.gamma = layer_norm.weight

    def set_decoder(self, spec, module):
        spec.scale_embeddings = False
        self.set_embeddings(spec.embeddings, module.embed_tokens)
        self.set_layer_norm(spec.layer_norm, module.norm)

        for layer_spec, layer in zip(spec.layer, module.layers):
            self.set_layer_norm(
                layer_spec.self_attention.layer_norm, layer.input_layernorm
            )
            self.set_layer_norm(
                layer_spec.ffn.layer_norm, layer.post_attention_layernorm
            )

            wq = layer.self_attn.q_proj.weight
            wk = layer.self_attn.k_proj.weight
            wv = layer.self_attn.v_proj.weight
            wo = layer.self_attn.o_proj.weight

            layer_spec.self_attention.linear[0].weight = torch.cat([wq, wk, wv])
            layer_spec.self_attention.linear[1].weight = wo

            self.set_linear(layer_spec.ffn.linear_0, layer.mlp.gate_proj)
            self.set_linear(layer_spec.ffn.linear_0_noact, layer.mlp.up_proj)
            self.set_linear(layer_spec.ffn.linear_1, layer.mlp.down_proj)

            delattr(layer, "self_attn")
            delattr(layer, "mlp")
            gc.collect()

Here's a snippet which I succesfully conducted the convertion. Not sure if it's good to send out a PR - given the sliding window support is not there yet.

Awesome, any change we can get a bfloat ctranslate2 edition since the model is originally in bfloat16? that way we can use quantizations at run time other than int8?

@BBC-Esq
Copy link

BBC-Esq commented Oct 2, 2023

Most people using Mistral will be using it for RAG, meaning it'll probably break without the sliding window attention.

Speaking of RAG. My other posts have been inquiring about getting ctranslate2 to work with the "instructor" class of embedding models like instructor-xl, for example. I'm being serious here, since you successfully converted Mistral by modifying the ctranslate2 scripts, I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally. This is very important to me, so hit me up if you want to discuss. I'd be happy to share my credentials, law firm website, or whatever it takes so we can do this and make payment remotely...Thanks.

@silvacarl2
Copy link

I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally.

we second this, althought we are focussed on healthcare i..e the pay part. ctranslate2 is awesome.

@BBC-Esq
Copy link

BBC-Esq commented Oct 2, 2023

I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally.

we second this, althought we are focussed on healthcare i..e the pay part. ctranslate2 is awesome.

Let's do this, we'll split the cost 50/50 for whatever freelance programmer actually does it. We'll need to discuss the amount of time and first of course. ;-)

@silvacarl2
Copy link

confirmed. we are also looking into fine tuning of this model, althought it does not need very much.

from our tests this model works the best out of the box vanilla with a variety of tests we have dor our use case.

@BBC-Esq
Copy link

BBC-Esq commented Oct 2, 2023

confirmed. we are also looking into fine tuning of this model, althought it does not need very much.

from our tests this model works the best out of the box vanilla with a variety of tests we have dor our use case.

I agree, and even though it's a resource hog (relative to other embedding models) it's worth it IMHO.

@winstxnhdw

This comment was marked as off-topic.

@BBC-Esq
Copy link

BBC-Esq commented Oct 3, 2023

I've just noticed that it performs significantly better when I use it. Not sure why exactly, I know that different models perform differently depending on the type of text being fed it, but that's just what I've noticed. Any interest?

@silvacarl2
Copy link

will check out the leaderboard and runs some tests thx.

@winstxnhdw

This comment was marked as off-topic.

@BBC-Esq
Copy link

BBC-Esq commented Oct 7, 2023

I'm sorry, are you saying that bge-en-large-1.5 allows you to enter instructions like instructor-xl does?

@winstxnhdw

This comment was marked as off-topic.

@vince62s
Copy link
Member

@winstxnhdw do you have the use case to test #1528 it would require passing a very long prompt ( > 4096, maybe double this) and see if it outputs consistent completion.

@winstxnhdw
Copy link

Yeah, easily but I am really busy this week. I can maybe test something this weekend. Will update.

@muhtalhakhan

This comment was marked as off-topic.

@winstxnhdw

This comment was marked as off-topic.

@muhtalhakhan

This comment was marked as off-topic.

@vince62s
Copy link
Member

vince62s commented Nov 3, 2023

I closed #1528 and worked with @minhthuc2502 on #1524.

still WIP, not good so far.

@vince62s
Copy link
Member

We just merged #1524 great team work with @minhthuc2502
Mistral should now run fine with very long inputs. I just recommend to use int8_float16 when converting, plain float16 may go OOM quite easily on a 24GB GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants