Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MADLAD-400 - multilingual machine translation model based on the T5 architecture #1560

Closed
carolinaxxxxx opened this issue Nov 20, 2023 · 14 comments
Labels
enhancement New feature or request

Comments

@carolinaxxxxx
Copy link

Hi Team,

please consider adding support for models from the collection: https://huggingface.co/collections/jbochi/madlad-400-65491e6a78726cac9a4b84b7

Short description:

MADLAD-400 is a multilingual machine translation model based on the T5 architecture that was trained on 250 billion tokens covering over 450 languages using publicly available data. It is competitive with models that are significantly larger.

Paper: https://huggingface.co/papers/2309.04662

Thank you very much for your work. Best regards 👍 🥇

@vince62s
Copy link
Member

I think @Ehsan-Jahanbakhsh converted it using the T5 template, how did it go ?

@vince62s vince62s added the enhancement New feature or request label Nov 21, 2023
@Ehsan-Jahanbakhsh
Copy link
Contributor

Ehsan-Jahanbakhsh commented Nov 21, 2023

it works well. the huggingface model (link) had a broken tokenizer that has been fixed.
ct2-transformers-converter works now. (although in order to convert correctly #1552 is needed)

@vince62s
Copy link
Member

Closing then, will be fine for next release.

@carolinaxxxxx
Copy link
Author

carolinaxxxxx commented Nov 22, 2023

@Ehsan-Jahanbakhsh How to use MADLAD-400 with ctranslate2?

I converted the model, use with:

import ctranslate2
import transformers

translator = ctranslate2.Translator("/path/to/model", device="cuda", compute_type="float16")
tokenizer = transformers.AutoTokenizer.from_pretrained("jbochi/madlad400-7b-mt")

input_text = "<2de> The house is wonderful."
input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text))

results = translator.translate_batch([input_tokens])

output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))

print(output_text

Result:

<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>

I will be grateful for your tips. Greetings!

@Ehsan-Jahanbakhsh
Copy link
Contributor

Ehsan-Jahanbakhsh commented Nov 22, 2023

See This.
Do you have the same problem with 3b version?

@carolinaxxxxx
Copy link
Author

carolinaxxxxx commented Nov 23, 2023

@Ehsan-Jahanbakhsh

See This. Do you have the same problem with 3b version?

Same here.

Conversion:

ct2-transformers-converter --model jbochi/madlad400-3b-mt --quantization float16

Test code:

import ctranslate2
import transformers

translator = ctranslate2.Translator("/path/to/model", device="cuda", compute_type="float16")
tokenizer = transformers.AutoTokenizer.from_pretrained("jbochi/madlad400-3b-mt")

input_text = "<2de> The house is wonderful."
input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text))

results = translator.translate_batch([input_tokens])

output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))

print(output_text

Result:

<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>

config.json:

{
  "add_source_bos": false,
  "add_source_eos": false,
  "bos_token": "<s>",
  "decoder_start_token": "<unk>",
  "eos_token": "</s>",
  "layer_norm_epsilon": null,
  "unk_token": "<unk>"
}

@Ehsan-Jahanbakhsh
Copy link
Contributor

I was not able to reproduce this results. please run the inference on CPU with float32 or int8 and see if the problem persists.

@carolinaxxxxx
Copy link
Author

@Ehsan-Jahanbakhsh int8 CPU is ok, result:

Das Haus ist wunderbar.

Do you have any idea what might be causing it? Maybe something with the conversion? Can you share the models you converted?

THx ✌️

@Ehsan-Jahanbakhsh
Copy link
Contributor

I don't have the means, but it would be cool if someone tested float32 on a gpu.

@carolinaxxxxx
Copy link
Author

carolinaxxxxx commented Nov 23, 2023

@Ehsan-Jahanbakhsh float32 on GPU is ok, result:

Das Haus ist wunderbar.

So I guess there is something wrong with the float16 conversion. Any idea?

@vince62s
Copy link
Member

it's a known issue with T5 models. Google it you will find discussions on this.

@carolinaxxxxx
Copy link
Author

Ok. He'll look around.. Thx.

@vince62s
Copy link
Member

#1074

@carolinaxxxxx
Copy link
Author

@vince62s thanks for that ✌️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants