Support MADLAD-400 - multilingual machine translation model based on the T5 architecture #1560

carolinaxxxxx · 2023-11-20T16:17:15Z

Hi Team,

please consider adding support for models from the collection: https://huggingface.co/collections/jbochi/madlad-400-65491e6a78726cac9a4b84b7

Short description:

MADLAD-400 is a multilingual machine translation model based on the T5 architecture that was trained on 250 billion tokens covering over 450 languages using publicly available data. It is competitive with models that are significantly larger.

Paper: https://huggingface.co/papers/2309.04662

Thank you very much for your work. Best regards 👍 🥇

vince62s · 2023-11-21T09:44:10Z

I think @Ehsan-Jahanbakhsh converted it using the T5 template, how did it go ?

Ehsan-Jahanbakhsh · 2023-11-21T12:42:39Z

it works well. the huggingface model (link) had a broken tokenizer that has been fixed.
ct2-transformers-converter works now. (although in order to convert correctly #1552 is needed)

vince62s · 2023-11-21T13:28:50Z

Closing then, will be fine for next release.

carolinaxxxxx · 2023-11-22T23:22:16Z

@Ehsan-Jahanbakhsh How to use MADLAD-400 with ctranslate2?

I converted the model, use with:

import ctranslate2
import transformers

translator = ctranslate2.Translator("/path/to/model", device="cuda", compute_type="float16")
tokenizer = transformers.AutoTokenizer.from_pretrained("jbochi/madlad400-7b-mt")

input_text = "<2de> The house is wonderful."
input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text))

results = translator.translate_batch([input_tokens])

output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))

print(output_text

Result:

<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>

I will be grateful for your tips. Greetings!

Ehsan-Jahanbakhsh · 2023-11-22T23:32:03Z

See This.
Do you have the same problem with 3b version?

carolinaxxxxx · 2023-11-23T00:11:44Z

@Ehsan-Jahanbakhsh

See This. Do you have the same problem with 3b version?

Same here.

Conversion:

ct2-transformers-converter --model jbochi/madlad400-3b-mt --quantization float16

Test code:

import ctranslate2
import transformers

translator = ctranslate2.Translator("/path/to/model", device="cuda", compute_type="float16")
tokenizer = transformers.AutoTokenizer.from_pretrained("jbochi/madlad400-3b-mt")

input_text = "<2de> The house is wonderful."
input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text))

results = translator.translate_batch([input_tokens])

output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))

print(output_text

Result:

config.json:

{
  "add_source_bos": false,
  "add_source_eos": false,
  "bos_token": "<s>",
  "decoder_start_token": "<unk>",
  "eos_token": "</s>",
  "layer_norm_epsilon": null,
  "unk_token": "<unk>"
}

Ehsan-Jahanbakhsh · 2023-11-23T10:12:50Z

I was not able to reproduce this results. please run the inference on CPU with float32 or int8 and see if the problem persists.

carolinaxxxxx · 2023-11-23T18:39:24Z

@Ehsan-Jahanbakhsh int8 CPU is ok, result:

Das Haus ist wunderbar.

Do you have any idea what might be causing it? Maybe something with the conversion? Can you share the models you converted?

THx ✌️

Ehsan-Jahanbakhsh · 2023-11-23T18:41:11Z

I don't have the means, but it would be cool if someone tested float32 on a gpu.

carolinaxxxxx · 2023-11-23T18:48:53Z

@Ehsan-Jahanbakhsh float32 on GPU is ok, result:

Das Haus ist wunderbar.

So I guess there is something wrong with the float16 conversion. Any idea?

vince62s · 2023-11-23T20:16:18Z

it's a known issue with T5 models. Google it you will find discussions on this.

carolinaxxxxx · 2023-11-23T20:17:36Z

Ok. He'll look around.. Thx.

vince62s · 2023-11-23T20:21:56Z

#1074

carolinaxxxxx · 2023-11-24T00:06:12Z

@vince62s thanks for that ✌️

vince62s added the enhancement New feature or request label Nov 21, 2023

vince62s closed this as completed Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MADLAD-400 - multilingual machine translation model based on the T5 architecture #1560

Support MADLAD-400 - multilingual machine translation model based on the T5 architecture #1560

carolinaxxxxx commented Nov 20, 2023

vince62s commented Nov 21, 2023

Ehsan-Jahanbakhsh commented Nov 21, 2023 •

edited

vince62s commented Nov 21, 2023

carolinaxxxxx commented Nov 22, 2023 •

edited

Ehsan-Jahanbakhsh commented Nov 22, 2023 •

edited

carolinaxxxxx commented Nov 23, 2023 •

edited

Ehsan-Jahanbakhsh commented Nov 23, 2023

carolinaxxxxx commented Nov 23, 2023

Ehsan-Jahanbakhsh commented Nov 23, 2023

carolinaxxxxx commented Nov 23, 2023 •

edited

vince62s commented Nov 23, 2023

carolinaxxxxx commented Nov 23, 2023

vince62s commented Nov 23, 2023

carolinaxxxxx commented Nov 24, 2023

Support MADLAD-400 - multilingual machine translation model based on the T5 architecture #1560

Support MADLAD-400 - multilingual machine translation model based on the T5 architecture #1560

Comments

carolinaxxxxx commented Nov 20, 2023

vince62s commented Nov 21, 2023

Ehsan-Jahanbakhsh commented Nov 21, 2023 • edited

vince62s commented Nov 21, 2023

carolinaxxxxx commented Nov 22, 2023 • edited

Ehsan-Jahanbakhsh commented Nov 22, 2023 • edited

carolinaxxxxx commented Nov 23, 2023 • edited

Ehsan-Jahanbakhsh commented Nov 23, 2023

carolinaxxxxx commented Nov 23, 2023

Ehsan-Jahanbakhsh commented Nov 23, 2023

carolinaxxxxx commented Nov 23, 2023 • edited

vince62s commented Nov 23, 2023

carolinaxxxxx commented Nov 23, 2023

vince62s commented Nov 23, 2023

carolinaxxxxx commented Nov 24, 2023

Ehsan-Jahanbakhsh commented Nov 21, 2023 •

edited

carolinaxxxxx commented Nov 22, 2023 •

edited

Ehsan-Jahanbakhsh commented Nov 22, 2023 •

edited

carolinaxxxxx commented Nov 23, 2023 •

edited

carolinaxxxxx commented Nov 23, 2023 •

edited