Tatoeba models outputting nonsense #35

Latrolage · 2023-09-05T10:13:41Z

On the huggingface demo, (e.g. https://huggingface.co/Helsinki-NLP/opus-tatoeba-en-ja?text=My+name+is+Wolfgang+and+I+live+in+Berlin) the output doesn't seem to make sense.

I ran some models locally too and this was the result of:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
input_text = "犬が好きじゃない"
print("Text to translate: "+ input_text)
print("Expected translation: I don't like dogs/I dislike dogs")
for folder in "../opus-2020-06-17-pytorch", "../opus-2021-02-18-pytorch", "Helsinki-NLP/opus-mt-ja-en", "Helsinki-NLP/opus-mt-jap-en":
    print(folder)
    tokenizer = AutoTokenizer.from_pretrained(folder)
    model = AutoModelForSeq2SeqLM.from_pretrained(folder)
    tokenized = tokenizer([input_text], return_tensors='pt')
    out = model.generate(**tokenized, max_length=128)
    print(tokenizer.decode(out[0], skip_special_tokens=True))

Output:

❯ python translate.py
Text to translate: 犬が好きじゃない
Expected translation: I don't like dogs/I dislike dogs
../opus-2020-06-17-pytorch
□ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □
../opus-2021-02-18-pytorch
pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain mountain mountain mountain mountain mountain mountain mountain mountain mountain eighteenth eighteenth eighteenth eighteenth eighteenth king king king king king king king king king king king king king king king king king eighteenth king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king
Helsinki-NLP/opus-mt-ja-en
I don't like dogs.
Helsinki-NLP/opus-mt-jap-en
A dog's dogs would desire a dogs a dog would desire.

only opus-mt-ja-en gave an answer which was understandable at all. Any idea what the problem might be? The opus-mt-jap-en model also doesn't make a comprehensible translation.

The tatoeba models were converted to pytorch through python -m transformers.models.marian.convert_marian_to_pytorch --src folder --dest folder-pytorch I'm not sure how just pasting in the huggingface link loads it so I don't know how to replicate it.

The text was updated successfully, but these errors were encountered:

droussis · 2023-09-12T14:07:30Z

This seems to be the case with all their models which originate from Tatoeba Challenge. Only the models which are included here seem to work using Hugging Face. Up until a month ago, I hadn't encountered such problems.

ArthurZucker · 2023-09-12T21:59:50Z

Thanks for reporting, I'll try to check if the tokenizer or the model is wrong.

ArthurZucker · 2023-09-18T18:54:42Z

Hey! you should use model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-tatoeba-en-ja", revision = "refs/pr/3"). This is indeed related to an update on the lib, but a fix was opened on all of the models online, like the following: https://huggingface.co/Helsinki-NLP/opus-tatoeba-en-ja/discussions/3

Latrolage · 2024-01-28T16:53:36Z

Are the opus-mt-xx-xx models a different issue? I tried just now on both old and newer transformers and haven't gotten them to work.
https://huggingface.co/Helsinki-NLP/opus-mt-jap-en?text=%E7%8A%AC%E3%81%8C%E5%A5%BD%E3%81%8D%E3%81%98%E3%82%83%E3%81%AA%E3%81%84

jorgtied · 2024-01-28T17:35:54Z

Note that jap is not Japanese

Latrolage · 2024-01-28T18:22:10Z

That makes more sense.
I also tried the model at https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/jpn-eng opus-2021-02-18 and it seems that my issue there is related to #2 (comment)

This was referenced Sep 18, 2023

Some MarianMT models broken and output garbage huggingface/transformers#26216

Closed

Marian models broken with latest transformers huggingface/transformers#26271

Closed

Latrolage closed this as completed Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tatoeba models outputting nonsense #35

Tatoeba models outputting nonsense #35

Latrolage commented Sep 5, 2023 •

edited

Loading

droussis commented Sep 12, 2023

ArthurZucker commented Sep 12, 2023

ArthurZucker commented Sep 18, 2023

Latrolage commented Jan 28, 2024

jorgtied commented Jan 28, 2024

Latrolage commented Jan 28, 2024

Tatoeba models outputting nonsense #35

Tatoeba models outputting nonsense #35

Comments

Latrolage commented Sep 5, 2023 • edited Loading

droussis commented Sep 12, 2023

ArthurZucker commented Sep 12, 2023

ArthurZucker commented Sep 18, 2023

Latrolage commented Jan 28, 2024

jorgtied commented Jan 28, 2024

Latrolage commented Jan 28, 2024

Latrolage commented Sep 5, 2023 •

edited

Loading