Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tatoeba models outputting nonsense #35

Closed
Latrolage opened this issue Sep 5, 2023 · 6 comments
Closed

Tatoeba models outputting nonsense #35

Latrolage opened this issue Sep 5, 2023 · 6 comments

Comments

@Latrolage
Copy link

Latrolage commented Sep 5, 2023

On the huggingface demo, (e.g. https://huggingface.co/Helsinki-NLP/opus-tatoeba-en-ja?text=My+name+is+Wolfgang+and+I+live+in+Berlin) the output doesn't seem to make sense.

image

I ran some models locally too and this was the result of:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
input_text = "犬が好きじゃない"
print("Text to translate: "+ input_text)
print("Expected translation: I don't like dogs/I dislike dogs")
for folder in "../opus-2020-06-17-pytorch", "../opus-2021-02-18-pytorch", "Helsinki-NLP/opus-mt-ja-en", "Helsinki-NLP/opus-mt-jap-en":
    print(folder)
    tokenizer = AutoTokenizer.from_pretrained(folder)
    model = AutoModelForSeq2SeqLM.from_pretrained(folder)
    tokenized = tokenizer([input_text], return_tensors='pt')
    out = model.generate(**tokenized, max_length=128)
    print(tokenizer.decode(out[0], skip_special_tokens=True))

Output:

❯ python translate.py
Text to translate: 犬が好きじゃない
Expected translation: I don't like dogs/I dislike dogs
../opus-2020-06-17-pytorch
□ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □
../opus-2021-02-18-pytorch
pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain mountain mountain mountain mountain mountain mountain mountain mountain mountain eighteenth eighteenth eighteenth eighteenth eighteenth king king king king king king king king king king king king king king king king king eighteenth king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king
Helsinki-NLP/opus-mt-ja-en
I don't like dogs.
Helsinki-NLP/opus-mt-jap-en
A dog's dogs would desire a dogs a dog would desire.

only opus-mt-ja-en gave an answer which was understandable at all. Any idea what the problem might be? The opus-mt-jap-en model also doesn't make a comprehensible translation.

The tatoeba models were converted to pytorch through python -m transformers.models.marian.convert_marian_to_pytorch --src folder --dest folder-pytorch I'm not sure how just pasting in the huggingface link loads it so I don't know how to replicate it.

@droussis
Copy link

This seems to be the case with all their models which originate from Tatoeba Challenge. Only the models which are included here seem to work using Hugging Face. Up until a month ago, I hadn't encountered such problems.

@ArthurZucker
Copy link

Thanks for reporting, I'll try to check if the tokenizer or the model is wrong.

@ArthurZucker
Copy link

Hey! you should use model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-tatoeba-en-ja", revision = "refs/pr/3"). This is indeed related to an update on the lib, but a fix was opened on all of the models online, like the following: https://huggingface.co/Helsinki-NLP/opus-tatoeba-en-ja/discussions/3

@Latrolage
Copy link
Author

Are the opus-mt-xx-xx models a different issue? I tried just now on both old and newer transformers and haven't gotten them to work.
https://huggingface.co/Helsinki-NLP/opus-mt-jap-en?text=%E7%8A%AC%E3%81%8C%E5%A5%BD%E3%81%8D%E3%81%98%E3%82%83%E3%81%AA%E3%81%84
image

@jorgtied
Copy link
Member

Note that jap is not Japanese

@Latrolage
Copy link
Author

That makes more sense.
I also tried the model at https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/jpn-eng opus-2021-02-18 and it seems that my issue there is related to #2 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants