[Question] Cross-lingual Score #7

loretoparisi · 2019-05-02T11:36:05Z

Assumed that the embeddings have learned joint languages representations (so that cat is closer to katze or chat, hence a sentence like I love eating will be closer to Ich esse gerne, as it happens in the MUSE or LASER models), would it be possible to evaluate the BERTscore against sentences in two different languages?

The text was updated successfully, but these errors were encountered:

Tiiiger · 2019-05-02T14:33:35Z

Hello,

we also conjecture that this is possible although we have not done a proper study regarding this hypothesis.

shoegazerstella · 2019-05-02T14:53:10Z

Hi @Tiiiger,
What would you suggest if we want to verify whether this hypothesis is valid? Is it possible to use an existing model or there is the need of a new training on BERT?

I was trying this:

cands = ['hello how are you?', 'cat', 'house']
refs = ['ciao come stai?', 'gatto', 'topo']

P, R, F1 = score(cands, refs, bert="bert-base-multilingual-cased", verbose=True)

P: tensor([0.6788, 0.6708, 0.7756])
R: tensor([0.7002, 0.6582, 0.7756])
F1: tensor([0.6893, 0.6644, 0.7756])

And the confusion matrix plot is this:

Clearly this is not working as the last words house/topo in the two lists are not the same, but they have the highest score of similarity.

shoegazerstella · 2019-05-02T15:35:18Z

We just saw this https://github.com/facebookresearch/XLM implementation of a cross-lingual Language Model based on BERT.
It seems that the XNLI-15 model could be a nice first solution:

XNLI-15 is the model used in the paper for XNLI fine-tuning.
It handles English, French, Spanish, German, Greek, Bulgarian, 
Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.
It is trained with the MLM and the TLM objectives. 
For this model we used a different preprocessing than for the MT models (such as lowercasing and accents removal).

There is an example on how it works.

Could you consider implementing this in your library?

Tiiiger · 2019-05-03T03:53:28Z

Thank you @shoegazerstella for letting us know. We are definitely going to look into it but it may take some time before we get back to you.

If this is really important to your research, I encourage you to fork the repo and start implementing it. The general backend of BERTScore is at https://github.com/Tiiiger/bert_score/blob/master/bert_score/utils.py. Please let me know if you have any questions.

shoegazerstella · 2019-05-03T16:16:51Z

I am trying to implement the solution discussed. You can find the code here.

Apologies if this is not the most elegant solution but was the fastest for me to test today.

So I am running bert_score_test.py but I get stuck after the embedding generation, the prints refer to ref_stats and hyp_stats in bert_score/utils.py:

calculating scores...
loading facebook-XLM model..
Loading vocabulary from XLM/models/vocab_xnli_15.txt ...
Read 4622450944 words (121332 unique) from vocabulary file.
Loading codes from XLM/models/codes_xnli_15.txt ...
Read 80000 codes from the codes file.
  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]generating embeddings from facebook-XLM model..
{'es': 'hola como estas?'}
8
torch.Size([8, 1, 1024])
torch.Size([1, 1024])
tensor([[[-0.0235, -1.1157,  5.5236,  ..., -1.7445,  4.6693, -5.4893]],

        [[-4.8977, -5.8174, -0.0425,  ..., -4.0513,  1.5466,  1.0361]],

        [[-3.0278, -2.7101, -6.6004,  ..., -2.3234,  2.0516,  0.8349]],

        ...,

        [[-3.4236, -3.7358,  1.5622,  ..., -2.7245,  0.3207,  1.5517]],

        [[-1.3155, -2.3146,  0.8112,  ...,  1.7799, -0.2109,  4.8358]],

        [[-3.9836, -0.7102,  1.4045,  ..., -2.2827,  5.0350,  8.0413]]],
       grad_fn=<TransposeBackward0>)
generating embeddings from facebook-XLM model..
{'en': 'hello how are you?'}
7
torch.Size([7, 1, 1024])
torch.Size([1, 1024])
tensor([[[-3.6706e+00, -5.1693e+00,  2.3415e+00,  ..., -3.3566e+00,
           2.2613e+00,  1.2468e+01]],

        [[-5.1743e+00, -5.0928e+00,  1.0318e-02,  ..., -5.8567e+00,
          -3.4373e+00,  6.0835e+00]],

        [[-2.3680e+00, -1.0124e+01,  3.8484e+00,  ..., -2.8918e+00,
          -8.9933e+00, -2.7259e+00]],

        ...,

        [[-7.4168e+00, -3.6042e+00,  3.5969e+00,  ...,  3.8602e+00,
          -6.1241e-01,  5.5241e-01]],

        [[-1.0793e+00, -4.0387e+00,  5.8260e+00,  ...,  3.7948e+00,
           2.2968e+00, -1.2407e+01]],

        [[-3.8262e+00, -5.0583e+00,  5.7023e+00,  ..., -4.9191e-01,
           4.4571e+00, -2.0888e+00]]], grad_fn=<TransposeBackward0>)
Traceback (most recent call last):
  File "bert_score_test.py", line 13, in <module>
    P, R, F1 = score(cands, refs, cands_lang, refs_lang, bert="facebook-XLM", verbose=True, no_idf=no_idf) 
  File "/src/bert_score/score.py", line 65, in score
    verbose=verbose, device=device, batch_size=batch_size)
  File "/src/bert_score/utils.py", line 192, in bert_cos_score_idf
    P, R, F1 = greedy_cos_idf(*ref_stats, *hyp_stats)
TypeError: greedy_cos_idf() takes 8 positional arguments but 15 were given

I am printing the size of the tensors to compare with your implementation. Is this error related to their shape?
By running bert_score_test.py using bert-base-multilingual-cased I get this:

The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
calculating scores...
  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]<class 'tuple'>
4
torch.Size([1, 7, 768])
(tensor([[[-0.0576, -0.0147,  0.0266,  ...,  0.8648,  1.4775, -0.6607],
         [ 0.0755,  0.0690, -0.3626,  ...,  0.3833,  1.1005,  0.4550],
         [ 0.1177,  0.3928,  0.2649,  ..., -0.5387,  0.6300,  0.0785],
         ...,
         [ 0.2860,  0.2741,  0.0339,  ...,  0.5458,  0.6054,  0.3276],
         [ 0.4984,  0.4997,  0.1665,  ..., -0.2474,  0.7287, -0.1994],
         [ 0.2426, -0.3561,  0.9417,  ...,  0.2333,  1.1731, -0.5414]]]), tensor([7]), tensor([[1, 1, 1, 1, 1, 1, 1]]), tensor([[0., 1., 1., 1., 1., 1., 0.]]))
<class 'tuple'>
4
torch.Size([1, 8, 768])
(tensor([[[ 0.1399,  0.0636, -0.3477,  ...,  1.2948,  1.3497, -0.8473],
         [ 0.5074,  0.4649,  0.0511,  ...,  1.7759,  1.0347, -0.7468],
         [ 0.6105,  0.7187,  0.1068,  ...,  0.8703,  0.7290, -0.4718],
         ...,
         [ 0.0584,  0.8752,  0.4854,  ...,  0.8477, -0.3838, -0.3481],
         [ 0.7315,  0.2678,  0.0808,  ..., -0.2716,  0.4328, -0.6448],
         [ 0.6694, -0.4003,  0.9021,  ...,  0.4409,  0.8974, -0.7192]]]), tensor([8]), tensor([[1, 1, 1, 1, 1, 1, 1, 1]]), tensor([[0., 1., 1., 1., 1., 1., 1., 0.]]))
100%|############################################################################################################################| 1/1 [00:01<00:00,  1.52s/it]
done in 1.57 seconds
['hello how are you?']
['hola como estas?']
P: tensor([0.7123])
R: tensor([0.7264])
F1: tensor([0.7193])

so here instead, ref_stats and hyp_stats are tuples.
The code for XLM embedding generation is here.
Do you have any tip on this? Thanks a lot for your help!

Tiiiger · 2019-05-03T21:16:19Z

I think you gave the wrong number of arguments to greedy_cos_idf.

I will add documentation to utils.py by tonight. Hang on.

Tiiiger · 2019-05-04T04:40:54Z

Hope the added docs can help you.

shoegazerstella · 2019-05-06T15:55:35Z

Hi @Tiiiger, thanks a lot for the docs, it helped a lot.
I managed to make it work but I am still missing something.
I am a bit struggling on how to correctly compute the idf_dict from https://github.com/facebookresearch/XLM.

As of now I have these very bad results:

['hello how are you?']
['hola como estas?']

XLM:
P: tensor([0.6018])
R: tensor([0.6582])
F1: tensor([0.6287])

bert-base-multilingual-cased:
P: tensor([0.7123])
R: tensor([0.7264])
F1: tensor([0.7193])

Seems that bert-base-multilingual-cased is performing way better.

Tiiiger · 2019-05-09T22:48:49Z

Just a heads up. The absolute score may be less meaningful because they can have different ranges. Ideally, you would like to show that the Score correlates with human judgment, which unfortunately I don't know any.

As I understand, you don't have any more implementation questions so I am closing this.

I am happy to chat about the potential research opportunities for the cross-lingual scores. Feel free to continue the conversation under this issue or contact us directly through emails if you want to keep it private.

shoegazerstella · 2019-05-22T15:55:12Z

Hi @Tiiiger,

I had to modify something in the XLM vocabulary and use the bpe as tokenizer. You can see some more changes here. The code still does not work for more than 1 reference phrase at a time though.

I would like to ask you a couple of questions on some things I am missing:

Every time I launch the script using exactly the same 2 phrases, the results slightly change. What can cause this strange behaviour?
Could you please clarify more on how to handle the no_idf parameter? For now it is like:

no_idf = True if len(refs) == 1 else False

Thanks a lot for your help!

loretoparisi changed the title ~~Cross-lingual Score~~ [Question] Cross-lingual Score May 2, 2019

Tiiiger closed this as completed May 9, 2019

shuailiu6626 mentioned this issue Aug 19, 2023

BERT-score is so different isi-nlp/RECAP#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Cross-lingual Score #7

[Question] Cross-lingual Score #7

loretoparisi commented May 2, 2019 •

edited

Tiiiger commented May 2, 2019

shoegazerstella commented May 2, 2019 •

edited

shoegazerstella commented May 2, 2019 •

edited

Tiiiger commented May 3, 2019

shoegazerstella commented May 3, 2019

Tiiiger commented May 3, 2019 •

edited

Tiiiger commented May 4, 2019

shoegazerstella commented May 6, 2019

Tiiiger commented May 9, 2019

shoegazerstella commented May 22, 2019

[Question] Cross-lingual Score #7

[Question] Cross-lingual Score #7

Comments

loretoparisi commented May 2, 2019 • edited

Tiiiger commented May 2, 2019

shoegazerstella commented May 2, 2019 • edited

shoegazerstella commented May 2, 2019 • edited

Tiiiger commented May 3, 2019

shoegazerstella commented May 3, 2019

Tiiiger commented May 3, 2019 • edited

Tiiiger commented May 4, 2019

shoegazerstella commented May 6, 2019

Tiiiger commented May 9, 2019

shoegazerstella commented May 22, 2019

loretoparisi commented May 2, 2019 •

edited

shoegazerstella commented May 2, 2019 •

edited

shoegazerstella commented May 2, 2019 •

edited

Tiiiger commented May 3, 2019 •

edited