Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Cross-lingual Score #7

Closed
loretoparisi opened this issue May 2, 2019 · 10 comments
Closed

[Question] Cross-lingual Score #7

loretoparisi opened this issue May 2, 2019 · 10 comments

Comments

@loretoparisi
Copy link

loretoparisi commented May 2, 2019

Assumed that the embeddings have learned joint languages representations (so that cat is closer to katze or chat, hence a sentence like I love eating will be closer to Ich esse gerne, as it happens in the MUSE or LASER models), would it be possible to evaluate the BERTscore against sentences in two different languages?

@loretoparisi loretoparisi changed the title Cross-lingual Score [Question] Cross-lingual Score May 2, 2019
@Tiiiger
Copy link
Owner

Tiiiger commented May 2, 2019

Hello,

we also conjecture that this is possible although we have not done a proper study regarding this hypothesis.

@shoegazerstella
Copy link

shoegazerstella commented May 2, 2019

Hi @Tiiiger,
What would you suggest if we want to verify whether this hypothesis is valid? Is it possible to use an existing model or there is the need of a new training on BERT?

I was trying this:

cands = ['hello how are you?', 'cat', 'house']
refs = ['ciao come stai?', 'gatto', 'topo']

P, R, F1 = score(cands, refs, bert="bert-base-multilingual-cased", verbose=True)

P: tensor([0.6788, 0.6708, 0.7756])
R: tensor([0.7002, 0.6582, 0.7756])
F1: tensor([0.6893, 0.6644, 0.7756])

And the confusion matrix plot is this:
Screenshot 2019-05-02 at 16 47 00

Clearly this is not working as the last words house/topo in the two lists are not the same, but they have the highest score of similarity.

@shoegazerstella
Copy link

shoegazerstella commented May 2, 2019

We just saw this https://github.com/facebookresearch/XLM implementation of a cross-lingual Language Model based on BERT.
It seems that the XNLI-15 model could be a nice first solution:

XNLI-15 is the model used in the paper for XNLI fine-tuning.
It handles English, French, Spanish, German, Greek, Bulgarian, 
Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.
It is trained with the MLM and the TLM objectives. 
For this model we used a different preprocessing than for the MT models (such as lowercasing and accents removal).

There is an example on how it works.

Could you consider implementing this in your library?

@Tiiiger
Copy link
Owner

Tiiiger commented May 3, 2019

Thank you @shoegazerstella for letting us know. We are definitely going to look into it but it may take some time before we get back to you.

If this is really important to your research, I encourage you to fork the repo and start implementing it. The general backend of BERTScore is at https://github.com/Tiiiger/bert_score/blob/master/bert_score/utils.py. Please let me know if you have any questions.

@shoegazerstella
Copy link

I am trying to implement the solution discussed. You can find the code here.

Apologies if this is not the most elegant solution but was the fastest for me to test today.

So I am running bert_score_test.py but I get stuck after the embedding generation, the prints refer to ref_stats and hyp_stats in bert_score/utils.py:

calculating scores...
loading facebook-XLM model..
Loading vocabulary from XLM/models/vocab_xnli_15.txt ...
Read 4622450944 words (121332 unique) from vocabulary file.
Loading codes from XLM/models/codes_xnli_15.txt ...
Read 80000 codes from the codes file.
  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]generating embeddings from facebook-XLM model..
{'es': 'hola como estas?'}
8
torch.Size([8, 1, 1024])
torch.Size([1, 1024])
tensor([[[-0.0235, -1.1157,  5.5236,  ..., -1.7445,  4.6693, -5.4893]],

        [[-4.8977, -5.8174, -0.0425,  ..., -4.0513,  1.5466,  1.0361]],

        [[-3.0278, -2.7101, -6.6004,  ..., -2.3234,  2.0516,  0.8349]],

        ...,

        [[-3.4236, -3.7358,  1.5622,  ..., -2.7245,  0.3207,  1.5517]],

        [[-1.3155, -2.3146,  0.8112,  ...,  1.7799, -0.2109,  4.8358]],

        [[-3.9836, -0.7102,  1.4045,  ..., -2.2827,  5.0350,  8.0413]]],
       grad_fn=<TransposeBackward0>)
generating embeddings from facebook-XLM model..
{'en': 'hello how are you?'}
7
torch.Size([7, 1, 1024])
torch.Size([1, 1024])
tensor([[[-3.6706e+00, -5.1693e+00,  2.3415e+00,  ..., -3.3566e+00,
           2.2613e+00,  1.2468e+01]],

        [[-5.1743e+00, -5.0928e+00,  1.0318e-02,  ..., -5.8567e+00,
          -3.4373e+00,  6.0835e+00]],

        [[-2.3680e+00, -1.0124e+01,  3.8484e+00,  ..., -2.8918e+00,
          -8.9933e+00, -2.7259e+00]],

        ...,

        [[-7.4168e+00, -3.6042e+00,  3.5969e+00,  ...,  3.8602e+00,
          -6.1241e-01,  5.5241e-01]],

        [[-1.0793e+00, -4.0387e+00,  5.8260e+00,  ...,  3.7948e+00,
           2.2968e+00, -1.2407e+01]],

        [[-3.8262e+00, -5.0583e+00,  5.7023e+00,  ..., -4.9191e-01,
           4.4571e+00, -2.0888e+00]]], grad_fn=<TransposeBackward0>)
Traceback (most recent call last):
  File "bert_score_test.py", line 13, in <module>
    P, R, F1 = score(cands, refs, cands_lang, refs_lang, bert="facebook-XLM", verbose=True, no_idf=no_idf) 
  File "/src/bert_score/score.py", line 65, in score
    verbose=verbose, device=device, batch_size=batch_size)
  File "/src/bert_score/utils.py", line 192, in bert_cos_score_idf
    P, R, F1 = greedy_cos_idf(*ref_stats, *hyp_stats)
TypeError: greedy_cos_idf() takes 8 positional arguments but 15 were given

I am printing the size of the tensors to compare with your implementation. Is this error related to their shape?
By running bert_score_test.py using bert-base-multilingual-cased I get this:

The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
calculating scores...
  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]<class 'tuple'>
4
torch.Size([1, 7, 768])
(tensor([[[-0.0576, -0.0147,  0.0266,  ...,  0.8648,  1.4775, -0.6607],
         [ 0.0755,  0.0690, -0.3626,  ...,  0.3833,  1.1005,  0.4550],
         [ 0.1177,  0.3928,  0.2649,  ..., -0.5387,  0.6300,  0.0785],
         ...,
         [ 0.2860,  0.2741,  0.0339,  ...,  0.5458,  0.6054,  0.3276],
         [ 0.4984,  0.4997,  0.1665,  ..., -0.2474,  0.7287, -0.1994],
         [ 0.2426, -0.3561,  0.9417,  ...,  0.2333,  1.1731, -0.5414]]]), tensor([7]), tensor([[1, 1, 1, 1, 1, 1, 1]]), tensor([[0., 1., 1., 1., 1., 1., 0.]]))
<class 'tuple'>
4
torch.Size([1, 8, 768])
(tensor([[[ 0.1399,  0.0636, -0.3477,  ...,  1.2948,  1.3497, -0.8473],
         [ 0.5074,  0.4649,  0.0511,  ...,  1.7759,  1.0347, -0.7468],
         [ 0.6105,  0.7187,  0.1068,  ...,  0.8703,  0.7290, -0.4718],
         ...,
         [ 0.0584,  0.8752,  0.4854,  ...,  0.8477, -0.3838, -0.3481],
         [ 0.7315,  0.2678,  0.0808,  ..., -0.2716,  0.4328, -0.6448],
         [ 0.6694, -0.4003,  0.9021,  ...,  0.4409,  0.8974, -0.7192]]]), tensor([8]), tensor([[1, 1, 1, 1, 1, 1, 1, 1]]), tensor([[0., 1., 1., 1., 1., 1., 1., 0.]]))
100%|############################################################################################################################| 1/1 [00:01<00:00,  1.52s/it]
done in 1.57 seconds
['hello how are you?']
['hola como estas?']
P: tensor([0.7123])
R: tensor([0.7264])
F1: tensor([0.7193])

so here instead, ref_stats and hyp_stats are tuples.
The code for XLM embedding generation is here.
Do you have any tip on this? Thanks a lot for your help!

@Tiiiger
Copy link
Owner

Tiiiger commented May 3, 2019

I think you gave the wrong number of arguments to greedy_cos_idf.

I will add documentation to utils.py by tonight. Hang on.

@Tiiiger
Copy link
Owner

Tiiiger commented May 4, 2019

Hope the added docs can help you.

@shoegazerstella
Copy link

Hi @Tiiiger, thanks a lot for the docs, it helped a lot.
I managed to make it work but I am still missing something.
I am a bit struggling on how to correctly compute the idf_dict from https://github.com/facebookresearch/XLM.

As of now I have these very bad results:

['hello how are you?']
['hola como estas?']

XLM:
P: tensor([0.6018])
R: tensor([0.6582])
F1: tensor([0.6287])

bert-base-multilingual-cased:
P: tensor([0.7123])
R: tensor([0.7264])
F1: tensor([0.7193])

Seems that bert-base-multilingual-cased is performing way better.

@Tiiiger
Copy link
Owner

Tiiiger commented May 9, 2019

Just a heads up. The absolute score may be less meaningful because they can have different ranges. Ideally, you would like to show that the Score correlates with human judgment, which unfortunately I don't know any.

As I understand, you don't have any more implementation questions so I am closing this.

I am happy to chat about the potential research opportunities for the cross-lingual scores. Feel free to continue the conversation under this issue or contact us directly through emails if you want to keep it private.

@Tiiiger Tiiiger closed this as completed May 9, 2019
@shoegazerstella
Copy link

Hi @Tiiiger,

I had to modify something in the XLM vocabulary and use the bpe as tokenizer. You can see some more changes here. The code still does not work for more than 1 reference phrase at a time though.

I would like to ask you a couple of questions on some things I am missing:

  • Every time I launch the script using exactly the same 2 phrases, the results slightly change. What can cause this strange behaviour?
  • Could you please clarify more on how to handle the no_idf parameter? For now it is like:
no_idf = True if len(refs) == 1 else False

Thanks a lot for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants