Skip to content
BERT score for text generation
Branch: master
Clone or download
Latest commit 8781157 Jul 17, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
bert_score add docs; change default batch size to be consistent. May 4, 2019
cli debug cli script. Apr 27, 2019
example update notebook demo. Apr 27, 2019
.gitignore Initial commit Apr 23, 2019
LICENSE Initial commit Apr 23, 2019 update requirements. Jul 17, 2019
bert_score.png Initial commit Apr 23, 2019
requirements.txt use to integrate with notebook. Apr 27, 2019 update version number. Apr 27, 2019


Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT.


*: Equal Contribution


BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on setence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

For an illustration, BERTScore precision can be computed as

If you find this repo useful, please cite:

  title={BERTScore: Evaluating Text Generation with BERT},
  author={Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav.},
  journal={arXiv preprint arXiv:1904.09675},


  • Python version >= 3.6
  • PyTorch version >= 0.4.1

Install from pip by

pip install bert-score

Install it from the source by:

git clone
cd bert_score
pip install -r requiremnts.txt
pip install .



We provide a command line interface(CLI) of BERTScore as well as a python module. For the CLI, you can use it as follows:

  1. To evaluate English text files:

We provide example inputs under ./example.

bert-score -r example/refs.txt -c example/hyps.txt --bert bert-base-uncased 
  1. To evaluate Chinese text files:

Please format your input files similar to the ones in ./example.

bert-score -r [references] -c [candidates] --bert bert-base-chinese
  1. To evaluate text files in other languages:

Please format your input files similar to the ones in ./example.

bert-score -r [references] -c [candidates]

See more options by bert-score -h.

For the python module, we provide a demo. Please refer to bert_score/ for more details.

Running BERTScore can be computationally intensive (because it uses BERT :p). Therefore, a GPU is usually necessary. If you don't have access to a GPU, you can try our demo on Google Colab

Practical Tips

  • BERTScore relies on inverse document frequency (idf) on the reference sentences to weigh word importance. However, when the set of reference sentences become too small, the idf score would become inaccurate/invalid. Please consider turning off idf scaling, by setting no_idf=True when calling bert_score.score function.
  • When you are low on GPU memory, consider setting batch_size when calling bert_score.score function.


This repo wouldn't be possible without the awesome bert and pytorch-pretrained-BERT.

You can’t perform that action at this time.