PhoBERT: Pre-trained language models for Vietnamese

Hacky way to run VnCoreNLP's tokenizer with PhoBERT

dict.txt is obtained by running phobert.task.source_dictionary.save(open("dict.txt", "w"))

Examples:

from hacky_phobert_tokenizer import PhoBertTokenizer

tokenizer = PhoBertTokenizer(vncore=False)
sentence = "Tôi là sinh viên trường đại học Công nghệ"  

tokens = tokenizer.encode(sentence)  # tensor([   0,  218,    8,  418, 1430,  212, 2919,  222, 3344, 5116,    2])
print(tokenizer.decode(tokens, remove_underscore=False))  #  Tôi là sinh viên trường đại học Công nghệ

tokenizer.vncore = True  # using VnCoreNLP word tokenizer
tokens = tokenizer.encode(sentence)  # tensor([   0,  218,    8,  649,  212,  956, 2413,    2])
print(tokenizer.decode(tokens, remove_underscore=False))  #  Tôi là sinh_viên trường đại_học Công_nghệ

Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese (Pho, i.e. "Phở", is a popular food in Vietnam):

Two versions of PhoBERT "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on RoBERTa which optimizes the BERT pre-training method for more robust performance.
PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on three downstream Vietnamese NLP tasks of Part-of-speech tagging, Named-entity recognition and Natural language inference.

The general architecture and experimental results of PhoBERT can be found in our following paper:

@article{phobert,
title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
journal   = {arXiv preprint},
volume    = {arXiv:2003.00744},
year      = {2020}
}

Please cite our paper when PhoBERT is used to help produce published results or incorporated into other software.

Experimental results

Experiments show that using a straightforward finetuning manner as we use for PhoBERT can lead to state-of-the-art results. Note that we might boost our downstream task performances even further by doing a more careful hyper-parameter fine-tuning.

Using PhoBERT in `fairseq`

Installation

Python version >= 3.6
PyTorch version >= 1.2.0
fairseq
fastBPE: pip3 install fastBPE

Pre-trained models

Model	#params	size	Download
`PhoBERT-base`	135M	1.2GB	PhoBERT_base_fairseq.tar.gz
`PhoBERT-large`	370M	3.2GB	PhoBERT_large_fairseq.tar.gz

PhoBERT-base:

wget https://public.vinai.io/PhoBERT_base_fairseq.tar.gz
tar -xzvf PhoBERT_base_fairseq.tar.gz

PhoBERT-large:

wget https://public.vinai.io/PhoBERT_large_fairseq.tar.gz
tar -xzvf PhoBERT_large_fairseq.tar.gz

Example usage

# Load PhoBERT-base in fairseq
from fairseq.models.roberta import RobertaModel
phobert = RobertaModel.from_pretrained('/path/to/PhoBERT_base_fairseq', checkpoint_file='model.pt')
phobert.eval()  # disable dropout (or leave in train mode to finetune)

# Incorporate the BPE encoder into PhoBERT-base 
from fairseq.data.encoders.fastbpe import fastBPE  
from fairseq import options  
parser = options.get_preprocessing_parser()  
parser.add_argument('--bpe-codes', type=str, help='path to fastBPE BPE', default="/path/to/PhoBERT_base_fairseq/bpe.codes")  
args = parser.parse_args()  
phobert.bpe = fastBPE(args) #Incorporate the BPE encoder into PhoBERT


# Extract the last layer's features  
line = "Tôi là sinh_viên trường đại_học Công_nghệ"  # INPUT TEXT IS WORD-SEGMENTED!
subwords = phobert.encode(line)  
last_layer_features = phobert.extract_features(subwords)  
assert last_layer_features.size() == torch.Size([1, 8, 768])  
  
# Extract all layer's features (layer 0 is the embedding layer)  
all_layers = phobert.extract_features(subwords, return_all_hiddens=True)  
assert len(all_layers) == 13  
assert torch.all(all_layers[-1] == last_layer_features)  
  
# Extract features aligned to words  
words = phobert.extract_features_aligned_to_words(line)  
for word in words:  
    print('{:10}{} (...)'.format(str(word), word.vector[:5]))  
  
# Filling marks  
masked_line = 'Tôi là  <mask> trường đại_học Công_nghệ '  
topk_filled_outputs = phobert.fill_mask(masked_line, topk=5)  
print(topk_filled_outputs)

Using PhoBERT in Hugging Face `transformers`

Installation

transformers: pip3 install transformers

Pre-trained models

Model	#params	size	Download
`PhoBERT-base`	135M	307MB	PhoBERT_base_transformers.tar.gz
`PhoBERT-large`	370M	834MB	PhoBERT_large_transformers.tar.gz

PhoBERT-base:

wget https://public.vinai.io/PhoBERT_base_transformers.tar.gz
tar -xzvf PhoBERT_base_transformers.tar.gz

PhoBERT-large:

wget https://public.vinai.io/PhoBERT_large_transformers.tar.gz
tar -xzvf PhoBERT_large_transformers.tar.gz

Example usage

Under construction, coming soon!

License

PhoBERT  Copyright (C) 2020  VinAI Research

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
PhoBERT_large_fairseq		PhoBERT_large_fairseq
LICENSE		LICENSE
README.md		README.md
hacky_phobert_tokenizer.py		hacky_phobert_tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PhoBERT_large_fairseq

PhoBERT_large_fairseq

LICENSE

LICENSE

README.md

README.md

hacky_phobert_tokenizer.py

hacky_phobert_tokenizer.py

Repository files navigation

PhoBERT: Pre-trained language models for Vietnamese

Experimental results

Using PhoBERT in `fairseq`

Installation

Pre-trained models

Example usage

Using PhoBERT in Hugging Face `transformers`

Installation

Pre-trained models

Example usage

License

About

Releases

Packages

Languages

License

Luvata/PhoBERT

Folders and files

Latest commit

History

Repository files navigation

PhoBERT: Pre-trained language models for Vietnamese

Experimental results

Using PhoBERT in fairseq

Installation

Pre-trained models

Example usage

Using PhoBERT in Hugging Face transformers

Installation

Pre-trained models

Example usage

License

About

Resources

License

Stars

Watchers

Forks

Languages

Using PhoBERT in `fairseq`

Using PhoBERT in Hugging Face `transformers`