plantbert

Goal

Pre-train a BERT model using plant sciecne corpus.

Issues

Tokenization

To train a tokenizer, the common practice is to consider wordpiece or similar approaches (e.g., byte-pair encoding) for reduce the size of vocab. But these approaches keeps frequent words intact and generate word pieces for less frequent words. Considering names for entities such as genes, proteins, and metabolites will be mostly rare in a corpus, they will almost always become word pieces which may (or may not) lead to issues with attention scores.

Possible solution:

Use as is and see how bad it is.
Use distinct words as tokens: But this will lead to a very large vocab. For the plant science corpus, there are >900,000 distinct words.
Add genes, proteins, and metabolites manually into vocab after training tokenizers: But it is not trivial to collect all known gene names or other molecular entities from ALL species.
Define vocab size with a threshold frequency: this need to be tested.

Pre-training

Mask language modeling

Random: default.
Specific tokens: This would involve modifying tf_mask_tokens() in DataCollatorForLanguageModeling. Not sure if it is feasible.

Resource consideration

Memory

With 379k docs, vocab size=30_522, max_length=512, mlm_prob=0.2
For RTX 3090 with 24Gb VRAM, can only get training batch size to 25 before problems arise.

Run time

For 10 epochs, need 35 days on a NVDIA RTX 3070; 1.5 days on an RTX 3090.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
_dev		_dev
README.md		README.md
config.yaml		config.yaml
environment.yml		environment.yml
nohup.out		nohup.out
script_1_1_bert_retrain_v3.py		script_1_1_bert_retrain_v3.py
script_1_1_dpp_demo_basic.py		script_1_1_dpp_demo_basic.py
script_1_2_model_save_load_stat.ipynb		script_1_2_model_save_load_stat.ipynb
script_2_1_corpus_filtering.ipynb		script_2_1_corpus_filtering.ipynb
script_eda.ipynb		script_eda.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_dev

_dev

README.md

README.md

config.yaml

config.yaml

environment.yml

environment.yml

nohup.out

nohup.out

script_1_1_bert_retrain_v3.py

script_1_1_bert_retrain_v3.py

script_1_1_dpp_demo_basic.py

script_1_1_dpp_demo_basic.py

script_1_2_model_save_load_stat.ipynb

script_1_2_model_save_load_stat.ipynb

script_2_1_corpus_filtering.ipynb

script_2_1_corpus_filtering.ipynb

script_eda.ipynb

script_eda.ipynb

Repository files navigation

plantbert

Goal

Issues

Tokenization

Pre-training

Resource consideration

About

Releases

Packages

Languages

ShiuLab/plantbert

Folders and files

Latest commit

History

Repository files navigation

plantbert

Goal

Issues

Tokenization

Pre-training

Resource consideration

About

Resources

Stars

Watchers

Forks

Languages