Resources for the paper: A Survey of Pre-trained Language Models for Processing Scientific Text
- Related survey papers
- Existing SciLMs
- Awesome scientific datasets
- A Survey of Large Language Models - arXiv 2023
- Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey - Accepted by Machine Intelligence Research 2023
- Pre-trained Language Models in Biomedical Domain: A Systematic Survey - Accepted in ACM Computing Surveys 2023
- AMMU: A survey of transformer-based biomedical pretrained language models - Journal of Biomedical Informatics 2022
- Pre-Trained Language Models and Their Applications - Engineering, 2022
- Pre-trained models: Past, present and future - AI Open, Volume 2, 2021
No. | Year | Name | Base-model | Objective | #Parameters | Code |
---|---|---|---|---|---|---|
1 | 2019/01 | BioBERT | BERT | MLM, NSP | 110M | GitHub |
2 | 2019/02 | BERT-MIMIC | BERT | MLM, NSP | 110M, 340M | N/A |
3 | 2019/04 | BioELMo | ELMo | Bi-LM | 93.6M | GitHub |
4 | 2019/04 | Clinical BERT (Emily) | BERT | MLM, NSP | 110M | GitHub |
5 | 2019/04 | ClinicalBERT (Kexin) | BERT | MLM, NSP | 110M | GitHub |
6 | 2019/06 | BlueBERT | BERT | MLM, NSP | 110M, 340M | GitHub |
7 | 2019/06 | G-BERT | GNN + BERT | Self-Prediction, Dual-Prediction | 3M | GitHub |
8 | 2019/07 | BEHRT | BERT | MLM, NSP | N/A | GitHub |
9 | 2019/08 | BioFLAIR | FLAIR | Bi-LM | N/A | GitHub |
10 | 2019/09 | EhrBERT | BERT | MLM, NSP | 110M | GitHub |
11 | 2019/12 | Clinical XLNet | XLNet | Generalized Autoregressive Pretraining | 110M | GitHub |
12 | 2020/04 | GreenBioBERT | BERT | CBOW Word2Vec, Word Vector Space Alignment | 110M | GitHub |
13 | 2020/05 | BERT-XML | BERT | MLM, NSP | N/A | N/A |
14 | 2020/05 | Bio-ELECTRA | ELECTRA | Replaced Token Prediction | 14M | GitHub |
15 | 2020/05 | Med-BERT | BERT | MLM, Prolonged LOS Prediction | 110M | GitHub |
16 | 2020/05 | ouBioBERT | BERT | MLM, NSP | 110M | GitHub |
17 | 2020/07 | PubMedBERT | BERT | MLM, NSP, Whole-Word Masking | 110M | HuggingFace |
18 | 2020/08 | MCBERT | BERT | MLM, NSP | 110M, 340M | GitHub |
19 | 2020/09 | BioALBERT | ALBERT | MLM, SOP | 12M, 18M | GitHub |
20 | 2020/09 | BRLTM | BERT | MLM | N/A | GitHub |
21 | 2020/10 | BioMegatron | Megatron | MLM, NSP | 345M, 800M, 1.2B | GitHub |
22 | 2020/10 | CharacterBERT | BERT + Character-CNN | MLM, NSP | 105M | GitHub |
23 | 2020/10 | ClinicalTransformer | BERT - ALBERT - RoBERTa - ELECTRA | MLM, NSP - MLM, SOP - MLM - Replaced Token Prediction | 110M - 12M - 125M - 110M | GitHub |
24 | 2020/10 | SapBERT | BERT | Multi-Similarity Loss | 110M | GitHub |
25 | 2020/10 | UmlsBERT | BERT | MLM | 110M | GitHub |
26 | 2020/11 | bert-for-radiology | BERT | MLM, NSP | 110M | GitHub |
27 | 2020/11 | Bio-LM | RoBERTa | MLM | 125M, 355M | GitHub |
28 | 2020/11 | CODER | PubMedBERT - mBERT | Contrastive Learning | 110M - 110M | GitHub |
29 | 2020/11 | exBERT | BERT | MLM, NSP | N/A | GitHub |
30 | 2020/12 | BioMedBERT | BERT | MLM, NSP | 340M | GitHub |
31 | 2020/12 | LBERT | BERT | MLM, NSP | 110M | GitHub |
32 | 2021/04 | CovidBERT | BioBERT | MLM, NSP | 110M | N/A |
33 | 2021/04 | ELECTRAMed | ELECTRA | Replaced Token Prediction | N/A | GitHub |
34 | 2021/04 | KeBioLM | PubMedBERT | MLM, Entity Detection, Entity Linking | 110M | GitHub |
35 | 2021/04 | SINA-BERT | BERT | MLM | 110M | N/A |
36 | 2021/05 | ProteinBERT | BERT | Corrupted Token, Annotation Prediction | 16M | GitHub |
37 | 2021/05 | SciFive | T5 | Span Corruption Prediction | 220M, 770M | GitHub |
38 | 2021/06 | BioELECTRA | ELECTRA | Replaced Token Prediction | 110M | GitHub |
39 | 2021/06 | EntityBERT | BERT | Entity-centric MLM | 110M | N/A |
40 | 2021/07 | MedGPT | GPT-2 + GLU + RotaryEmbed | LM | N/A | N/A |
41 | 2021/08 | SMedBERT | SMedBERT | Masked Neighbor Modeling, Masked Mention Modeling, SOP, MLM | N/A | GitHub |
42 | 2021/09 | Bio-cli | RoBERTa | MLM, Subword Masking or Whole Word Masking | 125M | GitHub |
43 | 2021/11 | UTH-BERT | BERT | MLM, NSP | 110M | GitHub |
44 | 2021/12 | ChestXRayBERT | BERT | MLM, NSP | 110M | N/A |
45 | 2021/12 | MedRoBERTa.nl | RoBERTa | MLM | 123M | GitHub |
46 | 2021/12 | PubMedELECTRA | ELECTRA | Replaced Token Prediction | 110M, 335M | HuggingFace |
47 | 2022/01 | Clinical-BigBird | BigBird | MLM | 166M | GitHub |
48 | 2022/01 | Clinical-Longformer | Longformer | MLM | 149M | GitHub |
49 | 2022/03 | BioLinkBERT | BERT | MLM, Document Relation Prediction | 110M, 340M | GitHub |
50 | 2022/04 | BioBART | BART | Text Infilling, Sentence Permutation | 140M, 400M | GitHub |
51 | 2022/05 | bsc-bio-ehr-es | RoBERTa | MLM | 125M | GitHub |
52 | 2022/05 | PathologyBERT | BERT | MLM, NSP | 110M | HuggingFace |
53 | 2022/06 | RadBERT | RoBERTa | MLM | 110M | GitHub |
54 | 2022/06 | ViHealthBERT | BERT | MLM, NSP, Capitalized Prediction | 110M | GitHub |
55 | 2022/07 | Clinical Flair | Flair | Character-level Bi-LM | N/A | GitHub |
56 | 2022/08 | KM-BERT | BERT | MLM, NSP | 99M | GitHub |
57 | 2022/09 | BioGPT | GPT | Autoregressive Language Model | 347M, 1.5B | GitHub |
58 | 2022/10 | Bioberturk | BERT | MLM, NSP | N/A | GitHub |
59 | 2022/10 | DRAGON | GreaseLM | MLM, KG Link Prediction | 360M | GitHub |
60 | 2022/10 | UCSF-BERT | BERT | MLM, NSP | 135M | N/A |
61 | 2022/10 | ViPubmedT5 | ViT5 | Spans-masking learning | 220M | GitHub |
62 | 2022/12 | ALIBERT | BERT | MLM | 110M | N/A |
63 | 2022/12 | BioMedLM | GPT2 | Autoregressive Language Model | 2.7B | GitHub |
64 | 2022/12 | BioReader | T5 & RETRO | MLM | 229.5M | GitHub |
65 | 2022/12 | clinicalT5 | T5 | Span-mask Denoising Objective | 220M, 770M | N/A |
66 | 2022/12 | Gatortron | BERT | MLM | 8.9B | GitHub |
67 | 2022/12 | Med-PaLM | Flan-PaLM | Instruction Prompt Tuning | 540B | Official Site |
68 | 2023/01 | clinical-T5 | T5 | Fill-in-the-blank-style denoising objective | 220M, 770M | PhysioNet |
69 | 2023/01 | CPT-BigBird | BigBird | MLM | 166M | N/A |
70 | 2023/01 | CPT-Longformer | Longformer | MLM | 149M | N/A |
71 | 2023/02 | Bioformer | Bioformer | MLM, NSP | 43M | GitHub |
72 | 2023/02 | Lightweight | DistilBERT | MLM, Knowledge Distillation | 65M, 25M, 18M, 15M | GitHub |
73 | 2023/03 | RAMM | PubmedBERT | MLM, Contrastive Learning, Image-Text Matching | N/A | GitHub |
74 | 2023/04 | DrBERT | RoBERTa | MLM | 110M | GitHub |
75 | 2023/04 | MOTOR | BLIP | MLM, Contrastive Learning, Image-Text Matching | N/A | GitHub |
76 | 2023/05 | BiomedGPT | BART backbone + BERT-encoder + GPT-decoder | MLM | 33M, 93M, 182M | GitHub |
77 | 2023/05 | TurkRadBERT | BERT | MLM, NSP | 110M | N/A |
78 | 2023/06 | CamemBERT-bio | BERT | Whole Word MLM | 111M | HuggingFace |
79 | 2023/06 | ClinicalGPT | T5 | Supervised Fine Tuning, Rank-based Training | N/A | N/A |
80 | 2023/06 | EriBERTa | RoBERTa | MLM | 125M | N/A |
81 | 2023/06 | PharmBERT | BERT | MLM | 110M | GitHub |
82 | 2023/07 | BioNART | BERT | Non-AutoRegressive Model | 110M | GitHub |
83 | 2023/07 | BIOptimus | BERT | MLM | 110M | GitHub |
84 | 2023/07 | KEBLM | BERT | MLM, Contrastive Learning, Ranking Objective | N/A | N/A |
85 | 2023/09 | CPLLM | Llama2 | Autoregressive Language Model, Supervised Fine Tuning | 13B | GitHub |
86 | 2023/11 | MedCPT | BERT | Contrastive Learning, Ranking Objective | 110M | GitHub |
No. | Year | Name | Base-model | Objective | #Parameters | Code |
---|---|---|---|---|---|---|
1 | 2020/03 | NukeBERT | BERT | MLM, NSP | 110M | GitHub |
2 | 2020/10 | ChemBERTa | RoBERTa | MLM | 125M | GitHub |
3 | 2021/05 | NukeLM | SciBERT, RoBERTa | MLM | 125M, 355M, 110M | GitHub |
4 | 2021/06 | ChemBERT | RoBERTa | MLM | 110M | GitHub |
5 | 2021/09 | MatSciBERT | BERT | MLM | 110M | GitHub |
6 | 2021/10 | MatBERT | BERT | MLM | 110M | GitHub |
7 | 2022/05 | BatteryBERT | BERT, SciBERT | MLM | 110M | GitHub |
8 | 2022/05 | ChemGPT | GPT | Autoregressive Language Model | 1B | GitHub |
9 | 2022/08 | MaterialsBERT (Shetty) | PubMedBERT | MLM, NSP, Whole-Word Masking | 110M | GitHub |
10 | 2022/08 | ProcessBERT | BERT | MLM, NSP | 110M | N/A |
11 | 2022/09 | ChemBERTa-2 | RoBERTa | MLM, Multi-task Regression | 125M | GitHub |
12 | 2022/09 | MaterialBERT (Yoshitake) | BERT | MLM, NSP | 110M | MDR |
13 | 2023/01 | MolGen | BART | Seq2Seq MLM, Autoregressive Language Modeling | 460M, 7B | Github |
14 | 2023/08 | GIT-Mol | GIT-Former | Xmodal-Text Matching, Xmodal-Text Contrastive Learning | 700M | N/A |
15 | 2023/10 | MolCA | Galactica | Molecule Captioning, Molecule-Text Contrastive Learning | 1.4B | GitHub |
No. | Year | Name | Base-model | Objective | #Parameters | Code |
---|---|---|---|---|---|---|
1 | 2019/03 | SciBERT (CS + Bio) | BERT | MLM, NSP | 110M | GitHub |
2 | 2019/11 | S2ORC-SciBERT | BERT | MLM, NSP | 110M | GitHub |
3 | 2020/04 | SPECTER | BERT | Triple-loss | 110M | GitHub |
4 | 2021/03 | OAG-BERT | BERT | MLM | 110M | GitHub |
5 | 2022/05 | ScholarBERT | BERT | MLM | 770M | HuggingFace |
6 | 2022/06 | SciDEBERTa | DeBERTa | MLM | N/A | GitHub |
7 | 2022/09 | CSL-T5 | T5 | Fill-in-the-blank-style denoising objective | 220M | GitHub |
8 | 2022/10 | AcademicRoBERTa | RoBERTa | MLM | 125M | GitHub |
9 | 2022/11 | Galactica | GPT | Autoregressive Language Model | 125M, 1.3B, 6.7B, 30B, 120B | GitHub |
10 | 2022/11 | VarMAE | RoBERTa | MLM | 110M | N/A |
11 | 2022/12 | SciBART | BART | MLM | 124M, 386M | Github |
12 | 2023/05 | Patton | GNN + BERT | Network-contextualized MLM, Masked Node Prediction | N/A | GitHub |
Sorted by Domain-name
No. | Year | Name | Base-model | Objective | #Parameters | Code | Domain |
---|---|---|---|---|---|---|---|
1 | 2022/04 | SecureBERT | RoBERTa | MLM | 125M | GitHub | Cybersecurity |
2 | 2022/12 | CySecBERT | BERT | MLM, NSP | 110M | N/A | Cybersecurity |
3 | 2021/05 | MathBERT (Peng) | BERT | MLM, Masked Substructure Prediction, Context Correspondence Prediction | 110M | N/A | Math |
4 | 2021/06 | MathBERT (Shen) | RoBERTa | MLM | 110M | GitHub | Math |
5 | 2021/10 | ClimateBert | DistilROBERTA | MLM | 66M | GitHub | Climate |
6 | 2020/02 | SciGPT2 | GPT2 | LM | 124M | GitHub | CS |
7 | 2023/06 | K2 | LLaMA | Cosine Loss | 7B | GitHub | Geoscience |
8 | 2023/03 | ManuBERT | BERT | MLM | 110M, 126M | HuggingFace | Manufaturing |
9 | 2023/01 | ProtST | BERT | Masked Protein Modeling, Contrastive Learning, Multi-modal Masked Prediction | N/A | GitHub | Protein |
10 | 2023/01 | SciEdBERT | BERT | MLM | 110M | N/A | Science Education |
11 | 2022/06 | SsciBERT | BERT | MLM, NSP | 110M | GitHub | Social Science |