Skip to content

JasonZhangzy1757/the-effect-of-domain-corpus-size-for-pretraining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

(Corpus) Size Does Matter: The Effects of In-Domain Corpus Size on Language Model Performance

XSC224u Final Project: Pre-training BERT from scratch on different sizes of biomedical domain corpora.

Project Repo Framework

1. DataCollection

  • All artifacts related to the collection (primarily through web scraping) of data for both pretraining and finetuing

2. Preprocessing

  • All artifacts related to the preprocessing of data from raw files up to the point of tokenization.

3. Modeling

  • All artificats related to building/tweaking our BERT language models

Expected Timeframe

  • March 11: Tokenization/Vocabulary scheme finalized + pretraining data secured
  • March 20: Model pretraining completed (for at least one model approach)
  • March 30: All experiments completed
  • April 3: Final paper due
  • April 8: Code completed