Skip to content

Latest commit

 

History

History
24 lines (24 loc) · 793 Bytes

README.md

File metadata and controls

24 lines (24 loc) · 793 Bytes

How to build with your own data

  1. Prepare your corpus and split it into small jsonl files with the following structure (like Pile, but we only need the text key), and put them into a folder named splits.
{
  'meta': {'pile_set_name': 'Pile-CC'},
  'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on...'
}
  1. Build metadata
python build_split_meta.py 
  1. Build shard
python build_shard.py
  1. Build database
build_db.py
  1. Build index. The Faiss parameter is hardcoded now. Choosing an index is like a kind of compute resource and recall tradeoff. See Faiss wiki for more details.
build_index.py