- Prepare your corpus and split it into small jsonl files with the following structure (like Pile, but we only need the
text
key), and put them into a folder namedsplits
.
{
'meta': {'pile_set_name': 'Pile-CC'},
'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on...'
}
- Build metadata
python build_split_meta.py
- Build shard
python build_shard.py
- Build database
build_db.py
- Build index. The Faiss parameter is hardcoded now. Choosing an index is like a kind of compute resource and recall tradeoff. See Faiss wiki for more details.
build_index.py