Python implementation of DataSculpt, a framework for constructing long-context sequences through a multi-objective partition allocation strategy.
The capability of LLMs to effectively process extended contexts has not consistently reached its potential, emphasizing the necessity for novel methodologies to bolster their extended-context modeling proficiency.
DataSculpt strategically aligns multiple objectives including relevance, homogeneity, integrity, and computational efficiency to optimize the data structure for long-context training.
We achieve improvements on a 7B model including an 18.09% increase in retrieval augmentation, 21.23% in summarization, 21.27% in reading comprehension, and a 3.81% rise in code completion, all while maintaining the models' general proficiency with a 4.88% enhancement.
The graphic below provides an overview of DataSculpt. Check out the paper for more details.
This codebase outputs constructed context sequences given a text dataset.
To get started, please clone the repo and install it:
git clone git@github.com:8023looker/DataSculpt.git
cd docker/
DOCKER_BUILDKIT=1 docker build -f Dockerfile -t datasculpt/emr-serverless-spark .We provide an example file in ./data_sample/input/ to demonstrate our pipeline, which is in jsonl format (./data_sample/input/part-00000) with each line representing one document.
cd src/
bash run_datasculpt_pipeline.sh 16000 0.5 0.5 5 # context_window delta epsilon iter_T{
"content": "This is an example of document content.",
"docid": "falcon_talks.cam.acuk_0b1809",
"...": "..."
}{
"total_token_num": 2,
"docs": [{
"content": "This is an example of document content.",
"docid": "falcon_talks.cam.acuk_0b1809",
"vector_encoded": [0.142877, "...", "..."],
"...": "..."
},{
"content": "This is an example of document content.",
"docid": "falcon_talks.cam.acuk_0b1809",
"vector_encoded": [0.142877, "...", "..."],
"...": "..."
}
]
}[Optional] To run DataSculpt on your own dataset, provide data as the input format, refering to ./data_sample/input/part-00000.
We post-train a 7B model using the concanated sequences from DataSculpt with 16K, 32K and 64K context lengths and compare it to the baselines (random sampling and ICLM).
If this was useful to you, please cite the paper:
@misc{lu2024datasculptcraftingdatalandscapes,
title={DataSculpt: Crafting Data Landscapes for Long-Context LLMs through Multi-Objective Partitioning},
author={Keer Lu and Xiaonan Nie and Zheng Liang and Da Pan and Shusen Zhang and Keshi Zhao and Weipeng Chen and Zenan Zhou and Guosheng Dong and Bin Cui and Wentao Zhang},
year={2024},
eprint={2409.00997},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.00997},
}