Skip to content

8023looker/DataSculpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataSculpt: Crafting Data Landscapes for LLM Post-Training through Multi-objective Partitioning

license arXiv

Python implementation of DataSculpt, a framework for constructing long-context sequences through a multi-objective partition allocation strategy. The capability of LLMs to effectively process extended contexts has not consistently reached its potential, emphasizing the necessity for novel methodologies to bolster their extended-context modeling proficiency. DataSculpt strategically aligns multiple objectives including relevance, homogeneity, integrity, and computational efficiency to optimize the data structure for long-context training.
We achieve improvements on a 7B model including an 18.09% increase in retrieval augmentation, 21.23% in summarization, 21.27% in reading comprehension, and a 3.81% rise in code completion, all while maintaining the models' general proficiency with a 4.88% enhancement. The graphic below provides an overview of DataSculpt. Check out the paper for more details. This codebase outputs constructed context sequences given a text dataset.

Illustration of DataSculpt.

Installation

To get started, please clone the repo and install it:

git clone git@github.com:8023looker/DataSculpt.git
cd docker/
DOCKER_BUILDKIT=1 docker build -f Dockerfile -t datasculpt/emr-serverless-spark .

Construct Pretraining Data using DataSculpt

We provide an example file in ./data_sample/input/ to demonstrate our pipeline, which is in jsonl format (./data_sample/input/part-00000) with each line representing one document.

cd src/
bash run_datasculpt_pipeline.sh 16000 0.5 0.5 5 # context_window delta epsilon iter_T

Data Format

Input

{
  "content": "This is an example of document content.",
  "docid": "falcon_talks.cam.acuk_0b1809",
  "...": "..."
}

Output

{
  "total_token_num": 2,
  "docs": [{
      "content": "This is an example of document content.",
      "docid": "falcon_talks.cam.acuk_0b1809",
      "vector_encoded": [0.142877, "...", "..."],
      "...": "..."
    },{
      "content": "This is an example of document content.",
      "docid": "falcon_talks.cam.acuk_0b1809",
      "vector_encoded": [0.142877, "...", "..."],
      "...": "..."
    }
  ]
}

[Optional] To run DataSculpt on your own dataset, provide data as the input format, refering to ./data_sample/input/part-00000.

Experimental Results

We post-train a 7B model using the concanated sequences from DataSculpt with 16K, 32K and 64K context lengths and compare it to the baselines (random sampling and ICLM).

Results of DataSculpt.

Citation

If this was useful to you, please cite the paper:

@misc{lu2024datasculptcraftingdatalandscapes,
      title={DataSculpt: Crafting Data Landscapes for Long-Context LLMs through Multi-Objective Partitioning}, 
      author={Keer Lu and Xiaonan Nie and Zheng Liang and Da Pan and Shusen Zhang and Keshi Zhao and Weipeng Chen and Zenan Zhou and Guosheng Dong and Bin Cui and Wentao Zhang},
      year={2024},
      eprint={2409.00997},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.00997}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors