DataSculpt: Crafting Data Landscapes for LLM Post-Training through Multi-objective Partitioning

Python implementation of DataSculpt, a framework for constructing long-context sequences through a multi-objective partition allocation strategy. The capability of LLMs to effectively process extended contexts has not consistently reached its potential, emphasizing the necessity for novel methodologies to bolster their extended-context modeling proficiency. DataSculpt strategically aligns multiple objectives including relevance, homogeneity, integrity, and computational efficiency to optimize the data structure for long-context training.
We achieve improvements on a 7B model including an 18.09% increase in retrieval augmentation, 21.23% in summarization, 21.27% in reading comprehension, and a 3.81% rise in code completion, all while maintaining the models' general proficiency with a 4.88% enhancement. The graphic below provides an overview of DataSculpt. Check out the paper for more details. This codebase outputs constructed context sequences given a text dataset.

Installation

To get started, please clone the repo and install it:

git clone git@github.com:8023looker/DataSculpt.git
cd docker/
DOCKER_BUILDKIT=1 docker build -f Dockerfile -t datasculpt/emr-serverless-spark .

Construct Pretraining Data using DataSculpt

We provide an example file in ./data_sample/input/ to demonstrate our pipeline, which is in jsonl format (./data_sample/input/part-00000) with each line representing one document.

cd src/
bash run_datasculpt_pipeline.sh 16000 0.5 0.5 5 # context_window delta epsilon iter_T

Data Format

Input

{
  "content": "This is an example of document content.",
  "docid": "falcon_talks.cam.acuk_0b1809",
  "...": "..."
}

Output

{
  "total_token_num": 2,
  "docs": [{
      "content": "This is an example of document content.",
      "docid": "falcon_talks.cam.acuk_0b1809",
      "vector_encoded": [0.142877, "...", "..."],
      "...": "..."
    },{
      "content": "This is an example of document content.",
      "docid": "falcon_talks.cam.acuk_0b1809",
      "vector_encoded": [0.142877, "...", "..."],
      "...": "..."
    }
  ]
}

[Optional] To run DataSculpt on your own dataset, provide data as the input format, refering to ./data_sample/input/part-00000.

Experimental Results

We post-train a 7B model using the concanated sequences from DataSculpt with 16K, 32K and 64K context lengths and compare it to the baselines (random sampling and ICLM).

Citation

If this was useful to you, please cite the paper:

@misc{lu2024datasculptcraftingdatalandscapes,
      title={DataSculpt: Crafting Data Landscapes for Long-Context LLMs through Multi-Objective Partitioning}, 
      author={Keer Lu and Xiaonan Nie and Zheng Liang and Da Pan and Shusen Zhang and Keshi Zhao and Weipeng Chen and Zenan Zhou and Guosheng Dong and Bin Cui and Wentao Zhang},
      year={2024},
      eprint={2409.00997},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.00997}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data_sample		data_sample
docker		docker
figures		figures
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataSculpt: Crafting Data Landscapes for LLM Post-Training through Multi-objective Partitioning

Installation

Construct Pretraining Data using DataSculpt

Data Format

Input

Output

Experimental Results

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataSculpt: Crafting Data Landscapes for LLM Post-Training through Multi-objective Partitioning

Installation

Construct Pretraining Data using DataSculpt

Data Format

Input

Output

Experimental Results

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages