Added option for loading pre-saved bootstrapped training data for fine-tuning #8262

asad-aali · 2025-05-22T21:53:03Z

For the bootstrap fine-tuning feature, I observed that DSPy would re-run bootstrapping every time, which was pretty time-consuming and expensive. Cache did not solve the problem because my dataset goes through modifications at random in every run.

This PR adds an optional feature to pass bootstrapped_data_path when using dspy.BootstrapFinetune(). The bootstrapped_data_path can be a .jsonl file saved from previous or other fine-tuning runs. If bootstrapped_data_path is passed, the class BootstrapFinetune(FinetuneTeleprompter): automatically loads this data and skips the bootstrap_trace_data step, saving repeated effort.

…apped training data (.jsonl) from a local path, for fine-tuning

okhat · 2025-05-22T23:20:17Z

Thank you @asad-aali ! This complicates the logic a bit, in light of new optimizers that do fine-tuning in DSPy....

QQ: Can you handle this by making the randomization on your end more deterministic? i.e., randomize via a hash of the input, so the randomness is fixed-per-example every time

asad-aali · 2025-05-22T23:49:52Z

Thanks for the feedback @okhat! You're right that with fully deterministic inputs, DSPy's caching can prevent redundant bootstrapping. That said, the option to re-use the same data can support reproducibility across machines/pipelines (for more apples-to-apples analyses), especially when reusing high-cost teacher traces (e.g., GPT-4).

That said, happy to defer the decision to you!

Included an option in bootstrap_finetune.py to load pre-saved bootstr…

e3baf3f

…apped training data (.jsonl) from a local path, for fine-tuning

asad-aali changed the title ~~Added option for loading pre-saved bootstrapped training data (.jsonl) for fine-tuning~~ Added option for loading pre-saved bootstrapped training data for fine-tuning May 22, 2025

okhat closed this Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added option for loading pre-saved bootstrapped training data for fine-tuning #8262

Added option for loading pre-saved bootstrapped training data for fine-tuning #8262

Uh oh!

asad-aali commented May 22, 2025

Uh oh!

okhat commented May 22, 2025

Uh oh!

asad-aali commented May 22, 2025

Uh oh!

Uh oh!

Added option for loading pre-saved bootstrapped training data for fine-tuning #8262

Added option for loading pre-saved bootstrapped training data for fine-tuning #8262

Uh oh!

Conversation

asad-aali commented May 22, 2025

Uh oh!

okhat commented May 22, 2025

Uh oh!

asad-aali commented May 22, 2025

Uh oh!

Uh oh!