Skip to content

Conversation

asad-aali
Copy link
Contributor

For the bootstrap fine-tuning feature, I observed that DSPy would re-run bootstrapping every time, which was pretty time-consuming and expensive. Cache did not solve the problem because my dataset goes through modifications at random in every run.

This PR adds an optional feature to pass bootstrapped_data_path when using dspy.BootstrapFinetune(). The bootstrapped_data_path can be a .jsonl file saved from previous or other fine-tuning runs. If bootstrapped_data_path is passed, the class BootstrapFinetune(FinetuneTeleprompter): automatically loads this data and skips the bootstrap_trace_data step, saving repeated effort.

…apped training data (.jsonl) from a local path, for fine-tuning
@asad-aali asad-aali changed the title Added option for loading pre-saved bootstrapped training data (.jsonl) for fine-tuning Added option for loading pre-saved bootstrapped training data for fine-tuning May 22, 2025
@okhat
Copy link
Collaborator

okhat commented May 22, 2025

Thank you @asad-aali ! This complicates the logic a bit, in light of new optimizers that do fine-tuning in DSPy....

QQ: Can you handle this by making the randomization on your end more deterministic? i.e., randomize via a hash of the input, so the randomness is fixed-per-example every time

@asad-aali
Copy link
Contributor Author

Thanks for the feedback @okhat! You're right that with fully deterministic inputs, DSPy's caching can prevent redundant bootstrapping. That said, the option to re-use the same data can support reproducibility across machines/pipelines (for more apples-to-apples analyses), especially when reusing high-cost teacher traces (e.g., GPT-4).

That said, happy to defer the decision to you!

@okhat okhat closed this Aug 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants