# AutoPrompting with datasets

In [2]:
from datasets import load_dataset

sst2 = load_dataset("stanfordnlp/sst2")
class_dataset = sst2['train']['sentence'][:100]
class_targets = sst2['train']['label'][:100]

samsum = load_dataset("knkarthick/samsum")
gen_dataset = samsum['train']['dialogue'][:100]
gen_targets = samsum['train']['summary'][:100]

Starting with PromptTuner

In [3]:
from coolprompt.assistant import PromptTuner

# Define an initial prompt
class_start_prompt = "Classify sentence sentiment"

# Initialize the tuner
tuner = PromptTuner()

# Call prompt optimization with dataset and target
final_prompt = tuner.run(
    start_prompt=class_start_prompt,
    task="classification",
    dataset=class_dataset,
    target=class_targets,
    metric="f1"
)

[2025-07-09 00:56:30,551] [INFO] [llm.init] - Initializing default model
[2025-07-09 00:56:30,551] [DEBUG] [llm.init] - Updating default model params with langchain config: None and vllm_engine_config: None


INFO 07-09 00:56:32 __init__.py:207] Automatically detected platform cuda.
INFO 07-09 00:56:40 config.py:549] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed', 'score'}. Defaulting to 'generate'.
INFO 07-09 00:56:40 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='t-tech/T-lite-it-1.0', speculative_config=None, tokenizer='t-tech/T-lite-it-1.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=Fals

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 07-09 00:58:48 model_runner.py:1115] Loading model weights took 14.2426 GB
INFO 07-09 00:58:53 worker.py:267] Memory profiling takes 5.28 seconds
INFO 07-09 00:58:53 worker.py:267] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.27GiB
INFO 07-09 00:58:53 worker.py:267] model weights take 14.24GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 4.35GiB; the rest of the memory reserved for KV Cache is 2.62GiB.
INFO 07-09 00:58:54 executor_base.py:111] # cuda blocks: 3069, # CPU blocks: 4681
INFO 07-09 00:58:54 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 1.50x
INFO 07-09 00:59:01 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_ut

Capturing CUDA graph shapes: 100%|██████████████████████████| 35/35 [00:16<00:00,  2.11it/s]

INFO 07-09 00:59:18 model_runner.py:1562] Graph capturing finished in 17 secs, took 0.21 GiB
INFO 07-09 00:59:18 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 30.00 seconds



[2025-07-09 00:59:18,449] [INFO] [assistant.__init__] - Validating the model: VLLM
[2025-07-09 00:59:18,451] [INFO] [assistant.__init__] - PromptTuner successfully initialized with model: VLLM
[2025-07-09 00:59:18,452] [INFO] [assistant.run] - Validating args for PromptTuner running
[2025-07-09 00:59:19,205] [INFO] [evaluator.__init__] - Evaluator sucessfully initialized with f1 metric
[2025-07-09 00:59:19,206] [INFO] [assistant.run] - === Starting Prompt Optimization ===
[2025-07-09 00:59:19,206] [INFO] [assistant.run] - Method: hype, Task: classification
[2025-07-09 00:59:19,207] [INFO] [assistant.run] - Metric: f1, Validation size: 0.25
[2025-07-09 00:59:19,208] [INFO] [assistant.run] - Dataset: 100 samples
[2025-07-09 00:59:19,208] [INFO] [assistant.run] - Target: 100 samples
[2025-07-09 00:59:19,209] [INFO] [hype.hype_optimizer] - Running HyPE optimization...
Processed prompts: 100%|█| 1/1 [00:00<00:00,  1.32it/s, est. speed input: 328.07 toks/s, out
[2025-07-09 00:59:19,993] [IN

In [4]:
print("Final prompt:", final_prompt)
print("Start prompt metric: ", tuner.init_metric)
print("Final prompt metric: ", tuner.final_metric)

Final prompt: Please provide a sentence and classify its sentiment as positive, negative, or neutral.
Start prompt metric:  0.6364983164983165
Final prompt metric:  0.8899889988998899


You can do the same with generation task

Also you can reuse previous tuner binded to LLM

In [4]:
gen_start_prompt = "Summarize this text"

final_prompt = tuner.run(
    start_prompt=gen_start_prompt,
    task="generation",
    dataset=gen_dataset,
    target=gen_targets,
    metric="meteor"
)

print("Final prompt:", final_prompt)
print("Start prompt metric: ", tuner.init_metric)
print("Final prompt metric: ", tuner.final_metric)

[2025-07-09 01:09:57,414] [INFO] [assistant.run] - Validating args for PromptTuner running
[nltk_data] Downloading package wordnet to
[nltk_data]     /nfs/home/asitkina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /nfs/home/asitkina/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /nfs/home/asitkina/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[2025-07-09 01:09:59,450] [INFO] [evaluator.__init__] - Evaluator sucessfully initialized with meteor metric
[2025-07-09 01:09:59,450] [INFO] [assistant.run] - === Starting Prompt Optimization ===
[2025-07-09 01:09:59,451] [INFO] [assistant.run] - Method: hype, Task: generation
[2025-07-09 01:09:59,451] [INFO] [assistant.run] - Metric: meteor, Validation size: 0.25
[2025-07-09 01:09:59,452] [INFO] [assistant.run] - Dataset: 100 samples
[2025-07-09 01:09:59,453] [INFO] 

Final prompt: Please provide a concise summary of the text, focusing on the main points and omitting any unnecessary details.
Start prompt metric:  0.2889385547946554
Final prompt metric:  0.30479966549868787


Currently supported metrics are
- accuracy and f1 for classification
- meteor, bleu and rouge for generation

Also, task type must correspond the metric

There are two ways to initialize tuner with your custom LLM

To init a model by yourself and pass it to the tuner

In [None]:
from langchain_community.llms import VLLM

my_model = VLLM(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    trust_remote_code=True,
    dtype='float16',
)

tuner_with_custom_llm = PromptTuner(model=my_model)

INFO 07-09 13:17:00 __init__.py:207] Automatically detected platform cuda.
INFO 07-09 13:17:08 config.py:549] This model supports multiple tasks: {'classify', 'reward', 'embed', 'generate', 'score'}. Defaulting to 'generate'.
INFO 07-09 13:17:08 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-09 13:17:08 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgram

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 07-09 13:17:32 model_runner.py:1115] Loading model weights took 3.3460 GB
INFO 07-09 13:17:33 worker.py:267] Memory profiling takes 0.51 seconds
INFO 07-09 13:17:33 worker.py:267] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.27GiB
INFO 07-09 13:17:33 worker.py:267] model weights take 3.35GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 16.48GiB.
INFO 07-09 13:17:33 executor_base.py:111] # cuda blocks: 38573, # CPU blocks: 9362
INFO 07-09 13:17:33 executor_base.py:116] Maximum concurrency for 131072 tokens per request: 4.71x
INFO 07-09 13:17:39 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_u

Capturing CUDA graph shapes: 100%|██████████████████████████| 35/35 [00:14<00:00,  2.38it/s]

INFO 07-09 13:17:53 model_runner.py:1562] Graph capturing finished in 15 secs, took 0.20 GiB
INFO 07-09 13:17:53 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 20.99 seconds



[2025-07-09 13:17:54,325] [INFO] [assistant.__init__] - Validating the model: VLLM
[2025-07-09 13:17:54,326] [INFO] [assistant.__init__] - PromptTuner successfully initialized with model: VLLM


In [3]:
tuner_with_custom_llm.run(start_prompt="Explain the difference between a cat and a kitten")

[2025-07-09 13:18:43,972] [INFO] [assistant.run] - Validating args for PromptTuner running
[nltk_data] Downloading package wordnet to
[nltk_data]     /nfs/home/asitkina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /nfs/home/asitkina/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /nfs/home/asitkina/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[2025-07-09 13:18:44,730] [INFO] [evaluator.__init__] - Evaluator sucessfully initialized with meteor metric
[2025-07-09 13:18:44,730] [INFO] [assistant.run] - === Starting Prompt Optimization ===
[2025-07-09 13:18:44,731] [INFO] [assistant.run] - Method: hype, Task: generation
[2025-07-09 13:18:44,732] [INFO] [assistant.run] - Metric: meteor, Validation size: 0.25
[2025-07-09 13:18:44,733] [INFO] [assistant.run] - No dataset provided
[2025-07-09 13:18:44,733] [INFO] [

'An instructive prompt to explain the difference between a cat and a kitten [...] '

Or to change config of our default model

In [None]:
from coolprompt.language_model.llm import DefaultLLM

changed_model = DefaultLLM.init(langchain_config={
    'max_new_tokens': 1000,
    "temperature": 0.5,
})

tuner_with_changed_llm = PromptTuner(model=changed_model)

[2025-07-09 13:22:45,700] [INFO] [llm.init] - Initializing default model
[2025-07-09 13:22:45,701] [DEBUG] [llm.init] - Updating default model params with langchain config: {'max_new_tokens': 1000, 'temperature': 0.5} and vllm_engine_config: None


INFO 07-09 13:22:47 __init__.py:207] Automatically detected platform cuda.
INFO 07-09 13:22:56 config.py:549] This model supports multiple tasks: {'score', 'generate', 'reward', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 07-09 13:22:56 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='t-tech/T-lite-it-1.0', speculative_config=None, tokenizer='t-tech/T-lite-it-1.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=Fals

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 07-09 13:24:51 model_runner.py:1115] Loading model weights took 14.2426 GB
INFO 07-09 13:24:57 worker.py:267] Memory profiling takes 5.31 seconds
INFO 07-09 13:24:57 worker.py:267] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.27GiB
INFO 07-09 13:24:57 worker.py:267] model weights take 14.24GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 4.35GiB; the rest of the memory reserved for KV Cache is 2.62GiB.
INFO 07-09 13:24:57 executor_base.py:111] # cuda blocks: 3069, # CPU blocks: 4681
INFO 07-09 13:24:57 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 1.50x
INFO 07-09 13:25:05 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_ut

Capturing CUDA graph shapes: 100%|██████████████████████████| 35/35 [00:16<00:00,  2.08it/s]

INFO 07-09 13:25:22 model_runner.py:1562] Graph capturing finished in 17 secs, took 0.21 GiB
INFO 07-09 13:25:22 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 30.43 seconds



[2025-07-09 13:25:22,233] [INFO] [assistant.__init__] - Validating the model: VLLM
[2025-07-09 13:25:22,233] [INFO] [assistant.__init__] - PromptTuner successfully initialized with model: VLLM


In [2]:
tuner_with_changed_llm.run(start_prompt="Explain the difference between a cat and a kitten")

[2025-07-09 13:25:35,352] [INFO] [assistant.run] - Validating args for PromptTuner running
[nltk_data] Downloading package wordnet to
[nltk_data]     /nfs/home/asitkina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /nfs/home/asitkina/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /nfs/home/asitkina/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[2025-07-09 13:25:37,042] [INFO] [evaluator.__init__] - Evaluator sucessfully initialized with meteor metric
[2025-07-09 13:25:37,043] [INFO] [assistant.run] - === Starting Prompt Optimization ===
[2025-07-09 13:25:37,043] [INFO] [assistant.run] - Method: hype, Task: generation
[2025-07-09 13:25:37,043] [INFO] [assistant.run] - Metric: meteor, Validation size: 0.25
[2025-07-09 13:25:37,044] [INFO] [assistant.run] - No dataset provided
[2025-07-09 13:25:37,044] [INFO] [

'Please provide a detailed comparison between a cat and a kitten, focusing on their physical characteristics, behaviors, and life stages.'

You can access prompts and their metrics via tuner fields

In [7]:
def print_results(tuner):
    print("Start prompt:", tuner.init_prompt)
    print("Final prompt:", tuner.final_prompt)
    print("Start prompt metric: ", tuner.init_metric)
    print("Final prompt metric: ", tuner.final_metric)

print_results(tuner)

Start prompt: Summarize this text
Final prompt: Please provide a concise summary of the text, focusing on the main points and omitting any unnecessary details.
Start prompt metric:  0.2889385547946554
Final prompt metric:  0.30479966549868787


There are 3 currently implemented optimizers:
- HyPE (optimizing with predefined system instruction)
- DistillPrompt
- ReflectivePrompt

In [8]:
tuner.run(
    start_prompt=class_start_prompt,
    task="classification",
    method="hype",
    dataset=class_dataset,
    target=class_targets,
    metric="f1"
)

print_results(tuner)

[2025-07-09 01:11:35,596] [INFO] [assistant.run] - Validating args for PromptTuner running
[2025-07-09 01:11:36,329] [INFO] [evaluator.__init__] - Evaluator sucessfully initialized with f1 metric
[2025-07-09 01:11:36,330] [INFO] [assistant.run] - === Starting Prompt Optimization ===
[2025-07-09 01:11:36,330] [INFO] [assistant.run] - Method: hype, Task: classification
[2025-07-09 01:11:36,331] [INFO] [assistant.run] - Metric: f1, Validation size: 0.25
[2025-07-09 01:11:36,332] [INFO] [assistant.run] - Dataset: 100 samples
[2025-07-09 01:11:36,332] [INFO] [assistant.run] - Target: 100 samples
[2025-07-09 01:11:36,333] [INFO] [hype.hype_optimizer] - Running HyPE optimization...
Processed prompts: 100%|█| 1/1 [00:00<00:00,  1.29it/s, est. speed input: 320.47 toks/s, out
[2025-07-09 01:11:37,114] [INFO] [hype.hype_optimizer] - HyPE optimization completed
[2025-07-09 01:11:37,115] [INFO] [assistant.run] - Evaluating on given dataset for classification task...
[2025-07-09 01:11:37,115] [INFO]

Start prompt: Classify sentence sentiment
Final prompt: Please provide a sentence and classify its sentiment as positive, negative, or neutral.
Start prompt metric:  0.6364983164983165
Final prompt metric:  0.8899889988998899


In [6]:
tuner.run(
    start_prompt=class_start_prompt,
    task="classification",
    dataset=class_dataset[:20],
    target=class_targets[:20],
    method="distill",
    use_cache=False,
    num_epochs=1
)

print_results(tuner)

[2025-07-09 14:09:40,871] [INFO] [assistant.run] - Validating args for PromptTuner running
[2025-07-09 14:09:41,621] [INFO] [evaluator.__init__] - Evaluator sucessfully initialized with f1 metric
[2025-07-09 14:09:41,622] [INFO] [assistant.run] - === Starting Prompt Optimization ===
[2025-07-09 14:09:41,623] [INFO] [assistant.run] - Method: distill, Task: classification
[2025-07-09 14:09:41,624] [INFO] [assistant.run] - Metric: f1, Validation size: 0.25
[2025-07-09 14:09:41,624] [INFO] [assistant.run] - Dataset: 20 samples
[2025-07-09 14:09:41,625] [INFO] [assistant.run] - Target: 20 samples
[2025-07-09 14:09:41,637] [INFO] [distiller.distillation] - Starting DistillPrompt optimization...
[2025-07-09 14:09:41,638] [INFO] [evaluator.evaluate] - Evaluating prompt for classification task on 15 samples
Processed prompts: 100%|█| 15/15 [01:57<00:00,  7.84s/it, est. speed input: 6.04 toks/s, out
  0%|                                                                 | 0/1 [00:00<?, ?it/s][2025



Processed prompts: 100%|█| 15/15 [02:21<00:00,  9.42s/it, est. speed input: 7.04 toks/s, out
Processed prompts: 100%|█| 4/4 [01:50<00:00, 27.51s/it, est. speed input: 5.82 toks/s, outpu
[2025-07-09 14:27:34,202] [INFO] [evaluator.evaluate] - Evaluating prompt for classification task on 15 samples
Processed prompts: 100%|█| 15/15 [00:00<00:00, 24.19it/s, est. speed input: 1508.09 toks/s, 
[2025-07-09 14:27:34,845] [INFO] [evaluator.evaluate] - Evaluating prompt for classification task on 15 samples
Processed prompts: 100%|█| 15/15 [00:01<00:00,  8.11it/s, est. speed input: 813.79 toks/s, o
[2025-07-09 14:27:36,716] [INFO] [evaluator.evaluate] - Evaluating prompt for classification task on 15 samples
Processed prompts: 100%|█| 15/15 [02:01<00:00,  8.13s/it, est. speed input: 7.66 toks/s, out
[2025-07-09 14:29:38,725] [INFO] [evaluator.evaluate] - Evaluating prompt for classification task on 15 samples
Processed prompts: 100%|█| 15/15 [02:11<00:00,  8.76s/it, est. speed input: 7.57 toks/s

Start prompt: Classify sentence sentiment
Final prompt: Identify the sentiment of each sentence in the provided text. Indicate the sentiment as positive, negative, or neutral.
Start prompt metric:  0.21031746031746032
Final prompt metric:  0.7916666666666667


In [7]:
tuner.run(
    start_prompt=class_start_prompt,
    task="classification",
    dataset=class_dataset[:20],
    target=class_targets[:20],
    method="reflective",
    problem_description="sentiment classification",
    use_cache=False,
    population_size=4,
    num_epochs=1
)

print_results(tuner)

[2025-07-09 14:41:55,676] [INFO] [assistant.run] - Validating args for PromptTuner running
[2025-07-09 14:41:59,027] [INFO] [evaluator.__init__] - Evaluator sucessfully initialized with f1 metric
[2025-07-09 14:41:59,027] [INFO] [assistant.run] - === Starting Prompt Optimization ===
[2025-07-09 14:41:59,028] [INFO] [assistant.run] - Method: reflective, Task: classification
[2025-07-09 14:41:59,028] [INFO] [assistant.run] - Metric: f1, Validation size: 0.25
[2025-07-09 14:41:59,028] [INFO] [assistant.run] - Dataset: 20 samples
[2025-07-09 14:41:59,029] [INFO] [assistant.run] - Target: 20 samples
[2025-07-09 14:41:59,030] [INFO] [run.reflectiveprompt] - Starting ReflectivePrompt optimization...
[2025-07-09 14:41:59,031] [INFO] [evoluter._init_pop] - Initializing population...
Processed prompts: 100%|█| 1/1 [00:01<00:00,  1.77s/it, est. speed input: 44.66 toks/s, outp
[2025-07-09 14:42:00,805] [INFO] [evoluter._evaluation] - Evaluating population...
[2025-07-09 14:42:00,807] [INFO] [evalu

Start prompt: Classify sentence sentiment
Final prompt: Classify the sentiment of the sentence into one of the following categories: positive, negative, or neutral. Provide a detailed explanation for your classification.
Start prompt metric:  0.21031746031746032
Final prompt metric:  0.8465473145780051
