# Prompt Optimization: Testing and Implementation

This notebook tests all components before running the full experiments.

## 1. Test Ollama Connection

In [27]:
// Testing Ollama Connection
import ollama from "npm:ollama";

try {
  const response = await ollama.chat({
    model: 'mistral:7b',
    messages: [{ role: 'user', content: 'Tell me a fun fact about Quebec' }],
    stream: false
  });
  
  console.log("Response:", response.message.content);
} catch (error) {
  console.error("Error:", error.message);
}

Response:  In Quebec, Canada, the French language is the primary and only officially recognized language in the National Assembly. Interestingly, English-speaking members are allowed to address the assembly in English, but all responses must be in French! This rule reflects Quebec's strong commitment to preserving its French cultural identity.


## 2. Testing JSON Structured Output

In [15]:
import { z } from "npm:zod";
import { chatJSON } from "./callOllama.ts";

console.log(chatJSON.toString());

// Yes/No schema
const TestSchema = z.object({
  answer: z.enum(["yes", "no"]),
  confidence: z.number().min(0).max(1).optional()
});

const llmConfig ={
    provider:"ollama",
    model:"mistral:7b"
}

// Test structured JSON output
const { data, raw, tokens } = await chatJSON(
  llmConfig,
  [
    { role: "system", content: "You answer yes/no questions. Return JSON: {\"answer\": \"yes|no\"}" },
    { role: "user", content: "Is the sky blue?" }
  ],
  TestSchema
);

console.log("Parsed data:", data);
console.log("Raw response:", raw);
console.log("Tokens used:", tokens);
console.log("Valid:", data !== null);



async function chatJSON(config, messages, schema) {
  if (config.provider === "ollama") {
    return chatJSONOllama(config.model, messages, schema);
  } else if (config.provider === "openai") {
    return chatJSONOpenAI(config, messages, schema);
  }
  throw new Error(`Unsupported LLM provider: ${config.provider}`);
}
Parsed data: { answer: [32m"yes"[39m }
Raw response: "{\"answer\": \"yes\"}"
Tokens used: [33m39[39m
Valid: [33mtrue[39m


In [17]:
const TestSchema = z.object({
  answer: z.enum(["yes", "no"])
});

console.log("Testing chatJSON with different response formats...\n");
const llmConfig ={
    provider:"ollama",
    model:"mistral:7b"
}
// Test 1: Normal case
const test1 = await chatJSON(
   llmConfig,
  [
    { role: "system", content: "Answer yes/no questions." },
    { role: "user", content: 'Is water wet? Return valid JSON' }
  ],
  TestSchema
);

console.log("Test 1 - Normal response:");
console.log("  Raw:", test1.raw);
console.log("  Parsed:", test1.data);
console.log("  Valid:", test1.data !== null);
console.log("");

// Test 2: Math question
const MathSchema = z.object({
  answer: z.number()
});

const test2 = await chatJSON(
  llmConfig,
  [
    { role: "system", content: "Solve math problems." },
    { role: "user", content: 'What is 2+2? Return valid JSON' }
  ],
  MathSchema
);

console.log("Test 2 - Math response:");
console.log("  Raw:", test2.raw);
console.log("  Parsed:", test2.data);
console.log("  Valid:", test2.data !== null);

Testing chatJSON with different response formats...

Test 1 - Normal response:
  Raw: "{ \"answer\": \"yes\" }"
  Parsed: { answer: [32m"yes"[39m }
  Valid: [33mtrue[39m

Test 2 - Math response:
  Raw: "{\"answer\": 4}"
  Parsed: { answer: [33m4[39m }
  Valid: [33mtrue[39m


## 3. Test Dataset Loading

In [18]:
import { loadPIQA, loadHellaSwag, loadBoolQ, loadGSM8K } from "./datasets.ts";

// Downlaod w/ python3 download_datasets.py first

// Test PIQA loading
console.log("Loading datsets\n");

const piqaData = await loadPIQA(5);
console.log("PIQA loaded:", piqaData.length, "examples");
console.log("Sample:", piqaData[0]);
console.log("Sample:", piqaData[1]);
console.log("Sample:", piqaData[2]);
console.log("Sample:", piqaData[3]);
console.log("Sample:", piqaData[4]);

const hellaData = await loadHellaSwag(5);
console.log("HellaSwag loaded:", hellaData.length, "examples");
console.log("Sample:", hellaData[0]);

const boolqData = await loadBoolQ(5);
console.log("BoolQ loaded:", boolqData.length, "examples");
console.log("Sample:", boolqData[0]);

const gsmData = await loadGSM8K(5);
console.log("GSM8K loaded:", gsmData.length, "examples");
console.log("Sample:", gsmData[0]);

Loading datsets

PIQA loaded: [33m5[39m examples
Sample: {
  question: [32m"How do I ready a guinea pig cage for it's new occupants?"[39m,
  choices: [
    [32m"Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish."[39m,
    [32m"Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish."[39m
  ],
  correct: [32m"A"[39m
}
Sample: {
  question: [32m"dresser"[39m,
  choices: [
    [32m"replace drawer with bobby pin "[39m,
    [32m"finish, woodgrain with  bobby pin "[39m
  ],
  correct: [32m"B"[39m
}
Sample: {
  question: [32m"To fight Ivan Drago in Rocky for sega master system."[39m,
  choices: [
    [32m"Drago isn't in this game because it was released before Rocky IV."[39m,
    [32m"You have to defeat Apollo Creed and Clubber Lang first."[39m
  ],

## 4. Test Evaluators

In [20]:
import { 
  makePIQAEvaluator, 
  makeHellaSwagEvaluator,
  makeBoolQEvaluator,
  makeGSM8KEvaluator 
} from "./evals.ts";

console.log("Testing Multiple Choice, Boolean, and Integer evals\n");

const llmConfig ={
    provider:"ollama",
    model:"mistral:7b"
}

// Test PIQA evaluator
const piqaEval = makePIQAEvaluator(llmConfig);
const baseInstruction = 'You are a classifier. You must follow the provided JSON schema.';

console.log("Testing PIQA evaluator");
const result1 = await piqaEval(baseInstruction, piqaData[0]);
console.log("Result:", result1);
console.log("Score:", result1.score);
console.log("Tokens:", result1.tokens);

// Test BoolQ evaluator
const boolqEval = makeBoolQEvaluator(llmConfig);
const boolqInstruction = 'Answer yes or no based on the passage. Return valid JSON';

console.log("\nTesting BoolQ evaluator");
const result2 = await boolqEval(boolqInstruction, boolqData[0]);
console.log("Result:", result2);
console.log("Score:", result2.score);
console.log("Tokens:", result2.tokens);

Testing Multiple Choice, Boolean, and Integer evals

Testing PIQA evaluator
Result: { score: [33m1[39m, tokens: [33m151[39m }
Score: [33m1[39m
Tokens: [33m151[39m

Testing BoolQ evaluator
Result: { score: [33m0[39m, tokens: [33m362[39m }
Score: [33m0[39m
Tokens: [33m362[39m


## 5. Test TokenMeter

In [21]:
import { TokenMeter } from "./ape.ts";

console.log("Testing the token meter for budgeting\n");

const meter = new TokenMeter();
console.log("Initial:", meter.snapshot());

meter.add(100);
console.log("Add(100):", meter.snapshot());

meter.add(250);
console.log("Add(250):", meter.snapshot());

// Test budget checking (from evo.ts)
import { TokenMeter as EvoMeter } from "./evo.ts";
const budgetMeter = new EvoMeter();
const BUDGET = 500;

console.log("\nBudget testing:");
console.log("Can spend (budget=500, used=0):", budgetMeter.can(BUDGET));

budgetMeter.add(400);
console.log("Can spend (budget=500, used=400):", budgetMeter.can(BUDGET));

budgetMeter.add(150);
console.log("Can spend (budget=500, used=550):", budgetMeter.can(BUDGET));

Testing the token meter for budgeting

Initial: [33m0[39m
Add(100): [33m100[39m
Add(250): [33m350[39m

Budget testing:
Can spend (budget=500, used=0): [33mtrue[39m
Can spend (budget=500, used=400): [33mtrue[39m
Can spend (budget=500, used=550): [33mfalse[39m


## 6. Test Paraphrasing (APE Component)

In [29]:
import { paraphraseInstruction } from "./ape.ts";
import { TokenMeter } from "./ape.ts";

console.log("\nTesting APE optimization\n");

const llmConfig ={
    provider:"ollama",
    model:"mistral:7b"
}

const meter2 = new TokenMeter();
const baseInst = 'You classify questions. Return JSON: {"label": "A|B"}.';

console.log("Original:", baseInst);
console.log("\nGenerating 3 paraphrases\n");

for (let i = 0; i < 3; i++) {
  const { instruction, tokens } = await paraphraseInstruction(
    llmConfig,
    baseInst,
    meter2
  );
  console.log(`Paraphrase ${i+1}: ${instruction}`);
  console.log(`  Tokens used: ${tokens}\n`);
}

console.log("Total tokens for paraphrasing:", meter2.snapshot());


Testing APE optimization

Original: You classify questions. Return JSON: {"label": "A|B"}.

Generating 3 paraphrases

Paraphrase 1: You classify questions. Return JSON: {"label": "A|B"}.
  Tokens used: 131

Paraphrase 2: You classify questions. Return JSON: {"label": "A|B"}.
  Tokens used: 131

Paraphrase 3: You classify questions. Return JSON: {"label": "A|B"}.
  Tokens used: 131

Total tokens for paraphrasing: [33m393[39m


## 7. Test APE Algorithm

In [30]:
import { apeOptimize } from "./ape.ts";

console.log("Testing APE algorithm\n");

const llmConfig ={
    provider:"ollama",
    model:"mistral:7b"
}
const smallData = piqaData.slice(0, 3); // Just 3 examples
const apeResult = await apeOptimize({
  llmConfig,
  baseInstruction: baseInstruction,
  N: 2,  // Only 2 paraphrases for testing
  data: smallData,
  evalExample: piqaEval
});

console.log("APE completed!");
console.log("\nResults:");
console.log("  Best score:", apeResult.bestPrompts[0].score.toFixed(3));
console.log("  Tokens used:", apeResult.bestPrompts[0].tokens);
console.log("  Best instruction:", apeResult.best.instruction.slice(0, 100) + "...");

Testing APE algorithm



TypeError: Cannot read properties of undefined (reading 'provider')

## 8. Test Evolution Algorithm

In [9]:
import { evoOptimize } from "./evo.ts";

console.log("Testing Evolution algorithm \n");

const evoResult = await evoOptimize({
  model: "gemma3:4b",
  seeds: [baseInstruction],
  data: smallData,
  evalExample: piqaEval,
  budget: 2000  // Small budget for testing
});

console.log("Evolution completed!");
console.log("\nResults:");
console.log("Rounds:", evoResult.bestPrompts.length);
console.log("Final score:", evoResult.bestPrompts[evoResult.bestPrompts.length - 1].score.toFixed(3));
console.log("Tokens used:", evoResult.meter.snapshot());
console.log("Best instruction:", evoResult.best.instruction.slice(0, 100) + "...");

// Show improvement over time
console.log("Scores:");
evoResult.bestPrompts.forEach((h, i) => {
  console.log(`    Round ${i}: ${h.score.toFixed(3)} (${h.tokens} tokens)`);
});

Testing Evolution algorithm 



Error: Unsupported LLM provider: undefined

## 9. Test Thompson Sampling

In [26]:
import { tsOptimize } from "./thompson.ts";

console.log("Testing Thompson Sampling 3 arms\n");
const llmConfig ={
    provider:"ollama",
    model:"mistral:7b"
}
const tsResult = await tsOptimize({
  llmConfig,
  seeds: [baseInstruction],
  data: smallData,
  evalExample: piqaEval,
  budget: 1500,  // Small budget
  extraArms: 2   // 2 mutated variants
});

console.log("Thompson Sampling completed!");
console.log("\nResults:");
console.log("Pulls:", tsResult.bestPrompts.length);
console.log("Final score:", tsResult.bestPrompts[tsResult.bestPrompts.length - 1].score.toFixed(3));
console.log("Tokens used:", tsResult.meter.snapshot());
console.log("Best instruction:", tsResult.best.instruction.slice(0, 100) + "...");

// Show posterior distributions
console.log("\n  Posterior distributions (final):");
tsResult.posterior.forEach((post, armId) => {
  console.log(`    Arm ${armId.slice(0, 8)}: mean=${post.mu.toFixed(3)}, spread=${post.kappa.toFixed(1)}`);
});

Testing Thompson Sampling 3 arms



TypeError: Cannot read properties of undefined (reading 'provider')

## 10. Full Experiment (One Dataset)

In [11]:
console.log("FULL EXPERIMENT : PIQA Dataset");

const FULL_DATA = await loadPIQA(30);
const FULL_BUDGET = 5000;
const FULL_INSTRUCTION = 
  'You are a classifier for physical commonsense reasoning. ' +
  'Analyze both options carefully and choose the more practical solution. ' +
  'Return ONLY valid JSON: {"label": "A"} or {"label": "B"}.';

const fullEval = makePIQAEvaluator(MODEL);

console.log(`\nDataset: PIQA`);
console.log(`Examples: ${FULL_DATA.length}`);
console.log(`Budget: ${FULL_BUDGET} tokens`);
console.log(`Base instruction: ${FULL_INSTRUCTION.slice(0, 80)}...\n`);

// Run all three algorithms
console.log("Running APE...");
const fullAPE = await apeOptimize({
  model: MODEL,
  baseInstruction: FULL_INSTRUCTION,
  N: 5,
  data: FULL_DATA,
  evalExample: fullEval
});
console.log(`Score: ${fullAPE.bestPrompts[0].score.toFixed(3)}, Tokens: ${fullAPE.bestPrompts[0].tokens}`);

console.log("\nRunning Evolution...");
const fullEvo = await evoOptimize({
  model: MODEL,
  seeds: [FULL_INSTRUCTION],
  data: FULL_DATA,
  evalExample: fullEval,
  budget: FULL_BUDGET
});
const evoFinal = fullEvo.bestPrompts[fullEvo.bestPrompts.length - 1];
console.log(`Score: ${evoFinal.score.toFixed(3)}, Tokens: ${evoFinal.tokens}, Rounds: ${fullEvo.bestPrompts.length}`);

console.log("\nRunning Thompson Sampling...");
const fullTS = await tsOptimize({
  model: MODEL,
  seeds: [FULL_INSTRUCTION],
  data: FULL_DATA,
  evalExample: fullEval,
  budget: FULL_BUDGET,
  extraArms: 3
});
const tsFinal = fullTS.bestPrompts[fullTS.bestPrompts.length - 1];
console.log(`Score: ${tsFinal.score.toFixed(3)}, Tokens: ${tsFinal.tokens}, Pulls: ${fullTS.bestPrompts.length}`);
console.log("EXPERIMENT COMPLETE");

FULL EXPERIMENT : PIQA Dataset

Dataset: PIQA
Examples: 30
Budget: 5000 tokens
Base instruction: You are a classifier for physical commonsense reasoning. Analyze both options ca...

Running APE...


TypeError: Cannot read properties of undefined (reading 'provider')

## 11. Save Results for Plotting

In [12]:
// Prepare data for plotting
const plotData = {
  ape: {
    tokens: fullAPE.history.map(h => h.tokens),
    scores: fullAPE.history.map(h => h.score),
    best: fullAPE.best.instruction
  },
  evo: {
    tokens: fullEvo.history.map(h => h.tokens),
    scores: fullEvo.history.map(h => h.score),
    best: fullEvo.best.instruction
  },
  ts: {
    tokens: fullTS.history.map(h => h.tokens),
    scores: fullTS.history.map(h => h.score),
    best: fullTS.best.instruction
  }
};

// Save to file
await Deno.writeTextFile(
  'notebook_results.json',
  JSON.stringify(plotData, null, 2)
);

console.log("Results saved to notebook_results.json");
console.log("\nYou can now plot these results using Python/matplotlib or any plotting tool.");

ReferenceError: fullAPE is not defined

## 12. Display Summary Table

In [13]:
console.log("FINAL COMPARISON");
console.log("Algorithm        | Final Score | Tokens Used | Iterations | Efficiency");

const apeScore = fullAPE.history[0].score;
const apeTokens = fullAPE.history[0].tokens;
console.log(
  `APE              | ${apeScore.toFixed(3).padEnd(11)} | ` +
  `${apeTokens.toString().padEnd(11)} | ${"1".padEnd(10)} | ` +
  `${(apeScore / apeTokens * 1000).toFixed(2)} pts/1k tok`
);

const evoScore = fullEvo.history[fullEvo.history.length - 1].score;
const evoTokens = fullEvo.meter.snapshot();
console.log(
  `Evolution        | ${evoScore.toFixed(3).padEnd(11)} | ` +
  `${evoTokens.toString().padEnd(11)} | ${fullEvo.history.length.toString().padEnd(10)} | ` +
  `${(evoScore / evoTokens * 1000).toFixed(2)} pts/1k tok`
);

const tsScore = fullTS.history[fullTS.history.length - 1].score;
const tsTokens = fullTS.meter.snapshot();
console.log(
  `Thompson         | ${tsScore.toFixed(3).padEnd(11)} | ` +
  `${tsTokens.toString().padEnd(11)} | ${fullTS.history.length.toString().padEnd(10)} | ` +
  `${(tsScore / tsTokens * 1000).toFixed(2)} pts/1k tok`
);

console.log("=".repeat(80));
console.log("\nKey Observations:");
console.log(`  • APE: Single evaluation pass, most expensive per improvement`);
console.log(`  • Evolution: ${fullEvo.history.length} rounds of binary tournament`);
console.log(`  • Thompson: ${fullTS.history.length} pulls (most iterations per budget)`);
console.log("\nBest Instructions:");
console.log("\nAPE:", plotData.ape.best.slice(0, 120) + "...");
console.log("\nEvo:", plotData.evo.best.slice(0, 120) + "...");
console.log("\nTS:", plotData.ts.best.slice(0, 120) + "...");

FINAL COMPARISON
Algorithm        | Final Score | Tokens Used | Iterations | Efficiency


ReferenceError: fullAPE is not defined

## 13. Create Simple ASCII Plot

In [14]:
// Simple ASCII visualization
function asciiPlot(data: {tokens: number[], scores: number[]}, label: string, width = 60) {
  console.log(`\n${label}:`);
  
  const maxTokens = Math.max(...data.tokens);
  const maxScore = Math.max(...data.scores);
  
  for (let i = 0; i < data.scores.length; i++) {
    const tokenPos = Math.floor((data.tokens[i] / maxTokens) * width);
    const scoreBar = Math.floor((data.scores[i] / maxScore) * 20);
    const bar = "█".repeat(scoreBar) + "░".repeat(20 - scoreBar);
    console.log(`  ${data.tokens[i].toString().padStart(6)} tok | ${bar} | ${data.scores[i].toFixed(3)}`);
  }
}

console.log("\n" + "=".repeat(80));
console.log("ASCII PLOTS (Token Usage vs Score)");
console.log("=".repeat(80));

asciiPlot(plotData.ape, "APE");
asciiPlot(plotData.evo, "Evolution (first 10 rounds)", 60);
asciiPlot(plotData.ts, "Thompson Sampling (first 10 pulls)", 60);


ASCII PLOTS (Token Usage vs Score)


ReferenceError: plotData is not defined