In [15]:
base_prompt = """Create a detailed 5-day travel itinerary for Swiss alps from 2025-09-02 to 2025-09-06.  

Travel Style: mid-range
Interests: History & Culture, Architecture, Shopping


Please format your response EXACTLY as follows:

ITINERARY TITLE: [Creative title for the trip]

DAY 1 - [Date: YYYY-MM-DD] - [Day Title]
[Brief day description]

**[Place Name 1]** (lat: XX.XXXXX, lng: XX.XXXXX)
[Description of the place and activities]

**[Place Name 2]** (lat: XX.XXXXX, lng: XX.XXXXX)
[Description of the place and activities]

DAY 2 - [Date: YYYY-MM-DD] - [Day Title]
[Brief day description]

**[Place Name 3]** (lat: XX.XXXXX, lng: XX.XXXXX)
[Description of the place and activities]

Continue this format for all 5 days.

Requirements:
- Include 4-8 places and attractions per day
- match the activities to the interests of the traveler, the season and whether at the time of the year
- Consider travel time between locations and places that might be closed at the time of the year or on the specific day
- Each day should have a thematic focus when possible
- For each place, provide a description in a narrative style that shows why the specific traveller would want to visit it, what can they expect to do there and a bit of interesting background about the place. 
- Write in a tourist guide style. Start paragraph with 2nd person language always start with a verb ("discover the old city..., hike to the top..."), avoid reapting the same language. 
- Paragraphs should be no more than 40 words for a day and no more than 50 words for a place
- For each day, provide an intro with the theme of the day and the types of activies and possibly some historical background (depending on and subject to the travelers interests)
- traveler's interests: History & Culture, Architecture, Shopping
- Match the mid-range budget and style
- Provide realistic latitude and longitude coordinates (5 decimal places)
- Each place name must be wrapped in **double asterisks**
- Include specific place names (restaurants, museums, attractions, etc.)
- Provide practical, actionable recommendations
- Include warnrings about opening hours and practical tips for specific place when relevant.
- Do not include any blocks other than days and places
- """

In [16]:
%pip install -q openai python-dotenv


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:

import os
from dotenv import load_dotenv
from datetime import datetime
import json
from typing import List, Dict, Any

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
assert OPENAI_API_KEY, "OPENAI_API_KEY not found in environment"

MODELS: List[str] = [
    # Adjust list as desired
    "o4-mini",
    "gpt-4o-mini",
    "gpt-5-mini",
    "gpt-5-nano",
    "gpt-5"
    
]
TEMPERATURES: List[float] = [0.3, 0.7, 1.0]
JUDGE_MODEL: str = "o4-mini"
# Models that do not support non-default temperature values
MODELS_NO_TEMPERATURE = {"o4-mini", "gpt-5-mini", "gpt-5-nano", "gpt-5"}

# Create organized output directories for different result types
BASE_RESULTS_DIR = "../results"
OUTPUT_DIRS = {
    "best_itinerary": os.path.join(BASE_RESULTS_DIR, "best-itinerary"),
    "judgement": os.path.join(BASE_RESULTS_DIR, "judgements"),
    "differences": os.path.join(BASE_RESULTS_DIR, "differences"),
    "excerpts": os.path.join(BASE_RESULTS_DIR, "excerpts")
}

# Ensure all output directories exist
for output_dir in OUTPUT_DIRS.values():
    os.makedirs(output_dir, exist_ok=True)

TIMESTAMP = datetime.now().strftime("%Y%m%d-%H%M%S")


In [18]:
os.path.abspath(OUTPUT_DIRS["excerpts"])


'c:\\Users\\ShaiS\\Documents\\Repos\\itera-notes\\research\\results\\excerpts'

In [19]:
# Build a single prompt string from the base_prompt in the previous cell
prompt = base_prompt.strip()
print(prompt[:400] + ("..." if len(prompt) > 400 else ""))


Create a detailed 5-day travel itinerary for Swiss alps from 2025-09-02 to 2025-09-06.  

Travel Style: mid-range
Interests: History & Culture, Architecture, Shopping


Please format your response EXACTLY as follows:

ITINERARY TITLE: [Creative title for the trip]

DAY 1 - [Date: YYYY-MM-DD] - [Day Title]
[Brief day description]

**[Place Name 1]** (lat: XX.XXXXX, lng: XX.XXXXX)
[Description of th...


In [20]:
# OpenAI helper functions
from openai import OpenAI

client = OpenAI(api_key=OPENAI_API_KEY)

SYSTEM_PROMPT = (
    "You are a professional travel planner. Create detailed, practical itineraries with specific places, realistic timing, and helpful descriptions. "
    "Always include approximate latitude and longitude coordinates for each place."
)


def run_prompt(model: str, temperature: float, prompt_text: str) -> str:
    kwargs = {
        "model": model,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt_text},
        ],
    }
    if model not in MODELS_NO_TEMPERATURE:
        kwargs["temperature"] = temperature
    resp = client.chat.completions.create(**kwargs)
    return resp.choices[0].message.content or ""


def judge_outputs(prompt_text: str, candidates: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Use an LLM-as-judge to rate each candidate and pick the best."""
    judge_instructions = (
        "You are an impartial expert trip-planning evaluator. Score each candidate itinerary on a 1-10 scale for: "
        "(1) accuracy to the user's instruction and format, (2) practicality and realism (seasonality, hours, distances), "
        "(3) alignment with interests and budget, (4) clarity and readability. Provide a short justification. "
        "Return strict JSON with fields: results: [{model, temperature, score, justification}], best_index, and differences_table_markdown."
    )

    # Build judging prompt with numbered candidates, and ask for differences table + excerpts
    numbered = []
    for idx, c in enumerate(candidates):
        numbered.append(f"CANDIDATE {idx}:\nMODEL: {c['model']} TEMP: {c['temperature']}\n---\n{c['output']}\n")
    judge_user = (
        f"Original prompt:\n{prompt_text}\n\n"
        "Compare the candidates. In addition to scoring each, produce a Markdown table noting key differences (structure, realism, specificity, coordinates, seasonality/hours, interest alignment), and include a short excerpt (2-3 lines) per candidate that best represents its style and quality."
        "\n\nCandidates to evaluate:\n" + "\n\n".join(numbered) +
        "\n\nRespond only with strict JSON."
    )

    kwargs = {
        "model": JUDGE_MODEL,
        "messages": [
            {"role": "system", "content": judge_instructions},
            {"role": "user", "content": judge_user},
        ],
    }
    if JUDGE_MODEL not in MODELS_NO_TEMPERATURE:
        kwargs["temperature"] = 1

    # Try enforcing JSON if supported; fall back if not
    try:
        resp = client.chat.completions.create(**{**kwargs, "response_format": {"type": "json_object"}})
    except Exception:
        resp = client.chat.completions.create(**kwargs)

    content = resp.choices[0].message.content or "{}"
    try:
        data = json.loads(content)
    except json.JSONDecodeError:
        data = {"results": [], "best_index": 0, "raw": content}
    return data

# Fallback: Build a differences table with excerpts locally if the judge did not return one
import re

def build_differences_table_markdown(candidates: List[Dict[str, Any]]) -> str:
    headers = [
        "#", "Model", "Temp", "Chars", "Days", "Places", "Coords", "InterestHits", "SeasonalityRefs",
    ]
    rows = []
    interest_terms = ["History", "Culture", "Architecture", "Shopping"]
    season_terms = ["open", "hours", "season", "closed"]
    for i, c in enumerate(candidates):
        text = c.get("output", "") or ""
        chars = len(text)
        days = len(re.findall(r"\bDAY\s+\d+\b", text))
        places = text.count("**[") + text.count("** ") + text.count("**") // 2
        coords = text.lower().count("lat:")
        interest_hits = sum(text.count(term) for term in interest_terms)
        season_refs = sum(text.lower().count(term) for term in season_terms)
        rows.append([
            str(i), c.get("model"), str(c.get("temperature")), str(chars), str(days), str(places), str(coords), str(interest_hits), str(season_refs)
        ])

    # Build Markdown table
    lines = []
    lines.append("| " + " | ".join(headers) + " |")
    lines.append("| " + " | ".join(["-" for _ in headers]) + " |")
    for r in rows:
        lines.append("| " + " | ".join(r) + " |")

    # Append excerpts
    lines.append("\nExcerpts:\n")
    for i, c in enumerate(candidates):
        text = (c.get("output", "") or "").strip()
        excerpt = text[:1100].replace("\n\n", "\n").splitlines()
        excerpt = "\n".join(excerpt[:3])
        lines.append(f"- Candidate {i} (model={c.get('model')}, temp={c.get('temperature')}):\n{excerpt}\n")
    return "\n".join(lines)

def build_comparable_excerpts(candidates: List[Dict[str, Any]]) -> str:
    """Extract similar excerpts from all candidates for side-by-side comparison."""
    lines = []
    lines.append("COMPARABLE EXCERPTS FROM ALL MODELS")
    lines.append("=" * 50)
    lines.append("")
    
    for i, c in enumerate(candidates):
        model = c.get("model", "unknown")
        temp = c.get("temperature", "unknown")
        text = (c.get("output", "") or "").strip()
        
        lines.append(f"--- MODEL: {model} | TEMPERATURE: {temp} ---")
        
        # Extract same portion from each: first 15 lines or 800 chars, whichever is shorter
        text_lines = text.splitlines()
        excerpt_lines = text_lines[:15]
        excerpt = "\n".join(excerpt_lines)
        if len(excerpt) > 800:
            excerpt = excerpt[:800] + "..."
        
        lines.append(excerpt)
        lines.append("")  # blank line between excerpts
    
    return "\n".join(lines)


In [21]:
# Run the grid search and judge
candidates: List[Dict[str, Any]] = []
errors: List[Dict[str, Any]] = []

for model in MODELS:
    if model in MODELS_NO_TEMPERATURE:
        try:
            print(f"Running model={model} (no temperature control) ...")
            out = run_prompt(model, 1.0, prompt)
            candidates.append({
                "model": model,
                "temperature": 1.0,
                "output": out,
            })
        except Exception as e:
            err = str(e)
            print(f"Failed for model={model}: {err}")
            errors.append({"model": model, "temperature": None, "error": err})
    else:
        for temp in TEMPERATURES:
            try:
                print(f"Running model={model} temp={temp} ...")
                out = run_prompt(model, temp, prompt)
                candidates.append({
                    "model": model,
                    "temperature": temp,
                    "output": out,
                })
            except Exception as e:
                err = str(e)
                print(f"Failed for model={model} temp={temp}: {err}")
                errors.append({"model": model, "temperature": temp, "error": err})

print(f"Collected {len(candidates)} candidates, {len(errors)} errors")

if not candidates:
    print("No successful candidates; skipping judge.")
else:
    judgement = judge_outputs(prompt, candidates)
    judgement_path = os.path.join(OUTPUT_DIRS["judgement"], f"judgement-{TIMESTAMP}.json")
    with open(judgement_path, "w", encoding="utf-8") as f:
        json.dump(judgement, f, ensure_ascii=False, indent=2)

    results = judgement.get("results", [])
    best_index = judgement.get("best_index", 0)
    print("Scores:")
    for i, r in enumerate(results):
        print(f"#{i}: {r.get('model')} T={r.get('temperature')} -> {r.get('score')}: {r.get('justification')}")

    # Save differences table with excerpts
    differences_table = judgement.get("differences_table_markdown") or ""
    diff_path = os.path.join(OUTPUT_DIRS["differences"], f"differences-{TIMESTAMP}.txt")
    if not differences_table:
        differences_table = build_differences_table_markdown(candidates)
    with open(diff_path, "w", encoding="utf-8") as f:
        f.write(differences_table)

    # Save comparable excerpts from all models
    comparable_excerpts = build_comparable_excerpts(candidates)
    excerpts_path = os.path.join(OUTPUT_DIRS["excerpts"], f"excerpts-{TIMESTAMP}.txt")
    with open(excerpts_path, "w", encoding="utf-8") as f:
        f.write(comparable_excerpts)

    best = candidates[best_index] if 0 <= best_index < len(candidates) else None
    if best:
        best_text = best["output"]
        best_path = os.path.join(OUTPUT_DIRS["best_itinerary"], f"best-itinerary-{TIMESTAMP}.txt")
        with open(best_path, "w", encoding="utf-8") as f:
            f.write(best_text)
        print(f"Best: model={best['model']} temp={best.get('temperature')}")
        print(f"Saved best itinerary to: {best_path}\nSaved judgment to: {judgement_path}\nSaved differences table to: {diff_path}\nSaved comparable excerpts to: {excerpts_path}")
    else:
        print("Judge did not return a valid best_index.")


Running model=o4-mini (no temperature control) ...
Collected 1 candidates, 0 errors
Scores:
#0: o4-mini T=1.0 -> 9: The itinerary closely follows the required format, provides realistic logistics and opening hours, matches the traveler’s interests in history, culture, architecture, and shopping, and uses clear, concise second-person narrative.
Best: model=o4-mini temp=1.0
Saved best itinerary to: ../results\best-itinerary\best-itinerary-20250817-123521.txt
Saved judgment to: ../results\judgements\judgement-20250817-123521.json
Saved differences table to: ../results\differences\differences-20250817-123521.txt
Saved comparable excerpts to: ../results\excerpts\excerpts-20250817-123521.txt
