# Test Fine-tuned RAG LLM

This notebook lets you immediately test the already-trained model and existing retrieval artifacts without running any of the data generation or training steps.

- Loads FAISS index and metadata from `models/rag_llm/step_0/` and `data/rag_llm/processed/review_metadata.parquet`
- Loads fine-tuned model from `models/rag_llm/final` (or extracts from `final.zip` if needed)
- Provides quick test queries and an optional interactive chat



### Step 0 — Load retrieval artifacts (index + metadata)

In [1]:
# Step 0 — Load retrieval artifacts (index + metadata)
from pathlib import Path
import json
import pandas as pd
import numpy as np

try:
    import faiss  # type: ignore
except ImportError:
    raise SystemExit("faiss is required. Install with `pip install faiss-cpu` on Windows.")

import pyarrow.parquet as pq
from sentence_transformers import SentenceTransformer

# Resolve project root (folder that contains `data`)
PROJECT_ROOT = Path.cwd() if (Path.cwd() / 'data').exists() else Path.cwd().parent
INDEX_DIR = PROJECT_ROOT / 'models' / 'rag_llm' / 'step_0'
MANIFEST_PATH = INDEX_DIR / 'manifest.json'

if not MANIFEST_PATH.exists():
    raise SystemExit(f"Missing manifest at {MANIFEST_PATH}. Build artifacts with the training notebook's Step 0 once.")

with open(MANIFEST_PATH, 'r', encoding='utf-8') as f:
    manifest = json.load(f)

# Load FAISS index and metadata
faiss_index = faiss.read_index(manifest['index_path'])
md_df = pq.read_table(manifest['metadata_path']).to_pandas()

# Sanity checks and conveniences
if 'place' not in md_df.columns or 'comment' not in md_df.columns:
    raise SystemExit("Metadata parquet missing required columns: 'place', 'comment'.")

# Ensure stars as float alias
if 'stars_float' in md_df.columns:
    md_df['stars'] = md_df['stars_float'].astype(float)
elif 'stars' in md_df.columns:
    md_df['stars'] = pd.to_numeric(md_df['stars'], errors='coerce').astype(float)
else:
    md_df['stars'] = np.nan

ALLOWED_PLACES = sorted(md_df['place'].dropna().unique().tolist())

# Embedding model for retrieval
EMBED_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'
embedder = SentenceTransformer(EMBED_MODEL)


def retrieve(query: str, k: int = 8) -> pd.DataFrame:
    q_emb = embedder.encode([query], convert_to_numpy=True, normalize_embeddings=True).astype('float32')
    scores, idx = faiss_index.search(q_emb, k)
    hits = []
    for i, s in zip(idx[0], scores[0]):
        if int(i) == -1:
            continue
        row = md_df.iloc[int(i)].to_dict()
        row['score'] = float(s)
        hits.append(row)
    return pd.DataFrame(hits)



### Step 1 — Load fine-tuned model (from models/rag_llm/final or extract from ZIP)

In [6]:
# Step 1 — Load fine-tuned model (from models/rag_llm/final or extract from ZIP)
import os
import re
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, __version__ as transformers_version

MODEL_DIR = PROJECT_ROOT / 'models' / 'rag_llm' / 'final'
ZIP_PATH = PROJECT_ROOT / 'models' / 'rag_llm' / 'final.zip'

# If final folder missing but zip exists, extract just-in-time
if not MODEL_DIR.exists() and ZIP_PATH.exists():
    import zipfile
    print(f"Extracting model from {ZIP_PATH} ...")
    with zipfile.ZipFile(ZIP_PATH, 'r') as zf:
        zf.extractall(PROJECT_ROOT / 'models' / 'rag_llm')

if not MODEL_DIR.exists():
    raise SystemExit(f"Model folder not found at {MODEL_DIR}. Ensure the trained model exists.")

# Resolve runtime settings
using_cuda = torch.cuda.is_available()
device = 'cuda:0' if using_cuda else 'cpu'
load_dtype = torch.float16 if using_cuda else torch.float32

# Load model and tokenizer (pin to single GPU with dynamic cap)
ft_tok = AutoTokenizer.from_pretrained(str(MODEL_DIR))
if using_cuda:
    try:
        _props = torch.cuda.get_device_properties(0)
        _total_gib = _props.total_memory / (1024**3)
        _headroom_gib = max(0.5, _total_gib * 0.10)  # 10% or at least 0.5 GiB headroom
        _usable_gib = max(1.0, _total_gib - _headroom_gib)
        max_mem = {"cuda:0": f"{_usable_gib:.2f}GiB"}
    except Exception:
        max_mem = {"cuda:0": "7.50GiB"}
else:
    max_mem = None

ft_model = AutoModelForCausalLM.from_pretrained(
    str(MODEL_DIR),
    trust_remote_code=True,
    torch_dtype=load_dtype,
    device_map={"": 0} if using_cuda else device,
    max_memory=max_mem
).eval()

# Runtime report
print("Model loaded and ready. Runtime settings:")
print(f"  - Model path: {MODEL_DIR}")
print(f"  - Transformers: {transformers_version}")
print(f"  - Torch: {torch.__version__}")
print(f"  - Device: {'CUDA' if using_cuda else 'CPU'} ({'cuda:0' if using_cuda else device})")
print(f"  - torch_dtype: {str(load_dtype).replace('torch.', '')}")
if using_cuda and max_mem:
    print(f"  - max_memory: {max_mem['cuda:0']}")

if using_cuda:
    try:
        dev_index = 0
        print(f"  - CUDA version: {torch.version.cuda}")
        print(f"  - GPU name: {torch.cuda.get_device_name(dev_index)}")
        props = torch.cuda.get_device_properties(dev_index)
        total = props.total_memory / (1024**3)
        reserved = torch.cuda.memory_reserved(dev_index) / (1024**3)
        allocated = torch.cuda.memory_allocated(dev_index) / (1024**3)
        print(f"  - GPU memory (GB): total={total:.2f}, reserved={reserved:.2f}, allocated={allocated:.2f}")
    except Exception as e:
        print(f"  - CUDA info unavailable: {e}")



Model loaded and ready. Runtime settings:
  - Model path: c:\Users\TARIK\Desktop\Charles Darwin University\4 - Year 1 - Semester 2\IT CODE FAIR\Data Science Challenge\models\rag_llm\final
  - Transformers: 4.56.2
  - Torch: 2.8.0+cu129
  - Device: CUDA (cuda:0)
  - torch_dtype: float16
  - max_memory: 7.16GiB
  - CUDA version: 12.9
  - GPU name: NVIDIA GeForce RTX 5060 Laptop GPU
  - GPU memory (GB): total=7.96, reserved=2.99, allocated=1.21


### Step 2 — Helper: build context and ask_model()

In [24]:
# Step 2 — Helper: build context and ask_model()

def build_context(hits: pd.DataFrame, max_rows: int = 5) -> str:
    rows = []
    for _, r in hits.head(max_rows).iterrows():
        snippet = str(r['comment'])[:240]
        rows.append(f"- Place: {r['place']} | Source: {r.get('source', 'unknown')} | Stars: {float(r.get('stars', float('nan'))):.1f} | Review: {snippet}")
    return "\n".join(rows) if rows else "No context found."


def normalize_place_names_in_response(response_text: str) -> str:
    mapping = {
        'Kakadu National Park – Gunlom Falls': 'Kakadu',
        'Kakadu Gunlom Falls': 'Kakadu',
        'Nitmiluk (Katherine Gorge / Nitmiluk National Park)': 'Nitmiluk (Katherine Gorge)',
        'Tjoritja / West MacDonnell National Park': 'West MacDonnell National Park',
        'West MacDonnell – Ormiston Gorge': 'West MacDonnell National Park',
        'West MacDonnell Ormiston': 'West MacDonnell National Park',
    }
    normalized = response_text
    for old, new in mapping.items():
        normalized = normalized.replace(old, new)
    return normalized


def ask_model(query: str, k: int = 8) -> str:
    hits = retrieve(query, k=k)
    context = build_context(hits, max_rows=5)

    # Soft switches from user input
    user_wants_think = "/no_think" not in query.lower()
    clean_query = query.replace("/think", "").replace("/no_think", "").strip()

    # Derive top candidate places from hits by mean stars
    top_candidates = []
    try:
        if not hits.empty and 'place' in hits.columns and 'stars' in hits.columns:
            top_candidates = (
                hits.groupby('place')['stars']
                    .mean()
                    .sort_values(ascending=False)
                    .head(3)
                    .index.tolist()
            )
    except Exception:
        top_candidates = []

    candidate_str = ", ".join([p for p in top_candidates if p in ALLOWED_PLACES])

    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful travel assistant for Australian destinations. "
                "Only recommend places from Allowed Places. Base your answer ONLY on the review context. "
                "Think inside <think>...</think> concisely. After </think>, write a SINGLE short paragraph starting with 'Answer:' that recommends 2–3 places."
            ),
        },
        {
            "role": "user",
            "content": (
                f"Allowed Places: {', '.join(ALLOWED_PLACES)}\n\n"
                f"Use these candidate places if relevant: {candidate_str if candidate_str else 'N/A'}\n\n"
                f"User query: {clean_query}\n\n"
                f"Review context:\n{context}"
            ),
        },
    ]

    # Build prompt as text with Qwen's thinking mode toggled by user
    prompt_text = ft_tok.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=user_wants_think,
    )
    device = ft_model.device
    inputs = ft_tok(prompt_text, return_tensors="pt").to(device)
    if 'attention_mask' not in inputs:
        inputs['attention_mask'] = torch.ones_like(inputs['input_ids'], device=device)

    # Use Qwen's recommended sampling settings for thinking mode
    with torch.no_grad():
        out = ft_model.generate(
            **inputs,
            max_new_tokens=240,
            do_sample=True,
            temperature=0.6 if user_wants_think else 0.7,
            top_p=0.95 if user_wants_think else 0.8,
            top_k=20,
            repetition_penalty=1.15 if user_wants_think else 1.1,
            pad_token_id=ft_tok.eos_token_id,
            eos_token_id=ft_tok.eos_token_id,
            use_cache=True,
        )

    full = ft_tok.decode(out[0], skip_special_tokens=True)
    if "assistant" in full:
        full = full.split("assistant")[-1].strip()

    # Parse <think>...</think> and answer content
    import re as _re
    think_match = _re.search(r"<think>(.*?)</think>", full, flags=_re.DOTALL)
    thinking = think_match.group(1).strip() if think_match else ""

    # Prefer explicit 'Answer:' section if present
    answer_match = _re.search(r"Answer:\s*(.+)$", full, flags=_re.DOTALL)
    if answer_match:
        answer = answer_match.group(1).strip()
    else:
        answer = full[think_match.end():].strip() if think_match else full.strip()

    # Sanitize any residual thinking tags from the answer
    answer = _re.sub(r"<think>.*?</think>", "", answer, flags=_re.DOTALL)
    answer = answer.replace("</think>", "").replace("<think>", "").strip()

    # If answer is missing/too short, re-prompt once with thinking disabled for a concise answer
    if len(answer) < 30:
        retry_messages = [
            {
                "role": "system",
                "content": (
                    "You are a helpful travel assistant. Provide a single concise paragraph answer only. "
                    "Do not include think tags or reasoning."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"From these Allowed Places: {', '.join(ALLOWED_PLACES)}, and candidates: {candidate_str if candidate_str else 'N/A'}.\n"
                    f"User query: {query}\n\nReview context:\n{context}\n\n"
                    "Write 1–2 sentences recommending the top 2–3 places, grounded in the context."
                ),
            },
        ]
        retry_prompt = ft_tok.apply_chat_template(
            retry_messages,
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=False,
        )
        retry_inputs = ft_tok(retry_prompt, return_tensors="pt").to(device)
        if 'attention_mask' not in retry_inputs:
            retry_inputs['attention_mask'] = torch.ones_like(retry_inputs['input_ids'], device=device)
        with torch.no_grad():
            retry_out = ft_model.generate(
                **retry_inputs,
                max_new_tokens=150,
                do_sample=True,
                temperature=0.7,
                top_p=0.95,
                top_k=20,
                repetition_penalty=1.1,
                pad_token_id=ft_tok.eos_token_id,
                eos_token_id=ft_tok.eos_token_id,
                use_cache=True,
            )
        answer = ft_tok.decode(retry_out[0], skip_special_tokens=True)
        if "assistant" in answer:
            answer = answer.split("assistant")[-1].strip()

    # Clean thinking (trim, collapse whitespace)
    thinking_clean = _re.sub(r"\n\s*\n+", "\n\n", thinking).strip()

    return (
        f"🧠 THINKING:\n{thinking_clean if thinking_clean else 'N/A'}\n\n"
        f"💬 ANSWER:\n{answer.strip()}\n\n"
        f"🔎 CONTEXT USED:\n{context}"
    )



In [25]:
# Two-stage inference: Stage 1 (plan with thinking) -> Stage 2 (answer only)

import re as _re2
from typing import Dict, Any


def first_stage_plan(query: str, k: int = 8) -> Dict[str, Any]:
    """Run a first pass that retrieves, builds context, and lets the model think.
    Returns a dict with 'thinking', 'context', 'candidate_str', and 'raw_answer'.
    """
    hits = retrieve(query, k=k)
    context = build_context(hits, max_rows=5)

    # Derive top candidates from hits by mean stars (same logic as ask_model)
    top_candidates = []
    try:
        if not hits.empty and 'place' in hits.columns and 'stars' in hits.columns:
            top_candidates = (
                hits.groupby('place')['stars']
                .mean()
                .sort_values(ascending=False)
                .head(3)
                .index.tolist()
            )
    except Exception:
        top_candidates = []

    candidate_str = ", ".join([p for p in top_candidates if p in ALLOWED_PLACES])

    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful travel assistant for Australian destinations. "
                "Only recommend places from Allowed Places. Base your answer ONLY on the review context. "
                "Think inside <think>...</think> concisely. After </think>, write a SINGLE short paragraph starting with 'Answer:' that recommends 2–3 places."
            ),
        },
        {
            "role": "user",
            "content": (
                f"Allowed Places: {', '.join(ALLOWED_PLACES)}\n\n"
                f"Use these candidate places if relevant: {candidate_str if candidate_str else 'N/A'}\n\n"
                f"User query: {query}\n\n"
                f"Review context:\n{context}"
            ),
        },
    ]

    prompt_text = ft_tok.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True,
    )

    device = ft_model.device
    inputs = ft_tok(prompt_text, return_tensors="pt").to(device)
    if 'attention_mask' not in inputs:
        inputs['attention_mask'] = torch.ones_like(inputs['input_ids'], device=device)

    with torch.no_grad():
        out = ft_model.generate(
            **inputs,
            max_new_tokens=240,
            do_sample=True,
            temperature=0.6,
            top_p=0.95,
            top_k=20,
            repetition_penalty=1.15,
            pad_token_id=ft_tok.eos_token_id,
            eos_token_id=ft_tok.eos_token_id,
            use_cache=True,
        )

    full = ft_tok.decode(out[0], skip_special_tokens=True)
    if "assistant" in full:
        full = full.split("assistant")[-1].strip()

    think_match = _re2.search(r"<think>(.*?)</think>", full, flags=_re2.DOTALL)
    thinking = think_match.group(1).strip() if think_match else ""

    answer_match = _re2.search(r"Answer:\s*(.+)$", full, flags=_re2.DOTALL)
    raw_answer = answer_match.group(1).strip() if answer_match else full.strip()

    thinking_clean = _re2.sub(r"\n\s*\n+", "\n\n", thinking).strip()

    return {
        "thinking": thinking_clean,
        "context": context,
        "candidate_str": candidate_str,
        "raw_answer": raw_answer,
    }


def second_stage_answer(query: str, stage1: Dict[str, Any]) -> str:
    """Run a second pass that consumes stage1 thinking + context and outputs only the final answer.
    Returns the cleaned, concise answer text.
    """
    thinking = stage1.get("thinking", "").strip()
    context = stage1.get("context", "").strip()

    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful travel assistant. Provide ONLY a concise final answer. "
                "Do not include think tags, step-by-step plans, or prefaces. Start directly with 'Answer:' and write 1–2 sentences recommending 2–3 places grounded in the provided context."
            ),
        },
        {
            "role": "user",
            "content": (
                f"User query: {query}\n\n"
                f"Model planning notes (for your input only, do NOT echo):\n{thinking}\n\n"
                f"Review context (source of truth):\n{context}\n\n"
                "Write only the final answer paragraph."
            ),
        },
    ]

    prompt_text = ft_tok.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False,
    )

    device = ft_model.device
    inputs = ft_tok(prompt_text, return_tensors="pt").to(device)
    if 'attention_mask' not in inputs:
        inputs['attention_mask'] = torch.ones_like(inputs['input_ids'], device=device)

    with torch.no_grad():
        out = ft_model.generate(
            **inputs,
            max_new_tokens=180,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            top_k=20,
            repetition_penalty=1.1,
            pad_token_id=ft_tok.eos_token_id,
            eos_token_id=ft_tok.eos_token_id,
            use_cache=True,
        )

    text = ft_tok.decode(out[0], skip_special_tokens=True)
    if "assistant" in text:
        text = text.split("assistant")[-1].strip()

    # Strip any stray think tags and normalize known place names
    text = _re2.sub(r"<think>.*?</think>", "", text, flags=_re2.DOTALL).strip()
    text = normalize_place_names_in_response(text)
    return text


def ask_model_two_stage(query: str, k: int = 8) -> Dict[str, str]:
    """Convenience wrapper that returns a dict with keys: thinking, context, answer."""
    stage1 = first_stage_plan(query, k=k)
    final_answer = second_stage_answer(query, stage1)
    return {
        "thinking": stage1["thinking"] or "N/A",
        "context": stage1["context"],
        "answer": final_answer.strip(),
    }



### Step 3 — Quick tests

In [26]:
# Step 3 — Quick tests (two-stage)
print("🤖 Testing fine-tuned model (two-stage):")
print("="*60)

# Remove explicit /think flags from queries; stage1 always enables thinking
test_queries = [
    "best waterfalls and swimming spots",
    "family-friendly walks and sunset views",
    "places that are not too hot",
]

for i, q in enumerate(test_queries, 1):
    print(f"\n🗣️ Query {i}: {q}")
    print("🤖 Response (two-stage):")
    try:
        result = ask_model_two_stage(q)
        print("🧠 THINKING:\n" + (result["thinking"] or "N/A"))
        print("\n🔎 CONTEXT USED:\n" + result["context"])
        print("\n💬 ANSWER (final):\n" + result["answer"])
    except Exception as e:
        print(f"❌ Error: {e}")
    print("\n" + "="*60)



🤖 Testing fine-tuned model (two-stage):

🗣️ Query 1: best waterfalls and swimming spots
🤖 Response (two-stage):
🧠 THINKING:
Okay, the user is asking about the best waterfalls and swimming spots. Let me look through the allowed places. There's Alice Springs Desert Park, Kakadu, Nitmiluk (Katherine Gorge), and others like Uluru-Kata Tjuta, Tjoritja / West MacDonnell National Park, etc., but primarily focus on waterfalls.

Looking at reviews: Nitmiluk (Katherine Gorge) has a swimming hole. Kakadu also has a waterfall. Kakadu National Park – Gunlom Falls is even better with awesome views and a swimming pool. Tjoritja / West MacDonnell also has waterfalls. West MacDonnell – Ormiston Gorge might be another option. Need to make sure these are indeed famous spots. Yes, those seem right. So recommend Nitmiluk as a general good ride, Gunlom as a specific one, and maybe Tjoritja as well.

🔎 CONTEXT USED:
- Place: Kakadu | Source: GoogleMaps | Stars: 4.5 | Review: Lovely, there are secrete swimmin

### Step 4 — Interactive chat

The chat now uses a two-stage flow automatically: it first plans with concise thinking and shows the context, then produces a clean final answer. No need for /think or /no_think flags.


In [27]:
# Step 4 — Interactive chat (two-stage)

def interactive_chat():
    print("\n🎯 Interactive mode - Ask your own questions")
    print("Type 'quit' to exit")
    while True:
        user_query = input("\n🗣️ Your question: ").strip()
        if user_query.lower() in {"quit", "exit", "q"}:
            break
        if not user_query:
            continue
        try:
            print(f"\n🗣️ User: {user_query}")
            result = ask_model_two_stage(user_query)
            print("\n🧠 THINKING:\n" + (result["thinking"] or "N/A"))
            print("\n🔎 CONTEXT USED:\n" + result["context"])
            # Show answer only (no trimming to sentences; already concise)
            print("\n💬 ANSWER (final):\n" + result["answer"])
        except Exception as e:
            print(f"❌ Error: {e}")

interactive_chat()




🎯 Interactive mode - Ask your own questions
Type 'quit' to exit

🗣️ User: which place is better for walking

🧠 THINKING:
Okay, the user is asking which place is better for walking based on reviews. Let me look through the allowed places.

First, check each review's place and stars. Uluru-Kata Tjuta has a 4.8 star. The walk is generally praised, though some people might find it overrated. Then there's Alice Springs Desert Park, where I would rate it based on the information provided. Wait, actually looking back, the original question was about which place is better for walking, and the reviews don't directly compare places. However, considering the closest related places, Uluru-Kata Tjuta seems to be the main recommendation here. Nitmiluk (Katherine Gorge) also gets mentions, but that's closer in style. Kakadu National Park – Gunlom Falls is another option. Since the user wants recommendations, I'll go with Uluru-Kata Tjuta as the best fit based on reviews that highlight ease of walkin

### Step 5 — COMET evaluation

Simple evaluation using COMET to score model responses against reference answers.


In [None]:
### Step 9c — Deterministic COMET Evaluation using Two-Stage Answers

# Deterministic COMET evaluation (Stage 2 only) with place normalization
try:
    from comet import download_model, load_from_checkpoint
    import json, re, random
    import numpy as _np

    # Define OUTPUT_DIR locally for saving results
    OUTPUT_DIR = PROJECT_ROOT / 'models' / 'rag_llm'

    # Set global seeds for reproducibility
    random.seed(42)
    _np.random.seed(42)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(42)
    torch.manual_seed(42)

    print("Loading COMET model...")
    comet_model = load_from_checkpoint(download_model("Unbabel/wmt22-comet-da"))

    # Comprehensive test queries and reference answers
    test_data = [
        {"query": "best waterfalls and swimming spots", "reference": "Nitmiluk (Katherine Gorge) offers great waterfalls and swimming holes. Kakadu has Gunlom Falls with refreshing pools perfect for swimming. West MacDonnell National Park has natural waterholes for swimming."},
        {"query": "family-friendly walks and sunset views", "reference": "Devils Marbles (Karlu Karlu) provides beautiful sunset views and family-friendly walks. West MacDonnell National Park offers scenic walks suitable for families with spectacular sunset views."},
        {"query": "places that are not too hot", "reference": "Nitmiluk (Katherine Gorge) offers cooler temperatures and peaceful natural settings. West MacDonnell National Park provides pleasant conditions for outdoor activities."},
        {"query": "cultural experiences and wildlife viewing", "reference": "Kakadu offers rich cultural experiences with Aboriginal rock art and diverse wildlife. Uluru-Kata Tjuta provides cultural significance and unique wildlife viewing opportunities."},
        {"query": "scenic drives and photography spots", "reference": "West MacDonnell National Park offers spectacular scenic drives with excellent photography opportunities. Devils Marbles provides unique rock formations perfect for photography."},
        {"query": "quiet camping and stargazing", "reference": "West MacDonnell National Park offers quiet camping spots with excellent stargazing opportunities. Devils Marbles provides peaceful camping with clear night skies."},
        {"query": "easy hikes for beginners", "reference": "Nitmiluk (Katherine Gorge) offers easy walking trails suitable for beginners. West MacDonnell National Park has gentle walks perfect for those new to hiking."},
        {"query": "places with good facilities and amenities", "reference": "Kakadu has well-developed facilities and visitor centers. Uluru-Kata Tjuta offers comprehensive amenities for tourists."},
        {"query": "water activities and boat tours", "reference": "Nitmiluk (Katherine Gorge) offers excellent boat tours and water activities. Kakadu provides various water-based experiences and boat cruises."},
        {"query": "places to visit during wet season", "reference": "Kakadu is particularly beautiful during the wet season with flowing waterfalls. Nitmiluk (Katherine Gorge) offers different experiences during wet season with higher water levels."}
    ]

    # Deterministic generation helpers (two-stage)
    def _first_stage_plan_deterministic(query: str, k: int = 8) -> dict:
        hits = retrieve(query, k=k)
        context = build_context(hits, max_rows=5)

        # Build messages mirroring first_stage_plan but we'll decode deterministically
        top_candidates = []
        try:
            if not hits.empty and 'place' in hits.columns and 'stars' in hits.columns:
                top_candidates = (
                    hits.groupby('place')['stars']
                    .mean()
                    .sort_values(ascending=False)
                    .head(3)
                    .index.tolist()
                )
        except Exception:
            top_candidates = []
        candidate_str = ", ".join([p for p in top_candidates if p in ALLOWED_PLACES])

        messages = [
            {"role": "system", "content": (
                "You are a helpful travel assistant for Australian destinations. "
                "Only recommend places from Allowed Places. Base your answer ONLY on the review context. "
                "Think inside <think>...</think> concisely. After </think>, write a SINGLE short paragraph starting with 'Answer:' that recommends 2–3 places."
            )},
            {"role": "user", "content": (
                f"Allowed Places: {', '.join(ALLOWED_PLACES)}\n\n"
                f"Use these candidate places if relevant: {candidate_str if candidate_str else 'N/A'}\n\n"
                f"User query: {query}\n\n"
                f"Review context:\n{context}"
            )}
        ]

        prompt_text = ft_tok.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=True,
        )
        device = ft_model.device
        inputs = ft_tok(prompt_text, return_tensors='pt').to(device)
        if 'attention_mask' not in inputs:
            inputs['attention_mask'] = torch.ones_like(inputs['input_ids'], device=device)

        with torch.no_grad():
            out = ft_model.generate(
                **inputs,
                max_new_tokens=220,
                do_sample=False,
                num_beams=4,
                length_penalty=1.0,
                repetition_penalty=1.1,
                pad_token_id=ft_tok.eos_token_id,
                eos_token_id=ft_tok.eos_token_id,
                use_cache=True,
            )
        # Decode only newly generated tokens (exclude prompt)
        gen_ids = out[0][inputs["input_ids"].shape[1]:]
        full = ft_tok.decode(gen_ids, skip_special_tokens=True)
        # Robustly extract think and raw answer
        think_match = re.search(r"<think>(.*?)</think>", full, flags=re.DOTALL)
        thinking = think_match.group(1).strip() if think_match else ""
        answer_match = re.search(r"Answer:\s*(.+)$", full, flags=re.DOTALL)
        raw_answer = answer_match.group(1).strip() if answer_match else full.strip()
        thinking = re.sub(r"\n\s*\n+", "\n\n", thinking).strip()
        return {"thinking": thinking, "context": context, "raw_answer": raw_answer}

    def _second_stage_answer_deterministic(query: str, stage1: dict) -> str:
        thinking = stage1.get('thinking', '').strip()
        context = stage1.get('context', '').strip()
        messages = [
            {"role": "system", "content": (
                "You are a helpful travel assistant. Provide ONLY a concise final answer. "
                "Do not include think tags, step-by-step plans, or prefaces. Start directly with 'Answer:' and write 1–2 sentences recommending 2–3 places grounded in the provided context."
            )},
            {"role": "user", "content": (
                f"User query: {query}\n\n"
                f"Model planning notes (for your input only, do NOT echo):\n{thinking}\n\n"
                f"Review context (source of truth):\n{context}\n\n"
                "Write only the final answer paragraph."
            )}
        ]
        prompt_text = ft_tok.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=False,
        )
        device = ft_model.device
        inputs = ft_tok(prompt_text, return_tensors='pt').to(device)
        if 'attention_mask' not in inputs:
            inputs['attention_mask'] = torch.ones_like(inputs['input_ids'], device=device)
        with torch.no_grad():
            out = ft_model.generate(
                **inputs,
                max_new_tokens=140,
                do_sample=False,
                num_beams=4,
                length_penalty=1.0,
                repetition_penalty=1.05,
                pad_token_id=ft_tok.eos_token_id,
                eos_token_id=ft_tok.eos_token_id,
                use_cache=True,
            )
        # Decode only newly generated tokens (exclude prompt)
        gen_ids = out[0][inputs["input_ids"].shape[1]:]
        text = ft_tok.decode(gen_ids, skip_special_tokens=True)
        text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
        # Keep it short: first 2 sentences to better match references
        sents = re.split(r"(?<=[.!?])\s+", text)
        text = " ".join(sents[:2]).strip()
        text = normalize_place_names_in_response(text)
        return text

    # Load final model (reuse already loaded ft_tok/ft_model if present)
    print("Loading final model (reusing session)...")
    # ft_tok, ft_model already loaded earlier in the notebook

    print("Generating deterministic model responses (two-stage)...")
    model_responses = []
    lengths = []

    for item in test_data:
        try:
            s1 = _first_stage_plan_deterministic(item['query'])
            resp = _second_stage_answer_deterministic(item['query'], s1)
            model_responses.append(resp)
            lengths.append(len(resp.split()))
            print(f"Query: {item['query']}")
            print(f"Model (len={lengths[-1]}): {resp[:120]}...")
            print(f"Reference: {item['reference'][:120]}...")
            print("-" * 50)
        except Exception as e:
            print(f"Error generating response for '{item['query']}': {e}")
            model_responses.append("")
            lengths.append(0)

    # Prepare data for COMET
    comet_data = []
    for i, (resp, item) in enumerate(zip(model_responses, test_data)):
        comet_data.append({"src": item['query'], "mt": resp, "ref": item['reference']})

    # Run COMET evaluation
    print("Running COMET evaluation (deterministic)...")
    try:
        comet_output = comet_model.predict(comet_data, batch_size=8)
        if isinstance(comet_output, dict) and 'scores' in comet_output:
            comet_scores = comet_output['scores']
        elif isinstance(comet_output, list):
            comet_scores = comet_output
        else:
            comet_scores = [float(s) if isinstance(s, (int, float)) else 0.0 for s in comet_output]

        valid_scores = [float(s) for s in comet_scores]
        avg_score = sum(valid_scores) / len(valid_scores) if valid_scores else 0.0
        min_score = min(valid_scores) if valid_scores else 0.0
        max_score = max(valid_scores) if valid_scores else 0.0

        # Length stats
        avg_len = (sum(lengths) / len(lengths)) if lengths else 0.0
        min_len = min(lengths) if lengths else 0
        max_len = max(lengths) if lengths else 0

        print("\n" + "="*80)
        print("DETERMINISTIC COMET EVALUATION (Two-Stage, place-normalized)")
        print("="*80)
        print(f"Total Queries: {len(comet_data)}")
        print(f"Average COMET: {avg_score:.4f}  |  Min: {min_score:.4f}  |  Max: {max_score:.4f}")
        print(f"Answer length (words): avg={avg_len:.1f}, min={min_len}, max={max_len}")

        print("\nINDIVIDUAL RESULTS")
        for i, (item, score, resp) in enumerate(zip(test_data, valid_scores, model_responses), 1):
            print(f"{i:02d}. {item['query']}  |  Score={score:.4f}  |  len={len(resp.split())}")

    except Exception as e:
        print(f"COMET evaluation error: {e}")
        comet_scores = [0.0 for _ in comet_data]
        avg_score = 0.0
        min_score = 0.0
        max_score = 0.0

    # Save detailed results
    results = {
        "evaluation_metadata": {
            "total_queries": len(comet_data),
            "average_score": avg_score,
            "min_score": min_score,
            "max_score": max_score,
            "evaluation_date": str(pd.Timestamp.now()),
            "place_normalization_applied": True,
            "deterministic": True,
            "num_beams": 4,
            "repetition_penalty": 1.05,
        },
        "test_data": test_data,
        "model_responses": model_responses,
        "comet_scores": comet_scores,
        "length_words": lengths,
    }

    with open(str(OUTPUT_DIR / "comet_evaluation_deterministic.json"), "w" ) as f:
        json.dump(results, f, indent=2)
    import pandas as pd
    pd.DataFrame({
        'query': [t['query'] for t in test_data],
        'model_response': model_responses,
        'reference': [t['reference'] for t in test_data],
        'comet_score': comet_scores,
        'answer_len_words': lengths,
    }).to_csv(str(OUTPUT_DIR / "comet_evaluation_deterministic.csv"), index=False)
    print(f"\n📁 Deterministic results saved to: {OUTPUT_DIR / 'comet_evaluation_deterministic.json'}")
    print(f"📊 CSV saved to: {OUTPUT_DIR / 'comet_evaluation_deterministic.csv'}")

except ImportError:
    print("COMET not available. Install with: pip install unbabel-comet")
except Exception as e:
    print(f"COMET evaluation failed: {e}")
    print("Make sure you have a trained model and COMET installed.")


Loading COMET model...


Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\TARIK\.cache\huggingface\hub\models--Unbabel--wmt22-comet-da\snapshots\2760a223ac957f30acfb18c8aa649b01cf1d75f2\checkpoints\model.ckpt`
Encoder model frozen.


Loading final model (reusing session)...
Generating deterministic model responses (two-stage)...


c:\Users\TARIK\Desktop\Charles Darwin University\4 - Year 1 - Semester 2\IT CODE FAIR\Data Science Challenge\venv\Lib\site-packages\pytorch_lightning\core\saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: best waterfalls and swimming spots
Model (len=22): Answer: Kakadu and West MacDonnell National Park are excellent waterfalls and swimming spots. Nitmiluk (Katherine Gorge)...
Reference: Nitmiluk (Katherine Gorge) offers great waterfalls and swimming holes. Kakadu has Gunlom Falls with refreshing pools per...
--------------------------------------------------


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: family-friendly walks and sunset views
Model (len=23): - Visit Uluru-Kata Tjuta for unforgettable walks and sunset views. - Explore Devils Marbles (Karlu Karlu) for a relaxing...
Reference: Devils Marbles (Karlu Karlu) provides beautiful sunset views and family-friendly walks. West MacDonnell National Park of...
--------------------------------------------------


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: places that are not too hot
Model (len=14): Answer: West MacDonnell National Park and Kakadu are not too hot places to stay....
Reference: Nitmiluk (Katherine Gorge) offers cooler temperatures and peaceful natural settings. West MacDonnell National Park provi...
--------------------------------------------------


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: cultural experiences and wildlife viewing
Model (len=17): Answer: Alice Springs Desert Park and Alice Springs National Park offer unique cultural experiences and stunning wildlif...
Reference: Kakadu offers rich cultural experiences with Aboriginal rock art and diverse wildlife. Uluru-Kata Tjuta provides cultura...
--------------------------------------------------


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: scenic drives and photography spots
Model (len=10): photography spots along scenic drives: Uluru-Kata Tjuta, Nitmiluk (Katherine Gorge)...
Reference: West MacDonnell National Park offers spectacular scenic drives with excellent photography opportunities. Devils Marbles ...
--------------------------------------------------


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: quiet camping and stargazing
Model (len=33): - Uluru-Kata Tjuta camping grounds offer a unique environment for both camping and stargazing. - West MacDonnell Nationa...
Reference: West MacDonnell National Park offers quiet camping spots with excellent stargazing opportunities. Devils Marbles provide...
--------------------------------------------------


Batches:   0%|          | 0/1 [00:00<?, ?it/s]