# RP 7.1 LLMs for Code Generation

## Section 0: Check the presence of the GPU

In [161]:
import torch
print("torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU count:", torch.cuda.device_count())
    print("Current device:", torch.cuda.current_device())
    print("GPU name:", torch.cuda.get_device_name(0))

torch: 2.8.0+cu128
CUDA available: True
GPU count: 2
Current device: 0
GPU name: NVIDIA GeForce RTX 5060 Ti


In [162]:
from utils.hardware import check_gpu
gpu_info = check_gpu(verbose=True)

Is GPU available?: True
Number of GPUs: 2
GPU Name: NVIDIA GeForce RTX 5060 Ti
Current device: 0


In [163]:
#pip install python-dotenv
#!pip install google-generativeai
import sys, torch
print("python:", sys.executable)
print("torch:", torch.__version__)


python: c:\Users\drugm\anaconda3\envs\rp\python.exe
torch: 2.8.0+cu128


## Section 1: Imports

### 1.1 -> Import of libraries

In [164]:
import os

In [225]:
from utils.check_imports import check_libraries
print("Checking required libraries...")
lib_status = check_libraries()
from utils import (
    _qcfg_to_dict,
    download_github_raw_json,
    robust_code_tokenizer_for_s5,
    extract_code_units_from_file,
    generate_kb_for_library_sources
)


Checking required libraries...
torch ✓ (version 2.8.0+cu128)
transformers ✓ (version 4.52.3)
datasets ✓ (version 3.6.0)
huggingface_hub ✓ (version 0.32.2)
tqdm ✓ (version 4.67.1)
rank-bm25 ✓ (version 0.2.2)
numpy ✓ (version 2.3.1)
requests ✓ (version 2.32.5)
ipython ✓ (version 9.2.0)


### 1.2 -> Import of prompts (Baseline & RAG)



| Versione | Builder Baseline | Builder RAG | Idea chiave | Gestione snippet | Protocollo di output |
|----------|------------------|--------------|-------------|------------------|-----------------------|
| **v1**   | `build_baseline_prompt_v1` | `build_rag_prompt_v1` | Struttura minimale: “Task → Code” | RAG: sezione “Retrieved Examples” in testa (grezza) | Codice dopo `### Code:` (testuale, semplice) |
| **v2**   | `build_baseline_prompt_v2` | `build_rag_prompt_v2` | Template specifico **seedemu** con requisiti espliciti | RAG: aggiunge retrieved + lista requisiti libreria | Solo **codice eseguibile**, niente testo extra |
| **v3**   | `build_baseline_prompt_v3` | `build_rag_prompt_v3` | “Senior engineer”: type hints, docstring, test | RAG: retrieved come ispirazione prima del task | Output in fenced block \`\`\`python con parti richieste |
| **v4**   | `build_baseline_prompt_v4` | `build_rag_prompt_v4` | **Raw answer only** (nessun boilerplate) | RAG: retrieved + stesse regole; vietato copiare snippet | Restituisce **solo** la risposta grezza |
| **v5**   | `build_baseline_prompt_v5` | `build_rag_prompt_v5` | Implementazione completa + test; enfasi qualità | RAG: retrieved **troncati** con `truncate_to_n_tokens` | Output in \`\`\`python, niente spiegazioni |
| **v6**   | `build_baseline_prompt_v6` | `build_rag_prompt_v6` | **Long Code Arena**: singolo file `.py`, regole strette | RAG: sezione “Retrieved Library Snippets” con snippet troncati | Solo **codice** in fenced block, niente chain-of-thought |
| **v6_2** | `build_baseline_prompt_v6_2` | `build_rag_prompt_v6_2` | Variante **UPPERCASE** del v6 (stesso spirito) | RAG: snippet troncati (stesso schema v6) | Output solo codice, marker \`\`\`PYTHON |
| **v6_3** | `build_baseline_prompt_v6_3` | `build_rag_prompt_v6_3` | Come v6_2 ma stress su “no errors / clean code” | RAG: idem v6_2 con note di qualità | Output solo codice, marker \`\`\`PYTHON |
| **v7**   | `build_baseline_prompt_v7` | `build_rag_prompt_v7` | Script standalone, uso idiomatico libreria, alta qualità | RAG: retrieved come contesto (troncabili) | Solo codice, niente markdown/extra |
| **v8**   | `build_baseline_prompt_v8` | `build_rag_prompt_v8` | Checklist esplicita (imports, docstring, tests, robustness) | RAG: retrieved troncati e integrati | Output solo codice dopo \`\`\`python |
| **v9**   | `build_baseline_prompt_v9` | `build_rag_prompt_v9` | Focus su **API-recall** + self-review interna | RAG: “verified excerpts” troncati per ancorare le API | Output solo codice; niente spiegazioni esterne |


In [166]:
# --- Importa il package prompts modularizzato ---
from prompts import PROMPT_REGISTRY, truncate_to_n_tokens
print("Prompt builders (v1–v9, baseline e RAG) e utility caricati correttamente.")

Prompt builders (v1–v9, baseline e RAG) e utility caricati correttamente.


### 1.3 Settig API for LLMs cloud Usage

In [None]:
# --- API KEYS (kept in notebook; don't commit this cell) ---
%env GOOGLE_API_KEY=not stupid enough to leave this here, sorry
%env OPENROUTER_API_KEY=sk-or-v1-not stupid enough to leave this here, sorry

# Runtime niceties
import os, gc, torch
os.environ.setdefault("PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION", "python")
os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")

# If you already imported clients earlier, reload so they see the env.
from importlib import reload
import models.llm_clients as llm; reload(llm)

print("GOOGLE_API_KEY set:", bool(os.getenv("GOOGLE_API_KEY")))
print("OPENROUTER_API_KEY set:", bool(os.getenv("OPENROUTER_API_KEY")))


## Section 2: LLM &Tokenizer Loading with 4-bit Quantization Model + Save in cache

In [168]:
from models.loader import load_model_and_tokenizer
from models.config import load_config

cfg = load_config()
cfg.model.model_name = "codellama/CodeLlama-7b-Instruct-hf"
cfg.cache.root = "./cache"
tokenizer, model, used_cache = load_model_and_tokenizer(cfg)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Section 3: Dataset and Knowledge Base


### -> 3.1:  Uploading Dataset 

In [169]:
# =========================
# Load + (optional) Filter + Save + Print paths (usando pipeline.dataset_ops)
# =========================
from pathlib import Path
from pipeline.dataset_ops import (
    ensure_out_dir,
    save_dataset_jsonl,
    load_and_optionally_filter,
    print_summary,
)

# Parametri
DATASET_NAME = "JetBrains-Research/lca-library-based-code-generation"
SPLIT = "test"  # "train" | "validation" | "test"
TARGET_REPOS = ["seed-emulator", "pyscf__pyscf"]  # [] per non filtrare

# Output
OUT_DIR = Path("data")
OUT_ALL = OUT_DIR / f"lca_{SPLIT}_all.jsonl"
OUT_FILT = OUT_DIR / f"lca_{SPLIT}_filtered.jsonl"

# Esecuzione
ensure_out_dir(OUT_DIR)

ds_full, ds_filt = load_and_optionally_filter(
    dataset_name=DATASET_NAME,
    split=SPLIT,
    target_repos=TARGET_REPOS,
)

save_dataset_jsonl(ds_full, OUT_ALL)
save_dataset_jsonl(ds_filt, OUT_FILT)

print_summary(
    ds_full=ds_full,
    ds_filt=ds_filt,
    split=SPLIT,
    target_repos=TARGET_REPOS,
    out_all=OUT_ALL,
    out_filt=OUT_FILT,
)



  Loading dataset 'JetBrains-Research/lca-library-based-code-generation' (split='test')…
Dataset loaded with 150 examples across all libraries.
Filtering to repos: ['seed-emulator', 'pyscf__pyscf']
Filtered to 23 examples.


Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

HuggingFace cache dir: C:\Users\drugm\.cache\huggingface\datasets
cache_files (full split): [{'filename': 'C:\\Users\\drugm\\.cache\\huggingface\\datasets\\JetBrains-Research___lca-library-based-code-generation\\default\\0.0.0\\dc460b7e403f63e58ec7128b80eaf2c9c95344f8\\lca-library-based-code-generation-test.arrow'}]

===== SUMMARY =====
Split: test
Total in split: 150
Filter repos: ['seed-emulator', 'pyscf__pyscf']
Total after filter: 23

Saved files:
 - Full split   : C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\data\lca_test_all.jsonl
 - Filtered     : C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\data\lca_test_filtered.jsonl


### -> 3.2 Uploading KB

In [170]:
# ============================================================
# KB configuration + filter/download 2 specific KBs + print paths
# ============================================================
from pipeline.kb_ops import (
    configure_kb,
    list_kbs_from_remote,
    fetch_target_kbs,
    print_kb_summary,
)
from utils.kb_manager import kb_base_url

# Parametri di configurazione (adatta se necessario)
CFG = configure_kb(
    username="PatrizioAcquadro",
    repo_name="RAG_Project_SE2",
    branch="main",
    kb_folder="knowledge_bases_prod",
    local_dir="./temp_downloaded_kbs",
)

# Stampa info base come nella tua cella originale
print("--- RAG Knowledge Base Configuration (GitHub) ---")
print(f"  KBs will be downloaded from GitHub base URL: {kb_base_url().rstrip('/')}/kb_LIBRARY_KEY.json")
print(f"  Downloaded KBs will be temporarily stored in: {CFG.local_dir}")

# Elenco dei KB remoti disponibili (primi 20 per non stampare troppo)
try:
    remote_kbs = list_kbs_from_remote()
    print(f"\nFound {len(remote_kbs)} KB files on remote")
    for x in remote_kbs[:20]:
        size_info = f" (size: {x.get('size')} bytes)" if x.get("size") is not None else ""
except Exception as e:
    print("Error fetching KB metadata from GitHub:", e)

# Filtriamo per le due librerie di interesse e assicuriamoci che siano presenti localmente
TARGET_REPOS = ["seed-emulator", "pyscf"]  # puoi usare anche "pyscf__pyscf"
downloaded = fetch_target_kbs(CFG, TARGET_REPOS)

# Riepilogo finale con i path locali dove trovarli§
print()
print_kb_summary(CFG, TARGET_REPOS, downloaded)


--- RAG Knowledge Base Configuration (GitHub) ---
  KBs will be downloaded from GitHub base URL: https://raw.githubusercontent.com/PatrizioAcquadro/RAG_Project_SE2/main/knowledge_bases_prod/kb_LIBRARY_KEY.json
  Downloaded KBs will be temporarily stored in: ./temp_downloaded_kbs
Fetching KB files from GitHub…

Found 64 KB files on remote

--- RAG Knowledge Base Configuration (GitHub) ---
  Base URL: https://raw.githubusercontent.com/PatrizioAcquadro/RAG_Project_SE2/main/knowledge_bases_prod/kb_LIBRARY_KEY.json
  Local directory: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\temp_downloaded_kbs

Target KBs:
 - seed-emulator: kb_seed-labs__seed-emulator.json -> C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\temp_downloaded_kbs\kb_seed-labs__seed-emulator.json (exists=True, size=885251)
 - pyscf: kb_pyscf__pyscf.json -> C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\temp_downloaded_kbs\kb_pyscf__pyscf.json (exists=True, size=8315405)


### -> 3.3 Check presence of Dataset and KB in the cache

In [171]:
# ============================================================
# Verifica KB per le repo target: presenza su GitHub e in locale
# ============================================================
from pathlib import Path

# Import dai moduli di utilità
try:
    CFG
except NameError:
    CFG = None

from pipeline.kb_ops import (
    configure_kb,
    default_kb_name_map,
    resolve_kb_names_for_repos,
    list_kbs_from_remote,
    fetch_target_kbs,
)

# Parametri
TARGET_REPOS = ["seed-emulator", "pyscf"]  # usa "pyscf__pyscf" se preferisci lo stesso alias del dataset

# Configurazione (se non già presente)
if CFG is None:
    CFG = configure_kb(
        username="PatrizioAcquadro",
        repo_name="RAG_Project_SE2",
        branch="main",
        kb_folder="knowledge_bases_prod",
        local_dir="./temp_downloaded_kbs",
    )

# 1) Risolviamo i nomi dei file KB attesi per le repo target
repo_to_kb = resolve_kb_names_for_repos(TARGET_REPOS)  # es.: {"seed-emulator": "kb_seed-labs__seed-emulator.json", "pyscf": "kb_pyscf__pyscf.json"}

# 2) Elenco remoto (GitHub) e verifica esistenza dei KB richiesti
remote = list_kbs_from_remote()
remote_names = {item.get("name") for item in remote if "name" in item}

missing_remote = {repo: fname for repo, fname in repo_to_kb.items() if fname not in remote_names}
present_remote = {repo: fname for repo, fname in repo_to_kb.items() if fname in remote_names}

print("=== Remote (GitHub) KB check ===")
print(f"Target repos: {TARGET_REPOS}")
print(f"Known KB mapping (default_kb_name_map): {default_kb_name_map()}")
print(f"KB required (resolved): {repo_to_kb}")
print(f"KB present on remote: {list(present_remote.values())}")
print(f"KB missing on remote: {list(missing_remote.values())}")

# 3) Se esistono su remoto, assicuriamoci di averli in locale e stampiamo i path
downloaded_rows = []
if present_remote:
    # scarica/valida solo quelli presenti su remoto
    present_repos = list(present_remote.keys())
    downloaded_rows = fetch_target_kbs(CFG, present_repos)

# 4) Riepilogo finale: per ogni repo target indichiamo se è pronta al retrieval
print("\n=== Local cache check ===")
ready, not_ready = [], []
rows_by_repo = {row["repo_name"]: row for row in downloaded_rows}

for repo in TARGET_REPOS:
    kb_fname = repo_to_kb.get(repo)
    remote_ok = kb_fname in remote_names
    row = rows_by_repo.get(repo)
    local_ok = bool(row and row["exists"] and (row["size_bytes"] or 0) > 0)
    local_path = Path(row["local_path"]).resolve() if row else None

    status = {
        "repo": repo,
        "kb_filename": kb_fname,
        "remote_found": remote_ok,
        "local_found": local_ok,
        "local_path": str(local_path) if local_path else None,
        "size_bytes": (row["size_bytes"] if row else None),
    }
    if remote_ok and local_ok:
        ready.append(status)
    else:
        not_ready.append(status)

# Stampa dettagli
print("Ready for retrieval:")
for s in ready:
    print(f" - {s['repo']}: {s['kb_filename']} -> {s['local_path']} (size={s['size_bytes']})")

print("Not ready:")
for s in not_ready:
    print(f" - {s['repo']}: remote_found={s['remote_found']} local_found={s['local_found']} kb_filename={s['kb_filename']} local_path={s['local_path']}")

# Esito sintetico
print("\nSummary:")
print(f"  Target repos            : {len(TARGET_REPOS)}")
print(f"  Remote KB available     : {len(present_remote)}")
print(f"  Remote KB missing       : {len(missing_remote)}")
print(f"  Locally ready (download): {len(ready)}")
print(f"  Locally missing         : {len(not_ready)}")


Fetching KB files from GitHub…
=== Remote (GitHub) KB check ===
Target repos: ['seed-emulator', 'pyscf']
Known KB mapping (default_kb_name_map): {'seed-emulator': 'kb_seed-labs__seed-emulator.json', 'pyscf': 'kb_pyscf__pyscf.json', 'pyscf__pyscf': 'kb_pyscf__pyscf.json'}
KB required (resolved): {'seed-emulator': 'kb_seed-labs__seed-emulator.json', 'pyscf': 'kb_pyscf__pyscf.json'}
KB present on remote: ['kb_seed-labs__seed-emulator.json', 'kb_pyscf__pyscf.json']
KB missing on remote: []

=== Local cache check ===
Ready for retrieval:
 - seed-emulator: kb_seed-labs__seed-emulator.json -> C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\temp_downloaded_kbs\kb_seed-labs__seed-emulator.json (size=885251)
 - pyscf: kb_pyscf__pyscf.json -> C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\temp_downloaded_kbs\kb_pyscf__pyscf.json (size=8315405)
Not ready:

Summary:
  Target repos            : 2
  Remote KB available     : 2
  Remote KB missing       : 0
  Locally ready (download): 2
  Locally m

## Section 4: RETRIVAL

### 4.1 -> BM25

In [172]:
# ============================================================
# BM25: esecuzione filtrata su seed-emulator / pyscf + export JSONL normalizzato
# ============================================================
from pathlib import Path
from BM25.pipeline import run_bm25_and_export

# Assunto: lca_dataset_split è già definito (dataset filtrato o completo)
# Se non lo è, caricalo prima (es. dalla cella che abbiamo scritto in precedenza)

TARGET_REPOS = ["seed-emulator", "pyscf__pyscf"]  # usa "pyscf" se è questo l'alias nel tuo dataset
TOP_K = 3

summary = run_bm25_and_export(
    dataset=lca_dataset_split,
    target_repos=TARGET_REPOS,
    top_k=TOP_K,
    out_dir=Path("outputs/retrieval"),
    bm25_k1=1.5,
    bm25_b=0.75,
    overwrite_kb_download=False,
    max_samples=None,       # imposta un int per test veloci
    force_rebuild=False,
)

print("BM25 summary:")
for k, v in summary.items():
    print(f"- {k}: {v}")


  Cache hit: found existing results for K=3 at 'BM25\retrieved_k3_samples.json'.
BM25 summary:
- num_queries_payload: 150
- num_kbs_payload: 62
- raw_json_path: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\BM25\retrieved_k3_samples.json
- normalized_jsonl_path: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\bm25_topk_k3.jsonl
- exported_queries: 0
- k: 3


### 4.2  -> CodeBert Embedding & Cosine Similarity

In [173]:
# ============================
# COSINE — esecuzione batch
# ============================
from importlib import reload
from pathlib import Path

import COSINE.batch as cosine_batch; reload(cosine_batch)

# Parametri principali
TARGET_REPOS = ["seed-emulator", "pyscf", "pyscf__pyscf"]
TOP_K = 3
MODEL_NAME = "microsoft/codebert-base"
BATCH_SIZE = 32
RRF_K = 60
BM25_K1 = 1.5
BM25_B  = 0.75
MAX_SAMPLES = None  # usa None per tutte le query già filtrate nel tuo lca_dataset_split

# Esecuzione
result = cosine_batch.run_cosine_retrieval(
    dataset=lca_dataset_split,
    target_repos=TARGET_REPOS,
    top_k=TOP_K,
    model_name=MODEL_NAME,
    batch_size=BATCH_SIZE,
    rrf_k=RRF_K,
    bm25_k1=BM25_K1,
    bm25_b=BM25_B,
    max_samples=MAX_SAMPLES,
    out_dir=Path("outputs/retrieval"),
)




[COSINE] queries selezionate: 23 | k=3


COSINE: 100%|██████████| 23/23 [00:47<00:00,  2.09s/it]

[COSINE] Saved files:
 - RAW          : C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\cosine_raw_k3.json
 - JSONL        : C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\cosine_topk_k3.jsonl
 - JSON (array) : C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\cosine_topk_k3.json
[COSINE] Exported queries: 23 | k = 3 | metric = cosine





### 4.3  -> Hybrid Approach Cosine Similarity & BM25

In [175]:
# ============================
# HYBRID — esecuzione batch
# ============================
from importlib import reload
from pathlib import Path

import HYBRID.batch as hybrid_batch; reload(hybrid_batch)

# Parametri principali
TARGET_REPOS = ["seed-emulator", "pyscf", "pyscf__pyscf"]
TOP_K = 3
MODEL_NAME = "microsoft/codebert-base"
BATCH_SIZE = 32
RRF_K = 60
BM25_K1 = 1.5
BM25_B  = 0.75
MAX_SAMPLES = None  # None => tutte le query filtrate

# Esecuzione
result = hybrid_batch.run_hybrid_retrieval(
    dataset=lca_dataset_split,
    target_repos=TARGET_REPOS,
    top_k=TOP_K,
    model_name=MODEL_NAME,
    batch_size=BATCH_SIZE,
    rrf_k=RRF_K,
    bm25_k1=BM25_K1,
    bm25_b=BM25_B,
    max_samples=MAX_SAMPLES,
    out_dir=Path("outputs/retrieval"),
)



[HYBRID] queries selezionate: 23 | k=3


HYBRID: 100%|██████████| 23/23 [00:46<00:00,  2.02s/it]

[HYBRID] Saved files:
 - RAW          : C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\hybrid_raw_k3.json
 - JSONL        : C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\hybrid_topk_k3.jsonl
 - JSON (array) : C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\hybrid_topk_k3.json
[HYBRID] Exported queries: 23 | k = 3 | metric = hybrid





### 4.4 Multi-Hop: Strategy

#### 4.4.0 Multi-Hop Preparation

In [176]:
# === Cell 1: Imports / reloads ===
from importlib import reload
import os, json, time, pandas as pd

# Project helpers and multihop engine
import S6.mh_helpers as MH; reload(MH)
from S6.mh_helpers import discover_libs, filters_for, resolve_instructions  # add resolve_instructions
import S6.multihop as mh; reload(mh)

from models.retrieval_providers import make_provider  # retriever factory
from data.datasets import load_lca_dataset            # HF dataset loader for 'instruction'

In [177]:
# === Cell 2: Variables / config (robust version) ===

# Defensive import: make sure discover_libs / filters_for are available even if Cell 1 wasn't run
try:
    discover_libs  # type: ignore[name-defined]
    filters_for    # type: ignore[name-defined]
    resolve_instructions  # type: ignore[name-defined]
except NameError:
    from importlib import reload
    import sys, os
    ROOT = os.path.abspath(".")
    if ROOT not in sys.path:
        sys.path.insert(0, ROOT)
    import S6.mh_helpers as MH; reload(MH)
    from S6.mh_helpers import discover_libs, filters_for, resolve_instructions

# --- SINGLE SWITCH for instruction selection ---
# Choices:
#   "custom" | "all" | 5 (random) | "idx:12" | [0, 7, 12]
INSTRUCTION_SELECTION = "idx:0"     # change just this one knob

# Load dataset only if needed ( internet required for first HF download )
DS = load_lca_dataset(split="test") if INSTRUCTION_SELECTION != "custom" else None

# Materialize questions according to the single switch
# When selection == "custom", we feed your two hand-written questions (below)
TEST_SET = resolve_instructions(
    INSTRUCTION_SELECTION,
    dataset=DS,
    custom=[
        "How does the library create a BIP39 mnemonic and derive a seed?",
        "Where is BIP32 key derivation implemented?",
    ],
    seed=42,
)
print(f"[TEST] mode={TEST_SET['mode']}  #items={len(TEST_SET['items'])}")

# KB (libs) parameters
KB_DIR = "temp_downloaded_kbs"

# Discover and index ALL libraries under KB_DIR (always index everything)
LIBS = discover_libs(KB_DIR)

# Runtime selection:
# None = open-world; str = single repo; list[str] = multiple repos
SELECTED_LIBS = None                     # default: search across all indexed libs
F = filters_for(SELECTED_LIBS)           # None | {"repo": str} | {"repo": [..,..]}
REPO_FILTER = SELECTED_LIBS              # back-compat alias

print(f"Discovered {len(LIBS)} libs")
print("Sample:", LIBS[:5])

# Retriever selector
RETRIEVER = globals().get("RETRIEVER", "bm25")  # "bm25" | "knn" | "hybrid"

# LLM settings
BACKEND = "gemini"                        # "gemini" | "openrouter" | "local"
MODEL   = "gemini-2.0-flash"              # or "gemini-2.0-flash" | "deepseek/deepseek-chat-v3.1:free" | "cache/Qwen_Qwen2.5-Coder-7B-Instruct_4bit_nf4"

# Evaluation & output
top_k      = 5
CSV_DIR    = "results/multihop_csv"
COMPARE_FN = "multihop_compare.csv"
IMPACT_FN  = "multihop_topk_impact.csv"
PLANNER_CTX_DOCS  = 3
PLANNER_CTX_CHARS = 500
PLAN_MAX_TOKENS   = 64
DECOMP_MAX_TOKENS = 96
SHOW_TOP          = min(5, top_k)

# Questions used across runs (kept for readability; selection is driven by TEST_SET)
Q_DECOMP = "How does the library create a BIP39 mnemonic and derive a seed?"
Q_IR     = "Where is BIP32 key derivation implemented?"

# Output scores normalized (for better visualization)
MH.NORMALIZE_SCORES = True
MH.NORMALIZE_MAX    = 100.0

# --- Provider built via your retriever switch; always index the full KB for BM25 ---
if RETRIEVER == "bm25":
    provider = make_provider(
        "bm25",
        local_kb_dir=KB_DIR,
        library_filter=LIBS,
        tokenizer=None,  # or pass your bm25 tokenizer
    )
elif RETRIEVER == "knn":
    # Requires retrievers/KNN.py implemented + a built index
    provider = make_provider(
        "knn",
        index_dir="indexes/knn",       # adjust to your KNN index path
        index_type="hnsw",             # or "flat"
        ef_search=128,
        nprobe=16,
        # embedder=...                 # optional: your SentenceTransformer etc.
    )
elif RETRIEVER == "hybrid":
    # Requires retrievers/HYBRID.py and KNN available
    provider = make_provider(
        "hybrid",
        bm25_kwargs=dict(local_kb_dir=KB_DIR, library_filter=LIBS),
        knn_kwargs=dict(index_dir="indexes/knn", index_type="hnsw", ef_search=128, nprobe=16),
        fusion="rrf", alpha=0.5, rrf_k=60,
    )
else:
    raise ValueError(f"Unknown RETRIEVER: {RETRIEVER}")


  Loading dataset 'JetBrains-Research/lca-library-based-code-generation' (split='test')…
Dataset loaded with 150 examples across all libraries.
[TEST] mode=dataset  #items=1
Discovered 62 libs
Sample: ['1200wd__bitcoinlib', 'aidasoft__dd4hep', 'ansys__pyaedt', 'ansys__pydpf-core', 'ansys__pymapdl']
[INFO] retrieval_providers: BM25 indexed docs: 133530 across 62 libs (local)


In [178]:
# === Cell 3: Detailed multihop run, compare CSV, compact table preview ===
# Prints: initial question, model-generated (sub)queries, FINAL top-k, and per-(sub)query top-k.
# Appends rows to results/multihop_csv/multihop_compare.csv and previews a compact table.

def _filters():
    """Return the normalized filters dict computed in Cell 2."""
    return F

def _per_query_print(per_q_hits: dict, show_top: int):
    """Pretty-print per-query compact hits (doc_id, score, repo)."""
    if not per_q_hits:
        print("  (no per-query results)")
        return
    for i, (q, hits) in enumerate(per_q_hits.items()):
        print(f"  Q{i}: {q}")
        if not hits:
            print("    (no hits)")
            continue
        for j, h in enumerate(hits[:show_top], 1):
            print(f"    {j:>2}. doc_id={h['doc_id']}  score={h['score']:.3f}  [{h['repo']}]")

rows = []  # rows that will be written to compare CSV and shown in preview

def run_detailed_for_question(question: str):
    # ----- Run decomposition_first (detailed) -----
    pack1, hits1, cfg1, lat1 = MH.run_mode(
        provider=provider,
        question=question,
        strategy_key="decomposition_first",
        top_k=top_k,
        repo_filter=REPO_FILTER,                   # legacy arg; normalized internally
        backend=BACKEND,
        model=MODEL,
        cache_dir=".cache/mh_detailed_decomposition_first",
        decomposer_max_tokens=DECOMP_MAX_TOKENS,
    )

    MH.print_header("Decomposition-first", question, cfg1, _filters())
    print("\nFinal top-k (after the strategy):")
    MH.print_hits(hits1, top_k)
    print(f"Latency: {lat1:.0f} ms | Total final hits: {len(hits1)}")

    # Per-subquery: re-run retrieval for each subquery (same provider/search) and print compact lists
    subs1 = MH.extract_subqueries(pack1)
    print("\n--- Per-subquery retrieval (top_k each) ---")
    per_q_hits_1 = MH.per_query_hits_dict(subs1, provider, top_k, _filters())
    _per_query_print(per_q_hits_1, SHOW_TOP)

    # Build + collect the compare row
    row1 = MH.build_compare_row(
        strategy="decomposition_first", backend=BACKEND, model=MODEL, retriever=RETRIEVER,
        cfg=cfg1, repo_filter=REPO_FILTER, latency_ms=lat1,
        hits=hits1, question=question, kb_dir=KB_DIR, libs=LIBS,
        jaccard_vs_baseline=None,
    )
    row1["subqueries"] = " || ".join(subs1)
    rows.append(row1)

    # ----- Run iterative_refine (detailed) -----
    pack2, hits2, cfg2, lat2 = MH.run_mode(
        provider=provider,
        question=question,
        strategy_key="iterative_refine",
        top_k=top_k,
        repo_filter=REPO_FILTER,
        backend=BACKEND,
        model=MODEL,
        cache_dir=".cache/mh_detailed_iterative_refine",
        planner_ctx_docs=PLANNER_CTX_DOCS,
        planner_ctx_chars=PLANNER_CTX_CHARS,
        planner_max_tokens=PLAN_MAX_TOKENS,
    )

    MH.print_header("Iterative-refine", question, cfg2, _filters())
    print("\nFinal top-k (after the strategy):")
    MH.print_hits(hits2, top_k)
    print(f"Latency: {lat2:.0f} ms | Total final hits: {len(hits2)}")

    # Progression: ensure initial question appears first if not already in subqueries
    subs2 = MH.extract_subqueries(pack2)
    if not subs2 or subs2[0] != question:
        subs2 = [question] + subs2

    print("\n--- Per-query retrieval progression (top_k each) ---")
    per_q_hits_2 = MH.per_query_hits_dict(subs2, provider, top_k, _filters())
    _per_query_print(per_q_hits_2, SHOW_TOP)

    # Build + collect the compare row (compute Jaccard vs first run as a quick diff signal)
    jacc = MH.jaccard_hits(hits2, hits1) if hits1 else None
    row2 = MH.build_compare_row(
        strategy="iterative_refine", backend=BACKEND, model=MODEL, retriever=RETRIEVER,
        cfg=cfg2, repo_filter=REPO_FILTER, latency_ms=lat2,
        hits=hits2, question=question, kb_dir=KB_DIR, libs=LIBS,
        jaccard_vs_baseline=jacc,
    )
    row2["subqueries"] = " || ".join(subs2)
    rows.append(row2)

# ---- Drive it using the single selection (NO dependency on Q_DECOMP/Q_IR) ----
for item in TEST_SET["items"]:
    run_detailed_for_question(item["instruction"])

# Persist both compare rows
compare_csv_path = MH.save_compare_csv(CSV_DIR, rows, filename=COMPARE_FN)
print("\nCompare CSV appended:", compare_csv_path)

# ---- Preview the compact comparison table (final-result metrics only) ----
import pandas as _pd
df_cmp = _pd.DataFrame(rows)
cols = ["timestamp_iso","strategy","backend","model","retriever","k_final","latency_ms",
        "unique_repos","diversity","mean_score","max_score","jaccard_vs_baseline"]
print("\nCompact comparison table (preview):")
display(df_cmp[cols].sort_values(["strategy","backend","model"]))



=== Decomposition-first ===
Question           : Generate code that creates an emulation using the seedemu library. The emulation should include three layers: base, routing, and eBGP. It should also include a domain name caching service. 

The base layer should create multiple autonomous systems and internet exchanges. Each autonomous system should have multiple hosts and a router. The hosts and the router should join a network within the autonomous system and the router should also join an internet exchange. 

The domain name caching service should be installed on specific hosts within the autonomous systems and bindings should be added for these installations. 

The eBGP layer should add private peerings between different autonomous systems. 

Finally, all the layers and the domain name caching service should be added to the emulator and the state of the emulator should be dumped to a binary file.
Strategy           : decomposition_first
k_sub / k_final    : 2 / 5
max_hops          

Unnamed: 0,timestamp_iso,strategy,backend,model,retriever,k_final,latency_ms,unique_repos,diversity,mean_score,max_score,jaccard_vs_baseline
0,2025-09-20T12:21:10,decomposition_first,gemini,gemini-2.0-flash,bm25,5,275.0,1,0.2,87.81,100.0,
1,2025-09-20T12:21:23,iterative_refine,gemini,gemini-2.0-flash,bm25,5,7211.0,1,0.2,89.27,100.0,0.0


In [179]:
# === Cell 4: top_k summary sweep (final-only), impact CSV, compact comparison table preview ===
# Focuses on final results and compact metrics across k values.

TOPK_GRID = [3, 5, 8]
SUMMARY_STRATEGIES = ["decomposition_first", "iterative_refine"]

def _filters():
    """Return the normalized filters dict computed in Cell 2."""
    return F

rows_summary = []   # for compact preview (final-only)
impact_paths = []   # each row saved to top_k impact CSV

def run_summary_for_question(question: str):
    for strat in SUMMARY_STRATEGIES:
        for k_val in TOPK_GRID:
            pack, hits, cfg, latency_ms = MH.run_mode(
                provider=provider,
                question=question,
                strategy_key=strat,
                top_k=k_val,
                repo_filter=REPO_FILTER,   # legacy arg; normalized internally
                backend=BACKEND,
                model=MODEL,
                cache_dir=f".cache/mh_topk_summary_{strat}_k{k_val}",
            )

            # Build a compact compare row (final metrics only)
            row = MH.build_compare_row(
                strategy=strat, backend=BACKEND, model=MODEL, retriever=RETRIEVER,
                cfg=cfg, repo_filter=REPO_FILTER, latency_ms=latency_ms,
                hits=hits, question=question, kb_dir=KB_DIR, libs=LIBS,
                jaccard_vs_baseline=None,
            )
            row["subqueries"] = " || ".join(MH.extract_subqueries(pack))
            rows_summary.append(row)

            # Save an impact row (keeps per-query hits in CSV for offline analysis)
            subs = MH.extract_subqueries(pack)
            if strat == "iterative_refine" and (not subs or subs[0] != question):
                subs = [question] + subs
            per_q_hits = MH.per_query_hits_dict(subs, provider, k_val, _filters())
            path = MH.save_topk_impact_csv(
                CSV_DIR,
                strategy=strat, backend=BACKEND, model=MODEL, retriever=RETRIEVER,
                top_k=k_val, question=question, repo_filter=REPO_FILTER,
                subqueries=subs, per_query_hits=per_q_hits,
                pack_hits=hits, latency_ms=latency_ms, kb_dir=KB_DIR, libs=LIBS,
                filename=IMPACT_FN,
            )
            impact_paths.append(path)

# Drive the sweep using the same selection (NO dependency on Q_DECOMP/Q_IR)
for item in TEST_SET["items"]:
    run_summary_for_question(item["instruction"])

# Also append these rows to the compare CSV for unified analysis
compare_csv_path = MH.save_compare_csv(CSV_DIR, rows_summary, filename=COMPARE_FN)
print("Compare CSV appended:", compare_csv_path)
print("Impact CSV appended:", impact_paths[-1] if impact_paths else "(none)")

# ---- Preview the compact comparison table (final-only metrics) ----
import pandas as _pd
df_cmp_summary = _pd.DataFrame(rows_summary)
cols = ["timestamp_iso","strategy","backend","model","retriever","k_final","latency_ms",
        "unique_repos","diversity","mean_score","max_score","jaccard_vs_baseline"]
print("\nCompact comparison table (preview):")
display(df_cmp_summary[cols].sort_values(["strategy","k_final","backend","model"]))

Compare CSV appended: results/multihop_csv\multihop_compare.csv
Impact CSV appended: results/multihop_csv\multihop_topk_impact.csv

Compact comparison table (preview):


Unnamed: 0,timestamp_iso,strategy,backend,model,retriever,k_final,latency_ms,unique_repos,diversity,mean_score,max_score,jaccard_vs_baseline
0,2025-09-20T12:21:23,decomposition_first,gemini,gemini-2.0-flash,bm25,3,280.0,1,0.3333,84.8,100.0,
1,2025-09-20T12:21:24,decomposition_first,gemini,gemini-2.0-flash,bm25,5,282.0,1,0.2,87.81,100.0,
2,2025-09-20T12:21:24,decomposition_first,gemini,gemini-2.0-flash,bm25,8,638.0,1,0.125,90.5,100.0,
3,2025-09-20T12:21:32,iterative_refine,gemini,gemini-2.0-flash,bm25,3,7060.0,1,0.3333,97.27,100.0,
4,2025-09-20T12:21:45,iterative_refine,gemini,gemini-2.0-flash,bm25,5,7218.0,1,0.2,92.24,100.0,
5,2025-09-20T12:21:58,iterative_refine,gemini,gemini-2.0-flash,bm25,8,7318.0,1,0.125,95.02,100.0,


In [180]:
# === Cell 5: Single-mode multi-hop (no visualization) ===
from importlib import reload
import S6.mh_helpers as MH; reload(MH)
from S6.mh_helpers import discover_libs, filters_for, extract_subqueries, snippets_from_hits
from models.retrieval_providers import make_provider

# --------------------
# Config (edit as needed)
# --------------------
KB_DIR      = "temp_downloaded_kbs"
RETRIEVER   = "bm25"                  # "bm25" | "knn" | "hybrid"
STRATEGY    = "iterative_refine"      # "decomposition_first" | "iterative_refine"
QUESTION    = "Replace me with the instruction you want to run"
TOP_K       = 5

# LLM settings (used only by multi-hop planning/decomposition)
BACKEND     = "gemini"                # "gemini" | "openrouter" | "local"
MODEL       = "gemini-2.0-flash"

# Strategy params
PLANNER_CTX_DOCS  = 3
PLANNER_CTX_CHARS = 500
PLAN_MAX_TOKENS   = 64
DECOMP_MAX_TOKENS = 96

# Library selection: None = all; or "lib_name"; or ["libA","libB"]
SELECTED_LIBS = None

# --------------------
# Build provider (indexes all KBs under KB_DIR)
# --------------------
LIBS = discover_libs(KB_DIR)
F = filters_for(SELECTED_LIBS)
provider = make_provider(RETRIEVER, local_kb_dir=KB_DIR, library_filter=LIBS)

# Ensure IR "final" uses last refined subquery (matches our earlier fix)
force_last = (STRATEGY == "iterative_refine")

# --------------------
# Run the single mode
# --------------------
pack, hits, cfg, latency_ms = MH.run_mode(
    provider=provider,
    question=QUESTION,
    strategy_key=STRATEGY,
    top_k=TOP_K,
    repo_filter=SELECTED_LIBS,                  # legacy alias supported by run_mode
    backend=BACKEND,
    model=MODEL,
    cache_dir=f".cache/mh_single_{STRATEGY}",
    planner_ctx_docs=PLANNER_CTX_DOCS,
    planner_ctx_chars=PLANNER_CTX_CHARS,
    planner_max_tokens=PLAN_MAX_TOKENS,
    decomposer_max_tokens=DECOMP_MAX_TOKENS,
    force_last_subquery_final=force_last,
)

# --------------------
# Extract ALL top_k snippets (no truncation)
# --------------------
SNIPPETS = snippets_from_hits(hits, provider, dedupe=True)

# Lean trace; keep for reproducibility (optional)
print(f"[MH] strategy={STRATEGY} k={TOP_K} libs={'all' if SELECTED_LIBS is None else SELECTED_LIBS} "
      f"latency_ms={int(latency_ms)}")
print(f"[MH] subqueries: {extract_subqueries(pack)}")
print(f"[MH] final_query: {pack.get('meta',{}).get('final_query_text')}")
print(f"[MH] snippets_used={len(SNIPPETS)}")


[INFO] retrieval_providers: BM25 indexed docs: 133530 across 62 libs (local)
[MH] strategy=iterative_refine k=5 libs=all latency_ms=2290
[MH] subqueries: ['Replace me with the instruction you want to run', 'kivy BuilderException "not isinstance(instr, Instruction)" Factory.get', 'kivy Factory.get Instruction BuilderException canvas.clear']
[MH] final_query: kivy Factory.get Instruction BuilderException canvas.clear
[MH] snippets_used=5


#### 4.4.1 Multi-Hop: Decomposition

In [181]:
# ============================================================
# Multihop "decomposition_first" (top-4) su 23 query selezionate
# KB limitate a: seed-emulator, pyscf (pyscf__pyscf)
# Output: outputs/multihop/mh_decomposition_topk_k4.{jsonl,json}
# ============================================================
import os, json
from pathlib import Path
from typing import Dict, Any, List, Union
from tqdm import tqdm

# Helpers multihop / provider
from importlib import reload
import S6.mh_helpers as MH; reload(MH)
from S6.mh_helpers import discover_libs, filters_for, snippets_from_hits
from models.retrieval_providers import make_provider

# --- Parametri run ---
strategy_key     = "decomposition_first"
top_k            = 3
retriever_name   = "bm25"  # "bm25" | "knn" | "hybrid"
kb_dir           = "temp_downloaded_kbs"
selected_libs    = ["seed-emulator", "pyscf", "pyscf__pyscf"]
backend          = "gemini"
model_name       = "gemini-2.0-flash"
cache_dir        = f".cache/mh_json_{strategy_key}_k{top_k}"
max_samples      = 23  # numero di query da processare

# --- Output ---
OUT_DIR   = Path("outputs/multihop"); OUT_DIR.mkdir(parents=True, exist_ok=True)
OUT_JSONL = OUT_DIR / f"mh_decomposition_topk_k{top_k}.jsonl"
OUT_JSON  = OUT_DIR / f"mh_decomposition_topk_k{top_k}.json"

# --- Normalizzazione hits (robusta, compatibile con sezioni precedenti) ---
def norm_hit(hit: Union[Dict[str, Any], str], i: int) -> Dict[str, Any]:
    if isinstance(hit, str):
        return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": hit, "metadata": {}}
    if isinstance(hit, dict):
        doc_id = hit.get("doc_id") or hit.get("id") or hit.get("path") or f"doc_{i}"
        text   = hit.get("text") or hit.get("content") or hit.get("snippet") or ""
        score  = hit.get("score") or hit.get("similarity") or hit.get("bm25_score") or 0.0
        path   = hit.get("path")
        md     = hit.get("metadata") if isinstance(hit.get("metadata"), dict) else {}
        try: score = float(score)
        except Exception: score = 0.0
        return {"doc_id": doc_id, "score": score, "path": path, "text": text, "metadata": md}
    return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": str(hit), "metadata": {}}

def norm_repo_name(item: Dict[str, Any]) -> str:
    for k in ("repo_name", "repo_full_name"):
        if k in item and item[k]:
            return str(item[k]).strip().lower()
    return ""

def norm_instruction(item: Dict[str, Any]) -> str:
    for k in ("instruction", "query", "prompt"):
        if k in item and item[k]:
            return str(item[k])
    return ""

def norm_query_id(item: Dict[str, Any], idx: int) -> str:
    for k in ("id", "query_id", "qid"):
        if k in item and item[k]:
            return str(item[k])
    return f"{norm_repo_name(item) or 'repo'}__{idx:06d}"

# --- Provider limitato alle due KB ---
# Nota: passiamo il filtro direttamente al provider e anche a run_mode.
provider = make_provider(
    retriever_name,
    local_kb_dir=kb_dir,
    library_filter=selected_libs,
)

F = filters_for(selected_libs)

# --- Costruisci l'iteratore di 23 esempi dal dataset già filtrato a monte ---
# lca_dataset_split deve essere definito (come nelle sezioni precedenti).
data_iter = list(lca_dataset_split)[:max_samples]

# --- Esecuzione e serializzazione ---
rows = []
for i, item in enumerate(tqdm(data_iter, desc=f"[{strategy_key}] building JSON")):
    q    = norm_instruction(item)
    repo = norm_repo_name(item)
    try:
        pack, hits, cfg, latency_ms = MH.run_mode(
            provider=provider,
            question=q,
            strategy_key=strategy_key,
            top_k=top_k,
            repo_filter=selected_libs,   # applica il filtro anche qui
            backend=backend,
            model=model_name,
            cache_dir=cache_dir,
            # Token/ctx knobs usano i default interni di MH.run_mode se omessi
        )
        # Estraggo i final snippets (dedupe True per sicurezza)
        # NB: se hits già contiene i testi completi, va bene; altrimenti snippets_from_hits
        # recupera i contenuti dal provider.
        final_snippets = snippets_from_hits(hits, provider, dedupe=True)
        hits_norm = [norm_hit(h, j) for j, h in enumerate(final_snippets)]
        row = {
            "query_id": norm_query_id(item, i),
            "repo_name": repo,
            "instruction": q,
            "retrieval_method": f"multihop:{strategy_key}",
            "k": top_k,
            "retrieval_params": {
                "retriever": retriever_name,
                "backend": backend,
                "model": model_name,
                "repo_filter": selected_libs,
                "strategy": strategy_key,
            },
            "results": hits_norm
        }
    except Exception as e:
        row = {
            "query_id": norm_query_id(item, i),
            "repo_name": repo,
            "instruction": q,
            "retrieval_method": f"multihop:{strategy_key}",
            "k": top_k,
            "retrieval_params": {
                "retriever": retriever_name,
                "backend": backend,
                "model": model_name,
                "repo_filter": selected_libs,
                "strategy": strategy_key,
            },
            "results": [],
            "error": str(e),
        }
    rows.append(row)

# ---- Salvataggio JSONL + JSON (array) ----
with OUT_JSONL.open("w", encoding="utf-8") as f:
    for row in rows:
        f.write(json.dumps(row, ensure_ascii=False) + "\n")

with OUT_JSON.open("w", encoding="utf-8") as f:
    json.dump(rows, f, indent=2, ensure_ascii=False)

print("Saved files:")
print(" - JSONL       :", OUT_JSONL.resolve())
print(" - JSON (array):", OUT_JSON.resolve())
print("Exported queries:", len(rows), "| k =", top_k, "| strategy =", strategy_key)


[INFO] retrieval_providers: BM25 indexed docs: 6618 across 1 libs (local)


[decomposition_first] building JSON: 100%|██████████| 23/23 [01:06<00:00,  2.89s/it]

Saved files:
 - JSONL       : C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_decomposition_topk_k3.jsonl
 - JSON (array): C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_decomposition_topk_k3.json
Exported queries: 23 | k = 3 | strategy = decomposition_first





In [183]:
# ============================================================
# Multihop "iterative_refine" (top-4) su 23 query selezionate
# KB limitate a: seed-emulator, pyscf (pyscf__pyscf)
# Output: outputs/multihop/mh_iterative_topk_k4.{jsonl,json}
# ============================================================
import os, json
from pathlib import Path
from typing import Dict, Any, List, Union
from tqdm import tqdm

# Helpers multihop / provider
from importlib import reload
import S6.mh_helpers as MH; reload(MH)
from S6.mh_helpers import discover_libs, filters_for, snippets_from_hits
from models.retrieval_providers import make_provider

# --- Parametri run ---
strategy_key     = "iterative_refine"
top_k            = 3
retriever_name   = "bm25"  # "bm25" | "knn" | "hybrid"
kb_dir           = "temp_downloaded_kbs"
selected_libs    = ["seed-emulator", "pyscf", "pyscf__pyscf"]
backend          = "gemini"
model_name       = "gemini-2.0-flash"
cache_dir        = f".cache/mh_json_{strategy_key}_k{top_k}"
max_samples      = 23  # numero di query da processare

# --- Output ---
OUT_DIR   = Path("outputs/multihop"); OUT_DIR.mkdir(parents=True, exist_ok=True)
OUT_JSONL = OUT_DIR / f"mh_iterative_topk_k{top_k}.jsonl"
OUT_JSON  = OUT_DIR / f"mh_iterative_topk_k{top_k}.json"

# --- Normalizzazione hits (come sopra) ---
def norm_hit(hit: Union[Dict[str, Any], str], i: int) -> Dict[str, Any]:
    if isinstance(hit, str):
        return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": hit, "metadata": {}}
    if isinstance(hit, dict):
        doc_id = hit.get("doc_id") or hit.get("id") or hit.get("path") or f"doc_{i}"
        text   = hit.get("text") or hit.get("content") or hit.get("snippet") or ""
        score  = hit.get("score") or hit.get("similarity") or hit.get("bm25_score") or 0.0
        path   = hit.get("path")
        md     = hit.get("metadata") if isinstance(hit.get("metadata"), dict) else {}
        try: score = float(score)
        except Exception: score = 0.0
        return {"doc_id": doc_id, "score": score, "path": path, "text": text, "metadata": md}
    return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": str(hit), "metadata": {}}

def norm_repo_name(item: Dict[str, Any]) -> str:
    for k in ("repo_name", "repo_full_name"):
        if k in item and item[k]:
            return str(item[k]).strip().lower()
    return ""

def norm_instruction(item: Dict[str, Any]) -> str:
    for k in ("instruction", "query", "prompt"):
        if k in item and item[k]:
            return str(item[k])
    return ""

def norm_query_id(item: Dict[str, Any], idx: int) -> str:
    for k in ("id", "query_id", "qid"):
        if k in item and item[k]:
            return str(item[k])
    return f"{norm_repo_name(item) or 'repo'}__{idx:06d}"

# --- Provider limitato alle due KB ---
provider = make_provider(
    retriever_name,
    local_kb_dir=kb_dir,
    library_filter=selected_libs,
)

F = filters_for(selected_libs)

# --- Costruisci l'iteratore di 23 esempi dal dataset già filtrato a monte ---
data_iter = list(lca_dataset_split)[:max_samples]

# --- Esecuzione e serializzazione ---
rows = []
for i, item in enumerate(tqdm(data_iter, desc=f"[{strategy_key}] building JSON")):
    q    = norm_instruction(item)
    repo = norm_repo_name(item)
    try:
        pack, hits, cfg, latency_ms = MH.run_mode(
            provider=provider,
            question=q,
            strategy_key=strategy_key,
            top_k=top_k,
            repo_filter=selected_libs,
            backend=backend,
            model=model_name,
            cache_dir=cache_dir,
            # Forza l'uso dell'ULTIMA sub-query come finale (coerente con la tua Cell 5)
            force_last_subquery_final=True,
        )
        final_snippets = snippets_from_hits(hits, provider, dedupe=True)
        hits_norm = [norm_hit(h, j) for j, h in enumerate(final_snippets)]
        row = {
            "query_id": norm_query_id(item, i),
            "repo_name": repo,
            "instruction": q,
            "retrieval_method": f"multihop:{strategy_key}",
            "k": top_k,
            "retrieval_params": {
                "retriever": retriever_name,
                "backend": backend,
                "model": model_name,
                "repo_filter": selected_libs,
                "strategy": strategy_key,
                "force_last_subquery_final": True,
            },
            "results": hits_norm
        }
    except Exception as e:
        row = {
            "query_id": norm_query_id(item, i),
            "repo_name": repo,
            "instruction": q,
            "retrieval_method": f"multihop:{strategy_key}",
            "k": top_k,
            "retrieval_params": {
                "retriever": retriever_name,
                "backend": backend,
                "model": model_name,
                "repo_filter": selected_libs,
                "strategy": strategy_key,
                "force_last_subquery_final": True,
            },
            "results": [],
            "error": str(e),
        }
    rows.append(row)

# ---- Salvataggio JSONL + JSON (array) ----
with OUT_JSONL.open("w", encoding="utf-8") as f:
    for row in rows:
        f.write(json.dumps(row, ensure_ascii=False) + "\n")

with OUT_JSON.open("w", encoding="utf-8") as f:
    json.dump(rows, f, indent=2, ensure_ascii=False)

print("Saved files:")
print(" - JSONL       :", OUT_JSONL.resolve())
print(" - JSON (array):", OUT_JSON.resolve())
print("Exported queries:", len(rows), "| k =", top_k, "| strategy =", strategy_key)


[INFO] retrieval_providers: BM25 indexed docs: 6618 across 1 libs (local)


[iterative_refine] building JSON: 100%|██████████| 23/23 [06:04<00:00, 15.84s/it]

Saved files:
 - JSONL       : C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_iterative_topk_k3.jsonl
 - JSON (array): C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_iterative_topk_k3.json
Exported queries: 23 | k = 3 | strategy = iterative_refine





#### 4.4.2 Multi-Hop: iterative_refine

In [184]:
# ============================================================
# Multihop "iterative_refine" (top-4) su 23 query selezionate
# KB limitate a: seed-emulator, pyscf (pyscf__pyscf)
# Output: outputs/multihop/mh_iterative_topk_k4.{jsonl,json}
# ============================================================
import os, json
from pathlib import Path
from typing import Dict, Any, List, Union
from tqdm import tqdm

# Helpers multihop / provider
from importlib import reload
import S6.mh_helpers as MH; reload(MH)
from S6.mh_helpers import discover_libs, filters_for, snippets_from_hits
from models.retrieval_providers import make_provider

# --- Parametri run ---
strategy_key     = "iterative_refine"
top_k            = 3
retriever_name   = "bm25"  # "bm25" | "knn" | "hybrid"
kb_dir           = "temp_downloaded_kbs"
selected_libs    = ["seed-emulator", "pyscf", "pyscf__pyscf"]
backend          = "gemini"
model_name       = "gemini-2.0-flash"
cache_dir        = f".cache/mh_json_{strategy_key}_k{top_k}"
max_samples      = 23  # numero di query da processare

# --- Output ---
OUT_DIR   = Path("outputs/multihop"); OUT_DIR.mkdir(parents=True, exist_ok=True)
OUT_JSONL = OUT_DIR / f"mh_iterative_topk_k{top_k}.jsonl"
OUT_JSON  = OUT_DIR / f"mh_iterative_topk_k{top_k}.json"

# --- Normalizzazione hits (come sopra) ---
def norm_hit(hit: Union[Dict[str, Any], str], i: int) -> Dict[str, Any]:
    if isinstance(hit, str):
        return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": hit, "metadata": {}}
    if isinstance(hit, dict):
        doc_id = hit.get("doc_id") or hit.get("id") or hit.get("path") or f"doc_{i}"
        text   = hit.get("text") or hit.get("content") or hit.get("snippet") or ""
        score  = hit.get("score") or hit.get("similarity") or hit.get("bm25_score") or 0.0
        path   = hit.get("path")
        md     = hit.get("metadata") if isinstance(hit.get("metadata"), dict) else {}
        try: score = float(score)
        except Exception: score = 0.0
        return {"doc_id": doc_id, "score": score, "path": path, "text": text, "metadata": md}
    return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": str(hit), "metadata": {}}

def norm_repo_name(item: Dict[str, Any]) -> str:
    for k in ("repo_name", "repo_full_name"):
        if k in item and item[k]:
            return str(item[k]).strip().lower()
    return ""

def norm_instruction(item: Dict[str, Any]) -> str:
    for k in ("instruction", "query", "prompt"):
        if k in item and item[k]:
            return str(item[k])
    return ""

def norm_query_id(item: Dict[str, Any], idx: int) -> str:
    for k in ("id", "query_id", "qid"):
        if k in item and item[k]:
            return str(item[k])
    return f"{norm_repo_name(item) or 'repo'}__{idx:06d}"

# --- Provider limitato alle due KB ---
provider = make_provider(
    retriever_name,
    local_kb_dir=kb_dir,
    library_filter=selected_libs,
)

F = filters_for(selected_libs)

# --- Costruisci l'iteratore di 23 esempi dal dataset già filtrato a monte ---
data_iter = list(lca_dataset_split)[:max_samples]

# --- Esecuzione e serializzazione ---
rows = []
for i, item in enumerate(tqdm(data_iter, desc=f"[{strategy_key}] building JSON")):
    q    = norm_instruction(item)
    repo = norm_repo_name(item)
    try:
        pack, hits, cfg, latency_ms = MH.run_mode(
            provider=provider,
            question=q,
            strategy_key=strategy_key,
            top_k=top_k,
            repo_filter=selected_libs,
            backend=backend,
            model=model_name,
            cache_dir=cache_dir,
            # Forza l'uso dell'ULTIMA sub-query come finale (coerente con la tua Cell 5)
            force_last_subquery_final=True,
        )
        final_snippets = snippets_from_hits(hits, provider, dedupe=True)
        hits_norm = [norm_hit(h, j) for j, h in enumerate(final_snippets)]
        row = {
            "query_id": norm_query_id(item, i),
            "repo_name": repo,
            "instruction": q,
            "retrieval_method": f"multihop:{strategy_key}",
            "k": top_k,
            "retrieval_params": {
                "retriever": retriever_name,
                "backend": backend,
                "model": model_name,
                "repo_filter": selected_libs,
                "strategy": strategy_key,
                "force_last_subquery_final": True,
            },
            "results": hits_norm
        }
    except Exception as e:
        row = {
            "query_id": norm_query_id(item, i),
            "repo_name": repo,
            "instruction": q,
            "retrieval_method": f"multihop:{strategy_key}",
            "k": top_k,
            "retrieval_params": {
                "retriever": retriever_name,
                "backend": backend,
                "model": model_name,
                "repo_filter": selected_libs,
                "strategy": strategy_key,
                "force_last_subquery_final": True,
            },
            "results": [],
            "error": str(e),
        }
    rows.append(row)

# ---- Salvataggio JSONL + JSON (array) ----
with OUT_JSONL.open("w", encoding="utf-8") as f:
    for row in rows:
        f.write(json.dumps(row, ensure_ascii=False) + "\n")

with OUT_JSON.open("w", encoding="utf-8") as f:
    json.dump(rows, f, indent=2, ensure_ascii=False)

print("Saved files:")
print(" - JSONL       :", OUT_JSONL.resolve())
print(" - JSON (array):", OUT_JSON.resolve())
print("Exported queries:", len(rows), "| k =", top_k, "| strategy =", strategy_key)


[INFO] retrieval_providers: BM25 indexed docs: 6618 across 1 libs (local)


[iterative_refine] building JSON: 100%|██████████| 23/23 [06:10<00:00, 16.11s/it]

Saved files:
 - JSONL       : C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_iterative_topk_k3.jsonl
 - JSON (array): C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_iterative_topk_k3.json
Exported queries: 23 | k = 3 | strategy = iterative_refine





## 5 Prompt Generation

### -> 5.0 BASELINE

In [185]:
from promptgen import generate_baseline_prompts
from pathlib import Path

dataset_iter = list(lca_dataset_split)  # le 23 query già filtrate

per_template_paths, aggregated_path, total = generate_baseline_prompts(
    dataset_like=dataset_iter,
    prompts_dir=Path("prompts"),
    out_dir=Path("outputs/prompts/baseline"),
    aggregate_filename="_all_baseline.jsonl",
    fail_fast=False,
)

print("\nFile per-template:")
for p in per_template_paths:
    print(" -", p.resolve())
print("Aggregato:", aggregated_path.resolve())
print("Totale:", total)


[WARN] baseline_make_prompts qid=seed-emulator__000000: Retrieval file non trovato: BM25\retrieved_kGenerate code that creates an emulation using the seedemu library. The emulation should include three layers: base, routing, and eBGP. It should also include a domain name caching service. 

The base layer should create multiple autonomous systems and internet exchanges. Each autonomous system should have multiple hosts and a router. The hosts and the router should join a network within the autonomous system and the router should also join an internet exchange. 

The domain name caching service should be installed on specific hosts within the autonomous systems and bindings should be added for these installations. 

The eBGP layer should add private peerings between different autonomous systems. 

Finally, all the layers and the domain name caching service should be added to the emulator and the state of the emulator should be dumped to a binary file._samples.json.
Genera prima BM25/retr

### -> 5.1 Retrivial augmented generations prompts

#### -- 5.1.1 BM 25 prompts 

In [186]:
# === RAG BM25 prompt builder (adatto al tuo file bundle meta+results) ===
import json
from pathlib import Path
from typing import Dict, Any, List

# --- Percorso del TUO file BM25 (bundle con meta+results) ---
BM25_BUNDLE_PATH = Path(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\BM25\retrieved_k3_samples.json")

# --- Output ---
TOP_K_SNIPPETS = 3
METHOD_TAG     = "rag_bm25_top3"
OUT_DIR        = Path("outputs/prompts") / METHOD_TAG
AGG_JSONL      = OUT_DIR / f"RAG_{METHOD_TAG}.jsonl"

# Repo target (case-insensitive). Se non matcha nulla, il filtro viene disattivato automaticamente.
TARGET_REPOS = {"seed-emulator", "seed-labs__seed-emulator", "pyscf", "pyscf__pyscf"}

# ---------- Helpers ----------
def _norm_repo_name(d: Dict[str, Any]) -> str:
    for k in ("repo_name", "repo_full_name", "repo"):
        v = d.get(k)
        if v:
            return str(v).strip().lower()
    return ""

def _norm_instruction(d: Dict[str, Any]) -> str:
    for k in ("instruction", "query", "prompt"):
        v = d.get(k)
        if v:
            return str(v)
    return ""

def _norm_query_id(repo: str, idx: int) -> str:
    repo = (repo or "repo").lower()
    return f"{repo}__{idx:06d}"

def _norm_hit_from_topk(hit: Dict[str, Any], i: int) -> Dict[str, Any]:
    """
    Adatta un elemento della lista 'topk' del tuo bundle:
      {rank, library_key, snippet, snippet_len}
    in un risultato standardizzato con doc_id/text/score/path.
    """
    rank = hit.get("rank", i+1)
    lib  = hit.get("library_key") or ""
    text = hit.get("snippet") or ""
    # Score artificiale: più basso il rank, più alto lo score
    try:
        score = 1.0 / float(rank)
    except Exception:
        score = 0.0
    return {
        "doc_id": f"{lib}::rank{rank}",
        "score": score,
        "path": None,
        "text": text,
    }

def _take_top_k(results: List[Dict[str, Any]], k: int) -> List[Dict[str, Any]]:
    if not isinstance(results, list):
        return []
    sorted_res = sorted(
        results,
        key=lambda x: (x.get("score") if isinstance(x.get("score"), (int, float)) else -1.0),
        reverse=True,
    )
    return sorted_res[:k]

# ---------- Caricamento del bundle BM25 ----------
if not BM25_BUNDLE_PATH.exists():
    raise FileNotFoundError(f"File BM25 non trovato: {BM25_BUNDLE_PATH}")

bundle = json.loads(BM25_BUNDLE_PATH.read_text(encoding="utf-8"))
items  = bundle.get("results") or []

rows_raw: List[Dict[str, Any]] = []
for item in items:
    repo = (_norm_repo_name(item) or item.get("repo_full_name") or "").lower()
    instr = _norm_instruction(item)
    idx = int(item.get("idx", len(rows_raw)))
    qid = _norm_query_id(repo, idx)

    # Converti topk -> results normalizzati
    topk_list = item.get("topk") or []
    results_norm = [_norm_hit_from_topk(h, j) for j, h in enumerate(topk_list)]

    rows_raw.append({
        "query_id": qid,
        "repo_name": repo,
        "instruction": instr,
        "results": results_norm,
    })

# ---------- Filtro repo (disattiva se svuota tutto) ----------
rows = [r for r in rows_raw if (r.get("repo_name") or "") in TARGET_REPOS]
if not rows:
    print("[WARN] Filtro repo ha rimosso tutto. Disattivo filtro e uso tutti gli items.")
    rows = rows_raw

print(f"BM25 bundle: items totali={len(items)} | righe utili dopo filtro={len(rows)}")

# ---------- Costruzione prompt ----------
from prompts_common.templates import load_all_prompt_builders
from prompts_common.rag_prompt_maker import make_rag_prompt, save_jsonl

builders = load_all_prompt_builders()
print("Template caricati:", ", ".join(sorted(builders.keys())))

OUT_DIR.mkdir(parents=True, exist_ok=True)
agg_rows: List[Dict[str, Any]] = []
per_template_buffers: Dict[str, List[Dict[str, Any]]] = {name: [] for name in builders.keys()}

for r in rows:
    qid   = str(r.get("query_id"))
    repo  = (r.get("repo_name") or "").lower()
    instr = r.get("instruction") or ""
    results_topk = _take_top_k(r.get("results") or [], TOP_K_SNIPPETS)

    for templ_name, builder in builders.items():
        prompt_text = make_rag_prompt(
            base_builder=builder,
            instruction=instr,
            snippets=results_topk,
            repo_name=repo,
            method="bm25",
            k=TOP_K_SNIPPETS,
        )
        row_out = {
            "query_id": qid,
            "repo_name": repo,
            "instruction": instr,
            "template": templ_name,
            "variant": "rag_bm25",
            "k_snippets": TOP_K_SNIPPETS,
            "retrieval_method": "bm25",
            "snippets": results_topk,
            "prompt": prompt_text,
        }
        per_template_buffers[templ_name].append(row_out)
        agg_rows.append(row_out)

# ---------- Salvataggi ----------
written = []
for templ_name, rows_buf in per_template_buffers.items():
    path = OUT_DIR / f"{templ_name}_rag_bm25_top3.jsonl"
    save_jsonl(path, rows_buf)
    written.append(path)

save_jsonl(AGG_JSONL, agg_rows)

print("\nScritti i file RAG (BM25, top-3):")
for p in written: print(" -", p.resolve())
print("Aggregato:")
print(" -", AGG_JSONL.resolve())
print(f"Totale prompt (aggregato): {len(agg_rows)}")


BM25 bundle: items totali=150 | righe utili dopo filtro=23
Template caricati: v1, v2, v3, v4, v5, v6, v6_2, v6_3, v7, v8, v9

Scritti i file RAG (BM25, top-3):
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_bm25_top3\v1_rag_bm25_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_bm25_top3\v2_rag_bm25_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_bm25_top3\v3_rag_bm25_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_bm25_top3\v4_rag_bm25_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_bm25_top3\v5_rag_bm25_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_bm25_top3\v6_rag_bm25_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_bm25_top3\v7_rag_bm25_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_bm25_top3\v8_rag_bm25_top3.jsonl


#### -- 5.1.2 Cosine prompts 

In [198]:
# === RAG prompts con COSINE (top-3) usando i file indicati ===
import json
from pathlib import Path
from typing import Dict, Any, List

# File COSINE (metti i tuoi path qui)
RAW_JSON_PATH        = Path(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\cosine_raw_k3.json")
NORM_JSONL_PATH      = Path(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\cosine_topk_k3.jsonl")
NORM_JSON_ARRAY_PATH = Path(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\cosine_topk_k3.json")

# Output
TOP_K_SNIPPETS = 3
METHOD_TAG     = "rag_cosine_top3"
OUT_DIR        = Path("outputs/prompts/rag_cosine_top3")
AGG_JSONL      = OUT_DIR / f"RAG_{METHOD_TAG}.jsonl"

TARGET_REPOS = {"seed-emulator", "seed-labs__seed-emulator", "pyscf", "pyscf__pyscf"}

# ---------- Normalizzazione hits ----------
def _norm_hit(hit: Any, i: int) -> Dict[str, Any]:
    if isinstance(hit, str):
        return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": hit}
    if isinstance(hit, dict):
        doc_id = hit.get("doc_id") or hit.get("id") or hit.get("path") or f"doc_{i}"
        text   = hit.get("text") or hit.get("content") or hit.get("snippet") or ""
        # cosine di solito è 'similarity', ma supportiamo anche 'score' e 'bm25_score'
        score  = hit.get("similarity", hit.get("score", hit.get("bm25_score", 0.0)))
        path   = hit.get("path")
        try: score = float(score)
        except Exception: score = 0.0
        return {"doc_id": doc_id, "score": score, "path": path, "text": text}
    return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": str(hit)}

def _take_top_k(results: List[Dict[str, Any]], k: int) -> List[Dict[str, Any]]:
    if not isinstance(results, list):
        return []
    sorted_res = sorted(
        results,
        key=lambda x: (x.get("score") if isinstance(x.get("score"), (int, float)) else -1.0),
        reverse=True
    )
    return sorted_res[:k]

def _norm_repo_name(item: Dict[str, Any]) -> str:
    for k in ("repo_name", "repo_full_name"):
        v = item.get(k)
        if v: return str(v).strip().lower()
    return ""

def _norm_instruction(item: Dict[str, Any]) -> str:
    for k in ("instruction", "query", "prompt"):
        v = item.get(k)
        if v: return str(v)
    return ""

def _norm_query_id(item: Dict[str, Any], fallback_idx: int) -> str:
    for k in ("id", "query_id", "qid"):
        v = item.get(k)
        if v: return str(v)
    repo = _norm_repo_name(item) or "repo"
    return f"{repo}__{fallback_idx:06d}"

# ---------- Loader: preferisci JSONL normalizzato, poi JSON array, infine RAW JSON ----------
def _load_normalized_jsonl(p: Path) -> List[Dict[str, Any]]:
    rows = []
    with p.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line: continue
            obj = json.loads(line)
            repo = _norm_repo_name(obj)
            if repo not in TARGET_REPOS:
                continue
            res  = [_norm_hit(h, j) for j, h in enumerate(obj.get("results") or [])]
            obj["results"] = res
            rows.append(obj)
    return rows

def _load_normalized_json_array(p: Path) -> List[Dict[str, Any]]:
    data = json.loads(p.read_text(encoding="utf-8"))
    rows = []
    for obj in data:
        repo = _norm_repo_name(obj)
        if repo not in TARGET_REPOS:
            continue
        res = [_norm_hit(h, j) for j, h in enumerate(obj.get("results") or [])]
        rows.append({
            "query_id": obj.get("query_id"),
            "repo_name": repo,
            "instruction": _norm_instruction(obj) or obj.get("instruction") or "",
            "results": res,
        })
    return rows

def _load_raw_json(p: Path) -> List[Dict[str, Any]]:
    data = json.loads(p.read_text(encoding="utf-8"))
    rows = []
    for i, item in enumerate(data):
        repo = _norm_repo_name(item)
        if repo not in TARGET_REPOS:
            continue
        instr = _norm_instruction(item)
        hits  = item.get("retrieved_snippets") or []
        res   = [_norm_hit(h, j) for j, h in enumerate(hits)]
        rows.append({
            "query_id": _norm_query_id(item, i),
            "repo_name": repo,
            "instruction": instr,
            "results": res,
        })
    return rows

def load_cosine_rows() -> List[Dict[str, Any]]:
    if NORM_JSONL_PATH.exists():
        print(f"Carico JSONL normalizzato: {NORM_JSONL_PATH}")
        return _load_normalized_jsonl(NORM_JSONL_PATH)
    if NORM_JSON_ARRAY_PATH.exists():
        print(f"JSONL non trovato. Carico JSON (array): {NORM_JSON_ARRAY_PATH}")
        return _load_normalized_json_array(NORM_JSON_ARRAY_PATH)
    if RAW_JSON_PATH.exists():
        print(f"Normalizzati non trovati. Carico RAW JSON: {RAW_JSON_PATH}")
        return _load_raw_json(RAW_JSON_PATH)
    raise FileNotFoundError("Nessuno dei file COSINE esiste nei path indicati.")

# ---------- Costruzione prompt ----------
from prompts_common.templates import load_all_prompt_builders
from prompts_common.rag_prompt_maker import make_rag_prompt, save_jsonl

cosine_rows = load_cosine_rows()
print(f"COSINE rows (filtrate su repo target): {len(cosine_rows)}")

builders = load_all_prompt_builders()
print("Template caricati:", ", ".join(sorted(builders.keys())))

per_template_buffers: Dict[str, List[Dict[str, Any]]] = {name: [] for name in builders.keys()}
agg_rows: List[Dict[str, Any]] = []

for r in cosine_rows:
    qid   = str(r.get("query_id"))
    repo  = (r.get("repo_name") or "").lower()
    instr = r.get("instruction") or ""
    results_topk = _take_top_k((r.get("results") or []), TOP_K_SNIPPETS)

    for templ_name, builder in builders.items():
        prompt_text = make_rag_prompt(
            base_builder=builder,
            instruction=instr,
            snippets=results_topk,
            repo_name=repo,
            method="cosine",
            k=TOP_K_SNIPPETS,
        )
        row_out = {
            "query_id": qid,
            "repo_name": repo,
            "instruction": instr,
            "template": templ_name,
            "variant": "rag_cosine",
            "k_snippets": TOP_K_SNIPPETS,
            "retrieval_method": "cosine",
            "snippets": results_topk,
            "prompt": prompt_text,
        }
        per_template_buffers[templ_name].append(row_out)
        agg_rows.append(row_out)

# ---------- Salvataggi ----------
OUT_DIR.mkdir(parents=True, exist_ok=True)
written = []
for templ_name, rows in per_template_buffers.items():
    path = OUT_DIR / f"{templ_name}_rag_cosine_top3.jsonl"
    save_jsonl(path, rows)
    written.append(path)

save_jsonl(AGG_JSONL, agg_rows)

print("\nScritti i file RAG (COSINE, top-3):")
for p in written: print(" -", p.resolve())
print("Aggregato:")
print(" -", AGG_JSONL.resolve())


Carico JSONL normalizzato: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\cosine_topk_k3.jsonl
COSINE rows (filtrate su repo target): 23
Template caricati: v1, v2, v3, v4, v5, v6, v6_2, v6_3, v7, v8, v9

Scritti i file RAG (COSINE, top-3):
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_cosine_top3\v1_rag_cosine_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_cosine_top3\v2_rag_cosine_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_cosine_top3\v3_rag_cosine_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_cosine_top3\v4_rag_cosine_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_cosine_top3\v5_rag_cosine_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_cosine_top3\v6_rag_cosine_top3.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_cosine_top3\v7

#### -- 5.1.3 Hybrid prompts 

In [194]:
# === RAG prompts con HYBRID (top-5) usando i file indicati ===
import json
from pathlib import Path
from typing import Dict, Any, List

# File HYBRID (tuoi)
RAW_JSON_PATH = Path(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\hybrid_raw_k5.json")
NORM_JSONL_PATH = Path(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\hybrid_topk_k5.jsonl")
NORM_JSON_ARRAY_PATH = Path(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\hybrid_topk_k5.json")

# Output
TOP_K_SNIPPETS = 3
METHOD_TAG = "rag_hybrid_top5"
OUT_DIR = Path("outputs/prompts/rag_hybrid_top3")
AGG_JSONL = OUT_DIR / f"RAG_{METHOD_TAG}.jsonl"

TARGET_REPOS = {"seed-emulator", "seed-labs__seed-emulator", "pyscf", "pyscf__pyscf"}

# ---------- Normalizzazione hits ----------
def _norm_hit(hit: Any, i: int) -> Dict[str, Any]:
    if isinstance(hit, str):
        return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": hit}
    if isinstance(hit, dict):
        doc_id = hit.get("doc_id") or hit.get("id") or hit.get("path") or f"doc_{i}"
        text   = hit.get("text") or hit.get("content") or hit.get("snippet") or ""
        score  = hit.get("score") or hit.get("similarity") or hit.get("bm25_score") or 0.0
        path   = hit.get("path")
        try: score = float(score)
        except Exception: score = 0.0
        return {"doc_id": doc_id, "score": score, "path": path, "text": text}
    return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": str(hit)}

def _take_top_k(results: List[Dict[str, Any]], k: int) -> List[Dict[str, Any]]:
    if not isinstance(results, list):
        return []
    sorted_res = sorted(
        results,
        key=lambda x: (x.get("score") if isinstance(x.get("score"), (int, float)) else -1.0),
        reverse=True
    )
    return sorted_res[:k]

def _norm_repo_name(item: Dict[str, Any]) -> str:
    for k in ("repo_name", "repo_full_name"):
        v = item.get(k)
        if v: return str(v).strip().lower()
    return ""

def _norm_instruction(item: Dict[str, Any]) -> str:
    for k in ("instruction", "query", "prompt"):
        v = item.get(k)
        if v: return str(v)
    return ""

def _norm_query_id(item: Dict[str, Any], fallback_idx: int) -> str:
    for k in ("id", "query_id", "qid"):
        v = item.get(k)
        if v: return str(v)
    repo = _norm_repo_name(item) or "repo"
    return f"{repo}__{fallback_idx:06d}"

# ---------- Loader: preferisci JSONL normalizzato, poi JSON array, infine RAW JSON ----------
def _load_normalized_jsonl(p: Path) -> List[Dict[str, Any]]:
    rows = []
    with p.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line: continue
            obj = json.loads(line)
            repo = _norm_repo_name(obj)
            if repo not in TARGET_REPOS:
                continue
            res  = [_norm_hit(h, j) for j, h in enumerate(obj.get("results") or [])]
            obj["results"] = res
            rows.append(obj)
    return rows

def _load_normalized_json_array(p: Path) -> List[Dict[str, Any]]:
    data = json.loads(p.read_text(encoding="utf-8"))
    rows = []
    for obj in data:
        repo = _norm_repo_name(obj)
        if repo not in TARGET_REPOS:
            continue
        res = [_norm_hit(h, j) for j, h in enumerate(obj.get("results") or [])]
        rows.append({
            "query_id": obj.get("query_id"),
            "repo_name": repo,
            "instruction": _norm_instruction(obj) or obj.get("instruction") or "",
            "results": res,
        })
    return rows

def _load_raw_json(p: Path) -> List[Dict[str, Any]]:
    data = json.loads(p.read_text(encoding="utf-8"))
    rows = []
    for i, item in enumerate(data):
        repo = _norm_repo_name(item)
        if repo not in TARGET_REPOS:
            continue
        instr = _norm_instruction(item)
        hits  = item.get("retrieved_snippets") or []
        res   = [_norm_hit(h, j) for j, h in enumerate(hits)]
        rows.append({
            "query_id": _norm_query_id(item, i),
            "repo_name": repo,
            "instruction": instr,
            "results": res,
        })
    return rows

def load_hybrid_rows() -> List[Dict[str, Any]]:
    if NORM_JSONL_PATH.exists():
        print(f"Carico JSONL normalizzato: {NORM_JSONL_PATH}")
        return _load_normalized_jsonl(NORM_JSONL_PATH)
    if NORM_JSON_ARRAY_PATH.exists():
        print(f"JSONL non trovato. Carico JSON (array): {NORM_JSON_ARRAY_PATH}")
        return _load_normalized_json_array(NORM_JSON_ARRAY_PATH)
    if RAW_JSON_PATH.exists():
        print(f"Normalizzati non trovati. Carico RAW JSON: {RAW_JSON_PATH}")
        return _load_raw_json(RAW_JSON_PATH)
    raise FileNotFoundError("Nessuno dei file HYBRID esiste nei path indicati.")

# ---------- Costruzione prompt ----------
from prompts_common.templates import load_all_prompt_builders
from prompts_common.rag_prompt_maker import make_rag_prompt, save_jsonl

hybrid_rows = load_hybrid_rows()
print(f"HYBRID rows (filtrate su repo target): {len(hybrid_rows)}")

builders = load_all_prompt_builders()
print("Template caricati:", ", ".join(sorted(builders.keys())))

per_template_buffers: Dict[str, List[Dict[str, Any]]] = {name: [] for name in builders.keys()}
agg_rows: List[Dict[str, Any]] = []

for r in hybrid_rows:
    qid   = str(r.get("query_id"))
    repo  = (r.get("repo_name") or "").lower()
    instr = r.get("instruction") or ""
    results_topk = _take_top_k((r.get("results") or []), TOP_K_SNIPPETS)

    for templ_name, builder in builders.items():
        prompt_text = make_rag_prompt(
            base_builder=builder,
            instruction=instr,
            snippets=results_topk,
            repo_name=repo,
            method="hybrid",
            k=TOP_K_SNIPPETS,
        )
        row_out = {
            "query_id": qid,
            "repo_name": repo,
            "instruction": instr,
            "template": templ_name,
            "variant": "rag_hybrid",
            "k_snippets": TOP_K_SNIPPETS,
            "retrieval_method": "hybrid",
            "snippets": results_topk,
            "prompt": prompt_text,
        }
        per_template_buffers[templ_name].append(row_out)
        agg_rows.append(row_out)

# ---------- Salvataggi ----------
OUT_DIR.mkdir(parents=True, exist_ok=True)
written = []
for templ_name, rows in per_template_buffers.items():
    path = OUT_DIR / f"{templ_name}_rag_hybrid_top5.jsonl"
    save_jsonl(path, rows)
    written.append(path)

save_jsonl(AGG_JSONL, agg_rows)

print("\nScritti i file RAG (HYBRID, top-5):")
for p in written: print(" -", p.resolve())
print("Aggregato:")
print(" -", AGG_JSONL.resolve())


Carico JSONL normalizzato: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\retrieval\hybrid_topk_k5.jsonl
HYBRID rows (filtrate su repo target): 23
Template caricati: v1, v2, v3, v4, v5, v6, v6_2, v6_3, v7, v8, v9

Scritti i file RAG (HYBRID, top-5):
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_hybrid_top3\v1_rag_hybrid_top5.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_hybrid_top3\v2_rag_hybrid_top5.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_hybrid_top3\v3_rag_hybrid_top5.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_hybrid_top3\v4_rag_hybrid_top5.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_hybrid_top3\v5_rag_hybrid_top5.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_hybrid_top3\v6_rag_hybrid_top5.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_hybrid_top3\v7

#### -- 5.1.3 Multihope optione 1 prompts 

In [195]:
# === RAG prompts con MULTIHOP "decomposition_first" (top-4) ===
import json
from pathlib import Path
from typing import Dict, Any, List

# ---- INPUT MultiHop (i tuoi file) ----
MH_JSONL = Path(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_decomposition_topk_k3.jsonl")
MH_JSON  = Path(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_decomposition_topk_k3.json")

# ---- PARAMS / OUTPUT ----
TOP_K_SNIPPETS = 3
STRATEGY = "decomposition_first"
METHOD_TAG = "rag_multihop_decomposition_top4"
OUT_DIR = Path("outputs/prompts") / METHOD_TAG
AGG_JSONL = OUT_DIR / f"{METHOD_TAG}.jsonl"
TARGET_REPOS = {"seed-emulator", "seed-labs__seed-emulator", "pyscf", "pyscf__pyscf"}

# ---------- Utilità di normalizzazione ----------
def _norm_repo_name(item: Dict[str, Any]) -> str:
    for k in ("repo_name", "repo_full_name"):
        v = item.get(k)
        if v: return str(v).strip().lower()
    return ""

def _norm_instruction(item: Dict[str, Any]) -> str:
    for k in ("instruction", "query", "prompt"):
        v = item.get(k)
        if v: return str(v)
    return ""

def _norm_hit(hit: Any, i: int) -> Dict[str, Any]:
    if isinstance(hit, str):
        return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": hit}
    if isinstance(hit, dict):
        doc_id = hit.get("doc_id") or hit.get("id") or hit.get("path") or f"doc_{i}"
        text   = hit.get("text") or hit.get("content") or hit.get("snippet") or ""
        score  = hit.get("score") or hit.get("similarity") or hit.get("bm25_score") or 0.0
        path   = hit.get("path")
        try: score = float(score)
        except Exception: score = 0.0
        return {"doc_id": doc_id, "score": score, "path": path, "text": text}
    return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": str(hit)}

def _take_top_k(results: List[Dict[str, Any]], k: int) -> List[Dict[str, Any]]:
    if not isinstance(results, list):
        return []
    sorted_res = sorted(
        results,
        key=lambda x: (x.get("score") if isinstance(x.get("score"), (int, float)) else -1.0),
        reverse=True
    )
    return sorted_res[:k]

# ---------- Loader: preferisci JSONL poi JSON ----------
def _load_mh_jsonl(p: Path) -> List[Dict[str, Any]]:
    rows = []
    with p.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line: continue
            obj = json.loads(line)
            repo = _norm_repo_name(obj)
            if repo and repo not in TARGET_REPOS:
                continue
            # compat: risultati possono stare in "results" o "retrieved_snippets"
            res = obj.get("results")
            if res is None:
                res = obj.get("retrieved_snippets") or []
            res = [_norm_hit(h, j) for j, h in enumerate(res)]
            rows.append({
                "query_id": obj.get("query_id"),
                "repo_name": repo or obj.get("repo_name"),
                "instruction": _norm_instruction(obj) or obj.get("instruction") or "",
                "results": res
            })
    return rows

def _load_mh_json(p: Path) -> List[Dict[str, Any]]:
    data = json.loads(p.read_text(encoding="utf-8"))
    rows = []
    for i, obj in enumerate(data):
        repo = _norm_repo_name(obj)
        if repo and repo not in TARGET_REPOS:
            continue
        res = obj.get("results")
        if res is None:
            res = obj.get("retrieved_snippets") or []
        res = [_norm_hit(h, j) for j, h in enumerate(res)]
        rows.append({
            "query_id": obj.get("query_id") or f"{repo or 'repo'}__{i:06d}",
            "repo_name": repo or obj.get("repo_name"),
            "instruction": _norm_instruction(obj) or obj.get("instruction") or "",
            "results": res
        })
    return rows

def load_multihop_rows() -> List[Dict[str, Any]]:
    if MH_JSONL.exists():
        print(f"Carico MultiHop JSONL: {MH_JSONL}")
        return _load_mh_jsonl(MH_JSONL)
    if MH_JSON.exists():
        print(f"JSONL non trovato. Carico MultiHop JSON: {MH_JSON}")
        return _load_mh_json(MH_JSON)
    raise FileNotFoundError("File MultiHop non trovati nei path indicati.")

# ---------- Costruzione prompt ----------
from prompts_common.templates import load_all_prompt_builders
from prompts_common.rag_prompt_maker import make_rag_prompt, save_jsonl

rows = load_multihop_rows()
print(f"MultiHop rows: {len(rows)}  | strategy={STRATEGY} | k={TOP_K_SNIPPETS}")

builders = load_all_prompt_builders()
print("Template caricati:", ", ".join(sorted(builders.keys())))

OUT_DIR.mkdir(parents=True, exist_ok=True)
per_template_buffers: Dict[str, List[Dict[str, Any]]] = {name: [] for name in builders.keys()}
agg_rows: List[Dict[str, Any]] = []

for r in rows:
    qid   = str(r.get("query_id"))
    repo  = (r.get("repo_name") or "").lower()
    instr = r.get("instruction") or ""
    topk  = _take_top_k((r.get("results") or []), TOP_K_SNIPPETS)

    for templ_name, builder in builders.items():
        prompt_text = make_rag_prompt(
            base_builder=builder,
            instruction=instr,
            snippets=topk,
            repo_name=repo,
            method="multihop_decomposition",
            k=TOP_K_SNIPPETS,
        )
        row_out = {
            "query_id": qid,
            "repo_name": repo,
            "instruction": instr,
            "template": templ_name,
            "variant": "rag_multihop_decomposition",
            "strategy": STRATEGY,
            "k_snippets": TOP_K_SNIPPETS,
            "retrieval_method": "multihop",
            "snippets": topk,
            "prompt": prompt_text,
        }
        per_template_buffers[templ_name].append(row_out)
        agg_rows.append(row_out)

# ---------- Salvataggi ----------
written = []
for templ_name, rows_out in per_template_buffers.items():
    path = OUT_DIR / f"{templ_name}_{METHOD_TAG}.jsonl"
    save_jsonl(path, rows_out)
    written.append(path)

save_jsonl(AGG_JSONL, agg_rows)

print("\nScritti i file RAG (MultiHop decomposition, top-4):")
for p in written: print(" -", p.resolve())
print("Aggregato:")
print(" -", AGG_JSONL.resolve())


Carico MultiHop JSONL: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_decomposition_topk_k3.jsonl
MultiHop rows: 23  | strategy=decomposition_first | k=3
Template caricati: v1, v2, v3, v4, v5, v6, v6_2, v6_3, v7, v8, v9

Scritti i file RAG (MultiHop decomposition, top-4):
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_decomposition_top4\v1_rag_multihop_decomposition_top4.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_decomposition_top4\v2_rag_multihop_decomposition_top4.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_decomposition_top4\v3_rag_multihop_decomposition_top4.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_decomposition_top4\v4_rag_multihop_decomposition_top4.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_decomposition_top4\v5_rag_multihop_decomposition_top4.jsonl


#### -- 5.1.4 Multihope opzione 2 prompts 

In [196]:
# === RAG prompts con MULTIHOP "iterative_refine" (top-4) ===
import json
from pathlib import Path
from typing import Dict, Any, List

# ---- INPUT MultiHop (iterative_refine) ----
MH_JSONL = Path(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_iterative_topk_k3.jsonl")
MH_JSON  = Path(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_iterative_topk_k3.json")

# ---- PARAMS / OUTPUT ----
TOP_K_SNIPPETS = 3
STRATEGY = "iterative_refine"
METHOD_TAG = "rag_multihop_iterative_top4"
OUT_DIR = Path("outputs/prompts") / METHOD_TAG
AGG_JSONL = OUT_DIR / f"{METHOD_TAG}.jsonl"
TARGET_REPOS = {"seed-emulator", "seed-labs__seed-emulator", "pyscf", "pyscf__pyscf"}

# ---------- Utilità di normalizzazione ----------
def _norm_repo_name(item: Dict[str, Any]) -> str:
    for k in ("repo_name", "repo_full_name"):
        v = item.get(k)
        if v: return str(v).strip().lower()
    return ""

def _norm_instruction(item: Dict[str, Any]) -> str:
    for k in ("instruction", "query", "prompt"):
        v = item.get(k)
        if v: return str(v)
    return ""

def _norm_hit(hit: Any, i: int) -> Dict[str, Any]:
    if isinstance(hit, str):
        return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": hit}
    if isinstance(hit, dict):
        doc_id = hit.get("doc_id") or hit.get("id") or hit.get("path") or f"doc_{i}"
        text   = hit.get("text") or hit.get("content") or hit.get("snippet") or ""
        score  = hit.get("score") or hit.get("similarity") or hit.get("bm25_score") or 0.0
        path   = hit.get("path")
        try: score = float(score)
        except Exception: score = 0.0
        return {"doc_id": doc_id, "score": score, "path": path, "text": text}
    return {"doc_id": f"doc_{i}", "score": 0.0, "path": None, "text": str(hit)}

def _take_top_k(results: List[Dict[str, Any]], k: int) -> List[Dict[str, Any]]:
    if not isinstance(results, list):
        return []
    sorted_res = sorted(
        results,
        key=lambda x: (x.get("score") if isinstance(x.get("score"), (int, float)) else -1.0),
        reverse=True
    )
    return sorted_res[:k]

# ---------- Loader: preferisci JSONL poi JSON ----------
def _load_mh_jsonl(p: Path) -> List[Dict[str, Any]]:
    rows = []
    with p.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line: continue
            obj = json.loads(line)
            repo = _norm_repo_name(obj)
            if repo and repo not in TARGET_REPOS:
                continue
            res = obj.get("results")
            if res is None:
                res = obj.get("retrieved_snippets") or []
            res = [_norm_hit(h, j) for j, h in enumerate(res)]
            rows.append({
                "query_id": obj.get("query_id"),
                "repo_name": repo or obj.get("repo_name"),
                "instruction": _norm_instruction(obj) or obj.get("instruction") or "",
                "results": res
            })
    return rows

def _load_mh_json(p: Path) -> List[Dict[str, Any]]:
    data = json.loads(p.read_text(encoding="utf-8"))
    rows = []
    for i, obj in enumerate(data):
        repo = _norm_repo_name(obj)
        if repo and repo not in TARGET_REPOS:
            continue
        res = obj.get("results")
        if res is None:
            res = obj.get("retrieved_snippets") or []
        res = [_norm_hit(h, j) for j, h in enumerate(res)]
        rows.append({
            "query_id": obj.get("query_id") or f"{repo or 'repo'}__{i:06d}",
            "repo_name": repo or obj.get("repo_name"),
            "instruction": _norm_instruction(obj) or obj.get("instruction") or "",
            "results": res
        })
    return rows

def load_multihop_rows() -> List[Dict[str, Any]]:
    if MH_JSONL.exists():
        print(f"Carico MultiHop JSONL: {MH_JSONL}")
        return _load_mh_jsonl(MH_JSONL)
    if MH_JSON.exists():
        print(f"JSONL non trovato. Carico MultiHop JSON: {MH_JSON}")
        return _load_mh_json(MH_JSON)
    raise FileNotFoundError("File MultiHop (iterative_refine) non trovati nei path indicati.")

# ---------- Costruzione prompt ----------
from prompts_common.templates import load_all_prompt_builders
from prompts_common.rag_prompt_maker import make_rag_prompt, save_jsonl

rows = load_multihop_rows()
print(f"MultiHop rows: {len(rows)}  | strategy={STRATEGY} | k={TOP_K_SNIPPETS}")

builders = load_all_prompt_builders()
print("Template caricati:", ", ".join(sorted(builders.keys())))

OUT_DIR.mkdir(parents=True, exist_ok=True)
per_template_buffers: Dict[str, List[Dict[str, Any]]] = {name: [] for name in builders.keys()}
agg_rows: List[Dict[str, Any]] = []

for r in rows:
    qid   = str(r.get("query_id"))
    repo  = (r.get("repo_name") or "").lower()
    instr = r.get("instruction") or ""
    topk  = _take_top_k((r.get("results") or []), TOP_K_SNIPPETS)

    for templ_name, builder in builders.items():
        prompt_text = make_rag_prompt(
            base_builder=builder,
            instruction=instr,
            snippets=topk,
            repo_name=repo,
            method="multihop_iterative",
            k=TOP_K_SNIPPETS,
        )
        row_out = {
            "query_id": qid,
            "repo_name": repo,
            "instruction": instr,
            "template": templ_name,
            "variant": "rag_multihop_iterative",
            "strategy": STRATEGY,
            "k_snippets": TOP_K_SNIPPETS,
            "retrieval_method": "multihop",
            "snippets": topk,
            "prompt": prompt_text,
        }
        per_template_buffers[templ_name].append(row_out)
        agg_rows.append(row_out)

# ---------- Salvataggi ----------
written = []
for templ_name, rows_out in per_template_buffers.items():
    path = OUT_DIR / f"{templ_name}_{METHOD_TAG}.jsonl"
    save_jsonl(path, rows_out)
    written.append(path)

save_jsonl(AGG_JSONL, agg_rows)

print("\nScritti i file RAG (MultiHop iterative_refine, top-4):")
for p in written: print(" -", p.resolve())
print("Aggregato:")
print(" -", AGG_JSONL.resolve())


Carico MultiHop JSONL: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_iterative_topk_k3.jsonl
MultiHop rows: 23  | strategy=iterative_refine | k=3
Template caricati: v1, v2, v3, v4, v5, v6, v6_2, v6_3, v7, v8, v9

Scritti i file RAG (MultiHop iterative_refine, top-4):
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_iterative_top4\v1_rag_multihop_iterative_top4.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_iterative_top4\v2_rag_multihop_iterative_top4.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_iterative_top4\v3_rag_multihop_iterative_top4.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_iterative_top4\v4_rag_multihop_iterative_top4.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_iterative_top4\v5_rag_multihop_iterative_top4.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_1

## 6: CODE GENERATION

### 6.1   BASELINE

In [216]:
# === Code generation per tutti i prompt disponibili ===
from pathlib import Path
from models.loader import load_model_and_tokenizer
from models.config import load_config

# 1) Carica il modello locale (come facevi tu)
cfg = load_config()
cfg.model.model_name = "codellama/CodeLlama-7b-Instruct-hf"
cfg.cache.root = "./cache"
tokenizer, model, used_cache = load_model_and_tokenizer(cfg)

# 2) Scansiona i file di prompt (al momento solo baseline)
from codegen import scan_prompt_files, run_codegen_over_prompt_files

PROMPT_DIRS = [
    Path("outputs/prompts/baseline"),
    # in futuro aggiungi qui:
    # Path("outputs/prompts/bm25"),
    # Path("outputs/prompts/cosine"),
    # Path("outputs/prompts/hybrid"),
    # Path("outputs/prompts/multihop_decomposition_first"),
    # Path("outputs/prompts/multihop_iterative_refine"),
]

prompt_files = scan_prompt_files(PROMPT_DIRS)
print(f"Trovati {len(prompt_files)} file di prompt:")
for p in prompt_files: print(" -", p)

# 3) Parametri di generazione (tunabili)
GEN_CFG = dict(
    max_new_tokens=512,
    temperature=0.2,
    top_p=0.95,
    do_sample=False,  # deterministico per confronti
)

# 4) Esegui codegen con resume (non rigenera se già presente)
written = run_codegen_over_prompt_files(
    tokenizer=tokenizer,
    model=model,
    prompt_files=prompt_files,
    out_root=Path("outputs/codegen"),
    model_name=cfg.model.model_name,
    **GEN_CFG
)

print("\nOutput per-prompt aggiornati:")
for w in written: print(" -", w.resolve())

print("\nAggregato globale:")
from codegen.io_utils import result_paths
if prompt_files:
    _, agg = result_paths(Path("outputs/codegen"), cfg.model.model_name, prompt_files[0])
    print(" -", agg.resolve())


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Trovati 14 file di prompt:
 - outputs\prompts\baseline\_all_baseline.jsonl
 - outputs\prompts\baseline\baseline_make_prompts_baseline.jsonl
 - outputs\prompts\baseline\templates_baseline.jsonl
 - outputs\prompts\baseline\v1_baseline.jsonl
 - outputs\prompts\baseline\v2_baseline.jsonl
 - outputs\prompts\baseline\v3_baseline.jsonl
 - outputs\prompts\baseline\v4_baseline.jsonl
 - outputs\prompts\baseline\v5_baseline.jsonl
 - outputs\prompts\baseline\v6_2_baseline.jsonl
 - outputs\prompts\baseline\v6_3_baseline.jsonl
 - outputs\prompts\baseline\v6_baseline.jsonl
 - outputs\prompts\baseline\v7_baseline.jsonl
 - outputs\prompts\baseline\v8_baseline.jsonl
 - outputs\prompts\baseline\v9_baseline.jsonl

[CODEGEN] file: outputs\prompts\baseline\_all_baseline.jsonl | items: 253 | resume hits: 0


Generating -> _all_baseline:   0%|          | 0/253 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> _all_baseline:   0%|          | 1/253 [00:42<2:58:55, 42.60s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> _all_baseline:   1%|          | 2/253 [01:24<2:57:40, 42.47s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> _all_baseline:   1%|          | 3/253 [02:07<2:56:12, 42.29s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> _all_baseline:   2%|▏         | 4/253 [02:49<2:55:33, 42.30s/it]The following generation flags are not valid and

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\_all_baseline.jsonl (+253)

[CODEGEN] file: outputs\prompts\baseline\baseline_make_prompts_baseline.jsonl | items: 0 | resume hits: 0


Generating -> baseline_make_prompts_baseline: 0it [00:00, ?it/s]


[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\baseline_make_prompts_baseline.jsonl (+0)

[CODEGEN] file: outputs\prompts\baseline\templates_baseline.jsonl | items: 0 | resume hits: 0


Generating -> templates_baseline: 0it [00:00, ?it/s]


[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\templates_baseline.jsonl (+0)

[CODEGEN] file: outputs\prompts\baseline\v1_baseline.jsonl | items: 23 | resume hits: 0


Generating -> v1_baseline:   0%|          | 0/23 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v1_baseline:   4%|▍         | 1/23 [00:42<15:34, 42.47s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v1_baseline:   9%|▊         | 2/23 [01:24<14:48, 42.31s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v1_baseline:  13%|█▎        | 3/23 [02:06<14:05, 42.29s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v1_baseline:  17%|█▋        | 4/23 [02:49<13:22, 42.22s/it]The following generation flags are not valid and may be ignored: ['temp

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v1_baseline.jsonl (+23)

[CODEGEN] file: outputs\prompts\baseline\v2_baseline.jsonl | items: 23 | resume hits: 0


Generating -> v2_baseline:   0%|          | 0/23 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v2_baseline:   4%|▍         | 1/23 [00:28<10:23, 28.35s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v2_baseline:   9%|▊         | 2/23 [00:57<10:01, 28.66s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v2_baseline:  13%|█▎        | 3/23 [01:15<08:01, 24.09s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v2_baseline:  17%|█▋        | 4/23 [01:56<09:40, 30.57s/it]The following generation flags are not valid and may be ignored: ['temp

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v2_baseline.jsonl (+23)

[CODEGEN] file: outputs\prompts\baseline\v3_baseline.jsonl | items: 23 | resume hits: 0


Generating -> v3_baseline:   0%|          | 0/23 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v3_baseline:   4%|▍         | 1/23 [00:42<15:39, 42.72s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v3_baseline:   9%|▊         | 2/23 [01:25<14:54, 42.61s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v3_baseline:  13%|█▎        | 3/23 [02:07<14:12, 42.63s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v3_baseline:  17%|█▋        | 4/23 [02:50<13:29, 42.59s/it]The following generation flags are not valid and may be ignored: ['temp

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v3_baseline.jsonl (+23)

[CODEGEN] file: outputs\prompts\baseline\v4_baseline.jsonl | items: 23 | resume hits: 0


Generating -> v4_baseline:   0%|          | 0/23 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v4_baseline:   4%|▍         | 1/23 [00:42<15:26, 42.13s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v4_baseline:   9%|▊         | 2/23 [01:24<14:42, 42.02s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v4_baseline:  13%|█▎        | 3/23 [02:06<14:00, 42.02s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v4_baseline:  17%|█▋        | 4/23 [02:47<13:17, 41.97s/it]The following generation flags are not valid and may be ignored: ['temp

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v4_baseline.jsonl (+23)

[CODEGEN] file: outputs\prompts\baseline\v5_baseline.jsonl | items: 23 | resume hits: 0


Generating -> v5_baseline:   0%|          | 0/23 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v5_baseline:   4%|▍         | 1/23 [00:42<15:31, 42.32s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v5_baseline:   9%|▊         | 2/23 [01:24<14:46, 42.19s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v5_baseline:  13%|█▎        | 3/23 [02:06<14:04, 42.22s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v5_baseline:  17%|█▋        | 4/23 [02:48<13:21, 42.17s/it]The following generation flags are not valid and may be ignored: ['temp

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v5_baseline.jsonl (+23)

[CODEGEN] file: outputs\prompts\baseline\v6_2_baseline.jsonl | items: 23 | resume hits: 0


Generating -> v6_2_baseline:   0%|          | 0/23 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_2_baseline:   4%|▍         | 1/23 [00:45<16:45, 45.70s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_2_baseline:   9%|▊         | 2/23 [01:31<15:56, 45.56s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_2_baseline:  13%|█▎        | 3/23 [02:16<15:11, 45.59s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_2_baseline:  17%|█▋        | 4/23 [03:02<14:25, 45.54s/it]The following generation flags are not valid and may be ignor

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v6_2_baseline.jsonl (+23)

[CODEGEN] file: outputs\prompts\baseline\v6_3_baseline.jsonl | items: 23 | resume hits: 0


Generating -> v6_3_baseline:   0%|          | 0/23 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_3_baseline:   4%|▍         | 1/23 [00:45<16:49, 45.87s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_3_baseline:   9%|▊         | 2/23 [01:31<16:00, 45.76s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_3_baseline:  13%|█▎        | 3/23 [02:17<15:15, 45.76s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_3_baseline:  17%|█▋        | 4/23 [03:02<14:28, 45.72s/it]The following generation flags are not valid and may be ignor

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v6_3_baseline.jsonl (+23)

[CODEGEN] file: outputs\prompts\baseline\v6_baseline.jsonl | items: 23 | resume hits: 0


Generating -> v6_baseline:   0%|          | 0/23 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_baseline:   4%|▍         | 1/23 [00:44<16:13, 44.25s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_baseline:   9%|▊         | 2/23 [01:01<09:52, 28.22s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_baseline:  13%|█▎        | 3/23 [01:45<11:49, 35.47s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v6_baseline:  17%|█▋        | 4/23 [02:21<11:15, 35.55s/it]The following generation flags are not valid and may be ignored: ['temp

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v6_baseline.jsonl (+23)

[CODEGEN] file: outputs\prompts\baseline\v7_baseline.jsonl | items: 23 | resume hits: 0


Generating -> v7_baseline:   0%|          | 0/23 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v7_baseline:   4%|▍         | 1/23 [00:44<16:09, 44.08s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v7_baseline:   9%|▊         | 2/23 [01:27<15:23, 43.95s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v7_baseline:  13%|█▎        | 3/23 [02:11<14:39, 43.97s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v7_baseline:  17%|█▋        | 4/23 [02:55<13:54, 43.92s/it]The following generation flags are not valid and may be ignored: ['temp

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v7_baseline.jsonl (+23)

[CODEGEN] file: outputs\prompts\baseline\v8_baseline.jsonl | items: 23 | resume hits: 0


Generating -> v8_baseline:   0%|          | 0/23 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v8_baseline:   4%|▍         | 1/23 [00:43<16:03, 43.78s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v8_baseline:   9%|▊         | 2/23 [01:11<11:55, 34.05s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v8_baseline:  13%|█▎        | 3/23 [01:54<12:49, 38.46s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v8_baseline:  17%|█▋        | 4/23 [02:38<12:48, 40.47s/it]The following generation flags are not valid and may be ignored: ['temp

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v8_baseline.jsonl (+23)

[CODEGEN] file: outputs\prompts\baseline\v9_baseline.jsonl | items: 23 | resume hits: 0


Generating -> v9_baseline:   0%|          | 0/23 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v9_baseline:   4%|▍         | 1/23 [00:44<16:10, 44.10s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v9_baseline:   9%|▊         | 2/23 [01:28<15:23, 43.98s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v9_baseline:  13%|█▎        | 3/23 [02:12<14:40, 44.00s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating -> v9_baseline:  17%|█▋        | 4/23 [02:55<13:54, 43.95s/it]The following generation flags are not valid and may be ignored: ['temp

[DONE] -> outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v9_baseline.jsonl (+23)

Output per-prompt aggiornati:
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\_all_baseline.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\baseline_make_prompts_baseline.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\templates_baseline.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v1_baseline.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v2_baseline.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\v3_baseline.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\




### -> 6.2 RAG

#### 6.2.1  RAG WITH BM25 - RISULTATI PRONTI

In [324]:
# === CODEGEN SOLO per RAG BM25 top-3 — OOM-safe + resume + progress (clonato da COSINE) ===
import os, gc, json, time
from pathlib import Path
from typing import Dict, Any, List, Tuple
import torch
from tqdm.auto import tqdm

from models.loader import load_model_and_tokenizer
from models.config import load_config

# -----------------------------
# 1) Carica modello locale
# -----------------------------
cfg = load_config()
cfg.model.model_name = "codellama/CodeLlama-7b-Instruct-hf"
cfg.cache.root = "./cache"
tokenizer, model, used_cache = load_model_and_tokenizer(cfg)
device = next(model.parameters()).device

torch.set_grad_enabled(False)
try:
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
except Exception:
    pass

# -----------------------------
# 2) Utility I/O
# -----------------------------
def read_jsonl(path: Path) -> List[Dict[str, Any]]:
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line=line.strip()
            if not line:
                continue
            rows.append(json.loads(line))
    return rows

def append_jsonl(path: Path, rows: List[Dict[str, Any]]) -> None:
    if not rows: return
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("a", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

def load_done_keys(path: Path) -> set:
    done = set()
    if path.exists():
        for obj in read_jsonl(path):
            done.add( (str(obj.get("query_id")), str(obj.get("template")), str(obj.get("variant"))) )
    return done

# -----------------------------
# 3) Config generazione + guard-rails (identico a COSINE)
# -----------------------------
GEN_KW = dict(
    max_new_tokens=200,     # identico al runner COSINE
    do_sample=False,        # deterministico come COSINE
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_time=30.0,
    use_cache=True,
)

MODEL_CTX = getattr(model.config, "max_position_embeddings", None) or getattr(tokenizer, "model_max_length", 2048)
MAX_INPUT_TOKENS = min(1024, max(512, int(MODEL_CTX - GEN_KW["max_new_tokens"] - 64)))

def truncate_prompt_tokens(prompt: str) -> str:
    ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
    if len(ids) <= MAX_INPUT_TOKENS:
        return prompt
    ids = ids[-MAX_INPUT_TOKENS:]  # tieni la coda come nel COSINE
    return tokenizer.decode(ids, skip_special_tokens=True)

def generate_safe(prompt: str) -> Tuple[str, Dict[str, Any]]:
    stats = {"attempts": 0, "oom_retries": 0, "used_max_new_tokens": GEN_KW["max_new_tokens"]}
    prompt_use = truncate_prompt_tokens(prompt)
    kw = dict(GEN_KW)
    for _ in range(3):
        stats["attempts"] += 1
        try:
            inputs = tokenizer(prompt_use, return_tensors="pt").to(device)
            with torch.no_grad():
                out_ids = model.generate(**inputs, **kw)
            text = tokenizer.decode(out_ids[0], skip_special_tokens=True)
            if text.startswith(prompt_use):
                text = text[len(prompt_use):].lstrip()
            del inputs, out_ids
            return text, stats
        except torch.cuda.OutOfMemoryError:
            stats["oom_retries"] += 1
            kw["max_new_tokens"] = max(128, int(kw["max_new_tokens"] * 0.8))
            stats["used_max_new_tokens"] = kw["max_new_tokens"]
            torch.cuda.empty_cache(); gc.collect()
        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                stats["oom_retries"] += 1
                kw["max_new_tokens"] = max(128, int(kw["max_new_tokens"] * 0.8))
                stats["used_max_new_tokens"] = kw["max_new_tokens"]
                torch.cuda.empty_cache(); gc.collect()
                continue
            raise
    # fallback compatto (come COSINE)
    short_prompt = prompt_use[-800:]
    inputs = tokenizer(short_prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        out_ids = model.generate(**inputs, **{**GEN_KW, "max_new_tokens": 160, "max_time": 25.0})
    text = tokenizer.decode(out_ids[0], skip_special_tokens=True)
    if text.startswith(short_prompt):
        text = text[len(short_prompt):].lstrip()
    del inputs, out_ids
    return text, {**stats, "used_max_new_tokens": 160, "fallback": True}

# -----------------------------
# 4) Sorgente RAG BM25 e cartelle output
# -----------------------------
BM25_RAG_FILE = Path(
    r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_bm25_top3\RAG_rag_bm25_top3.jsonl"
)
if not BM25_RAG_FILE.exists():
    raise FileNotFoundError(f"File prompt BM25 non trovato: {BM25_RAG_FILE}")

TAG = "rag_bm25_top3"
RESULTS_ROOT = Path("results/rag_generations")
OUT_DIR = RESULTS_ROOT / TAG
OUT_DIR.mkdir(parents=True, exist_ok=True)
AGG_OUT = OUT_DIR / "AGGREGATED_results.jsonl"

rows = read_jsonl(BM25_RAG_FILE)
print(f"[{TAG}] prompt letti: {len(rows)}")

# Resume
done_agg = load_done_keys(AGG_OUT)
print(f"[{TAG}] resume: {len(done_agg)} già presenti (aggregato)")

templates = sorted({r.get("template", "unknown") for r in rows})
per_template_targets: Dict[str, Path] = {t: OUT_DIR / f"{t}_results.jsonl" for t in templates}
per_template_done: Dict[str, set] = {t: load_done_keys(p) for t, p in per_template_targets.items()}
per_template_buffers: Dict[str, List[Dict[str, Any]]] = {t: [] for t in templates}

# -----------------------------
# 5) Loop di generazione (BM25 clonato)
# -----------------------------
total_new = 0
new_agg_rows: List[Dict[str, Any]] = []
t0 = time.time()

if torch.cuda.is_available():
    torch.cuda.empty_cache()

pbar = tqdm(rows, desc=TAG, dynamic_ncols=True)
for r in pbar:
    qid = str(r.get("query_id"))
    templ = str(r.get("template"))
    variant = str(r.get("variant") or TAG)  # "rag_bm25_top3"
    prompt = r.get("prompt") or ""

    key = (qid, templ, variant)
    if key in done_agg or key in per_template_done.get(templ, set()):
        continue

    if not r.get("snippets"):
        print(f"[WARN] {templ} qid={qid}: campo 'snippets' mancante/vuoto nel prompt BM25.")

    gen_text, stats = generate_safe(prompt)

    row_out = {
        "query_id": qid,
        "repo_name": r.get("repo_name"),
        "instruction": r.get("instruction"),
        "template": templ,
        "variant": variant,                 # "rag_bm25_top3"
        "retrieval_method": r.get("retrieval_method"),
        "k_snippets": r.get("k_snippets"),
        "snippets": r.get("snippets"),
        "prompt": prompt,
        "generation": gen_text,
        "gen_stats": stats,
        "model_name": cfg.model.model_name,
    }

    per_template_buffers[templ].append(row_out)
    new_agg_rows.append(row_out)
    total_new += 1

    try:
        alloc = torch.cuda.memory_allocated() / (1024**3)
        reserved = torch.cuda.memory_reserved() / (1024**3)
        pbar.set_postfix(done=total_new, mem=f"{alloc:.2f}G/{reserved:.2f}G")
    except Exception:
        pbar.set_postfix(done=total_new)

    if sum(len(v) for v in per_template_buffers.values()) >= 8:
        for tname, buf in per_template_buffers.items():
            if buf:
                append_jsonl(per_template_targets[tname], buf)
                per_template_buffers[tname].clear()
        if new_agg_rows:
            append_jsonl(AGG_OUT, new_agg_rows)
            new_agg_rows.clear()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()

pbar.close()

# Flush finale
for tname, buf in per_template_buffers.items():
    if buf:
        append_jsonl(per_template_targets[tname], buf)
        per_template_buffers[tname].clear()

if new_agg_rows:
    append_jsonl(AGG_OUT, new_agg_rows)
    new_agg_rows.clear()

dt = time.time() - t0
print(f"[{TAG}] completato in {dt:.1f}s | nuove generazioni: {total_new}")
print("Risultati:", OUT_DIR.resolve())


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

[rag_bm25_top3] prompt letti: 253
[rag_bm25_top3] resume: 0 già presenti (aggregato)


rag_bm25_top3:   0%|          | 0/253 [00:00<?, ?it/s]

[rag_bm25_top3] completato in 4519.6s | nuove generazioni: 253
Risultati: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_bm25_top3


In [None]:
"""# === CODEGEN RAG (SOLO BM25) con resume, OOM-safe e progress bar ===
import os, gc, json, time
from pathlib import Path
from typing import Dict, Any, List, Tuple
import torch
from tqdm.auto import tqdm

# 1) Carica il modello locale (stessa logica della baseline)
from models.loader import load_model_and_tokenizer
from models.config import load_config

cfg = load_config()
cfg.model.model_name = "codellama/CodeLlama-7b-Instruct-hf"  # cambia qui se vuoi altro modello
cfg.cache.root = "./cache"
tokenizer, model, used_cache = load_model_and_tokenizer(cfg)
device = next(model.parameters()).device

# 2) Parametri di generazione (sampling on per evitare warning "temperature/top_p ignorati")
GEN_KW = dict(
    max_new_tokens=384,   # riduci a 256 se vuoi più veloce/meno OOM
    temperature=0.2,
    top_p=0.95,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id,
)

# 3) Utility I/O
def read_jsonl(path: Path) -> List[Dict[str, Any]]:
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                rows.append(json.loads(line))
    return rows

def append_jsonl(path: Path, rows: List[Dict[str, Any]]) -> None:
    if not rows: return
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("a", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

def load_done_keys(path: Path) -> set:
    """
    Chiave di resume: (query_id, template, variant).
    """
    done = set()
    if path.exists():
        try:
            for obj in read_jsonl(path):
                done.add((str(obj.get("query_id")), str(obj.get("template")), str(obj.get("variant"))))
        except Exception:
            pass
    return done

# 4) Limiti contesto e troncamento input
MODEL_CTX = getattr(model.config, "max_position_embeddings", None) or getattr(tokenizer, "model_max_length", 2048)
MAX_INPUT_TOKENS = max(256, int(MODEL_CTX - GEN_KW["max_new_tokens"] - 32))

def truncate_input_for_ctx(prompt: str, max_input_tokens: int) -> str:
    ids = tokenizer.encode(prompt, add_special_tokens=False)
    if len(ids) <= max_input_tokens:
        return prompt
    keep_head = max(64, int(max_input_tokens * 0.25))
    keep_tail = max_input_tokens - keep_head
    ids_new = ids[:keep_head] + ids[-keep_tail:]
    return tokenizer.decode(ids_new, skip_special_tokens=True)

def generate_safe(prompt: str, gen_kw: Dict[str, Any]) -> Tuple[str, Dict[str, Any]]:
    stats = {"attempts": 0, "used_max_new_tokens": gen_kw.get("max_new_tokens"), "oom_retries": 0}
    prompt_trim = truncate_input_for_ctx(prompt, MAX_INPUT_TOKENS)
    kw = dict(gen_kw)
    for _ in range(3):
        stats["attempts"] += 1
        try:
            inputs = tokenizer(prompt_trim, return_tensors="pt").to(device)
            with torch.no_grad():
                out_ids = model.generate(**inputs, **kw)
            text = tokenizer.decode(out_ids[0], skip_special_tokens=True)
            # taglia l'eco dell'input se presente
            if text.startswith(prompt_trim):
                text = text[len(prompt_trim):].lstrip()
            return text, stats
        except torch.cuda.OutOfMemoryError:
            stats["oom_retries"] += 1
            kw["max_new_tokens"] = max(128, int(kw["max_new_tokens"] * 0.6))
            torch.cuda.empty_cache(); gc.collect()
        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                stats["oom_retries"] += 1
                kw["max_new_tokens"] = max(128, int(kw["max_new_tokens"] * 0.6))
                torch.cuda.empty_cache(); gc.collect()
                continue
            raise
    return "", stats

# 5) Solo BM25: sorgente e destinazioni
BM25_PROMPTS = Path("outputs/prompts/rag_bm25_top3/RAG_rag_bm25_top3.jsonl")
if not BM25_PROMPTS.exists():
    raise FileNotFoundError(f"File prompt BM25 non trovato: {BM25_PROMPTS.resolve()}")

RESULTS_ROOT = Path("results/rag_generations")
OUT_DIR = RESULTS_ROOT / "rag_bm25_top3"
OUT_DIR.mkdir(parents=True, exist_ok=True)
AGG_OUT = OUT_DIR / "AGGREGATED_results.jsonl"

# 6) Carica prompt e prepara resume
rows = read_jsonl(BM25_PROMPTS)
done_agg = load_done_keys(AGG_OUT)
print(f"[rag_bm25_top3] prompt letti: {len(rows)}")
print(f"[rag_bm25_top3] resume: {len(done_agg)} già presenti (aggregato)")

templates = sorted({str(r.get("template", "unknown")) for r in rows})
per_template_targets = {t: OUT_DIR / f"{t}_results.jsonl" for t in templates}
per_template_done = {t: load_done_keys(per_template_targets[t]) for t in templates}
buffers = {t: [] for t in templates}
new_agg: List[Dict[str, Any]] = []

# 7) Loop di generazione
local_count = 0
t0 = time.time()
pbar = tqdm(rows, total=len(rows), desc="rag_bm25_top3", dynamic_ncols=True)

for r in pbar:
    qid = str(r.get("query_id"))
    templ = str(r.get("template"))
    variant = str(r.get("variant", "rag_bm25_top3"))
    prompt = r.get("prompt") or ""
    key = (qid, templ, variant)

    if key in done_agg or key in per_template_done.get(templ, set()):
        pbar.set_postfix(done=local_count); continue

    gen_text, stats = generate_safe(prompt, GEN_KW)

    out_row = {
        "query_id": qid,
        "repo_name": r.get("repo_name"),
        "instruction": r.get("instruction"),
        "template": templ,
        "variant": variant,
        "retrieval_method": r.get("retrieval_method"),
        "k_snippets": r.get("k_snippets"),
        "snippets": r.get("snippets"),
        "prompt": prompt,
        "generation": gen_text,
        "gen_stats": stats,
        "model_name": cfg.model.model_name,
    }

    buffers[templ].append(out_row)
    new_agg.append(out_row)
    local_count += 1

    # feedback memoria GPU (se disponibile)
    try:
        alloc = torch.cuda.memory_allocated() / (1024**3)
        reserved = torch.cuda.memory_reserved() / (1024**3)
        pbar.set_postfix(done=local_count, mem=f"{alloc:.2f}G/{reserved:.2f}G")
    except Exception:
        pbar.set_postfix(done=local_count)

    # flush periodico per non accumulare in RAM
    if sum(len(v) for v in buffers.values()) >= 16:
        for tname, buf in buffers.items():
            if buf:
                append_jsonl(per_template_targets[tname], buf)
                buf.clear()
        if new_agg:
            append_jsonl(AGG_OUT, new_agg); new_agg.clear()
        torch.cuda.empty_cache(); gc.collect()

pbar.close()

# flush finale
for tname, buf in buffers.items():
    if buf:
        append_jsonl(per_template_targets[tname], buf)
if new_agg:
    append_jsonl(AGG_OUT, new_agg)

dt = time.time() - t0
print(f"[rag_bm25_top3] completato in {dt:.1f}s | nuove generazioni: {local_count}")
print("Risultati:", OUT_DIR.resolve())"""


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.


[rag_bm25_top3] prompt letti: 253
[rag_bm25_top3] resume: 253 già presenti (aggregato)


rag_bm25_top3:   0%|          | 0/253 [00:00<?, ?it/s]

[rag_bm25_top3] completato in 0.0s | nuove generazioni: 0
Risultati: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_bm25_top3


#### 6.2.2  RAG WITH COSINE

In [218]:
# === CODEGEN SOLO per RAG COSINE top-3 — OOM-safe + resume + progress ===
import os, gc, json, time
from pathlib import Path
from typing import Dict, Any, List, Tuple
import torch
from tqdm.auto import tqdm

from models.loader import load_model_and_tokenizer
from models.config import load_config

# -----------------------------
# 1) Carica modello locale
# -----------------------------
cfg = load_config()
cfg.model.model_name = "codellama/CodeLlama-7b-Instruct-hf"
cfg.cache.root = "./cache"
tokenizer, model, used_cache = load_model_and_tokenizer(cfg)
device = next(model.parameters()).device

torch.set_grad_enabled(False)
try:
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
except Exception:
    pass

# -----------------------------
# 2) Utility I/O
# -----------------------------
def read_jsonl(path: Path) -> List[Dict[str, Any]]:
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line=line.strip()
            if not line: 
                continue
            rows.append(json.loads(line))
    return rows

def append_jsonl(path: Path, rows: List[Dict[str, Any]]) -> None:
    if not rows: return
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("a", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

def load_done_keys(path: Path) -> set:
    done = set()
    if path.exists():
        for obj in read_jsonl(path):
            done.add( (str(obj.get("query_id")), str(obj.get("template")), str(obj.get("variant"))) )
    return done

# -----------------------------
# 3) Config generazione + guard-rails
# -----------------------------
GEN_KW = dict(
    max_new_tokens=200,     # ridotto per velocità/VRAM
    do_sample=False,        # deterministico
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_time=30.0,
    use_cache=True,
)

MODEL_CTX = getattr(model.config, "max_position_embeddings", None) or getattr(tokenizer, "model_max_length", 2048)
MAX_INPUT_TOKENS = min(1024, max(512, int(MODEL_CTX - GEN_KW["max_new_tokens"] - 64)))

def truncate_prompt_tokens(prompt: str) -> str:
    ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
    if len(ids) <= MAX_INPUT_TOKENS:
        return prompt
    ids = ids[-MAX_INPUT_TOKENS:]  # tieni la coda (snippet)
    return tokenizer.decode(ids, skip_special_tokens=True)

def generate_safe(prompt: str) -> Tuple[str, Dict[str, Any]]:
    stats = {"attempts": 0, "oom_retries": 0, "used_max_new_tokens": GEN_KW["max_new_tokens"]}
    prompt_use = truncate_prompt_tokens(prompt)
    kw = dict(GEN_KW)
    for _ in range(3):
        stats["attempts"] += 1
        try:
            inputs = tokenizer(prompt_use, return_tensors="pt").to(device)
            with torch.no_grad():
                out_ids = model.generate(**inputs, **kw)
            text = tokenizer.decode(out_ids[0], skip_special_tokens=True)
            if text.startswith(prompt_use):
                text = text[len(prompt_use):].lstrip()
            del inputs, out_ids
            return text, stats
        except torch.cuda.OutOfMemoryError:
            stats["oom_retries"] += 1
            kw["max_new_tokens"] = max(128, int(kw["max_new_tokens"] * 0.8))
            stats["used_max_new_tokens"] = kw["max_new_tokens"]
            torch.cuda.empty_cache(); gc.collect()
        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                stats["oom_retries"] += 1
                kw["max_new_tokens"] = max(128, int(kw["max_new_tokens"] * 0.8))
                stats["used_max_new_tokens"] = kw["max_new_tokens"]
                torch.cuda.empty_cache(); gc.collect()
                continue
            raise
    # fallback compatto
    short_prompt = prompt_use[-800:]
    inputs = tokenizer(short_prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        out_ids = model.generate(**inputs, **{**GEN_KW, "max_new_tokens": 160, "max_time": 25.0})
    text = tokenizer.decode(out_ids[0], skip_special_tokens=True)
    if text.startswith(short_prompt):
        text = text[len(short_prompt):].lstrip()
    del inputs, out_ids
    return text, {**stats, "used_max_new_tokens": 160, "fallback": True}

# -----------------------------
# 4) Sorgente RAG COSINE e cartelle output
# -----------------------------
COSINE_RAG_FILE = Path(
    r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_cosine_top3\RAG_rag_cosine_top3.jsonl"
)
if not COSINE_RAG_FILE.exists():
    raise FileNotFoundError(f"File prompt COSINE non trovato: {COSINE_RAG_FILE}")

TAG = "rag_cosine_top3"
RESULTS_ROOT = Path("results/rag_generations")
OUT_DIR = RESULTS_ROOT / TAG
OUT_DIR.mkdir(parents=True, exist_ok=True)
AGG_OUT = OUT_DIR / "AGGREGATED_results.jsonl"

rows = read_jsonl(COSINE_RAG_FILE)
print(f"[{TAG}] prompt letti: {len(rows)}")

# Resume
done_agg = load_done_keys(AGG_OUT)
print(f"[{TAG}] resume: {len(done_agg)} già presenti (aggregato)")

templates = sorted({r.get("template", "unknown") for r in rows})
per_template_targets: Dict[str, Path] = {t: OUT_DIR / f"{t}_results.jsonl" for t in templates}
per_template_done: Dict[str, set] = {t: load_done_keys(p) for t, p in per_template_targets.items()}
per_template_buffers: Dict[str, List[Dict[str, Any]]] = {t: [] for t in templates}

# -----------------------------
# 5) Loop di generazione (solo COSINE)
# -----------------------------
total_new = 0
new_agg_rows: List[Dict[str, Any]] = []
t0 = time.time()

if torch.cuda.is_available():
    torch.cuda.empty_cache()

pbar = tqdm(rows, desc=TAG, dynamic_ncols=True)
for r in pbar:
    qid = str(r.get("query_id"))
    templ = str(r.get("template"))
    variant = str(r.get("variant") or TAG)  # "rag_cosine_top3"
    prompt = r.get("prompt") or ""

    key = (qid, templ, variant)
    if key in done_agg or key in per_template_done.get(templ, set()):
        continue

    # Avviso se mancano gli snippet (devono esserci!)
    if not r.get("snippets"):
        print(f"[WARN] {templ} qid={qid}: campo 'snippets' mancante/vuoto nel prompt COSINE.")

    gen_text, stats = generate_safe(prompt)

    row_out = {
        "query_id": qid,
        "repo_name": r.get("repo_name"),
        "instruction": r.get("instruction"),
        "template": templ,
        "variant": variant,                 # "rag_cosine_top3"
        "retrieval_method": r.get("retrieval_method"),
        "k_snippets": r.get("k_snippets"),
        "snippets": r.get("snippets"),
        "prompt": prompt,
        "generation": gen_text,
        "gen_stats": stats,
        "model_name": cfg.model.model_name,
    }

    per_template_buffers[templ].append(row_out)
    new_agg_rows.append(row_out)
    total_new += 1

    # Telemetria memoria & flush periodico
    try:
        alloc = torch.cuda.memory_allocated() / (1024**3)
        reserved = torch.cuda.memory_reserved() / (1024**3)
        pbar.set_postfix(done=total_new, mem=f"{alloc:.2f}G/{reserved:.2f}G")
    except Exception:
        pbar.set_postfix(done=total_new)

    if sum(len(v) for v in per_template_buffers.values()) >= 8:
        for tname, buf in per_template_buffers.items():
            if buf:
                append_jsonl(per_template_targets[tname], buf)
                per_template_buffers[tname].clear()
        if new_agg_rows:
            append_jsonl(AGG_OUT, new_agg_rows)
            new_agg_rows.clear()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()

pbar.close()

# Flush finale
for tname, buf in per_template_buffers.items():
    if buf:
        append_jsonl(per_template_targets[tname], buf)
        per_template_buffers[tname].clear()

if new_agg_rows:
    append_jsonl(AGG_OUT, new_agg_rows)
    new_agg_rows.clear()

dt = time.time() - t0
print(f"[{TAG}] completato in {dt:.1f}s | nuove generazioni: {total_new}")
print("Risultati:", OUT_DIR.resolve())


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.


[rag_cosine_top3] prompt letti: 253
[rag_cosine_top3] resume: 8 già presenti (aggregato)


rag_cosine_top3:   0%|          | 0/253 [00:00<?, ?it/s]

[rag_cosine_top3] completato in 7890.2s | nuove generazioni: 245
Risultati: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_cosine_top3


#### 6.2.3  RAG WITH HYBRID 

In [220]:
# === CODEGEN SOLO per RAG HYBRID — OOM-safe + resume + progress ===
import os, gc, json, time
from pathlib import Path
from typing import Dict, Any, List, Tuple
import torch
from tqdm.auto import tqdm

# -----------------------------
# 1) Carica modello locale (come baseline)
# -----------------------------
from models.loader import load_model_and_tokenizer
from models.config import load_config

cfg = load_config()
cfg.model.model_name = "codellama/CodeLlama-7b-Instruct-hf"   # cambia se vuoi
cfg.cache.root = "./cache"
tokenizer, model, used_cache = load_model_and_tokenizer(cfg)
device = next(model.parameters()).device

torch.set_grad_enabled(False)
try:
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
except Exception:
    pass

# -----------------------------
# 2) Utility I/O
# -----------------------------
def read_jsonl(path: Path) -> List[Dict[str, Any]]:
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line=line.strip()
            if line:
                rows.append(json.loads(line))
    return rows

def append_jsonl(path: Path, rows: List[Dict[str, Any]]) -> None:
    if not rows: return
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("a", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

def load_done_keys(path: Path) -> set:
    """Chiave resume: (query_id, template, variant)."""
    done = set()
    if path.exists():
        try:
            for obj in read_jsonl(path):
                done.add((str(obj.get("query_id")), str(obj.get("template")), str(obj.get("variant"))))
        except Exception:
            pass
    return done

def method_tag_from_filename(p: Path) -> str:
    # "RAG_rag_hybrid_top5.jsonl" -> "rag_hybrid_top5"
    name = p.stem
    return name[len("RAG_"):] if name.startswith("RAG_") else name

# -----------------------------
# 3) Config generazione + guard-rails (ridotti per velocità/VRAM)
# -----------------------------
GEN_KW = dict(
    max_new_tokens=200,      # più corto per velocità/VRAM
    do_sample=False,         # deterministico
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_time=30.0,           # timeout per sample
    use_cache=True,
)

MODEL_CTX = getattr(model.config, "max_position_embeddings", None) or getattr(tokenizer, "model_max_length", 2048)
MAX_INPUT_TOKENS = min(1024, max(512, int(MODEL_CTX - GEN_KW["max_new_tokens"] - 64)))

def truncate_prompt_tokens(prompt: str) -> str:
    ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
    if len(ids) <= MAX_INPUT_TOKENS:
        return prompt
    # tieni la coda: di solito contiene gli snippet
    ids = ids[-MAX_INPUT_TOKENS:]
    return tokenizer.decode(ids, skip_special_tokens=True)

def generate_safe(prompt: str) -> Tuple[str, Dict[str, Any]]:
    """Retry se OOM abbassando max_new_tokens; fallback compatto se serve."""
    stats = {"attempts": 0, "oom_retries": 0, "used_max_new_tokens": GEN_KW["max_new_tokens"]}
    prompt_use = truncate_prompt_tokens(prompt)
    kw = dict(GEN_KW)
    for _ in range(3):
        stats["attempts"] += 1
        try:
            inputs = tokenizer(prompt_use, return_tensors="pt").to(device)
            with torch.no_grad():
                out_ids = model.generate(**inputs, **kw)
            text = tokenizer.decode(out_ids[0], skip_special_tokens=True)
            if text.startswith(prompt_use):
                text = text[len(prompt_use):].lstrip()
            del inputs, out_ids
            return text, stats
        except torch.cuda.OutOfMemoryError:
            stats["oom_retries"] += 1
            kw["max_new_tokens"] = max(128, int(kw["max_new_tokens"] * 0.8))
            stats["used_max_new_tokens"] = kw["max_new_tokens"]
            torch.cuda.empty_cache(); gc.collect()
        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                stats["oom_retries"] += 1
                kw["max_new_tokens"] = max(128, int(kw["max_new_tokens"] * 0.8))
                stats["used_max_new_tokens"] = kw["max_new_tokens"]
                torch.cuda.empty_cache(); gc.collect()
                continue
            raise
    # fallback ultra-compatto
    short_prompt = prompt_use[-800:]
    inputs = tokenizer(short_prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        out_ids = model.generate(**inputs, **{**GEN_KW, "max_new_tokens": 160, "max_time": 25.0})
    text = tokenizer.decode(out_ids[0], skip_special_tokens=True)
    if text.startswith(short_prompt):
        text = text[len(short_prompt):].lstrip()
    del inputs, out_ids
    return text, {**stats, "used_max_new_tokens": 160, "fallback": True}

# -----------------------------
# 4) Sorgente RAG HYBRID e cartelle output
# -----------------------------
HYBRID_RAG_FILE = Path(
    r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_hybrid_top3\RAG_rag_hybrid_top5.jsonl"
)
# In alternativa, se usi la variante "top5":
# HYBRID_RAG_FILE = Path(
#     r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_hybrid_top5\RAG_rag_hybrid_top5.jsonl"
# )

if not HYBRID_RAG_FILE.exists():
    raise FileNotFoundError(f"File prompt HYBRID non trovato: {HYBRID_RAG_FILE}")

TAG = method_tag_from_filename(HYBRID_RAG_FILE)  # es. "rag_hybrid_top5" o "rag_hybrid_top3"
RESULTS_ROOT = Path("results/rag_generations")
OUT_DIR = RESULTS_ROOT / TAG
OUT_DIR.mkdir(parents=True, exist_ok=True)
AGG_OUT = OUT_DIR / "AGGREGATED_results.jsonl"

rows = read_jsonl(HYBRID_RAG_FILE)
print(f"[{TAG}] prompt letti: {len(rows)}")

# Resume aggregato
done_agg = load_done_keys(AGG_OUT)
print(f"[{TAG}] resume: {len(done_agg)} già presenti (aggregato)")

# Target per template + rispettivi resume
templates = sorted({str(r.get("template", "unknown")) for r in rows})
per_template_targets: Dict[str, Path] = {t: OUT_DIR / f"{t}_results.jsonl" for t in templates}
per_template_done: Dict[str, set] = {t: load_done_keys(p) for t, p in per_template_targets.items()}
per_template_buffers: Dict[str, List[Dict[str, Any]]] = {t: [] for t in templates}

# -----------------------------
# 5) Loop di generazione (solo HYBRID)
# -----------------------------
total_new = 0
new_agg_rows: List[Dict[str, Any]] = []
t0 = time.time()

if torch.cuda.is_available():
    torch.cuda.empty_cache()

pbar = tqdm(rows, desc=TAG, dynamic_ncols=True)
for r in pbar:
    qid = str(r.get("query_id"))
    templ = str(r.get("template"))
    variant = str(r.get("variant") or TAG)  # dovrebbe essere "rag_hybrid_*"
    prompt = r.get("prompt") or ""
    key = (qid, templ, variant)

    # Avvisa se mancano gli snippet (devono esserci!)
    if not r.get("snippets"):
        print(f"[WARN] {templ} qid={qid}: campo 'snippets' mancante/vuoto nel prompt HYBRID.")

    # Resume
    if key in done_agg or key in per_template_done.get(templ, set()):
        continue

    gen_text, stats = generate_safe(prompt)

    row_out = {
        "query_id": qid,
        "repo_name": r.get("repo_name"),
        "instruction": r.get("instruction"),
        "template": templ,
        "variant": variant,
        "retrieval_method": r.get("retrieval_method"),
        "k_snippets": r.get("k_snippets"),
        "snippets": r.get("snippets"),
        "prompt": prompt,
        "generation": gen_text,
        "gen_stats": stats,
        "model_name": cfg.model.model_name,
    }

    per_template_buffers[templ].append(row_out)
    new_agg_rows.append(row_out)
    total_new += 1

    # feedback memoria GPU (se disponibile)
    try:
        alloc = torch.cuda.memory_allocated() / (1024**3)
        reserved = torch.cuda.memory_reserved() / (1024**3)
        pbar.set_postfix(done=total_new, mem=f"{alloc:.2f}G/{reserved:.2f}G")
    except Exception:
        pbar.set_postfix(done=total_new)

    # flush periodico
    if sum(len(v) for v in per_template_buffers.values()) >= 8:  # flush più frequente
        for tname, buf in per_template_buffers.items():
            if buf:
                append_jsonl(per_template_targets[tname], buf)
                per_template_buffers[tname].clear()
        if new_agg_rows:
            append_jsonl(AGG_OUT, new_agg_rows)
            new_agg_rows.clear()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()

pbar.close()

# flush finale
for tname, buf in per_template_buffers.items():
    if buf:
        append_jsonl(per_template_targets[tname], buf)
        per_template_buffers[tname].clear()

if new_agg_rows:
    append_jsonl(AGG_OUT, new_agg_rows)
    new_agg_rows.clear()

dt = time.time() - t0
print(f"[{TAG}] completato in {dt:.1f}s | nuove generazioni: {total_new}")
print("Risultati:", OUT_DIR.resolve())


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

[rag_hybrid_top5] prompt letti: 253
[rag_hybrid_top5] resume: 0 già presenti (aggregato)


rag_hybrid_top5:   0%|          | 0/253 [00:00<?, ?it/s]

[rag_hybrid_top5] completato in 4349.7s | nuove generazioni: 253
Risultati: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_hybrid_top5


#### 6.2.4  RAG WITH MULTIHOP V1: DECOMPOSITION-FIRST 

In [None]:
# === CODEGEN SOLO per RAG MultiHop: DECOMPOSITION-FIRST ===
import json, time, gc
from pathlib import Path
from typing import Dict, Any, List
from tqdm import tqdm

import torch
from models.loader import load_model_and_tokenizer
from models.config import load_config

# -------------------------------------------------
# 1) PERCORSO PROMPT MULTIHOP (DECOMPOSITION)
#    Dalla tua log: file aggregato senza prefisso "RAG_".
# -------------------------------------------------
RAG_FILE = Path(
    r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_decomposition_top4\rag_multihop_decomposition_top4.jsonl"
)
if not RAG_FILE.exists():
    raise FileNotFoundError(f"File prompt MultiHop (decomposition) non trovato: {RAG_FILE}")

# -------------------------------------------------
# 2) Carica il modello (come baseline)
# -------------------------------------------------
cfg = load_config()
cfg.model.model_name = "codellama/CodeLlama-7b-Instruct-hf"   # cambia se vuoi
cfg.cache.root = "./cache"
tokenizer, model, used_cache = load_model_and_tokenizer(cfg)

# -------------------------------------------------
# 3) Utilità I/O
# -------------------------------------------------
def read_jsonl(path: Path) -> List[Dict[str, Any]]:
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            if line.strip():
                rows.append(json.loads(line))
    return rows

def append_jsonl(path: Path, rows: List[Dict[str, Any]]) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("a", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

def load_done_keys(path: Path) -> set:
    done = set()
    if path.exists():
        for obj in read_jsonl(path):
            done.add((str(obj.get("query_id")), str(obj.get("template")), str(obj.get("variant"))))
    return done

def tag_from_filename(p: Path) -> str:
    # Usa direttamente lo stem del file (qui non c'è prefisso "RAG_")
    return p.stem

# -------------------------------------------------
# 4) Config generazione + guard-rails
# -------------------------------------------------
GEN_KW = dict(
    max_new_tokens=320,
    do_sample=False,  # deterministico
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_time=45.0,    # timeout per sample
)

MAX_INPUT_TOKENS = 1536  # lascia spazio all'output per evitare OOM

def truncate_prompt_tokens(prompt: str) -> str:
    ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
    if len(ids) <= MAX_INPUT_TOKENS:
        return prompt
    ids = ids[-MAX_INPUT_TOKENS:]  # tieni la coda (spesso contiene gli snippet)
    return tokenizer.decode(ids, skip_special_tokens=True)

# -------------------------------------------------
# 5) Carica prompt & prepara output
# -------------------------------------------------
rows = read_jsonl(RAG_FILE)
TAG = tag_from_filename(RAG_FILE)  # es. "rag_multihop_decomposition_top4"
print(f"[{TAG}] prompt letti: {len(rows)}")

RESULTS_ROOT = Path("results/rag_generations")
OUT_DIR = RESULTS_ROOT / TAG
OUT_DIR.mkdir(parents=True, exist_ok=True)
AGG_OUT = OUT_DIR / "AGGREGATED_results.jsonl"

done_agg = load_done_keys(AGG_OUT)
print(f"[{TAG}] resume: {len(done_agg)} già presenti (aggregato)")

templates = sorted({r.get("template", "unknown") for r in rows})
per_template_targets = {t: OUT_DIR / f"{t}_results.jsonl" for t in templates}
per_template_done = {t: load_done_keys(p) for t, p in per_template_targets.items()}
per_template_buffers = {t: [] for t in templates}

# -------------------------------------------------
# 6) Loop generazione
# -------------------------------------------------
total_new = 0
new_agg_rows: List[Dict[str, Any]] = []
t0 = time.time()

if torch.cuda.is_available():
    torch.cuda.empty_cache()

for r in tqdm(rows, desc=TAG, ascii=True):
    qid = str(r.get("query_id"))
    templ = str(r.get("template"))
    variant = str(r.get("variant") or TAG)  # dovrebbe essere "rag_multihop_decomposition"
    prompt = r.get("prompt") or ""

    key = (qid, templ, variant)
    if key in done_agg or key in per_template_done.get(templ, set()):
        continue

    prompt_use = truncate_prompt_tokens(prompt)

    try:
        inputs = tokenizer(prompt_use, return_tensors="pt").to(model.device)
        with torch.no_grad():
            out_ids = model.generate(**inputs, **GEN_KW)
        gen_text = tokenizer.decode(out_ids[0], skip_special_tokens=True)
        if gen_text.startswith(prompt_use):
            gen_text = gen_text[len(prompt_use):].lstrip()
    except torch.cuda.OutOfMemoryError:
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        short_prompt = prompt_use[-800:]
        inputs = tokenizer(short_prompt, return_tensors="pt").to(model.device)
        with torch.no_grad():
            out_ids = model.generate(**inputs, **{**GEN_KW, "max_new_tokens": 200, "max_time": 30.0})
        gen_text = tokenizer.decode(out_ids[0], skip_special_tokens=True)
        if gen_text.startswith(short_prompt):
            gen_text = gen_text[len(short_prompt):].lstrip()
    except Exception as e:
        gen_text = f"<<GENERATION_ERROR: {e!r}>>"

    row_out = {
        "query_id": qid,
        "repo_name": r.get("repo_name"),
        "instruction": r.get("instruction"),
        "template": templ,
        "variant": variant,
        "retrieval_method": r.get("retrieval_method"),
        "k_snippets": r.get("k_snippets"),
        "snippets": r.get("snippets"),
        "prompt": prompt,
        "generation": gen_text,
        "model_name": cfg.model.model_name,
    }

    per_template_buffers[templ].append(row_out)
    new_agg_rows.append(row_out)
    total_new += 1

    if total_new % 16 == 0:
        for tname, buf in per_template_buffers.items():
            if buf:
                append_jsonl(per_template_targets[tname], buf)
                per_template_buffers[tname].clear()
        if new_agg_rows:
            append_jsonl(AGG_OUT, new_agg_rows)
            new_agg_rows.clear()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()

# flush finale
for tname, buf in per_template_buffers.items():
    if buf:
        append_jsonl(per_template_targets[tname], buf)
        per_template_buffers[tname].clear()
if new_agg_rows:
    append_jsonl(AGG_OUT, new_agg_rows)

dt = time.time() - t0
print(f"[{TAG}] completato in {dt:.1f}s | nuove generazioni: {total_new}")
print("Risultati:", OUT_DIR.resolve())


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

[rag_multihop_decomposition_top4] prompt letti: 253
[rag_multihop_decomposition_top4] resume: 0 già presenti (aggregato)


rag_multihop_decomposition_top4: 100%|##########| 253/253 [3:06:38<00:00, 44.26s/it]  

[rag_multihop_decomposition_top4] completato in 11198.9s | nuove generazioni: 253
Risultati: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_multihop_decomposition_top4





#### 6.2.5  RAG WITH MULTIHop: ITERATIVE-REFINE


In [221]:
# === CODEGEN SOLO per RAG MultiHop: ITERATIVE-REFINE (con fix snippets mancanti) ===
import json, time, gc, re
from pathlib import Path
from typing import Dict, Any, List, Tuple
from tqdm.auto import tqdm

import torch
from models.loader import load_model_and_tokenizer
from models.config import load_config

# -------------------------------------------------
# 1) PERCORSI FILE
# -------------------------------------------------
PROMPTS_JSONL = Path(
    r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\prompts\rag_multihop_iterative_top4\rag_multihop_iterative_top4.jsonl"
)
RETRIEVAL_JSONL = Path(
    r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\multihop\mh_iterative_topk_k3.jsonl"
)

if not PROMPTS_JSONL.exists():
    raise FileNotFoundError(f"File prompt MultiHop (iterative_refine) non trovato: {PROMPTS_JSONL}")
if not RETRIEVAL_JSONL.exists():
    raise FileNotFoundError(f"File retrieval MultiHop (iterative_refine) non trovato: {RETRIEVAL_JSONL}")

# -------------------------------------------------
# 2) Carica il modello (come baseline/BM25)
# -------------------------------------------------
cfg = load_config()
cfg.model.model_name = "codellama/CodeLlama-7b-Instruct-hf"   # cambia se vuoi
cfg.cache.root = "./cache"
tokenizer, model, used_cache = load_model_and_tokenizer(cfg)
device = next(model.parameters()).device
torch.set_grad_enabled(False)
try:
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
except Exception:
    pass

# -------------------------------------------------
# 3) Utilità I/O + normalizzazione
# -------------------------------------------------
def read_jsonl(path: Path) -> List[Dict[str, Any]]:
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                rows.append(json.loads(line))
    return rows

def append_jsonl(path: Path, rows: List[Dict[str, Any]]) -> None:
    if not rows: return
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("a", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

def load_done_keys(path: Path) -> set:
    done = set()
    if path.exists():
        for obj in read_jsonl(path):
            done.add((str(obj.get("query_id")), str(obj.get("template")), str(obj.get("variant"))))
    return done

def take_top_k(results: List[Dict[str, Any]], k: int) -> List[Dict[str, Any]]:
    if not isinstance(results, list): return []
    # ordina per score decrescente se disponibile, altrimenti lascia l'ordine
    try:
        return sorted(results, key=lambda x: float(x.get("score", 0.0)), reverse=True)[:k]
    except Exception:
        return results[:k]

def clean_text(t: str) -> str:
    # Evita markdown eccessivo negli snippet iniettati
    return t.strip()

# Inietta gli snippet alla fine del prompt se non presenti
def inject_snippets_into_prompt(prompt: str, snippets: List[Dict[str, Any]]) -> str:
    if not snippets:
        return prompt
    # Se il prompt contiene già una sezione "Context Snippets", non duplicare
    if re.search(r"(?i)context\s+snippets", prompt):
        return prompt
    blocks = []
    for i, h in enumerate(snippets, 1):
        txt = clean_text(h.get("text", ""))
        doc = str(h.get("doc_id") or "")
        sc  = h.get("score")
        head = f"[{i}] doc_id={doc}" + (f" | score={sc:.4f}" if isinstance(sc, (int,float)) else "")
        blocks.append(head + "\n" + txt)
    ctx = "\n\n### Context Snippets\n" + "\n\n---\n\n".join(blocks) + "\n"
    # Aggiungi in coda: manteniamo il prompt originale e poi il contesto
    return prompt + ctx

# -------------------------------------------------
# 4) Carica retrieval MultiHop e prepara lookup by query_id
# -------------------------------------------------
retrieval_rows = read_jsonl(RETRIEVAL_JSONL)
# attesi campi: query_id, repo_name, instruction, results=[{doc_id,text,score,...}]
lookup_hits: Dict[str, List[Dict[str, Any]]] = {}
for obj in retrieval_rows:
    qid = str(obj.get("query_id"))
    res = obj.get("results") or []
    lookup_hits[qid] = res

# -------------------------------------------------
# 5) Config generazione + guard-rails
# -------------------------------------------------
GEN_KW = dict(
    max_new_tokens=320,
    do_sample=False,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_time=45.0,
)
MODEL_CTX = getattr(model.config, "max_position_embeddings", None) or getattr(tokenizer, "model_max_length", 2048)
MAX_INPUT_TOKENS = max(256, int(MODEL_CTX - GEN_KW["max_new_tokens"] - 32))

def truncate_prompt_tokens(prompt: str) -> str:
    ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
    if len(ids) <= MAX_INPUT_TOKENS:
        return prompt
    # tieni la coda (gli snippet sono tipicamente in fondo)
    ids = ids[-MAX_INPUT_TOKENS:]
    return tokenizer.decode(ids, skip_special_tokens=True)

def generate_safe(prompt: str) -> Tuple[str, Dict[str, Any]]:
    stats = {"attempts": 0, "oom_retries": 0, "used_max_new_tokens": GEN_KW["max_new_tokens"]}
    prompt_use = truncate_prompt_tokens(prompt)
    kw = dict(GEN_KW)
    for _ in range(3):
        stats["attempts"] += 1
        try:
            inputs = tokenizer(prompt_use, return_tensors="pt").to(device)
            with torch.no_grad():
                out_ids = model.generate(**inputs, **kw)
            text = tokenizer.decode(out_ids[0], skip_special_tokens=True)
            if text.startswith(prompt_use):
                text = text[len(prompt_use):].lstrip()
            del inputs, out_ids
            return text, stats
        except torch.cuda.OutOfMemoryError:
            stats["oom_retries"] += 1
            kw["max_new_tokens"] = max(160, int(kw["max_new_tokens"] * 0.6))
            stats["used_max_new_tokens"] = kw["max_new_tokens"]
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
            gc.collect()
        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                stats["oom_retries"] += 1
                kw["max_new_tokens"] = max(160, int(kw["max_new_tokens"] * 0.6))
                stats["used_max_new_tokens"] = kw["max_new_tokens"]
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
                gc.collect()
                continue
            raise
    # fallback ultra-compatto
    short_prompt = prompt_use[-800:]
    inputs = tokenizer(short_prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        out_ids = model.generate(**inputs, **{**GEN_KW, "max_new_tokens": 160, "max_time": 30.0})
    text = tokenizer.decode(out_ids[0], skip_special_tokens=True)
    if text.startswith(short_prompt):
        text = text[len(short_prompt):].lstrip()
    del inputs, out_ids
    return text, {**stats, "used_max_new_tokens": 160, "fallback": True}

# -------------------------------------------------
# 6) Carica prompt, ripristina snippets mancanti, genera
# -------------------------------------------------
rows = read_jsonl(PROMPTS_JSONL)
TAG = PROMPTS_JSONL.stem  # "rag_multihop_iterative_top4"
print(f"[{TAG}] prompt letti: {len(rows)}")

RESULTS_ROOT = Path("results/rag_generations")
OUT_DIR = RESULTS_ROOT / TAG
OUT_DIR.mkdir(parents=True, exist_ok=True)
AGG_OUT = OUT_DIR / "AGGREGATED_results.jsonl"

done_agg = load_done_keys(AGG_OUT)
print(f"[{TAG}] resume: {len(done_agg)} già presenti (aggregato)")

templates = sorted({str(r.get("template", "unknown")) for r in rows})
per_template_targets = {t: OUT_DIR / f"{t}_results.jsonl" for t in templates}
per_template_done = {t: load_done_keys(p) for t, p in per_template_targets.items()}
buffers = {t: [] for t in templates}

total_new = 0
new_agg_rows: List[Dict[str, Any]] = []
t0 = time.time()

if torch.cuda.is_available():
    torch.cuda.empty_cache()

pbar = tqdm(rows, desc=TAG, dynamic_ncols=True)
for r in pbar:
    qid   = str(r.get("query_id"))
    templ = str(r.get("template"))
    variant = str(r.get("variant") or TAG)  # es. "rag_multihop_iterative_top4"
    prompt = r.get("prompt") or ""
    k_snip = int(r.get("k_snippets") or 4)

    key = (qid, templ, variant)
    if key in done_agg or key in per_template_done.get(templ, set()):
        continue

    # --- FIX: ripristina 'snippets' se mancanti/vuoti ---
    snippets = r.get("snippets") or []
    if not snippets:
        src_hits = lookup_hits.get(qid) or []
        snippets = take_top_k(src_hits, k_snip)
        # inietta gli snippet anche nel prompt, così il modello li vede
        prompt = inject_snippets_into_prompt(prompt, snippets)

    gen_text, stats = generate_safe(prompt)

    row_out = {
        "query_id": qid,
        "repo_name": r.get("repo_name"),
        "instruction": r.get("instruction"),
        "template": templ,
        "variant": variant,
        "retrieval_method": r.get("retrieval_method"),
        "k_snippets": k_snip,
        "snippets": snippets,          # <- garantiti qui
        "prompt": prompt,              # <- prompt con sezione Context Snippets se serviva
        "generation": gen_text,
        "gen_stats": stats,
        "model_name": cfg.model.model_name,
    }

    buffers[templ].append(row_out)
    new_agg_rows.append(row_out)
    total_new += 1

    # flush periodico
    if sum(len(v) for v in buffers.values()) >= 16:
        for tname, buf in buffers.items():
            if buf:
                append_jsonl(per_template_targets[tname], buf)
                buffers[tname].clear()
        if new_agg_rows:
            append_jsonl(AGG_OUT, new_agg_rows)
            new_agg_rows.clear()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()

pbar.close()

# flush finale
for tname, buf in buffers.items():
    if buf:
        append_jsonl(per_template_targets[tname], buf)
        buffers[tname].clear()

if new_agg_rows:
    append_jsonl(AGG_OUT, new_agg_rows)
    new_agg_rows.clear()

dt = time.time() - t0
print(f"[{TAG}] completato in {dt:.1f}s | nuove generazioni: {total_new}")
print("Risultati:", OUT_DIR.resolve())


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.


[rag_multihop_iterative_top4] prompt letti: 253
[rag_multihop_iterative_top4] resume: 0 già presenti (aggregato)


rag_multihop_iterative_top4:   0%|          | 0/253 [00:00<?, ?it/s]

[rag_multihop_iterative_top4] completato in 11657.9s | nuove generazioni: 253
Risultati: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_multihop_iterative_top4


In [262]:
# === Scrive tabelle finali in percentuale ===
import os, sys, json, math
from pathlib import Path

# Assicurati che il pacchetto sia importabile
if os.getcwd() not in sys.path:
    sys.path.insert(0, os.getcwd())
try:
    from Code_evaluation.codebleu_eval import compute_codebleu_for_generations, export_summary_csv
except Exception:
    from code_evaluation.codebleu_eval import compute_codebleu_for_generations, export_summary_csv

# ---- PATH (adatta se serve) ----
GENERATIONS_PATH = r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\_all_baseline.jsonl"
DATASET_PATH     = r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\data\lca_test_filtered.jsonl"
OUT_DIR_BASE     = r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\baseline"

Path(OUT_DIR_BASE).mkdir(parents=True, exist_ok=True)

# ---- Esegui / ri-esegui valutazione (filtra solo baseline) ----
per_instance_raw, summary_raw = compute_codebleu_for_generations(
    generations_path=GENERATIONS_PATH,
    dataset_path=DATASET_PATH,
    lang="python",
    prompt_field="template",
    variant_field="variant",
    variant_value="baseline",
    out_dir=None  # qui non salviamo, salviamo noi sotto con formato richiesto
)

# ---- Helper: conversione robusta a percentuale ----
def to_percent(x):
    if x is None:
        return None
    try:
        v = float(x)
        if math.isnan(v):
            return None
        # se è chiaramente in [0,1], porta a %
        if 0.0 <= v <= 1.0:
            return v * 100.0
        # se è >100 probabilmente è già % "sballata": limitiamo a 100 ma NON dovrebbe capitare
        return v
    except Exception:
        return None

# ---- Costruisci codebleu_per_instance.jsonl (253 righe attese) ----
# Usiamo la combinazione 0.25/0.25/0.25/0.25 come "codebleu" principale.
per_instance_final = []
for r in per_instance_raw:
    out = {
        "query_id":   r.get("query_id"),
        "repo_name":  r.get("repo_name"),
        "prompt_type":r.get("prompt_type"),
        "model_name": r.get("model_name"),
        "codebleu":   to_percent(r.get("codebleu_a025")),
        "bleu":       to_percent(r.get("bleu_a025")),
        "w_bleu":     to_percent(r.get("w_bleu_a025")),
        "ast":        to_percent(r.get("ast_a025")),
        "df":         to_percent(r.get("df_a025")),
        "error_reason": r.get("error_reason"),
    }
    per_instance_final.append(out)

per_instance_path = os.path.join(OUT_DIR_BASE, "codebleu_per_instance.jsonl")
with open(per_instance_path, "w", encoding="utf-8") as f:
    for row in per_instance_final:
        f.write(json.dumps(row, ensure_ascii=False) + "\n")

# ---- Costruisci summary_by_prompt.json (11 righe) in percentuale ----
summary_final = []
for s in summary_raw:
    summary_final.append({
        "prompt_type": s.get("prompt_type"),
        "count":       s.get("count"),
        "mean":        to_percent(s.get("mean_codebleu_025")),
        "min":         to_percent(s.get("min_codebleu_025")),
        "max":         to_percent(s.get("max_codebleu_025")),
        "std":         to_percent(s.get("std_codebleu_025")),
    })

summary_path = os.path.join(OUT_DIR_BASE, "summary_by_prompt.json")
with open(summary_path, "w", encoding="utf-8") as f:
    json.dump(summary_final, f, ensure_ascii=False, indent=2)

print("[OK] Salvati:")
print(" -", per_instance_path)
print(" -", summary_path)

# ---- (Opzionale) stampa due righe di esempio ----
print("\nEsempio per_instance:")
print(per_instance_final[0] if per_instance_final else None)

print("\nEsempio summary:")
print(summary_final[0] if summary_final else None)


[OK] Salvati:
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\baseline\codebleu_per_instance.jsonl
 - C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\baseline\summary_by_prompt.json

Esempio per_instance:
{'query_id': 'seed-emulator__000000', 'repo_name': 'seed-emulator', 'prompt_type': 'v1', 'model_name': 'codellama/CodeLlama-7b-Instruct-hf', 'codebleu': 12.706925473630774, 'bleu': 12.706925473630774, 'w_bleu': None, 'ast': None, 'df': None, 'error_reason': None}

Esempio summary:
{'prompt_type': 'v1', 'count': 23, 'mean': 2.633914445340188, 'min': 1.8548169447666272, 'max': 18.519668477565148, 'std': 3.9425483532272927}


Per ogni query che sono 23 in totale abbiamo diverse strategie retrivial. Ogni combinazione di query e topksnippets trovati con la specfica strategia di retrivial viene combinata con differenti versioni di prompt. Serve trovsre risultati per ogni categoria di retrvial (BASELINE, BM25, COSINE, HYBRID , MULTIHOPEV1, MULTIHOPEV2) e per ognuna serve vedere quale è la migliore strategia di prompt. Poi fare riflessioni sui risultti. 

PASSARE ALCUNE RIGHE DEL FILE JSONL CON CODE GENERATO , PASSARE ALCUNE RIGHE DATASET COSI SA COME SONO LE COLONNE DA USARE, PASSARE LA PATH DEL DATASET E LA PATH DEL JSONL CON CODICE GENERATO. POI CHIEDERE DI TROVARE LA MEDIA PER LE METRICHE, IL RISULTATO PIU ALTO E PIU BASSO. 



## Section 7: Results 

In [337]:
# === CELLA: calcolo CodeBLEU normalizzato per tutte le tipologie ===
# Requisiti:
#   - cartella/libreria `codebleu/` presente nel PYTHONPATH (quella che abbiamo creato)
#   - pip: `codebleu` installato (fork che espone `calc_codebleu`)

from pathlib import Path
import json
import pandas as pd
import numpy as np
import sys
import re
from typing import List, Dict, Any, Optional

# Se necessario, aggiungi il path della tua cartella progetto, es:
# sys.path.append(r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS")

from codebleu_F import compute_grouped_codebleu  # dal nostro pacchetto modulare

# -----------------------------
# PATH forniti (puliti)
# -----------------------------
BASELINE_FILE = r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\outputs\codegen\codellama_CodeLlama-7b-Instruct-hf\by_prompt\_all_baseline.jsonl"
DATASET_FILE  = r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\data\lca_test_filtered.jsonl"

BM25_FILE     = r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_bm25_top3\AGGREGATED_results.jsonl"
COSINE_FILE   = r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_cosine_top3\AGGREGATED_results.jsonl"
HYBRID_FILE   = r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_hybrid_top5\AGGREGATED_results.jsonl"

MULTIHOP_DECOMP_DIR = r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_multihop_decomposition_top4"
MULTIHOP_ITER_DIR   = r"C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_multihop_iterative_top4"

# -----------------------------
# Utility di parsing
# -----------------------------

def read_jsonl(path: Path) -> List[Dict[str, Any]]:
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            try:
                rows.append(json.loads(line))
            except Exception:
                # prova a ripulire eventuali BOM/escape strani
                try:
                    rows.append(json.loads(line.encode("utf-8", "ignore").decode("utf-8")))
                except Exception:
                    pass
    return rows

def load_dataset(df_path: Path) -> pd.DataFrame:
    data = read_jsonl(df_path)
    df = pd.DataFrame(data)

    # 👈 ADATTA QUI se la reference sta in un altro campo
    # Proviamo vari alias comuni:
    candidate_ref_cols = ["references", "reference", "ground_truth", "solution", "target", "code"]
    ref_col = None
    for c in candidate_ref_cols:
        if c in df.columns:
            ref_col = c
            break
    if ref_col is None:
        raise ValueError(f"Non trovo una colonna reference tra {candidate_ref_cols} in {df_path}")

    # normalizza in lista di stringhe (multi-ref OK)
    def to_list_refs(x):
        if isinstance(x, list):
            return [s for s in x if isinstance(s, str)]
        if isinstance(x, str):
            return [x]
        return []

    df["references_norm"] = df[ref_col].apply(to_list_refs)

    # 👈 ADATTA QUI: id della riga (preferiamo chiavi stabili)
    id_col = None
    for c in ["id", "qid", "sample_id", "problem_id", "uid"]:
        if c in df.columns:
            id_col = c
            break
    if id_col is None:
        # se non c'è un ID, creiamo un indice posizionale
        df["__row_index__"] = np.arange(len(df))
        id_col = "__row_index__"

    df = df[[id_col, "references_norm"]].rename(columns={id_col: "join_id"})
    return df

def extract_pred_and_prompt(rows: List[Dict[str, Any]]) -> pd.DataFrame:
    """
    Estrae prediction e (se presente) prompt_type da una lista di dict JSON.
    """
    def pick_pred(d: Dict[str, Any]) -> str:
        # 👈 ADATTA QUI se la predizione sta sotto un altro nome
        for k in ["prediction", "generated", "code", "output", "generation"]:
            if k in d and isinstance(d[k], str):
                return d[k]
        # in alcuni formati, c'è un oggetto {"text": "..."}
        for k in ["prediction", "generated", "output", "generation"]:
            if k in d and isinstance(d[k], dict) and isinstance(d[k].get("text"), str):
                return d[k]["text"]
        return ""

    def pick_prompt_type(d: Dict[str, Any]) -> Optional[str]:
        for k in ["prompt_type", "promptVersion", "prompt", "template_id"]:
            if k in d and isinstance(d[k], str):
                return d[k]
        return None

    def pick_id(d: Dict[str, Any]) -> Optional[str]:
        for k in ["id", "qid", "sample_id", "problem_id", "uid"]:
            if k in d:
                return str(d[k])
        return None

    recs = []
    for d in rows:
        recs.append({
            "join_id": pick_id(d),
            "prediction": pick_pred(d),
            "prompt_type": pick_prompt_type(d)
        })
    df = pd.DataFrame(recs)
    # se manca join_id, usa indice posizionale
    if df["join_id"].isna().any():
        df["__row_index__"] = np.arange(len(df))
        df["join_id"] = df["join_id"].fillna(df["__row_index__"].astype(str))
        df.drop(columns=["__row_index__"], inplace=True)
    return df

def load_generation_file(file_path: Path, variant_hint: str) -> pd.DataFrame:
    rows = read_jsonl(file_path)
    dfp = extract_pred_and_prompt(rows)
    dfp["variant"] = variant_hint
    return dfp

def load_generation_dir(dir_path: Path, variant_hint: str) -> pd.DataFrame:
    """
    Carica tutte le .jsonl nella cartella e prova a inferire prompt_type dal nome file se non presente dentro.
    """
    all_parts = []
    for fp in sorted(dir_path.glob("*.jsonl")):
        rows = read_jsonl(fp)
        dfp = extract_pred_and_prompt(rows)
        # se manca prompt_type, deducilo dal nome file (pattern tipo ..._v7.jsonl)
        if dfp["prompt_type"].isna().all():
            m = re.search(r"(v\d+(_\d+)?)", fp.stem.lower())
            if m:
                dfp["prompt_type"] = m.group(1)
            else:
                dfp["prompt_type"] = "unknown"
        dfp["variant"] = variant_hint
        dfp["__source_file__"] = str(fp)
        all_parts.append(dfp)
    return pd.concat(all_parts, ignore_index=True) if all_parts else pd.DataFrame(columns=["join_id","prediction","prompt_type","variant"])

# -----------------------------
# Carica dataset (riferimenti)
# -----------------------------
dataset_df = load_dataset(Path(DATASET_FILE))

# -----------------------------
# Carica tutte le generazioni
# -----------------------------
frames = []

# baseline (file singolo)
frames.append(load_generation_file(Path(BASELINE_FILE), variant_hint="baseline"))

# RAG aggregati (file singoli)
frames.append(load_generation_file(Path(BM25_FILE),   variant_hint="bm25"))
frames.append(load_generation_file(Path(COSINE_FILE), variant_hint="cosine"))
frames.append(load_generation_file(Path(HYBRID_FILE), variant_hint="hybrid"))

# Multihop (cartelle con più file)
frames.append(load_generation_dir(Path(MULTIHOP_DECOMP_DIR), variant_hint="multihop_decomposition"))
frames.append(load_generation_dir(Path(MULTIHOP_ITER_DIR),   variant_hint="multihop_iterative"))

gens_df = pd.concat(frames, ignore_index=True)

# assicurati che prompt_type esista
if "prompt_type" not in gens_df.columns:
    gens_df["prompt_type"] = "unknown"
else:
    gens_df["prompt_type"] = gens_df["prompt_type"].fillna("unknown")

# -----------------------------
# Join con dataset per ottenere le references
# -----------------------------
# Se join_id è numerico stringa in uno dei due, allinea i tipi
dataset_df["join_id"] = dataset_df["join_id"].astype(str)
gens_df["join_id"]    = gens_df["join_id"].astype(str)

merged = gens_df.merge(dataset_df, on="join_id", how="left")

# Eventuali mancati match: fallback per posizione (solo dove references mancanti)
if merged["references_norm"].isna().any():
    # prendi le righe bucate e rimpiazza references dalla posizione
    mask = merged["references_norm"].isna()
    # mappa per variant/prompt_type l'ordine, per allineare con dataset se serve
    # (fallback semplice: usa l'ordine del dataset)
    fallback_refs = dataset_df["references_norm"].tolist()
    merged.loc[mask, "references_norm"] = merged.loc[mask].index.map(
        lambda i: fallback_refs[i % len(fallback_refs)]
    )

# -----------------------------
# Calcolo CodeBLEU per gruppo
# -----------------------------
# compute_grouped_codebleu richiede:
#   - prediction_col
#   - references_col (lista di stringhe)
#   - group_cols
summary = compute_grouped_codebleu(
    merged,
    prediction_col="prediction",
    references_col="references_norm",
    group_cols=("variant","prompt_type"),
    # puoi cambiare average="corpus" se preferisci corpus-level
    average="macro",
)

# Ordina per variant/prompt_type
summary = summary.sort_values(["variant","prompt_type"]).reset_index(drop=True)

# Salva CSV
out_dir = Path(HYBRID_FILE).parent  # salva vicino a uno dei risultati
out_csv = out_dir / "codebleu_summary_by_variant_prompt.csv"
summary.to_csv(out_csv, index=False, encoding="utf-8-sig")

print("Salvato:", out_csv)




Salvato: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_hybrid_top5\codebleu_summary_by_variant_prompt.csv


In [338]:
summary_by_variant = summary.groupby("variant", as_index=False)["codebleu"].mean()
display(summary_by_variant)


Unnamed: 0,variant,codebleu
0,baseline,0.16341
1,bm25,0.128609
2,cosine,0.250732
3,hybrid,0.138757
4,multihop_decomposition,0.200912
5,multihop_iterative,0.224805


In [339]:
import re

# --- 1) definisci le "firme" testuali per riconoscere ogni versione ---
PATTERNS = [
    ("v9",    r"\(v9\b|Workflow \(do not output steps 1–3\)|Verified code excerpts"),
    ("v8",    r"\(v8\)|Library-Centric Python Code Generation \(v8\)"),
    ("v7",    r"Python Code Generation Task \(Library-Focused\)"),
    ("v6_3",  r"^# PYTHON LIBRARY-BASED CODE GENERATION TASK|```PYTHON"),
    ("v6",    r"Long Code Arena · Library-Based Code Generation"),
    ("v5",    r"Write a complete Python 3 implementation for the following task"),
    ("v4",    r"Return exactly and only the raw answer|^You are a senior Python engineer\.\s+Use the retrieved examples"),
    ("v3",    r"^You are a senior Python engineer\.\s+Fulfill the following task|Implementation\*\*:\s*```python"),
    ("v2",    r"\bseedemu\b|Use the `seedemu` Python library"),
    ("v1",    r"^### Task:\s|^### Retrieved Examples:\s"),
]

FALLBACK_VERSION = "v6_2"   # come richiesto

# --- 2) funzione di riconoscimento su una singola stringa ---
def detect_version(prompt_text: str) -> str:
    if not isinstance(prompt_text, str) or not prompt_text.strip():
        return FALLBACK_VERSION
    txt = prompt_text.strip()
    for ver, pat in PATTERNS:
        if re.search(pat, txt, flags=re.IGNORECASE | re.MULTILINE):
            return ver
    return FALLBACK_VERSION

# --- 3) applica al tuo DataFrame summary ---
# summary ha colonne: ['variant', 'prompt_type', 'codebleu']
summary = summary.copy()
summary["prompt_version"] = summary["prompt_type"].apply(detect_version)

# (opzionale) rimuovi il testo lungo
summary = summary.drop(columns=["prompt_type"])

# ordina e mostra
summary = summary.sort_values(["variant", "prompt_version"]).reset_index(drop=True)

# --- 4) (opzionale) salva anche il mapping trovato per tracciabilità ---
# NB: qui il mapping è "dinamico" (pattern->versione). Se vuoi anche
# un mapping esplicito dai TESTI LUNGHI visti a 'vX', fai così:
seen = {}
for s in summary["prompt_version"].unique():
    seen[s] = []  # placeholder per completezza

# popola (solo le prime occorrenze per non appesantire)
for row in summary.itertuples(index=False):
    pv = row.prompt_version
    if len(seen[pv]) < 3:  # salva qualche esempio
        # ATTENZIONE: questo richiede il prompt_type originale. Se lo hai già droppato, usa una copia
        pass


In [340]:
# Calcola la media per ogni coppia (variant, prompt_version)
avg_summary = (
    summary
    .groupby(["variant", "prompt_version"], as_index=False)
    .mean(numeric_only=True)  # prende solo le colonne numeriche
)

# Ordina per chiarezza
avg_summary = avg_summary.sort_values(["variant", "prompt_version"]).reset_index(drop=True)

# Mostra
display(avg_summary)

# (Opzionale) salva in CSV
out_csv_avg = Path(HYBRID_FILE).parent / "codebleu_avg_by_variant_prompt.csv"
avg_summary.to_csv(out_csv_avg, index=False, encoding="utf-8-sig")
print("Salvato:", out_csv_avg)

# Pivot table: variant sulle righe, prompt_version sulle colonne
pivot = summary.pivot_table(
    index="variant",
    columns="prompt_version",
    values="codebleu",   # <-- metti qui la metrica che vuoi mediare
    aggfunc="mean"
)

# Ordina le colonne per avere v1..v11 in ordine naturale
pivot = pivot.reindex(sorted(pivot.columns, key=lambda x: (x[0]!="v", x)), axis=1)

# Mostra la tabella
display(pivot)

# (opzionale) salva in CSV
out_csv_pivot = Path(HYBRID_FILE).parent / "codebleu_pivot_by_variant_prompt.csv"
pivot.to_csv(out_csv_pivot, encoding="utf-8-sig")
print("Salvato pivot:", out_csv_pivot)


Unnamed: 0,variant,prompt_version,codebleu
0,baseline,v1,0.185871
1,baseline,v2,0.171419
2,baseline,v6_2,0.14188
3,baseline,v6_3,0.164756
4,baseline,v7,0.144842
5,baseline,v8,0.153567
6,baseline,v9,0.17222
7,bm25,v1,0.161507
8,bm25,v2,0.142972
9,bm25,v6_2,0.096764


Salvato: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_hybrid_top5\codebleu_avg_by_variant_prompt.csv


prompt_version,v1,v2,v6_2,v6_3,v7,v8,v9
variant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
baseline,0.185871,0.171419,0.14188,0.164756,0.144842,0.153567,0.17222
bm25,0.161507,0.142972,0.096764,0.122803,0.133774,0.16821,0.087831
cosine,0.253125,0.256719,0.250018,0.249077,0.25001,0.250002,0.250112
hybrid,0.155605,0.15433,0.112124,0.129295,0.137169,0.215327,0.094614
multihop_decomposition,0.133053,0.182133,0.15341,0.206845,0.191376,0.250543,0.244997
multihop_iterative,0.177532,0.213739,0.201527,0.223354,0.250354,0.250143,0.250154


Salvato pivot: C:\Users\drugm\Documents\RP_PCTITO\JACK_17_EDITS\results\rag_generations\rag_hybrid_top5\codebleu_pivot_by_variant_prompt.csv
