# 8. Joint Verifier + Ensemble Scoring

This notebook implements a hybrid LLaMA-3 + DeBERTa-v3-base verifier pipeline for structured reasoning (LLM-SR):

1. **Data Preparation**  
   - Corrupt valid question-parsing (QP) and CoT-parsing (CP) outputs to generate negative examples  
   - Build a balanced train/dev JSONL dataset for verifier fine-tuning  

2. **Verifier Training**  
   - Fine-tune DeBERTa-v3-base as a binary classifier on (premise, hypothesis) pairs  
   - Use class-weighted cross-entropy loss to handle label imbalance  
   - Evaluate with accuracy, binary/macro F1, and class-wise precision/recall  

3. **Inference Pipeline**  
   - **QP Stage:** sample 3 QP candidates (temperature sampling)  
   - **CP Stage:** generate 3 CP candidates per QP (beam search)  
   - **Verifier Scoring:** score each (QP, CP) pair using the fine-tuned DeBERTa verifier  
   - **LM Fallback:** compute LLaMA log-probs for QP and CP candidates  
   - **Ensemble Reranking:**  
     - Compute dynamic threshold = max(0.6, median(verifier_scores)+0.5·std)  
     - Score_i = 0.65·verifier + 0.20·norm(QP_logprob) + 0.15·norm(CP_logprob)  
     - If best Score ≥ threshold, select that candidate; otherwise fallback to highest log-prob  

4. **Evaluation**  
   - Save structured outputs to JSON  
   - Run `eval.py` with official thresholds for question, statement, and relation F1  

**Goals:**  
- Leverage a learned verifier to correct LLM parsing errors  
- Combine model confidence and likelihoods for robust candidate selection  
- Maintain interpretable, structured (statement, evidence, verification) outputs  



## Configuration

In [1]:
# Install core evaluation utilities
!pip install -q evaluate
!pip install json5

!pip uninstall -y nltk
!pip install -q --upgrade nltk

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting json5
  Downloading json5-0.12.0-py3-none-any.whl.metadata (36 kB)
Downloading json5-0.12.0-py3-none-any.whl (36 kB)
Installing collected packages: json5
Successfully installed json5-0.12.0
Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
import nltk
nltk.download("punkt_tab")
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Prepare data for training

In [None]:
import json
import json5
import random
import copy
from sklearn.model_selection import train_test_split

# CONFIGURATION
INPUT = "/content/drive/MyDrive/llm-sr-project/700dataset.json"
OUT_TRAIN = "/content/drive/MyDrive/llm-sr-project/verifier_train.jsonl"
OUT_DEV   = "/content/drive/MyDrive/llm-sr-project/verifier_dev.jsonl"
NEG_PER_POS = 3    # Number of negative (corrupted) examples to generate per positive one
DEV_SIZE    = 0.1  # Fraction of the dataset to use as the dev set


def corrupt_question_parsing(qp):
    """
    Corrupt a valid question_parsing (QP) list by either:
    - Randomly dropping one sentence (if more than one), or
    - Shuffling the entire list to break order-dependence
    """
    qp2 = qp.copy()
    if random.random() < 0.5 and len(qp2)>1:
        # drop one random
        qp2.pop(random.randrange(len(qp2)))
    else:
        random.shuffle(qp2)
    return qp2

def corrupt_cot_parsing(cp):
    """
    Corrupt a valid cot_parsing (CP) list of dicts by:
    - Flipping the verification value (true ↔ false)
    - Swapping evidence between two steps
    - Dropping the 'evidence' field from one step
    """
    cp2 = copy.deepcopy(cp)
    if not cp2: return cp2
    choice = random.choice(["flip", "swap", "drop_field"])
    if choice == "flip":
        # flip a random statement's flag
        idx = random.randrange(len(cp2))
        cp2[idx]["Verification"] = "true" if cp2[idx]["Verification"]=="false" else "false"
    elif choice == "swap":
        # swap evidence between two
        if len(cp2)>=2:
            i,j = random.sample(range(len(cp2)), 2)
            cp2[i]["evidence"], cp2[j]["evidence"] = cp2[j]["evidence"], cp2[i]["evidence"]
    else:  # drop_field
        idx = random.randrange(len(cp2))
        cp2[idx].pop("evidence", None)
    return cp2

def make_record(question, cot, qp, cp, label):
    """
    Format a training example with:
    - `premise`: question and CoT
    - `hypothesis`: the structured parses (QP + CP)
    - `label`: 1 (valid) or 0 (corrupted)
    """
    premise = f"{question}\n\nCoT:\n{cot}"
    hyp_qp = json.dumps(qp, ensure_ascii=False)
    hyp_cp = json.dumps(cp, ensure_ascii=False)
    hypothesis = f"QuestionParsing: {hyp_qp}  CoTParsing: {hyp_cp}"
    return {"premise": premise, "hypothesis": hypothesis, "label": label}

# STEP 1: Load valid (positive) examples
with open(INPUT, "r", encoding="utf-8") as f:
    positives = json5.loads(f.read())

all_records = []
for ex in positives:
    q   = ex["question"]
    cot = ex["cot"]
    qp  = ex["question_parsing"]
    cp  = ex["cot_parsing"]

    # Add the original, valid example (label = 1)
    all_records.append(make_record(q, cot, qp, cp, 1))

    # Generate negative (corrupted) versions
    for _ in range(NEG_PER_POS):
        qp_bad = corrupt_question_parsing(qp)
        cp_bad = corrupt_cot_parsing(cp)

        # Avoid adding duplicates accidentally
        if qp_bad==qp and cp_bad==cp:
            continue
        all_records.append(make_record(q, cot, qp_bad, cp_bad, 0))

# STEP 2: Split into training and dev sets
train, dev = train_test_split(all_records, test_size=DEV_SIZE, random_state=42, stratify=[r["label"] for r in all_records])

# STEP 3: Write out the datasets in JSONL format
for path, split in [(OUT_TRAIN, train), (OUT_DEV, dev)]:
    with open(path, "w", encoding="utf-8") as f:
        for rec in split:
            f.write(json.dumps(rec, ensure_ascii=False) + "\n")

print(f"✔︎ wrote {len(train)} train + {len(dev)} dev examples")

## Train the Joint Verifier

In [None]:
import numpy as np
import random
from collections import Counter
import torch
import torch.nn as nn

from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from sklearn.metrics import accuracy_score, f1_score, precision_recall_fscore_support

# 1) Load Dataset and Balance Training Set
data_files = {
    "train":      "/content/drive/MyDrive/llm-sr-project/verifier_train.jsonl",
    "validation": "/content/drive/MyDrive/llm-sr-project/verifier_dev.jsonl"
}
ds = load_dataset("json", data_files=data_files)

# Balance training data: downsample negatives to match positives
def balance(split):
    labels = split["label"]
    neg = [i for i,l in enumerate(labels) if l==0]
    pos = [i for i,l in enumerate(labels) if l==1]
    random.seed(42)
    neg_down = random.sample(neg, len(pos))
    idxs = neg_down + pos
    random.shuffle(idxs)
    return split.select(idxs)

ds["train"] = balance(ds["train"])
print("Train balance:", Counter(ds["train"]["label"]))
print("Dev   balance:",   Counter(ds["validation"]["label"]))  # leave as is

# 2) Load Pretrained Model and Tokenizer
MODEL = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model     = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2)
device    = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 3) Define Class-Weighted Loss
# Helps to balance contribution of majority/minority class
counts = Counter(ds["train"]["label"])
w0 = counts[1] / (counts[0]+counts[1])  # weight for class 0 (neg)
w1 = counts[0] / (counts[0]+counts[1])  # weight for class 1 (pos)
weights = torch.tensor([w0, w1], device=device)
loss_fn  = nn.CrossEntropyLoss(weight=weights)

# Custom Trainer to use class-weighted loss
class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels").to(device)
        outputs= model(**inputs)
        logits = outputs.logits
        loss   = loss_fn(logits, labels)
        return (loss, outputs) if return_outputs else loss

# 4) Preprocessing Function
def preprocess(ex):
    enc = tokenizer(ex["premise"], ex["hypothesis"],
                    truncation=True, padding=False)
    enc["labels"] = ex["label"]
    return enc

tokenized = ds.map(preprocess, batched=True, remove_columns=ds["train"].column_names)
data_collator = DataCollatorWithPadding(tokenizer)

# 5) Evaluation Metrics
def compute_metrics(p):
    logits = p.predictions
    labels = p.label_ids
    probs  = torch.softmax(torch.tensor(logits), dim=1).numpy()[:,1]
    preds  = (probs > 0.5).astype(int)

    acc     = accuracy_score(labels, preds)
    f1_bin  = f1_score(labels, preds, average="binary")
    f1_mac  = f1_score(labels, preds, average="macro")
    p0, r0, f0, _ = precision_recall_fscore_support(labels, preds, labels=[0], average="binary", zero_division=0)
    p1, r1, f1, _ = precision_recall_fscore_support(labels, preds, labels=[1], average="binary", zero_division=0)

    return {
        "accuracy":   acc,
        "f1_binary":  f1_bin,
        "f1_macro":   f1_mac,
        "precision_0": p0,
        "recall_0":    r0,
        "f1_0":        f0,
        "precision_1": p1,
        "recall_1":    r1,
        "f1_1":        f1,
    }

# 6) TrainingArguments
training_args = TrainingArguments(
    output_dir="deberta3-verifier",
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=50,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    learning_rate=1e-5,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    greater_is_better=True,
    fp16=True,
    report_to="none",
    warmup_steps=100,
    lr_scheduler_type="linear",
)

# 7) Initialize Trainer and Train
trainer = WeightedTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

# 8) Save Final Model
trainer.save_model("/content/drive/MyDrive/llm-sr-project/deberta3-verifier-final")
tokenizer.save_pretrained("/content/drive/MyDrive/llm-sr-project/deberta3-verifier-final")

## Inference and Evaluate

### Imports and Setup

In [None]:
import unsloth
import torch, gc, json, re, ast, html
from torch.nn.functional import log_softmax
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSequenceClassification, pipeline as hf_pipeline
)
from datasets import load_dataset


# Optional json5 for fallback parsing
try:
    import json5
    USE_JSON5 = True
except ImportError:
    USE_JSON5 = False

### Prompts Template

In [None]:
# In-Contect Learning (ICL)Prompting
QP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project. This assignment must satisfy:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

If A works on Beta, which of the following must be true?
A. B works on Alpha
B. C works on Beta
C. D works on Alpha
D. F works on Beta

The parsing result is:

[
  "There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project.",
  "If A works on Alpha, then B works on Beta",
  "If C works on Alpha, then D and E work on Beta",
  "F works on a different project than E",
  "D must work on a different project than A",
  "If F works on Alpha, then B works on Alpha",
  "A works on Beta"
]
'''

QP_TEMPLATE = '''Given a question, extract all relevant information from the question that would help to solve it.

Output only a JSON list and nothing else. Follow the format shown in the example.

Example:
{demon}

Now, the question is:

{question}

Your output:
'''

CP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. Each person works on exactly one project.

Conditions:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

Question:
If A works on Beta, which of the following must be true?

CoT:
Since A works on Beta, Condition (1) is not triggered. Condition (2) is not triggered since C's assignment is unknown. Condition (3) doesn't give anything because E's assignment is unspecified. Condition (4) says D must work on a different project than A, so D must work on Alpha. Condition (5) depends on F, which is unknown.

Parsing result:

[
  {
    "statement": "Condition (1) is not applicable",
    "evidence": "Condition (1): If A works on Alpha, then B works on Beta. | A is working on Beta",
    "Verification": "false"
  },
  {
    "statement": "Condition (2) is not applicable",
    "evidence": "Condition (2): If C works on Alpha, then D and E work on Beta. | C's assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "Condition (3) does not provide any info",
    "evidence": "Condition (3): F works on a different project than E. | E's assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "D must work on Alpha",
    "evidence": "Condition (4): D must work on a different project than A, and A is working on Beta",
    "Verification": "true"
  },
  {
    "statement": "Condition (5) is not applicable",
    "evidence": "Condition (5): If F works on Alpha, then B works on Alpha. | F's assignment is unknown",
    "Verification": "false"
  }
]
'''

CP_TEMPLATE = '''You are a reasoning assistant. Based on the question, conditions, and chain-of-thought (CoT), extract every inference or non-inference step as a JSON object.

Example:
{demon}

Now, given:

Question:
{question}

Conditions:
{conditions}

Chain-of-Thought:
{cot}

Your output:
'''

### Helper Functions

In [None]:
# Helper Functions
def clean_quotes(t):
    return (t.replace('“','"').replace('”','"')
             .replace("‘","'").replace("’","'"))

def normalize_text(t):
    t = clean_quotes(t)
    t = re.sub(r'\?\s(?=[A-Z])', ', ', t)
    t = re.sub(r'(?<=[a-zA-Z])\.(?=[A-Z])', '. ', t)
    t = re.sub(r'(?<![A-Da-d])\\n(?!\s?[A-Da-d]\\.)', ' ', t)
    return html.unescape(t).strip()

def extract_first_json_array(raw: str):
    raw = raw.strip()
    start = raw.find('[')
    if start < 0: return None
    depth = 0
    for i,ch in enumerate(raw[start:], start):
        if ch=='[': depth+=1
        elif ch==']': depth-=1
        if depth==0:
            block = raw[start:i+1]
            for parser in (json.loads, ast.literal_eval, json5.loads if USE_JSON5 else None):
                if not parser: continue
                try: return parser(block)
                except: pass
            return None
    return None

### Load Models and Verifier

In [None]:
# Load Models and Verifier
device = "cuda" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 2
conf_min = 0.6

# QP generation (3 samples)
qp_model_path = "/content/drive/MyDrive/llm-sr-project/finetuned_llama3_question_parsing"
qp_tok = AutoTokenizer.from_pretrained(qp_model_path)
qp_tok.model_max_length = 1024
qp_tok.padding_side = "left"
qp_mod = AutoModelForCausalLM.from_pretrained(qp_model_path,
    device_map="auto", torch_dtype=torch.float16)
qp_pipe = hf_pipeline("text-generation", model=qp_mod, tokenizer=qp_tok,
    return_full_text=False, do_sample=True, temperature=0.7,
    num_return_sequences=3, max_new_tokens=512, batch_size=BATCH_SIZE)

# CP generation (deterministic beam‐search)
cp_model_path = "/content/drive/MyDrive/llm-sr-project/finetuned_llama3_cot_parsing"
cp_tok = AutoTokenizer.from_pretrained(cp_model_path)
cp_tok.model_max_length = 2048
cp_tok.padding_side = "left"
cp_mod = AutoModelForCausalLM.from_pretrained(cp_model_path,
    device_map="auto", torch_dtype=torch.float16)
cp_pipe = hf_pipeline("text-generation", model=cp_mod, tokenizer=cp_tok,
    return_full_text=False, num_beams=5, do_sample=False,
    max_new_tokens=1024, batch_size=BATCH_SIZE)

# Verifier model (fine-tuned DeBERTa)
verifier_path = "/content/drive/MyDrive/llm-sr-project/deberta3-verifier-final"
rew_tok = AutoTokenizer.from_pretrained(verifier_path)
rew_mod = AutoModelForSequenceClassification.from_pretrained(
    verifier_path, device_map="auto", torch_dtype=torch.float16)
rew_mod.to(device)

print("✅ Models loaded.")

### Main Scoring Function

In [None]:
# Scoring Helpers
def compute_logprobs(cands, tokenizer_, model_):
    """Return nested list of log‐prob sums for each candidate list."""
    out = []
    for lst in cands:
        scores = []
        for item in lst:
            if not item:
                scores.append(float('-inf'))
                continue
            s = json.dumps(item) if not isinstance(item, str) else item
            enc = tokenizer_(s, return_tensors="pt",
                             truncation=True, padding=True,
                             max_length=tokenizer_.model_max_length).to(device)
            with torch.no_grad():
                logits = model_(**enc, labels=enc["input_ids"]).logits
            # sum token log‐probs
            lps = log_softmax(logits[0, :-1], dim=-1)
            lbls = enc["input_ids"][0,1:]
            score = lps[range(lbls.size(0)), lbls].sum().item()
            scores.append(score)
        out.append(scores)
    return out

# Main structured function with ensemble + consistency
def make_structured(batch):

    # 1) Normalize
    questions = [normalize_text(q) for q in batch["question"]]
    cots      = [normalize_text(c) for c in batch["cot"]]
    ids       = batch["id"]
    answers   = batch.get("answer", [None]*len(questions))

    # 2) Generate QP candidates
    qp_prompts = [QP_TEMPLATE.format(demon=QP_DEMON, question=q)
                  for q in questions]
    raw_qp = qp_pipe(qp_prompts, batch_size=len(qp_prompts))
    flat_qp = sum((sub if isinstance(sub,list) else [sub]
                   for sub in raw_qp), [])
    qp_lists = [extract_first_json_array(x["generated_text"]) or []
                for x in flat_qp]
    qp_cands = [qp_lists[i*3:(i+1)*3] for i in range(len(questions))]

    # 3) Generate CP candidates with cross‐feedback
    cp_prompts, mapping = [], []
    for qi,(q,cot,qp_list) in enumerate(zip(questions,cots,qp_cands)):
        for pi,qp in enumerate(qp_list):
            prompt = CP_TEMPLATE.format(
                demon=CP_DEMON,
                question=q,
                conditions=json.dumps(qp,ensure_ascii=False),
                cot=cot
            )
            cp_prompts.append(prompt)
            mapping.append((qi,pi))
    raw_cp = cp_pipe(cp_prompts, batch_size=len(cp_prompts))
    flat_cp = sum((sub if isinstance(sub,list) else [sub]
                   for sub in raw_cp), [])
    cp_lists = [extract_first_json_array(x["generated_text"]) or []
                for x in flat_cp]
    # regroup
    cp_cands = [[] for _ in questions]
    for (qi,pi), lst in zip(mapping, cp_lists):
        while len(cp_cands[qi])<=pi:
            cp_cands[qi].append([])
        cp_cands[qi][pi] = lst

    # 4) Verifier scoring
    premises, hyps, pairs = [], [], []
    for qi,(q,cot,qp_list,cp_list) in enumerate(zip(questions,cots,qp_cands,cp_cands)):
        for pi,(qp,cp) in enumerate(zip(qp_list,cp_list)):
            if not qp or not cp: continue
            prem = f"{q}\n\nCoT:\n{cot}"
            hyp = "Question Parsing:\n" + "\n".join(f"- {s}" for s in qp)
            hyp += "\n\nCoT Parsing:\n" + "\n".join(f"- {st.get('statement','')}" for st in cp)
            premises.append(prem); hyps.append(hyp); pairs.append((qi,pi))
    # batch & score
    ver_scores = []
    for i in range(0,len(premises),32):
        toks = rew_tok(premises[i:i+32], hyps[i:i+32],
                       return_tensors="pt", padding=True,
                       truncation=True).to(device)
        with torch.no_grad():
            logits = rew_mod(**toks).logits
        probs = (torch.sigmoid(logits).squeeze(-1)
                 if logits.size(1)==1
                 else logits.softmax(1)[:,1])
        ver_scores.extend(probs.cpu().tolist())

    # 5) LM fallback
    qp_lps = compute_logprobs(qp_cands, qp_tok, qp_mod)
    cp_lps = compute_logprobs(cp_cands, cp_tok, cp_mod)

    # 6) Dynamic threshold
    if ver_scores:
        med, std = np.median(ver_scores), np.std(ver_scores)
        thr_dyn = max(conf_min, med + 0.5*std)
    else:
        thr_dyn = conf_min

    # 7) Weighted ensemble reranking
    best_qp, best_cp = [], []
    # group ver_scores by question candidate index
    ver_dict = {}
    for (qi,pi),score in zip(pairs, ver_scores):
        ver_dict.setdefault(qi, {})[pi] = score

    for qi in range(len(questions)):
        scores = []
        for pi in range(len(qp_cands[qi])):
            v = ver_dict.get(qi,{}).get(pi,0.0)
            lq = qp_lps[qi][pi] if pi<len(qp_lps[qi]) else -1e9
            lc = cp_lps[qi][pi] if pi<len(cp_lps[qi]) else -1e9
            # normalize lps
            nq = 1/(1+math.exp(-lq/100))
            nc = 1/(1+math.exp(-lc/100))
            # weights
            w = 0.65*v + 0.20*nq + 0.15*nc
            scores.append((pi,w))
        if not scores:
            best_qp.append([]); best_cp.append([])
            continue
        pi_best,score_best = max(scores, key=lambda x:x[1])
        if score_best>=thr_dyn:
            best_qp.append(qp_cands[qi][pi_best])
            best_cp.append(cp_cands[qi][pi_best])
        else:
            # fallback to LM-only
            pi_lm = max(range(len(qp_lps[qi])), key=lambda i: qp_lps[qi][i])
            best_qp.append(qp_cands[qi][pi_lm])
            best_cp.append(cp_cands[qi][pi_lm] if pi_lm<len(cp_cands[qi]) else [])

    return {
        "question":         questions,
        "question_parsing": best_qp,
        "cot":              cots,
        "cot_parsing":      best_cp,
        "id":               ids,
        "answer":           answers,
    }

### Run and Save

In [None]:
# Run and Save Predictions
if __name__=="__main__":
    torch.cuda.empty_cache(); gc.collect()
    ds = load_dataset("json", data_files={"test":"/content/drive/MyDrive/llm-sr-project/testingData-blank.json"})["test"]
    out = ds.map(make_structured, batched=True, batch_size=BATCH_SIZE,
                 remove_columns=ds.column_names)
    out.to_json("/content/drive/MyDrive/llm-sr-project/results_ensemble.json",
                orient="records", lines=False)
    print("✅ Done!")

### Transform Predictions

In [None]:
import json

INPUT_PATH  = "/content/drive/MyDrive/llm-sr-project/results_ensemble.json"
OUTPUT_PATH = "/content/drive/MyDrive/llm-sr-project/final_results_ensemble.json"

def transform_example(ex):
    # reorder each cot_parsing entry: statement → evidence → Verification
    reordered = []
    for step in ex.get("cot_parsing", []):
        reordered.append({
            "statement":    step.get("statement"),
            "evidence":     step.get("evidence"),
            "Verification": step.get("Verification"),
        })

    return {
        "question":         ex.get("question"),
        "question_parsing": ex.get("question_parsing"),
        "answer":           ex.get("answer"),
        "id":               ex.get("id"),
        "cot":              ex.get("cot"),
        "cot_parsing":      reordered,
        "sel_idx":          ex.get("sel_idx"),
    }

def main():
    with open(INPUT_PATH, "r", encoding="utf-8") as f:
        examples = json.load(f)

    structured = [transform_example(ex) for ex in examples]

    with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
        json.dump(structured, f, ensure_ascii=False, indent=2)

    print(f"Wrote {len(structured)} examples to {OUTPUT_PATH}")

if __name__ == "__main__":
    main()


### Evaluate

In [2]:
EVAL_SCRIPT = "/content/drive/MyDrive/llm-sr-project/eval.py"
PREDICTION_PATH = "/content/drive/MyDrive/llm-sr-project/final_results_ensemble.json"
REFERENCE_PATH = "/content/drive/MyDrive/llm-sr-project/test-reference.json"

!python {EVAL_SCRIPT} \
  --prediction {PREDICTION_PATH} \
  --reference {REFERENCE_PATH} \
  --question_threshold 0.95 \
  --statement_threshold 0.9 \
  --relation_threshold 0.9

2025-05-17 15:59:20.655049: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-17 15:59:20.673412: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747497560.695235    1784 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747497560.701763    1784 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-17 15:59:20.723004: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr