# Hybrid Inference Strategy (Ablation): QP Verifier + CP Verifier

**Objective**  
Use two independently fine-tuned DeBERTa verifiers (one for QP, one for CP) alongside LoRA-adapted LLaMA-3 models in a lightweight, two-stage pipeline:

1. **Question Parsing (QP) Stage**  
   - Generate three QP candidates via a fine-tuned LLaMA-3 QP model (beam-search + sampling).  
   - Clean each JSON list of logical constraints (remove “A.”, “Option”, etc.).  
   - Score all three with the QP verifier (DeBERTa-v3); pick the highest-scoring candidate if its score ≥ **THR_QP** (0.75), otherwise fall back to the first beam output.

2. **Chain-of-Thought Parsing (CP) Stage**  
   - Given the selected QP list, generate three CP candidates via a fine-tuned LLaMA-3 CP model (beam-search + sampling).  
   - Clean, dedupe, and normalize each candidate’s `statement`, `evidence`, and `Verification` fields.

3. **Verifier Reranking**  
   - Score each CP candidate with the CP verifier (DeBERTa-v3), averaging the True-class probabilities over its steps.  
   - If the top CP candidate’s average score ≥ **THR_CP** (0.70), select it; otherwise, fall back to the first beam output.

4. **Output**  
   Emit a JSON record per example containing:  
   ```json
   {
     "question": …,
     "question_parsing": …,
     "cot": …,
     "cot_parsing": …,        // verified or fallback candidate
     "answer": …,
     "id": …,
     "sel_idx": …
   }


---

## Evaluation

| Metric                         | Score  |
|--------------------------------|--------|
| **Question_Macro_F1**          | 0.7321 |
| **Statement_Macro_F1**         | 0.3654 |
| **Statement_Evidence_Macro_F1**| 0.1383 |
| **Reasoning_F1**               | 0.0946 |

> This ablation underperforms the main initial strategy on Question_F1 (down from 0.7526 to 0.7321) and other metrics, indicating that adding a standalone QP verifier before CP can hurt downstream reasoning performance.


## Setup and Thresholds

In [2]:
# Install Unsloth for efficient LLM fine-tuning
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

In [1]:
# Install core evaluation utilities
!pip install -q evaluate
!pip install json5

!pip uninstall -y nltk
!pip install -q --upgrade nltk

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m81.9/84.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting json5
  Downloading json5-0.12.0-py3-none-any.whl.metadata (36 kB)
Downloading json5-0.12.0-py3-none-any.whl (36 kB)
Installing collected packages: json5
Successfully installed json5-0.12.0
Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
import nltk
nltk.download("punkt_tab")
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
import unsloth  
import torch, gc, json, re, ast, html, numpy as np
from torch.nn.functional import log_softmax
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    pipeline
)
from datasets import load_dataset
from collections import Counter


#  Paths & thresholds
INPUT       = "/content/drive/MyDrive/llm-sr-project/testingData-blank.json"
#OUTPUT      = "/content/drive/MyDrive/llm-sr-project/results_hybrid_approach.json"
OUTPUT      = "/content/drive/MyDrive/llm-sr-project/results_hybrid_approach_with2verifiers.json"
QP_LM_PATH  = "/content/drive/MyDrive/llm-sr-project/finetuned_llama3_question_parsing"
CP_LM_PATH  = "/content/drive/MyDrive/llm-sr-project/finetuned_llama3_cot_parsing"
QP_VER_PATH = "/content/drive/MyDrive/deberta-qparse-verifier"
CP_VER_PATH = "/content/drive/MyDrive/deberta-cotparse-verifier"


THR_QP = 0.75
THR_CP = 0.70
#THR_CP = 0.80
device = "cuda" if torch.cuda.is_available() else "cpu"

Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.3.0+cu121 with CUDA 1201 (you have 2.6.0+cu124)
    Python  3.11.9 (you have 3.11.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!


## Prompt Templates

In [3]:
# In-Context Learning (ICL) Prompts

QP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project. This assignment must satisfy:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

If A works on Beta, which of the following must be true?
A. B works on Alpha
B. C works on Beta
C. D works on Alpha
D. F works on Beta

The parsing result is:

[
  "There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project.",
  "If A works on Alpha, then B works on Beta",
  "If C works on Alpha, then D and E work on Beta",
  "F works on a different project than E",
  "D must work on a different project than A",
  "If F works on Alpha, then B works on Alpha",
  "A works on Beta"
]
'''

QP_TEMPLATE = '''Given a question, extract all relevant information from the question that would help to solve it.

This includes:
- General setup information (e.g., number of people, projects involved)
- Explicit facts given in the question
- All logical constraints or conditions

Output only a JSON list and nothing else. Follow the format shown in the example.

Example:

{demon}

Now, the question is:

{question}

Your output:
'''

CP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. Each person works on exactly one project.

Conditions:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

Question:
If A works on Beta, which of the following must be true?

CoT:
Since A works on Beta, Condition (1) is not triggered. Condition (2) is not triggered since C's assignment is unknown. Condition (3) doesn't give anything because E's assignment is unspecified. Condition (4) says D must work on a different project than A, so D must work on Alpha. Condition (5) depends on F, which is unknown.

Parsing result:

[
  {
    "statement": "Condition (1) is not applicable",
    "evidence": "Condition (1): If A works on Alpha, then B works on Beta. | A is working on Beta",
    "Verification": "false"
  },
  {
    "statement": "Condition (2) is not applicable",
    "evidence": "Condition (2): If C works on Alpha, then D and E work on Beta. | C's assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "Condition (3) does not provide any info",
    "evidence": "Condition (3): F works on a different project than E. | E's assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "D must work on Alpha",
    "evidence": "Condition (4): D must work on a different project than A, and A is working on Beta",
    "Verification": "true"
  },
  {
    "statement": "Condition (5) is not applicable",
    "evidence": "Condition (5): If F works on Alpha, then B works on Alpha. | F's assignment is unknown",
    "Verification": "false"
  }
]
'''


CP_TEMPLATE = '''You are a reasoning assistant. Based on the question, conditions, and chain-of-thought (CoT), extract every inference or non-inference step as a JSON object.

For each CoT sentence that either:
  1. Refers to a condition (e.g. "Condition (2) …")
  2. Starts with an inference cue ("Since", "Therefore", "This means", "We can deduce", etc.)

Produce one object with:
  • "statement": the new claim you read in that CoT sentence (don't quote the entire sentence—just the core inference).
  • "evidence":
      – if the claim restates a constraint, use the exact line from the **Conditions** block,
      – otherwise, use the CoT fragment that you extracted it from.
  • "Verification":
      – `"false"` if the sentence rejects or blocks a condition (contains "not applicable", "does not provide", etc.),
      – otherwise `"true"`.

Keep the objects in the same order as they appear in the CoT.

Example:

{demon}

Now, given:

Question:
{question}

Conditions:
{conditions}

Chain-of-Thought:
{cot}

Your output:
'''

## Helper Functions

In [4]:
def clean_quotes(t):
    return (t.replace('"','"').replace('"','"').replace("'","'").replace("'","'"))

def normalize_text(t):
    t = clean_quotes(t)
    t = re.sub(r'\?\s(?=[A-Z])', ', ', t)
    t = re.sub(r'(?<=[a-zA-Z])\.(?=[A-Z])', '. ', t)
    t = re.sub(r'(?<![A-Da-d])\\n(?!\s?[A-Da-d]\\.)', ' ', t)
    return html.unescape(t).strip()

def extract_json(raw):
    raw = raw.strip()
    i = raw.find('[')
    if i < 0: return []
    depth = 0
    for j,ch in enumerate(raw[i:], i):
        if ch=='[': depth+=1
        elif ch==']': depth-=1
        if depth==0:
            blk = raw[i:j+1]
            for p in (json.loads, ast.literal_eval):
                try: return p(blk)
                except: pass
            return []
    return []

def score_verifier_batch(prem_list, hyp_list, tok, mod):
    enc = tok(prem_list, hyp_list, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        logits = mod(**enc).logits
    return torch.softmax(logits, dim=1)[:, 1].tolist()

def clean_qp(qp_list):
    return [s for s in qp_list if not re.match(r'^[A-Da-d][\.:\)]', s.strip()) and "Option" not in s and "following" not in s]

## Load Models and Verifiers

In [5]:
# QP LM - deterministic
qp_tok = AutoTokenizer.from_pretrained(QP_LM_PATH)
qp_tok.model_max_length = 1024
qp_mod = AutoModelForCausalLM.from_pretrained(QP_LM_PATH).to(device)
qp_pipe = pipeline("text-generation", model=qp_mod, tokenizer=qp_tok,
                   return_full_text=False, do_sample=False,
                   num_beams=5, early_stopping=True,
                   max_new_tokens=512, batch_size=4)

# CP LM - Using beam search
cp_tok = AutoTokenizer.from_pretrained(CP_LM_PATH)
cp_tok.model_max_length = 2048
cp_mod = AutoModelForCausalLM.from_pretrained(CP_LM_PATH).to(device)
cp_pipe = pipeline("text-generation", model=cp_mod, tokenizer=cp_tok,
                   return_full_text=False, do_sample=True,temperature=0.7,
                   num_beams=5, early_stopping=True, num_return_sequences=3,
                   max_new_tokens=1024, batch_size=4)


# Load QP and COT Verifiers
def load_verifiers():
    global qv_tok, qv_mod, cv_tok, cv_mod
    qv_tok = AutoTokenizer.from_pretrained(QP_VER_PATH)
    qv_mod = AutoModelForSequenceClassification.from_pretrained(QP_VER_PATH).to(device)
    cv_tok = AutoTokenizer.from_pretrained(CP_VER_PATH)
    cv_mod = AutoModelForSequenceClassification.from_pretrained(CP_VER_PATH).to(device)
    return qv_tok, qv_mod, cv_tok, cv_mod


qv_tok, qv_mod, cv_tok, cv_mod = load_verifiers()

config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Device set to use cuda:0


## Hybrid Inference

In [None]:
def process_one(example):
    q_raw, cot_raw = example["question"], example["cot"]
    sel_idx, ans   = example.get("sel_idx"), example.get("answer")
    q, cot         = normalize_text(q_raw), normalize_text(cot_raw)

    # 1) Generate 3 QP candidates and rerank with QP verifier
    prompt = QP_TEMPLATE.format(demon=QP_DEMON, question=q)
    raw_qp_list = qp_pipe(
        prompt,
        max_new_tokens=512,
        do_sample=True,
        num_return_sequences=3
    )

    qp_lists = []
    for item in raw_qp_list:
        candidate = extract_json(item["generated_text"])
        cleaned   = clean_qp(candidate)
        qp_lists.append(cleaned)

    qp_premises = [q] * len(qp_lists)
    qp_hypotheses = [
        f"QuestionParsing: {json.dumps(qp_json, ensure_ascii=False)}"
        for qp_json in qp_lists
    ]
    qp_scores = score_verifier_batch(qp_premises, qp_hypotheses, qv_tok, qv_mod)

    best_qp_idx = int(np.argmax(qp_scores))
    if qp_scores[best_qp_idx] < THR_QP:
        best_qp = qp_lists[0]
    else:
        best_qp = qp_lists[best_qp_idx]

    # CP: generate 3 beam‐search parses
    conds_str = json.dumps(best_qp, ensure_ascii=False)
    prompt_cp = CP_TEMPLATE.format(
        demon      = CP_DEMON,
        question   = q,
        conditions = conds_str,
        cot        = cot
    )

    raw_cp_outs = cp_pipe(prompt_cp, max_new_tokens=1024)
    raw_cp_flat = [
        item
        for sub in raw_cp_outs
        for item in (sub if isinstance(sub, list) else [sub])
    ]

    # 3) Parse & clean each CP candidate
    cps_candidates = []
    for out in raw_cp_flat:
        parsed = extract_json(out["generated_text"])
        if not parsed:
            continue
        seen = set()
        cleaned = []
        for st in parsed:
            stmt = st.get("statement","").strip()
            ev   = st.get("evidence","").strip() or "logical deduction"
            ver  = str(st.get("Verification","true")).lower()
            if len(stmt) < 3 or (stmt, ev) in seen:
                continue
            seen.add((stmt, ev))
            cleaned.append({
                "statement": stmt,
                "evidence":  ev,
                "Verification": ver
            })
        if cleaned:
            cps_candidates.append(cleaned)

    if not cps_candidates:
        cps_candidates = [[]]

    # 4) Score each CP candidate with CP verifier 
    premise = (
        f"Question:\n{q}\n\n"
        f"Conditions:\n" + "\n".join(f"- {s}" for s in best_qp) +
        f"\n\nCoT:\n{cot}"
    )
    avg_scores = []
    for cp_list in cps_candidates:
        if not cp_list:
            avg_scores.append(0.0)
            continue
        prems = [premise] * len(cp_list)
        hyps  = [
            f"Statement: {st['statement']}\nBased on: {st['evidence']}"
            for st in cp_list
        ]
        scores = score_verifier_batch(prems, hyps, cv_tok, cv_mod)
        avg_scores.append(sum(scores) / len(scores))

    # 5) Pick best CP candidate (with threshold fallback)
    best_idx = int(np.argmax(avg_scores))
    if avg_scores[best_idx] < THR_CP:
        best_idx = 0
    best_cp = cps_candidates[best_idx]

    return {
        "question":         q_raw,
        "question_parsing": best_qp,
        "answer":           ans,
        "id":               example["id"],
        "cot":              cot_raw,
        "cot_parsing":      best_cp,
        "sel_idx":          sel_idx
    }

## Batch and Run

In [7]:
def process_batch(batch):
    outs = [process_one({
        "question": batch["question"][i],
        "cot":       batch["cot"][i],
        "id":        batch["id"][i],
        "sel_idx":   batch.get("sel_idx", [None]*len(batch["id"]))[i],
        "answer":    batch.get("answer", [None]*len(batch["id"]))[i],
    }) for i in range(len(batch["question"]))]

    return {
        "question":        [o["question"]        for o in outs],
        "question_parsing":[o["question_parsing"]for o in outs],
        "answer":          [o["answer"]          for o in outs],
        "id":              [o["id"]              for o in outs],
        "cot":             [o["cot"]             for o in outs],
        "cot_parsing":     [o["cot_parsing"]     for o in outs],
        "sel_idx":         [o["sel_idx"]         for o in outs],
    }

if __name__=="__main__":
    gc.collect()
    device = "cuda" if torch.cuda.is_available() else "cpu"

    ds = load_dataset("json", data_files={"test": INPUT})["test"]

    out_ds = ds.map(
        process_batch,
        batched=True,
        batch_size=2,
        remove_columns=ds.column_names
    )

    out_ds.to_json(OUTPUT, orient="records", lines=False)
    print("✅ Done — saved to", OUTPUT)

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/24 [00:00<?, ? examples/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

✅ Done — saved to /content/drive/MyDrive/llm-sr-project/results_hybrid_approach_with2verifiers.json


## Transform Predictions

In [8]:
import json

INPUT_PATH  = "/content/drive/MyDrive/llm-sr-project/results_hybrid_approach_with2verifiers.json"
OUTPUT_PATH = "/content/drive/MyDrive/llm-sr-project/final_results_hybrid_approach_with2verifiers.json"

def transform_example(ex):
    # reorder each cot_parsing entry: statement → evidence → Verification
    reordered = []
    for step in ex.get("cot_parsing", []):
        reordered.append({
            "statement":    step.get("statement"),
            "evidence":     step.get("evidence"),
            "Verification": step.get("Verification"),
        })

    return {
        "question":         ex.get("question"),
        "question_parsing": ex.get("question_parsing"),
        "answer":           ex.get("answer"),
        "id":               ex.get("id"),
        "cot":              ex.get("cot"),
        "cot_parsing":      reordered,
        "sel_idx":          ex.get("sel_idx"),
    }

def main():
    with open(INPUT_PATH, "r", encoding="utf-8") as f:
        examples = json.load(f)

    structured = [transform_example(ex) for ex in examples]

    with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
        json.dump(structured, f, ensure_ascii=False, indent=2)

    print(f"Wrote {len(structured)} examples to {OUTPUT_PATH}")

if __name__ == "__main__":
    main()

Wrote 24 examples to /content/drive/MyDrive/llm-sr-project/final_results_hybrid_approach_with2verifiers.json


## Evaluate

In [9]:
EVAL_SCRIPT = "/content/drive/MyDrive/llm-sr-project/eval.py"
PREDICTION_PATH = "/content/drive/MyDrive/llm-sr-project/final_results_hybrid_approach_with2verifiers.json"
REFERENCE_PATH = "/content/drive/MyDrive/llm-sr-project/test-reference.json"

!python {EVAL_SCRIPT} \
  --prediction {PREDICTION_PATH} \
  --reference {REFERENCE_PATH} \
  --question_threshold 0.95 \
  --statement_threshold 0.9 \
  --relation_threshold 0.9

2025-06-01 17:38:24.824655: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748799504.846218   16056 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748799504.852727   16056 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
config.json: 100% 1.05k/1.05k [00:00<00:00, 8.52MB/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
model.safetensors: 100% 738M/738M [00:02<00:00, 271MB/s]
tokenizer_config.json: 100% 1.28k/1.28k [00:00<00:00, 9.40MB/s]
spm.model: 100% 2.46M/2.46M [0