# 13. Hybrid Inference Strategy 3: Enhanced CoT Cleaning & Evidence Boosting

**Objective**  
Improve Statement–Evidence alignment and consistency in CoT parsing by adding post-processing intelligence on top of the beam+sampled reranking pipeline:

1. **Question Parsing (QP) Stage**  
   - Same as v2: single deterministic JSON list from the LLaMA-3 QP model (beam-search).  
   - No verifier at this stage.

2. **Chain-of-Thought Parsing (CP) Stage**  
   - Generate **5** parses per example (2 beam + 3 sampled) from the LLaMA-3 CP model.  
   - Parse each into `statement`, `evidence`, `Verification`.

3. **Post-Processing & Cleaning**  
   - **Evidence Boosting**: if a step’s `evidence` is missing/too short, extract a better CoT sentence via keyword overlap.  
   - **Verification Normalization**: force every `Verification` to exactly `"true"` or `"false"`.  
   - **Support Check**: if `statement` affirms but `evidence` negates (or vice versa), mark `"false"`.  
   - **Reasoning Trace**: append a `reasoning_step` index and optional `related_to` links between steps.

4. **Verifier Reranking**  
   - Score each cleaned candidate by summing **log-probs** of the CP verifier’s “true” outputs across its steps.  
   - Select the candidate with the highest total log-prob (no explicit threshold fallback).

5. **Output**  
   - JSON record per example with keys:  
     ```json
     {
       "question": …,
       "question_parsing": …,
       "cot": …,
       "cot_parsing": …,      // enhanced & verified steps
       "answer": …,
       "id": …,
       "sel_idx": …
     }
     ```

---

## Evaluation

| Metric                         | v3 Score |
|--------------------------------|----------|
| **Question_Macro_F1**          | 0.7658   |
| **Statement_Macro_F1**         | 0.3990   |
| **Statement_Evidence_Macro_F1**| 0.1831   |
| **Reasoning_F1**               | 0.1129   |

> _Strong evidence-alignment gains (+0.010 vs. v2) at a small cost to overall reasoning F1, thanks to stricter step filtering and normalization._


## Setup and Thresholds

In [1]:
# Install core evaluation utilities
!pip install -q evaluate
!pip install json5

!pip uninstall -y nltk
!pip install -q --upgrade nltk

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting json5
  Downloading json5-0.12.0-py3-none-any.whl.metadata (36 kB)
Downloading json5-0.12.0-py3-none-any.whl (36 kB)
Installing collected packages: json5
Successfully installed json5-0.12.0
Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import nltk
nltk.download("punkt_tab")
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
import unsloth  # Must come first for 4-bit LoRA
import torch, gc, json, re, ast, html, numpy as np
from torch.nn.functional import log_softmax
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    pipeline
)
from datasets import load_dataset
from collections import Counter
import math

# Paths & thresholds
INPUT       = "/content/drive/MyDrive/llm-sr-project/testingData-blank.json"
OUTPUT      = "/content/drive/MyDrive/llm-sr-project/results_hybrid_approach4.json"
QP_LM_PATH  = "/content/drive/MyDrive/llm-sr-project/finetuned_llama3_question_parsing"
CP_LM_PATH  = "/content/drive/MyDrive/llm-sr-project/finetuned_llama3_cot_parsing"
QP_VER_PATH = "/content/drive/MyDrive/deberta-qparse-verifier"
CP_VER_PATH = "/content/drive/MyDrive/deberta-cotparse-verifier"


THR_QP = 0.75
THR_CP = 0.60
device = "cuda" if torch.cuda.is_available() else "cpu"

## Prompt Templates

In [None]:
# In-Context Learning (ICL) Prompts

QP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project. This assignment must satisfy:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

If A works on Beta, which of the following must be true?
A. B works on Alpha
B. C works on Beta
C. D works on Alpha
D. F works on Beta

The parsing result is:

[
  "There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project.",
  "If A works on Alpha, then B works on Beta",
  "If C works on Alpha, then D and E work on Beta",
  "F works on a different project than E",
  "D must work on a different project than A",
  "If F works on Alpha, then B works on Alpha",
  "A works on Beta"
]
'''


QP_TEMPLATE = '''Given a question, extract all relevant information from the question that would help to solve it.

This includes:
- General setup information (e.g., number of people, projects involved)
- Explicit facts given in the question
- All logical constraints or conditions

Output only a JSON list and nothing else. Follow the format shown in the example.

Example:

{demon}

Now, the question is:

{question}

Your output:
'''

CP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. Each person works on exactly one project.

Conditions:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

Question:
If A works on Beta, which of the following must be true?

CoT:
Since A works on Beta, Condition (1) is not triggered. Condition (2) is not triggered since C's assignment is unknown. Condition (3) doesn't give anything because E's assignment is unspecified. Condition (4) says D must work on a different project than A, so D must work on Alpha. Condition (5) depends on F, which is unknown.

Parsing result:

[
  {
    "statement": "Condition (1) is not applicable",
    "evidence": "Condition (1): If A works on Alpha, then B works on Beta. | A is working on Beta",
    "Verification": "false"
  },
  {
    "statement": "Condition (2) is not applicable",
    "evidence": "Condition (2): If C works on Alpha, then D and E work on Beta. | C's assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "Condition (3) does not provide any info",
    "evidence": "Condition (3): F works on a different project than E. | E's assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "D must work on Alpha",
    "evidence": "Condition (4): D must work on a different project than A, and A is working on Beta",
    "Verification": "true"
  },
  {
    "statement": "Condition (5) is not applicable",
    "evidence": "Condition (5): If F works on Alpha, then B works on Alpha. | F's assignment is unknown",
    "Verification": "false"
  }
]
'''

CP_TEMPLATE = '''You are a reasoning assistant. Based on the question, conditions, and chain-of-thought (CoT), extract every inference or non-inference step as a JSON object.

For each CoT sentence that either:
  1. Refers to a condition (e.g. "Condition (2) …")
  2. Starts with an inference cue ("Since", "Therefore", "This means", "We can deduce", etc.)

Produce one object with:
  • "statement": the new claim you read in that CoT sentence (don't quote the entire sentence—just the core inference).
  • "evidence":
      – if the claim restates a constraint, use the exact line from the **Conditions** block,
      – otherwise, use the CoT fragment that you extracted it from.
  • "Verification":
      – MUST BE EXACTLY `"false"` if the sentence rejects or blocks a condition (contains "not applicable", "does not provide", etc.),
      – MUST BE EXACTLY `"true"` in all other cases.

Keep the objects in the same order as they appear in the CoT.

IMPORTANT: "Verification" field MUST ONLY contain the string "true" or "false" (lowercase) and nothing else.

Example:

{demon}

Now, given:

Question:
{question}

Conditions:
{conditions}

Chain-of-Thought:
{cot}

Your output:
'''

## Helper Functions

In [None]:
def clean_quotes(t):
    return (t.replace('"','"').replace('"','"').replace("'","'").replace("'","'"))

def normalize_text(t):
    t = clean_quotes(t)
    t = re.sub(r'\?\s(?=[A-Z])', ', ', t)
    t = re.sub(r'(?<=[a-zA-Z])\.(?=[A-Z])', '. ', t)
    t = re.sub(r'(?<![A-Da-d])\\n(?!\s?[A-Da-d]\\.)', ' ', t)
    return html.unescape(t).strip()

def extract_json(raw):
    raw = raw.strip()
    i = raw.find('[')
    if i < 0: return []
    depth = 0
    for j,ch in enumerate(raw[i:], i):
        if ch=='[': depth+=1
        elif ch==']': depth-=1
        if depth==0:
            blk = raw[i:j+1]
            for p in (json.loads, ast.literal_eval):
                try: return p(blk)
                except: pass
            return []
    return []

def score_verifier_batch(prem_list, hyp_list, tok, mod):
    enc = tok(prem_list, hyp_list, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        logits = mod(**enc).logits
    return torch.softmax(logits, dim=1)[:, 1].tolist()

def clean_qp(qp_list):
    return [s for s in qp_list if not re.match(r'^[A-Da-d][\.:\)]', s.strip()) and "Option" not in s and "following" not in s]

# New function to validate and enhance evidence quality
def enhance_evidence(statement, evidence, cot, question):
    if not evidence or len(evidence.strip()) < 15:
        # Extract evidence from CoT if missing or too short
        statement_kw = set(re.findall(r'\b[a-zA-Z]{3,}\b', statement.lower()))
        sentences = re.split(r'[.!?]', cot)

        # Find a better evidence sentence based on keyword overlap
        best_score = 0
        best_sent = ""
        for sent in sentences:
            sent = sent.strip()
            if len(sent) < 10:
                continue
            sent_kw = set(re.findall(r'\b[a-zA-Z]{3,}\b', sent.lower()))
            overlap = len(statement_kw.intersection(sent_kw))
            if overlap > best_score:
                best_score = overlap
                best_sent = sent

        if best_score >= 2:
            return best_sent + "."

    return evidence

# New function to ensure verification is strictly "true" or "false"
def normalize_verification(verification):
    if not verification or verification is None:
        return "true"  # Default to true

    verification = verification.lower().strip()

    # Detect negation patterns that indicate false
    if any(neg in verification for neg in ["false", "not", "cannot", "doesn't", "unlikely", "invalid", "incorrect"]):
        return "false"

    # Default to true for all other cases
    return "true"

# New function to verify statement is properly supported by evidence
def verify_statement_evidence(statement, evidence):
    if not statement or not evidence:
        return "false"

    statement_kw = set(re.findall(r'\b[a-zA-Z]{3,}\b', statement.lower()))
    evidence_kw = set(re.findall(r'\b[a-zA-Z]{3,}\b', evidence.lower()))

    # Check for strong evidence support - keyword overlap
    overlap = len(statement_kw.intersection(evidence_kw))

    # Check for contradictions
    negations = ["not", "cannot", "doesn't", "don't", "isn't", "aren't", "won't", "wouldn't"]

    # If statement affirms but evidence negates, mark as false
    statement_affirms = all(neg not in statement.lower().split() for neg in negations)
    evidence_negates = any(neg in evidence.lower().split() for neg in negations)

    if statement_affirms and evidence_negates:
        return "false"

    # If good overlap and no contradictions, mark as true
    if overlap >= 2:
        return "true"

    return "false"

## Load Models and Verifiers

In [None]:
# QP LM - Using beam search
qp_tok = AutoTokenizer.from_pretrained(QP_LM_PATH)
qp_tok.model_max_length = 1024
qp_mod = AutoModelForCausalLM.from_pretrained(QP_LM_PATH).to(device)
qp_pipe = pipeline("text-generation", model=qp_mod, tokenizer=qp_tok,
                   return_full_text=False, do_sample=False,
                   num_beams=5, early_stopping=True,
                   max_new_tokens=512, batch_size=4)

# CP LM - Using beam search
cp_tok = AutoTokenizer.from_pretrained(CP_LM_PATH)
cp_tok.model_max_length = 2048
cp_mod = AutoModelForCausalLM.from_pretrained(CP_LM_PATH).to(device)

# Generate 2 beam candidates + 3 sampled candidates = 5 total
cp_pipe = pipeline(
     "text-generation",
     model=cp_mod, tokenizer=cp_tok,
     return_full_text=False,
     do_sample=False, num_beams=5, num_return_sequences=2,
     max_new_tokens=1024,
     batch_size=4
)

cp_sampler = pipeline(
    "text-generation",
    model=cp_mod, tokenizer=cp_tok,
    return_full_text=False,
     do_sample=True, temperature=0.8, num_return_sequences=3,
     max_new_tokens=1024,
     batch_size=4
)

# Load Verifiers
def load_verifiers():
    global qv_tok, qv_mod, cv_tok, cv_mod
    qv_tok = AutoTokenizer.from_pretrained(QP_VER_PATH)
    qv_mod = AutoModelForSequenceClassification.from_pretrained(QP_VER_PATH).to(device)
    cv_tok = AutoTokenizer.from_pretrained(CP_VER_PATH)
    cv_mod = AutoModelForSequenceClassification.from_pretrained(CP_VER_PATH).to(device)
    return qv_tok, qv_mod, cv_tok, cv_mod

qv_tok, qv_mod, cv_tok, cv_mod = load_verifiers()

## Hybrid Inference Function

In [None]:
def process_one(example):
    q_raw, cot_raw = example["question"], example["cot"]
    sel_idx, ans = example.get("sel_idx"), example.get("answer")
    q, cot = normalize_text(q_raw), normalize_text(cot_raw)

    # Single deterministic QP output
    prompt = QP_TEMPLATE.format(demon=QP_DEMON, question=q)
    raw_qp = qp_pipe(prompt, max_new_tokens=512)
    if not isinstance(raw_qp, list):
        raw_qp = [raw_qp]
    best_qp = clean_qp(extract_json(raw_qp[0]["generated_text"]))

    # 2) CP: generate 3 beam-search parses
    conds_str = json.dumps(best_qp, ensure_ascii=False)
    prompt_cp = CP_TEMPLATE.format(
        demon      = CP_DEMON,
        question   = q,
        conditions = conds_str,
        cot        = cot
    )

    # 2.a) get 5 raw outputs: 2 from beam, 3 from sampler
    raw_beams = cp_pipe(prompt_cp, max_new_tokens=1024)
    raw_samples = cp_sampler(prompt_cp, max_new_tokens=1024)
    raw_cp_outs = (raw_beams if isinstance(raw_beams, list) else [raw_beams]) \
            + (raw_samples if isinstance(raw_samples, list) else [raw_samples])

    # flatten HF's list-of-lists (if any)
    raw_cp_flat = [
        item
        for sub in raw_cp_outs
        for item in (sub if isinstance(sub, list) else [sub])
    ]

    # 2.b) parse & clean each candidate
    cps_candidates = []
    for out in raw_cp_flat:
        parsed = extract_json(out["generated_text"])
        if not parsed:
            continue
        seen = set()
        cleaned = []
        for st in parsed:
            # Fix: Add type checking to handle unexpected objects
            if not isinstance(st, dict):
                continue

            stmt = st.get("statement","").strip()
            ev   = st.get("evidence","").strip() or "logical deduction"

            # Enhance evidence quality if needed
            ev = enhance_evidence(stmt, ev, cot, q)

            # Normalize verification to strictly "true" or "false"
            ver_orig = st.get("Verification","true")
            ver = normalize_verification(ver_orig)

            # Check if statement is properly supported by evidence
            if ver == "true":
                ver = verify_statement_evidence(stmt, ev)

            if len(stmt) < 3 or (stmt,ev) in seen:
                continue

            seen.add((stmt,ev))
            cleaned.append({"statement":stmt,"evidence":ev,"Verification":ver})

        if cleaned:
            cps_candidates.append(cleaned)

    # fallback if nothing survived
    if not cps_candidates:
        cps_candidates = [[]]

    # 3) Score each candidate with your CP verifier
    premise = f"Question:\n{q}\n\nConditions:\n" + "\n".join(f"- {s}" for s in best_qp) + f"\n\nCoT:\n{cot}"
    avg_scores = []
    for cp_list in cps_candidates:
        if not cp_list:
            avg_scores.append(0.0)
            continue
        prems = [premise]*len(cp_list)
        hyps  = [f"Statement: {st['statement']}\nBased on: {st['evidence']}" for st in cp_list]
        scores = score_verifier_batch(prems, hyps, cv_tok, cv_mod)  # list of prob (0–1)
        # convert to log‐probs and sum
        sum_logprob = sum(math.log(s + 1e-12) for s in scores)
        avg_scores.append(sum_logprob)

    # 4) Pick best candidate (with threshold)
    best_idx = int(np.argmax(avg_scores)) if avg_scores else 0
    best_cp = cps_candidates[best_idx] if cps_candidates and best_idx < len(cps_candidates) else []

    # Final post-processing to ensure all verifications are strictly "true" or "false"
    for item in best_cp:
        if item["Verification"] not in ["true", "false"]:
            item["Verification"] = normalize_verification(item["Verification"])

    # Add reasoning steps explicitly
    for i, step in enumerate(best_cp):
        step["reasoning_step"] = i + 1

        # Add references to previous steps when possible
        if i > 0:
            curr_keywords = set(re.findall(r'\b[a-zA-Z]{3,}\b', step["statement"].lower()))
            references = []

            for j in range(i):
                prev_keywords = set(re.findall(r'\b[a-zA-Z]{3,}\b', best_cp[j]["statement"].lower()))
                if len(curr_keywords.intersection(prev_keywords)) >= 2:
                    references.append(j + 1)

            if references:
                step["related_to"] = references

    return {
        "question": q_raw,
        "question_parsing": best_qp,
        "answer": ans,
        "id": example["id"],
        "cot": cot_raw,
        "cot_parsing": best_cp,
        "sel_idx": sel_idx
    }

## Batch and Run

In [None]:
def process_batch(batch):
    outs = [process_one({
        "question": batch["question"][i],
        "cot":       batch["cot"][i],
        "id":        batch["id"][i],
        "sel_idx":   batch.get("sel_idx", [None]*len(batch["id"]))[i],
        "answer":    batch.get("answer", [None]*len(batch["id"]))[i],
    }) for i in range(len(batch["question"]))]

    return {
        "question":        [o["question"]        for o in outs],
        "question_parsing":[o["question_parsing"]for o in outs],
        "answer":          [o["answer"]          for o in outs],
        "id":              [o["id"]              for o in outs],
        "cot":             [o["cot"]             for o in outs],
        "cot_parsing":     [o["cot_parsing"]     for o in outs],
        "sel_idx":         [o["sel_idx"]         for o in outs],
    }

if __name__=="__main__":
    gc.collect()
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    try:
        ds = load_dataset("json", data_files={"test": INPUT})["test"]
        print(f"Loaded dataset with {len(ds)} examples")

        # Add more descriptive logging
        print("Starting batch processing...")
        out_ds = ds.map(
            process_batch,
            batched=True,
            batch_size=2,
            remove_columns=ds.column_names
        )

        print(f"Processing complete. Writing results to {OUTPUT}")
        out_ds.to_json(OUTPUT, orient="records", lines=False)
        print("✅ Done — saved to", OUTPUT)
    except Exception as e:
        print(f"Error occurred: {type(e).__name__}: {e}")
        import traceback
        traceback.print_exc()

## Transform Predictions

In [None]:
import json

INPUT_PATH  = "/content/drive/MyDrive/llm-sr-project/results_hybrid_approach4.json"
OUTPUT_PATH = "/content/drive/MyDrive/llm-sr-project/final_results_hybrid_approach4.json"

def transform_example(ex):
    # reorder each cot_parsing entry: statement → evidence → Verification
    reordered = []
    for step in ex.get("cot_parsing", []):
        reordered.append({
            "statement":    step.get("statement"),
            "evidence":     step.get("evidence"),
            "Verification": step.get("Verification"),
        })

    return {
        "question":         ex.get("question"),
        "question_parsing": ex.get("question_parsing"),
        "answer":           ex.get("answer"),
        "id":               ex.get("id"),
        "cot":              ex.get("cot"),
        "cot_parsing":      reordered,
        "sel_idx":          ex.get("sel_idx"),
    }

def main():
    with open(INPUT_PATH, "r", encoding="utf-8") as f:
        examples = json.load(f)

    structured = [transform_example(ex) for ex in examples]

    with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
        json.dump(structured, f, ensure_ascii=False, indent=2)

    print(f"Wrote {len(structured)} examples to {OUTPUT_PATH}")

if __name__ == "__main__":
    main()

## Evaluate

In [3]:
EVAL_SCRIPT = "/content/drive/MyDrive/llm-sr-project/eval.py"
PREDICTION_PATH = "/content/drive/MyDrive/llm-sr-project/final_results_hybrid_approach4.json"
REFERENCE_PATH = "/content/drive/MyDrive/llm-sr-project/test-reference.json"

!python {EVAL_SCRIPT} \
  --prediction {PREDICTION_PATH} \
  --reference {REFERENCE_PATH} \
  --question_threshold 0.95 \
  --statement_threshold 0.9 \
  --relation_threshold 0.9

2025-05-17 17:55:11.330251: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-17 17:55:11.347705: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747504511.369627    1366 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747504511.376187    1366 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-17 17:55:11.397483: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr