# 7. Reward-Model-Based Selection for CoT Parsing

In this approach we:

1. **Generate 5 CoT parsing candidates** (2 beam, 3 sampled)  
2. **Score each step** with `OpenAssistant/reward-model-deberta-v3-large-v2`  
3. **Add evidence bonuses** for condition references, logical connectors, etc.  
4. **Average per-step scores** and pick the highest-scoring candidate  

This prioritizes logical soundness and strong evidence. Results are saved to  
`final_results_reward_model.json` for downstream F1 evaluation.


## Imports and Setup

In [1]:
# Install core evaluation utilities
!pip install -q evaluate
!pip install json5

!pip uninstall -y nltk
!pip install -q --upgrade nltk

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting json5
  Downloading json5-0.12.0-py3-none-any.whl.metadata (36 kB)
Downloading json5-0.12.0-py3-none-any.whl (36 kB)
Installing collected packages: json5
Successfully installed json5-0.12.0
Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
import nltk
nltk.download("punkt_tab")
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [2]:
import gc, json, re, ast, html, numpy as np
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    pipeline
)
from datasets import load_dataset

# cleanup
torch.cuda.empty_cache()
gc.collect()

try:
    import json5
    USE_JSON5 = True
except ImportError:
    USE_JSON5 = False

device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
# Paths
INPUT       = "/content/drive/MyDrive/llm-sr-project/testingData-blank.json"
OUTPUT      = "/content/drive/MyDrive/llm-sr-project/results_reward_model.json"
QP_LM_PATH  = "/content/drive/MyDrive/llm-sr-project/finetuned_llama3_question_parsing"
CP_LM_PATH  = "/content/drive/MyDrive/llm-sr-project/finetuned_llama3_cot_parsing"
REWARD_MODEL_PATH = "OpenAssistant/reward-model-deberta-v3-large-v2"

# Configuration
NUM_CANDIDATES = 5  # Generate multiple CP candidates
GENERATION_DIVERSITY = 0.85  # Higher diversity in generation

## Prompt Templates and Helper Functions

In [None]:
# In-Context Learning (ICL) Prompts

QP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project. This assignment must satisfy:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

If A works on Beta, which of the following must be true?
A. B works on Alpha
B. C works on Beta
C. D works on Alpha
D. F works on Beta

The parsing result is:

[
  "There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project.",
  "If A works on Alpha, then B works on Beta",
  "If C works on Alpha, then D and E work on Beta",
  "F works on a different project than E",
  "D must work on a different project than A",
  "If F works on Alpha, then B works on Alpha",
  "A works on Beta"
]
'''

QP_TEMPLATE = '''Given a question, extract all relevant information from the question that would help to solve it.

This includes:
- General setup information (e.g., number of people, projects involved)
- Explicit facts given in the question
- All logical constraints or conditions

Output only a JSON list and nothing else. Follow the format shown in the example.

Example:

{demon}

Now, the question is:

{question}

Your output:
'''

CP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. Each person works on exactly one project.

Conditions:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

Question:
If A works on Beta, which of the following must be true?

CoT:
Since A works on Beta, Condition (1) is not triggered. Condition (2) is not triggered since C's assignment is unknown. Condition (3) doesn't give anything because E's assignment is unspecified. Condition (4) says D must work on a different project than A, so D must work on Alpha. Condition (5) depends on F, which is unknown.

Parsing result:

[
  {
    "statement": "Condition (1) is not applicable",
    "evidence": "Condition (1): If A works on Alpha, then B works on Beta. | A is working on Beta",
    "Verification": "false"
  },
  {
    "statement": "Condition (2) is not applicable",
    "evidence": "Condition (2): If C works on Alpha, then D and E work on Beta. | C's assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "Condition (3) does not provide any info",
    "evidence": "Condition (3): F works on a different project than E. | E's assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "D must work on Alpha",
    "evidence": "Condition (4): D must work on a different project than A, and A is working on Beta",
    "Verification": "true"
  },
  {
    "statement": "Condition (5) is not applicable",
    "evidence": "Condition (5): If F works on Alpha, then B works on Alpha. | F's assignment is unknown",
    "Verification": "false"
  }
]
'''

CP_TEMPLATE = '''You are a reasoning assistant. Based on the question, conditions, and chain-of-thought (CoT), extract every inference or non-inference step as a JSON object.

For each CoT sentence that either:
  1. Refers to a condition (e.g. "Condition (2) …")
  2. Starts with an inference cue ("Since", "Therefore", "This means", "We can deduce", etc.)

Produce one object with:
  • "statement": the new claim you read in that CoT sentence (don't quote the entire sentence—just the core inference).
  • "evidence":
      – if the claim restates a constraint, use the exact line from the **Conditions** block,
      – otherwise, use the CoT fragment that you extracted it from.
  • "Verification":
      – `"false"` if the sentence rejects or blocks a condition (contains "not applicable", "does not provide", etc.),
      – otherwise `"true"`.

Keep the objects in the same order as they appear in the CoT.

Example:

{demon}

Now, given:

Question:
{question}

Conditions:
{conditions}

Chain-of-Thought:
{cot}

Your output:
'''


REWARD_PROMPT = '''I need you to evaluate the quality of a reasoning step in a logical inference task.

Given the following context:

Question:
{question}

Conditions:
{conditions}

Chain of Thought:
{cot}

Evaluate the following reasoning step:
Statement: {statement}
Evidence: {evidence}

Rate the quality of this reasoning step on a scale from 1 to 10 where:
1-3: Low quality (unclear, incorrect, or unsupported)
4-6: Medium quality (partially correct or somewhat supported)
7-10: High quality (clear, correct, and well-supported by evidence)

Focus on:
1. Is the statement logically supported by the evidence?
2. Is the evidence clearly derived from the conditions or chain of thought?
3. Does the reasoning step contribute meaningfully to solving the problem?

Your rating (1-10):
'''

In [None]:
# Helper Functions
def clean_quotes(t):
    return (t.replace('"','"').replace('"','"').replace("'","'").replace("'","'"))

def normalize_text(t):
    t = clean_quotes(t)
    t = re.sub(r'\?\s(?=[A-Z])', ', ', t)
    t = re.sub(r'(?<=[a-zA-Z])\.(?=[A-Z])', '. ', t)
    t = re.sub(r'(?<![A-Da-d])\\n(?!\s?[A-Da-d]\\.)', ' ', t)
    return html.unescape(t).strip()

def extract_json(raw):
    raw = raw.strip()
    i = raw.find('[')
    if i < 0: return []
    depth = 0
    for j,ch in enumerate(raw[i:], i):
        if ch=='[': depth+=1
        elif ch==']': depth-=1
        if depth==0:
            blk = raw[i:j+1]
            # Try different parsers
            for p in [json.loads, ast.literal_eval, (json5.loads if USE_JSON5 else None)]:
                if p:
                    try: return p(blk)
                    except: pass
            return []
    return []

def clean_qp(qp_list):
    return [s for s in qp_list if not re.match(r'^[A-Da-d][\.:\)]', s.strip()) and "Option" not in s and "following" not in s]

def extract_rating(text):
    """Extract numerical rating from reward model output"""
    # Try to find a number in the text
    matches = re.findall(r'\b([1-9]|10)\b', text)
    if matches:
        return int(matches[0])
    else:
        # Fallback - check for keywords
        if any(word in text.lower() for word in ["excellent", "outstanding", "great", "high quality"]):
            return 9
        elif any(word in text.lower() for word in ["good", "well", "solid"]):
            return 7
        elif any(word in text.lower() for word in ["average", "adequate", "fair"]):
            return 5
        elif any(word in text.lower() for word in ["poor", "weak", "bad"]):
            return 3
        else:
            return 5  # Default middle score

## Load Models

In [None]:
qp_tok = AutoTokenizer.from_pretrained(QP_LM_PATH)
qp_tok.model_max_length = 1024
qp_mod = AutoModelForCausalLM.from_pretrained(QP_LM_PATH).to(device)
qp_pipe = pipeline("text-generation",
                   model=qp_mod,
                   tokenizer=qp_tok,
                   return_full_text=False,
                   do_sample=False,
                   num_beams=5,
                   early_stopping=True,
                   max_new_tokens=512,
                   batch_size=4)

cp_tok = AutoTokenizer.from_pretrained(CP_LM_PATH)
cp_tok.model_max_length = 2048
cp_mod = AutoModelForCausalLM.from_pretrained(CP_LM_PATH).to(device)

# Main candidate generation - beam search
cp_pipe = pipeline("text-generation",
                   model=cp_mod,
                   tokenizer=cp_tok,
                   return_full_text=False,
                   do_sample=False,
                   num_beams=4,
                   num_return_sequences=2,
                   early_stopping=True,
                   max_new_tokens=1024,
                   batch_size=4)

# Diverse candidate generation - sampling
cp_sampler = pipeline("text-generation",
                      model=cp_mod,
                      tokenizer=cp_tok,
                      return_full_text=False,
                      do_sample=True,
                      temperature=GENERATION_DIVERSITY,
                      top_p=0.92,
                      num_return_sequences=3,
                      max_new_tokens=1024,
                      batch_size=4)

# Load the reward model
print(f"Loading reward model from {REWARD_MODEL_PATH}...")
reward_tok = AutoTokenizer.from_pretrained(REWARD_MODEL_PATH)
reward_mod = AutoModelForSequenceClassification.from_pretrained(
    REWARD_MODEL_PATH,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

## Reward Function

In [None]:
# Improved scoring function for OpenAssistant reward model
def score_with_reward_model(statement, evidence, question, conditions, cot):
    """Score a statement-evidence pair using the OpenAssistant reward model with evidence emphasis"""
    # Base scoring from reward model
    prompt = f"Question: {question}\n\nConditions: {'; '.join(conditions)}\n\nReasoning: {cot}\n\nEvaluate:\nStatement: {statement}\nEvidence: {evidence}"
    answer = f"This reasoning step is clear and logically sound. The statement follows directly from the evidence and contributes to solving the problem."

    # Use the reward model to score the reasoning
    inputs = reward_tok(prompt, answer, return_tensors='pt')
    inputs = {k: v.to(reward_mod.device) for k, v in inputs.items()}

    # Get score from the model
    with torch.no_grad():
        score = reward_mod(**inputs).logits[0].cpu().item()

    # Basic normalization
    base_score = min(10, max(1, (score + 5) * 1.0))

    # Add evidence quality bonuses (critical for Statement_Evidence_Macro_F1)
    evidence_bonus = 0

    # Bonus for specific condition references
    if any(f"Condition ({i})" in evidence for i in range(1, 10)):
        evidence_bonus += 2.5

    # Bonus for direct quotes from conditions list
    for condition in conditions:
        condition_text = str(condition).strip().lower()
        if len(condition_text) > 10 and condition_text in evidence.lower():
            evidence_bonus += 2.0
            break

    # Bonus for logical structure indicators
    if any(term in evidence.lower() for term in ["therefore", "because", "implies", "since"]):
        evidence_bonus += 1.0

    # Penalty for vague evidence
    if evidence.lower() in ["logical deduction", "deduction", "reasoning"]:
        evidence_bonus -= 2.0

    # Calculate final score with emphasis on evidence quality
    final_score = min(10, base_score + evidence_bonus)

    return final_score

## Inference Function

In [None]:
def process_one(example):
    q_raw, cot_raw = example["question"], example["cot"]
    sel_idx, ans = example.get("sel_idx"), example.get("answer")
    q, cot = normalize_text(q_raw), normalize_text(cot_raw)

    # QP generation
    prompt = QP_TEMPLATE.format(demon=QP_DEMON, question=q)
    raw_qp = qp_pipe(prompt, max_new_tokens=512)
    if not isinstance(raw_qp, list):
        raw_qp = [raw_qp]
    best_qp = clean_qp(extract_json(raw_qp[0]["generated_text"]))

    # Generate multiple CP candidates for reward-based selection
    conds_str = json.dumps(best_qp, ensure_ascii=False)
    prompt_cp = CP_TEMPLATE.format(
        demon      = CP_DEMON,
        question   = q,
        conditions = conds_str,
        cot        = cot
    )

    # Generate diverse candidates
    raw_beams = cp_pipe(prompt_cp, max_new_tokens=1024)
    raw_samples = cp_sampler(prompt_cp, max_new_tokens=1024)

    # Combine all candidates
    raw_cp_outs = (raw_beams if isinstance(raw_beams, list) else [raw_beams]) \
            + (raw_samples if isinstance(raw_samples, list) else [raw_samples])

    # Extract and clean the outputs
    cp_candidates = []
    for out in raw_cp_outs:
        parsed = extract_json(out["generated_text"])
        if not parsed:
            continue

        seen = set()
        cleaned = []
        for st in parsed:
            # Skip malformed entries
            if not isinstance(st, dict):
                continue

            stmt = st.get("statement", "")
            stmt = stmt.strip() if stmt is not None else ""
            ev = st.get("evidence", "")
            ev = ev.strip() if ev is not None else ""
            ev = ev or "logical deduction"
            ver  = st.get("Verification","true")

            # Normalize verification
            if ver not in ["true", "false"]:
                if any(phrase in stmt.lower() for phrase in ["not applicable", "doesn't apply", "not triggered", "no information"]):
                    ver = "false"
                else:
                    ver = "true"

            # Skip duplicates and very short statements
            if len(stmt) < 3 or (stmt,ev) in seen:
                continue

            seen.add((stmt,ev))
            cleaned.append({
                "statement": stmt,
                "evidence": ev,
                "Verification": ver
            })

        if cleaned:
            cp_candidates.append(cleaned)

    # If no valid candidates were found, return empty list
    if not cp_candidates:
        return {
            "question": q_raw,
            "question_parsing": best_qp,
            "answer": ans,
            "id": example["id"],
            "cot": cot_raw,
            "cot_parsing": [],
            "sel_idx": sel_idx
        }

    # Score each candidate's statements with the reward model
    print(f"Scoring {len(cp_candidates)} candidates with reward model...")
    candidate_scores = []
    for candidate in cp_candidates:
        # Score each statement-evidence pair in the candidate
        step_scores = []
        for step in candidate:
            score = score_with_reward_model(
                step["statement"],
                step["evidence"],
                q, best_qp, cot
            )
            step_scores.append(score)

        # Average score for the candidate
        avg_score = sum(step_scores) / len(step_scores) if step_scores else 0
        candidate_scores.append(avg_score)

    # Select the best candidate based on reward model scores
    if candidate_scores:
        best_idx = np.argmax(candidate_scores)
        best_cp = cp_candidates[best_idx]
        print(f"Selected candidate {best_idx} with score {candidate_scores[best_idx]}")
    else:
        # Fallback to first candidate if scoring failed
        best_cp = cp_candidates[0]


    if best_cp:
        # Apply evidence quality post-processing
        for step in best_cp:
            # Improve evidence quality by making explicit links to conditions
            evidence = step["evidence"]
            statement = step["statement"]

            # Try to fix evidence that lacks specific condition references
            if not any(f"Condition ({i})" in evidence for i in range(1, 10)):
                for i, condition in enumerate(best_qp, 1):
                    # If the statement clearly relates to a condition, add explicit reference
                    if any(keyword in condition.lower() and keyword in statement.lower()
                           for keyword in ["must", "works", "assigned", "different", "project", "Alpha", "Beta"]):
                        step["evidence"] = f"Condition ({i}): {condition}. {evidence}"
                        break

    return {
        "question": q_raw,
        "question_parsing": best_qp,
        "answer": ans,
        "id": example["id"],
        "cot": cot_raw,
        "cot_parsing": best_cp,
        "sel_idx": sel_idx
    }

## Batch Inference and Save

In [None]:
def process_batch(batch):
    # Process one example at a time due to reward model complexity
    outs = [process_one({
        "question": batch["question"][i],
        "cot":       batch["cot"][i],
        "id":        batch["id"][i],
        "sel_idx":   batch.get("sel_idx", [None]*len(batch["id"]))[i],
        "answer":    batch.get("answer", [None]*len(batch["id"]))[i],
    }) for i in range(len(batch["question"]))]

    return {
        "question":        [o["question"]        for o in outs],
        "question_parsing":[o["question_parsing"]for o in outs],
        "answer":          [o["answer"]          for o in outs],
        "id":              [o["id"]              for o in outs],
        "cot":             [o["cot"]             for o in outs],
        "cot_parsing":     [o["cot_parsing"]     for o in outs],
        "sel_idx":         [o["sel_idx"]         for o in outs],
    }

if __name__=="__main__":
    gc.collect()
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    try:
        ds = load_dataset("json", data_files={"test": INPUT})["test"]
        print(f"Loaded dataset with {len(ds)} examples")

        print(f"Starting processing with enhanced OpenAssistant reward model...")
        print(f"Generating {NUM_CANDIDATES} candidates per example with diversity {GENERATION_DIVERSITY}")

        # Process with smaller batch size due to reward model overhead
        out_ds = ds.map(
            process_batch,
            batched=True,
            batch_size=1,
            remove_columns=ds.column_names
        )

        print(f"Processing complete. Writing results to {OUTPUT}")
        out_ds.to_json(OUTPUT, orient="records", lines=False)
        print("✅ Done — saved to", OUTPUT)
    except Exception as e:
        print(f"Error occurred: {type(e).__name__}: {e}")
        import traceback
        traceback.print_exc()

## Structure file for evaluation

In [None]:
import json

INPUT_PATH  = "/content/drive/MyDrive/llm-sr-project/results_reward_model.json"
OUTPUT_PATH ="/content/drive/MyDrive/llm-sr-project/final_results_reward_model.json"

def transform_example(ex):
    # reorder each cot_parsing entry: statement → evidence → Verification
    reordered = []
    for step in ex.get("cot_parsing", []):
        reordered.append({
            "statement":    step.get("statement"),
            "evidence":     step.get("evidence"),
            "Verification": step.get("Verification"),
        })

    return {
        "question":         ex.get("question"),
        "question_parsing": ex.get("question_parsing"),
        "answer":           ex.get("answer"),
        "id":               ex.get("id"),
        "cot":              ex.get("cot"),
        "cot_parsing":      reordered,
        "sel_idx":          ex.get("sel_idx"),
    }

def main():
    with open(INPUT_PATH, "r", encoding="utf-8") as f:
        examples = json.load(f)

    structured = [transform_example(ex) for ex in examples]

    with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
        json.dump(structured, f, ensure_ascii=False, indent=2)

    print(f"Wrote {len(structured)} examples to {OUTPUT_PATH}")

if __name__ == "__main__":
    main()

## Evaluate

In [3]:
EVAL_SCRIPT = "/content/drive/MyDrive/llm-sr-project/eval.py"
PREDICTION_PATH = "/content/drive/MyDrive/llm-sr-project/final_results_reward_model.json"
REFERENCE_PATH = "/content/drive/MyDrive/llm-sr-project/test-reference.json"

!python {EVAL_SCRIPT} \
  --prediction {PREDICTION_PATH} \
  --reference {REFERENCE_PATH} \
  --question_threshold 0.95 \
  --statement_threshold 0.9 \
  --relation_threshold 0.9

2025-05-17 15:36:59.777196: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747496219.798135    3253 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747496219.804458    3253 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
config.json: 100% 1.05k/1.05k [00:00<00:00, 8.20MB/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
model.safetensors: 100% 738M/738M [00:04<00:00, 148MB/s]
tokenizer_config.json: 100% 1.28k/1.28k [00:00<00:00, 9.99MB/s]
spm.model: 100% 2.46M/2.46M [0