#4. Evaluation

**Purpose:**  
Run inference on the blank test set using our fine-tuned QP & CoT models, log all raw outputs, and compute final F1 metrics using the shared-task `eval.py` script.

**Inputs:**  
- `/content/drive/MyDrive/llm-sr-project/testingData-blank.json`  
- Fine-tuned QP model at `/content/drive/MyDrive/llm-sr-project/finetuned_llama3_question_parsing`  
- Fine-tuned CoT model at `/content/drive/MyDrive/llm-sr-project/finetuned_llama3_cot_parsing`  
- Evaluation script at `/content/drive/MyDrive/llm-sr-project/eval.py`  
- Reference file at `/content/drive/MyDrive/llm-sr-project/test-reference.json`

**Outputs:**  
- `/content/drive/MyDrive/llm-sr-project/testingDataresultsfor700-2.json` (inference results)  
- Printed metrics:  
  - Question_Macro_F1  
  - Statement_Macro_F1  
  - Statement_Evidence_Macro_F1  
  - Reasoning_F1  

**Workflow:**  
1. Load the LoRA-adapter checkpoints for QP and CoT.  
2. For each test example:  
   - Generate `question_parsing` via structured ICL prompt.  
   - Generate `cot_parsing` using extracted constraints and the Chain-of-Thought.  
3. Deduplicate and clean each parse.  
4. Save full inference JSON.  
5. Run `eval.py` with high-precision thresholds to compute F1 scores.  


## Imports and File Paths

In [None]:
# Install core evaluation utilities
!pip install -q evaluate
!pip install json5

!pip uninstall -y nltk
!pip install -q --upgrade nltk

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting json5
  Downloading json5-0.12.0-py3-none-any.whl.metadata (36 kB)
Downloading json5-0.12.0-py3-none-any.whl (36 kB)
Installing collected packages: json5
Successfully installed json5-0.12.0
Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m62.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import nltk
nltk.download("punkt_tab")
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
import unsloth  # Must come first for 4-bit LoRA compatibility
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import json, re, ast, os, html

# Optional: robust JSON parsing
try:
    import json5
    USE_JSON5 = True
except ImportError:
    USE_JSON5 = False

In [None]:
input_path  = "/content/drive/MyDrive/llm-sr-project/testingData-blank.json"
output_path = "/content/drive/MyDrive/llm-sr-project/testingDataresultsfor700-2.json"
log_path    = "/content/drive/MyDrive/llm-sr-project/raw_outputs_log.jsonl"

## Prompt Templates

In [None]:
# ICL (In-Context Learning) Prompt Templates and Demonstrations (same as in training)

QP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project. This assignment must satisfy:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

If A works on Beta, which of the following must be true?
A. B works on Alpha
B. C works on Beta
C. D works on Alpha
D. F works on Beta

The parsing result is:

[
  "There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project.",
  "If A works on Alpha, then B works on Beta",
  "If C works on Alpha, then D and E work on Beta",
  "F works on a different project than E",
  "D must work on a different project than A",
  "If F works on Alpha, then B works on Alpha",
  "A works on Beta"
]
'''


QP_TEMPLATE = '''Given a question, extract all relevant information from the question that would help to solve it.

This includes:
- General setup information (e.g., number of people, projects involved)
- Explicit facts given in the question
- All logical constraints or conditions

Output only a JSON list and nothing else. Follow the format shown in the example.

Example:

{demon}

Now, the question is:

{question}

Your output:
'''


CP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project.

Conditions:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

Question:
If A works on Beta, which of the following must be true?

CoT:
Since A works on Beta, Condition (1) is not triggered. Condition (2) is not triggered since C’s assignment is unknown. Condition (3) doesn’t give anything because E’s assignment is unspecified. Condition (4) says D must work on a different project than A, so D must work on Alpha. Condition (5) depends on F, which is unknown.

Parsing result:

[
  {
    "statement": "Condition (1) is not applicable",
    "evidence": "Condition (1): If A works on Alpha, then B works on Beta. | A is working on Beta",
    "Verification": "false"
  },
  {
    "statement": "Condition (2) is not applicable",
    "evidence": "Condition (2): If C works on Alpha, then D and E work on Beta. | C’s assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "Condition (3) does not provide any info",
    "evidence": "Condition (3): F works on a different project than E. | E’s assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "D must work on Alpha",
    "evidence": "Condition (4): D must work on a different project than A, and A is working on Beta",
    "Verification": "true"
  },
  {
    "statement": "Condition (5) is not applicable",
    "evidence": "Condition (5): If F works on Alpha, then B works on Alpha. | F’s assignment is unknown",
    "Verification": "false"
  }
]
'''

CP_TEMPLATE = '''You are a reasoning assistant. Based on the question, conditions, and chain-of-thought (CoT), extract every inference or non-inference step as a JSON object.

For each CoT sentence that either:
  1. Refers to a condition (e.g. “Condition (2) …”)
  2. Starts with an inference cue (“Since”, “Therefore”, “This means”, “We can deduce”, etc.)

Produce one object with:
  • "statement": the new claim you read in that CoT sentence (don’t quote the entire sentence—just the core inference).
  • "evidence":
      – if the claim restates a constraint, use the exact line from the **Conditions** block,
      – otherwise, use the CoT fragment that you extracted it from.
  • "Verification":
      – `"false"` if the sentence rejects or blocks a condition (contains “not applicable”, “does not provide”, etc.),
      – otherwise `"true"`.

Keep the objects in the same order as they appear in the CoT.

Example:

{demon}

Now, given:

Question:
{question}

Conditions:
{conditions}

Chain-of-Thought:
{cot}

Your output:
'''

## Load Models

In [None]:
# Question Parser
question_model_path = "/content/drive/MyDrive/llm-sr-project/finetuned_llama3_question_parsing"
question_tokenizer  = AutoTokenizer.from_pretrained(question_model_path)
question_tokenizer.model_max_length = 1024
question_model      = AutoModelForCausalLM.from_pretrained(question_model_path, load_in_4bit=True)
question_pipe       = pipeline("text-generation", model=question_model, tokenizer=question_tokenizer,
                               return_full_text=False, num_beams=5, early_stopping=True, do_sample=False)

# CoT Parser
cot_model_path = "/content/drive/MyDrive/llm-sr-project/finetuned_llama3_cot_parsing"
cot_tokenizer  = AutoTokenizer.from_pretrained(cot_model_path)
cot_tokenizer.model_max_length = 2048
cot_model      = AutoModelForCausalLM.from_pretrained(cot_model_path, load_in_4bit=True)
cot_pipe       = pipeline("text-generation", model=cot_model, tokenizer=cot_tokenizer,
                          return_full_text=False, num_beams=5, early_stopping=True, do_sample=False)

print("✅ Models loaded.")

## Utility Functions

In [None]:
def extract_first_json_array(raw: str):
    raw = raw.strip()
    start = raw.find('[')
    if start == -1: return None
    depth = 0
    for i, ch in enumerate(raw[start:], start):
        if ch=='[': depth+=1
        elif ch==']': depth-=1
        if depth==0:
            block = raw[start:i+1]
            for parser in (json.loads, ast.literal_eval, (json5.loads if USE_JSON5 else None)):
                if parser:
                    try: return parser(block)
                    except: pass
            return None
    return None

def clean_quotes(text):
    return text.replace('“','"').replace('”','"').replace("‘","'").replace("’","'")

def normalize_question_text(text):
    text = clean_quotes(text)
    text = re.sub(r'\?\s(?=[A-Z])', ', ', text)
    text = re.sub(r'(?<=[a-zA-Z])\.(?=[A-Z])', '. ', text)
    text = re.sub(r'(?<![A-Da-d])\\n(?!\s?[A-Da-d]\\.)', ' ', text)
    return html.unescape(text).strip()

## Inference Functions

In [None]:
def generate_question_parsing(question: str):
    q = normalize_question_text(question)
    prompt = QP_TEMPLATE.format(demon=QP_DEMON, question=q)
    resp = question_pipe(prompt, max_new_tokens=512)[0]["generated_text"]
    with open(log_path, "a") as f:
        f.write(json.dumps({"type":"QP","question":question,"raw":resp})+"\n")
    return extract_first_json_array(resp) or []

def generate_cot_parsing(question: str, cot: str, constraints):
    q = normalize_question_text(question)
    c = normalize_question_text(cot)
    prompt = CP_TEMPLATE.format(demon=CP_DEMON, question=q,
                                conditions=json.dumps(constraints, ensure_ascii=False),
                                cot=c)
    resp_list = cot_pipe(prompt, max_new_tokens=1024)
    if not resp_list or "generated_text" not in resp_list[0]:
        print("⚠️ Malformed response")
        return []
    resp = resp_list[0]["generated_text"]
    with open(log_path, "a") as f:
        f.write(json.dumps({"type":"CP","question":question,"cot":cot,"raw":resp})+"\n")
    steps = extract_first_json_array(resp)
    if not steps: return []
    clean, seen = [], set()
    for st in steps:
        s = st.get("statement","").strip()
        e = st.get("evidence","").strip() or "logical deduction"
        v = str(st.get("Verification","true")).lower()
        if len(s)<5 or (s,e) in seen: continue
        seen.add((s,e))
        clean.append({"statement":s,"evidence":e,"Verification":v})
    return clean

## Run Inference and Save

In [None]:
with open(input_path) as f:
    test_data = json.load(f)

results = []
for ex in test_data:
    qp = generate_question_parsing(ex["question"])
    print(f"→ QP extracted {len(qp)} constraints")
    cp = generate_cot_parsing(ex["question"], ex["cot"], qp)
    print(f"→ CoT parsed {len(cp)} steps")
    results.append({**ex, "question_parsing": qp, "cot_parsing": cp})

with open(output_path, "w") as f:
    json.dump(results, f, indent=2)

print("✅ Saved:", output_path)

## Evaluate

In [None]:
EVAL_SCRIPT = "/content/drive/MyDrive/llm-sr-project/eval.py"
PREDICTION_PATH = "/content/drive/MyDrive/llm-sr-project/testingDataresultsfor700-2.json"
REFERENCE_PATH = "/content/drive/MyDrive/llm-sr-project/test-reference.json"

!python {EVAL_SCRIPT} \
  --prediction {PREDICTION_PATH} \
  --reference {REFERENCE_PATH} \
  --question_threshold 0.95 \
  --statement_threshold 0.9 \
  --relation_threshold 0.9

2025-05-17 14:13:00.489505: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-17 14:13:00.507025: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747491180.528260    2927 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747491180.534701    2927 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-17 14:13:00.555637: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr