# 1. Preprocessing

**Purpose:**  
Generate the training and full‐metadata JSON files for Question Parsing (QP) and Chain-of-Thought (CoT) Parsing from the raw `700dataset.json`.

**Inputs:**  
- `/llm-sr-project/700dataset.json`: synthetically generated logical puzzles with gold QP/CoT parses.

**Outputs:**  
- `train_question_parsing.jsonl`  
- `train_cot_parsing.jsonl`  
- `full_qp_data.json`  
- `full_cot_data.json`

This notebook defines our prompt templates, loads the raw dataset, builds both stripped I/O records (for LoRA fine-tuning) and full metadata records (for debugging/evaluation), and writes them to disk.

In [1]:
import os
import json

DATA_PATH   = "/content/drive/MyDrive/llm-sr-project/700dataset.json"
SAVE_FOLDER = "/content/drive/MyDrive/llm-sr-project/"

In [2]:
# ─────────────────────────────────────────────────────────────────────────────
# Define In-Context Learning Demonstrations and Prompt Templates
# ─────────────────────────────────────────────────────────────────────────────

# QP_DEMON: One-shot example for Question Parsing
QP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project. This assignment must satisfy:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

If A works on Beta, which of the following must be true?
A. B works on Alpha
B. C works on Beta
C. D works on Alpha
D. F works on Beta

The parsing result is:

[
  "There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project.",
  "If A works on Alpha, then B works on Beta",
  "If C works on Alpha, then D and E work on Beta",
  "F works on a different project than E",
  "D must work on a different project than A",
  "If F works on Alpha, then B works on Alpha",
  "A works on Beta"
]
'''

# QP_TEMPLATE: Formats a new question with the demonstration
QP_TEMPLATE = '''Given a question, extract all relevant information from the question that would help to solve it.

This includes:
- General setup information (e.g., number of people, projects involved)
- Explicit facts given in the question
- All logical constraints or conditions

Output only a JSON list and nothing else. Follow the format shown in the example.

Example:

{demon}

Now, the question is:

{question}

Your output:
'''

# CP_DEMON: One-shot example for Chain-of-Thought Parsing
CP_DEMON = '''The question is:

There are 6 volunteers: A, B, C, D, E and F. They will be assigned to either Project Alpha or Project Beta. Each person works on exactly one project.

Conditions:
(1) If A works on Alpha, then B works on Beta.
(2) If C works on Alpha, then D and E work on Beta.
(3) F works on a different project than E.
(4) D must work on a different project than A.
(5) If F works on Alpha, then B works on Alpha.

Question:
If A works on Beta, which of the following must be true?

CoT:
Since A works on Beta, Condition (1) is not triggered. Condition (2) is not triggered since C’s assignment is unknown. Condition (3) doesn’t give anything because E’s assignment is unspecified. Condition (4) says D must work on a different project than A, so D must work on Alpha. Condition (5) depends on F, which is unknown.

Parsing result:

[
  {
    "statement": "Condition (1) is not applicable",
    "evidence": "Condition (1): If A works on Alpha, then B works on Beta. | A is working on Beta",
    "Verification": "false"
  },
  {
    "statement": "Condition (2) is not applicable",
    "evidence": "Condition (2): If C works on Alpha, then D and E work on Beta. | C’s assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "Condition (3) does not provide any info",
    "evidence": "Condition (3): F works on a different project than E. | E’s assignment is unknown",
    "Verification": "false"
  },
  {
    "statement": "D must work on Alpha",
    "evidence": "Condition (4): D must work on a different project than A, and A is working on Beta",
    "Verification": "true"
  },
  {
    "statement": "Condition (5) is not applicable",
    "evidence": "Condition (5): If F works on Alpha, then B works on Alpha. | F’s assignment is unknown",
    "Verification": "false"
  }
]
'''
# CP_TEMPLATE: Formats a question + CoT + conditions for CP model
CP_TEMPLATE = '''You are a reasoning assistant. Based on the question, conditions, and chain-of-thought (CoT), extract every inference or non-inference step as a JSON object.

For each CoT sentence that either:
  1. Refers to a condition (e.g. “Condition (2) …”)
  2. Starts with an inference cue (“Since”, “Therefore”, “This means”, “We can deduce”, etc.)

Produce one object with:
  • "statement": the new claim you read in that CoT sentence (don’t quote the entire sentence—just the core inference).
  • "evidence":
      – if the claim restates a constraint, use the exact line from the **Conditions** block,
      – otherwise, use the CoT fragment that you extracted it from.
  • "Verification":
      – `"false"` if the sentence rejects or blocks a condition (contains “not applicable”, “does not provide”, etc.),
      – otherwise `"true"`.

Keep the objects in the same order as they appear in the CoT.

Example:

{demon}

Now, given:

Question:
{question}

Conditions:
{conditions}

Chain-of-Thought:
{cot}

Your output:
'''

In [3]:
# Load our dataset
with open(DATA_PATH, "r", encoding="utf-8") as f:
    data = json.load(f)


qp_full = []
cot_full = []
qp_recs = []
cot_recs = []

for idx, ex in enumerate(data):
    q    = ex["question"]
    cot  = ex["cot"]
    ans  = ex.get("answer", "b")
    qpar = ex["question_parsing"]
    cpar = ex["cot_parsing"]

    qp_out = json.dumps(qpar, ensure_ascii=False)

    cot_out = json.dumps(cpar, ensure_ascii=False)

    # Full metadata version (for eval/debugging)
    qp_full.append({
        "input":  f"Question:\n{q}",
        "output": qp_out,
        "question": q,
        "answer": ans,
        "id": idx,
        "cot": cot,
        "question_parsing": qpar,
        "cot_parsing": cpar,
        "sel_idx": idx
    })

    cot_full.append({
        "input":  f"Question:\n{q}\n\nCoT:\n{cot}",
        "output": cot_out,
        "question": q,
        "answer": ans,
        "id": idx,
        "cot": cot,
        "question_parsing": qpar,
        "cot_parsing": cpar,
        "sel_idx": idx
    })

    # Stripped QP training record
    qp_recs.append({
        "input": QP_TEMPLATE.format(demon=QP_DEMON, question=q),
        "output": qp_out
    })
    # Stripped CoT training record
    cot_recs.append({
        "input": CP_TEMPLATE.format(
            demon=CP_DEMON,
            question=q,
            conditions=json.dumps(qpar, ensure_ascii=False),
            cot=cot
        ),
        "output": cot_out
    })


In [4]:
with open(os.path.join(SAVE_FOLDER, "train_question_parsing.jsonl"), "w", encoding="utf-8") as f:
    for rec in qp_recs:
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")

with open(os.path.join(SAVE_FOLDER, "train_cot_parsing.jsonl"), "w", encoding="utf-8") as f:
    for rec in cot_recs:
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")

# Save full metadata versions
with open(os.path.join(SAVE_FOLDER, "full_qp_data.json"), "w", encoding="utf-8") as f:
    json.dump(qp_full, f, indent=2, ensure_ascii=False)

with open(os.path.join(SAVE_FOLDER, "full_cot_data.json"), "w", encoding="utf-8") as f:
    json.dump(cot_full, f, indent=2, ensure_ascii=False)

print("✅ Saved all preprocessing outputs")

✅ Saved all preprocessing outputs
