In [1]:
def build_prompt(utterance):
    return f"""
You are annotating classroom discussion transcripts.

Task:
Decide whether the following utterance involves *proposing strategies or plans*.

Context: 
Students are working in groups on activities to learn about kepler's first law of planetary motion. There is a pen and paper activity (that uses pins, paper, pencil, string) for them to understand how draw an elliptical orbit and then a computer aspect where they work on various immersive computer simulation activities to develop a final claim that orbits are elliptical. The learning objective is for them to work collaboratively to discover this new knowledge through hands on activities.

Definition:
- Articulating specific steps, strategies, or procedures required to organize or accomplish the group's task.
- Look for utterances that set direction or specify how to complete an activity (often using procedural or sequential language). Exclude cases where the speaker is merely following instructions read aloud after being prompted by a peer 

Utterance:
\"\"\"{utterance}\"\"\"

Respond ONLY in valid JSON.
Do NOT include any explanation or extra text.

Format:
{{"label": "YES"}} or {{"label": "NO"}}

"""


In [3]:
import pandas as pd
import json

df = pd.read_csv("t3.csv")

LABEL_COL = "Planning Strategies & Plans"

df = df.dropna(subset=["Message", LABEL_COL]).copy()
df[LABEL_COL] = df[LABEL_COL].astype(int)

def label_to_text(x):
    return "YES" if x == 1 else "NO"

out_path = "t3_sft_kepler_prompt.jsonl"

with open(out_path, "w", encoding="utf-8") as f:
    for _, r in df.iterrows():
        prompt = build_prompt(r["Message"].strip())
        ex = {
            "messages": [
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": json.dumps(
                    {"label": label_to_text(r[LABEL_COL])}
                )},
            ]
        }
        f.write(json.dumps(ex, ensure_ascii=False) + "\n")

print("Saved:", out_path)
print("Rows:", len(df))
print("Label distribution:\n", df[LABEL_COL].value_counts())


Saved: t3_sft_kepler_prompt.jsonl
Rows: 207
Label distribution:
 Planning Strategies & Plans
0    166
1     41
Name: count, dtype: int64
