# Support Multi-Label Pipeline (3-Step)
This notebook:
1) Loads `train.csv` (14 rows) to build few-shot examples  
2) Loads `valid.csv` (86 rows) to predict labels  
3) Runs 3 gates: Domain Gate → OP Last Comment Gate → Final Multi-label  
4) Saves `valid_with_predictions.csv`

## 1. Project Setup

In [1]:
%pip install -U openai python-dotenv pandas tqdm pydantic scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os, json, time
import pandas as pd
from tqdm import tqdm
from dotenv import load_dotenv
from pydantic import BaseModel
from typing_extensions import Literal
from openai import OpenAI

load_dotenv()
assert os.getenv("OPENAI_API_KEY"), "OPENAI_API_KEY not found in environment or .env"

client = OpenAI()


## 2. Loading files and definitions

In [3]:
TRAIN_PATH = "../../data/train.csv"
VALID_PATH = "../../data/valid.csv"

train_df = pd.read_csv(TRAIN_PATH)
valid_df = pd.read_csv(VALID_PATH)

assert "prompt" in train_df.columns, "train.csv must contain a 'prompt' column"
assert "prompt" in valid_df.columns, "valid.csv must contain a 'prompt' column"

train_df["prompt"] = train_df["prompt"].fillna("").astype(str)
valid_df["prompt"] = valid_df["prompt"].fillna("").astype(str)

train_df.head(2)


Unnamed: 0,post_id,parent_id,subreddit,comment_id,parent_fullname,depth,comment_author,comment_body,comment_score,created_utc,...,body,prompt,Tags,Information support,Emotional support,Esteem support,Network support,Tangible assistance,Seeking support,Group interactions
0,1lc7y2n,,TooAfraidToAsk,mxysa7g,t3_1lc7y2n,0,Miaous95,Definitely SA and I’d do it back to him see if...,5,1750018747,...,So you have sex with a man with consent. You b...,Original Post:\nAuthor: Beginning_Exit_6256\nT...,Information support,Yes,No,No,No,No,No,No
1,5rf97b,,relationship_advice,dd7kbxy,t3_5rf97b,0,[deleted],You cheated on him. You are responsible for yo...,5,1485988490,...,My boyfriend (17/m) and I had been dating for ...,Original Post:\nAuthor: ahhhhconfuse\nTitle: D...,Information support,Yes,No,No,No,No,No,No


In [4]:
LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
    "Not applicable",
]

SUPPORT_ONLY_LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
]

SUPPORT_DEFINITIONS = """
Information support: Giving advice, facts, resources, or explanations that clarify what’s happening or what to do (including opinionated judgments like “that is assault” when used to inform/guide).
Emotional support: Messages that express care and empathy—comforting, showing affection/sympathy, encouraging hope, offering prayers, or easing guilt/blame (“I’m so sorry,” “sending hugs,” “stay strong,” “I’ll pray for you,” “it’s not your fault”).
Esteem support: Messages that build the user up by complimenting them or validating their feelings/beliefs/actions as reasonable/normal.
Network support: Encouraging the person to reach out to other people or connect with external help systems (therapy, friends, family, communities, etc.). Note : Suggesting to go to police is Information support and not Network support.
Tangible assistance: When the commenter personally offers to help directly (I am here, You can talk to me).
Seeking support: Messages where the author explicitly asks for help for themselves—either a direct question/request for info/suggestions or an explicit reassurance request.
Group interactions: Any reply that primarily participates socially in the thread—expressing gratitude/thanks, congratulations, or sharing one’s own experience/story (including “me too” anecdotes). This label can co-occur with other support labels if the comment also gives advice, empathy, or info.
""".strip()


In [5]:
for lab in SUPPORT_ONLY_LABELS:
    if lab not in train_df.columns:
        print(f"❌ Missing column: {lab}")
    else:
        yes_count = (train_df[lab].astype(str).str.strip().str.lower() == "yes").sum()
        print(f"{lab}: YES count = {yes_count}")


Information support: YES count = 2
Emotional support: YES count = 2
Esteem support: YES count = 2
Network support: YES count = 2
Tangible assistance: YES count = 2
Seeking support: YES count = 2
Group interactions: YES count = 2


## 3. Prompts

In [6]:
DOMAIN_GATE_PROMPT = """
You are a psychologist and an expert in Reddit threads about possible sexual harassment/sexual assault.

Task: DOMAIN GATE.

Decide if the author is UNCERTAIN about whether the experience counts as sexual assault/harassment.

Return true ONLY if the author explicitly questions whether the experience counts as SA/harassment.

Return false otherwise.

Return ONLY valid JSON:
{{
  "is_ambivalent_sa_domain": true/false
}}

Prompt:
{prompt_text}
""".strip()


OP_LAST_COMMENT_PROMPT = """
Input format:
- There is an "Original Post:" section with Author, Title, Body.
- There is a "Conversation History:" section with a single/multiple comments.
"Conversation History:" includes blocks like:
  Comment (depth X):
  Author: ... 
  Content: ...

Steps:
1) Extract OP from "Original Post: Author:"
2) Choose the comment with the highest depth in the conversation history.
3) Extract that last comment's Author, Depth, and Content.
4) Compare last comment author with OP.

Return ONLY valid JSON:
{{
  "op_author": "<string>",
  "last_comment_depth": <int>,
  "last_comment_author": "<string>",
  "last_comment_content": "<string>",
  "is_last_comment_by_op": true/false
}}

Prompt:
{prompt_text}
""".strip()


FINAL_MULTILABEL_PROMPT = """
You are a psychologist expert in identifying support in Reddit comments. 

You will label ONLY this comment (the last comment in the thread):
Last Comment Author: {last_comment_author}
Last Comment Content: {last_comment_content}

Gates (already computed):
- is_ambivalent_sa_domain: {is_ambivalent_sa_domain}
- is_last_comment_by_op: {is_last_comment_by_op}

Rules:
1) If is_ambivalent_sa_domain is false:
   - Output: Not applicable = YES, and all other labels = NO. Stop.

2) If is_last_comment_by_op is true:
   - Only evaluate: Seeking support.
   - Force: Information support/Emotional support/Esteem support/Network support/Tangible assistance = NO.
   - Default rule:
     * If Seeking support = NO, then Group interactions = YES.
     * Else evaluate Group interactions.

                
3) Otherwise:
   - Only evaluate: Information support, Emotional support, Esteem support, Network support, Tangible assistance, Group interactions.
   - Force: Seeking support = NO.
   - Not applicable = YES only if all evaluated labels are NO.

Definitions:
{support_definitions}

Few-shot examples:
{few_shot_examples}

Return ONLY valid JSON with all labels:
{{
  "Information support": "YES/NO",
  "Emotional support": "YES/NO",
  "Esteem support": "YES/NO",
  "Network support": "YES/NO",
  "Tangible assistance": "YES/NO",
  "Seeking support": "YES/NO",
  "Group interactions": "YES/NO",
  "Not applicable": "YES/NO"
}}

Prompt:
{prompt_text}
""".strip()


In [7]:
import re

def extract_title_body(full_prompt: str):
    if not full_prompt or not isinstance(full_prompt, str):
        return "", ""
    title_m = re.search(r"Title:\s*(.*)", full_prompt)
    body_m  = re.search(r"Body:\s*(.*?)(?:\n---\n|Conversation History:|\Z)", full_prompt, flags=re.DOTALL)
    title = title_m.group(1).strip() if title_m else ""
    body  = body_m.group(1).strip() if body_m else ""
    return title, body

def build_gate_title_only(full_prompt: str) -> str:
    title, _ = extract_title_body(full_prompt)
    return f"""Original Post:
Title: {title}
""".strip()

def build_gate_full_op(full_prompt: str) -> str:
    title, body = extract_title_body(full_prompt)
    return f"""Original Post:
Title: {title}
Body: {body}
""".strip()

def domain_gate_title_then_fullbody(model: str, full_prompt: str):
    """
    Gate 1: Title-only
    If false -> Gate 2: Title + full Body (no comments)
    """
    gate1_text = build_gate_title_only(full_prompt)
    dg1 = call_structured(model, DOMAIN_GATE_PROMPT.format(prompt_text=gate1_text), DomainGateOut)
    if dg1.is_ambivalent_sa_domain:
        return dg1, "title_only"

    gate2_text = build_gate_full_op(full_prompt)
    dg2 = call_structured(model, DOMAIN_GATE_PROMPT.format(prompt_text=gate2_text), DomainGateOut)
    return dg2, "full_body_fallback"


In [8]:
from pydantic import BaseModel, Field, ConfigDict
from typing_extensions import Literal

class DomainGateOut(BaseModel):
    is_ambivalent_sa_domain: bool

class LastCommentOut(BaseModel):
    op_author: str
    last_comment_depth: int
    last_comment_author: str
    last_comment_content: str
    is_last_comment_by_op: bool

YesNo = Literal["YES", "NO"]

class MultiLabelOut(BaseModel):
    model_config = ConfigDict(populate_by_name=True)

    Information_support: YesNo = Field(alias="Information support")
    Emotional_support: YesNo = Field(alias="Emotional support")
    Esteem_support: YesNo = Field(alias="Esteem support")
    Network_support: YesNo = Field(alias="Network support")
    Tangible_assistance: YesNo = Field(alias="Tangible assistance")
    Seeking_support: YesNo = Field(alias="Seeking support")
    Group_interactions: YesNo = Field(alias="Group interactions")
    Not_applicable: YesNo = Field(alias="Not applicable")

    def to_label_dict(self):
        return {
            "Information support": self.Information_support,
            "Emotional support": self.Emotional_support,
            "Esteem support": self.Esteem_support,
            "Network support": self.Network_support,
            "Tangible assistance": self.Tangible_assistance,
            "Seeking support": self.Seeking_support,
            "Group interactions": self.Group_interactions,
            "Not applicable": self.Not_applicable,
        }


## 4. OpenAI Call Helper (Structured Parse) + Build examples

In [9]:
import re, time

def _retry_after_seconds(err: Exception) -> float | None:
    msg = str(err)
    m = re.search(r"try again in ([0-9.]+)ms", msg, re.IGNORECASE)
    if m:
        return float(m.group(1)) / 1000.0
    m = re.search(r"try again in ([0-9.]+)s", msg, re.IGNORECASE)
    if m:
        return float(m.group(1))
    return None

def call_structured(model: str, prompt: str, out_schema, max_retries: int = 12):
    last_err = None
    for attempt in range(max_retries):
        try:
            resp = client.responses.parse(
                model=model,
                input=[{"role": "user", "content": prompt}],
                text_format=out_schema,
                max_output_tokens=350,  # IMPORTANT: cap output to reduce token reservation
            )
            return resp.output_parsed
        except Exception as e:
            last_err = e
            ra = _retry_after_seconds(e)

            # If server tells us exactly when to retry, obey it.
            if ra is not None:
                sleep_s = ra + 0.2
            else:
                # Gentle fallback backoff (not explosive)
                sleep_s = min(8.0, 0.5 * (1.5 ** (attempt + 1)))

            print(f"[retry {attempt+1}/{max_retries}] {type(e).__name__}: sleeping {sleep_s:.2f}s")
            time.sleep(sleep_s)

    raise RuntimeError(f"OpenAI call failed after retries. Last error: {last_err}")


In [10]:
def build_few_shot_examples(train_df: pd.DataFrame) -> str:
    missing = [lab for lab in SUPPORT_ONLY_LABELS if lab not in train_df.columns]
    if missing:
        print("❌ Missing label columns:", missing)
        return "NO_FEW_SHOT_AVAILABLE"

    blocks = []
    idx = 1

    for lab in SUPPORT_ONLY_LABELS:
        positives = train_df[train_df[lab].astype(str).str.strip().str.lower().eq("yes")].head(2)
        if len(positives) < 2:
            print(f"WARNING: label '{lab}' has only {len(positives)} YES rows")

        for _, row in positives.iterrows():
            gold = {c: "NO" for c in LABELS}
            for c in LABELS:
                if c in train_df.columns:
                    gold[c] = str(row.get(c, "NO")).strip().upper()

            if "Not applicable" not in train_df.columns:
                gold["Not applicable"] = "YES" if all(gold[x] == "NO" for x in SUPPORT_ONLY_LABELS) else "NO"

            blocks.append(
                f"Example {idx}:\n"
                f"Prompt:\n{row['prompt']}\n\n"
                f"Gold labels (JSON):\n{json.dumps(gold, ensure_ascii=False)}\n"
            )
            idx += 1

    return "\n\n".join(blocks).strip()

few_shot_examples = build_few_shot_examples(train_df)
print(few_shot_examples[:1200])


Example 1:
Prompt:
Original Post:
Author: Beginning_Exit_6256
Title: Is this considered sexual assault?
Body: So you have sex with a man with consent. You both want to have sex. You tell him that you don’t swallow semen and you’ve never done that. He tells you to do it but you don’t

He then thinks it’s funny/as a joke to force some of his semen with his hand in your mouth? He forcibly does this with his hand

Is this sexual assault?

edit: he’d probably just get arrested for that. I don’t he’d go to prison for it though lol

---
Conversation History:
Comment (depth 0):
Author: Miaous95
Content: Definitely SA and I’d do it back to him see if he finds it funny

Gold labels (JSON):
{"Information support": "YES", "Emotional support": "NO", "Esteem support": "NO", "Network support": "NO", "Tangible assistance": "NO", "Seeking support": "NO", "Group interactions": "NO", "Not applicable": "NO"}


Example 2:
Prompt:
Original Post:
Author: ahhhhconfuse
Title: Did I (17/f) cheat on my boyfriend

## Run

In [11]:
from openai import OpenAI

client = OpenAI()

models = client.models.list()

# Print nicely
ids = sorted([m.id for m in models.data])
print(f"Total models visible: {len(ids)}\n")
for mid in ids:
    print(mid)


Total models visible: 114

babbage-002
chatgpt-4o-latest
chatgpt-image-latest
codex-mini-latest
dall-e-2
dall-e-3
davinci-002
gpt-3.5-turbo
gpt-3.5-turbo-0125
gpt-3.5-turbo-1106
gpt-3.5-turbo-16k
gpt-3.5-turbo-instruct
gpt-3.5-turbo-instruct-0914
gpt-4
gpt-4-0125-preview
gpt-4-0613
gpt-4-1106-preview
gpt-4-turbo
gpt-4-turbo-2024-04-09
gpt-4-turbo-preview
gpt-4.1
gpt-4.1-2025-04-14
gpt-4.1-mini
gpt-4.1-mini-2025-04-14
gpt-4.1-nano
gpt-4.1-nano-2025-04-14
gpt-4o
gpt-4o-2024-05-13
gpt-4o-2024-08-06
gpt-4o-2024-11-20
gpt-4o-audio-preview
gpt-4o-audio-preview-2024-12-17
gpt-4o-audio-preview-2025-06-03
gpt-4o-mini
gpt-4o-mini-2024-07-18
gpt-4o-mini-audio-preview
gpt-4o-mini-audio-preview-2024-12-17
gpt-4o-mini-realtime-preview
gpt-4o-mini-realtime-preview-2024-12-17
gpt-4o-mini-search-preview
gpt-4o-mini-search-preview-2025-03-11
gpt-4o-mini-transcribe
gpt-4o-mini-transcribe-2025-03-20
gpt-4o-mini-transcribe-2025-12-15
gpt-4o-mini-tts
gpt-4o-mini-tts-2025-03-20
gpt-4o-mini-tts-2025-12-15
gpt

In [12]:
MODEL = "gpt-4.1-mini"
i = 3
full_prompt_text = valid_df.iloc[i]["prompt"]

print("Row:", i)
print("FULL prompt preview:\n", full_prompt_text[:500], "\n")

# Step 1: Domain gate (title-only → fallback full body)
dg_out, gate_used = domain_gate_title_then_fullbody(MODEL, full_prompt_text)
print("DOMAIN GATE:", dg_out, "| gate_used:", gate_used)

if not dg_out.is_ambivalent_sa_domain:
    pred = {lab: "NO" for lab in SUPPORT_ONLY_LABELS}
    pred["Not applicable"] = "YES"
    print("\nFINAL LABELS (forced):")
    print(json.dumps(pred, indent=2))
else:
    # Step 2
    op_out = call_structured(MODEL, OP_LAST_COMMENT_PROMPT.format(prompt_text=full_prompt_text), LastCommentOut)
    print("OP LAST COMMENT:", op_out)

    # Step 3
    final_prompt = FINAL_MULTILABEL_PROMPT.format(
        is_ambivalent_sa_domain=str(dg_out.is_ambivalent_sa_domain).lower(),
        is_last_comment_by_op=str(op_out.is_last_comment_by_op).lower(),
        last_comment_author=op_out.last_comment_author,
        last_comment_content=op_out.last_comment_content,
        support_definitions=SUPPORT_DEFINITIONS,
        few_shot_examples=few_shot_examples,
        prompt_text=full_prompt_text,
    )
    ml_out = call_structured(MODEL, final_prompt, MultiLabelOut)
    pred = ml_out.to_label_dict()

    print("\nFINAL LABELS:")
    print(json.dumps(pred, indent=2))

    yes_labels = [lab for lab in LABELS if pred.get(lab) == "YES"]
    print("\nYES labels:", ", ".join(yes_labels))


Row: 3
FULL prompt preview:
 Original Post:
Author: katiedababie
Title: I (21F) found a spy camera in my room after my step dad (36M) put a surveillance camera in my window
Body: My stepdad (36M) installed a surveillance camera positioned outside of my window to scan our backyard while him and my mother and two younger brothers go camping. I couldn’t go on the trip since i had work, so i was staying home to house sit and feed our animals. My stepdad recently put up new cameras outside so I didn’t see anything wrong with him 

DOMAIN GATE: is_ambivalent_sa_domain=False | gate_used: full_body_fallback

FINAL LABELS (forced):
{
  "Information support": "NO",
  "Emotional support": "NO",
  "Esteem support": "NO",
  "Network support": "NO",
  "Tangible assistance": "NO",
  "Seeking support": "NO",
  "Group interactions": "NO",
  "Not applicable": "YES"
}


In [13]:
import concurrent.futures as cf
from tqdm import tqdm

MODEL = "gpt-4.1-mini"
MAX_WORKERS = 1   # start low; increase only if you stop seeing 429s

ERROR_COL = "llm_error"
YES_LABEL_COL = "predicted_labels_yes"
GATE_USED_COL = "gate_used"

# Ensure OUR output columns exist (do NOT touch Tags)
base_cols = [
    "is_ambivalent_sa_domain",
    "op_author",
    "last_comment_depth",
    "last_comment_author",
    "is_last_comment_by_op",
    ERROR_COL,
    YES_LABEL_COL,
    GATE_USED_COL,
]
for col in base_cols:
    if col not in valid_df.columns:
        valid_df[col] = None

for lab in LABELS:
    if lab not in valid_df.columns:
        valid_df[lab] = None

def run_one(full_prompt_text: str):
    # Step 1: title-only → fallback full body
    dg_out, gate_used = domain_gate_title_then_fullbody(MODEL, full_prompt_text)

    if not dg_out.is_ambivalent_sa_domain:
        pred = {lab: "NO" for lab in SUPPORT_ONLY_LABELS}
        pred["Not applicable"] = "YES"
        yes_labels = [lab for lab, v in pred.items() if v == "YES"]
        return {
            "gate_used": gate_used,
            "is_ambivalent_sa_domain": False,
            "op_author": None,
            "last_comment_depth": None,
            "last_comment_author": None,
            "is_last_comment_by_op": None,
            "pred": pred,
            "predicted_labels_yes": ", ".join(yes_labels),
            "error": None,
        }

    # Step 2
    op_out = call_structured(MODEL, OP_LAST_COMMENT_PROMPT.format(prompt_text=full_prompt_text), LastCommentOut)

    # Step 3
    final_prompt = FINAL_MULTILABEL_PROMPT.format(
        is_ambivalent_sa_domain=str(dg_out.is_ambivalent_sa_domain).lower(),
        is_last_comment_by_op=str(op_out.is_last_comment_by_op).lower(),
        last_comment_author=op_out.last_comment_author,
        last_comment_content=op_out.last_comment_content,
        support_definitions=SUPPORT_DEFINITIONS,
        few_shot_examples=few_shot_examples,
        prompt_text=full_prompt_text,
    )
    ml_out = call_structured(MODEL, final_prompt, MultiLabelOut)
    pred = ml_out.to_label_dict()
    yes_labels = [lab for lab, v in pred.items() if v == "YES"]

    return {
        "gate_used": gate_used,
        "is_ambivalent_sa_domain": True,
        "op_author": op_out.op_author,
        "last_comment_depth": op_out.last_comment_depth,
        "last_comment_author": op_out.last_comment_author,
        "is_last_comment_by_op": op_out.is_last_comment_by_op,
        "pred": pred,
        "predicted_labels_yes": ", ".join(yes_labels),
        "error": None,
    }

# Parallel run
futures = {}
with cf.ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex:
    for i, row in valid_df.iterrows():
        # Optional resume: skip rows already done successfully
        if pd.notna(row.get(YES_LABEL_COL)) and str(row.get(ERROR_COL, "")).strip() in {"", "nan", "None"}:
            continue
        futures[ex.submit(run_one, row["prompt"])] = i

    for fut in tqdm(cf.as_completed(futures), total=len(futures), desc=f"Full validation (workers={MAX_WORKERS})"):
        i = futures[fut]
        try:
            out = fut.result()

            valid_df.at[i, GATE_USED_COL] = out["gate_used"]
            valid_df.at[i, "is_ambivalent_sa_domain"] = out["is_ambivalent_sa_domain"]
            valid_df.at[i, "op_author"] = out["op_author"]
            valid_df.at[i, "last_comment_depth"] = out["last_comment_depth"]
            valid_df.at[i, "last_comment_author"] = out["last_comment_author"]
            valid_df.at[i, "is_last_comment_by_op"] = out["is_last_comment_by_op"]

            for lab, val in out["pred"].items():
                valid_df.at[i, lab] = val

            valid_df.at[i, YES_LABEL_COL] = out["predicted_labels_yes"]
            valid_df.at[i, ERROR_COL] = out["error"]

        except Exception as e:
            valid_df.at[i, ERROR_COL] = str(e)

# Save once
OUT_PATH = "valid_with_predictions_4mini.csv"
valid_df.to_csv(OUT_PATH, index=False)
print("Saved:", OUT_PATH)
print(f"Left existing 'Tags' untouched. Errors are in '{ERROR_COL}'.")


Full validation (workers=1):   0%|          | 0/86 [00:00<?, ?it/s]

Full validation (workers=1):  66%|██████▋   | 57/86 [04:40<02:18,  4.78s/it]

[retry 1/12] ValidationError: sleeping 0.75s
[retry 2/12] ValidationError: sleeping 1.12s
[retry 3/12] ValidationError: sleeping 1.69s
[retry 4/12] ValidationError: sleeping 2.53s
[retry 5/12] ValidationError: sleeping 3.80s
[retry 6/12] ValidationError: sleeping 5.70s
[retry 7/12] ValidationError: sleeping 8.00s
[retry 8/12] ValidationError: sleeping 8.00s
[retry 9/12] ValidationError: sleeping 8.00s
[retry 10/12] ValidationError: sleeping 8.00s
[retry 11/12] ValidationError: sleeping 8.00s
[retry 12/12] ValidationError: sleeping 8.00s


Full validation (workers=1):  72%|███████▏  | 62/86 [07:10<05:36, 14.02s/it]

[retry 1/12] ValidationError: sleeping 0.75s
[retry 2/12] ValidationError: sleeping 1.12s
[retry 3/12] ValidationError: sleeping 1.69s
[retry 4/12] ValidationError: sleeping 2.53s
[retry 5/12] ValidationError: sleeping 3.80s
[retry 6/12] ValidationError: sleeping 5.70s
[retry 7/12] ValidationError: sleeping 8.00s
[retry 8/12] ValidationError: sleeping 8.00s
[retry 9/12] ValidationError: sleeping 8.00s
[retry 10/12] ValidationError: sleeping 8.00s
[retry 11/12] ValidationError: sleeping 8.00s
[retry 12/12] ValidationError: sleeping 8.00s


Full validation (workers=1): 100%|██████████| 86/86 [10:59<00:00,  7.66s/it]

Saved: valid_with_predictions_4mini.csv
Left existing 'Tags' untouched. Errors are in 'llm_error'.





In [13]:
cols_to_reset = [
    "is_ambivalent_sa_domain","op_author","last_comment_depth","last_comment_author","is_last_comment_by_op",
    "gate_used","llm_error","predicted_labels_yes",
    "Information support","Emotional support","Esteem support","Network support",
    "Tangible assistance","Seeking support","Group interactions","Not applicable",
]
for c in cols_to_reset:
    if c in valid_df.columns:
        valid_df[c] = None

print("Reset done. Now rerun full validation with o4-mini.")

Reset done. Now rerun full validation with o4-mini.


In [22]:
import concurrent.futures as cf
from tqdm import tqdm

MODEL = "o4-mini"
MAX_WORKERS = 2   # start low; increase only if you stop seeing 429s

ERROR_COL = "llm_error"
YES_LABEL_COL = "predicted_labels_yes"
GATE_USED_COL = "gate_used"

# Ensure OUR output columns exist (do NOT touch Tags)
base_cols = [
    "is_ambivalent_sa_domain",
    "op_author",
    "last_comment_depth",
    "last_comment_author",
    "is_last_comment_by_op",
    ERROR_COL,
    YES_LABEL_COL,
    GATE_USED_COL,
]
for col in base_cols:
    if col not in valid_df.columns:
        valid_df[col] = None

for lab in LABELS:
    if lab not in valid_df.columns:
        valid_df[lab] = None

def run_one(full_prompt_text: str):
    # Step 1: title-only → fallback full body
    dg_out, gate_used = domain_gate_title_then_fullbody(MODEL, full_prompt_text)

    if not dg_out.is_ambivalent_sa_domain:
        pred = {lab: "NO" for lab in SUPPORT_ONLY_LABELS}
        pred["Not applicable"] = "YES"
        yes_labels = [lab for lab, v in pred.items() if v == "YES"]
        return {
            "gate_used": gate_used,
            "is_ambivalent_sa_domain": False,
            "op_author": None,
            "last_comment_depth": None,
            "last_comment_author": None,
            "is_last_comment_by_op": None,
            "pred": pred,
            "predicted_labels_yes": ", ".join(yes_labels),
            "error": None,
        }

    # Step 2
    op_out = call_structured(MODEL, OP_LAST_COMMENT_PROMPT.format(prompt_text=full_prompt_text), LastCommentOut)

    # Step 3
    final_prompt = FINAL_MULTILABEL_PROMPT.format(
        is_ambivalent_sa_domain=str(dg_out.is_ambivalent_sa_domain).lower(),
        is_last_comment_by_op=str(op_out.is_last_comment_by_op).lower(),
        support_definitions=SUPPORT_DEFINITIONS,
        few_shot_examples=few_shot_examples,
        prompt_text=full_prompt_text,
    )
    ml_out = call_structured(MODEL, final_prompt, MultiLabelOut)
    pred = ml_out.to_label_dict()
    yes_labels = [lab for lab, v in pred.items() if v == "YES"]

    return {
        "gate_used": gate_used,
        "is_ambivalent_sa_domain": True,
        "op_author": op_out.op_author,
        "last_comment_depth": op_out.last_comment_depth,
        "last_comment_author": op_out.last_comment_author,
        "is_last_comment_by_op": op_out.is_last_comment_by_op,
        "pred": pred,
        "predicted_labels_yes": ", ".join(yes_labels),
        "error": None,
    }

# Parallel run
futures = {}
with cf.ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex:
    for i, row in valid_df.iterrows():
        # Optional resume: skip rows already done successfully
        if pd.notna(row.get(YES_LABEL_COL)) and str(row.get(ERROR_COL, "")).strip() in {"", "nan", "None"}:
            continue
        futures[ex.submit(run_one, row["prompt"])] = i

    for fut in tqdm(cf.as_completed(futures), total=len(futures), desc=f"Full validation (workers={MAX_WORKERS})"):
        i = futures[fut]
        try:
            out = fut.result()

            valid_df.at[i, GATE_USED_COL] = out["gate_used"]
            valid_df.at[i, "is_ambivalent_sa_domain"] = out["is_ambivalent_sa_domain"]
            valid_df.at[i, "op_author"] = out["op_author"]
            valid_df.at[i, "last_comment_depth"] = out["last_comment_depth"]
            valid_df.at[i, "last_comment_author"] = out["last_comment_author"]
            valid_df.at[i, "is_last_comment_by_op"] = out["is_last_comment_by_op"]

            for lab, val in out["pred"].items():
                valid_df.at[i, lab] = val

            valid_df.at[i, YES_LABEL_COL] = out["predicted_labels_yes"]
            valid_df.at[i, ERROR_COL] = out["error"]

        except Exception as e:
            valid_df.at[i, ERROR_COL] = str(e)

# Save once
OUT_PATH = "valid_with_predictions_o4mini.csv"
valid_df.to_csv(OUT_PATH, index=False)
print("Saved:", OUT_PATH)
print(f"Left existing 'Tags' untouched. Errors are in '{ERROR_COL}'.")


Full validation (workers=2):   0%|          | 0/86 [00:00<?, ?it/s]

Full validation (workers=2):  31%|███▏      | 27/86 [01:52<03:47,  3.85s/it]

[retry 1/12] ValidationError: sleeping 0.75s


Full validation (workers=2):  52%|█████▏    | 45/86 [03:40<03:55,  5.74s/it]

[retry 1/12] ValidationError: sleeping 0.75s


Full validation (workers=2):  62%|██████▏   | 53/86 [04:34<03:18,  6.02s/it]

[retry 1/12] ValidationError: sleeping 0.75s


Full validation (workers=2):  92%|█████████▏| 79/86 [06:27<00:26,  3.81s/it]

[retry 1/12] ValidationError: sleeping 0.75s


Full validation (workers=2):  93%|█████████▎| 80/86 [06:30<00:21,  3.64s/it]

[retry 1/12] ValidationError: sleeping 0.75s


Full validation (workers=2): 100%|██████████| 86/86 [06:55<00:00,  4.83s/it]

Saved: valid_with_predictions_o4mini.csv
Left existing 'Tags' untouched. Errors are in 'llm_error'.





In [51]:
print("Rows:", len(valid_df))

# how many rows got predictions
if "predicted_labels_yes" in valid_df.columns:
    print("predicted_labels_yes filled:", valid_df["predicted_labels_yes"].notna().sum())

# how many rows errored
if "llm_error" in valid_df.columns:
    print("llm_error filled:", valid_df["llm_error"].notna().sum())
    print("\nTop llm_error values:")
    print(valid_df["llm_error"].dropna().value_counts().head(10))

# peek a few rows
cols = [c for c in ["Tags", "predicted_labels_yes", "llm_error", "is_ambivalent_sa_domain"] if c in valid_df.columns]
display(valid_df[cols].head(10))


Rows: 86
predicted_labels_yes filled: 86
llm_error filled: 0

Top llm_error values:
Series([], Name: count, dtype: int64)


Unnamed: 0,Tags,predicted_labels_yes,llm_error,is_ambivalent_sa_domain
0,"Information support, Emotional support, Esteem...","Emotional support, Esteem support",,True
1,Group interactions,Seeking support,,True
2,Not Applicable,Not applicable,,True
3,Not Applicable,Not applicable,,False
4,"Information support, Emotional support, Group ...","Information support, Emotional support",,True
5,Not Applicable,Not applicable,,False
6,"Information support, Esteem support","Information support, Emotional support, Networ...",,True
7,Group interactions,Seeking support,,True
8,"Information support, Emotional support","Information support, Emotional support",,True
9,"Information support, Emotional support, Networ...","Information support, Emotional support, Networ...",,True


In [14]:
mlb = MultiLabelBinarizer(classes=ALL_LABELS)
Y_true = mlb.fit_transform(y_true_list)
Y_pred = mlb.transform(y_pred_list)

print("Is 'Seeking support' in classes?", "Seeking support" in mlb.classes_)
print("Classes:", list(mlb.classes_))



NameError: name 'MultiLabelBinarizer' is not defined

In [15]:
import json, re
import numpy as np
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report, precision_recall_fscore_support

# ====== 1) LOAD YOUR FILE ======
IN_PATH = "valid_with_predictions_4mini.csv"
df = pd.read_csv(IN_PATH)

print("Loaded:", IN_PATH)
print("Rows:", len(df))
print("Columns:", list(df.columns))

# ====== 2) LABEL SET ======
ALL_LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
    "Not applicable",
]
LABEL_SET = set(ALL_LABELS)

# map for case/spacing normalization
_norm_map = {lab.lower().strip(): lab for lab in ALL_LABELS}

def normalize_label(t: str) -> str | None:
    t2 = str(t).strip()
    if not t2:
        return None
    key = t2.lower().strip()
    return _norm_map.get(key, None)  # returns canonical label or None if unknown

# ====== 3) PARSE TAG STRINGS ======
def parse_labels(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return []
    s = str(x).strip()
    if s == "" or s.lower() in {"nan", "none"}:
        return []

    labels = []

    # Try JSON list
    if s.startswith("[") and s.endswith("]"):
        try:
            obj = json.loads(s)
            labels = [str(t).strip() for t in obj]
        except Exception:
            try:
                obj = json.loads(re.sub(r"'", '"', s))
                labels = [str(t).strip() for t in obj]
            except Exception:
                labels = []

    # Fallback split
    if not labels:
        parts = re.split(r"[,\n;|]+", s)
        labels = []
        for t in parts:
            t = re.sub(r"^tags?\s*:\s*", "", t.strip(), flags=re.IGNORECASE)
            if t:
                labels.append(t)

    # Normalize + keep only known labels
    out = []
    for t in labels:
        canon = normalize_label(t)
        if canon is not None:
            out.append(canon)
    # de-dup while preserving order
    seen = set()
    out2 = []
    for t in out:
        if t not in seen:
            seen.add(t)
            out2.append(t)
    return out2

# ====== 4) BUILD y_true / y_pred ======
assert "Tags" in df.columns, "Missing ground-truth column: Tags"
assert "predicted_labels_yes" in df.columns, "Missing prediction column: predicted_labels_yes"

y_true_list = df["Tags"].apply(parse_labels).tolist()
y_pred_list = df["predicted_labels_yes"].apply(parse_labels).tolist()

# (Optional) sanity check: how many rows have empty GT/pred after parsing?
print("\nEmpty ground-truth rows:", sum(len(x)==0 for x in y_true_list))
print("Empty prediction rows:", sum(len(x)==0 for x in y_pred_list))

mlb = MultiLabelBinarizer(classes=ALL_LABELS)
Y_true = mlb.fit_transform(y_true_list)
Y_pred = mlb.transform(y_pred_list)

# ====== 5) OVERALL METRICS ======
scores = {
    "micro_precision": precision_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_recall":    recall_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_f1":        f1_score(Y_true, Y_pred, average="micro", zero_division=0),

    "macro_precision": precision_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_recall":    recall_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_f1":        f1_score(Y_true, Y_pred, average="macro", zero_division=0),

    "samples_precision": precision_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_recall":    recall_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_f1":        f1_score(Y_true, Y_pred, average="samples", zero_division=0),
}

print("\nOverall Precision \\& Recall \\& F1")
for k, v in scores.items():
    print(f"{k}: {v:.4f}")

# ====== 6) PER-LABEL METRICS ======
p, r, f1, support = precision_recall_fscore_support(Y_true, Y_pred, average=None, zero_division=0)

per_label_df = pd.DataFrame({
    "label": mlb.classes_,
    "precision": p,
    "recall": r,
    "f1": f1,
    "support_true": support
}).sort_values("f1", ascending=False)

print("\nPer-label metrics:")
display(per_label_df)

print("\nClassification report:")
print(classification_report(Y_true, Y_pred, target_names=mlb.classes_, zero_division=0))


Loaded: valid_with_predictions_4mini.csv
Rows: 86
Columns: ['post_id', 'parent_id', 'subreddit', 'comment_id', 'parent_fullname', 'depth', 'comment_author', 'comment_body', 'comment_score', 'created_utc', 'permalink', 'is_post_author', 'title', 'author', 'body', 'prompt', 'Tags', 'is_ambivalent_sa_domain', 'op_author', 'last_comment_depth', 'last_comment_author', 'is_last_comment_by_op', 'llm_error', 'predicted_labels_yes', 'gate_used', 'Information support', 'Emotional support', 'Esteem support', 'Network support', 'Tangible assistance', 'Seeking support', 'Group interactions', 'Not applicable']

Empty ground-truth rows: 0
Empty prediction rows: 2

Overall Precision \& Recall \& F1
micro_precision: 0.7606
micro_recall: 0.7297
micro_f1: 0.7448
macro_precision: 0.6650
macro_recall: 0.6792
macro_f1: 0.6665
samples_precision: 0.7068
samples_recall: 0.6953
samples_f1: 0.6863

Per-label metrics:


Unnamed: 0,label,precision,recall,f1,support_true
4,Tangible assistance,1.0,1.0,1.0,2
0,Information support,0.911111,0.759259,0.828283,54
1,Emotional support,0.774194,0.888889,0.827586,27
6,Group interactions,0.681818,0.75,0.714286,20
7,Not applicable,0.647059,0.785714,0.709677,14
3,Network support,0.6,0.75,0.666667,4
2,Esteem support,0.705882,0.5,0.585366,24
5,Seeking support,0.0,0.0,0.0,3



Classification report:
                     precision    recall  f1-score   support

Information support       0.91      0.76      0.83        54
  Emotional support       0.77      0.89      0.83        27
     Esteem support       0.71      0.50      0.59        24
    Network support       0.60      0.75      0.67         4
Tangible assistance       1.00      1.00      1.00         2
    Seeking support       0.00      0.00      0.00         3
 Group interactions       0.68      0.75      0.71        20
     Not applicable       0.65      0.79      0.71        14

          micro avg       0.76      0.73      0.74       148
          macro avg       0.67      0.68      0.67       148
       weighted avg       0.77      0.73      0.74       148
        samples avg       0.71      0.70      0.69       148



In [25]:
import json, re
import numpy as np
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report, precision_recall_fscore_support

# ====== 1) LOAD YOUR FILE ======
IN_PATH = "valid_with_predictions_o4mini.csv"
df = pd.read_csv(IN_PATH)

print("Loaded:", IN_PATH)
print("Rows:", len(df))
print("Columns:", list(df.columns))

# ====== 2) LABEL SET ======
ALL_LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
    "Not applicable",
]
LABEL_SET = set(ALL_LABELS)

# map for case/spacing normalization
_norm_map = {lab.lower().strip(): lab for lab in ALL_LABELS}

def normalize_label(t: str) -> str | None:
    t2 = str(t).strip()
    if not t2:
        return None
    key = t2.lower().strip()
    return _norm_map.get(key, None)  # returns canonical label or None if unknown

# ====== 3) PARSE TAG STRINGS ======
def parse_labels(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return []
    s = str(x).strip()
    if s == "" or s.lower() in {"nan", "none"}:
        return []

    labels = []

    # Try JSON list
    if s.startswith("[") and s.endswith("]"):
        try:
            obj = json.loads(s)
            labels = [str(t).strip() for t in obj]
        except Exception:
            try:
                obj = json.loads(re.sub(r"'", '"', s))
                labels = [str(t).strip() for t in obj]
            except Exception:
                labels = []

    # Fallback split
    if not labels:
        parts = re.split(r"[,\n;|]+", s)
        labels = []
        for t in parts:
            t = re.sub(r"^tags?\s*:\s*", "", t.strip(), flags=re.IGNORECASE)
            if t:
                labels.append(t)

    # Normalize + keep only known labels
    out = []
    for t in labels:
        canon = normalize_label(t)
        if canon is not None:
            out.append(canon)
    # de-dup while preserving order
    seen = set()
    out2 = []
    for t in out:
        if t not in seen:
            seen.add(t)
            out2.append(t)
    return out2

# ====== 4) BUILD y_true / y_pred ======
assert "Tags" in df.columns, "Missing ground-truth column: Tags"
assert "predicted_labels_yes" in df.columns, "Missing prediction column: predicted_labels_yes"

y_true_list = df["Tags"].apply(parse_labels).tolist()
y_pred_list = df["predicted_labels_yes"].apply(parse_labels).tolist()

# (Optional) sanity check: how many rows have empty GT/pred after parsing?
print("\nEmpty ground-truth rows:", sum(len(x)==0 for x in y_true_list))
print("Empty prediction rows:", sum(len(x)==0 for x in y_pred_list))

mlb = MultiLabelBinarizer(classes=ALL_LABELS)
Y_true = mlb.fit_transform(y_true_list)
Y_pred = mlb.transform(y_pred_list)

# ====== 5) OVERALL METRICS ======
scores = {
    "micro_precision": precision_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_recall":    recall_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_f1":        f1_score(Y_true, Y_pred, average="micro", zero_division=0),

    "macro_precision": precision_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_recall":    recall_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_f1":        f1_score(Y_true, Y_pred, average="macro", zero_division=0),

    "samples_precision": precision_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_recall":    recall_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_f1":        f1_score(Y_true, Y_pred, average="samples", zero_division=0),
}

print("\nOverall Precision \\& Recall \\& F1")
for k, v in scores.items():
    print(f"{k}: {v:.4f}")

# ====== 6) PER-LABEL METRICS ======
p, r, f1, support = precision_recall_fscore_support(Y_true, Y_pred, average=None, zero_division=0)

per_label_df = pd.DataFrame({
    "label": mlb.classes_,
    "precision": p,
    "recall": r,
    "f1": f1,
    "support_true": support
}).sort_values("f1", ascending=False)

print("\nPer-label metrics:")
display(per_label_df)

print("\nClassification report:")
print(classification_report(Y_true, Y_pred, target_names=mlb.classes_, zero_division=0))


Loaded: valid_with_predictions_o4mini.csv
Rows: 86
Columns: ['post_id', 'parent_id', 'subreddit', 'comment_id', 'parent_fullname', 'depth', 'comment_author', 'comment_body', 'comment_score', 'created_utc', 'permalink', 'is_post_author', 'title', 'author', 'body', 'prompt', 'Tags', 'is_ambivalent_sa_domain', 'op_author', 'last_comment_depth', 'last_comment_author', 'is_last_comment_by_op', 'llm_error', 'predicted_labels_yes', 'gate_used', 'Information support', 'Emotional support', 'Esteem support', 'Network support', 'Tangible assistance', 'Seeking support', 'Group interactions', 'Not applicable']

Empty ground-truth rows: 0
Empty prediction rows: 75

Overall Precision \& Recall \& F1
micro_precision: 0.8182
micro_recall: 0.0608
micro_f1: 0.1132
macro_precision: 0.2250
macro_recall: 0.0737
macro_f1: 0.0879
samples_precision: 0.1047
samples_recall: 0.1047
samples_f1: 0.1047

Per-label metrics:


Unnamed: 0,label,precision,recall,f1,support_true
7,Not applicable,0.8,0.571429,0.666667,14
0,Information support,1.0,0.018519,0.036364,54
2,Esteem support,0.0,0.0,0.0,24
1,Emotional support,0.0,0.0,0.0,27
3,Network support,0.0,0.0,0.0,4
4,Tangible assistance,0.0,0.0,0.0,2
5,Seeking support,0.0,0.0,0.0,3
6,Group interactions,0.0,0.0,0.0,20



Classification report:
                     precision    recall  f1-score   support

Information support       1.00      0.02      0.04        54
  Emotional support       0.00      0.00      0.00        27
     Esteem support       0.00      0.00      0.00        24
    Network support       0.00      0.00      0.00         4
Tangible assistance       0.00      0.00      0.00         2
    Seeking support       0.00      0.00      0.00         3
 Group interactions       0.00      0.00      0.00        20
     Not applicable       0.80      0.57      0.67        14

          micro avg       0.82      0.06      0.11       148
          macro avg       0.23      0.07      0.09       148
       weighted avg       0.44      0.06      0.08       148
        samples avg       0.10      0.10      0.10       148

