# Support Multi-Label Pipeline (3-Step)
This notebook:
1) Loads `train.csv` (14 rows) to build few-shot examples  
2) Loads `valid.csv` (86 rows) to predict labels  
3) Runs 3 gates: Domain Gate ‚Üí OP Last Comment Gate ‚Üí Final Multi-label  
4) Saves `valid_with_predictions.csv`

## 1. Project Setup

In [17]:
%pip install -U openai python-dotenv pandas tqdm pydantic scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [18]:
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

MODEL = "openai/gpt-oss-120b:cerebras"


In [19]:
import os, json, time
import pandas as pd
from tqdm import tqdm
from dotenv import load_dotenv
from pydantic import BaseModel
from typing_extensions import Literal
from openai import OpenAI

# load_dotenv()
# assert os.getenv("OPENAI_API_KEY"), "OPENAI_API_KEY not found in environment or .env"

# client = OpenAI()


## 2. Loading files and definitions

In [48]:
TRAIN_PATH = "../../data/train.csv"
VALID_PATH = "../../data/valid_86.csv"

train_df = pd.read_csv(TRAIN_PATH)
valid_df = pd.read_csv(VALID_PATH)

assert "prompt" in train_df.columns, "train.csv must contain a 'prompt' column"
assert "prompt" in valid_df.columns, "valid.csv must contain a 'prompt' column"

train_df["prompt"] = train_df["prompt"].fillna("").astype(str)
valid_df["prompt"] = valid_df["prompt"].fillna("").astype(str)

train_df.head(2)


Unnamed: 0,post_id,parent_id,subreddit,comment_id,parent_fullname,depth,comment_author,comment_body,comment_score,created_utc,...,body,prompt,Tags,Information support,Emotional support,Esteem support,Network support,Tangible assistance,Seeking support,Group interactions
0,1lc7y2n,,TooAfraidToAsk,mxysa7g,t3_1lc7y2n,0,Miaous95,Definitely SA and I‚Äôd do it back to him see if...,5,1750018747,...,So you have sex with a man with consent. You b...,Original Post:\nAuthor: Beginning_Exit_6256\nT...,Information support,Yes,No,No,No,No,No,No
1,5rf97b,,relationship_advice,dd7kbxy,t3_5rf97b,0,[deleted],You cheated on him. You are responsible for yo...,5,1485988490,...,My boyfriend (17/m) and I had been dating for ...,Original Post:\nAuthor: ahhhhconfuse\nTitle: D...,Information support,Yes,No,No,No,No,No,No


In [49]:
LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
    "Not applicable",
]

SUPPORT_ONLY_LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
]

SUPPORT_DEFINITIONS = """
Information support: Giving advice, facts, resources, or explanations that clarify what‚Äôs happening or what to do (including opinionated judgments like ‚Äúthat is assault‚Äù when used to inform/guide).
Emotional support: Messages that express care and empathy‚Äîcomforting, showing affection/sympathy, encouraging hope, offering prayers, or easing guilt/blame (‚ÄúI‚Äôm so sorry,‚Äù ‚Äúsending hugs,‚Äù ‚Äústay strong,‚Äù ‚ÄúI‚Äôll pray for you,‚Äù ‚Äúit‚Äôs not your fault‚Äù).
Esteem support: Messages that build the user up by complimenting them or validating their feelings/beliefs/actions as reasonable/normal.
Network support: Encouraging the person to reach out to other people or connect with external help systems (therapy, friends, family, communities, etc.). Note : Suggesting to go to police is Information support and not Network support.
Tangible assistance: When the commenter personally offers to help directly (I am here, You can talk to me).
Seeking support: Messages where the author explicitly asks for help for themselves‚Äîeither a direct question/request for info/suggestions or an explicit reassurance request.
Group interactions: Any reply that primarily participates socially in the thread‚Äîexpressing gratitude/thanks, congratulations, or sharing one‚Äôs own experience/story (including ‚Äúme too‚Äù anecdotes). This label can co-occur with other support labels if the comment also gives advice, empathy, or info.
""".strip()


In [50]:
for lab in SUPPORT_ONLY_LABELS:
    if lab not in train_df.columns:
        print(f"‚ùå Missing column: {lab}")
    else:
        yes_count = (train_df[lab].astype(str).str.strip().str.lower() == "yes").sum()
        print(f"{lab}: YES count = {yes_count}")


Information support: YES count = 2
Emotional support: YES count = 2
Esteem support: YES count = 2
Network support: YES count = 2
Tangible assistance: YES count = 2
Seeking support: YES count = 0
Group interactions: YES count = 2


## 3. Prompts

In [51]:
DOMAIN_GATE_PROMPT = """
You are a psychologist and an expert in Reddit threads about possible sexual harassment/sexual assault.

Task: DOMAIN GATE.

Decide if the author is UNCERTAIN about whether the experience counts as sexual assault/harassment.

Return true ONLY if the author explicitly questions whether the experience counts as SA/harassment.

Return false otherwise.

Return ONLY valid JSON:
{{
  "is_ambivalent_sa_domain": true/false
}}

Prompt:
{prompt_text}
""".strip()


OP_LAST_COMMENT_PROMPT = """
Input format:
- There is an "Original Post:" section with Author, Title, Body.
- There is a "Conversation History:" section with a single/multiple comments.
"Conversation History:" includes blocks like:
  Comment (depth X):
  Author: ... 
  Content: ...

Steps:
1) Extract OP from "Original Post: Author:"
2) Choose the comment with the highest depth in the conversation history.
3) Extract that last comment's Author, Depth, and Content.
4) Compare last comment author with OP.

Return ONLY valid JSON:
{{
  "op_author": "<string>",
  "last_comment_depth": <int>,
  "last_comment_author": "<string>",
  "last_comment_content": "<string>",
  "is_last_comment_by_op": true/false
}}

Prompt:
{prompt_text}
""".strip()


FINAL_MULTILABEL_PROMPT = """
You are a psychologist expert in identifying support in Reddit comments. 

You will label ONLY this comment (the last comment in the thread):
Last Comment Author: {last_comment_author}
Last Comment Content: {last_comment_content}

Gates (already computed):
- is_ambivalent_sa_domain: {is_ambivalent_sa_domain}
- is_last_comment_by_op: {is_last_comment_by_op}

Rules:
1) If is_ambivalent_sa_domain is false:
   - Output: Not applicable = YES, and all other labels = NO. Stop.

2) If is_last_comment_by_op is true:
   - Only evaluate: Seeking support.
   - Force: Information support/Emotional support/Esteem support/Network support/Tangible assistance = NO.
   - Default rule:
     * If Seeking support = NO, then Group interactions = YES.
     * Else evaluate Group interactions.

                
3) Otherwise:
   - Only evaluate: Information support, Emotional support, Esteem support, Network support, Tangible assistance, Group interactions.
   - Force: Seeking support = NO.
   - Not applicable = YES only if all evaluated labels are NO.

Definitions:
{support_definitions}

Few-shot examples:
{few_shot_examples}

Return ONLY valid JSON with all labels:
{{
  "Information support": "YES/NO",
  "Emotional support": "YES/NO",
  "Esteem support": "YES/NO",
  "Network support": "YES/NO",
  "Tangible assistance": "YES/NO",
  "Seeking support": "YES/NO",
  "Group interactions": "YES/NO",
  "Not applicable": "YES/NO"
}}

Prompt:
{prompt_text}
""".strip()


In [52]:
import re

def extract_title_body(full_prompt: str):
    if not full_prompt or not isinstance(full_prompt, str):
        return "", ""
    title_m = re.search(r"Title:\s*(.*)", full_prompt)
    body_m  = re.search(r"Body:\s*(.*?)(?:\n---\n|Conversation History:|\Z)", full_prompt, flags=re.DOTALL)
    title = title_m.group(1).strip() if title_m else ""
    body  = body_m.group(1).strip() if body_m else ""
    return title, body

def build_gate_title_only(full_prompt: str) -> str:
    title, _ = extract_title_body(full_prompt)
    return f"""Original Post:
Title: {title}
""".strip()

def build_gate_full_op(full_prompt: str) -> str:
    title, body = extract_title_body(full_prompt)
    return f"""Original Post:
Title: {title}
Body: {body}
""".strip()

def domain_gate_title_then_fullbody(model: str, full_prompt: str):
    """
    Gate 1: Title-only
    If false -> Gate 2: Title + full Body (no comments)
    """
    gate1_text = build_gate_title_only(full_prompt)
    dg1 = call_structured(model, DOMAIN_GATE_PROMPT.format(prompt_text=gate1_text), DomainGateOut)
    if dg1.is_ambivalent_sa_domain:
        return dg1, "title_only"

    gate2_text = build_gate_full_op(full_prompt)
    dg2 = call_structured(model, DOMAIN_GATE_PROMPT.format(prompt_text=gate2_text), DomainGateOut)
    return dg2, "full_body_fallback"


In [53]:
from pydantic import BaseModel, Field, ConfigDict
from typing_extensions import Literal

class DomainGateOut(BaseModel):
    is_ambivalent_sa_domain: bool

class LastCommentOut(BaseModel):
    op_author: str
    last_comment_depth: int
    last_comment_author: str
    last_comment_content: str
    is_last_comment_by_op: bool

YesNo = Literal["YES", "NO"]

class MultiLabelOut(BaseModel):
    model_config = ConfigDict(populate_by_name=True)

    Information_support: YesNo = Field(alias="Information support")
    Emotional_support: YesNo = Field(alias="Emotional support")
    Esteem_support: YesNo = Field(alias="Esteem support")
    Network_support: YesNo = Field(alias="Network support")
    Tangible_assistance: YesNo = Field(alias="Tangible assistance")
    Seeking_support: YesNo = Field(alias="Seeking support")
    Group_interactions: YesNo = Field(alias="Group interactions")
    Not_applicable: YesNo = Field(alias="Not applicable")

    def to_label_dict(self):
        return {
            "Information support": self.Information_support,
            "Emotional support": self.Emotional_support,
            "Esteem support": self.Esteem_support,
            "Network support": self.Network_support,
            "Tangible assistance": self.Tangible_assistance,
            "Seeking support": self.Seeking_support,
            "Group interactions": self.Group_interactions,
            "Not applicable": self.Not_applicable,
        }


## 4. OpenAI Call Helper (Structured Parse) + Build examples

In [55]:
import json, re, time

def _get_choice_text(choice) -> str | None:
    """
    HF router providers sometimes return text in different fields.
    Try the common ones safely.
    """
    # 1) Standard OpenAI chat format
    msg = getattr(choice, "message", None)
    if msg is not None:
        content = getattr(msg, "content", None)

        # content can be a normal string
        if isinstance(content, str) and content.strip():
            return content

        # sometimes content is a list of parts (rare, provider-dependent)
        if isinstance(content, list):
            parts = []
            for p in content:
                if isinstance(p, dict):
                    # common shapes: {"type":"text","text":"..."} or {"text":"..."}
                    if p.get("type") == "text" and isinstance(p.get("text"), str):
                        parts.append(p["text"])
                    elif isinstance(p.get("text"), str):
                        parts.append(p["text"])
            joined = "".join(parts).strip()
            if joined:
                return joined

    # 2) Legacy completion-style
    txt = getattr(choice, "text", None)
    if isinstance(txt, str) and txt.strip():
        return txt

    return None

def _extract_first_json_obj(text: str) -> str:
    text = text.strip()
    if text.startswith("{") and text.endswith("}"):
        return text
    m = re.search(r"\{.*\}", text, flags=re.DOTALL)
    if not m:
        raise ValueError(f"No JSON object found in model output. First 400 chars:\n{text[:400]}")
    return m.group(0)

def call_structured(model: str, prompt: str, out_schema, max_retries: int = 6):
    last_err = None

    for attempt in range(max_retries):
        try:
            resp = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.2,
                max_tokens=900,
            )

            choice0 = resp.choices[0]
            txt = _get_choice_text(choice0)

            if not txt:
                # Print debug once; this is usually not recoverable by retrying
                print("DEBUG: Empty/None content returned.")
                try:
                    print("finish_reason:", getattr(choice0, "finish_reason", None))
                    print("choice0:", choice0)
                except Exception:
                    pass
                raise ValueError("Model returned no text content (None/empty).")

            json_str = _extract_first_json_obj(txt)
            data = json.loads(json_str)

            # Pydantic v2
            if hasattr(out_schema, "model_validate"):
                return out_schema.model_validate(data)
            # Pydantic v1 fallback
            return out_schema(**data)

        except Exception as e:
            last_err = e
            sleep_s = min(8.0, 0.75 * (1.5 ** attempt))
            print(f"[retry {attempt+1}/{max_retries}] {type(e).__name__}: sleeping {sleep_s:.2f}s")
            time.sleep(sleep_s)

    raise RuntimeError(f"HF call failed after retries. Last error: {last_err}")


In [56]:
def build_few_shot_examples(train_df: pd.DataFrame) -> str:
    missing = [lab for lab in SUPPORT_ONLY_LABELS if lab not in train_df.columns]
    if missing:
        print("‚ùå Missing label columns:", missing)
        return "NO_FEW_SHOT_AVAILABLE"

    blocks = []
    idx = 1

    for lab in SUPPORT_ONLY_LABELS:
        positives = train_df[train_df[lab].astype(str).str.strip().str.lower().eq("yes")].head(2)
        if len(positives) < 2:
            print(f"WARNING: label '{lab}' has only {len(positives)} YES rows")

        for _, row in positives.iterrows():
            gold = {c: "NO" for c in LABELS}
            for c in LABELS:
                if c in train_df.columns:
                    gold[c] = str(row.get(c, "NO")).strip().upper()

            if "Not applicable" not in train_df.columns:
                gold["Not applicable"] = "YES" if all(gold[x] == "NO" for x in SUPPORT_ONLY_LABELS) else "NO"

            blocks.append(
                f"Example {idx}:\n"
                f"Prompt:\n{row['prompt']}\n\n"
                f"Gold labels (JSON):\n{json.dumps(gold, ensure_ascii=False)}\n"
            )
            idx += 1

    return "\n\n".join(blocks).strip()

few_shot_examples = build_few_shot_examples(train_df)
print(few_shot_examples[:1200])


Example 1:
Prompt:
Original Post:
Author: Beginning_Exit_6256
Title: Is this considered sexual assault?
Body: So you have sex with a man with consent. You both want to have sex. You tell him that you don‚Äôt swallow semen and you‚Äôve never done that. He tells you to do it but you don‚Äôt

He then thinks it‚Äôs funny/as a joke to force some of his semen with his hand in your mouth? He forcibly does this with his hand

Is this sexual assault?

edit: he‚Äôd probably just get arrested for that. I don‚Äôt he‚Äôd go to prison for it though lol

---
Conversation History:
Comment (depth 0):
Author: Miaous95
Content: Definitely SA and I‚Äôd do it back to him see if he finds it funny

Gold labels (JSON):
{"Information support": "YES", "Emotional support": "NO", "Esteem support": "NO", "Network support": "NO", "Tangible assistance": "NO", "Seeking support": "NO", "Group interactions": "NO", "Not applicable": "NO"}


Example 2:
Prompt:
Original Post:
Author: ahhhhconfuse
Title: Did I (17/f) cheat

## Run

In [27]:
from openai import OpenAI

client = OpenAI()

models = client.models.list()

# Print nicely
ids = sorted([m.id for m in models.data])
print(f"Total models visible: {len(ids)}\n")
for mid in ids:
    print(mid)


Total models visible: 114

babbage-002
chatgpt-4o-latest
chatgpt-image-latest
codex-mini-latest
dall-e-2
dall-e-3
davinci-002
gpt-3.5-turbo
gpt-3.5-turbo-0125
gpt-3.5-turbo-1106
gpt-3.5-turbo-16k
gpt-3.5-turbo-instruct
gpt-3.5-turbo-instruct-0914
gpt-4
gpt-4-0125-preview
gpt-4-0613
gpt-4-1106-preview
gpt-4-turbo
gpt-4-turbo-2024-04-09
gpt-4-turbo-preview
gpt-4.1
gpt-4.1-2025-04-14
gpt-4.1-mini
gpt-4.1-mini-2025-04-14
gpt-4.1-nano
gpt-4.1-nano-2025-04-14
gpt-4o
gpt-4o-2024-05-13
gpt-4o-2024-08-06
gpt-4o-2024-11-20
gpt-4o-audio-preview
gpt-4o-audio-preview-2024-12-17
gpt-4o-audio-preview-2025-06-03
gpt-4o-mini
gpt-4o-mini-2024-07-18
gpt-4o-mini-audio-preview
gpt-4o-mini-audio-preview-2024-12-17
gpt-4o-mini-realtime-preview
gpt-4o-mini-realtime-preview-2024-12-17
gpt-4o-mini-search-preview
gpt-4o-mini-search-preview-2025-03-11
gpt-4o-mini-transcribe
gpt-4o-mini-transcribe-2025-03-20
gpt-4o-mini-transcribe-2025-12-15
gpt-4o-mini-tts
gpt-4o-mini-tts-2025-03-20
gpt-4o-mini-tts-2025-12-15
gpt

In [57]:
test_prompt = DOMAIN_GATE_PROMPT.format(prompt_text="Original Post:\nTitle: AIO is this SA?\nBody: ...")
print(call_structured(MODEL, test_prompt, DomainGateOut))


is_ambivalent_sa_domain=True


In [58]:
MODEL = "openai/gpt-oss-120b:groq"
i = 5
full_prompt_text = valid_df.iloc[i]["prompt"]

print("Row:", i)
print("FULL prompt preview:\n", full_prompt_text[:500], "\n")

# Step 1: Domain gate (title-only ‚Üí fallback full body)
dg_out, gate_used = domain_gate_title_then_fullbody(MODEL, full_prompt_text)
print("DOMAIN GATE:", dg_out, "| gate_used:", gate_used)

if not dg_out.is_ambivalent_sa_domain:
    pred = {lab: "NO" for lab in SUPPORT_ONLY_LABELS}
    pred["Not applicable"] = "YES"
    print("\nFINAL LABELS (forced):")
    print(json.dumps(pred, indent=2))
else:
    # Step 2
    op_out = call_structured(MODEL, OP_LAST_COMMENT_PROMPT.format(prompt_text=full_prompt_text), LastCommentOut)
    print("OP LAST COMMENT:", op_out)

    # Step 3
    final_prompt = FINAL_MULTILABEL_PROMPT.format(
        is_ambivalent_sa_domain=str(dg_out.is_ambivalent_sa_domain).lower(),
        is_last_comment_by_op=str(op_out.is_last_comment_by_op).lower(),
        last_comment_author=op_out.last_comment_author,
        last_comment_content=op_out.last_comment_content,
        support_definitions=SUPPORT_DEFINITIONS,
        few_shot_examples=few_shot_examples,
        prompt_text=full_prompt_text,
    )
    ml_out = call_structured(MODEL, final_prompt, MultiLabelOut)
    pred = ml_out.to_label_dict()

    print("\nFINAL LABELS:")
    print(json.dumps(pred, indent=2))

    yes_labels = [lab for lab in LABELS if pred.get(lab) == "YES"]
    print("\nYES labels:", ", ".join(yes_labels))


Row: 5
FULL prompt preview:
 Original Post:
Author: AnteaterDue3967
Title: Attracted to mainly Asian women and I felt like a creep for it.
Body: Now to set the record straight I'm not that type of guy that's into super dainty, skinny short,docile, racist stereotype of Asian women. More into muscular/ forward or even alt type of girls of any race really  
But to be honest I kind of have a thing for Asian women and part of me feels like it might be a race thing but not 100% sure if it's tha or because of the BS Hollywood fed  

DOMAIN GATE: is_ambivalent_sa_domain=False | gate_used: full_body_fallback

FINAL LABELS (forced):
{
  "Information support": "NO",
  "Emotional support": "NO",
  "Esteem support": "NO",
  "Network support": "NO",
  "Tangible assistance": "NO",
  "Seeking support": "NO",
  "Group interactions": "NO",
  "Not applicable": "YES"
}


In [59]:
import concurrent.futures as cf
from tqdm import tqdm

MODEL = "openai/gpt-oss-120b:groq"
MAX_WORKERS = 1   # start low; increase only if you stop seeing 429s

ERROR_COL = "llm_error"
YES_LABEL_COL = "predicted_labels_yes"
GATE_USED_COL = "gate_used"

# Ensure OUR output columns exist (do NOT touch Tags)
base_cols = [
    "is_ambivalent_sa_domain",
    "op_author",
    "last_comment_depth",
    "last_comment_author",
    "is_last_comment_by_op",
    ERROR_COL,
    YES_LABEL_COL,
    GATE_USED_COL,
]
for col in base_cols:
    if col not in valid_df.columns:
        valid_df[col] = None

for lab in LABELS:
    if lab not in valid_df.columns:
        valid_df[lab] = None

def run_one(full_prompt_text: str):
    # Step 1: title-only ‚Üí fallback full body
    dg_out, gate_used = domain_gate_title_then_fullbody(MODEL, full_prompt_text)

    if not dg_out.is_ambivalent_sa_domain:
        pred = {lab: "NO" for lab in SUPPORT_ONLY_LABELS}
        pred["Not applicable"] = "YES"
        yes_labels = [lab for lab, v in pred.items() if v == "YES"]
        return {
            "gate_used": gate_used,
            "is_ambivalent_sa_domain": False,
            "op_author": None,
            "last_comment_depth": None,
            "last_comment_author": None,
            "is_last_comment_by_op": None,
            "pred": pred,
            "predicted_labels_yes": ", ".join(yes_labels),
            "error": None,
        }

    # Step 2
    op_out = call_structured(MODEL, OP_LAST_COMMENT_PROMPT.format(prompt_text=full_prompt_text), LastCommentOut)

    # Step 3
    final_prompt = FINAL_MULTILABEL_PROMPT.format(
        is_ambivalent_sa_domain=str(dg_out.is_ambivalent_sa_domain).lower(),
        is_last_comment_by_op=str(op_out.is_last_comment_by_op).lower(),
        last_comment_author=op_out.last_comment_author,
        last_comment_content=op_out.last_comment_content,
        support_definitions=SUPPORT_DEFINITIONS,
        few_shot_examples=few_shot_examples,
        prompt_text=full_prompt_text,
    )
    ml_out = call_structured(MODEL, final_prompt, MultiLabelOut)
    pred = ml_out.to_label_dict()
    yes_labels = [lab for lab, v in pred.items() if v == "YES"]

    return {
        "gate_used": gate_used,
        "is_ambivalent_sa_domain": True,
        "op_author": op_out.op_author,
        "last_comment_depth": op_out.last_comment_depth,
        "last_comment_author": op_out.last_comment_author,
        "is_last_comment_by_op": op_out.is_last_comment_by_op,
        "pred": pred,
        "predicted_labels_yes": ", ".join(yes_labels),
        "error": None,
    }

# Parallel run
futures = {}
with cf.ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex:
    for i, row in valid_df.iterrows():
        # Optional resume: skip rows already done successfully
        if pd.notna(row.get(YES_LABEL_COL)) and str(row.get(ERROR_COL, "")).strip() in {"", "nan", "None"}:
            continue
        futures[ex.submit(run_one, row["prompt"])] = i

    for fut in tqdm(cf.as_completed(futures), total=len(futures), desc=f"Full validation (workers={MAX_WORKERS})"):
        i = futures[fut]
        try:
            out = fut.result()

            valid_df.at[i, GATE_USED_COL] = out["gate_used"]
            valid_df.at[i, "is_ambivalent_sa_domain"] = out["is_ambivalent_sa_domain"]
            valid_df.at[i, "op_author"] = out["op_author"]
            valid_df.at[i, "last_comment_depth"] = out["last_comment_depth"]
            valid_df.at[i, "last_comment_author"] = out["last_comment_author"]
            valid_df.at[i, "is_last_comment_by_op"] = out["is_last_comment_by_op"]

            for lab, val in out["pred"].items():
                valid_df.at[i, lab] = val

            valid_df.at[i, YES_LABEL_COL] = out["predicted_labels_yes"]
            valid_df.at[i, ERROR_COL] = out["error"]

        except Exception as e:
            valid_df.at[i, ERROR_COL] = str(e)

# Save once
OUT_PATH = "valid_with_predictions_goss_v86.csv"
valid_df.to_csv(OUT_PATH, index=False)
print("Saved:", OUT_PATH)
print(f"Left existing 'Tags' untouched. Errors are in '{ERROR_COL}'.")


Full validation (workers=1):   0%|          | 0/86 [00:00<?, ?it/s]

[retry 1/6] ValueError: sleeping 0.75s


Full validation (workers=1):  13%|‚ñà‚ñé        | 11/86 [00:30<03:19,  2.66s/it]

[retry 1/6] ValueError: sleeping 0.75s


Full validation (workers=1):  53%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé    | 46/86 [02:14<01:51,  2.79s/it]

DEBUG: Empty/None content returned.
finish_reason: length
choice0: Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning='We need to apply rules. is_ambivalent_sa_domain = true, so we proceed. is_last_comment_by_op = false (since author is Snoo_78896, not OP). So we go to "Otherwise" branch.\n\nWe must evaluate: Information support, Emotional support, Esteem support, Network support, Tangible assistance, Group interactions. Force: Seeking support = NO. Not applicable = YES only if all evaluated labels are NO.\n\nNow we need to label each based on the comment content.\n\nComment content: "Im sorry that you feel the way you do. To be honest, you\'re borh in the wrong here. 1. You put yourself in a vulnerable position to be taken advantage of. 2. He took advantage of you. You should be able to trust your boyfriend, and he should respect yo

Full validation (workers=1):  66%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 57/86 [02:51<01:25,  2.94s/it]

DEBUG: Empty/None content returned.
finish_reason: length
choice0: Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning='We need to follow steps:\n\n1) Extract OP author from Original Post: Author: DaddyBren42\n\n2) Choose comment with highest depth in conversation history. Depth values: 0,1,2,3,4,5. Highest is depth 5.\n\n3) Extract that last comment\'s Author, Depth, Content. Depth 5 comment: Author: ParentPostLacksWang, Content: (the long paragraph). Need to capture content exactly as given.\n\n4) Compare last comment author with OP. OP is DaddyBren42, last comment author is ParentPostLacksWang, so not same => false.\n\nReturn JSON with fields.\n\nWe must ensure JSON formatting, strings escaped properly. Provide content as a string; need to preserve line breaks? Usually we can keep as single line with spaces. We\'ll include the cont

Full validation (workers=1): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 86/86 [04:14<00:00,  2.95s/it]

Saved: valid_with_predictions_goss_v86.csv
Left existing 'Tags' untouched. Errors are in 'llm_error'.





In [13]:
cols_to_reset = [
    "is_ambivalent_sa_domain","op_author","last_comment_depth","last_comment_author","is_last_comment_by_op",
    "gate_used","llm_error","predicted_labels_yes",
    "Information support","Emotional support","Esteem support","Network support",
    "Tangible assistance","Seeking support","Group interactions","Not applicable",
]
for c in cols_to_reset:
    if c in valid_df.columns:
        valid_df[c] = None

print("Reset done. Now rerun full validation with o4-mini.")

Reset done. Now rerun full validation with o4-mini.


In [60]:
import json, re
import numpy as np
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    classification_report, precision_recall_fscore_support
)

# ====== 1) LOAD YOUR FILE ======
IN_PATH = "valid_with_predictions_goss_v86.csv"
df = pd.read_csv(IN_PATH)

print("Loaded:", IN_PATH)
print("Rows:", len(df))
print("Columns:", list(df.columns))

# ====== 2) LABEL SET ======
ALL_LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
    "Not applicable",
]

EXCLUDE_LABELS = {"Not applicable", "Seeking support"}
EVAL_LABELS = [lab for lab in ALL_LABELS if lab not in EXCLUDE_LABELS]

# map for case/spacing normalization
_norm_map = {lab.lower().strip(): lab for lab in ALL_LABELS}

def normalize_label(t: str):
    t2 = str(t).strip()
    if not t2:
        return None
    key = t2.lower().strip()
    return _norm_map.get(key, None)  # canonical label or None if unknown

# ====== 3) PARSE TAG STRINGS ======
def parse_labels(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return []
    s = str(x).strip()
    if s == "" or s.lower() in {"nan", "none"}:
        return []

    labels = []

    # Try JSON list
    if s.startswith("[") and s.endswith("]"):
        try:
            obj = json.loads(s)
            labels = [str(t).strip() for t in obj]
        except Exception:
            try:
                obj = json.loads(re.sub(r"'", '"', s))
                labels = [str(t).strip() for t in obj]
            except Exception:
                labels = []

    # Fallback split
    if not labels:
        parts = re.split(r"[,\n;|]+", s)
        labels = []
        for t in parts:
            t = re.sub(r"^tags?\s*:\s*", "", t.strip(), flags=re.IGNORECASE)
            if t:
                labels.append(t)

    # Normalize + keep only known labels
    out = []
    for t in labels:
        canon = normalize_label(t)
        if canon is not None:
            out.append(canon)

    # de-dup while preserving order
    seen = set()
    out2 = []
    for t in out:
        if t not in seen:
            seen.add(t)
            out2.append(t)
    return out2

# ====== 4) BUILD y_true / y_pred ======
assert "Tags" in df.columns, "Missing ground-truth column: Tags"
assert "predicted_labels_yes" in df.columns, "Missing prediction column: predicted_labels_yes"

y_true_list_all = df["Tags"].apply(parse_labels).tolist()
y_pred_list_all = df["predicted_labels_yes"].apply(parse_labels).tolist()

print("\nEmpty ground-truth rows:", sum(len(x) == 0 for x in y_true_list_all))
print("Empty prediction rows:", sum(len(x) == 0 for x in y_pred_list_all))

# --- Filter OUT excluded labels at the list level (optional but keeps things clean)
def drop_excluded(labels):
    return [t for t in labels if t not in EXCLUDE_LABELS]

y_true_list = [drop_excluded(x) for x in y_true_list_all]
y_pred_list = [drop_excluded(x) for x in y_pred_list_all]

# Binarize ONLY over evaluation labels
mlb = MultiLabelBinarizer(classes=EVAL_LABELS)
Y_true = mlb.fit_transform(y_true_list)
Y_pred = mlb.transform(y_pred_list)

print("\nEvaluating on labels (excluded: Not applicable, Seeking support):")
print(mlb.classes_)

# ====== 5) OVERALL METRICS ======
scores = {
    "micro_precision": precision_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_recall":    recall_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_f1":        f1_score(Y_true, Y_pred, average="micro", zero_division=0),

    "macro_precision": precision_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_recall":    recall_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_f1":        f1_score(Y_true, Y_pred, average="macro", zero_division=0),

    "samples_precision": precision_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_recall":    recall_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_f1":        f1_score(Y_true, Y_pred, average="samples", zero_division=0),
}

print("\nOverall Precision \\& Recall \\& F1 (EXCLUDING Not applicable + Seeking support)")
for k, v in scores.items():
    print(f"{k}: {v:.4f}")

# ====== 6) PER-LABEL METRICS ======
p, r, f1, support = precision_recall_fscore_support(
    Y_true, Y_pred, average=None, zero_division=0
)

per_label_df = pd.DataFrame({
    "label": mlb.classes_,
    "precision": p,
    "recall": r,
    "f1": f1,
    "support_true": support
}).sort_values("f1", ascending=False)

print("\nPer-label metrics (excluded labels removed):")
display(per_label_df)

print("\nClassification report (excluded labels removed):")
print(classification_report(Y_true, Y_pred, target_names=mlb.classes_, zero_division=0))


Loaded: valid_with_predictions_goss_v86.csv
Rows: 86
Columns: ['post_id', 'parent_id', 'subreddit', 'comment_id', 'parent_fullname', 'depth', 'comment_author', 'comment_body', 'comment_score', 'created_utc', 'permalink', 'is_post_author', 'title', 'author', 'body', 'prompt', 'Tags', 'is_ambivalent_sa_domain', 'op_author', 'last_comment_depth', 'last_comment_author', 'is_last_comment_by_op', 'llm_error', 'predicted_labels_yes', 'gate_used', 'Information support', 'Emotional support', 'Esteem support', 'Network support', 'Tangible assistance', 'Seeking support', 'Group interactions', 'Not applicable']

Empty ground-truth rows: 0
Empty prediction rows: 0

Evaluating on labels (excluded: Not applicable, Seeking support):
['Information support' 'Emotional support' 'Esteem support'
 'Network support' 'Tangible assistance' 'Group interactions']

Overall Precision \& Recall \& F1 (EXCLUDING Not applicable + Seeking support)
micro_precision: 0.9292
micro_recall: 0.8015
micro_f1: 0.8607
macro_pr

Unnamed: 0,label,precision,recall,f1,support_true
4,Tangible assistance,1.0,1.0,1.0,2
0,Information support,0.979592,0.888889,0.932039,54
1,Emotional support,0.896552,0.962963,0.928571,27
3,Network support,1.0,0.75,0.857143,4
5,Group interactions,0.928571,0.65,0.764706,20
2,Esteem support,0.8125,0.541667,0.65,24



Classification report (excluded labels removed):
                     precision    recall  f1-score   support

Information support       0.98      0.89      0.93        54
  Emotional support       0.90      0.96      0.93        27
     Esteem support       0.81      0.54      0.65        24
    Network support       1.00      0.75      0.86         4
Tangible assistance       1.00      1.00      1.00         2
 Group interactions       0.93      0.65      0.76        20

          micro avg       0.93      0.80      0.86       131
          macro avg       0.94      0.80      0.86       131
       weighted avg       0.93      0.80      0.85       131
        samples avg       0.69      0.65      0.66       131



In [35]:
import json, re
import numpy as np
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    classification_report, precision_recall_fscore_support
)

# ====== 1) LOAD YOUR FILE ======
IN_PATH = "test_with_predictions_goss_t44.csv"
df = pd.read_csv(IN_PATH)

print("Loaded:", IN_PATH)
print("Rows:", len(df))
print("Columns:", list(df.columns))

# ====== 2) LABEL SET ======
ALL_LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
    "Not applicable",
]

EXCLUDE_LABELS = {"Not applicable", "Seeking support"}
EVAL_LABELS = [lab for lab in ALL_LABELS if lab not in EXCLUDE_LABELS]

# map for case/spacing normalization
_norm_map = {lab.lower().strip(): lab for lab in ALL_LABELS}

def normalize_label(t: str):
    t2 = str(t).strip()
    if not t2:
        return None
    key = t2.lower().strip()
    return _norm_map.get(key, None)  # canonical label or None if unknown

# ====== 3) PARSE TAG STRINGS ======
def parse_labels(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return []
    s = str(x).strip()
    if s == "" or s.lower() in {"nan", "none"}:
        return []

    labels = []

    # Try JSON list
    if s.startswith("[") and s.endswith("]"):
        try:
            obj = json.loads(s)
            labels = [str(t).strip() for t in obj]
        except Exception:
            try:
                obj = json.loads(re.sub(r"'", '"', s))
                labels = [str(t).strip() for t in obj]
            except Exception:
                labels = []

    # Fallback split
    if not labels:
        parts = re.split(r"[,\n;|]+", s)
        labels = []
        for t in parts:
            t = re.sub(r"^tags?\s*:\s*", "", t.strip(), flags=re.IGNORECASE)
            if t:
                labels.append(t)

    # Normalize + keep only known labels
    out = []
    for t in labels:
        canon = normalize_label(t)
        if canon is not None:
            out.append(canon)

    # de-dup while preserving order
    seen = set()
    out2 = []
    for t in out:
        if t not in seen:
            seen.add(t)
            out2.append(t)
    return out2

# ====== 4) BUILD y_true / y_pred ======
assert "Tags" in df.columns, "Missing ground-truth column: Tags"
assert "predicted_labels_yes" in df.columns, "Missing prediction column: predicted_labels_yes"

y_true_list_all = df["Tags"].apply(parse_labels).tolist()
y_pred_list_all = df["predicted_labels_yes"].apply(parse_labels).tolist()

print("\nEmpty ground-truth rows:", sum(len(x) == 0 for x in y_true_list_all))
print("Empty prediction rows:", sum(len(x) == 0 for x in y_pred_list_all))

# --- Filter OUT excluded labels at the list level (optional but keeps things clean)
def drop_excluded(labels):
    return [t for t in labels if t not in EXCLUDE_LABELS]

y_true_list = [drop_excluded(x) for x in y_true_list_all]
y_pred_list = [drop_excluded(x) for x in y_pred_list_all]

# Binarize ONLY over evaluation labels
mlb = MultiLabelBinarizer(classes=EVAL_LABELS)
Y_true = mlb.fit_transform(y_true_list)
Y_pred = mlb.transform(y_pred_list)

print("\nEvaluating on labels (excluded: Not applicable, Seeking support):")
print(mlb.classes_)

# ====== 5) OVERALL METRICS ======
scores = {
    "micro_precision": precision_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_recall":    recall_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_f1":        f1_score(Y_true, Y_pred, average="micro", zero_division=0),

    "macro_precision": precision_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_recall":    recall_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_f1":        f1_score(Y_true, Y_pred, average="macro", zero_division=0),

    "samples_precision": precision_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_recall":    recall_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_f1":        f1_score(Y_true, Y_pred, average="samples", zero_division=0),
}

print("\nOverall Precision \\& Recall \\& F1 (EXCLUDING Not applicable + Seeking support)")
for k, v in scores.items():
    print(f"{k}: {v:.4f}")

# ====== 6) PER-LABEL METRICS ======
p, r, f1, support = precision_recall_fscore_support(
    Y_true, Y_pred, average=None, zero_division=0
)

per_label_df = pd.DataFrame({
    "label": mlb.classes_,
    "precision": p,
    "recall": r,
    "f1": f1,
    "support_true": support
}).sort_values("f1", ascending=False)

print("\nPer-label metrics (excluded labels removed):")
display(per_label_df)

print("\nClassification report (excluded labels removed):")
print(classification_report(Y_true, Y_pred, target_names=mlb.classes_, zero_division=0))


Loaded: test_with_predictions_goss_t44.csv
Rows: 44
Columns: ['post_id', 'parent_id', 'subreddit', 'comment_id', 'parent_fullname', 'depth', 'comment_author', 'comment_body', 'comment_score', 'created_utc', 'permalink', 'is_post_author', 'title', 'author', 'body', 'prompt', 'Tags', 'is_ambivalent_sa_domain', 'op_author', 'last_comment_depth', 'last_comment_author', 'is_last_comment_by_op', 'llm_error', 'predicted_labels_yes', 'gate_used', 'Information support', 'Emotional support', 'Esteem support', 'Network support', 'Tangible assistance', 'Seeking support', 'Group interactions', 'Not applicable']

Empty ground-truth rows: 0
Empty prediction rows: 0

Evaluating on labels (excluded: Not applicable, Seeking support):
['Information support' 'Emotional support' 'Esteem support'
 'Network support' 'Tangible assistance' 'Group interactions']

Overall Precision \& Recall \& F1 (EXCLUDING Not applicable + Seeking support)
micro_precision: 0.9400
micro_recall: 0.7966
micro_f1: 0.8624
macro_pre

Unnamed: 0,label,precision,recall,f1,support_true
3,Network support,1.0,1.0,1.0,1
1,Emotional support,0.866667,1.0,0.928571,13
0,Information support,1.0,0.8,0.888889,25
5,Group interactions,0.888889,0.8,0.842105,10
2,Esteem support,1.0,0.5,0.666667,10
4,Tangible assistance,0.0,0.0,0.0,0



Classification report (excluded labels removed):
                     precision    recall  f1-score   support

Information support       1.00      0.80      0.89        25
  Emotional support       0.87      1.00      0.93        13
     Esteem support       1.00      0.50      0.67        10
    Network support       1.00      1.00      1.00         1
Tangible assistance       0.00      0.00      0.00         0
 Group interactions       0.89      0.80      0.84        10

          micro avg       0.94      0.80      0.86        59
          macro avg       0.79      0.68      0.72        59
       weighted avg       0.95      0.80      0.85        59
        samples avg       0.59      0.57      0.57        59



In [47]:
import json, re
import numpy as np
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    classification_report, precision_recall_fscore_support
)

# ====== 1) LOAD YOUR FILE ======
IN_PATH = "valid_with_predictions_goss_v44.csv"
df = pd.read_csv(IN_PATH)

print("Loaded:", IN_PATH)
print("Rows:", len(df))
print("Columns:", list(df.columns))

# ====== 2) LABEL SET ======
ALL_LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
    "Not applicable",
]

EXCLUDE_LABELS = {"Not applicable", "Seeking support"}
EVAL_LABELS = [lab for lab in ALL_LABELS if lab not in EXCLUDE_LABELS]

# map for case/spacing normalization
_norm_map = {lab.lower().strip(): lab for lab in ALL_LABELS}

def normalize_label(t: str):
    t2 = str(t).strip()
    if not t2:
        return None
    key = t2.lower().strip()
    return _norm_map.get(key, None)  # canonical label or None if unknown

# ====== 3) PARSE TAG STRINGS ======
def parse_labels(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return []
    s = str(x).strip()
    if s == "" or s.lower() in {"nan", "none"}:
        return []

    labels = []

    # Try JSON list
    if s.startswith("[") and s.endswith("]"):
        try:
            obj = json.loads(s)
            labels = [str(t).strip() for t in obj]
        except Exception:
            try:
                obj = json.loads(re.sub(r"'", '"', s))
                labels = [str(t).strip() for t in obj]
            except Exception:
                labels = []

    # Fallback split
    if not labels:
        parts = re.split(r"[,\n;|]+", s)
        labels = []
        for t in parts:
            t = re.sub(r"^tags?\s*:\s*", "", t.strip(), flags=re.IGNORECASE)
            if t:
                labels.append(t)

    # Normalize + keep only known labels
    out = []
    for t in labels:
        canon = normalize_label(t)
        if canon is not None:
            out.append(canon)

    # de-dup while preserving order
    seen = set()
    out2 = []
    for t in out:
        if t not in seen:
            seen.add(t)
            out2.append(t)
    return out2

# ====== 4) BUILD y_true / y_pred ======
assert "Tags" in df.columns, "Missing ground-truth column: Tags"
assert "predicted_labels_yes" in df.columns, "Missing prediction column: predicted_labels_yes"

y_true_list_all = df["Tags"].apply(parse_labels).tolist()
y_pred_list_all = df["predicted_labels_yes"].apply(parse_labels).tolist()

print("\nEmpty ground-truth rows:", sum(len(x) == 0 for x in y_true_list_all))
print("Empty prediction rows:", sum(len(x) == 0 for x in y_pred_list_all))

# --- Filter OUT excluded labels at the list level (optional but keeps things clean)
def drop_excluded(labels):
    return [t for t in labels if t not in EXCLUDE_LABELS]

y_true_list = [drop_excluded(x) for x in y_true_list_all]
y_pred_list = [drop_excluded(x) for x in y_pred_list_all]

# Binarize ONLY over evaluation labels
mlb = MultiLabelBinarizer(classes=EVAL_LABELS)
Y_true = mlb.fit_transform(y_true_list)
Y_pred = mlb.transform(y_pred_list)

print("\nEvaluating on labels (excluded: Not applicable, Seeking support):")
print(mlb.classes_)

# ====== 5) OVERALL METRICS ======
scores = {
    "micro_precision": precision_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_recall":    recall_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_f1":        f1_score(Y_true, Y_pred, average="micro", zero_division=0),

    "macro_precision": precision_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_recall":    recall_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_f1":        f1_score(Y_true, Y_pred, average="macro", zero_division=0),

    "samples_precision": precision_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_recall":    recall_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_f1":        f1_score(Y_true, Y_pred, average="samples", zero_division=0),
}

print("\nOverall Precision \\& Recall \\& F1 (EXCLUDING Not applicable + Seeking support)")
for k, v in scores.items():
    print(f"{k}: {v:.4f}")

# ====== 6) PER-LABEL METRICS ======
p, r, f1, support = precision_recall_fscore_support(
    Y_true, Y_pred, average=None, zero_division=0
)

per_label_df = pd.DataFrame({
    "label": mlb.classes_,
    "precision": p,
    "recall": r,
    "f1": f1,
    "support_true": support
}).sort_values("f1", ascending=False)

print("\nPer-label metrics (excluded labels removed):")
display(per_label_df)

print("\nClassification report (excluded labels removed):")
print(classification_report(Y_true, Y_pred, target_names=mlb.classes_, zero_division=0))


Loaded: valid_with_predictions_goss_v44.csv
Rows: 44
Columns: ['post_id', 'parent_id', 'subreddit', 'comment_id', 'parent_fullname', 'depth', 'comment_author', 'comment_body', 'comment_score', 'created_utc', 'permalink', 'is_post_author', 'title', 'author', 'body', 'prompt', 'Tags', 'is_ambivalent_sa_domain', 'op_author', 'last_comment_depth', 'last_comment_author', 'is_last_comment_by_op', 'llm_error', 'predicted_labels_yes', 'gate_used', 'Information support', 'Emotional support', 'Esteem support', 'Network support', 'Tangible assistance', 'Seeking support', 'Group interactions', 'Not applicable']

Empty ground-truth rows: 0
Empty prediction rows: 0

Evaluating on labels (excluded: Not applicable, Seeking support):
['Information support' 'Emotional support' 'Esteem support'
 'Network support' 'Tangible assistance' 'Group interactions']

Overall Precision \& Recall \& F1 (EXCLUDING Not applicable + Seeking support)
micro_precision: 0.8833
micro_recall: 0.7361
micro_f1: 0.8030
macro_pr

Unnamed: 0,label,precision,recall,f1,support_true
4,Tangible assistance,1.0,1.0,1.0,2
0,Information support,0.964286,0.931034,0.947368,29
1,Emotional support,0.866667,0.928571,0.896552,14
3,Network support,1.0,0.666667,0.8,3
5,Group interactions,0.833333,0.5,0.625,10
2,Esteem support,0.571429,0.285714,0.380952,14



Classification report (excluded labels removed):
                     precision    recall  f1-score   support

Information support       0.96      0.93      0.95        29
  Emotional support       0.87      0.93      0.90        14
     Esteem support       0.57      0.29      0.38        14
    Network support       1.00      0.67      0.80         3
Tangible assistance       1.00      1.00      1.00         2
 Group interactions       0.83      0.50      0.62        10

          micro avg       0.88      0.74      0.80        72
          macro avg       0.87      0.72      0.77        72
       weighted avg       0.85      0.74      0.78        72
        samples avg       0.70      0.64      0.65        72



In [25]:
import json, re
import numpy as np
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    classification_report, precision_recall_fscore_support
)

# ====== 1) LOAD YOUR FILE ======
IN_PATH = "valid_with_predictions_4mini.csv"
df = pd.read_csv(IN_PATH)

print("Loaded:", IN_PATH)
print("Rows:", len(df))
print("Columns:", list(df.columns))

# ====== 2) LABEL SET ======
ALL_LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
    "Not applicable",
]

EXCLUDE_LABELS = {"Not applicable", "Seeking support"}
EVAL_LABELS = [lab for lab in ALL_LABELS if lab not in EXCLUDE_LABELS]

# map for case/spacing normalization
_norm_map = {lab.lower().strip(): lab for lab in ALL_LABELS}

def normalize_label(t: str):
    t2 = str(t).strip()
    if not t2:
        return None
    key = t2.lower().strip()
    return _norm_map.get(key, None)  # canonical label or None if unknown

# ====== 3) PARSE TAG STRINGS ======
def parse_labels(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return []
    s = str(x).strip()
    if s == "" or s.lower() in {"nan", "none"}:
        return []

    labels = []

    # Try JSON list
    if s.startswith("[") and s.endswith("]"):
        try:
            obj = json.loads(s)
            labels = [str(t).strip() for t in obj]
        except Exception:
            try:
                obj = json.loads(re.sub(r"'", '"', s))
                labels = [str(t).strip() for t in obj]
            except Exception:
                labels = []

    # Fallback split
    if not labels:
        parts = re.split(r"[,\n;|]+", s)
        labels = []
        for t in parts:
            t = re.sub(r"^tags?\s*:\s*", "", t.strip(), flags=re.IGNORECASE)
            if t:
                labels.append(t)

    # Normalize + keep only known labels
    out = []
    for t in labels:
        canon = normalize_label(t)
        if canon is not None:
            out.append(canon)

    # de-dup while preserving order
    seen = set()
    out2 = []
    for t in out:
        if t not in seen:
            seen.add(t)
            out2.append(t)
    return out2

# ====== 4) BUILD y_true / y_pred ======
assert "Tags" in df.columns, "Missing ground-truth column: Tags"
assert "predicted_labels_yes" in df.columns, "Missing prediction column: predicted_labels_yes"

y_true_list_all = df["Tags"].apply(parse_labels).tolist()
y_pred_list_all = df["predicted_labels_yes"].apply(parse_labels).tolist()

print("\nEmpty ground-truth rows:", sum(len(x) == 0 for x in y_true_list_all))
print("Empty prediction rows:", sum(len(x) == 0 for x in y_pred_list_all))

# --- Filter OUT excluded labels at the list level (optional but keeps things clean)
def drop_excluded(labels):
    return [t for t in labels if t not in EXCLUDE_LABELS]

y_true_list = [drop_excluded(x) for x in y_true_list_all]
y_pred_list = [drop_excluded(x) for x in y_pred_list_all]

# Binarize ONLY over evaluation labels
mlb = MultiLabelBinarizer(classes=EVAL_LABELS)
Y_true = mlb.fit_transform(y_true_list)
Y_pred = mlb.transform(y_pred_list)

print("\nEvaluating on labels (excluded: Not applicable, Seeking support):")
print(mlb.classes_)

# ====== 5) OVERALL METRICS ======
scores = {
    "micro_precision": precision_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_recall":    recall_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_f1":        f1_score(Y_true, Y_pred, average="micro", zero_division=0),

    "macro_precision": precision_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_recall":    recall_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_f1":        f1_score(Y_true, Y_pred, average="macro", zero_division=0),

    "samples_precision": precision_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_recall":    recall_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_f1":        f1_score(Y_true, Y_pred, average="samples", zero_division=0),
}

print("\nOverall Precision \\& Recall \\& F1 (EXCLUDING Not applicable + Seeking support)")
for k, v in scores.items():
    print(f"{k}: {v:.4f}")

# ====== 6) PER-LABEL METRICS ======
p, r, f1, support = precision_recall_fscore_support(
    Y_true, Y_pred, average=None, zero_division=0
)

per_label_df = pd.DataFrame({
    "label": mlb.classes_,
    "precision": p,
    "recall": r,
    "f1": f1,
    "support_true": support
}).sort_values("f1", ascending=False)

print("\nPer-label metrics (excluded labels removed):")
display(per_label_df)

print("\nClassification report (excluded labels removed):")
print(classification_report(Y_true, Y_pred, target_names=mlb.classes_, zero_division=0))


Loaded: valid_with_predictions_4mini.csv
Rows: 86
Columns: ['post_id', 'parent_id', 'subreddit', 'comment_id', 'parent_fullname', 'depth', 'comment_author', 'comment_body', 'comment_score', 'created_utc', 'permalink', 'is_post_author', 'title', 'author', 'body', 'prompt', 'Tags', 'is_ambivalent_sa_domain', 'op_author', 'last_comment_depth', 'last_comment_author', 'is_last_comment_by_op', 'llm_error', 'predicted_labels_yes', 'gate_used', 'Information support', 'Emotional support', 'Esteem support', 'Network support', 'Tangible assistance', 'Seeking support', 'Group interactions', 'Not applicable']

Empty ground-truth rows: 0
Empty prediction rows: 2

Evaluating on labels (excluded: Not applicable, Seeking support):
['Information support' 'Emotional support' 'Esteem support'
 'Network support' 'Tangible assistance' 'Group interactions']

Overall Precision \& Recall \& F1 (EXCLUDING Not applicable + Seeking support)
micro_precision: 0.8333
micro_recall: 0.7634
micro_f1: 0.7968
macro_preci

Unnamed: 0,label,precision,recall,f1,support_true
4,Tangible assistance,1.0,1.0,1.0,2
1,Emotional support,0.857143,0.888889,0.872727,27
0,Information support,0.933333,0.777778,0.848485,54
3,Network support,0.75,0.75,0.75,4
5,Group interactions,0.695652,0.8,0.744186,20
2,Esteem support,0.722222,0.541667,0.619048,24



Classification report (excluded labels removed):
                     precision    recall  f1-score   support

Information support       0.93      0.78      0.85        54
  Emotional support       0.86      0.89      0.87        27
     Esteem support       0.72      0.54      0.62        24
    Network support       0.75      0.75      0.75         4
Tangible assistance       1.00      1.00      1.00         2
 Group interactions       0.70      0.80      0.74        20

          micro avg       0.83      0.76      0.80       131
          macro avg       0.83      0.79      0.81       131
       weighted avg       0.84      0.76      0.79       131
        samples avg       0.62      0.60      0.60       131



In [39]:
import json, re
import numpy as np
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    classification_report, precision_recall_fscore_support
)

# ====== 1) LOAD YOUR FILE ======
IN_PATH = "valid_with_predictions_4mini_v44.csv"
df = pd.read_csv(IN_PATH)

print("Loaded:", IN_PATH)
print("Rows:", len(df))
print("Columns:", list(df.columns))

# ====== 2) LABEL SET ======
ALL_LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
    "Not applicable",
]

EXCLUDE_LABELS = {"Not applicable", "Seeking support"}
EVAL_LABELS = [lab for lab in ALL_LABELS if lab not in EXCLUDE_LABELS]

# map for case/spacing normalization
_norm_map = {lab.lower().strip(): lab for lab in ALL_LABELS}

def normalize_label(t: str):
    t2 = str(t).strip()
    if not t2:
        return None
    key = t2.lower().strip()
    return _norm_map.get(key, None)  # canonical label or None if unknown

# ====== 3) PARSE TAG STRINGS ======
def parse_labels(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return []
    s = str(x).strip()
    if s == "" or s.lower() in {"nan", "none"}:
        return []

    labels = []

    # Try JSON list
    if s.startswith("[") and s.endswith("]"):
        try:
            obj = json.loads(s)
            labels = [str(t).strip() for t in obj]
        except Exception:
            try:
                obj = json.loads(re.sub(r"'", '"', s))
                labels = [str(t).strip() for t in obj]
            except Exception:
                labels = []

    # Fallback split
    if not labels:
        parts = re.split(r"[,\n;|]+", s)
        labels = []
        for t in parts:
            t = re.sub(r"^tags?\s*:\s*", "", t.strip(), flags=re.IGNORECASE)
            if t:
                labels.append(t)

    # Normalize + keep only known labels
    out = []
    for t in labels:
        canon = normalize_label(t)
        if canon is not None:
            out.append(canon)

    # de-dup while preserving order
    seen = set()
    out2 = []
    for t in out:
        if t not in seen:
            seen.add(t)
            out2.append(t)
    return out2

# ====== 4) BUILD y_true / y_pred ======
assert "Tags" in df.columns, "Missing ground-truth column: Tags"
assert "predicted_labels_yes" in df.columns, "Missing prediction column: predicted_labels_yes"

y_true_list_all = df["Tags"].apply(parse_labels).tolist()
y_pred_list_all = df["predicted_labels_yes"].apply(parse_labels).tolist()

print("\nEmpty ground-truth rows:", sum(len(x) == 0 for x in y_true_list_all))
print("Empty prediction rows:", sum(len(x) == 0 for x in y_pred_list_all))

# --- Filter OUT excluded labels at the list level (optional but keeps things clean)
def drop_excluded(labels):
    return [t for t in labels if t not in EXCLUDE_LABELS]

y_true_list = [drop_excluded(x) for x in y_true_list_all]
y_pred_list = [drop_excluded(x) for x in y_pred_list_all]

# Binarize ONLY over evaluation labels
mlb = MultiLabelBinarizer(classes=EVAL_LABELS)
Y_true = mlb.fit_transform(y_true_list)
Y_pred = mlb.transform(y_pred_list)

print("\nEvaluating on labels (excluded: Not applicable, Seeking support):")
print(mlb.classes_)

# ====== 5) OVERALL METRICS ======
scores = {
    "micro_precision": precision_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_recall":    recall_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_f1":        f1_score(Y_true, Y_pred, average="micro", zero_division=0),

    "macro_precision": precision_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_recall":    recall_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_f1":        f1_score(Y_true, Y_pred, average="macro", zero_division=0),

    "samples_precision": precision_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_recall":    recall_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_f1":        f1_score(Y_true, Y_pred, average="samples", zero_division=0),
}

print("\nOverall Precision \\& Recall \\& F1 (EXCLUDING Not applicable + Seeking support)")
for k, v in scores.items():
    print(f"{k}: {v:.4f}")

# ====== 6) PER-LABEL METRICS ======
p, r, f1, support = precision_recall_fscore_support(
    Y_true, Y_pred, average=None, zero_division=0
)

per_label_df = pd.DataFrame({
    "label": mlb.classes_,
    "precision": p,
    "recall": r,
    "f1": f1,
    "support_true": support
}).sort_values("f1", ascending=False)

print("\nPer-label metrics (excluded labels removed):")
display(per_label_df)

print("\nClassification report (excluded labels removed):")
print(classification_report(Y_true, Y_pred, target_names=mlb.classes_, zero_division=0))


Loaded: valid_with_predictions_4mini_v44.csv
Rows: 44
Columns: ['post_id', 'parent_id', 'subreddit', 'comment_id', 'parent_fullname', 'depth', 'comment_author', 'comment_body', 'comment_score', 'created_utc', 'permalink', 'is_post_author', 'title', 'author', 'body', 'prompt', 'Tags', 'is_ambivalent_sa_domain', 'op_author', 'last_comment_depth', 'last_comment_author', 'is_last_comment_by_op', 'llm_error', 'predicted_labels_yes', 'gate_used', 'Information support', 'Emotional support', 'Esteem support', 'Network support', 'Tangible assistance', 'Seeking support', 'Group interactions', 'Not applicable']

Empty ground-truth rows: 0
Empty prediction rows: 0

Evaluating on labels (excluded: Not applicable, Seeking support):
['Information support' 'Emotional support' 'Esteem support'
 'Network support' 'Tangible assistance' 'Group interactions']

Overall Precision \& Recall \& F1 (EXCLUDING Not applicable + Seeking support)
micro_precision: 0.7941
micro_recall: 0.7500
micro_f1: 0.7714
macro_p

Unnamed: 0,label,precision,recall,f1,support_true
4,Tangible assistance,1.0,1.0,1.0,2
0,Information support,0.925926,0.862069,0.892857,29
3,Network support,1.0,0.666667,0.8,3
1,Emotional support,0.705882,0.857143,0.774194,14
5,Group interactions,0.7,0.7,0.7,10
2,Esteem support,0.6,0.428571,0.5,14



Classification report (excluded labels removed):
                     precision    recall  f1-score   support

Information support       0.93      0.86      0.89        29
  Emotional support       0.71      0.86      0.77        14
     Esteem support       0.60      0.43      0.50        14
    Network support       1.00      0.67      0.80         3
Tangible assistance       1.00      1.00      1.00         2
 Group interactions       0.70      0.70      0.70        10

          micro avg       0.79      0.75      0.77        72
          macro avg       0.82      0.75      0.78        72
       weighted avg       0.79      0.75      0.77        72
        samples avg       0.64      0.62      0.61        72



In [51]:
import json, re
import numpy as np
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    classification_report, precision_recall_fscore_support
)

# ====== 1) LOAD YOUR FILE ======
IN_PATH = "test_with_predictions_4mini_t44.csv"
df = pd.read_csv(IN_PATH)

print("Loaded:", IN_PATH)
print("Rows:", len(df))
print("Columns:", list(df.columns))

# ====== 2) LABEL SET ======
ALL_LABELS = [
    "Information support",
    "Emotional support",
    "Esteem support",
    "Network support",
    "Tangible assistance",
    "Seeking support",
    "Group interactions",
    "Not applicable",
]

EXCLUDE_LABELS = {"Not applicable", "Seeking support"}
EVAL_LABELS = [lab for lab in ALL_LABELS if lab not in EXCLUDE_LABELS]

# map for case/spacing normalization
_norm_map = {lab.lower().strip(): lab for lab in ALL_LABELS}

def normalize_label(t: str):
    t2 = str(t).strip()
    if not t2:
        return None
    key = t2.lower().strip()
    return _norm_map.get(key, None)  # canonical label or None if unknown

# ====== 3) PARSE TAG STRINGS ======
def parse_labels(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return []
    s = str(x).strip()
    if s == "" or s.lower() in {"nan", "none"}:
        return []

    labels = []

    # Try JSON list
    if s.startswith("[") and s.endswith("]"):
        try:
            obj = json.loads(s)
            labels = [str(t).strip() for t in obj]
        except Exception:
            try:
                obj = json.loads(re.sub(r"'", '"', s))
                labels = [str(t).strip() for t in obj]
            except Exception:
                labels = []

    # Fallback split
    if not labels:
        parts = re.split(r"[,\n;|]+", s)
        labels = []
        for t in parts:
            t = re.sub(r"^tags?\s*:\s*", "", t.strip(), flags=re.IGNORECASE)
            if t:
                labels.append(t)

    # Normalize + keep only known labels
    out = []
    for t in labels:
        canon = normalize_label(t)
        if canon is not None:
            out.append(canon)

    # de-dup while preserving order
    seen = set()
    out2 = []
    for t in out:
        if t not in seen:
            seen.add(t)
            out2.append(t)
    return out2

# ====== 4) BUILD y_true / y_pred ======
assert "Tags" in df.columns, "Missing ground-truth column: Tags"
assert "predicted_labels_yes" in df.columns, "Missing prediction column: predicted_labels_yes"

y_true_list_all = df["Tags"].apply(parse_labels).tolist()
y_pred_list_all = df["predicted_labels_yes"].apply(parse_labels).tolist()

print("\nEmpty ground-truth rows:", sum(len(x) == 0 for x in y_true_list_all))
print("Empty prediction rows:", sum(len(x) == 0 for x in y_pred_list_all))

# --- Filter OUT excluded labels at the list level (optional but keeps things clean)
def drop_excluded(labels):
    return [t for t in labels if t not in EXCLUDE_LABELS]

y_true_list = [drop_excluded(x) for x in y_true_list_all]
y_pred_list = [drop_excluded(x) for x in y_pred_list_all]

# Binarize ONLY over evaluation labels
mlb = MultiLabelBinarizer(classes=EVAL_LABELS)
Y_true = mlb.fit_transform(y_true_list)
Y_pred = mlb.transform(y_pred_list)

print("\nEvaluating on labels (excluded: Not applicable, Seeking support):")
print(mlb.classes_)

# ====== 5) OVERALL METRICS ======
scores = {
    "micro_precision": precision_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_recall":    recall_score(Y_true, Y_pred, average="micro", zero_division=0),
    "micro_f1":        f1_score(Y_true, Y_pred, average="micro", zero_division=0),

    "macro_precision": precision_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_recall":    recall_score(Y_true, Y_pred, average="macro", zero_division=0),
    "macro_f1":        f1_score(Y_true, Y_pred, average="macro", zero_division=0),

    "samples_precision": precision_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_recall":    recall_score(Y_true, Y_pred, average="samples", zero_division=0),
    "samples_f1":        f1_score(Y_true, Y_pred, average="samples", zero_division=0),
}

print("\nOverall Precision \\& Recall \\& F1 (EXCLUDING Not applicable + Seeking support)")
for k, v in scores.items():
    print(f"{k}: {v:.4f}")

# ====== 6) PER-LABEL METRICS ======
p, r, f1, support = precision_recall_fscore_support(
    Y_true, Y_pred, average=None, zero_division=0
)

per_label_df = pd.DataFrame({
    "label": mlb.classes_,
    "precision": p,
    "recall": r,
    "f1": f1,
    "support_true": support
}).sort_values("f1", ascending=False)

print("\nPer-label metrics (excluded labels removed):")
display(per_label_df)

print("\nClassification report (excluded labels removed):")
print(classification_report(Y_true, Y_pred, target_names=mlb.classes_, zero_division=0))


Loaded: test_with_predictions_4mini_t44.csv
Rows: 44
Columns: ['post_id', 'parent_id', 'subreddit', 'comment_id', 'parent_fullname', 'depth', 'comment_author', 'comment_body', 'comment_score', 'created_utc', 'permalink', 'is_post_author', 'title', 'author', 'body', 'prompt', 'Tags', 'is_ambivalent_sa_domain', 'op_author', 'last_comment_depth', 'last_comment_author', 'is_last_comment_by_op', 'llm_error', 'predicted_labels_yes', 'gate_used', 'Information support', 'Emotional support', 'Esteem support', 'Network support', 'Tangible assistance', 'Seeking support', 'Group interactions', 'Not applicable']

Empty ground-truth rows: 0
Empty prediction rows: 2

Evaluating on labels (excluded: Not applicable, Seeking support):
['Information support' 'Emotional support' 'Esteem support'
 'Network support' 'Tangible assistance' 'Group interactions']

Overall Precision \& Recall \& F1 (EXCLUDING Not applicable + Seeking support)
micro_precision: 0.7302
micro_recall: 0.7797
micro_f1: 0.7541
macro_pr

Unnamed: 0,label,precision,recall,f1,support_true
3,Network support,1.0,1.0,1.0,1
1,Emotional support,0.75,0.923077,0.827586,13
0,Information support,0.9,0.72,0.8,25
2,Esteem support,0.7,0.7,0.7,10
5,Group interactions,0.5,0.8,0.615385,10
4,Tangible assistance,0.0,0.0,0.0,0



Classification report (excluded labels removed):
                     precision    recall  f1-score   support

Information support       0.90      0.72      0.80        25
  Emotional support       0.75      0.92      0.83        13
     Esteem support       0.70      0.70      0.70        10
    Network support       1.00      1.00      1.00         1
Tangible assistance       0.00      0.00      0.00         0
 Group interactions       0.50      0.80      0.62        10

          micro avg       0.73      0.78      0.75        59
          macro avg       0.64      0.69      0.66        59
       weighted avg       0.77      0.78      0.76        59
        samples avg       0.52      0.53      0.52        59

