# 02_prepare_training_data

In this notebook we:
1. Load the merged dataset from `data/merged_dataverse`  
2. Compute a `topics_list` column for each post  
3. Filter out any posts with no topics  
4. Preview the resulting DataFrame  

This prepares our data for instruction–response pair generation.


In [1]:
# Cell 1: imports & load
from datasets import load_from_disk
import pandas as pd
from pathlib import Path

# 1) Load merged dataset
ROOT = Path("..")
ds = load_from_disk(ROOT / "data" / "merged_dataverse")
df = ds.to_pandas()

# 2) Identify all topic columns
topic_cols = [c for c in df.columns if c not in ("id", "text")]

# 3) Build topics_list for each row
df["topics_list"] = (
    df[topic_cols]
    .apply(lambda row: [col for col, v in row.items() if isinstance(v, (int, float)) and v > 0], axis=1)
)

# 4) Filter out posts with no topics
df = df[df["topics_list"].map(len) > 0].reset_index(drop=True)

# 5) Preview
print(f"Total training examples: {len(df)}")
df.head(5)


Total training examples: 69553


Unnamed: 0,id,text,gpt4o_relation,gpt4o_protein,gpt4o_ed,gpt4o_exercise,gpt4o_crave,gpt4o_restrict,gpt4o_binge,gpt4o_loss,...,human_binge01,human_loss01,human_gain01,human_calorie01,human_idealbody01,human_bodyhate01,human_feargain01,human_fearfood01,human_depressedmood01,topics_list
0,1003i3b,tw ana body dysmorphia describing body potenti...,0.433868,0.0,0.159211,0.0,0.0,0.297401,0.0,0.774578,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[gpt4o_relation, gpt4o_ed, gpt4o_restrict, gpt..."
1,1004j21,its been two weeks since i started strength tr...,0.0,0.0,0.0,0.567893,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[gpt4o_exercise, gpt4o_exercise01, Llama-3.1-8..."
2,100cs6r,update 41523 hey folks soit turns out that i...,0.0,0.0,0.0,0.334327,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[gpt4o_exercise, gpt4o_exercise01, Llama-3.1-8..."
3,100nlca,i used to feel bad about having a small facean...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[gpt4o_idealbody, gpt4o_bodyhate, gpt4o_idealb..."
4,100nn9c,hi we eat a lot of grilled chicken and i was w...,0.0,0.140245,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[gpt4o_protein, gpt4o_protein01, Llama-3.1-8B-..."
