<a href="https://colab.research.google.com/github/NigelWilliamUOP/vibe-coding/blob/main/Passport_bro_notebook_02_threads_structure.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 02 — Thread reconstruction & structural features
This notebook reconstructs submission-rooted threads, computes depth/parent features, and exports thread-level structural metrics.

**Inputs**: `raw.parquet` (from Notebook 01)

**Outputs**:
- `artefacts/thread_map.parquet`
- `artefacts/thread_metrics.parquet`
- `artefacts/author_first_seen.parquet`

All steps are programmatic (no manual qualitative work).


In [1]:
# --- Install deps (Colab-safe) ---
!pip -q install -U pyarrow tqdm

import sys, platform, json, math, hashlib
from pathlib import Path
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("pandas:", pd.__version__)
print("pyarrow:", __import__("pyarrow").__version__)


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25hPython: 3.12.12
Platform: Linux-6.6.105+-x86_64-with-glibc2.35
pandas: 2.2.2
pyarrow: 22.0.0


## 1) Locate and load `raw.parquet`
By default this expects `/content/artefacts/raw.parquet` written by Notebook 01.
If you’re running from a fresh Colab session, upload `raw.parquet` or mount Drive.

In [2]:
DEFAULT_PARQUET_CANDIDATES = [
    "/content/artefacts/raw.parquet",
    "/content/raw.parquet",
    "/content/data/raw.parquet",
]

ARTEFACT_DIR = Path("/content/artefacts")
ARTEFACT_DIR.mkdir(parents=True, exist_ok=True)

def find_raw_parquet(candidates=DEFAULT_PARQUET_CANDIDATES) -> Path:
    for p in candidates:
        if Path(p).exists():
            return Path(p)
    hits = list(Path("/content").rglob("raw.parquet"))
    if hits:
        return hits[0]
    raise FileNotFoundError(
        "Could not find raw.parquet. Upload it (Files pane) or mount Drive and set RAW_PARQUET_PATH."
    )

RAW_PARQUET_PATH = find_raw_parquet()
RAW_PARQUET_PATH


PosixPath('/content/raw.parquet')

In [3]:
# Optional upload (uncomment if needed)
# from google.colab import files
# uploaded = files.upload()
# RAW_PARQUET_PATH = Path(next(iter(uploaded.keys())))
# RAW_PARQUET_PATH


In [4]:
df = pd.read_parquet(RAW_PARQUET_PATH, engine="pyarrow")
print("Rows:", len(df), "Cols:", df.shape[1])
df.head(3)


Rows: 76800 Cols: 36


Unnamed: 0,id,date,author,title,text,comment_on,type,score,upvote_ratio,url,...,date_dt,selftext,text_all,text_len,month,week,is_submission,is_comment,is_reply,author_hash
0,1gt7gx8,11-17-2024 06:24:46,,Dating in the West in 2024,,,Submission,6171,0.96,https://i.redd.it/8vwl3xqxne1e1.jpeg,...,2024-11-17 06:24:46,,Dating in the West in 2024,27,2024-11,2024-11-11/2024-11-17,True,False,False,
1,1i5zk4y,01-20-2025 20:04:50,IamDreamzzz,men with an asian wife seeing a latina up close,,,Submission,4324,0.93,https://i.redd.it/ph5kyu3lg7ee1.jpeg,...,2025-01-20 20:04:50,,men with an asian wife seeing a latina up close,48,2025-01,2025-01-20/2025-01-26,True,False,False,c8febdf20a50dfcfe675d1542f69dbac5cd523e4d32848...
2,1ktcez8,05-23-2025 06:07:41,VdelaM,Interesting thing to think about,,,Submission,3646,0.92,https://i.redd.it/e8sfwemc3h2f1.jpeg,...,2025-05-23 06:07:41,,Interesting thing to think about,33,2025-05,2025-05-19/2025-05-25,True,False,False,260ea31d86958a47e0a69608522741ee3e681937de9e52...


## 2) Minimal column checks and normalisation

In [5]:
required = ["id","comment_on","type","date_dt","author_hash"]
missing = [c for c in required if c not in df.columns]
if missing:
    raise ValueError(f"Missing required columns: {missing}")

# normalise string dtypes
for c in ["id","comment_on","type","author_hash","month","cluster_label","language"]:
    if c in df.columns:
        df[c] = df[c].astype("string")

# Ensure datetime
df["date_dt"] = pd.to_datetime(df["date_dt"], errors="coerce")

# Flags (use provided columns if present)
df["is_submission"] = df.get("is_submission", (df["type"].str.lower()=="submission")).fillna(False)
df["is_comment"] = df.get("is_comment", (df["type"].str.lower()=="comment")).fillna(False)
df["is_reply"] = df.get("is_reply", (df["type"].str.lower()=="reply")).fillna(False)

print("Submissions:", int(df["is_submission"].sum()),
      "Comments:", int(df["is_comment"].sum()),
      "Replies:", int(df["is_reply"].sum()))


Submissions: 989 Comments: 22904 Replies: 52907


## 3) Thread root and depth reconstruction
We treat the **thread root** as the last ancestor in the `comment_on` chain.
In this dataset, that should be the submission row where `comment_on` is missing.

In [6]:
# Build parent lookup
parent_map = dict(zip(df["id"].tolist(), df["comment_on"].tolist()))

root_cache = {}
depth_cache = {}

def resolve_root_and_depth(node_id: str):
    # Returns (root_id, depth) where depth is number of parent-edges from node to root.
    if node_id in root_cache:
        return root_cache[node_id], depth_cache[node_id]

    cur = node_id
    stack = []
    while True:
        if cur in root_cache:
            root = root_cache[cur]
            depth_to_root = depth_cache[cur]
            break

        parent = parent_map.get(cur)
        if parent is None or pd.isna(parent):
            root = cur
            depth_to_root = 0
            root_cache[cur] = root
            depth_cache[cur] = 0
            break

        stack.append(cur)
        cur = parent

    # unwind stack: assign root and depth for each node encountered
    for n in reversed(stack):
        depth_to_root += 1
        root_cache[n] = root
        depth_cache[n] = depth_to_root

    return root_cache[node_id], depth_cache[node_id]

ids = df["id"].tolist()
roots, depths = [], []
for node_id in tqdm(ids, desc="Resolving roots/depths"):
    r, d = resolve_root_and_depth(node_id)
    roots.append(r)
    depths.append(d)

df["root_submission_id"] = pd.Series(roots, dtype="string")
df["depth"] = pd.Series(depths, dtype="Int64")

# Sanity: root count should match submission count (usually)
n_roots = df["root_submission_id"].nunique(dropna=True)
n_submissions = int(df["is_submission"].sum())
print("Unique roots:", n_roots, "| submissions:", n_submissions)


Resolving roots/depths:   0%|          | 0/76800 [00:00<?, ?it/s]

Unique roots: 989 | submissions: 989


## 4) Parent author hash and per-row thread map export

In [7]:
# Parent author hash via map (fast join)
id_to_author_hash = dict(zip(df["id"].tolist(), df["author_hash"].tolist()))
df["parent_author_hash"] = df["comment_on"].map(id_to_author_hash).astype("string")

thread_map_cols = [
    "id","comment_on","root_submission_id","depth",
    "type","is_submission","is_comment","is_reply",
    "date_dt","month","author_hash","parent_author_hash",
    "text_len","cluster_label","language"
]
thread_map_cols = [c for c in thread_map_cols if c in df.columns]

thread_map = df[thread_map_cols].copy()

THREAD_MAP_PATH = ARTEFACT_DIR / "thread_map.parquet"
thread_map.to_parquet(THREAD_MAP_PATH, engine="pyarrow", compression="snappy", index=False)

print("Wrote:", THREAD_MAP_PATH, "| rows:", len(thread_map), "cols:", thread_map.shape[1])
thread_map.head(3)


Wrote: /content/artefacts/thread_map.parquet | rows: 76800 cols: 15


Unnamed: 0,id,comment_on,root_submission_id,depth,type,is_submission,is_comment,is_reply,date_dt,month,author_hash,parent_author_hash,text_len,cluster_label,language
0,1gt7gx8,,1gt7gx8,0,Submission,True,False,False,2024-11-17 06:24:46,2024-11,,,27,1: Outliers,en
1,1i5zk4y,,1i5zk4y,0,Submission,True,False,False,2025-01-20 20:04:50,2025-01,c8febdf20a50dfcfe675d1542f69dbac5cd523e4d32848...,,48,1: Outliers,en
2,1ktcez8,,1ktcez8,0,Submission,True,False,False,2025-05-23 06:07:41,2025-05,260ea31d86958a47e0a69608522741ee3e681937de9e52...,,33,1: Outliers,en


## 5) Author first-seen and newcomer share
We define an author’s **first-seen thread** as the root of their earliest timestamp in the dataset.

In [8]:
df["_row_i"] = np.arange(len(df))

authors = df[df["author_hash"].notna()].copy()
authors = authors.sort_values(["date_dt","_row_i"])

first_seen = authors.groupby("author_hash", as_index=False).first()[["author_hash","date_dt","root_submission_id"]]
first_seen = first_seen.rename(columns={
    "date_dt": "author_first_seen_dt",
    "root_submission_id": "author_first_thread_root"
})

AUTHOR_FIRST_SEEN_PATH = ARTEFACT_DIR / "author_first_seen.parquet"
first_seen.to_parquet(AUTHOR_FIRST_SEEN_PATH, engine="pyarrow", compression="snappy", index=False)

print("Wrote:", AUTHOR_FIRST_SEEN_PATH, "| rows:", len(first_seen))
first_seen.head(3)


Wrote: /content/artefacts/author_first_seen.parquet | rows: 17009


Unnamed: 0,author_hash,author_first_seen_dt,author_first_thread_root
0,000363ce318a9d8dedbd9be0175e36a352a23160916cd4...,2025-09-28 17:03:24,1nsotwx
1,00036e2a949a04a87ba413e89abc5cd814caa363de8857...,2024-06-23 02:52:06,1dljsft
2,000377ca842920cdca7f09f7d6cd208c2326e1dc0f9538...,2025-10-09 21:04:30,1o2eh4d


## 6) Thread-level structural metrics
These metrics are used later for matching and ABM calibration targets.

In [9]:
g = df.groupby("root_submission_id", dropna=False)

thread_size = g["id"].size().rename("thread_size")
max_depth = g["depth"].max().rename("max_depth")

start_dt = g["date_dt"].min().rename("thread_start_dt")
end_dt = g["date_dt"].max().rename("thread_end_dt")
duration_hours = ((end_dt - start_dt).dt.total_seconds() / 3600.0).rename("duration_hours")

n_sub = g["is_submission"].sum().rename("n_submissions")
n_com = g["is_comment"].sum().rename("n_comments")
n_rep = g["is_reply"].sum().rename("n_replies")

unique_authors = (
    df[df["author_hash"].notna()]
    .groupby("root_submission_id")["author_hash"]
    .nunique()
    .rename("unique_authors")
)

# Newcomer share: authors whose first thread root is this thread root
first_thread_map = dict(zip(first_seen["author_hash"].tolist(), first_seen["author_first_thread_root"].tolist()))
df["author_first_thread_root"] = df["author_hash"].map(first_thread_map).astype("string")

tmp = df[df["author_hash"].notna()].drop_duplicates(["root_submission_id","author_hash"])
newcomer_counts = (
    (tmp["author_first_thread_root"] == tmp["root_submission_id"])
    .groupby(tmp["root_submission_id"])
    .sum()
    .rename("newcomer_authors")
)
newcomer_share = (newcomer_counts / unique_authors).rename("newcomer_share")

# Participation entropy (Shannon entropy over author contribution shares)
def shannon_entropy(counts_arr: np.ndarray) -> float:
    total = counts_arr.sum()
    if total <= 0:
        return 0.0
    p = counts_arr / total
    return float(-(p * np.log(p)).sum())

counts = (
    df[df["author_hash"].notna()]
    .groupby(["root_submission_id","author_hash"])["id"]
    .size()
    .rename("n")
    .reset_index()
)

entropy = (
    counts.groupby("root_submission_id")["n"]
    .apply(lambda s: shannon_entropy(s.to_numpy()))
    .rename("participation_entropy")
)

thread_metrics = pd.concat([
    thread_size, max_depth, start_dt, end_dt, duration_hours,
    n_sub, n_com, n_rep,
    unique_authors, newcomer_share, entropy
], axis=1).reset_index()

thread_metrics.replace([np.inf, -np.inf], np.nan, inplace=True)

THREAD_METRICS_PATH = ARTEFACT_DIR / "thread_metrics.parquet"
thread_metrics.to_parquet(THREAD_METRICS_PATH, engine="pyarrow", compression="snappy", index=False)

print("Wrote:", THREAD_METRICS_PATH, "| rows:", len(thread_metrics), "cols:", thread_metrics.shape[1])
thread_metrics.head(5)


Wrote: /content/artefacts/thread_metrics.parquet | rows: 989 cols: 12


Unnamed: 0,root_submission_id,thread_size,max_depth,thread_start_dt,thread_end_dt,duration_hours,n_submissions,n_comments,n_replies,unique_authors,newcomer_share,participation_entropy
0,162huet,301,30,2023-08-27 05:38:00,2025-10-12 17:47:41,18660.161389,1,56,244,87.0,1.0,3.534041
1,16wowgc,185,11,2023-10-01 02:10:30,2025-11-26 14:38:41,18900.469722,1,42,142,94.0,0.957447,4.350199
2,17slf39,454,22,2023-11-11 03:09:24,2025-12-13 16:13:21,18325.065833,1,99,354,174.0,0.91954,4.613426
3,1880wns,227,20,2023-12-01 02:45:39,2025-04-23 05:49:44,12219.068056,1,62,164,91.0,0.846154,4.201146
4,18oldj8,193,9,2023-12-22 18:21:02,2025-08-27 22:52:50,14740.53,1,51,141,83.0,0.831325,4.138507


## 7) Quick sanity summary

In [11]:
# Roots without a matching submission row should be 0 for this dataset
roots_df = df.drop_duplicates("root_submission_id")[["root_submission_id"]].copy()
roots_df = roots_df.merge(
    df.loc[df["is_submission"], ["id","date_dt"]].rename(columns={"id":"root_submission_id","date_dt":"root_date_dt"}),
    on="root_submission_id", how="left"
)
missing_root_submissions = int(roots_df["root_date_dt"].isna().sum())
print("Roots without matching submission row:", missing_root_submissions, "out of", len(roots_df))

print("Thread size quantiles:")
print(thread_metrics["thread_size"].quantile([0.5,0.9,0.99]).to_string())

print("Max depth quantiles:")
print(thread_metrics["max_depth"].quantile([0.5,0.9,0.99]).to_string())

print("Unique author quantiles:")
print(thread_metrics["unique_authors"].quantile([0.5,0.9,0.99]).to_string())

Roots without matching submission row: 0 out of 989
Thread size quantiles:
0.50     25.00
0.90    202.00
0.99    673.48
Max depth quantiles:
0.50     5.0
0.90    13.0
0.99    27.0
Unique author quantiles:
0.50     14.00
0.90     98.60
0.99    289.26


## 8) Next notebook
Proceed to `03_network_authority.ipynb` to compute monthly authority metrics.