<a href="https://colab.research.google.com/github/NigelWilliamUOP/vibe-coding/blob/main/passport_bro_04_policing_weak_supervision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 04 — Boundary‑policing detection via weak supervision (Python‑only)
This notebook creates a **boundary‑policing** signal using **weak supervision** (labeling functions + label model) and a discriminative end model.

**Inputs**:
- `raw.parquet` (from Notebook 01)
- optional: `thread_map.parquet` (from Notebook 02) to attach `root_submission_id`, `depth`

**Outputs**:
- `artefacts/policing_labels.parquet` (id-level policing probability + hard label)
- `artefacts/lf_analysis.json` (coverage/conflict summary)
- `artefacts/policing_end_model.joblib` (trained end model pipeline)

Notes:
- No manual qualitative analysis (no hand labels). All labels are programmatic.
- The approach follows Snorkel’s weak supervision workflow: labeling functions → label model → discriminative model. citeturn0search18turn0search4turn0search1


In [None]:
# --- Install deps (Colab-safe) ---
!pip -q install -U pyarrow tqdm snorkel scikit-learn joblib

import sys, platform, re, json, math
from pathlib import Path
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

from snorkel.labeling import LabelingFunction, PandasLFApplier, LFAnalysis
from snorkel.labeling.model import LabelModel
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import joblib

print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("pandas:", pd.__version__)
print("snorkel:", __import__("snorkel").__version__)
print("sklearn:", __import__("sklearn").__version__)
print("pyarrow:", __import__("pyarrow").__version__)

Python: 3.12.12
Platform: Linux-6.6.105+-x86_64-with-glibc2.35
pandas: 2.2.2
snorkel: 0.10.0
sklearn: 1.8.0
pyarrow: 22.0.0


## 1) Locate inputs
Defaults:
- `/content/artefacts/raw.parquet`
- `/content/artefacts/thread_map.parquet` (optional)


In [None]:
ARTEFACT_DIR = Path("/content/artefacts")
ARTEFACT_DIR.mkdir(parents=True, exist_ok=True)

def find_file(name: str, candidates):
    for p in candidates:
        p = Path(p)
        if p.exists():
            return p
    hits = list(Path("/content").rglob(name))
    if hits:
        return hits[0]
    return None

RAW_PATH = find_file("raw.parquet", [
    "/content/artefacts/raw.parquet",
    "/content/raw.parquet",
    "/content/data/raw.parquet",
])

THREAD_MAP_PATH = find_file("thread_map.parquet", [
    "/content/artefacts/thread_map.parquet",
    "/content/thread_map.parquet",
    "/content/data/thread_map.parquet",
])

RAW_PATH, THREAD_MAP_PATH


(PosixPath('/content/raw.parquet'), PosixPath('/content/thread_map.parquet'))

In [None]:
# Optional upload (uncomment if needed)
# from google.colab import files
# uploaded = files.upload()
# list(uploaded.keys())


In [None]:
if RAW_PATH is None:
    raise FileNotFoundError("raw.parquet not found. Upload it or mount Drive, then set RAW_PATH.")

df = pd.read_parquet(RAW_PATH, engine="pyarrow")
print("Rows:", len(df), "Cols:", df.shape[1])
df[["id","type","language","month","text_all","text_len"]].head(3)


Rows: 76800 Cols: 36


Unnamed: 0,id,type,language,month,text_all,text_len
0,1gt7gx8,Submission,en,2024-11,Dating in the West in 2024,27
1,1i5zk4y,Submission,en,2025-01,men with an asian wife seeing a latina up close,48
2,1ktcez8,Submission,en,2025-05,Interesting thing to think about,33


## 2) Prepare modelling frame
Primary modelling set: English (`language=='en'`) and `text_len>=20`.
We keep all rows in the final output, but only train on rows meeting the criteria.

In [None]:
# Ensure dtypes
for c in ["id","type","language","month","author_hash"]:
    if c in df.columns:
        df[c] = df[c].astype("string")

df["text_all"] = df.get("text_all", (df.get("title","") + "\n" + df.get("text",""))).fillna("").astype("string")
df["text_len"] = pd.to_numeric(df.get("text_len", df["text_all"].str.len()), errors="coerce").astype("Int64")

# Keep only comments/replies for policing detection (submissions can be analysed later, but are noisier)
is_interaction = df["type"].str.lower().isin(["comment","reply"])
df_int = df[is_interaction].copy()

train_mask = (df_int["language"] == "en") & (df_int["text_len"].fillna(0) >= 20)
df_train = df_int[train_mask].copy()

print("Interactions:", len(df_int))
print("Training candidates (en & len>=20):", len(df_train))

Interactions: 75811
Training candidates (en & len>=20): 64455


## 3) Text normalisation helpers

In [None]:
URL_RE = re.compile(r"https?://\S+|www\.\S+")
WS_RE  = re.compile(r"\s+")

def norm_text(s: str) -> str:
    if s is None:
        return ""
    s = str(s)
    s = s.lower()
    s = URL_RE.sub(" ", s)
    s = s.replace('\n', ' ')
    s = WS_RE.sub(" ", s).strip()
    return s

df_train["text_norm"] = df_train["text_all"].map(norm_text)
df_train[["text_all","text_norm"]].head(2)

Unnamed: 0,text_all,text_norm
101,What city do you live in? I'm in Thailand no...,what city do you live in? i'm in thailand now....
102,I used to date a Filipino girl. She was smart...,i used to date a filipino girl. she was smart....


## 4) Weak supervision: labeling functions
We use high-precision signals for boundary-policing (rules/moderation language, gatekeeping phrases).
A label model aggregates these into probabilistic labels. citeturn0search4turn0search18

In [None]:
# Label space
ABSTAIN = -1
NON_POLICING = 0
POLICING = 1

# Regex helpers
def has_any(patterns, text):
    return any(re.search(p, text) for p in patterns)

# Positive LFs (policing)
POLICING_PATTERNS = {
    "mod_rule_enforcement": [
        r"\bmods?\b", r"\bmoderator\b", r"\breport\b", r"\bban(ned|s|)\b",
        r"\brules?\b", r"\bagainst the rules\b", r"\bsubreddit\b", r"\bthis sub\b",
        r"\bremoved\b", r"\blocked\b"
    ],
    "gatekeeping": [
        r"not what (this sub|passport bros) (is|means)",
        r"does(n'?t| not) belong here",
        r"off[- ]topic",
        r"keep (it|this) (on topic|civil|respectful)",
        r"take (this|it) elsewhere"
    ],
    "explicit_boundary": [
        r"no sex tourism",
        r"not sex tourism",
        r"not (about|for) (prostitut|escort|hookers?)",
        r"stop (calling|saying) (us|them) (sex tourists?|incels?)"
    ],
    "stop_posting": [
        r"stop posting",
        r"don'?t post (this|that)",
        r"do not post (this|that)"
    ]
}

# Negative LFs (non-policing) — travel/logistics questions are usually not policing
NON_POLICING_PATTERNS = {
    "travel_logistics": [
        r"\bvisa\b", r"\bflight\b", r"\bhotel\b", r"\bbudget\b", r"\bcurrency\b",
        r"\bexchange rate\b", r"\bairport\b", r"\bsim card\b", r"\bpassport\b",
        r"\btravel insurance\b"
    ],
    "country_recs": [
        r"which country",
        r"where should i go",
        r"recommend(ation|)s? for",
        r"best (country|place|city) for"
    ]
}

def make_lf(name, patterns, label):
    def f(x):
        t = x.text_norm
        if t and has_any(patterns, t):
            return label
        return ABSTAIN
    return LabelingFunction(name=name, f=f)

lfs = []
for name, pats in POLICING_PATTERNS.items():
    lfs.append(make_lf(f"lf_pos_{name}", pats, POLICING))
for name, pats in NON_POLICING_PATTERNS.items():
    lfs.append(make_lf(f"lf_neg_{name}", pats, NON_POLICING))

print("Labeling functions:", [lf.name for lf in lfs])


Labeling functions: ['lf_pos_mod_rule_enforcement', 'lf_pos_gatekeeping', 'lf_pos_explicit_boundary', 'lf_pos_stop_posting', 'lf_neg_travel_logistics', 'lf_neg_country_recs']


In [None]:
# Apply LFs
applier = PandasLFApplier(lfs=lfs)
L = applier.apply(df=df_train[["text_norm"]].assign(text_norm=df_train["text_norm"]))

# LF diagnostics
analysis = LFAnalysis(L=L, lfs=lfs).lf_summary()
analysis


100%|██████████| 64455/64455 [00:11<00:00, 5792.40it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_pos_mod_rule_enforcement,0,[1],0.041719,0.003367,0.002839
lf_pos_gatekeeping,1,[1],0.000791,0.000496,1.6e-05
lf_pos_explicit_boundary,2,[1],0.000155,4.7e-05,0.0
lf_pos_stop_posting,3,[1],7.8e-05,0.0,0.0
lf_neg_travel_logistics,4,[0],0.046094,0.002855,0.002746
lf_neg_country_recs,5,[0],0.001334,0.000233,0.000124


In [None]:
# Save LF analysis summary as JSON for the paper / prereg artefact trail
lf_summary_list = []
for idx, row in analysis.iterrows():
    # Construct a dictionary for each row, explicitly converting dtypes
    row_dict = {
        "lf_name": idx,
        "j": int(row["j"]),
        "Polarity": [int(p) for p in row["Polarity"]], # Ensure each element in Polarity is a standard int
        "Coverage": float(row["Coverage"]),
        "Overlaps": float(row["Overlaps"]),
        "Conflicts": float(row["Conflicts"])
    }
    lf_summary_list.append(row_dict)

lf_report = {
    "n_examples": int(L.shape[0]),
    "n_lfs": int(L.shape[1]),
    "lf_summary": lf_summary_list,
    "coverage": float(analysis["Coverage"].mean()) if "Coverage" in analysis.columns else None,
    "conflict": float(analysis["Conflicts"].mean()) if "Conflicts" in analysis.columns else None,
}

LF_JSON_PATH = ARTEFACT_DIR / "lf_analysis.json"
with LF_JSON_PATH.open("w", encoding="utf-8") as f:
    json.dump(lf_report, f, indent=2)

print("Wrote:", LF_JSON_PATH)

Wrote: /content/artefacts/lf_analysis.json


## 5) Train label model (probabilistic labels)
We use Snorkel’s `LabelModel` to combine LF outputs into class probabilities. citeturn0search4turn0search18

In [None]:
# Train/validation split (time-aware if date_dt exists; else random)
df_train = df_train.copy()
if "date_dt" in df_train.columns:
    df_train["date_dt"] = pd.to_datetime(df_train["date_dt"], errors="coerce")
    df_train = df_train.sort_values("date_dt")
    cut = int(0.8 * len(df_train))
    idx_train = df_train.index[:cut]
    idx_val = df_train.index[cut:]
else:
    idx_train, idx_val = train_test_split(df_train.index, test_size=0.2, random_state=42)

# Need L matrices aligned to df_train order; build position maps
pos = {i: p for p, i in enumerate(df_train.index)}
train_pos = [pos[i] for i in idx_train]
val_pos = [pos[i] for i in idx_val]

L_train = L[train_pos, :]
L_val = L[val_pos, :]

label_model = LabelModel(cardinality=2, verbose=False)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=42)

# Probabilities for all training-candidate rows
probs = label_model.predict_proba(L)
policing_prob_lm = probs[:, 1]

df_train["policing_prob_lm"] = policing_prob_lm
df_train["policing_label_lm"] = (df_train["policing_prob_lm"] >= 0.5).astype(int)

df_train[["policing_prob_lm","policing_label_lm"]].describe()


100%|██████████| 500/500 [00:00<00:00, 725.72epoch/s]


Unnamed: 0,policing_prob_lm,policing_label_lm
count,64455.0,64455.0
mean,0.499513,0.952711
std,0.03624,0.212258
min,0.258309,0.0
25%,0.5,1.0
50%,0.5,1.0
75%,0.5,1.0
max,0.918202,1.0


## 6) Train discriminative end model (TF‑IDF + logistic regression)
We train on **high-confidence** label-model predictions, then score all interactions.
TF‑IDF and Logistic Regression are standard sparse-text baselines. citeturn0search2turn0search6turn0search9

In [None]:
# High-confidence pseudo-labels
hi_pos = df_train["policing_prob_lm"] >= 0.90
hi_neg = df_train["policing_prob_lm"] <= 0.30 # Relaxed threshold to include more negative samples
df_hi = df_train[hi_pos | hi_neg].copy()
df_hi["y"] = (df_hi["policing_prob_lm"] >= 0.5).astype(int)

print("High-confidence set:", len(df_hi), "of", len(df_train),
      f"({len(df_hi)/len(df_train):.1%})")
print("Class balance (y=1 policing):", df_hi["y"].mean())

# End model pipeline
end_model = Pipeline(steps=[
    ("tfidf", TfidfVectorizer(
        ngram_range=(1,2),
        min_df=5,
        max_df=0.95,
        strip_accents="unicode",
        sublinear_tf=True
    )),
    ("clf", LogisticRegression(
        max_iter=1000,
        solver="liblinear",
        class_weight="balanced",
        random_state=42
    ))
])

# Split hi-confidence for a sanity check (this is not a gold evaluation)
X_train, X_test, y_train, y_test = train_test_split(
    df_hi["text_norm"], df_hi["y"], test_size=0.2, random_state=42, stratify=df_hi["y"]
)

end_model.fit(X_train, y_train)
y_hat = end_model.predict(X_test)

print(classification_report(y_test, y_hat, digits=3))

High-confidence set: 38 of 64455 (0.1%)
Class balance (y=1 policing): 0.8157894736842105
              precision    recall  f1-score   support

           0      0.333     1.000     0.500         1
           1      1.000     0.714     0.833         7

    accuracy                          0.750         8
   macro avg      0.667     0.857     0.667         8
weighted avg      0.917     0.750     0.792         8



In [None]:
# Score all interaction rows (comments+replies), including non-English/short text
df_int = df_int.copy()
df_int["text_norm"] = df_int["text_all"].map(norm_text)

# For very short texts, scoring is still possible, but may be noisy.
probs_end = end_model.predict_proba(df_int["text_norm"])[:, 1]
df_int["policing_prob"] = probs_end
df_int["policing_label"] = (df_int["policing_prob"] >= 0.5).astype(int)

df_int[["policing_prob","policing_label"]].describe()


Unnamed: 0,policing_prob,policing_label
count,75811.0,75811.0
mean,0.542332,0.616137
std,0.096986,0.486328
min,0.262511,0.0
25%,0.464402,0.0
50%,0.540611,1.0
75%,0.618629,1.0
max,0.796837,1.0


## 7) Optional: attach thread structure (root id, depth)
If `thread_map.parquet` is available, we attach `root_submission_id` and `depth` by `id`.

In [None]:
if THREAD_MAP_PATH is not None:
    tm = pd.read_parquet(THREAD_MAP_PATH, engine="pyarrow")[["id","root_submission_id","depth"]].copy()
    tm["id"] = tm["id"].astype("string")
    df_int["id"] = df_int["id"].astype("string")
    df_int = df_int.merge(tm, on="id", how="left")
    print("Attached thread_map fields.")
else:
    print("thread_map.parquet not found; skipping structure join.")


Attached thread_map fields.


## 8) Write outputs

In [None]:
# Save policing labels (id-level, interaction rows only)
out_cols = ["id","policing_prob","policing_label","language","month","author_hash"]
for c in ["root_submission_id","depth","type","date_dt"]:
    if c in df_int.columns:
        out_cols.append(c)

out = df_int[out_cols].copy()

POLICING_PATH = ARTEFACT_DIR / "policing_labels.parquet"
out.to_parquet(POLICING_PATH, engine="pyarrow", compression="snappy", index=False)

MODEL_PATH = ARTEFACT_DIR / "policing_end_model.joblib"
joblib.dump(end_model, MODEL_PATH)

print("Wrote:", POLICING_PATH, "| rows:", len(out), "cols:", out.shape[1])
print("Wrote:", MODEL_PATH)
out.head(3)


Wrote: /content/artefacts/policing_labels.parquet | rows: 75811 cols: 10
Wrote: /content/artefacts/policing_end_model.joblib


Unnamed: 0,id,policing_prob,policing_label,language,month,author_hash,root_submission_id,depth,type,date_dt
0,jxxdu8k,0.533845,1,en,2023-08,ca23f2e34edef42d93e1ccd9daf7880eba000eb9189b83...,162huet,1,Comment,2023-08-27 06:12:24
1,jxzqker,0.643932,1,en,2023-08,dd2373e66fc78b88d3e3997c5e57c4d7f66f5f95a9cadb...,162huet,1,Comment,2023-08-27 18:50:33
2,jxzz0y8,0.70877,1,en,2023-08,f5061e7f9ef8b7780dcae567773a6e86d4e4723670c106...,162huet,1,Comment,2023-08-27 19:49:37


## 9) Next notebook
Proceed to `05_event_study_design.ipynb` to define policing event times per thread and create matched treated/control threads.