<a href="https://colab.research.google.com/github/Teek101/HPP_resource_code/blob/main/lung_cancer_sdoh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import re, math, random, datetime as dt
from collections import defaultdict, Counter
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split, GroupShuffleSplit
from sklearn.metrics import f1_score, precision_recall_fscore_support, average_precision_score

In [None]:
df = pd.read_csv('/content/drive/MyDrive/lung-cancer-uhcc/lung_cancer_data_feb15.csv')
df

Unnamed: 0,Type,Subreddit,Post_id,Title,Author,User_Type,Timestamp,Text,Score,Total_comments,Shares,Geolocation,Post_URL
0,Post,lungcancer,1gt87bj,He’s gone,AdLeft4868,General User,2024-11-17 07:16:45,"My beautiful, selfless & amazing dad passed aw...",130,33,,,https://www.reddit.com/r/lungcancer/comments/1...
1,Comment,lungcancer,1gt87bj,He’s gone,tinkertink2010,General User,2024-11-17 12:26:40,I’m so sorry for your loss. 55 is no age. F u ...,10,0,,,
2,Comment,lungcancer,1gt87bj,He’s gone,Lucky-Contribution50,General User,2024-11-17 07:39:45,I'm so sorry to hear about your dad. May he re...,6,0,,,
3,Comment,lungcancer,1gt87bj,He’s gone,EnvironmentalGood835,General User,2024-11-18 08:35:12,So sorry for your loss. Thoughts and prayers t...,4,0,,,
4,Comment,lungcancer,1gt87bj,He’s gone,Blueporch,General User,2024-11-17 13:39:06,I’m so sorry. \n\nI lost my dad to lung cancer...,3,0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
109863,Comment,stopsmoking,1gcbhz7,Yeah buddy!,dlkapt3,General User,2024-10-27 13:20:29,All I can say is that I was just completely do...,1,0,,,
109864,Comment,stopsmoking,1gcbhz7,Yeah buddy!,dlkapt3,General User,2024-10-27 13:00:03,Nope! There’s no way I’m resetting this winnin...,1,0,,,
109865,Comment,stopsmoking,1gcbhz7,Yeah buddy!,Bio_tomato,General User,2024-10-26 03:03:09,Thank you for answering and congratulations.\n...,5,0,,,
109866,Comment,stopsmoking,1gcbhz7,Yeah buddy!,Powerful_Setting1816,General User,2024-10-26 11:29:26,Understood. i just forget existence of somethi...,5,0,,,


In [None]:
updated_df = df.drop_duplicates(subset=['Text', 'Author'])    ## remove duplicates
updated_df.shape

(105118, 13)

Individual - smoking history, mental health

Interpersonal - social support

community/organizational - employment, housing, access delay

policy/societal - cost insurance, transportation

In [None]:
SDOH_LABELS = [
    "Cost_Insurance", "Transportation", "Social_Support", "Smoking_History", "Mental_Health", "Work_Access_Housing"]
    ## # merged: Employment + Access_Delay + Housing

LABEL2ID = {l:i for i,l in enumerate(SDOH_LABELS)}
ID2LABEL = {i:l for l,i in LABEL2ID.items()}

In [None]:
HYPOTHESES = {
    "Cost_Insurance": [
        "The author is having trouble paying for treatment.",
        "The author is facing insurance or coverage problems.",
        "The cost of care is a barrier for the author."
    ],
    "Transportation": [
        "The author lacks transportation to appointments.",
        "The author has difficulties traveling to receive care.",
        "Transport problems are preventing the author from getting treatment."
    ],
    "Employment": [
        "The author's job or income is at risk due to cancer care.",
        "The author is worried about work or employment because of treatment.",
        "The author cannot work or lost work time due to their condition."
    ],
    "Social_Support": [
        "The author feels socially isolated.",
        "The author reports limited family or community support.",
        "The author lacks social support while managing treatment."
    ],
    "Access_Delay": [
        "The author is experiencing delays or barriers to accessing care.",
        "The author cannot get timely appointments or approvals.",
        "The author is waiting for access to needed medical services."
    ],
    "Housing": [
        "The author is facing housing instability.",
        "The author has problems related to housing or where they live.",
        "Housing insecurity is affecting the author's care."
    ],
    "Smoking_History": [
        "The author has a history of smoking.",
        "The author currently smokes cigarettes or recently quit.",
        "The author's smoking history is relevant to their care."
    ],
    "Mental_Health": [
        "The author is struggling with mental health or distress.",
        "The author expresses anxiety, depression, or high stress.",
        "The author feels emotionally overwhelmed."
    ],
}

In [None]:
data = updated_df.copy()

data['norm_text'] = data['Text'].astype(str).str.replace(r"\s+", " ", regex=True).str.strip()   # normalize text
data = data[data['norm_text'].str.len() >= 10].copy()

data['Timestamp'] = pd.to_datetime(data['Timestamp'], errors='coerce')

data['norm_title'] = data['Title'].astype(str).fillna("")
data['text_all'] = (data['norm_title'] + " " + data['norm_text']).str.lower().str.replace(r"\s+", " ", regex=True).str.strip()  # Lowercased composite for rules

employment, access delay and housing had very few labels so combining them under 1 term

In [None]:
## weak supervision

def any_kw(t, kws):
    tpad = f" {t} "
    return any(kw in tpad for kw in kws)

def has(pattern, t):
    return re.search(pattern, t) is not None


# Shared context for a couple of LFs
CARE_TERMS  = r"(appointment|clinic|hospital|chemo|infusion|radiation|scan|pet|ct|mri|biopsy|oncology|doctor|gp|pcp|medication|drug|rx|prescription|treatment|imaging)"
BARRIER_CUES  = r"(no|can't|cannot|without|miss(ed|ing)?|too\s+far|hard\s+to|unable|denied|delay(ed)?)"

In [None]:
def lf_cost_insurance_sdoh(t):
    # broadened for finance / insurance semantics
    # Insurance-specific tokens are strong enough alone
    insurance_kw = r"\b(insurance|coverage|copay|co-?pay|coinsurance|deductible|prior\s?auth(oriz(ation)?)?|" \
                   r"authorization|preauth|claim(s)?\s+denied|coverage\s+denial|in-?network|out-?of-?network|oop\s?max|max\s+out-?of-?pocket|premium)\b"
    if has(insurance_kw, t):
        return True

    # Otherwise require BOTH a cost word AND a healthcare context
    cost_words  = r"\b(cost|costs|pay|paid|afford|expensive|bill|bills|billing|\$\s?\d+|\d+\s*(k|grand)\b)\b"
    care_ctx    = CARE_TERMS
    # important: \b around bill(s) so it won't match "billions"

    return has(cost_words, t) and has(care_ctx, t)

def lf_transportation_sdoh(t):
    # Strong explicit phrases that nearly always indicate transport barrier
    strong = r"(no\s+car|need\s+a\s+ride|no\s+ride|missed\s+appointment|missed\s+chemo|no\s+bus|public\s+transport|" \
             r"too\s+far\s+to\s+(travel|drive)|bus\s+route|can't\s+get\s+to\s+(the\s+)?(hospital|appointment|chemo|clinic))"
    if has(strong, t):
        return True

    transport = r"(ride|bus|train|uber|lyft|car|transport)"
    appt_ctx  = r"(appointment|chemo|infusion|scan|clinic|hospital|oncology|radiation)"
    # Requires barrier and transport within 6 words and appointment/visit context
    prox = rf"({BARRIER_CUES}\W+(?:\w+\W+){{0,6}}{transport}|{transport}\W+(?:\w+\W+){{0,6}}{BARRIER_CUES})"
    return has(prox, t) and has(appt_ctx, t)


def lf_social_support_sdoh(t):    ## # loneliness / lack of support / seeking help
    kw = r"(alone|lonely|isolated|no\s+one\s+to\s+help|lack\s+of\s+support|nobody\s+around|no\s+family|no\s+friends|" \
         r"support\s+group|caregiver\s+support|peer\s+support|grief\s+group)"

    return has(kw, t)


def lf_smoking_history_sdoh(t):   # self-report or pack-year jargon
    self_report = r"(\bi\s+(used\s+to\s+)?smoke\b|\bi\s+quit\b|\bstill\s+smoke\b|\bi\s+am\s+a\s+smoker\b|" \
                  r"\bformer\s+smoker\b|\bex[-\s]?smoker\b)"
    jargon = r"(pack[-\s]?year(s)?|ppd|\b\d+\s*cig(arettes)?\s+per\s+day\b)"

    return has(self_report, t) or has(jargon, t)

def lf_mental_health_sdoh(t):
    # anxiety/depression/insomnia + anxiolytics (medication)
    kw = r"(anxiety|anxious|panic\s+attack|depress(ed|ion)|overwhelmed|can't\s+cope|stressed\s+out|" \
         r"insomnia|can't\s+sleep|lorazepam|ativan|fear|terrified|scanxiety|dread|worry|scared)"

    return has(kw, t)


def lf_work_access_housing_sdoh(t):
    # Merge: Employment + Access_Delay + Housing
    employment = (
        r"(lost\s+(my\s+)?job\b|on\s+unpaid\s+leave|reduced\s+hours\b|miss(ed|ing)\s+work\b|can't\s+work\b|cannot\s+work\b|" \
        r"on\s+leave\b|short-?\s*term\s+disability\b|long-?\s*term\s+disability\b|"
        r"\bi\s+(got\s+)?fired\b|\b(was|were)\s+fired\b|\bfired\b.*\b(job|work|employ(er|ment))|"
        r"\blaid\s+off\b|\blaid\s+off\b.*\bfrom\b)"
    )
    access = r"(appointment\s+delay|wait\s*list|waitlist|backlogged|no\s+availability|no\s+slots|can't\s+get\s+appointment|" \
             r"waiting\s+for\s+(approval|authorization|prior\s?auth|referral)|referral\s+delay|authorization\s+delay|" \
             r"approval\s+pending|wait(ing)?\s+\d+\s*(day|days|week|weeks|month|months))"
    housing = r"(evict(ed|ion)?|foreclos(ed|ure)|\bhomeless\b|shelter|couch\s+surf(ing)?|can't\s+pay\s+(rent|mortgage)|" \
              r"behind\s+on\s+(rent|mortgage)|move\s+out|lost\s+(my\s+)?home|rent\s+increase)"
    return has(employment, t) or has(access, t) or has(housing, t)

In [None]:
LF_MAP_SDOH8 = {
    "Cost_Insurance":   lf_cost_insurance_sdoh,
    "Transportation":   lf_transportation_sdoh,
    "Social_Support":   lf_social_support_sdoh,
    "Smoking_History":  lf_smoking_history_sdoh,
    "Mental_Health":    lf_mental_health_sdoh,
    "Work_Access_Housing":   lf_work_access_housing_sdoh,
}

NameError: name 'lf_cost_insurance_sdoh' is not defined

In [None]:
def apply_lfs_sdoh8(df):
    """Adds WS_<Label> columns aligned with SDOH_LABELS."""
    df = df.copy()
    for lab, fn in LF_MAP_SDOH8.items():
        df[f"WS_{lab}"] = df["text_all"].apply(lambda t: int(fn(t)))
    # quick coverage print => how many posts were marked positive for each SDoH category
    rates = {lab: float(df[f"WS_{lab}"].mean()) for lab in SDOH_LABELS}
    print("[LF coverage] positive rates:", {k: round(v, 4) for k,v in rates.items()})
    return df

In [None]:
data = apply_lfs_sdoh8(data)  ## apply weak labels and see
data

[LF coverage] positive rates: {'Cost_Insurance': 0.0167, 'Transportation': 0.0023, 'Social_Support': 0.043, 'Smoking_History': 0.0315, 'Mental_Health': 0.0713, 'Work_Access_Housing': 0.0032}


Unnamed: 0,Type,Subreddit,Post_id,Title,Author,User_Type,Timestamp,Text,Score,Total_comments,...,Post_URL,norm_text,norm_title,text_all,WS_Cost_Insurance,WS_Transportation,WS_Social_Support,WS_Smoking_History,WS_Mental_Health,WS_Work_Access_Housing
0,Post,lungcancer,1gt87bj,He’s gone,AdLeft4868,General User,2024-11-17 07:16:45,"My beautiful, selfless & amazing dad passed aw...",130,33,...,https://www.reddit.com/r/lungcancer/comments/1...,"My beautiful, selfless & amazing dad passed aw...",He’s gone,"he’s gone my beautiful, selfless & amazing dad...",0,0,0,0,0,0
1,Comment,lungcancer,1gt87bj,He’s gone,tinkertink2010,General User,2024-11-17 12:26:40,I’m so sorry for your loss. 55 is no age. F u ...,10,0,...,,I’m so sorry for your loss. 55 is no age. F u ...,He’s gone,he’s gone i’m so sorry for your loss. 55 is no...,0,0,0,0,0,0
2,Comment,lungcancer,1gt87bj,He’s gone,Lucky-Contribution50,General User,2024-11-17 07:39:45,I'm so sorry to hear about your dad. May he re...,6,0,...,,I'm so sorry to hear about your dad. May he re...,He’s gone,he’s gone i'm so sorry to hear about your dad....,0,0,0,0,0,0
3,Comment,lungcancer,1gt87bj,He’s gone,EnvironmentalGood835,General User,2024-11-18 08:35:12,So sorry for your loss. Thoughts and prayers t...,4,0,...,,So sorry for your loss. Thoughts and prayers t...,He’s gone,he’s gone so sorry for your loss. thoughts and...,0,0,0,0,0,0
4,Comment,lungcancer,1gt87bj,He’s gone,Blueporch,General User,2024-11-17 13:39:06,I’m so sorry. \n\nI lost my dad to lung cancer...,3,0,...,,I’m so sorry. I lost my dad to lung cancer too...,He’s gone,he’s gone i’m so sorry. i lost my dad to lung ...,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109863,Comment,stopsmoking,1gcbhz7,Yeah buddy!,dlkapt3,General User,2024-10-27 13:20:29,All I can say is that I was just completely do...,1,0,...,,All I can say is that I was just completely do...,Yeah buddy!,yeah buddy! all i can say is that i was just c...,0,0,0,0,0,0
109864,Comment,stopsmoking,1gcbhz7,Yeah buddy!,dlkapt3,General User,2024-10-27 13:00:03,Nope! There’s no way I’m resetting this winnin...,1,0,...,,Nope! There’s no way I’m resetting this winnin...,Yeah buddy!,yeah buddy! nope! there’s no way i’m resetting...,0,0,0,0,0,0
109865,Comment,stopsmoking,1gcbhz7,Yeah buddy!,Bio_tomato,General User,2024-10-26 03:03:09,Thank you for answering and congratulations.\n...,5,0,...,,Thank you for answering and congratulations. 1...,Yeah buddy!,yeah buddy! thank you for answering and congra...,0,0,0,1,0,0
109866,Comment,stopsmoking,1gcbhz7,Yeah buddy!,Powerful_Setting1816,General User,2024-10-26 11:29:26,Understood. i just forget existence of somethi...,5,0,...,,Understood. i just forget existence of somethi...,Yeah buddy!,yeah buddy! understood. i just forget existenc...,0,0,0,1,0,0


In [None]:
data.columns[-6:]

Index(['WS_Cost_Insurance', 'WS_Transportation', 'WS_Social_Support',
       'WS_Smoking_History', 'WS_Mental_Health', 'WS_Work_Access_Housing'],
      dtype='object')

In [None]:
data['WS_Cost_Insurance'].value_counts()

Unnamed: 0_level_0,count
WS_Cost_Insurance,Unnamed: 1_level_1
0,99576
1,1690


In [None]:
data[data['WS_Cost_Insurance']==1].shape

(1690, 22)

In [None]:
[value_counts for value_counts in (data[c].value_counts() for c in data.columns[-6:])]

[WS_Cost_Insurance
 0    99576
 1     1690
 Name: count, dtype: int64,
 WS_Transportation
 0    101036
 1       230
 Name: count, dtype: int64,
 WS_Social_Support
 0    96916
 1     4350
 Name: count, dtype: int64,
 WS_Smoking_History
 0    98080
 1     3186
 Name: count, dtype: int64,
 WS_Mental_Health
 0    94045
 1     7221
 Name: count, dtype: int64,
 WS_Work_Access_Housing
 0    100946
 1       320
 Name: count, dtype: int64]

In [None]:
for l in SDOH_LABELS:
  positives = data[data[f"WS_{l}"] == 1]
  if len(positives) == 0:
    print(f"\n==== {l} ==== (no positives)")
    continue

  n = min(6, len(positives))
  sample = positives.sample(n, random_state=22)
  print(f"\n==== {l} ====")
  for t in sample["Text"].tolist():
    print("-", t)


==== Cost_Insurance ====
- You said:  Cry. Like. Babies.

That tone speaks for itself. I suggest you delete your post.

If you must, replace it with a statement that you wish the thread weren't so political. 

But the political wasn't merely an anti-Tr**p rant. It was a specific concern about the nominee to run HHS saying the government should pause drug approvals. And others who express concern about medical coverage due to the way a whole party has threatened the ACA.

I'll leave the rest of your post alone for now.  Again: I suggest you delete your post.
- I find it really helpful to talk to other ppl with cancer via their groups. You can vent, but you don't need to  bc we're all going through same. If f2f is too much, then here is great too.
Fine? It's not effing fine. We know that, it's pain and exhaustion and humiliation and waiting in hospitals to hear bad news, even good news isn't that good. I mean it's cancer, what's good about it?
As for the bucket list - bolox, yeah. Can't