<a href="https://colab.research.google.com/github/DVerma11/Reddit_Anxiety_Symptoms_Narratives_NLP_Exploration/blob/main/Section_2_Phrase_extraction_Symptoms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 2: Phrase extraction/clause segmentation

Input File: Step1_Anxiety_Preprocessed.csv

Column to be processed: [comment_body_clean_phrases]

Output File: Step2B_symptom_phrases_exploded.csv with new column ["comment_body_clean_phrases"]


Using en_core_web_sm, a general-purpose English spaCy mode, fine for preprocessing, sentence splitting, and general phrase detection, but not for clinical entities.

Phrase extraction from symptom clean text is performed to identify
1) clinically meaningful symptom expressions(multi-word or single-word phrases) directly from preprocessed text
2) then used to detect negative phrases/symptoms
3) exclude negative phrases/symptoms from NER
4) perform NER for non negated phrases/symptoms


Rule‑based phrase extraction prioritizes syntactic noun and verb phrases and may not capture complex experiential panic expressions (e.g., ‘feeling like I’m going to die’). These were addressed through secondary pattern‑based detection and clinical entity recognition.

Clinical Models are NOT good for Phrase Extraction because they are trained on biomedical literature. For this reason, we used spaCy’s en_core_web_sm model to extract syntactic phrase candidates from social media text. The model was used solely for linguistic segmentation, not for medical entity recognition. We used it specifically as syntactic phrase extractor to perform Tokenization/Noun chunks, POS tags, and Dependency parsing.





In [None]:
import pandas as pd

anxiety_preprocessed_df = pd.read_csv(
    "Step1_Anxiety_Preprocessed.csv",  # <-- include .csv
    encoding="utf-8",
    low_memory=False
)

In [None]:
# Check shape
print(anxiety_preprocessed_df.shape)

# See all column names
print(anxiety_preprocessed_df.columns.tolist())

# Preview first 5 rows
anxiety_preprocessed_df.head()


(3196, 13)
['post_id', 'comment_id', 'parent_id', 'comment_body', 'author_hash', 'score', 'created_utc', 'title', 'post_body', 'full_text', 'full_text_clean', 'comment_body_clean', 'post_body_clean']


Unnamed: 0,post_id,comment_id,parent_id,comment_body,author_hash,score,created_utc,title,post_body,full_text,full_text_clean,comment_body_clean,post_body_clean
0,1czzuoo,l5k13qf,t3_1czzuoo,omg you have no idea how much better i feel. i...,bfc763f738dd81303e35d089fde639e68495eab77cc322...,110,1716601000.0,Here is a full list of anxiety symptoms I deal...,Anxiety easily can cause a million different s...,Here is a full list of anxiety symptoms I deal...,Here is a full list of anxiety symptoms I deal...,omg you have no idea how much better i feel. i...,anxiety easily can cause a million different s...
1,1czzuoo,l5k4qae,t3_1czzuoo,Thanks for this. I experience a ton of similar...,9bba55d20948ae8babbea1c68977c6d0c65cfc5a6d7412...,43,1716603000.0,Here is a full list of anxiety symptoms I deal...,Anxiety easily can cause a million different s...,Here is a full list of anxiety symptoms I deal...,Here is a full list of anxiety symptoms I deal...,thanks for this. i experience a ton of similar...,anxiety easily can cause a million different s...
2,1czzuoo,l5l1d2d,t3_1czzuoo,Though feeling all of these symptoms is incred...,4c07dfd4a1c67f96a0e9814edb0c983b564ff788ed7e64...,11,1716622000.0,Here is a full list of anxiety symptoms I deal...,Anxiety easily can cause a million different s...,Here is a full list of anxiety symptoms I deal...,Here is a full list of anxiety symptoms I deal...,though feeling all of these symptoms is incred...,anxiety easily can cause a million different s...
3,1czzuoo,l5kjm8b,t3_1czzuoo,Wow this was so reassuring to me. I was just t...,8d501e41f5ffc02c9d373eaf3705a596714dbafb7531de...,7,1716610000.0,Here is a full list of anxiety symptoms I deal...,Anxiety easily can cause a million different s...,Here is a full list of anxiety symptoms I deal...,Here is a full list of anxiety symptoms I deal...,wow this was so reassuring to me. i was just t...,anxiety easily can cause a million different s...
4,1czzuoo,lwwknq7,t3_1czzuoo,Hello everyone. I’m currently recovering from ...,0814aba0673db4ca9b746ba9f82b8bf3bfb671f48d7d9b...,8,1731499000.0,Here is a full list of anxiety symptoms I deal...,Anxiety easily can cause a million different s...,Here is a full list of anxiety symptoms I deal...,Here is a full list of anxiety symptoms I deal...,hello everyone. i m currently recovering from ...,anxiety easily can cause a million different s...


## 2.1 Extract phrases

In [None]:
#takes long time
import spacy
# Load model, disable only NER
nlp = spacy.load("en_core_web_sm", disable=["ner"])

# Add a simple rule-based sentencizer
nlp.add_pipe("sentencizer")

def extract_phrases(text):
    """Extract symptom phrases from text using spaCy sentence segmentation."""
    if not isinstance(text, str) or not text.strip():
        return []
    doc = nlp(text)
    phrases = []
    for sent in doc.sents:
        chunk = []
        for token in sent:
            if token.is_punct:
                continue
            # Split chunks at conjunctions
            if token.dep_ == "cc" and chunk:
                phrases.append(" ".join(chunk))
                chunk = []
                continue
            if token.is_stop:
                continue
            # Only keep NOUN, ADJ, VERB
            if token.pos_ in {"NOUN", "ADJ", "VERB"}:
                chunk.append(token.lemma_.lower())
        if chunk:
            phrases.append(" ".join(chunk))
    return phrases

# Apply to files
anxiety_preprocessed_df["comment_body_clean_phrases"] = anxiety_preprocessed_df["comment_body_clean"].apply(extract_phrases)


## 2.2 Repeated letter normalization of phrases

Example- convert "sooooo anxious" → "soo anxious"

In [None]:
import re

# Repeated letter normalization: 3+ repeated letters → 2 letters
def normalize_repeated_letters(text):
    if not isinstance(text, str):
        return text
    return re.sub(r'(.)\1{2,}', r'\1\1', text)
# apply repeated-letter normalization per phrase for symptom_phrases
anxiety_preprocessed_df["comment_body_clean_phrases"] = anxiety_preprocessed_df["comment_body_clean_phrases"].apply(
    lambda lst: [normalize_repeated_letters(p) for p in lst] if isinstance(lst, list) else []
)


## 2.3 Save consolidated phrases files: Step2A_symptoms_phrases_consolidated

In [None]:
# Convert list columns to string before saving CSV
anxiety_preprocessed_df["symptom_phrases_str"] = anxiety_preprocessed_df["comment_body_clean_phrases"].apply(lambda x: "; ".join(x))

anxiety_preprocessed_df.to_csv("Step2A_symptoms_phrases_consolidated.csv", index=False)


## 2.4 Explode the list of phrases for 2A output

In [None]:
# Symptom phrases: Explode the list of phrases so each phrase has its own row
anxiety_preprocessed_df = anxiety_preprocessed_df.explode("comment_body_clean_phrases").dropna(subset=["comment_body_clean_phrases"])


## 2.5 Phrase indexing by Author hash

In [None]:
anxiety_preprocessed_df["phrase_index"] = anxiety_preprocessed_df.groupby(["author_hash", "comment_body_clean_phrases"]).cumcount()


## 2.6 Save Step2B_symptom_phrases_exploded

In [None]:
# Save original phrase DataFrames (lists per comment)
anxiety_preprocessed_df.to_csv("Step2B_symptom_phrases_exploded.csv", index=False)


## 2.7 View headers and stats

In [None]:
# Show first 10 rows of the exploded phrases
anxiety_preprocessed_df[["author_hash", "comment_body_clean_phrases"]].head(10)


Unnamed: 0,author_hash,comment_body_clean_phrases
0,bfc763f738dd81303e35d089fde639e68495eab77cc322...,idea well feel
0,bfc763f738dd81303e35d089fde639e68495eab77cc322...,feeling
0,bfc763f738dd81303e35d089fde639e68495eab77cc322...,feel go die
0,bfc763f738dd81303e35d089fde639e68495eab77cc322...,go therapy
0,bfc763f738dd81303e35d089fde639e68495eab77cc322...,take medication month
0,bfc763f738dd81303e35d089fde639e68495eab77cc322...,feel lot well month
0,bfc763f738dd81303e35d089fde639e68495eab77cc322...,m
1,9bba55d20948ae8babbea1c68977c6d0c65cfc5a6d7412...,thank
1,9bba55d20948ae8babbea1c68977c6d0c65cfc5a6d7412...,experience ton similar symptom
1,9bba55d20948ae8babbea1c68977c6d0c65cfc5a6d7412...,s horrible give severe hypochondria


In [None]:
len(anxiety_preprocessed_df)

19539

**--End Of Phrase exploding for Symptoms