# Unsupervised Clustering of Mental Health Survey Questions

## Overview

This notebook implements **Method 2 (Baseline): Unsupervised Clustering** for the mental health dimension reduction project.

The goal of this method is to explore whether mental health survey questions exhibit **intrinsic semantic groupings** when embedded into a shared representation space, without imposing any predefined dimension labels or conceptual frameworks. Survey items are drawn from multiple validated instruments (e.g., PSQI, PSS, PWB, CD-RISC, UCLA Loneliness, PERMA) and treated as unlabeled text.

Prior to embedding, questions are processed to remove recurrent grammatical patterns introduced by questionnaire design, ensuring that embeddings emphasize semantic content rather than stylistic regularities. All questions are then embedded using a pretrained sentence encoder and clustered using an unsupervised algorithm (**K-Means**). The resulting clusters represent latent semantic themes that emerge purely from distributional similarity.

Cluster centers act as abstract representations of these latent themes, and representative questions are selected based on cosine similarity for qualitative inspection. This baseline provides a data-driven reference for comparison with concept-driven semantic mapping and supervised classification approaches.

## 2. Data

Each data point corresponds to a **single survey question**, represented with:
- `qid`: the original item identifier (e.g., PSQI_5_3, PSS_12)
- `dataset`: the source questionnaire or scale
- `text`: the question text

All questions are consolidated into a canonical dataset (`questions_master`) during preprocessing to ensure:
- No modification of raw source files
- Consistent identifiers across methods
- Reproducibility across models and experiments


In [1]:
from utils import load_questions
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Optional

question = load_questions()
question.head()

Unnamed: 0,qid,dataset,text
0,CD_RISC_1,CD-RISC,I am able to adapt when changes occur.
1,CD_RISC_2,CD-RISC,I have one close and secure relationship.
2,CD_RISC_3,CD-RISC,Sometimes fate or God helps me.
3,CD_RISC_4,CD-RISC,I can deal with whatever comes my way.
4,CD_RISC_5,CD-RISC,Past successes give me confidence.


In [2]:
print("Total questions:", len(question))
question["dataset"].value_counts()

Total questions: 145


dataset
PWS        36
CD-RISC    25
PERMA      23
PSS        23
UCLA       20
PWB        18
Name: count, dtype: int64

## 3. Universal Dependencies (UD) Extraction with Stanza

To mitigate the influence of recurrent grammatical patterns introduced by questionnaire design (e.g., repeated use of auxiliaries, determiners, or copular constructions), we leverage **Universal Dependencies (UD)** to obtain a syntactic view of each survey question.

We use **Stanza**, a neural NLP toolkit, to parse each question into a UD dependency structure and extract the sequence of dependency relation labels (`deprel`) for all tokens, preserving the original word order.  
This representation captures **syntactic roles rather than lexical content**, allowing us to identify and control for survey-specific structural templates.

All questions are mapped to their corresponding UD dependency sequences and stored in a separate CSV file for inspection and downstream analysis. see "./temp_result/UDmap.csv"

Universal Dependencies provide a standardized, cross-linguistic framework for syntactic annotation; see  
https://universaldependencies.org/ for details.

In [3]:
import stanza
class UDExtractor:
    def __init__(self, nlp: stanza.Pipeline):
        self.nlp = nlp

    def text_to_deprel(self, text: str) -> str:
        """
        Convert text to a space-separated UD deprel sequence.
        Nothing is dropped. Order is preserved.
        """
        if text is None:
            return ""

        doc = self.nlp(str(text))
        out = []

        for sent in doc.sentences:
            for w in sent.words:
                out.append((w.deprel or "dep").lower())

        return " ".join(out)

    def add_deprel_column(
        self,
        df: pd.DataFrame,
        text_col: str = "text",
        new_col: str = "ud_deprel"
    ) -> pd.DataFrame:
        """
        Add a new column with UD deprel sequences.
        """
        df = df.copy()
        texts = df[text_col].fillna("").astype(str).tolist()

        docs = self.nlp.bulk_process(texts)

        deprels = []
        for doc in docs:
            seq = []
            for sent in doc.sentences:
                for w in sent.words:
                    seq.append((w.deprel or "dep").lower())
            deprels.append(" ".join(seq))

        df[new_col] = deprels
        return df


# -----------------------------
# Run
# -----------------------------
nlp = stanza.Pipeline(
    "en",
    processors="tokenize,pos,lemma,depparse,ner",
    tokenize_no_ssplit=True
)

df = question.copy()
ud = UDExtractor(nlp)
df = ud.add_deprel_column(df, text_col="text", new_col="ud_deprel")

df.to_csv("./temp_result/UDmap.csv", index=False)

print("Saved to UDmap.csv")

  from .autonotebook import tqdm as notebook_tqdm
2026-02-04 14:37:52 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json: 436kB [00:00, 170MB/s]                     
2026-02-04 14:37:52 INFO: Downloaded file to /Users/haikeyu/stanza_resources/resources.json
2026-02-04 14:37:53 INFO: Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| pos       | combined_charlm           |
| lemma     | combined_nocharlm         |
| depparse  | combined_charlm           |
| ner       | ontonotes-ww-multi_charlm |

2026-02-04 14:37:53 INFO: Using device: cpu
2026-02-04 14:37:53 INFO: Loading: tokeni

Saved to UDmap.csv


## 4. Semantic Content Extraction Pipeline

Before embedding and clustering, we apply a **syntax-aware semantic extraction pipeline** to reduce the influence of questionnaire-specific surface patterns and preserve the core semantic content of each item.

### Motivation
Mental health questionnaires often share recurring grammatical templates (e.g., *“I feel…”, “How often have you…”, “During the past month…”*).  
If used directly, these patterns can dominate sentence embeddings and lead to clustering driven by **survey format rather than meaning**.

### Method
We use **Stanza** with Universal Dependencies (UD) parsing to perform token-level filtering based on syntactic roles and morphological features:

- **Remove grammatical shells** such as auxiliaries, determiners, punctuation, copulas, and coordinating conjunctions.
- **Drop subject personal pronouns** (e.g., *I, you, we*) to reduce first-/second-person framing effects, while retaining object or reflexive forms when semantically relevant.
- **Remove all named-entity spans** (e.g., time, date, quantity expressions) to avoid anchoring on temporal or numeric questionnaire artifacts.
- **Explicitly preserve negation** using UD morphological features (e.g., `Polarity=Neg`, `PronType=Neg`), ensuring that semantic polarity is retained.
- Keep all remaining content-bearing tokens (verbs, nouns, adjectives, meaningful modifiers).

Filtering is applied at the **token level**, and multi-word tokens (e.g., *“cannot”*) are preserved if any sub-token satisfies the retention criteria.

### Output
For each question, we produce:
- A **cleaned semantic backbone** used for downstream embedding and clustering
- The **original raw question text**, stored alongside the processed version for traceability

This preprocessing step ensures that downstream semantic representations emphasize **conceptual meaning** rather than questionnaire structure, providing a more reliable foundation for unsupervised clustering and semantic analysis. see "./temp_result/question_nvo.csv"

In [4]:
import stanza
from semantic_extractor import SemanticExtractor
import stanza
try:
    nlp = stanza.Pipeline("en", verbose=False)
except:
    stanza.download("en")
    nlp = stanza.Pipeline("en", verbose=False)

In [5]:
semantic_extractor = SemanticExtractor(nlp=nlp)
question_se = semantic_extractor.transform_df(question, text_col="text")
question_se.head()

Unnamed: 0,qid,dataset,text,text_raw
0,CD_RISC_1,CD-RISC,able to adapt changes occur,I am able to adapt when changes occur.
1,CD_RISC_2,CD-RISC,have close secure relationship,I have one close and secure relationship.
2,CD_RISC_3,CD-RISC,fate god helps me,Sometimes fate or God helps me.
3,CD_RISC_4,CD-RISC,deal with whatever comes my way,I can deal with whatever comes my way.
4,CD_RISC_5,CD-RISC,past successes give me confidence,Past successes give me confidence.


## 5. Unsupervised Clustering Pipeline (Embedding → K-Means → Inspection)

This section implements our **unsupervised clustering baseline** to examine whether survey questions form **intrinsic semantic groups** in embedding space, without using any predefined dimension labels.

### Step 1 — Sentence Embedding
We embed each (preprocessed) question using a pretrained sentence encoder (**Sentence-Transformers: `all-MiniLM-L6-v2`**).  
Embeddings are **L2-normalized** (`normalize_embeddings=True`) so that cosine similarity reflects semantic proximity more consistently.

### Step 2 — K-Means Clustering
We apply **K-Means** with `k=8` to the embedding matrix and assign each question a `cluster_id`.  
Cluster centroids serve as latent “theme prototypes” induced purely from distributional similarity.

### Step 3 — Cluster Inspection (Representatives + Similarity Query)
To interpret clusters, we:
- Select **representative questions per cluster** by ranking items by cosine similarity to the cluster centroid (`sim_to_center`).
- Support **nearest-neighbor querying** (`query_similar`) to retrieve the most semantically similar questions to a given item.

### Step 4 — Export + Optional Human Naming
Cluster assignments are saved to CSV in a **stable sorted order** (by `cluster_id`, then `dataset`, then `qid`).  
Optionally, we provide a manual mapping from `cluster_id → cluster_type` (e.g., `["Sleep", "Spiritual", ...]`) for downstream analysis and readability, and export the named results as an additional CSV. See "./temp_result/named_cluster.csv"

In [6]:
from sentence_transformers import SentenceTransformer
from semantic_cluster import SemanticCluster
embedder = SentenceTransformer("all-MiniLM-L6-v2")

clusterer = SemanticCluster(embedder, k=8)

df_clustered = clusterer.fit(question_se, text_col="text")


rep_df = clusterer.get_representatives(top_n=10)
display(rep_df)


clusterer.query_similar(i=10, top_k=6)

clusterer.save_cluster()

Unnamed: 0,qid,dataset,text,cluster_id,sim_to_center
0,PSS_5_2,PSS,during had trouble sleeping because wake up in...,0,0.889706
1,PSS_5_10,PSS,during had trouble sleeping because of other r...,0,0.886029
2,PSS_5_9,PSS,during had trouble sleeping because have pain,0,0.824738
3,PSS_5_1,PSS,during had trouble sleeping because cannot get...,0,0.822696
4,PSS_5_8,PSS,during had trouble sleeping because have bad d...,0,0.821908
...,...,...,...,...,...
74,CD_RISC_5,CD-RISC,past successes give me confidence,7,0.637245
75,PERMA1_1,PERMA,much of time feel making progress towards acco...,7,0.636630
76,PWS_31,PWS,things not work out way want them to in future,7,0.634175
77,PWB_7,PWB,live life at time don't think about future,7,0.612619


PosixPath('temp_result/cluster.csv')

In [7]:
cluster_name = ["Sleep", "Spiritual", "Relationship", "Relationship", "Feeling", "Physical", "Challenge", "Future"]
clusterer.map_cluster_type(cluster_name)
clusterer.save_cluster("./temp_result/named_cluster.csv")


PosixPath('temp_result/named_cluster.csv')