# Unsupervised Clustering of Mental Health Survey Questions

## 1. Overview

This notebook implements **Method 1 (Baseline): Unsupervised Clustering** for the mental health dimension reduction project.

The goal of this method is to explore the **intrinsic semantic structure** of survey questions drawn from multiple validated mental health and well-being scales (e.g., PSQI, PSS, PWB, CD-RISC, UCLA Loneliness, PERMA).  
Rather than relying on predefined scale boundaries or scoring rules, we examine how questions **naturally group together in semantic embedding space**.

This serves as:
- A **baseline method** for understanding latent dimensions across surveys
- A reference point for later **supervised classification** and **LLM-based approaches**

## 2. Data

Each data point corresponds to a **single survey question**, represented with:
- `qid`: the original item identifier (e.g., PSQI_5_3, PSS_12)
- `dataset`: the source questionnaire or scale
- `text`: the question text

All questions are consolidated into a canonical dataset (`questions_master`) during preprocessing to ensure:
- No modification of raw source files
- Consistent identifiers across methods
- Reproducibility across models and experiments


In [8]:
from utils import load_questions
import pandas as pd
import numpy as np
from pathlib import Path

df = load_questions()
df.head()

Unnamed: 0,qid,dataset,text
0,CD_RISC_1,CD-RISC,I am able to adapt when changes occur.
1,CD_RISC_2,CD-RISC,I have one close and secure relationship.
2,CD_RISC_3,CD-RISC,Sometimes fate or God helps me.
3,CD_RISC_4,CD-RISC,I can deal with whatever comes my way.
4,CD_RISC_5,CD-RISC,Past successes give me confidence.


In [7]:
print("Total questions:", len(df))
df["dataset"].value_counts()

Total questions: 145


dataset
PWS        36
CD-RISC    25
PERMA      23
PSS        23
UCLA       20
PWB        18
Name: count, dtype: int64

In [10]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(df["text"].tolist())
embeddings.shape

  from .autonotebook import tqdm as notebook_tqdm


(145, 384)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

i = 10 
sims = cosine_similarity(
    embeddings[i].reshape(1, -1),
    embeddings
)[0]

top_idx = sims.argsort()[-6:][::-1]

df.iloc[top_idx][["dataset", "text"]]

Unnamed: 0,dataset,text
10,CD-RISC,"I believe I can achieve my goals, even if ther..."
23,CD-RISC,I work to attain goals.
108,PWS,I am uncertain about my ability to do things w...
29,PERMA,How often do you achieve the important goals y...
25,PERMA,How much of the time do you feel you are makin...
16,CD-RISC,I think of myself as a strong person when deal...


In [12]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

# df: columns = [qid, dataset, text]
# embeddings: shape (N, D)

K = 8
kmeans = KMeans(n_clusters=K, random_state=42, n_init="auto")
cluster_id = kmeans.fit_predict(embeddings)

df2 = df.copy()
df2["cluster_id"] = cluster_id

centers = kmeans.cluster_centers_
rep_rows = []

for c in range(K):
    idx = np.where(cluster_id == c)[0]
    sims = cosine_similarity(embeddings[idx], centers[c].reshape(1, -1)).reshape(-1)
    top = idx[np.argsort(sims)[::-1][:10]]  # top 10
    rep_rows.append(df2.iloc[top][["qid","dataset","text","cluster_id"]])

display(pd.concat(rep_rows))

Unnamed: 0,qid,dataset,text,cluster_id
61,PSS_5_10,PSS,"During the past month, how often have you had ...",0
53,PSS_5_2,PSS,"During the past month, how often have you had ...",0
52,PSS_5_1,PSS,"During the past month, how often have you had ...",0
59,PSS_5_8,PSS,"During the past month, how often have you had ...",0
49,PSS_2,PSS,"During the past month, how long (in minutes) h...",0
...,...,...,...,...
32,PERMA3_2,PERMA,To what extent do you receive help and support...,7
130,UCLA_6,UCLA,I find myself waiting for people to call or write,7
103,PWS_15,PWS,My friends know they can always confide in me ...,7
1,CD_RISC_2,CD-RISC,I have one close and secure relationship.,7


In [13]:
for c in range(K):
    print("=" * 80)
    print(f"Cluster {c}")
    display(
        df2[df2["cluster_id"] == c][["dataset", "text"]].head(15)
    )

Cluster 0


Unnamed: 0,dataset,text
48,PSS,"During the past month, what time have you usua..."
49,PSS,"During the past month, how long (in minutes) h..."
50,PSS,"During the past month, what time have you usua..."
51,PSS,"During the past month, how many hours of actua..."
52,PSS,"During the past month, how often have you had ..."
53,PSS,"During the past month, how often have you had ..."
54,PSS,"During the past month, how often have you had ..."
55,PSS,"During the past month, how often have you had ..."
56,PSS,"During the past month, how often have you had ..."
57,PSS,"During the past month, how often have you had ..."


Cluster 1


Unnamed: 0,dataset,text
2,CD-RISC,Sometimes fate or God helps me.
8,CD-RISC,I believe most things happen for a reason.
19,CD-RISC,I have to act on a hunch.
20,CD-RISC,I have a strong sense of purpose in life.
23,CD-RISC,I work to attain goals.
31,PERMA,To what extent do you lead a purposeful and me...
93,PWS,I believe there is a real purpose for my life.
94,PWS,I will always seek out activities that challen...
101,PWS,I always look on the bright side of things.
111,PWS,I feel a sense of mission about my future.


Cluster 2


Unnamed: 0,dataset,text
21,CD-RISC,I feel like I am in control.
73,PWB,"Some people wander aimlessly through life, but..."
76,PWB,Maintaining close relationships has been diffi...
86,PWB,I have not experienced many warm and trusting ...
90,PWS,There have been times when I felt inferior to ...
115,PWS,"In the past, I have not always had friends wit..."
120,PWS,"In the past, I have felt sure of myself among ..."
125,UCLA,I am unhappy doing so many things alone
126,UCLA,I have nobody to talk to
127,UCLA,I cannot tolerate being so alone


Cluster 3


Unnamed: 0,dataset,text
30,PERMA,How would you say your health is?
36,PERMA,How satisfied are you with your current physic...
42,PERMA,"Compared to others of your same age and sex, h..."
92,PWS,My physical health has restricted me in the past.
98,PWS,My body seems to resist physical illness very ...
104,PWS,My physical health is excellent.
110,PWS,"Compared to people I know, my past physical he..."
116,PWS,I expect to always be physically healthy.
122,PWS,I expect my physical health to get worse.


Cluster 4


Unnamed: 0,dataset,text
25,PERMA,How much of the time do you feel you are makin...
26,PERMA,How often do you become absorbed in what you a...
27,PERMA,How often do you feel joyful?
28,PERMA,How often do you feel anxious?
29,PERMA,How often do you achieve the important goals y...
34,PERMA,To what extent do you feel excited and interes...
35,PERMA,How lonely do you feel in your daily life?
37,PERMA,How often do you feel positive?
38,PERMA,How often do you feel angry?
39,PERMA,How often are you able to handle your responsi...


Cluster 5


Unnamed: 0,dataset,text
11,CD-RISC,"Even when hopeless, I do not give up."
33,PERMA,To what extent do you feel that what you do in...
43,PERMA,To what extent do you feel you have a sense of...
72,PWB,"When I look at the story of my life, I am plea..."
74,PWB,The demands of everyday life often get me down.
75,PWB,In many ways I feel disappointed about my achi...
77,PWB,I live life one day at a time and don't really...
80,PWB,I sometimes feel as if I've done all there is ...
81,PWB,"For me, life has been a continuous process of ..."
84,PWB,I gave up trying to make big improvements or c...


Cluster 6


Unnamed: 0,dataset,text
0,CD-RISC,I am able to adapt when changes occur.
3,CD-RISC,I can deal with whatever comes my way.
4,CD-RISC,Past successes give me confidence.
5,CD-RISC,I try to see the humorous side of things when ...
6,CD-RISC,Having to cope with stress can make me stronger.
7,CD-RISC,"I tend to bounce back after illness, injury, o..."
9,CD-RISC,"I make my best effort, no matter what."
10,CD-RISC,"I believe I can achieve my goals, even if ther..."
12,CD-RISC,"In times of stress, I know where to find help."
13,CD-RISC,"Under pressure, I stay focused and think clearly."


Cluster 7


Unnamed: 0,dataset,text
1,CD-RISC,I have one close and secure relationship.
32,PERMA,To what extent do you receive help and support...
91,PWS,Members of my family come to me for support.
97,PWS,Sometimes I wonder if my family will really be...
103,PWS,My friends know they can always confide in me ...
109,PWS,My family has been available to support me in ...
121,PWS,My friends will be there for me when I need help.
130,UCLA,I find myself waiting for people to call or write
131,UCLA,There is no one I can turn to
