In [1]:
import pandas as pd
import os

### The visualisations in this script show resultsbased on the modelling tasks guided by the following settings:- 

## Models
- **Embedding Model**: `all-MiniLM-L6-v2` (SentenceTransformer)
- **LLM Model**: `mistralai/Mistral-7B-Instruct-v0.3` (Not available/used - fallback to KeyBERT)
- **Clustering**: HDBSCAN

## Key Parameters
- **PCA Components**: 75  
- **HDBSCAN**:
  - `min_cluster_size`: auto (1% of data, min 5)
  - `min_samples`: 1
  - `metric`: euclidean
  - `cluster_selection_method`: leaf
  - `cluster_selection_epsilon`: 0.5

## Keyword Generation (Fallback)
- **Method**: KeyBERT when LLM unavailable
- **Parameters**:
  - `keyphrase_ngram_range`: (1, 2)
  - `stop_words`: english
  - `top_n`: 5 (per question)
  - Minimum keyword frequency thresholds:
    - Main clusters: ≥3 occurrences
    - Subclusters: ≥2 occurrences

## Metrics
- **Silhouette Score**
- **Calinski-Harabasz Index**
- **Davies-Bouldin Index**
- **Cluster Statistics**:
  - Number of clusters
  - Noise points
  - Average cluster size

In [2]:
df = pd.read_csv('./Clinical_Cluster_Outputs/clinical_cluster_summary.csv')
df.head()

Unnamed: 0,cluster,size,keywords,representative_comment,representative_question
0,1,3680,"['peer unfit', 'people boring', 'married loser...",used always joke depressed hating life past mo...,I used to always just joke about being depress...
1,noise_cluster,2162,"['boring person', 'earn peer', 'failure stylis...",year old aromantic tell anymore feel mentally ...,"I鈥檓 14 years old, I鈥檓 aromantic, and I can鈥檛 ..."
2,0,122,"['friend culinary', 'wedding dislike', 'invita...",friend invite dinner party must enjoy company,My friend didn't invite me to their dinner pa...
3,2,93,"['leader project', 'idea rejected', 'think sma...",chosen project skill must lacking,I wasn't chosen for the project my skills mus...


##### The above shos the results of the first level (1st layer clustering) which produced 3 clusters and an additional noise cluster.
##### The intuition behind labelling noise as cluster is on the fact that quantitative value does not mean qualitative lack of value.
##### The noise clusters may contain micro clusters that have meaningful information especially when dealing with social construct and
##### and in clinically related experiments. It might be the onset of a health relatesd issue as such there is need to probe into these
##### minimal or micro clusters for futher analytical review.

In [3]:
df1 = pd.read_csv('./Clinical_Cluster_Outputs/cluster_details/cluster_0.csv')
df1.head()

Unnamed: 0,patient_question,cleaned_text,keywords,distance_to_centroid
0,My friend didn't invite me to the party I mus...,friend invite party must terrible friend,"['invite party', 'party terrible', 'terrible f...",0.704716
1,I can't cook as well as my mom I'm not a good...,cook well mom good homemaker,"['cook mom', 'good homemaker', 'homemaker', 'm...",0.4217
2,I wasn't invited to the meeting my opinions d...,invited meeting opinion matter,"['meeting opinion', 'invited meeting', 'meetin...",0.514285
3,I wasn't invited to their wedding they must d...,invited wedding must dislike,"['wedding dislike', 'invited wedding', 'weddin...",0.621622
4,My friends didn't include me in their plans t...,friend include plan must find boring,"['plan boring', 'include plan', 'friend includ...",0.573654


In [4]:
df2 = pd.read_csv('./Clinical_Cluster_Outputs/cluster_details/cluster_1.csv')
df2.head()

Unnamed: 0,patient_question,cleaned_text,keywords,distance_to_centroid
0,I'm such a failure I never do anything right.,failure never anything right,"['failure right', 'failure', 'right']",0.323038
1,My boss didn't say 'good morning' she must be...,bos say good morning must angry,"['morning angry', 'morning', 'bos say', 'good ...",0.281972
2,Nobody cares about me because they didn't ask...,nobody care ask day,"['ask day', 'care ask', 'care', 'day', 'ask']",0.197227
3,My partner didn't say 'I love you' today our ...,partner say love today relationship must falli...,"['relationship falling', 'falling apart', 'tod...",0.385994
4,My child misbehaved at school I must be a bad...,child misbehaved school must bad parent,"['misbehaved school', 'child misbehaved', 'bad...",0.313729


In [5]:
df3 = pd.read_csv('./Clinical_Cluster_Outputs/cluster_details/cluster_2.csv')
df3.head()

Unnamed: 0,patient_question,cleaned_text,keywords,distance_to_centroid
0,My idea was rejected my team must think I'm d...,idea rejected team must think dumb,"['idea rejected', 'rejected team', 'think dumb...",0.577736
1,I didn't get the promotion I will never progr...,get promotion never progress career,"['promotion progress', 'promotion', 'progress ...",0.345262
2,I wasn't promoted I'm a failure at my job.,promoted failure job,"['promoted failure', 'failure job', 'promoted'...",0.403485
3,I wasn't selected for the project my ideas mu...,selected project idea must worthless,"['idea worthless', 'project idea', 'selected p...",0.635969
4,I didn't get the raise my work is not appreci...,get raise work appreciated,"['raise work', 'raise', 'work', 'work apprecia...",0.413817


In [8]:
df4 = pd.read_csv('./Clinical_Cluster_Outputs/cluster_details/noise_cluster.csv')
df4.head()

Unnamed: 0,patient_question,cleaned_text,keywords,distance_to_centroid
0,Nobody likes me because I'm not interesting.,nobody like interesting,"['like interesting', 'interesting', 'like']",0.247387
1,I can't try new things because I'll just mess...,try new thing mess,"['thing mess', 'try new', 'mess', 'new thing',...",0.282968
2,I didn't get the job so I must be incompetent.,get job must incompetent,"['job incompetent', 'incompetent', 'job']",0.295849
3,I'm always unlucky. Good things only happen t...,always unlucky good thing happen people,"['unlucky good', 'unlucky', 'happen people', '...",0.359713
4,Everyone thinks I'm stupid because I made a m...,everyone think stupid made mistake presentation,"['mistake presentation', 'stupid mistake', 'pr...",0.386791


##### The above results show each patient question, the patient question respective cluster, keywords in the question which generally points to the nature of cognitive distortion and distance to the centroid. This information can be explord further to find out for example whether the centroid is positive expression and distance from that centroid is based on how close or how far is a patient question to that state of distortion or lack of it #####

In [10]:
df5 = pd.read_csv('./Clinical_Cluster_Outputs/clinical_cluster_summary.csv')
df5

Unnamed: 0,cluster,size,keywords,representative_comment,representative_question
0,1,3680,"['peer unfit', 'people boring', 'married loser...",used always joke depressed hating life past mo...,I used to always just joke about being depress...
1,noise_cluster,2162,"['boring person', 'earn peer', 'failure stylis...",year old aromantic tell anymore feel mentally ...,"I鈥檓 14 years old, I鈥檓 aromantic, and I can鈥檛 ..."
2,0,122,"['friend culinary', 'wedding dislike', 'invita...",friend invite dinner party must enjoy company,My friend didn't invite me to their dinner pa...
3,2,93,"['leader project', 'idea rejected', 'think sma...",chosen project skill must lacking,I wasn't chosen for the project my skills mus...


In [17]:

cluster_1_keywords = df.loc[0, 'keywords']
print(cluster_1_keywords)

['peer unfit', 'people boring', 'married loser', 'instagram unattractive', 'wish friend', 'likable significant', 'anniversary love', 'like toned', 'receive retweets', 'return avoiding']


In [18]:

noise_cluster_keywords = df.loc[1, 'keywords']
print(noise_cluster_keywords)

['boring person', 'earn peer', 'failure stylish', 'list unproductive', 'essay terrible', 'paper smart', 'think competent', 'exam pressure', 'thing happen', 'weight mean']


In [19]:

cluster_0_keywords = df.loc[2, 'keywords']
print(cluster_0_keywords)

['friend culinary', 'wedding dislike', 'invitation exclusive', 'partner compliment', 'decorate cake', 'housewarming party', 'session idea', 'make souffl', 'display painting', 'teacher lack']


In [20]:

cluster_2_keywords = df.loc[3, 'keywords']
print(cluster_2_keywords)

['leader project', 'idea rejected', 'think smart', 'join study', 'trust teacher', 'interview potential', 'skill lacking', 'receive raise', 'appreciate effort', 'duolingo polyglot']


##### The above visualisations generally show the main clusters and their keywords, number of patient questions per cluster.

##   Quantitative Metrics

In [21]:
import json

In [23]:
with open ('./Clinical_Cluster_Outputs/cluster_metrics.json') as f:
    data = json.load(f)
    print(json.dumps(data, indent=4))

{
    "cluster_stats": {
        "n_clusters": 3,
        "n_noise_points": 2162,
        "avg_cluster_size": 1298.3333333333333,
        "cluster_size_distribution": {
            "1": 3680,
            "0": 122,
            "2": 93
        }
    },
    "silhouette_score": 0.09763918817043304,
    "calinski_harabasz_score": 30.204289392289134,
    "davies_bouldin_score": 4.036162556673001
}


In [3]:
import json
with open ('/Users/samsonbobo/Desktop/submission folder/PCA_75/cluster_metrics.json') as f:
    data = json.load(f)
    print(json.dumps(data, indent=4))

{
    "primary_space": "PCA_reduced",
    "cluster_stats": {
        "n_clusters": 3,
        "n_noise_points": 2162,
        "avg_cluster_size": 1298.3333333333333,
        "cluster_size_distribution": {
            "1": 3680,
            "0": 122,
            "2": 93
        }
    },
    "PCA_reduced": {
        "silhouette_score": 0.09763918817043304,
        "calinski_harabasz_score": 30.204289392289134,
        "davies_bouldin_score": 4.036162556673001
    },
    "original": {
        "silhouette_score": 0.12354118376970291,
        "calinski_harabasz_score": 63.54640908268058,
        "davies_bouldin_score": 3.2125429761815076
    },
    "distance_preservation": {
        "distance_correlation": 0.766060439585678
    }
}


#####   The silhoutte score is quite low for this dataset, the prefferable score is >0.5...The score was computed base on PCA embeddings. However, the score computed against original embeddings that is the first instance embeddings recorded a score of 0.1235

#   SUB CLUSTERING VISUALISATIONS
#### Please check the sub_clustering.py script for the code steps

In [24]:
df7 = pd.read_csv('./cluster_analysis_output.csv')
df7.head()

Unnamed: 0,patient_question,Cluster,Main_Cluster_Label,Subcluster_Label,Cluster_Keywords,Subcluster_Keywords
0,I'm such a failure I never do anything right.,1,Peer Unfit People Boring,depressed,"peer unfit, people boring, married loser, inst...",depressed
1,My boss didn't say 'good morning' she must be...,1,Peer Unfit People Boring,bipolar,"peer unfit, people boring, married loser, inst...",bipolar
2,Nobody cares about me because they didn't ask...,1,Peer Unfit People Boring,don鈥檛 feel,"peer unfit, people boring, married loser, inst...",don鈥檛 feel
3,My partner didn't say 'I love you' today our ...,1,Peer Unfit People Boring,relationship,"peer unfit, people boring, married loser, inst...",relationship
4,My child misbehaved at school I must be a bad...,1,Peer Unfit People Boring,parents,"peer unfit, people boring, married loser, inst...",parents


In [25]:
import pandas as pd

# Load the data
df7 = pd.read_csv('./cluster_analysis_output.csv')

# Get unique clusters
unique_clusters = df7['Cluster'].unique()

# Display first 5 rows for each cluster
for cluster in unique_clusters:
    print(f"\n=== Cluster {cluster} ===")
    display(df7[df7['Cluster'] == cluster].head(5))


=== Cluster 1 ===


Unnamed: 0,patient_question,Cluster,Main_Cluster_Label,Subcluster_Label,Cluster_Keywords,Subcluster_Keywords
0,I'm such a failure I never do anything right.,1,Peer Unfit People Boring,depressed,"peer unfit, people boring, married loser, inst...",depressed
1,My boss didn't say 'good morning' she must be...,1,Peer Unfit People Boring,bipolar,"peer unfit, people boring, married loser, inst...",bipolar
2,Nobody cares about me because they didn't ask...,1,Peer Unfit People Boring,don鈥檛 feel,"peer unfit, people boring, married loser, inst...",don鈥檛 feel
3,My partner didn't say 'I love you' today our ...,1,Peer Unfit People Boring,relationship,"peer unfit, people boring, married loser, inst...",relationship
4,My child misbehaved at school I must be a bad...,1,Peer Unfit People Boring,parents,"peer unfit, people boring, married loser, inst...",parents



=== Cluster noise_cluster ===


Unnamed: 0,patient_question,Cluster,Main_Cluster_Label,Subcluster_Label,Cluster_Keywords,Subcluster_Keywords
3680,Nobody likes me because I'm not interesting.,noise_cluster,Boring Person Earn Peer,don,"boring person, earn peer, failure stylish, lis...",don
3681,I can't try new things because I'll just mess...,noise_cluster,Boring Person Earn Peer,don,"boring person, earn peer, failure stylish, lis...",don
3682,I didn't get the job so I must be incompetent.,noise_cluster,Boring Person Earn Peer,successful,"boring person, earn peer, failure stylish, lis...",successful
3683,I'm always unlucky. Good things only happen t...,noise_cluster,Boring Person Earn Peer,anxiety,"boring person, earn peer, failure stylish, lis...",anxiety
3684,Everyone thinks I'm stupid because I made a m...,noise_cluster,Boring Person Earn Peer,smart,"boring person, earn peer, failure stylish, lis...",smart



=== Cluster 0 ===


Unnamed: 0,patient_question,Cluster,Main_Cluster_Label,Subcluster_Label,Cluster_Keywords,Subcluster_Keywords
5842,My friend didn't invite me to the party I mus...,0,Friend Culinary Wedding Dislike,wasn invited,"friend culinary, wedding dislike, invitation e...",wasn invited
5843,I can't cook as well as my mom I'm not a good...,0,Friend Culinary Wedding Dislike,cook,"friend culinary, wedding dislike, invitation e...",cook
5844,I wasn't invited to the meeting my opinions d...,0,Friend Culinary Wedding Dislike,wasn invited,"friend culinary, wedding dislike, invitation e...",wasn invited
5845,I wasn't invited to their wedding they must d...,0,Friend Culinary Wedding Dislike,wedding,"friend culinary, wedding dislike, invitation e...",wedding
5846,My friends didn't include me in their plans t...,0,Friend Culinary Wedding Dislike,friends didn,"friend culinary, wedding dislike, invitation e...",friends didn



=== Cluster 2 ===


Unnamed: 0,patient_question,Cluster,Main_Cluster_Label,Subcluster_Label,Cluster_Keywords,Subcluster_Keywords
5964,My idea was rejected my team must think I'm d...,2,Leader Project Idea Rejected,think dumb,"leader project, idea rejected, think smart, jo...",think dumb
5965,I didn't get the promotion I will never progr...,2,Leader Project Idea Rejected,promotion,"leader project, idea rejected, think smart, jo...",promotion
5966,I wasn't promoted I'm a failure at my job.,2,Leader Project Idea Rejected,promoted,"leader project, idea rejected, think smart, jo...",promoted
5967,I wasn't selected for the project my ideas mu...,2,Leader Project Idea Rejected,project didn,"leader project, idea rejected, think smart, jo...",project didn
5968,I didn't get the raise my work is not appreci...,2,Leader Project Idea Rejected,raise,"leader project, idea rejected, think smart, jo...",raise


In [30]:
df8 = pd.read_csv('/Users/samsonbobo/Desktop/Final_Experiments/subcluster_distribution.csv')
df8

Unnamed: 0,Main_Cluster_Label,Subcluster_Label,Count
0,Boring Person Earn Peer,anxiety,204
1,Boring Person Earn Peer,boyfriend,68
2,Boring Person Earn Peer,can鈥檛,67
3,Boring Person Earn Peer,didn,108
4,Boring Person Earn Peer,doesn鈥檛,61
...,...,...,...
72,Peer Unfit People Boring,schizophrenia,86
73,Peer Unfit People Boring,schizophrenic,102
74,Peer Unfit People Boring,social anxiety,214
75,Peer Unfit People Boring,therapist,64
