## In the code below we perform the following steps

- Read in the item text data (https://chatgpt.com/share/66fa7c2a-101c-800b-88a5-7334934a995d)
- Calculate item embeddings
- Reverse item embeddings if necessary (we don't have reversed items here, but this approach may be unoptimal. In case of reversed items we could use fine-tuned model as in Hommell (2024))
- Compute cosine similarities
- Store results

### 1- Read in the item text data

In [1]:
import pandas as pd
import numpy as np
# read in file with items text etc.
df_items = pd.read_csv('./Data/dass_21_items_mod_2.csv')
df_items.head()

Unnamed: 0,Number,Factor,Item,Item_simp,Item_mod,Sign
0,1,Depression,I couldn't seem to experience any positive fee...,when feeling depressed I Couldn't seem to expe...,Depression is characterised principally by a l...,+
1,2,Depression,I found it difficult to work up the initiative...,when feeling depressed I Found it difficult to...,Depression is characterised principally by a l...,+
2,3,Depression,I felt that I had nothing to look forward to.,when feeling depressed I Felt that I had nothi...,Depression is characterised principally by a l...,+
3,4,Depression,I felt down-hearted and blue.,when feeling depressed I Felt down-hearted and...,Depression is characterised principally by a l...,+
4,5,Depression,I was unable to become enthusiastic about anyt...,when feeling depressed I was Unable to become ...,Depression is characterised principally by a l...,+


### 2- Calculate embeddings (and reverse code if necessary)

In [4]:
# First we create a list of models (all multilinguals here)
models = ['nli-distilroberta-base-v2',
          'all-mpnet-base-v2',
          'sentence-transformers/all-MiniLM-L6-v2',
          'intfloat/e5-large-v2',
          'LaBSE'] #consider adding the finetuned model for psicometrista

# Import the necessary libraries and functions
from sentence_transformers import SentenceTransformer, util

# Create an empty data frame, which we will then populate with the different type of embeddings
facet_embeddings_sentences = pd.DataFrame()

for mod in models:
    model = SentenceTransformer(mod) #call the model
    item_embed = [] #create list for item-level embed
    item_embed_rev = [] #create list for item-level embed accounting for sign
    for item in range(0,len(df_items['Number'])): #loop over all the items
    #encode items
        item_embed.append(model.encode(df_items['Item'].iloc[item]))
        if df_items['Sign'].iloc[item][0] == '-': #if items is negatively keyed, reverse the embeddings
            item_embed_rev.append(model.encode(df_items['Item'].iloc[item])*-1)
        else:
            item_embed_rev.append(model.encode(df_items['Item'].iloc[item]))
    df_items[mod + '_embeddings'] = item_embed #then, we append the two item-level embeddings list and give them a name based on the model we used
    df_items[mod + '_embeddings_rev'] = item_embed_rev



model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/201 [00:00<?, ?B/s]



In [5]:
df_items

Unnamed: 0,Number,Factor,Item,Item_simp,Item_mod,Sign,nli-distilroberta-base-v2_embeddings,nli-distilroberta-base-v2_embeddings_rev,all-mpnet-base-v2_embeddings,all-mpnet-base-v2_embeddings_rev,paraphrase-multilingual-MiniLM-L12-v2_embeddings,paraphrase-multilingual-MiniLM-L12-v2_embeddings_rev,sentence-transformers/all-MiniLM-L6-v2_embeddings,sentence-transformers/all-MiniLM-L6-v2_embeddings_rev,intfloat/e5-large-v2_embeddings,intfloat/e5-large-v2_embeddings_rev,LaBSE_embeddings,LaBSE_embeddings_rev
0,1,Depression,I couldn't seem to experience any positive fee...,when feeling depressed I Couldn't seem to expe...,Depression is characterised principally by a l...,+,"[-0.41798195, -0.60875887, 0.41989467, -0.3381...","[-0.41798195, -0.60875887, 0.41989467, -0.3381...","[-0.02760756, 0.036531188, -0.013363765, -0.01...","[-0.02760756, 0.036531188, -0.013363765, -0.01...","[-0.046919886, -0.109551735, 0.5276527, 0.1468...","[-0.046919886, -0.109551735, 0.5276527, 0.1468...","[-0.017120127, 0.0042126314, -0.005668672, 0.0...","[-0.017120127, 0.0042126314, -0.005668672, 0.0...","[0.010148046, -0.046404377, 0.022939976, -0.01...","[0.010148046, -0.046404377, 0.022939976, -0.01...","[-0.027827162, 0.05928395, -0.06475832, -0.049...","[-0.027827162, 0.05928395, -0.06475832, -0.049..."
1,2,Depression,I found it difficult to work up the initiative...,when feeling depressed I Found it difficult to...,Depression is characterised principally by a l...,+,"[-0.26888198, -0.49430892, 1.0279641, 0.863048...","[-0.26888198, -0.49430892, 1.0279641, 0.863048...","[0.022510586, 0.035619434, -0.0060765897, -0.0...","[0.022510586, 0.035619434, -0.0060765897, -0.0...","[0.23311327, 0.057133377, 0.2334384, 0.2777349...","[0.23311327, 0.057133377, 0.2334384, 0.2777349...","[-0.0026095035, 0.01844399, 0.053614903, 0.005...","[-0.0026095035, 0.01844399, 0.053614903, 0.005...","[0.029792955, -0.025565542, 0.023121811, -0.01...","[0.029792955, -0.025565542, 0.023121811, -0.01...","[0.034038432, -0.004301942, 0.0059101717, -0.0...","[0.034038432, -0.004301942, 0.0059101717, -0.0..."
2,3,Depression,I felt that I had nothing to look forward to.,when feeling depressed I Felt that I had nothi...,Depression is characterised principally by a l...,+,"[-0.71781987, -0.48583412, 0.17099401, -0.5182...","[-0.71781987, -0.48583412, 0.17099401, -0.5182...","[0.00654586, 0.06515021, -0.00472298, -0.03674...","[0.00654586, 0.06515021, -0.00472298, -0.03674...","[-0.0048233764, 0.12943742, 0.3204668, 0.16019...","[-0.0048233764, 0.12943742, 0.3204668, 0.16019...","[-0.02430578, 0.01588752, 0.020071963, -0.0402...","[-0.02430578, 0.01588752, 0.020071963, -0.0402...","[0.02525443, -0.04839567, 0.010145982, -0.0148...","[0.02525443, -0.04839567, 0.010145982, -0.0148...","[0.034416497, 0.03162081, -0.059108265, -0.056...","[0.034416497, 0.03162081, -0.059108265, -0.056..."
3,4,Depression,I felt down-hearted and blue.,when feeling depressed I Felt down-hearted and...,Depression is characterised principally by a l...,+,"[-0.3016205, 0.2141138, -0.23003463, -0.434639...","[-0.3016205, 0.2141138, -0.23003463, -0.434639...","[0.012996118, -0.01634062, -0.008537047, -0.05...","[0.012996118, -0.01634062, -0.008537047, -0.05...","[0.34807536, 0.24143778, 0.23827362, 0.4109705...","[0.34807536, 0.24143778, 0.23827362, 0.4109705...","[0.008963004, 0.0363882, 0.03991589, -0.016160...","[0.008963004, 0.0363882, 0.03991589, -0.016160...","[0.033449724, -0.054883257, 0.02414822, -0.023...","[0.033449724, -0.054883257, 0.02414822, -0.023...","[-0.05726779, 0.055614337, -0.0008938348, -0.0...","[-0.05726779, 0.055614337, -0.0008938348, -0.0..."
4,5,Depression,I was unable to become enthusiastic about anyt...,when feeling depressed I was Unable to become ...,Depression is characterised principally by a l...,+,"[-0.13796349, -0.31406143, 0.6003001, -0.60157...","[-0.13796349, -0.31406143, 0.6003001, -0.60157...","[0.011471874, 0.05111288, 0.0038501397, -0.004...","[0.011471874, 0.05111288, 0.0038501397, -0.004...","[0.3442108, 0.20397158, 0.388488, 0.24750763, ...","[0.3442108, 0.20397158, 0.388488, 0.24750763, ...","[0.10776935, 0.036098458, -0.012815361, 0.0460...","[0.10776935, 0.036098458, -0.012815361, 0.0460...","[0.027257938, -0.056715615, 0.016552718, -0.01...","[0.027257938, -0.056715615, 0.016552718, -0.01...","[0.018410865, 0.013755682, -0.064998284, -0.05...","[0.018410865, 0.013755682, -0.064998284, -0.05..."
5,6,Depression,I felt I wasn't worth much as a person.,when feeling depressed I Felt I wasn't worth m...,Depression is characterised principally by a l...,+,"[-0.25464723, -0.09043264, -0.17217383, -0.726...","[-0.25464723, -0.09043264, -0.17217383, -0.726...","[0.04429704, 0.06514896, 0.04807512, -0.010690...","[0.04429704, 0.06514896, 0.04807512, -0.010690...","[0.23760414, 0.45704395, 0.088960394, 0.076410...","[0.23760414, 0.45704395, 0.088960394, 0.076410...","[-0.0071838843, 0.058315855, 0.013659423, 0.04...","[-0.0071838843, 0.058315855, 0.013659423, 0.04...","[0.03583347, -0.0530934, 0.035658818, -0.04180...","[0.03583347, -0.0530934, 0.035658818, -0.04180...","[-0.018875614, 0.035567682, -0.0649392, -0.059...","[-0.018875614, 0.035567682, -0.0649392, -0.059..."
6,7,Depression,I felt that life was meaningless.,when feeling depressed I Felt that life was me...,Depression is characterised principally by a l...,+,"[-0.18967095, 0.44039482, -0.13778427, -0.0134...","[-0.18967095, 0.44039482, -0.13778427, -0.0134...","[0.054153547, 0.059044078, 0.022874115, -0.030...","[0.054153547, 0.059044078, 0.022874115, -0.030...","[0.13601482, 0.15952863, 0.25153095, 0.2566630...","[0.13601482, 0.15952863, 0.25153095, 0.2566630...","[-0.012068329, 0.05834761, 0.012975891, -0.012...","[-0.012068329, 0.05834761, 0.012975891, -0.012...","[0.014748435, -0.058268595, 0.015577694, -0.02...","[0.014748435, -0.058268595, 0.015577694, -0.02...","[-0.02286053, -0.0254926, -0.040635087, -0.050...","[-0.02286053, -0.0254926, -0.040635087, -0.050..."
7,8,Anxiety,I was aware of dryness of my mouth.,when feeling anxious I Aware of dryness of my ...,Anxiety is a relatively enduring state of anxi...,+,"[0.19624288, -0.37668458, -0.31101617, -0.2077...","[0.19624288, -0.37668458, -0.31101617, -0.2077...","[0.017764352, -0.05060164, 0.0033838844, -0.02...","[0.017764352, -0.05060164, 0.0033838844, -0.02...","[-0.08713848, -0.09587914, 0.3720982, 0.025985...","[-0.08713848, -0.09587914, 0.3720982, 0.025985...","[0.018174078, -0.0649536, -0.007891204, 0.0775...","[0.018174078, -0.0649536, -0.007891204, 0.0775...","[0.029176606, -0.06568839, 0.038887754, -0.019...","[0.029176606, -0.06568839, 0.038887754, -0.019...","[-0.014379178, -0.002829763, 0.0026872945, -0....","[-0.014379178, -0.002829763, 0.0026872945, -0...."
8,9,Anxiety,"I experienced breathing difficulty (e.g., exce...",when feeling anxious I experienced breathing d...,Anxiety is a relatively enduring state of anxi...,+,"[0.36264697, 0.09286714, 0.57524854, 0.0873215...","[0.36264697, 0.09286714, 0.57524854, 0.0873215...","[-0.008340977, -0.07560027, 0.008144175, -0.00...","[-0.008340977, -0.07560027, 0.008144175, -0.00...","[0.18178391, 0.19071576, 0.027461434, 0.458286...","[0.18178391, 0.19071576, 0.027461434, 0.458286...","[0.074580126, -0.001968827, -0.049601153, 0.07...","[0.074580126, -0.001968827, -0.049601153, 0.07...","[0.00795421, -0.06935591, 0.014716602, -0.0436...","[0.00795421, -0.06935591, 0.014716602, -0.0436...","[-0.047368508, 0.045650344, -0.049365725, -0.0...","[-0.047368508, 0.045650344, -0.049365725, -0.0..."
9,10,Anxiety,"I experienced trembling (e.g., in the hands).",when feeling anxious I Experienced trembling (...,Anxiety is a relatively enduring state of anxi...,+,"[0.3665608, -0.13394088, 0.12258482, -0.058099...","[0.3665608, -0.13394088, 0.12258482, -0.058099...","[-0.012785021, -0.04524491, -0.001446473, -0.0...","[-0.012785021, -0.04524491, -0.001446473, -0.0...","[0.034209885, 0.08235034, 0.2967834, 0.5630126...","[0.034209885, 0.08235034, 0.2967834, 0.5630126...","[0.021906491, -0.031833522, 0.052289005, 0.057...","[0.021906491, -0.031833522, 0.052289005, 0.057...","[0.03377252, -0.061833408, 0.026309539, -0.039...","[0.03377252, -0.061833408, 0.026309539, -0.039...","[-0.027833354, -0.004120173, -0.014727401, -0....","[-0.027833354, -0.004120173, -0.014727401, -0...."


### Step 3 -  Compute cosine simlarities and store the data

In [6]:
# To avoid having too long names for the output datsets, we create a list of names, which we will then use to save the embedding cosine matrices
# make sure that the names here are meaningful and aligned with those of the one in the cell above.
model_short = ['distilroberta', 'mpnet', 'miniLM', 'e5', 'labse']

# Below, we loop over the different models we use for the study and compute the cosine sim. matrices.
for mod in range(0, len(models)):
  # create temporary empty lists for the item and one-pop method embeddings
  facet_embeddings_item = []

  #create cosine similarity matrix for each embedding calculation approach
  cosine_similarities_item = util.pytorch_cos_sim(df_items[models[mod] + '_embeddings'],df_items[models[mod] + '_embeddings']).numpy()

  # we don't have revesed items so code below is not necessary
  
  #fill diagonal with 1. This is done to avoid efa functions reading the cosine matrix as covariance
  np.fill_diagonal(cosine_similarities_item,1)


  #store results
  pd.DataFrame(cosine_similarities_item, columns = df_items['Item_simp'].unique(), index = df_items['Item_simp'].unique()).to_csv('./Data/cos_matrices/matrix_concatenated_item_'+model_short[mod]+'.csv', index = False)


  a = torch.tensor(a)
