# Word2vec Semantic Comparison between Subject Answers
Natural language processing (NLP) is a branch of Artificial Intelligence to help computers understand human languages (both text and spoken words) [1] Word representation techniques help NLP tasks to perform better by grouping together vectors of similar words [2]. Mikolov et al. [3] proposed the vector offset method to capture meaningful syntactic and semantic regularities [1, 2, 3, 4]. The Word2Vec, a word embedding technique in NLP, was introduced by Mikolov et al. [3].  Continuous skip-gram and continuous bag of words (CBOW) are two architectures of the word2vec model . The Skip-gram model was developed by Mikolov et al. [3] to learn high-quality distributed vector representation. This is useful for predicting the surrounding words in a sentence or a document.  However, the CBOW [3] is useful for predicting the current word based on the context. 

The value of Word2Vec is its ability to create vector representations of word semantics on which simple algebraic operations can be run. An example from Mikolov et al. [3] is that the “vector(”King”) - vector(”Man”) + vector(”Woman”) is very similar to the vector “Queen” [3]. In research, these word2vec models have been employed in study the relationship between words in documents and revealed biases based on gender and race, e.g. “Man is to computer programmer as woman is to homemaker?” [5], and racial biases “Black is to criminal as caucasian is to police” [6]. In this study, Word2Vec was employed in order to score the semantic similarities between the ground truth image labels (words for the target drawing) and subject predictions (the children’s responses). Word2Vec allows us to reduce the sources bias when manually scoring semantic similarities. It also creates a continuous scale which is valuable for regression analysis.

In order to score similarity between the ground truth labels and subject predictions (Italian), all answers were manually translated into english. We then filtered words by their part of speech to create a Bag of Words including only Nouns, Proper Nouns, Adjectives, and Verbs. Words were then embedded using a word2vec model trained using the following corpus: OntoNotes 5, ClearNLP Constituent-to-Dependency Conversion, WordNet 3.0 (using spaCy [7], an open-source library for NLP, source: https://spacy.io/models/en#en_core_web_lg). The ground truth labels consisting of individual words were also vectorized, then each word in the subject’s prediction was compared against the ground truth and the shortest distance was reported as the semantic distance of the subject’s prediction. Code to reproduce the exact analysis can be found in the supplementary resources. (Supplementary resources: https://github.com/A-Telfer/neuropsychology-study-semantic-similarity)


References
1. Khurana, D., Koli, A., Khatter, K. et al. Natural language processing, state of the art, current trends and challenges. Multimed Tools Appl 82, 3713–3744 (2023).
2. Tomas Mikolov and Ilya Sutskever and Kai Chen and Greg Corrado and Jeffrey Dean , Distributed Representations of Words and Phrases and their Compositionality, arXiv: 1310.4546 (2013)
3. Tomas Mikolov and Kai Chen and Greg Corrado and Jeffrey. Efficient Estimation of Word Representations in Vector Space. arXiv: 1301.3781 (2013)
4. Mikolov, W.T. Yih, G. Zweig. Linguistic Regularities in Continuous Space Word Representations. NAACL HLT (2013).
5. Bolukbasi, Tolga, et al. "Man is to computer programmer as woman is to homemaker? debiasing word embeddings." Advances in neural information processing systems 29 (2016).
6. Manzini, Thomas, et al. "Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings." arXiv preprint arXiv:1904.04047 (2019).
7. Jugran S, Kumar A, Tyagi BS, Anand V. Extractive automatic text summarization using SpaCy in Python & NLP. In: 2021 International conference on advance computing and innovative technologies in engineering (ICACITE); 2021. p. 582–5.




## Download Model

In [2]:
pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [1]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
! python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:04[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [3]:
import pandas as pd
import numpy as np
import spacy

## Data Cleaning



In [5]:
df = pd.read_excel("Rating Scale - English.xlsx", skiprows=3)
df = df.loc[:129]
df.tail(3)

Unnamed: 0,Group,Subject,Stimuli,Unnamed: 3,Answer given (expl 1),Answer given 2 (expl 2),expl.1,Unnamed: 7,expl.2,Unnamed: 9,Unnamed: 10
127,B,13,8,key,racket,racket,4.0,,4.0,,
128,B,13,9,lamp,home/house,home/house,4.0,,4.0,,
129,B,13,10,leaf,racket,racket,3.0,,3.0,,


### Rename Columns


In [6]:
df = df.rename(columns={
    'Group': 'group',
    'Subject': 'subject',
    'Stimuli': 'stimuli',
    'Unnamed: 3': 'ground_truth',
    'Answer given (expl 1)': 'prediction1',
    'Answer given 2 (expl 2)': 'prediction2',
    'expl.1 ': 'manual_similarity_score1',
    'expl.2': 'manual_similarity_score2'
})
df.head(2)

Unnamed: 0,group,subject,stimuli,ground_truth,prediction1,prediction2,manual_similarity_score1,Unnamed: 7,manual_similarity_score2,Unnamed: 9,Unnamed: 10
0,VI,1,1,face,face,face,5.0,,5.0,,
1,VI,1,4,person/figure,little person,little child,5.0,,5.0,,


In [7]:
df = df[df.columns.drop(['Unnamed: 7', 'Unnamed: 9', 'Unnamed: 10'])]
df.head(2)

Unnamed: 0,group,subject,stimuli,ground_truth,prediction1,prediction2,manual_similarity_score1,manual_similarity_score2
0,VI,1,1,face,face,face,5.0,5.0
1,VI,1,4,person/figure,little person,little child,5.0,5.0


### Correct Values

In [8]:
for name, _ in df.groupby(["stimuli", "ground_truth"]):
  print(name)

(1, 'face')
(2, 'bottle')
(3, 'cup')
(3, 'cup/mug')
(4, 'person')
(4, 'person/figure')
(5, 'telephone')
(6, 'umbrella')
(7, 'scissors')
(7, 'scissors ')
(8, 'key')
(9, 'lamp')
(10, 'leaf ')
(11, 'apple')
(12, 'shoe')
(13, 'crutch/cane')
(15, 'flower')
(16, 'hand')


In [9]:
df.loc[df.stimuli==2, 'ground_truth'] = 'bottle'
df.loc[df.stimuli==3, 'ground_truth'] = 'cup'
df.loc[df.stimuli==4, 'ground_truth'] = 'person'
df.loc[df.stimuli==7, 'ground_truth'] = 'scissors'

for name, _ in df.groupby(["stimuli", "ground_truth"]):
  print(name)

(1, 'face')
(2, 'bottle')
(3, 'cup')
(4, 'person')
(5, 'telephone')
(6, 'umbrella')
(7, 'scissors')
(8, 'key')
(9, 'lamp')
(10, 'leaf ')
(11, 'apple')
(12, 'shoe')
(13, 'crutch/cane')
(15, 'flower')
(16, 'hand')


In [10]:
df.ground_truth.unique()

array(['face', 'person', 'umbrella', 'key', 'lamp', 'shoe', 'crutch/cane',
       'bottle', 'leaf ', 'cup', 'telephone', 'scissors', 'apple',
       'flower', 'hand'], dtype=object)

In [11]:
sorted(df.subject.unique())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

In [12]:
df.group.unique()

array(['VI', 'B', nan], dtype=object)

In [13]:
df[df.group.isna()]

Unnamed: 0,group,subject,stimuli,ground_truth,prediction1,prediction2,manual_similarity_score1,manual_similarity_score2
80,,7,1,face,portrait of a person,portrait,5.0,5.0


In [14]:
df = df.fillna('')

## POS Tagging and Filtering

In [16]:
# nlp = spacy.load("en_core_web_trf")
nlp = spacy.load("en_core_web_lg")

# Extract nouns and adjectives
pos_tags = ["NOUN", "ADJ", "VERB", "PROPN"]
df["tokenized_ground_truth"] = df.ground_truth.apply(nlp)
df["tokenized_ground_truth"] = df.tokenized_ground_truth.apply(
    lambda x: [w for w in x if w.pos_ in pos_tags])

df["tokenized_prediction1"] = df.prediction1.apply(nlp)
df["tokenized_prediction1"] = df.tokenized_prediction1.apply(
    lambda x: [w for w in x if w.pos_ in pos_tags])

df["tokenized_prediction2"] = df.prediction2.apply(nlp)
df["tokenized_prediction2"] = df.tokenized_prediction2.apply(
    lambda x: [w for w in x if w.pos_ in pos_tags])

df

Unnamed: 0,group,subject,stimuli,ground_truth,prediction1,prediction2,manual_similarity_score1,manual_similarity_score2,tokenized_ground_truth,tokenized_prediction1,tokenized_prediction2
0,VI,1,1,face,face,face,5.0,5.0,[face],[face],[face]
1,VI,1,4,person,little person,little child,5.0,5.0,[person],"[little, person]","[little, child]"
2,VI,1,6,umbrella,handle and some kind of bend/curve,umbrella,2.0,0.0,[umbrella],"[handle, kind, bend, curve]",[umbrella]
3,VI,1,8,key,I don't know,key,5.0,5.0,[key],[know],[key]
4,VI,1,9,lamp,I don't know,container,0.0,4.0,[lamp],[know],[container]
...,...,...,...,...,...,...,...,...,...,...,...
125,B,13,6,umbrella,umbrella,umbrella,5.0,5.0,[umbrella],[umbrella],[umbrella]
126,B,13,7,scissors,tree,tree,2.0,2.0,[scissors],[tree],[tree]
127,B,13,8,key,racket,racket,4.0,4.0,[key],[racket],[racket]
128,B,13,9,lamp,home/house,home/house,4.0,4.0,[lamp],"[home, house]","[home, house]"


## Vector Distances

In [17]:
def calculate_closest_distance_vectors(s1, s2):
  nearest = -1
  for w1 in s1:
    v1 = w1.vector
    for w2 in s2:
      v2 = w2.vector
      d = np.linalg.norm(v1-v2, ord=2)
      if nearest == -1 or d < nearest:
        nearest = d
  
  return nearest

for idx, row in df.iterrows():
  df.loc[idx, 'distance_predication1'] = calculate_closest_distance_vectors(
      row.tokenized_ground_truth, row.tokenized_prediction1)
  
  df.loc[idx, 'distance_predication2'] = calculate_closest_distance_vectors(
      row.tokenized_ground_truth, row.tokenized_prediction2)


In [19]:
df[['group', 'subject', 'stimuli', 'ground_truth', 'prediction1', 'distance_predication1', 'prediction2', 'distance_predication2']]

Unnamed: 0,group,subject,stimuli,ground_truth,prediction1,distance_predication1,prediction2,distance_predication2
0,VI,1,1,face,face,0.000000,face,0.000000
1,VI,1,4,person,little person,0.000000,little child,47.189495
2,VI,1,6,umbrella,handle and some kind of bend/curve,44.875175,umbrella,0.000000
3,VI,1,8,key,I don't know,88.529228,key,0.000000
4,VI,1,9,lamp,I don't know,70.088333,container,55.279503
...,...,...,...,...,...,...,...,...
125,B,13,6,umbrella,umbrella,0.000000,umbrella,0.000000
126,B,13,7,scissors,tree,64.222961,tree,64.222961
127,B,13,8,key,racket,82.567841,racket,82.567841
128,B,13,9,lamp,home/house,64.835320,home/house,64.835320
