## Onomasiological search

The goal of this project is to find the most appropriate word for a given definition. For this purpose, we try 4 different approaches:
1. Explore all wordnet Synsets and find the synset with maximum overlap between definition signature and synset signature
2. Starting from the most general synset (of a noun), explore the WordNet graph following the branch with maximum similarity between the embedded definition signature and the embedded synset signature.
3. Measure the similarity between word embeddings in SpaCy vocabulary and definition embeddings and get the word (in SpaCy) with the highest similarity.
4. Get the most frequent words in the definition (mfw), get synsets of mfw, its hyponims and hyperonims (mfw_synsets), calculate signature of mfw_synsets (mfw_synsets_signature), get synsets of the mfw_synsets with the best overlap between mfw_synsets_signature and definition signature.

In [24]:
import pandas as pd
import nltk
from nltk.corpus import wordnet as wn, stopwords
from nltk.corpus.reader import Synset
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import spacy
from spacy.tokens import Doc

from collections import Counter
from typing import Dict, List, Tuple

In [25]:
nltk.download('wordnet')
# spacy.cli.download("en_core_web_md")
embedder = spacy.load("en_core_web_md")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Gianl\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Load definitions

In [26]:
definitions = pd.read_csv('resources/definitions.tsv', sep='\t')

# remove index from the dataframe (for each row it is the first element)
definitions = definitions.iloc[:, 1:]
definitions.head()

Unnamed: 0,door,ladybug,pain,blurriness
0,"A construction used to divide two rooms, tempo...","small flying insect, typically red with black ...",A feeling of physical or mental distress,sight out of focus
1,"It's an opening, it can be opened or closed.","It is an insect, it has wings, red with black ...","It is a feeling, physical or emotional. It is ...","It is the absence of definite borders, shapele..."
2,"An object that divide two room, closing an hol...",An insect that can fly. It has red or orange c...,A felling that couscious beings can experince ...,A sensation felt when you can't see clearly th...
3,Usable for access from one area to another,Small insect with a red back,Concept that describes a suffering living being,Lack of sharpness
4,Structure that delimits an area and allows acc...,Small round flying insect,Feeling of physical discomfort,Characteristic of lack of clarity or precision


In [27]:
# convert the dataframe to a dictionary for easier access
definitions_dict: Dict[str, List[str]] = {}
for column in definitions.columns:
    definitions_dict[column] = definitions[column].tolist()

In [28]:
# print every word and one of its definitions
for word in definitions_dict:
    print(f'- {word.upper()}: \n\t{definitions_dict[word][0]}')

- DOOR: 
	A construction used to divide two rooms, temporarily closing the passage between them
- LADYBUG: 
	small flying insect, typically red with black spots with six legs
- PAIN: 
	A feeling of physical or mental distress
- BLURRINESS: 
	sight out of focus


### Function to calculate the signature of a definition

For each word, the signature is a list of words that are present in the definitions of the word. The words are lemmatized and the punctuation and stop words are removed.

In [29]:
lemmatizer = WordNetLemmatizer()

def clean_sentence(sentence: str) -> List[str]:
    # convert to list of words
    word_list = sentence.split()
    # convert to lower case
    word_list = [word.lower() for word in word_list]
    # tokenize the words and remove punctuation
    tokenizer = RegexpTokenizer(r'\w+')
    word_list = tokenizer.tokenize(' '.join(word_list))
    # remove stop words using the nltk stop words list
    word_list = [word for word in word_list if word not in stopwords.words('english')]
    # lemmatize the words
    word_list = [lemmatizer.lemmatize(word) for word in word_list]
    return word_list

### Function to get the synsets of the most frequent words in a definition

Extract synset of 10 most frequent words in a definition. The definition is considered as a signature, as calculated by the `clean_sentence` function.

In [30]:
def get_synsets_for_word(signature: List[str]) -> List[Synset]:
    # calculate words frequency for a definition signature
    word_frequency = Counter(signature)
    most_common_words: List[Tuple] = word_frequency.most_common(10)
    
    # get only nouns from most common words
    most_common_nouns = [word_tuple for word_tuple in most_common_words if 
                         wn.synsets(word_tuple[0]) and wn.synsets(word_tuple[0])[0].pos() == 'n']
    
    print(most_common_nouns) # TODO: remove
    
    # get the first synset for each noun
    synsets: List[Synset] = [wn.synsets(word_tuple[0])[0] for word_tuple in most_common_nouns]
    return synsets

Esample: get synsets starting from the first definition of the word 'door'

In [31]:
door_def_1 = definitions_dict['door'][0]
signature = clean_sentence(door_def_1)
print(signature)
get_synsets_for_word(signature)

['construction', 'used', 'divide', 'two', 'room', 'temporarily', 'closing', 'passage']
[('construction', 1), ('divide', 1), ('two', 1), ('room', 1), ('closing', 1), ('passage', 1)]


[Synset('construction.n.01'),
 Synset('divide.n.01'),
 Synset('two.n.01'),
 Synset('room.n.01'),
 Synset('shutting.n.01'),
 Synset('passage.n.01')]

### Testing method
Takes a callable function and applies it to the all the words and its definitions to find the initial word.

In [32]:
def test_method(method):
    for word in definitions_dict:
        print(f'- {word.upper()}:')
        for definition in definitions_dict[word]:
            print(f'\t- {definition}')
            word_found = method(definition)
            print(f'\t\t- {word_found}')

### Approach 1

Explore all wordnet Synsets and find the synset with maximum overlap between definition signature and synset signature

Note: this approach is very slow and not recommended. Part of the results are saved in the file `results/approach_1.txt`

In [33]:
def get_synset_signature(synset: Synset) -> List[str]:
    """
    Get the signature of a synset, which is the concatenation of the gloss and examples of the synset after cleaning (lowercase, lemmatization, punctuation removal, and stop words removal)
    :param synset: the synset for which to get the signature
    :return: the signature of the synset in the form of a list of words
    """
    
    gloss: List[str] = synset.definition().split()
    examples: List[str] = " ".join(synset.examples()).split()
    synset_signature = clean_sentence(" ".join(gloss + examples))
    return synset_signature

def approach_1(definition: str) -> Synset:
    """
    Get the synset with the maximum overlap between the definition signature and the synset signature in WordNet.
    :param definition: the definition for which to find the synset 
    :return: the synset with the maximum overlap
    """
    
    max_overlap = 0
    max_overlap_synset = None
    signature = clean_sentence(definition)
    # for each synset in wordnet
    for synset in wn.all_synsets():
        # get the signature of the synset
        synset_signature = get_synset_signature(synset)
        # calculate the overlap between the definition signature and the synset signature
        overlap = len(set(signature).intersection(set(synset_signature)))
        # if the overlap is greater than the maximum overlap, update the maximum overlap and the synset
        if overlap > max_overlap:
            max_overlap = overlap
            max_overlap_synset = synset
    return max_overlap_synset

In [34]:
# test_method(approach_1)

### Approach 2

Starting from the most general synset (of a noun) explore the WordNet graph following the branch with maximum similarity between the embedded definition signature and the embedded synset signature.

In [35]:
def embed_sentence(sentence: str) -> Doc:
    """
    Embed a sentence using the SpaCy model
    :param sentence: the sentence to embed as a string
    :return: the embedded sentence as a SpaCy Doc
    """
    sentence = " ".join(clean_sentence(sentence))
    return embedder(sentence)

In [36]:
def get_synset_embedding(synset: Synset) -> Doc:
    """
    Get the embedding of a synset by concatenating the lemma names, gloss, and examples of the synset and embedding the resulting text
    :param synset: the WordNet Synset for which to get the embedding
    :return: the embedding of the synset as a SpaCy Doc
    """
    # Concatenate the lemma names to form a text representation of the synset
    synset_lemmas = ' '.join(lemma.name().replace('_', ' ') for lemma in synset.lemmas())
    synset_gloss = synset.definition()
    synset_examples = ' '.join(example for example in synset.examples())
    synset_signature = synset_lemmas + ' ' + synset_gloss + ' ' + synset_examples

    # Create a Doc from the synset text
    synset_doc = embedder(" ".join(clean_sentence(synset_signature)))

    return synset_doc

Quick check of similarity between the 'door' word embedding and the 'door.n.01' synset embedding. The similarity should be high, because we are exploring the WordNet graph looking for synsets that are similar to the word.

In [37]:
door_synset = wn.synsets('door')[0]
door_synset_embedding = get_synset_embedding(door_synset)
door_word_embedding = embedder('door')
similarity = door_synset_embedding.similarity(door_word_embedding)

# cosine similarity between the word and the synset
print(f"Similarity between word embedding and synset embedding {similarity}")

Similarity between word embedding and synset embedding 0.8198888470468051


In [38]:
def approach_2(definition: str) -> Synset:
    """
    Find the best synset for a given definition by exploring the WordNet graph starting from the most general synset of a noun (entity.n.01) and following the branch with the highest similarity between the definition embedding and the synset embedding.
    :param definition: the definition for which to find the best synset
    :return: the synset for which there is the highest similarity between the definition embedding and the synset embedding
    """
    
    # Compute the target definition embedding
    target_doc = embed_sentence(definition)

    # synset for the current node in the graph, initialized to the most general synset of a noun
    current_synset = wn.synset('entity.n.01')
    # flag to check if the current synset is a leaf
    reached_leaf = False
    # holds the highest similarity between the hyponyms of current synset and the target definition
    highest_similarity = 0
    # similarity between the previous best synset and the target definition
    previous_similarity = -1
    # similarity between the current best synset and the target definition 
    current_similarity = 0
    # hyponym with the highest similarity to the target definition
    best_hyponym = None
    
    # we look for new best synset with a similarity increase of at least the threshold
    threshold = 0.00000001

    while not reached_leaf and current_similarity >= previous_similarity + threshold:
        # get the hyponyms of the current synset
        hyponyms = current_synset.hyponyms()

        # if the current synset has no hyponyms, set the reached_leaf flag to True
        if not hyponyms:
            reached_leaf = True
        else: # look for the hyponym with the highest similarity to the target definition
            for hyponym in hyponyms:
                # get the embedding of the hyponym
                hyponym_embedding = get_synset_embedding(hyponym)
        
                # compute the similarity between the target definition and the current hyponym
                hyponym_similarity = target_doc.similarity(hyponym_embedding)
                
                # if the similarity is greater than the highest similarity, update the highest similarity and the current synset
                if hyponym_similarity > highest_similarity:
                    highest_similarity = hyponym_similarity
                    best_hyponym = hyponym
            # now we found the best hyponym for the current synset
            previous_similarity = current_similarity
            current_synset = best_hyponym
            current_similarity = highest_similarity

    return current_synset

In [39]:
test_method(approach_2)

- DOOR:
	- A construction used to divide two rooms, temporarily closing the passage between them
		- Synset('relation.n.01')
	- It's an opening, it can be opened or closed.
		- Synset('diagonal.n.04')
	- An object that divide two room, closing an hole in a wall. You can open the door to let people enter or get out.
		- Synset('change.n.06')
	- Usable for access from one area to another
		- Synset('communication.n.03')
	- Structure that delimits an area and allows access to it
		- Synset('abstraction.n.06')
	- an object that is used to block passage but can be moved to pass
		- Synset('abstraction.n.06')
	- An assembled object, historically made of wood, but also of iron or other materials, used to separate rooms in a building. Sometimes opened by moving a handle, or pushed, or locked and requires some means to unlock. it consists of the main body, the hinges on which it rotates, and a lock.
		- Synset('abstraction.n.06')
	- object used to go through rooms separate by a wall, can be ope

We can see that the results are not very good. The similarity between the definition and the synset is not high enough to find the correct word. This is probably due to the fact that we are starting from a very general synset and we cannot find enough information in so much general synsets to find the correct path and get closer to a more specific synset.

### Approach 3

Measure the similarity between word embeddings in SpaCy vocabulary and definition embeddings and get the word (in SpaCy) with the highest similarity.

In [40]:
def approach_3(definition: str) -> str:
    """
    Find the word with the highest similarity to the input definition by comparing the definition embedding to all word embeddings in the SpaCy vocabulary.
    :param definition: the definition for which to find the most similar word
    :return: the word in the SpaCy vocabulary with the highest similarity to the input definition
    """
    # embed the input definition
    definition_doc = embed_sentence(definition)

    max_similarity = -1
    associated_word = None
    
    # calculate similarity between the input definition and all words in the vocabulary
    for word_embedding in embedder.vocab:
        # ignore stopwords and punctuation
        if word_embedding.is_stop or word_embedding.is_punct:
            continue
        
        # calculate similarity between the input definition and the current word
        similarity = definition_doc.similarity(word_embedding)
        
        # update associated_word if similarity is higher
        if similarity > max_similarity:
            max_similarity = similarity
            associated_word = word_embedding.text
            
    return associated_word

In [41]:
test_method(approach_3)

- DOOR:
	- A construction used to divide two rooms, temporarily closing the passage between them
		- separate
	- It's an opening, it can be opened or closed.
		- opened
	- An object that divide two room, closing an hole in a wall. You can open the door to let people enter or get out.
		- inside
	- Usable for access from one area to another


  similarity = definition_doc.similarity(word_embedding)


		- usable
	- Structure that delimits an area and allows access to it
		- access
	- an object that is used to block passage but can be moved to pass
		- intended
	- An assembled object, historically made of wood, but also of iron or other materials, used to separate rooms in a building. Sometimes opened by moving a handle, or pushed, or locked and requires some means to unlock. it consists of the main body, the hinges on which it rotates, and a lock.
		- actuating
	- object used to go through rooms separate by a wall, can be opened or closed
		- enclosing
	- something that can be opened, in order to access to another place
		- place
	- the access to a room
		- room
	- an object that allows access to a room
		- allows
	- Enclosing of an entrance that blocks off intruders as well as weather conditions. Can usually be locked with a key of some sort. At times it presents a small opening through which one can see outside.
		- normally
	- An object, that allows people to get inside or outsid

### Approach 4

 - Get the most frequent words in the definition (`mfw`)
 - Get synsets of `mfw`, its hyponims and hyperonims (`mfw_synsets`)
 - Calculate signature of `mfw_synsets` (`mfw_synsets_signature`)
 - Get synsets from `mfw_synsets` with the best overlap between `mfw_synsets_signature` and definition signature

In [42]:
def get_mfw(definition: str) -> List[str]:
    """
    Get the most frequent words in the definition
    :param definition: the definition, as string, for which to get the most frequent words
    :return: the most frequent words in the definition
    """
    definition_words = clean_sentence(definition) # remove stopwords, punctuation, lemmatize
    word_frequency = Counter(definition_words)
    mfw = [word for word, _ in word_frequency.most_common(10)]
    return mfw


def get_hypo_hyper(synset: Synset) -> List[Synset]:
    hyponyms = synset.hyponyms()
    hypernyms = synset.hypernyms()
    return hyponyms + hypernyms


def get_synsets_for_mfw(mfw: List[str]) -> List[Synset]:
    synsets: List[Synset] = []
    for word in mfw:
        synsets += wn.synsets(word)
        for synset in wn.synsets(word):
            # get hyponyms and hypernyms of the synset
            synsets += get_hypo_hyper(synset)
    return synsets

Calculate the signature of the synsets of the most frequent words in the definition

In [43]:
def approach_4(definition: str) -> Synset:
    """
    Find the best synset for a given definition by calculating embedding similarity between definition signature and synset signature. Considering only hyponims and hyperonims of the synsets of the most frequent words in the definition.
    :param definition: the definition, as string, for which to find the best synset
    :return: the synset with the best overlap between the definition signature and the synset signature
    """
    
    # get the most frequent words in the definition
    mfw = get_mfw(definition)
    
    # get the synsets for the most frequent words
    mfw_synsets = get_synsets_for_mfw(mfw)
    
    # get the signature of the synsets and the definition
    synset_signatures = [get_synset_signature(synset) for synset in mfw_synsets]
    definition_signature = clean_sentence(definition)
    
    max_overlap = 0
    best_synset = None
    for synset, synset_signature in zip(mfw_synsets, synset_signatures):
        overlap = len(set(definition_signature).intersection(set(synset_signature)))
        if overlap > max_overlap:
            max_overlap = overlap
            best_synset = synset
    return best_synset

In [44]:
test_method(approach_4)

- DOOR:
	- A construction used to divide two rooms, temporarily closing the passage between them
		- Synset('construction.n.05')
	- It's an opening, it can be opened or closed.
		- Synset('open.v.09')
	- An object that divide two room, closing an hole in a wall. You can open the door to let people enter or get out.
		- Synset('doorway.n.01')
	- Usable for access from one area to another
		- Synset('access.v.02')
	- Structure that delimits an area and allows access to it
		- Synset('area.n.05')
	- an object that is used to block passage but can be moved to pass
		- Synset('chock.n.01')
	- An assembled object, historically made of wood, but also of iron or other materials, used to separate rooms in a building. Sometimes opened by moving a handle, or pushed, or locked and requires some means to unlock. it consists of the main body, the hinges on which it rotates, and a lock.
		- Synset('moon.n.02')
	- object used to go through rooms separate by a wall, can be opened or closed
		- Synset('

We now got slightly better results with respect to Approach 2. This because we are starting from the synsets of most frequent words in the definition, focusing on a part of the WordNet graph that should be more relevant and closer to the word we are looking for.
We are now considering an approach similar to the "Genus-Differentia" principle. In this case, we are considering the most frequent words synsets as Genus but we are considering not only more specific synsets as Differentia (hyponims) but eventually also hyperonims.

### Testing methods with merged definitions

We now test the methods with the merged definitions of the words. The merged definition is the concatenation of all the definitions of a word. This should give use more infomation to find the correct word, instead of a single definition.

In [45]:
def test_method_merged_definitions(method):
    for word in definitions_dict:
        merged_definition = ' '.join(definitions_dict[word])
        print(f'- {word.upper()}:')
        word_found = method(merged_definition)
        print(f'\t- {word_found}')

In [46]:
test_method_merged_definitions(approach_2)

- DOOR:
	- Synset('spacing.n.02')
- LADYBUG:
	- Synset('diagonal.n.04')
- PAIN:
	- Synset('substance.n.04')
- BLURRINESS:
	- Synset('personality.n.01')


In [47]:
test_method_merged_definitions(approach_3)

- DOOR:
	- allowing
- LADYBUG:


  similarity = definition_doc.similarity(word_embedding)


	- yellow
- PAIN:
	- soreness
- BLURRINESS:
	- blinding


In [48]:
test_method_merged_definitions(approach_4)

- DOOR:
	- Synset('leave.v.06')
- LADYBUG:
	- Synset('dipterous_insect.n.01')
- PAIN:
	- Synset('bad.s.03')
- BLURRINESS:
	- Synset('picture.n.01')
