## Lexical overlap measurement

1. Read the definitions for each word from the file `definitions.tsv` and store them.
2. Create a function that computes the similarity between two definitions of a word.
    - For each definition remove punctuation and stop words and lemmatize the words before computing the similarity.
    - The similarity is the number of lemmas that are in common between the two definitions (overlap) divided by the length of the smaller definition.
3. Create a function that computes the mean similarity between each pair of definitions for a given word.
4. For each word compute the mean similarity between all pairs of definitions of the word.
5. Compute the mean similarity for concrete concepts and abstract concepts.
6. Compute the mean similarity for specific concepts and general concepts.

In [28]:
import pandas as pd
from nltk.stem import WordNetLemmatizer
from typing import Dict, List

from resources.constants import punctuation, stop_words

### Read the definitions

Read the definitions for each word from the file `definitions.tsv` and convert them to a dictionary of the form `word: [definition1, definition2, ...]`.

In [0]:
definitions = pd.read_csv('resources/definitions.tsv', sep='\t')
definitions.head()

# remove index from the dataframe (for each row it is the first element)
definitions = definitions.iloc[:, 1:]
definitions.head()

In [29]:
# convert the dataframe to a dictionary for easier access
definitions_dict: Dict[str, List[str]] = {}
for column in definitions.columns:
    definitions_dict[column] = definitions[column].tolist()

In [30]:
# print every word and one of its definitions
for word in definitions_dict:
    print(f'- {word.upper()}: \n\t{definitions_dict[word][0]}')

- DOOR: 
	A construction used to divide two rooms, temporarily closing the passage between them
- LADYBUG: 
	small flying insect, typically red with black spots with six legs
- PAIN: 
	A feeling of physical or mental distress
- BLURRINESS: 
	sight out of focus


### Functions to compute similarity

Clean the word list by removing punctuation, stop words and lemmatizing the words.

In [31]:
lemmatizer = WordNetLemmatizer()

def clean_word_list(word_list: List[str]):
    # remove punctuation
    word_list = [word for word in word_list if word not in punctuation]
    # remove stop words
    word_list = [word for word in word_list if word not in stop_words]
    # lemmatize the words
    word_list = [lemmatizer.lemmatize(word) for word in word_list]
    return word_list

Compute the similarity between two sentences as the number of lemmas that are in common between the two sentences divided by the length of the smaller sentence.

In [32]:
def sentence_similarity(sentence1: str, sentence2: str):
    # split the definitions into words
    words1 = sentence1.split()
    words2 = sentence2.split()
    words1 = clean_word_list(words1)
    words2 = clean_word_list(words2)
    # compute the intersection of the two definitions
    intersection = len(set(words1).intersection(set(words2)))
    # return the similarity, dividing by the length of the smaller definition
    return intersection / min(len(words1), len(words2))

Compute the mean similarity between each pair of definitions for a given word.

In [33]:
def definition_similarity(word: str) -> float:
    # if the word is not in the dictionary, raise error
    if word not in definitions_dict:
        raise ValueError(f'Word {word} not found in the dictionary')
    # get word definitions
    word_definitions = definitions_dict[word]
    similarities: List[float] = []
    # compute the similarity between each pair of definitions
    for i in range(len(word_definitions)):
        for j in range(i+1, len(word_definitions)):
            def1 = word_definitions[i]
            def2 = word_definitions[j]
            similarities.append(sentence_similarity(def1, def2))
    return sum(similarities) / len(similarities)

### Compute mean similarities for each word

Try the functions on a single word.

In [34]:
# try on first word
word1 = list(definitions_dict.keys())[0]
similarity = definition_similarity(word1)
print(f'Average similarity for word {word1}: {similarity}')

Similarities for word door: 0.1457782350286356


For each word compute the mean similarity between all pairs of definitions of the word.

In [40]:
similarities = {}
for word in definitions_dict:
    similarities[word] = definition_similarity(word)

for word in similarities:
    print(f'- {word.upper()}: \n\t{similarities[word]}')

- DOOR: 
	0.1457782350286356
- LADYBUG: 
	0.3663837790561933
- PAIN: 
	0.14762269658821392
- BLURRINESS: 
	0.06509760992519609


### Get the mean similarity for concrete and abstract concepts

In [41]:
concrete_words = ['door', 'ladybug']
abstract_words = ['pain', 'blurriness']

concrete_similarities = [similarities[word] for word in concrete_words]
abstract_similarities = [similarities[word] for word in abstract_words]

concrete_mean = sum(concrete_similarities) / len(concrete_similarities)
abstract_mean = sum(abstract_similarities) / len(abstract_similarities)

print(f'Mean similarity for concrete words: {concrete_mean}')
print(f'Mean similarity for abstract words: {abstract_mean}')

Mean similarity for concrete words: 0.25608100704241443
Mean similarity for abstract words: 0.106360153256705


### Get the mean similarity for specific and general concepts

In [42]:
specific_words = ['ladybug', 'blurriness']
general_words = ['door', 'pain']

specific_similarities = [similarities[word] for word in specific_words]
general_similarities = [similarities[word] for word in general_words]

specific_mean = sum(specific_similarities) / len(specific_similarities)
general_mean = sum(general_similarities) / len(general_similarities)

print(f'Mean similarity for specific words: {specific_mean}')
print(f'Mean similarity for general words: {general_mean}')

Mean similarity for specific words: 0.2157406944906947
Mean similarity for general words: 0.14670046580842477
