## Code Explanation: Similarity and Aggregation Analysis using Word Definitions

The code implements an experiment named "Defs" to calculate similarity and perform aggregation on dimensions of concreteness and specificity using word definitions. The experiment involves processing definitions of various words, such as "door," "ladybug," "pain," and "blurriness." The primary objective is to compute similarity scores between pairs of definitions and subsequently aggregate the scores based on the dimensions of concreteness and specificity.

### Importing Libraries and Loading Data

The code begins by importing essential libraries, such as `nltk`, `spacy`, and `csv`. It loads definitions from a TSV file into a dictionary named `definitions`, where the keys represent words, and the values are lists of corresponding definitions.

### Defining Functions

1. `compute_similarity(definition1, definition2)`: This function calculates the similarity between two definitions. It preprocesses the text of both definitions, finds the intersection of words, and computes the normalized intersection size over the minimum length of the two definitions.

2. `preprocess_text(text)`: This function tokenizes and preprocesses the given text. It converts the text to lowercase, tokenizes it, removes stop words, and performs stemming using the Porter Stemmer.

3. `words_frequency(definition1, definition2)`: This function calculates the frequency of words in two definitions. It preprocesses the text of both definitions, counts the frequency of each word, and returns the frequency of the most frequent word as a ratio of the total number of words.

### Similarity and Aggregation

For each word in the `definitions`, the code performs the following steps:

1. Calculates and prints the average similarity scores between pairs of definitions.
2. Calculates and prints the average frequency of the most frequent word in pairs of definitions.
3. Creates two dictionaries, `word_avg_similarity_dict` and `word_avg_freq_dict`, to store word-average similarity pairs and word-average frequency pairs, respectively.
4. Adds the word and its average similarity score to the `word_avg_similarity_dict`.
5. Adds the word and its average frequency score to the `word_avg_freq_dict`.
6. Prints the `word_avg_similarity_dict` and `word_avg_freq_dict`.




In [23]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from collections import Counter
import numpy as np
import csv
import nltk
from nltk.corpus import wordnet as wn
import spacy

# Carica il modello di lingua spaCy
nlp = spacy.load("en_core_web_sm")



In [24]:
# Load definitions from the TSV file
definitions = {}
words = []

with open("../lab1/TLN-definitions-23.tsv", "r", encoding="utf-8") as tsvfile:
    reader = csv.reader(tsvfile, delimiter="\t")
    
    # Read the first row to get the words
    words = next(reader)[1:]
    
    for row in reader:
        for i, definition in enumerate(row[1:]):
            word = words[i]
            if word not in definitions:
                definitions[word] = []
            definitions[word].append(definition)

definitions

{'door': ['A construction used to divide two rooms, temporarily closing the passage between them',
  "It's an opening, it can be opened or closed.",
  'An object that divide two room, closing an hole in a wall. You can open the door to let people enter or get out.',
  'Usable for access from one area to another',
  'Structure that delimits an area and allows access to it',
  'an object that is used to block passage but can be moved to pass',
  'An assembled object, historically made of wood, but also of iron or other materials, used to separate rooms in a building. Sometimes opened by moving a handle, or pushed, or locked and requires some means to unlock. it consists of the main body, the hinges on which it rotates, and a lock.',
  'object used to go through rooms separate by a wall, can be opened or closed',
  'something that can be opened, in order to access to another place',
  'the access to a room',
  'an object that allows access to a room',
  'Enclosing of an entrance that bloc

In [None]:
def compute_similarity(definition1, definition2):
    """
    Calculate the similarity between two definitions.

    Parameters:
    - definition1: The first definition (str)
    - definition2: The second definition (str)

    Returns:
    - similarity: The similarity score between the definitions (float)
    """
    words1 = preprocess_text(definition1)
    words2 = preprocess_text(definition2)
    intersection = set(words1) & set(words2)
    min_length = min(len(words1), len(words2))
    
    if min_length == 0:
        return 0.0
    
    similarity = len(intersection) / min_length
    return similarity


def preprocess_text(text):
    """
    Preprocess the input text.

    Parameters:
    - text: The input text to be preprocessed (str)

    Returns:
    - words: The preprocessed words (list[str])
    """
    tokens = word_tokenize(text.lower())
    stemmer = PorterStemmer()
    stop_words = set(stopwords.words("english"))
    words = [stemmer.stem(token) for token in tokens if token.isalnum() and token not in stop_words]
    return words


def words_frequency(definition1, definition2):
    """
    Calculate the frequency of words in two definitions.

    Parameters:
    - definition1: The first definition (str)
    - definition2: The second definition (str)

    Returns:
    - frequency: The frequency of the most frequent word (float)
    """
    frequency_dict = {}

    words1 = preprocess_text(definition1)
    words2 = preprocess_text(definition2)

    for word in words1:
        if word in frequency_dict:
            frequency_dict[word] += 1
        else:
            frequency_dict[word] = 1

    for word in words2:
        if word in frequency_dict:
            frequency_dict[word] += 1
        else:
            frequency_dict[word] = 1

    most_frequent_word = max(frequency_dict, key=frequency_dict.get)

    return round(frequency_dict[most_frequent_word] / (len(words1) + len(words2)), 2)



In [26]:
# Initialize an empty dictionary to store word-average similarity pairs
word_avg_similarity_dict = {}
word_avg_freq_dict = {}
# Test
keys = list(definitions.keys())
for key in keys:
    concreteness_scores = []
    specificity_scores = []
    similarities = []
    words_frequencies = []
    print(f"Word: {key}")
    
    # Calculate similarities between definitions
    for i in range(len(definitions[key])):
        for j in range(i+1, len(definitions[key])):
            words_freq = words_frequency(definitions[key][i], definitions[key][j])
            similarity = compute_similarity(definitions[key][i], definitions[key][j])
            similarities.append(similarity)
            words_frequencies.append(words_freq)
    
    # Aggregate and calculate average
    avg_similarity = sum(similarities) / len(similarities)
    avg_freq =  sum(words_frequencies) / len(words_frequencies)
    # Add the word and its average similarity score to the dictionary
    word_avg_similarity_dict[key] = avg_similarity
    word_avg_freq_dict[key] = avg_freq
    print(f"Average Similarity Score: {avg_similarity}")
    print(f"avg_freq: {avg_freq}")
    print("=" * 30)




Word: door
Average Similarity Score: 0.22419706730051592
avg_freq: 0.13691954022988503
Word: ladybug
Average Similarity Score: 0.5965796336485996
avg_freq: 0.1452183908045978
Word: pain
Average Similarity Score: 0.24244580451476996
avg_freq: 0.1759080459770116
Word: blurriness
Average Similarity Score: 0.07984516760378829
avg_freq: 0.11599999999999996


In [27]:
from prettytable import PrettyTable

# Initialize the PrettyTable
result_table = PrettyTable(["", "Astratto", "Concreto"])

# Specify the order of words
word_order = ["door", "pain", "ladybug", "blurriness"]

# Add the results to the PrettyTable based on the specified word order
for i in range(0, len(word_order), 2):
    word1 = word_order[i]
    word2 = word_order[i + 1]

    # Get the average similarity scores from word_avg_similarity_dict
    avg_similarity1 = word_avg_similarity_dict.get(word1, 0.0)
    avg_similarity2 = word_avg_similarity_dict.get(word2, 0.0)

    if i == 0:
        result_table.add_row([f"Generico", f" {word2} {avg_similarity2:.2f}", f" {word1} {avg_similarity1:.2f}"])
    else:
        result_table.add_row([f"Specifico", f" {word2} {avg_similarity2:.2f}", f" {word1} {avg_similarity1:.2f}"])

# Print the PrettyTable
print(result_table)



+-----------+------------------+---------------+
|           |     Astratto     |    Concreto   |
+-----------+------------------+---------------+
|  Generico |     pain 0.24    |    door 0.22  |
| Specifico |  blurriness 0.08 |  ladybug 0.60 |
+-----------+------------------+---------------+


In [28]:
from prettytable import PrettyTable

# Initialize the PrettyTable
result_table = PrettyTable(["", "Astratto", "Concreto"])

# Specify the order of words
word_order = ["door", "pain", "ladybug", "blurriness"]

# Add the results to the PrettyTable based on the specified word order
for i in range(0, len(word_order), 2):
    word1 = word_order[i]
    word2 = word_order[i + 1]

    # Get the average similarity scores from word_avg_similarity_dict
    avg_freq1 = word_avg_freq_dict.get(word1, 0.0)
    avg_freq2 = word_avg_freq_dict.get(word2, 0.0)

    if i == 0:
        result_table.add_row([f"Generico", f" {word2} {avg_freq2:.2f}", f" {word1} {avg_freq1:.2f}"])
    else:
        result_table.add_row([f"Specifico", f" {word2} {avg_freq2:.2f}", f" {word1} {avg_freq1:.2f}"])

# Print the PrettyTable
print(result_table)


+-----------+------------------+---------------+
|           |     Astratto     |    Concreto   |
+-----------+------------------+---------------+
|  Generico |     pain 0.18    |    door 0.14  |
| Specifico |  blurriness 0.12 |  ladybug 0.15 |
+-----------+------------------+---------------+
