The goal of this exercise is to simulate the game of **Ghigliottina**. In this game, the player is given 5 words and must identify a sixth word that is semantically related to the initial 5. To achieve this, we have implemented two distinct strategies:

1. **Similarity-based Approach**: This strategy involves using similarity measures to determine the related word. We calculate the similarity between the given words and potential candidates to find the most closely related one.

2. **Frequency-based Approach**: In contrast, the second strategy relies on the frequency of synsets. We analyze the frequency of WordNet synsets associated with the given words and select the most frequent synset as the related word.

These strategies provide different ways to approach the problem and showcase different aspects of word semantics. The goal is to compare their effectiveness in finding a semantically related word in the game of Ghigliottina.


In [15]:
from nltk.corpus import wordnet
from nltk.wsd import lesk
import csv
import pandas as pd
import random


In [16]:

def load_csv_to_dict(csv_filename):
    """
    Load data from a CSV file into a dictionary.

    Args:
        csv_filename (str): The name of the CSV file to be loaded.

    Returns:
        dict: A dictionary where keys are words (converted to lowercase) and values
              are lists of sentences associated with each word.
    """
    word_dict = {}
    
    with open(csv_filename, 'r') as csvfile:
        csvreader = csv.reader(csvfile)
        next(csvreader)
        
        for row in csvreader:
            if len(row) >= 2:
                word = row[0].lower()
                sentences = row[1].split('. ')
                word_dict[word] = sentences
    
    return word_dict



csv_filename = './res/data.csv'  
word_dict = load_csv_to_dict(csv_filename)

In [17]:
def read_first_column(csv_file):
    """
    Read the first column of a CSV file and return it as a list of lowercase strings.

    Args:
        csv_file (str): The name of the CSV file to be read.

    Returns:
        list: A list of lowercase strings representing the values in the first column of the CSV file.
    """
    df = pd.read_csv(csv_file, usecols=[0])
    first_column = df.iloc[:, 0].str.lower().tolist()
    return first_column

csv_file = './res/data.csv'
first_column_list = read_first_column(csv_file)

In [18]:
def disambiguate_senses_for_word(word, word_sentences_dict):
    """
    Disambiguate senses of a word within context sentences.

    This function takes a target 'word' and a dictionary 'word_sentences_dict'
    containing sentences associated with words. It attempts to disambiguate the
    senses of the target word within its context sentences using Lesk algorithm.

    Args:
        word (str): The target word to disambiguate.
        word_sentences_dict (dict): A dictionary where keys are words and values
            are lists of sentences associated with each word.

    Returns:
        set: A set containing the disambiguated senses (WordNet synsets) of the 'word'
            based on its context sentences. If no senses are found, an empty set is returned.
    """
    disambiguated_senses = set()
    
    if word in word_sentences_dict:
        sentences = word_sentences_dict[word]
        
        for sentence in sentences:
            synsets = wordnet.synsets(word)
            
            if synsets:
                best_sense = lesk(sentence.split(), word, synsets=synsets)
                disambiguated_senses.add(best_sense)
    
    return disambiguated_senses


In [19]:
def get_related_synsets(synset):

    """
    Retrieve related synsets for a given synset.

    This function uses the hypernyms, hyponyms, part meronyms, and substance meronyms
    relations to find related synsets.

    Args:
    synset: nltk.corpus.reader.wordnet.Synset
        The synset for which related synsets are to be found.

    Returns:
    list: 
        A list of nltk.corpus.reader.wordnet.Synset objects that are related to the input synset.
    """
     
    related_synsets = []
    
    hypernyms = synset.hypernyms()
    for hypernym in hypernyms:
        related_synsets.append(hypernym)
            
    hyponyms = synset.hyponyms()
    for hyponym in hyponyms:
        related_synsets.append(hyponym)

    part_meronyms= synset.part_meronyms()
    for meronym in part_meronyms:
        related_synsets.append(meronym)

    substance_meronyms= synset.substance_meronyms()
    for meronym in substance_meronyms:
        related_synsets.append(meronym)
            
    return related_synsets

In [20]:
def retrieve_related_synsets(word_list, word_sentences_dict):
    """
    Retrieve related WordNet synsets for a list of words within context sentences.

    This function takes a list of target 'word_list' and a dictionary 'word_sentences_dict'
    containing sentences associated with words. It retrieves related WordNet synsets for
    the target words based on their context sentences and disambiguated senses.

    Args:
        word_list (list): A list of words for which related synsets need to be retrieved.
        word_sentences_dict (dict): A dictionary where keys are words and values
            are lists of sentences associated with each word.

    Returns:
        list: A list containing related WordNet synsets for the input 'word_list' based on
            their context sentences and disambiguated senses.
    """
    related_synsets = []
    
    for word in word_list:
        disambiguated_senses = disambiguate_senses_for_word(word, word_sentences_dict)
        for sense in disambiguated_senses:
            related_synsets.extend(get_related_synsets(sense))
    
    return related_synsets

### when working with similarities the results are not great

In [21]:
def find_closest_synset(synsets):
    """
    Find the closest WordNet synset from a list of synsets based on similarity measures.

    This function takes a list of WordNet synsets 'synsets' and calculates the average similarity
    between each synset and the other synsets in the list. It returns the synset with the highest
    average similarity as the closest synset.

    Args:
        synsets (list): A list of WordNet synsets for which the closest synset needs to be found.

    Returns:
        str or None: The name of the closest WordNet synset (without sense number), or None if
            no synsets are provided or if the calculation cannot be performed.
    """
    closest_synset = None
    max_avg_similarity = -1
    
    for synset in synsets:
        total_similarity = 0
        count = 0
        
        for other_synset in synsets:
            if synset != other_synset:
                if synset.pos() == other_synset.pos():  # Check if parts of speech match
                    wup_similarity = synset.wup_similarity(other_synset)
                    shortest_path = synset.shortest_path_distance(other_synset)
                    lch_similarity = synset.lch_similarity(other_synset)
                    
                    if wup_similarity is not None and shortest_path is not None and lch_similarity is not None:
                        total_similarity += (wup_similarity + (1 / (1 + shortest_path)) + lch_similarity)
                        count += 1
        
        if count > 0:
            avg_similarity = total_similarity / count
            
            if avg_similarity > max_avg_similarity:
                max_avg_similarity = avg_similarity
                closest_synset = synset
    
    if closest_synset:
        return closest_synset.name().split('.')[0]
    else:
        return None

### working with frequency yields better results

In [22]:


def find_most_frequent_synset(synsets, five_words):
    """
    Find the most frequent synset among a list of synsets, excluding certain words.

    This function counts the frequency of each synset in the input list, ignoring synsets
    that match any of the words in the five_words list. It then returns the most frequent synset.

    Parameters:
    synsets : 
        The list of synsets to analyze.
    five_words : 
        The list of words to exclude from the frequency count.

    Returns:
    str:
        The name of the most frequent synset in the input list, or None if there are no valid synsets.
    """

    synset_counts = {}
    
    for synset in synsets:
        synset_name = synset.name().split('.')[0]
        if synset_name not in five_words:
            if synset_name in synset_counts:
                synset_counts[synset_name] += 1
            else:
                synset_counts[synset_name] = 1
    
    most_frequent_synset = None
    max_count = -1
    
    for synset_name, count in synset_counts.items():
        if count > max_count:
            max_count = count
            most_frequent_synset = synset_name
    
    return most_frequent_synset



Here's how we simulate the game:

1. Randomly selects 5 words from the `first_column_list`.

2. Retrieves related WordNet synsets for the 5 selected words using the `retrieve_related_synsets` function.

3. Finds the closest WordNet synset using the similarity approach with `find_closest_synset` function.

4. Finds the most frequent synset for the selected words using the `find_most_frequent_synset` function.

5. Prints the result of the similarity-based approach, which is the closest synset.

6. Prints the result of the frequency-based approach, which is the most frequent synset.


In [26]:
five_words = random.sample(first_column_list, 5)
rel_syns = retrieve_related_synsets(five_words, word_dict)

closest_synset = find_closest_synset(rel_syns)
most_freq_synset = find_most_frequent_synset(rel_syns, five_words)

for word in enumerate (five_words, start = 1):
    print(word)
print(f'Similarity based approach: {closest_synset}')
print(f'Frequency based approach: {most_freq_synset}')


(1, 'pitch')
(2, 'time')
(3, 'interest')
(4, 'spring')
(5, 'shop')
Similarity based approach: jump
Frequency based approach: alto
