### Download the Required Resources


##### Import the libraries

In [2]:
import gensim.downloader as g1
from transformers import BertModel, BertTokenizer 
import torch
import numpy as np
from nltk.corpus import wordnet as wn
import random
import pandas as pd
import json 
from scipy.spatial.distance import cosine
from itertools import combinations


  from .autonotebook import tqdm as notebook_tqdm


#### Practice the APis

##### Practice Word2Vec

In [167]:
w2vModel = g1.load("word2vec-google-news-300");
model = w2vModel

In [11]:
sim = w2vModel.most_similar('board')
vec1 = w2vModel.get_vector("board") 
vec2 = w2vModel.get_vector("committee") 
print(sim)
print(vec1)
print(vec2)

[('Board', 0.673071563243866), ('directors', 0.6475343704223633), ('trustees', 0.6403145790100098), ('baord', 0.5922820568084717), ('Trustees', 0.5866842269897461), ('Governing_Board', 0.5753400325775146), ('Theresa_Colaizzi', 0.5700287818908691), ('Jane_Gallucci', 0.5602066516876221), ('boards', 0.5594165325164795), ('Pat_Deutschman', 0.5546656250953674)]
[-0.14453125 -0.25976562 -0.01611328 -0.01074219 -0.01281738 -0.34765625
  0.10839844  0.00340271  0.07080078  0.04199219  0.0456543  -0.14160156
 -0.03808594 -0.19335938 -0.30273438  0.09619141  0.0703125  -0.11425781
 -0.02709961  0.01306152 -0.09863281  0.22070312  0.00118256  0.1328125
  0.02783203  0.14453125 -0.21386719  0.30664062 -0.20117188 -0.29101562
  0.07080078 -0.07861328 -0.07958984 -0.06738281  0.17675781 -0.23730469
  0.171875    0.31445312  0.13378906 -0.12109375 -0.09423828  0.13671875
  0.0390625  -0.09619141  0.07666016 -0.12695312  0.19140625 -0.04907227
  0.04589844  0.21679688 -0.00778198  0.08886719  0.055664

##### Practice Bert

In [4]:
# Load pre-trained BERT model and tokenizer 
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model_bert = BertModel.from_pretrained(model_name)

In [5]:
def pre_process_sentence(sentences: list):
    inputs = tokenizer(sentences, padding=True, return_tensors="pt")
    return inputs

### Part 1
One “objective” way of analyzing the similarities generated from a word embedding is to use existing human encoded knowledge. To this end, WordNet serves as a possible source for comparison.
For instance, one can pick a synset S, and look at all the words that are associated with S, and calculate the average similarity (of their corresponding vectors) between pairs of such words. Ideally these values should be high, as these words are supposedly “similar” (have a sense that is “the same”).
Once can also argue that if two synsets S1, S2 are “far apart” (e.g. having low path similarity), then if we pick a word associated with S1, and another associated with S2, their corresponding vectors’ similarity should be low.
To see if that is true, you should implement the following function and put it in “hw2.py” (when I say similarity below, I mean the (cosine)-similarity between the vectors (in the embedding) corresponds to the words.

#### Synonym Set Similarities Implementation
Where model is the model that is loaded from Word2Vec, and sset is a synset from
WordNet
- What the function returns depends on how many words are associated with sset
- If sset has only one word, you should return an empty list
- If sset has two words, you should return a list of one number, which is the similarity of the two vectors in the model corresponding to the two words
- If sset has three or more words, then you should return a list of 4 numbers: [avg, sd, min, max]
- avg is the average similarity between words in sset
- sd is the standard deviation of the similarities between words in sset
- min, max are the minimum and maximum similarity respectively


In [169]:
def synsetSimValue(model_wv, words):
    """
    Calculate similarity statistics for a list of words using pre-trained word vectors.

    Args:
    model_wv (gensim.models.keyedvectors.KeyedVectors): The word vectors from a Word2Vec model.
    words (list): A list of words (strings).

    Returns:
    list: A list containing similarity statistics (average, standard deviation, minimum, maximum).
    """

    # Check if the input is a list, if not but a string of len 1 convert it to list one word
    if not isinstance(words, list):
        if isinstance(words, str):
            words = [words]
        else:
            raise ValueError("The input must be a list of words or a single word.")
        
     # Filter words that are not in the model's vocabulary
    words = [word for word in words if word in model_wv.key_to_index]
    # print(f"{len(words)} words out of {len(words)} are in the vocabulary.")

    # Check the number of words in the synset
    num_words = len(words)

    if num_words == 1:
        return []
    elif num_words == 2:
        return [model_wv.similarity(words[0], words[1])]
    else:
        # Calculate the similarity between all pairs of words in the synset
        sim_values = []
        for i in range(num_words):
            for j in range(i+1, num_words):
                sim_values.append(model_wv.similarity(words[i], words[j]))

        # Calculate and return the statistics
         # Calculate statistics
        avg = np.mean(sim_values)
        sd = np.std(sim_values)
        min_sim = np.min(sim_values)
        max_sim = np.max(sim_values)

        return [avg, sd, min_sim, max_sim]

#### Cross Synonym Set Implementation
- Where model is the model that is loaded from Word2Vec, and sset1, sset2 are synset
sfrom WordNet
- What the function returns depends on how many words are associated with sset1 and
sset2
- If both sset1 and sset2 has one word, you should return a list of one number,
which is the similarity of the two vectors in the model corresponding to the two
words
- Otherwise you should return a list of 4 numbers: [avg, sd, min, max]
- avg is the average similarity between pair of words, one from sset1 and the other from sset2
- sd is the standard deviation of the similarities described above
- min, max are the minimum and maximum similarity respectively as
described above.

In [170]:
def crossSynsetSimValue(model, words1, words2):
    """
    Calculate similarity statistics for pairs of words from two different sets using a Word2Vec model.

    Args:
    model (gensim.models.Word2Vec): The Word2Vec model.
    words1 (list): A list of words from the first set.
    words2 (list): A list of words from the second set.

    Returns:
    list: A list containing similarity statistics (average, standard deviation, minimum, maximum).
          If both sets have only one word, returns a list with the similarity between the two words.
    """
    # Filter words that are not in the model's vocabularry
    words1 = [word for word in words1 if word in model.key_to_index]
    words2 = [word for word in words2 if word in model.key_to_index]
    # print(f"{len(words1)} words out of {len(words1)} are in the vocabulary.")
    # print(f"{len(words2)} words out of {len(words2)} are in the vocabulary.")

    # return similarity between two words if both sets have only one word
    if len(words1) == 1 and len(words2) == 1:
        return [model.similarity(words1[0], words2[0])]
    else:
        # Calculate similarities between all pairs of words, one from each set
        similarities = []
        for word1 in words1:
            for word2 in words2:
                similarities.append(model.similarity(word1, word2))

        # Calculate statistics
        avg = np.mean(similarities)
        sd = np.std(similarities)
        min_sim = np.min(similarities)
        max_sim = np.max(similarities)

        return [avg, sd, min_sim, max_sim]

##### Test Cross Synset Sample Sentences

In [53]:
def test_crossSyn_with_one_word(model, sset1, sset2):
    ''' Test with one word'''
    print("Test with 1 word, expect []")
    sim = crossSynsetSimValue(model, sset1, sset2)
    try:
        assert len(sim) == 1
        print("Pass")
        print(f"function returned simarilary between {sset1[0]} and {sset2[0]} as: {sim}")
    except AssertionError as e:
        print(f"Failed with error {e}")

def test_crossSyn_statistic(model, sset1, sset2):
    ''' Test with more than 1 words in each set'''
    print("Test with more than 2 words, expect [average, standard deviation, minimum, maximum]")
    stats = crossSynsetSimValue(model, sset1, sset2)
    try:
        assert len(stats) == 4
        print("Passed, printing statistics........\n")
        avg, sd, min_sim, max_sim = stats
        print(f"Average: {avg:.4f}")
        print(f"Standard deviation: {sd:.4f}")
        print(f"Minimum: {min_sim:.4f}")
        print(f"Maximum: {max_sim:.4f}")
    except AssertionError as e:
        print(f"Failed with error {e}")

#### Pick Synset from NLTK WordNet API
Pick 32 synsets from the noun hypernyum-hyponym tree, eight of them from level 4 (assume the root is level-0), eight from level 6, eight from level 8, and eight from level 10. You can either do this manually, or you can write a program using nltk’s wordnet api to get those. For synsets


This code defines a function get_synsets_at_level that retrieves a specified number of synsets at a given level in the noun hypernym-hyponym tree. It then uses this function to get 8 synsets each at levels 4, 6, 8, and 10, and combines them into a list of 32 synsets. Finally, it prints out the selected synsets along with their definitions.

In [171]:
# Define a function to extract words from a synset
def get_words_from_synset(synset):
    return [lemma.name() for lemma in synset.lemmas()]

def get_synset_level(synset):
    """
    Get the level of a synset in WordNet.

    Args:
    synset (nltk.corpus.reader.wordnet.Synset): A WordNet synset.

    Returns:
    int: The level of the synset in the WordNet hierarchy.
    """
    level = 0
    while synset.hypernyms():
        synset = synset.hypernyms()[0]
        level += 1
    return level

def get_synsets_at_level(model_wv, level, num_synsets, min_words=3, seed=42):
    """
    Get a specified number of synsets at a given level in the noun hypernym-hyponym tree,
    ensuring that the synsets contain words present in the Word2Vec model's vocabulary and are at the correct level.

    Args:
    model_wv (gensim.models.keyedvectors.KeyedVectors): The word vectors from a Word2Vec model.
    level (int): The level in the tree (0 is the root).
    num_synsets (int): The number of synsets to retrieve at the given level.
    min_words (int): Minimum number of words in each synset that must be in the model's vocabulary.
    seed (int): The seed for the random number generator.

    Returns:
    list: A list of synsets at the specified level.
    """
    synsets = []
    all_synsets = list(wn.all_synsets('n'))  # Get all noun synsets

    # Filter synsets that are at the correct level and have at least 'min_words' words in the model's vocabulary
    filtered_synsets = [
        s for s in all_synsets
        if get_synset_level(s) == level and
        len([word for word in get_words_from_synset(s) if word in model_wv.key_to_index]) >= min_words
    ]

    # Set the seed for random number generator
    random.seed(seed)

    # Shuffle the synsets to get a random sample
    random.shuffle(filtered_synsets)

    # Select the first 'num_synsets' synsets
    selected_synsets = filtered_synsets[:num_synsets]

    return selected_synsets

##### Download Wordnet

In [15]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/tango.tew/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

##### Test Synset picks

In [172]:
# Get synsets at different levels
level_4_synsets = get_synsets_at_level(model, 4, 8)  # Get 8 synsets at level 4 (less specific than level 6)
print("Level 4 Synsets:")
for i, sset in enumerate(level_4_synsets):
    print(f"  Synset {i + 1}: {sset} - {sset.definition()}")

level_6_synsets = get_synsets_at_level(model, 6, 8)  # Get 8 synsets at level 6 (less specific than level 8)
print("\nLevel 6 Synsets:")
for i, sset in enumerate(level_6_synsets):
    print(f"  Synset {i + 1}: {sset} - {sset.definition()}")

level_8_synsets = get_synsets_at_level(model, 8, 8)  # Get 8 synsets at level 8 (less specific than level 10)
print("\nLevel 8 Synsets:")
for i, sset in enumerate(level_8_synsets):
    print(f"  Synset {i + 1}: {sset} - {sset.definition()}")

level_10_synsets = get_synsets_at_level(model, 10, 8)  # Get 8 synsets at level 10
print("\nLevel 10 Synsets:")
for i, sset in enumerate(level_10_synsets):
    print(f"  Synset {i + 1}: {sset} - {sset.definition()}")


Level 4 Synsets:
  Synset 1: Synset('guidance.n.01') - something that provides direction or advice as to a decision or course of action
  Synset 2: Synset('hazard.n.01') - a source of danger; a possibility of incurring loss or misfortune
  Synset 3: Synset('dissenter.n.01') - a person who dissents from some established policy
  Synset 4: Synset('narrative.n.01') - a message that tells the particulars of an act or occurrence or course of events; presented in writing or drama or cinema or as a radio or television program
  Synset 5: Synset('package.n.01') - a collection of things wrapped or boxed together
  Synset 6: Synset('beginning.n.02') - the time at which something is supposed to begin
  Synset 7: Synset('inhabitant.n.01') - a person who inhabits a particular place
  Synset 8: Synset('set.n.05') - an unofficial association of people or groups

Level 6 Synsets:
  Synset 1: Synset('dilatation.n.01') - the state of being stretched beyond normal dimensions
  Synset 2: Synset('generator

##### 2. For each synset, pass it to synsetSimValue(model, sset) to collect the results.

In [173]:
# Collect results for each synset
results = {}
for level, synsets in zip([4, 6, 8, 10], [level_4_synsets, level_6_synsets, level_8_synsets, level_10_synsets]):
    results[level] = []
    for synset in synsets:
        words = get_words_from_synset(synset)
        sim_values = synsetSimValue(model, words)
        results[level].append(sim_values)

# Print the results
for level, sim_values in results.items():
    print(f"\nLevel {level} Synsets Similarity Values:")
    for i, values in enumerate(sim_values):
        print(f"  Synset {i}: {values}")


Level 4 Synsets Similarity Values:
  Synset 0: [0.19198452, 0.0675881, 0.05628027, 0.2545016]
  Synset 1: [0.36985362, 0.110029414, 0.20530821, 0.54964393]
  Synset 2: [0.23820046, 0.09675737, 0.053950276, 0.38784942]
  Synset 3: [0.5221036, 0.124665916, 0.3561672, 0.68530774]
  Synset 4: [0.28917667, 0.08000256, 0.17092088, 0.41898152]
  Synset 5: [0.18865238, 0.13750823, -0.031129314, 0.5463147]
  Synset 6: [0.46770832, 0.13531648, 0.29098988, 0.6575424]
  Synset 7: [0.08376061, 0.05216451, 0.008357173, 0.1701282]

Level 6 Synsets Similarity Values:
  Synset 0: [0.61335826, 0.04073567, 0.5802395, 0.6707398]
  Synset 1: [0.14698185, 0.11449183, 0.013768709, 0.29329577]
  Synset 2: [0.5057629, 0.082199104, 0.401117, 0.68773746]
  Synset 3: [0.42021513, 0.062443297, 0.33730286, 0.5239381]
  Synset 4: [0.19750372, 0.14555739, 0.024629625, 0.40799326]
  Synset 5: [0.05846564, 0.085492596, -0.01168602, 0.17882062]
  Synset 6: [0.42099822, 0.10362398, 0.33835047, 0.5671262]
  Synset 7: [0.

##### 3. Use a table like below to present the results:

In [174]:
def create_synset_similarity_table(level_synsets, model):
    # Define the columns for the DataFrame
    columns = ['Synset level', 'Synset ID', 'Words in that synset', 'Average similarity', 'Standard Deviation', 'Minimum', 'Maximum']
    data = []

    for level, synsets in level_synsets.items():
        for synset in synsets:
            words = get_words_from_synset(synset)
            # Filter words that are in the model's vocabulary
            words_in_vocab = [word for word in words if word in model.key_to_index]
            sim_values = synsetSimValue(model, words_in_vocab)
            if sim_values:
                avg, sd, min_sim, max_sim = sim_values
            else:
                avg, sd, min_sim, max_sim = [0, 0, 0, 0]
            data.append([level, synset.name(), ', '.join(words_in_vocab), avg, sd, min_sim, max_sim])

    # Create the DataFrame
    df = pd.DataFrame(data, columns=columns)
    return df

In [182]:
# Assuming you have the Word2Vec model and the level_synsets dictionary
level_synsets = {
    4: level_4_synsets,
    6: level_6_synsets,
    8: level_8_synsets,
    10: level_10_synsets
}

# print(f"level 4 synsets: {level_4_synsets}")
similarity_table = create_synset_similarity_table(level_synsets, model)
similarity_table

Unnamed: 0,Synset level,Synset ID,Words in that synset,Average similarity,Standard Deviation,Minimum,Maximum
0,4,guidance.n.01,"guidance, counsel, counseling, direction",0.191985,0.067588,0.05628,0.254502
1,4,hazard.n.01,"hazard, jeopardy, peril, risk, endangerment",0.369854,0.110029,0.205308,0.549644
2,4,dissenter.n.01,"dissenter, dissident, protester, objector, con...",0.2382,0.096757,0.05395,0.387849
3,4,narrative.n.01,"narrative, narration, story, tale",0.522104,0.124666,0.356167,0.685308
4,4,package.n.01,"package, bundle, packet, parcel",0.289177,0.080003,0.170921,0.418982
5,4,beginning.n.02,"beginning, commencement, first, outset, start,...",0.188652,0.137508,-0.031129,0.546315
6,4,inhabitant.n.01,"inhabitant, habitant, dweller, denizen",0.467708,0.135316,0.29099,0.657542
7,4,set.n.05,"set, circle, band, lot",0.083761,0.052165,0.008357,0.170128
8,6,dilatation.n.01,"dilatation, distension, distention",0.613358,0.040736,0.580239,0.67074
9,6,generator.n.03,"generator, source, author",0.146982,0.114492,0.013769,0.293296


#### Cross Synset Similarity
##### 4. Consider the synsets you selected in step 1. For each level, form 8 pairs of synsets (each synset participate in two pairs).

In [176]:
def form_synset_pairs(level_synsets, model):
    '''
    Form pairs of synsets at each level in the noun hypernym-hyponym tree.

    Args:
    level_synsets (dict): A dictionary containing synsets at different levels in the noun hypernym-hyponym tree.
    model (gensim.models.Word2Vec): The Word2Vec model.

    Returns:
    dict: A dictionary containing pairs of synset names (without extension) at each level.
    '''
    synset_pairs = {}
    for level, synsets in level_synsets.items():
        # Form pairs of synsets
        pairs = []
        for i in range(len(synsets)):
            pair1 = (synsets[i].name().split('.')[0], synsets[(i + 1) % len(synsets)].name().split('.')[0])
            pair2 = (synsets[i].name().split('.')[0], synsets[(i - 1) % len(synsets)].name().split('.')[0])
            pairs.append(pair1)
            pairs.append(pair2)
        # Remove duplicate pairs by converting to set and back to list
        pairs = list(set([tuple(sorted(pair)) for pair in pairs]))
        synset_pairs[level] = pairs
    return synset_pairs


In [178]:
synset_pairs = form_synset_pairs(level_synsets, model)

# Print the pairs for each level
for level, pairs in synset_pairs.items():
    print(f"\nLevel {level} Synset Pairs:")
    for i, (synset1, synset2) in enumerate(pairs):
        print(f"  Pair {i + 1}: ({synset1}, {synset2})")




Level 4 Synset Pairs:
  Pair 1: (inhabitant, set)
  Pair 2: (beginning, package)
  Pair 3: (dissenter, hazard)
  Pair 4: (guidance, hazard)
  Pair 5: (beginning, inhabitant)
  Pair 6: (dissenter, narrative)
  Pair 7: (guidance, set)
  Pair 8: (narrative, package)

Level 6 Synset Pairs:
  Pair 1: (coupling, obscenity)
  Pair 2: (curriculum_vitae, pass)
  Pair 3: (generator, whiner)
  Pair 4: (obscenity, whiner)
  Pair 5: (coupling, pass)
  Pair 6: (dilatation, generator)
  Pair 7: (dilatation, hybrid)
  Pair 8: (curriculum_vitae, hybrid)

Level 8 Synset Pairs:
  Pair 1: (presumption, wallet)
  Pair 2: (dry_dock, fatherland)
  Pair 3: (centrifuge, wallet)
  Pair 4: (fatherland, nitroglycerin)
  Pair 5: (dry_dock, sun_parlor)
  Pair 6: (centrifuge, sun_parlor)
  Pair 7: (nitroglycerin, sled)
  Pair 8: (presumption, sled)

Level 10 Synset Pairs:
  Pair 1: (dishwasher_detergent, myocardial_infarction)
  Pair 2: (myocardial_infarction, whipping)
  Pair 3: (mannequin, sanitation)
  Pair 4: (

##### 5. For each pair, pass it to crossSynsetSimValue(model, sset1, sset2) and collect the results

In [179]:
# Calculate similarity statistics for each pair of synsets
cross_synset_results = {}
for level, pairs in synset_pairs.items():
    cross_synset_results[level] = []
    for words1, words2 in pairs:
        sim_values = crossSynsetSimValue(model, words1, words2)
        cross_synset_results[level].append(sim_values)

# Print the results
for level, results in cross_synset_results.items():
    print(f"Level {level} Cross-Synset Similarity Statistics:")
    for i, (pair, stats) in enumerate(zip(synset_pairs[level], results), start=1):
        words1, words2 = pair  # Unpack the pair into two lists of words
        print(f"  Pair {i}:")
        print(f"    Words 1: {words1}")
        print(f"    Words 2: {words2}")
        avg, sd, min_sim, max_sim = stats
        print(f"    Similarity Stats: avg: {avg:.4f}, standard_deviation: {sd:.4f}, min: {min_sim:.4f}, max: {max_sim:.4f}")

Level 4 Cross-Synset Similarity Statistics:
  Pair 1:
    Words 1: inhabitant
    Words 2: set
    Similarity Stats: avg: 0.4081, standard_deviation: 0.1961, min: 0.2115, max: 1.0000
  Pair 2:
    Words 1: beginning
    Words 2: package
    Similarity Stats: avg: 0.3856, standard_deviation: 0.1986, min: 0.1507, max: 1.0000
  Pair 3:
    Words 1: dissenter
    Words 2: hazard
    Similarity Stats: avg: 0.4224, standard_deviation: 0.1687, min: 0.2537, max: 1.0000
  Pair 4:
    Words 1: guidance
    Words 2: hazard
    Similarity Stats: avg: 0.4596, standard_deviation: 0.1395, min: 0.2537, max: 1.0000
  Pair 5:
    Words 1: beginning
    Words 2: inhabitant
    Similarity Stats: avg: 0.5036, standard_deviation: 0.2277, min: 0.2115, max: 1.0000
  Pair 6:
    Words 1: dissenter
    Words 2: narrative
    Similarity Stats: avg: 0.4227, standard_deviation: 0.2267, min: 0.1278, max: 1.0000
  Pair 7:
    Words 1: guidance
    Words 2: set
    Similarity Stats: avg: 0.3531, standard_deviation: 0

##### 6. Use a table like below to present the results (no need to list the individual words in this table):

In [180]:
def create_cross_synset_similarity_table(synset_pairs, cross_synset_results, level_synsets):
    # Define the columns for the DataFrame
    columns = ['Synset level', 'Synset ID for sset1', 'Synset ID for sset2', 'Average similarity', 'Standard Deviation', 'Minimum', 'Maximum']
    data = []

    for level, results in cross_synset_results.items():
        for i, (pair, stats) in enumerate(zip(synset_pairs[level], results), start=1):
            words1, words2 = pair
            # Find the synset IDs for the words in the pair
            synset1_id = [synset.name() for synset in level_synsets[level] if synset.name().split('.')[0] == words1][0]
            synset2_id = [synset.name() for synset in level_synsets[level] if synset.name().split('.')[0] == words2][0]
            avg, sd, min_sim, max_sim = stats
            data.append([level, synset1_id, synset2_id, avg, sd, min_sim, max_sim])

    # Create the DataFrame
    df = pd.DataFrame(data, columns=columns)
    return df


In [181]:
cross_synset_similarity_table = create_cross_synset_similarity_table(synset_pairs, cross_synset_results, level_synsets)
cross_synset_similarity_table

Unnamed: 0,Synset level,Synset ID for sset1,Synset ID for sset2,Average similarity,Standard Deviation,Minimum,Maximum
0,4,inhabitant.n.01,set.n.05,0.408121,0.196072,0.21153,1.0
1,4,beginning.n.02,package.n.01,0.38559,0.198553,0.150729,1.0
2,4,dissenter.n.01,hazard.n.01,0.422412,0.168732,0.253688,1.0
3,4,guidance.n.01,hazard.n.01,0.459631,0.13947,0.253688,1.0
4,4,beginning.n.02,inhabitant.n.01,0.503638,0.227697,0.21153,1.0
5,4,dissenter.n.01,narrative.n.01,0.422688,0.226678,0.127753,1.0
6,4,guidance.n.01,set.n.05,0.353062,0.168004,0.21153,1.0
7,4,narrative.n.01,package.n.01,0.342242,0.156917,0.114713,1.0
8,6,coupling.n.03,obscenity.n.02,0.443409,0.170745,0.21153,1.0
9,6,curriculum_vitae.n.01,pass.n.09,0.319829,0.099261,0.013744,0.5105


##### 7. Use the two tables, to verify/contradict the arguments made in the first 3 paragraphs in the section. You may want to do additional analysis of the numbers to achieve it. Your argument may say “the results depends on the level of the synset....”

## Synset Similarity Analysis

Based on the data result for same-word similarity within synsets and cross-synset similarity, I draw the following conclusions that:

### Same-Word Similarity within Synsets:
- The average similarities within synsets are generally high, indicating strong semantic relatedness among words in the same synset. For example, the synset "guidance.n.01" at level 4 has an average similarity of 0.191985, and "hazard.n.01" at the same level has an even higher average similarity of 0.369854.
- There is some variability within synsets, as indicated by the standard deviation. However, the minimum and maximum values suggest that most words within a synset are relatively similar to each other.

### Cross-Synset Similarity:
- The average similarities between different synsets are lower than those within synsets, which supports the notion that words from different synsets are less semantically related. For instance, the average similarity between "inhabitant.n.01" and "set.n.05" at level 4 is 0.408121, which is lower than the average similarities within synsets at the same level.
- There is notable variability in cross-synset similarities, as seen in the standard deviation values. This indicates that some pairs of synsets may be more semantically related than others.

### Comparison and Insights:
- The data supports the hypothesis that words within the same synset are more semantically related than words from different synsets. This is evidenced by the generally higher average similarities within synsets compared to those between different synsets.
- The variability in both same-word and cross-synset similarities suggests that semantic relatedness is not uniform and can vary depending on the specific words or synsets being compared.
- There are exceptions where cross-synset similarities are relatively high, indicating that some synsets may be more closely related than expected. For example, the average similarity between "dissenter.n.01" and "narrative.n.01" at level 4 is 0.422688, which is comparable to some of the average similarities within synsets.

### Conclusion:
Overall, the analysis supports the theoretical expectations of semantic similarities within and across synsets. Words within the same synset tend to have higher average similarity, reflecting their semantic relatedness. In contrast, words from different synsets generally show lower similarity, indicating lesser semantic relatedness. However, the presence of exceptions and variability highlights the complexity of semantic relationships in natural language.

Importantly, the level of the synset appears to have an impact on the similarities observed. Higher-level synsets, which represent more general concepts, tend to have lower within-synset similarities compared to lower-level synsets, which represent more specific concepts. This suggests that the granularity of the synset level plays a role in determining the degree of semantic relatedness among words within and across synsets.


### Part 2: Comparing BERT within synsets


We can ask the similar questions for BERT instead of Word2Vec. However, since BERT is a contextualized embedding, there is no single vector associated with each word. Instead, you have to submit a sentence to BERT and it will return the vector corresponds to each word.

#### Part1 Check if same word in a different sentences would return the same vector.


In [75]:
# Load pre-trained BERT model and tokenizer 
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model_bert = BertModel.from_pretrained(model_name)

In [159]:
def genBERTVector(model, word, sentences):
    """
    Generate BERT vectors for a word in a list of sentences.
    """
    vectors = []
    for sentence in sentences:
        # Get BERT embeddings and tokenized sentence

        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

        # Tokenize the input sentence and convert to PyTorch tensors
        inputs = tokenizer(sentence, padding=True, return_tensors="pt")
        tokenized_sentence = tokenizer.tokenize(sentence)

        # Forward pass through the BERT model to get embeddings
        with torch.no_grad():
            outputs = model(**inputs)

        # Get the hidden states (embeddings) of the last layer
        last_hidden_states = outputs.last_hidden_state

        # Convert tensor to numpy array
        embeddings = last_hidden_states.numpy()


        # Find the position of the word or its subwords in the tokenized sentence
        word_positions = [i for i, token in enumerate(tokenized_sentence) if word in token]

        # If the word or its subwords are not in the sentence, add an empty list
        if not word_positions:
            vectors.append([])
            continue

        # Extract the embedding for the first occurrence of the word or its subwords
        word_vector = embeddings[0, word_positions[0], :]
        vectors.append(word_vector)

    return vectors

##### Test Part Two:
Consider a word that is in multiple sentences, will the same word’s (that have the same sense)
have the same (or similar) embedded vectors?

In [80]:
sentences_dogs = ["This dog is barking", "This dog is barking loudly", "How are you today?", "My dog is a big dog"]
vectors_dog = genBERTVector(model_bert, "dog", sentences_dogs)
for i, vec in enumerate(vectors_dog):
    print(f"V{i+1} = {vec.tolist() if len(vec) > 0 else '[]'}")

V1 = [-0.3924993872642517, 0.013372927904129028, 0.1901414841413498, -0.4516221880912781, 0.6767682433128357, 0.49663418531417847, 0.6628692150115967, 1.434213638305664, -0.33909350633621216, -0.24653920531272888, -0.11542430520057678, -1.1893110275268555, -0.12006565183401108, 0.21110528707504272, 0.03133716434240341, 0.5490267872810364, 0.08786246925592422, 0.28771302103996277, 0.4951881170272827, 0.4616583585739136, -0.13183458149433136, -0.5924156904220581, -0.29464057087898254, 0.3969518542289734, 0.28486084938049316, 0.20589354634284973, 0.37705016136169434, 0.4584537744522095, -0.008960023522377014, -0.18570205569267273, -0.4013621509075165, 0.2506321966648102, -0.05282978340983391, 0.5774250030517578, 0.016261544078588486, -0.08254845440387726, 0.21561606228351593, -0.5328277349472046, 0.035012386739254, 0.4629247188568115, -0.33626776933670044, -0.3762643039226532, 0.29860714077949524, -0.48116761445999146, -0.43893375992774963, -0.6458852887153625, -0.021414972841739655, -0.0

##### 2. For each word, call the genBERTVector() function to retrieve the vector for each word in the sentences.

In [160]:
# Load the JSON file
with open('sentences_by_synset.json', 'r') as file:
    data = json.load(file)

In [161]:
def generate_vectors(data, model):
    vectors = {}
    for synset, words_dict in data.items():
        vectors[synset] = {}
        for word, sentences in words_dict.items():
            vectors[synset][word] = genBERTVector(model, word, sentences)
    return vectors

In [162]:
# Call the function to generate vectors
vectors = generate_vectors(data, model_bert)

# Example: Print the vectors for the word 'narrative' in the synset 'narrative.n.01'
print(vectors['narrative.n.01']['narrative'])


[array([-1.12604544e-01, -9.33351666e-02, -4.01862383e-01, -2.42557839e-01,
        3.76543194e-01, -4.25543785e-01,  2.35960081e-01,  4.08625960e-01,
       -6.15531027e-01,  1.49725974e-01, -1.23684831e-01, -7.66852081e-01,
       -2.54453402e-02,  1.45650253e-01,  3.25838476e-01,  5.57404041e-01,
       -3.62933911e-02, -1.36008337e-01, -6.25536621e-01, -4.03050631e-01,
       -4.63696003e-01, -2.84207106e-01, -1.01359546e-01,  2.40284801e-01,
        7.90434420e-01,  8.69410336e-02,  3.84409636e-01, -8.72452259e-02,
       -6.68790191e-03, -2.98979431e-01,  2.22948864e-01, -1.62265331e-01,
       -4.10168320e-01,  1.02834761e+00,  3.23746830e-01,  3.45036238e-02,
       -3.96756865e-02, -1.19332537e-01, -2.35320136e-01, -1.31071657e-01,
       -5.10532521e-02, -3.11366647e-01,  6.55820251e-01,  3.46194237e-01,
       -2.11990625e-03, -2.63051331e-01,  2.42024705e-01, -4.17015292e-02,
        4.59078372e-01, -1.57421008e-01, -4.67293680e-01,  6.37058973e-01,
        6.32328033e-01, 

##### 3. Calculate the pairwise cosine similarity of each pair of vector. Record the average, standard deviation, min and max similarity value

Calculate the cosine similarities and return the stats for each pair of vectors of each word embedded vectors.

In [163]:

def calculate_cosine_similarity(vectors):
    similarities = []
    for i in range(len(vectors)):
        for j in range(i + 1, len(vectors)):
            if len(vectors[i]) > 0 and len(vectors[j]) > 0:
                similarity = 1 - cosine(vectors[i], vectors[j])
                similarities.append(similarity)
    
    if len(similarities) > 0:
        avg_similarity = np.mean(similarities)
        std_dev_similarity = np.std(similarities)
        min_similarity = np.min(similarities)
        max_similarity = np.max(similarities)
        return avg_similarity, std_dev_similarity, min_similarity, max_similarity
    else:
        return 0, 0, 0, 0

In [164]:
synset_stats = {}

for synset, words in vectors.items():
    synset_stats[synset] = {}
    for word, vecs in words.items():
        if vecs:  # Check if there are any vectors for this word
            avg_similarity, std_dev_similarity, min_similarity, max_similarity = calculate_cosine_similarity(vecs)
            synset_stats[synset][word] = {
                'Average similarity': avg_similarity,
                'Standard deviation': std_dev_similarity,
                'Minimum similarity': min_similarity,
                'Maximum similarity': max_similarity
            }

# Print the stats for each synset and word
for synset, words_stats in synset_stats.items():
    print(f"\nSynset: {synset}")
    for word, stats in words_stats.items():
        print(f"  Word: {word}")
        for stat_name, stat_value in stats.items():
            print(f"    {stat_name}: {stat_value}")


Synset: narrative.n.01
  Word: narrative
    Average similarity: 0.5137321949005127
    Standard deviation: 0.21643478646201697
    Minimum similarity: 0.28049299120903015
    Maximum similarity: 0.8692008852958679
  Word: narration
    Average similarity: 0.5676038960615793
    Standard deviation: 0.161876625418017
    Minimum similarity: 0.38016167283058167
    Maximum similarity: 0.8438411951065063
  Word: story
    Average similarity: 0.5570192684729894
    Standard deviation: 0.11659466822751499
    Minimum similarity: 0.3718770742416382
    Maximum similarity: 0.6755340099334717
  Word: tale
    Average similarity: 0.5661893089612325
    Standard deviation: 0.13611079653857722
    Minimum similarity: 0.40922173857688904
    Maximum similarity: 0.7945946455001831

Synset: package.n.01
  Word: package
    Average similarity: 0.4680553525686264
    Standard deviation: 0.16898095149378142
    Minimum similarity: 0.2804751992225647
    Maximum similarity: 0.7805398106575012
  Word: b

##### 4. Present the result in a table below. Each row should correspond to a word

Display the table for all the extracted vectors of similar word in different sentences in the same synset.

In [165]:
def display_synset_statistics(synset_stats):
    """
    Present the statistics for each synset in a table.

    Args:
    synset_stats (dict): A dictionary where keys are synset IDs and values are dictionaries of words and their statistics.

    Returns:
    pandas.DataFrame: A DataFrame containing the statistics for each word in each synset.
    """
    data = []
    for synset_id, words_stats in synset_stats.items():
        # Find the level of the synset
        level = get_synset_level(wn.synset(synset_id))
        for word, stats in words_stats.items():
            data.append([
                level,
                synset_id,
                word,
                stats['Average similarity'],
                stats['Standard deviation'],
                stats['Minimum similarity'],
                stats['Maximum similarity']
            ])

    # Create a DataFrame to display the results
    df = pd.DataFrame(data, columns=['Synset level', 'Synset ID', 'Word', 'Average Similarity', 'Standard Deviation', 'Minimum', 'Maximum'])
    return df


In [166]:

df = display_synset_statistics(synset_stats)
df

Unnamed: 0,Synset level,Synset ID,Word,Average Similarity,Standard Deviation,Minimum,Maximum
0,4,narrative.n.01,narrative,0.513732,0.216435,0.280493,0.869201
1,4,narrative.n.01,narration,0.567604,0.161877,0.380162,0.843841
2,4,narrative.n.01,story,0.557019,0.116595,0.371877,0.675534
3,4,narrative.n.01,tale,0.566189,0.136111,0.409222,0.794595
4,4,package.n.01,package,0.468055,0.168981,0.280475,0.78054
5,4,package.n.01,bundle,0.530151,0.163049,0.291323,0.730898
6,4,package.n.01,packet,0.582036,0.122606,0.478238,0.849265
7,4,package.n.01,parcel,0.467618,0.234115,0.266685,0.804487
8,6,obscenity.n.02,obscenity,0.0,0.0,0.0,0.0
9,6,obscenity.n.02,smut,0.0,0.0,0.0,0.0


##### 5. Calculate Cosine Similarities of words in the same synsets

Now consider each pair of words in the same synset. Calculate the cosine similarity of all pair of vectors, where the two vectors corresponds to the different word.
For example, let say for the above case, the four vectors corresponds the word layer is l1, l2, l3, l4. And the four vectors corresponds to bed is b1, b2, b3, b4
Then you should calculate the cosine similarity of (v, w) where v is one of the l’s and w is one of the b’s. So you should get 16 numbers.
If you synset has more than 2 words, than do the same for every pair of words.

In [104]:
def calculate_all_pairwise_similarities(vectors):
    """
    Calculate pairwise cosine similarities for all pairs of words within each synset.

    Args:
    vectors (dict): A dictionary where keys are synset IDs and values are dictionaries of word vectors.

    Returns:
    dict: A dictionary where keys are synset IDs and values are dictionaries of pairwise similarities.
    """
    synset_pairwise_similarities = {}
    for synset, words_vectors in vectors.items():
        synset_pairwise_similarities[synset] = {}
        word_pairs = list(words_vectors.keys())
        for i in range(len(word_pairs)):
            for j in range(i+1, len(word_pairs)):
                word1, word2 = word_pairs[i], word_pairs[j]
                vecs1, vecs2 = words_vectors[word1], words_vectors[word2]
                similarities = []
                for v1 in vecs1:
                    for v2 in vecs2:
                        if len(v1) > 0 and len(v2) > 0:
                            similarity = 1 - cosine(v1, v2)
                            similarities.append(similarity)
                synset_pairwise_similarities[synset][(word1, word2)] = similarities
    return synset_pairwise_similarities

Calculate the Pairwise similarities

In [126]:
# vectors = generate_vectors(data, model_bert)
pairwise_similarities = calculate_all_pairwise_similarities(vectors)
# Show an example of the pairwise similarities for the synset 'narrative.n.01'
pairwise_similarities['narrative.n.01'][('narrative', 'narration')]

[0.983587920665741,
 0.47258371114730835,
 0.6927400827407837,
 0.36696991324424744,
 0.39416563510894775,
 0.9606183171272278,
 0.45023632049560547,
 0.8510762453079224,
 0.7278580665588379,
 0.5178450345993042,
 0.9083399176597595,
 0.42074593901634216,
 0.28905054926872253,
 0.8030263185501099,
 0.35093697905540466,
 0.9624466896057129]

##### 6. Enter the results in the table below:

In [132]:
# Define the levels at which to collect synsets
def get_synset_level(synset):
    """
    Get the level of a synset in the WordNet hierarchy.

    Args:
    synset (nltk.corpus.reader.wordnet.Synset): A WordNet synset.

    Returns:
    int: The level of the synset in the hierarchy.
    """
    level = 0
    while synset.hypernyms():
        synset = synset.hypernyms()[0]
        level += 1
    return level


def create_synset_similarity_table(pairwise_similarities):
    """
    Create a table with average, standard deviation, minimum, and maximum similarity for each pair of words in a synset.

    Args:
    pairwise_similarities (dict): A dictionary where keys are synset IDs and values are dictionaries of pairwise similarities.

    Returns:
    pandas.DataFrame: A DataFrame containing the similarity statistics for each pair of words in each synset.
    """
    data = []
    for synset_id, word_pairs in pairwise_similarities.items():
        # Get the level of the synset using the get_synset_level function
        synset = wn.synset(synset_id)
        level = get_synset_level(synset)

        for (word1, word2), similarities in word_pairs.items():
            if similarities:  # Check if the list of similarities is not empty
                avg_similarity = np.mean(similarities)
                std_deviation = np.std(similarities)
                min_similarity = np.min(similarities)
                max_similarity = np.max(similarities)
            else:
                avg_similarity = std_deviation = min_similarity = max_similarity = np.nan  # Use NaN for missing values

            data.append([level, synset_id, f"{word1}, {word2}", avg_similarity, std_deviation, min_similarity, max_similarity])

    # Create a DataFrame to display the results
    columns = ['Synset Level', 'Synset ID', 'Words in the Synset', 'Average Similarity', 'Standard Deviation', 'Minimum', 'Maximum']
    df = pd.DataFrame(data, columns=columns)
    return df

In [129]:
df = create_synset_similarity_table(pairwise_similarities)
df

Unnamed: 0,Synset Level,Synset ID,Words in the Synset,Average Similarity,Standard Deviation,Minimum,Maximum
0,4,narrative.n.01,"narrative, narration",0.634514,0.243248,0.289051,0.983588
1,4,narrative.n.01,"narrative, story",0.533315,0.15785,0.29972,0.831913
2,4,narrative.n.01,"narrative, tale",0.524451,0.144711,0.297855,0.737998
3,4,narrative.n.01,"narration, story",0.534829,0.135895,0.299632,0.772997
4,4,narrative.n.01,"narration, tale",0.528845,0.131714,0.30784,0.716554
5,4,narrative.n.01,"story, tale",0.633318,0.167586,0.403201,0.946504
6,4,package.n.01,"package, bundle",0.532243,0.196079,0.244028,0.867987
7,4,package.n.01,"package, packet",0.534142,0.152807,0.33082,0.885269
8,4,package.n.01,"package, parcel",0.497443,0.254332,0.163251,0.930116
9,4,package.n.01,"bundle, packet",0.562775,0.145363,0.306139,0.851659


##### 7. Use the tables you generated (with additional analysis if you wish) try to answer the two questions posed in the beginning of this section. Once again, your answer may be different for different levels of synset.

##### Analysis

**Limitations of BERT's Vocabulary:**
It's important to note that BERT's vocabulary has limitations, and some words from WordNet may not be included in BERT's vocabularies. This means that there may be no vector representation for certain words, which can affect the comparison of embeddings. The analysis presented here is based on words that have vector representations in BERT's vocabularies.

**Similarity of Embeddings for the Same Word Across Different Sentences:**

The first question asks whether the same word, when it appears in multiple sentences with the same sense, will have similar embedded vectors. To answer this, we can look at the second table, which shows the average similarity, standard deviation, minimum, and maximum similarity for each word across different sentences.

For example, consider the word "narrative" in the synset "narrative.n.01". The average similarity is 0.513732, with a standard deviation of 0.216435. This indicates that while there is some variation in the embedded vectors for "narrative" across different sentences, the vectors are generally similar, as the average similarity is above 0.5.

Similarly, for the word "pass" in the synset "pass.n.09", the average similarity is 0.567686, indicating that the embeddings for "pass" are also generally similar across different sentences.

Therefore, we can conclude that for a given word with the same sense, the embedded vectors across different sentences are generally similar, although there may be some variation.

**Similarity of Embeddings for Words in the Same Synset:**

The second question asks whether words in the same synset will have similar embedded vectors when used in sentences that reflect the sense of the synset. To answer this, we can look at the first table, which shows the average similarity between pairs of words in the same synset.

For example, in the synset "narrative.n.01", the pair ("narrative", "narration") has an average similarity of 0.634514, indicating that the embeddings for these words are quite similar when used in the context of this synset. Similarly, the pair ("story", "tale") has an average similarity of 0.633318, further supporting the idea that words in the same synset have similar embeddings.

However, it's important to note that not all pairs of words in the same synset have high similarity scores. For example, in the synset "package.n.01", the pair ("package", "parcel") has a lower average similarity of 0.497443. This suggests that while words in the same synset tend to have similar embeddings, there can be exceptions, possibly due to the different connotations or usage contexts of the words.

In summary, the analysis indicates that words in the same synset generally have similar embedded vectors when used in sentences that reflect the sense of the synset, although there can be variations depending on the specific words and their usage contexts.


#### Some words not in bert vocabularies even though they were use for examples above, So the analysis was based on the words that are included in bert vocabularies

In [153]:
sentence =  "The obscenity of his language shocked the audience."
tokenized_sentence = tokenizer.tokenize(sentence)
print(tokenized_sentence)

['the', 'ob', '##sc', '##enity', 'of', 'his', 'language', 'shocked', 'the', 'audience', '.']
