## Code Explanation: Content-to-Form Experiment using WordNet Definitions

The code implements a content-to-form experiment utilizing WordNet definitions. The experiment focuses on disambiguating word senses for various concepts, such as "door," "ladybug," "pain," and "blurriness." For each concept, the code takes available definitions, searches for the correct synset in WordNet, and employs the principle of "genus" to guide the search.

The primary goal of this experiment is to showcase the process of disambiguating word senses using the modified Lesk algorithm. By leveraging the provided definitions and utilizing the structure of WordNet, the code aims to identify the most appropriate sense for each word within its specific context.

### Import and Load Data

The code begins by importing necessary libraries and loading definitions from a TSV file. The definitions are stored in a dictionary named `definitions`, where the keys are words, and the values are lists of their corresponding definitions.

### Word Preprocessing

Before performing word disambiguation, the code defines several utility functions for word preprocessing:

1. `remove_punctuation(sentence)`: This function removes punctuation characters from a sentence except for hyphens.
2. `clean_and_split(text, stopwords)`: This function cleans and processes text by removing punctuation, splitting into words, converting to lowercase, lemmatizing, and excluding stopwords.
3. `get_words_frequency(words)`: This function calculates the frequency of words.
4. `concat(l1, l2)`: This function concatenates two lists.
5. `get_hyponyms(word)`: This function retrieves the hyponyms (more specific terms) of a given word using WordNet from NLTK.
6. `signature(sense, stop_words)`: This function computes the signature of a WordNet synset, including its name and example words.
7. `computeOverlap(signature, context)`: This function calculates the overlap between two sets of words.
8. `modifiedLesk(words, disambiguation_context, stop_words)`: This function implements a modified version of the Lesk algorithm to find probable synsets for candidate genera.

### Word Disambiguation

For each of the words - "door," "ladybug," "pain," and "blurriness," the code performs the following steps:

1. Clean and preprocess each definition, storing the lemmatized words in the `clean_words` list.
2. Combine all cleaned word sets to create a disambiguation context.
3. Calculate the frequency of words in the cleaned definitions.
4. Select the top 15 most frequent words as candidate genera.
5. Apply the modified Lesk algorithm to rank probable senses for the candidate genera, based on the disambiguation context.

### Results

For each word, the code prints the top 5 most probable senses along with their corresponding scores. This information helps to identify the most suitable sense of each word within its specific context.


## import and load data


In [18]:
import csv
from nltk import WordNetLemmatizer
import string
import functools as ft
from nltk.corpus import wordnet as wn
import heapq as hq
from nltk.corpus import stopwords
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="nltk.corpus.reader.wordnet")

stop_words = list(stopwords.words('english'))

# Load definitions from the TSV file
definitions = {}
words = []

with open("../lab2/TLN-definitions-23.tsv", "r", encoding="utf-8") as tsvfile:
    reader = csv.reader(tsvfile, delimiter="\t")
    
    # Read the first row to get the words
    words = next(reader)[1:]
    
    for row in reader:
        for i, definition in enumerate(row[1:]):
            word = words[i]
            if word not in definitions:
                definitions[word] = []
            definitions[word].append(definition)

keys = list(definitions.keys())

door_def = definitions[keys[0]]
ladybug_def= definitions[keys[1]]
pain_def = definitions[keys[2]]
blurriness_def = definitions[keys[3]]


### Word preprocessing

In [19]:
def remove_punctuation(sentence):
    """
    Given a sentence, this function removes all punctuation characters except for the hyphen (-).
    
    Parameters:
    sentence (str): The sentence from which punctuation should be removed.

    Returns:
    str: The sentence with all punctuation except hyphens removed.
    """
    for character in string.punctuation:
        if character not in ['-']:
            sentence = sentence.replace(character, '')
    return sentence

def clean_and_split(text, stopwords):
    """
    Given a text, this function performs the following operations:
    1. Removes punctuation except for hyphens using the remove_punctuation function.
    2. Splits the text into individual words.
    3. Converts each word to lowercase.
    4. Lemmatizes each word using the WordNetLemmatizer.
    5. Filters out any words that are in the provided list of stopwords.
    
    Parameters:
    text (str): The text to be cleaned and split.
    stopwords (list): A list of words to be excluded from the final list.

    Returns:
    list: A list of lemmatized words from the text, excluding any stopwords.
    """
    lemmatizer = WordNetLemmatizer()
    text = remove_punctuation(text)
    return [lemmatizer.lemmatize(w.lower()) for w in text.split() if w.lower() not in stopwords]


In [20]:
def get_words_frequency(words):
    """
    Calculates the frequency of the input words.
    
    Parameters:
    words (list): The list of words for which to calculate frequencies.

    Returns:
    list: A list of tuples (word, frequency), sorted in descending order by frequency.
    """
    count = {}
    for word in words:
        if word in count:
            count[word] += 1
        else:
            count[word] = 1
    words_and_counts = [(w, count[w]) for w in count]
    words_and_counts.sort(key=lambda wac: wac[1], reverse=True)
    return words_and_counts

def concat(l1, l2):
    """
    Concatenates two lists by extending the first list with the second one.
    
    Parameters:
    l1 (list): The first list to which the second list will be added.
    l2 (list): The second list which will be added to the first list.

    Returns:
    list: The concatenated list.
    """
    l1.extend(l2)
    return l1


In [21]:
def get_hyponyms(word):
    """
    Retrieves the hyponyms (more specific terms) and hypernyms (more general terms) 
    of a given word using the WordNet corpus from the NLTK library.
    
    Parameters:
    word (str): The word for which to find the hyponyms and hypernyms.

    Returns:
    list: The list of hyponyms and hypernyms for the given word.
    """
    senses = wn.synsets(word)
    hyponyms = [list(s.closure(lambda s: s.hyponyms())) for s in senses]
    non_empty_hyponyms = [h for h in hyponyms if h]  # Filter out empty lists
    if non_empty_hyponyms:
        senses.extend(ft.reduce(concat, non_empty_hyponyms))

    return senses


def signature(sense, stop_words):
    """
    Computes the signature of a WordNet synset, namely its name and the words contained in its examples.
    
    Parameters:
    sense (str): The WordNet synset.
    stop_words (list): A list of words to ignore.

    Returns:
    set: The set of words that compose the signature of the sense.
    """
    s = sense.definition()
    for e in sense.examples():
        s = s + " " + e
    return set(clean_and_split(s, stop_words))

def computeOverlap(signature, context):
    """
    Computes the size of the overlap between two sets of words.
    
    Parameters:
    signature (set): The first set of words.
    context (set): The second set of words.

    Returns:
    int: The size of the overlap between the two sets.
    """
    return len(signature & context)

import heapq as hq

def modifiedLesk(words, disambiguation_context, stop_words):
    """
    Modified version of the Lesk algorithm that returns a list of probable synsets for a set of candidate genera.
    
    Parameters:
    words (list): The possible genera.
    disambiguation_context (list): The disambiguation context to use.
    stop_words (list): A list of words to ignore.

    Returns:
    list: A list of tuples (-score, synset, definition) organized in a minimum heap (the most probable has the lowest score).
    """
    senses_rank = []
    explored = {}
    for word in words:
        senses = get_hyponyms(word)
        for sense in senses:
            sign = signature(sense, stop_words)
            context_set = set(disambiguation_context)
            overlap = computeOverlap(sign, context_set)
            if sense.name() not in explored:
                explored[sense.name()] = overlap
                # Include synset and definition in the tuple
                synset_tuple = (-overlap, sense, sense.definition())
                hq.heappush(senses_rank, synset_tuple)
    return senses_rank





### door


In [22]:
# Initialize an empty list to store the cleaned words
clean_words = []

# For each definition in the list of door definitions
for definition in door_def:
    # Clean and split the definition into individual words, excluding any stopwords
    # Add the list of cleaned words to the clean_words list
    clean_words = [clean_and_split(definition, stop_words) for definition in door_def]

# Reduce the list of sets of words (one set for each cleaned definition) to a single set containing all words
# This set represents the disambiguation context
disambiguation_context = ft.reduce(set.union, [set(words) for words in clean_words])


# Calculate the frequency of each word in the cleaned definitions
# Reduce the list of lists of words to a single list containing all words
word_frequencies = get_words_frequency(ft.reduce(concat, clean_words))
# Get the 15 most frequent words as the candidate genera
candidates_genus = [wf[0] for wf in word_frequencies[:15]]

# Initialize an empty list to store the senses
senses = []

# Use the modified Lesk algorithm to get a ranked list of probable senses for the candidate genera
senses_rank = modifiedLesk(candidates_genus, disambiguation_context, stop_words)
# Add the 5 most probable senses to the senses list
top_senses = hq.nsmallest(5, senses_rank)

print("Term we are looking for: door\n")
print()
for score, synset, definition in top_senses:
      # Access the definition here
    print(f"Score: {abs(score)}")
    print(f"Synset name: {synset.name()}")
    print(f"Definition: {definition}\n")


Term we are looking for: door


Score: 9
Synset name: doorway.n.01
Definition: the entrance (the space in a wall) through which you enter or leave a room or building; the space that a door can close

Score: 7
Synset name: leave.v.06
Definition: make a possibility or provide opportunity for; permit to be attainable or cause to remain

Score: 7
Synset name: partition.n.01
Definition: a vertical structure that divides or separates (as a wall divides one room from another)

Score: 7
Synset name: wall.n.01
Definition: an architectural partition with a height and length greater than its thickness; used to divide or enclose an area or to support another structure

Score: 6
Synset name: adapter.n.02
Definition: device that enables something to be used in a way different from that for which it was intended or makes different pieces of apparatus compatible



### Ladybug

In [23]:
clean_words = []
for definition in ladybug_def:
    clean_words = [clean_and_split(definition, stop_words) for definition in ladybug_def]
disambiguation_context = ft.reduce(set.union, [set(words) for words in clean_words])



word_frequencies = get_words_frequency(ft.reduce(concat, clean_words))
candidates_genus = [wf[0] for wf in word_frequencies[:15]]



senses = []
senses_rank = modifiedLesk(candidates_genus, disambiguation_context, stop_words)
senses.append(hq.nsmallest(5, senses_rank))

# Add the 5 most probable senses to the senses list
top_senses = hq.nsmallest(5, senses_rank)

print("Term we are looking for: ladybug")
print()

for score, synset, definition in top_senses:
    
    print(f"Score: {abs(score)}")
    print(f"Synset name: {synset.name()}")
    print(f"Definition: {definition}\n")

Term we are looking for: ladybug

Score: 5
Synset name: four-lined_plant_bug.n.01
Definition: yellow or orange leaf bug with four black stripes down the back; widespread in central and eastern North America

Score: 5
Synset name: ladybug.n.01
Definition: small round bright-colored and spotted beetle that usually feeds on aphids and other insect pests

Score: 4
Synset name: color.v.01
Definition: add color to

Score: 4
Synset name: dipterous_insect.n.01
Definition: insects having usually a single pair of functional wings (anterior pair) with the posterior pair reduced to small knobbed structures and mouth parts adapted for sucking or lapping or piercing

Score: 4
Synset name: lacewing.n.01
Definition: any of two families of insects with gauzy wings (Chrysopidae and Hemerobiidae); larvae feed on insect pests such as aphids



### Pain

In [24]:
clean_words = []
for definition in pain_def:
    clean_words = [clean_and_split(definition, stop_words) for definition in pain_def]
disambiguation_context = ft.reduce(set.union, [set(words) for words in clean_words])


word_frequencies = get_words_frequency(ft.reduce(concat, clean_words))
candidates_genus = [wf[0] for wf in word_frequencies[:15]]


senses = []
senses_rank = modifiedLesk(candidates_genus, disambiguation_context, stop_words)
senses.append(hq.nsmallest(5, senses_rank))

# Add the 5 most probable senses to the senses list
top_senses = hq.nsmallest(5, senses_rank)

print("Term we are looking for: pain")
print()

for score, synset, definition in top_senses:
    
    print(f"Score: {abs(score)}")
    print(f"Synset name: {synset.name()}")
    print(f"Definition: {definition}\n")

Term we are looking for: pain

Score: 7
Synset name: bad.s.03
Definition: feeling physical discomfort or pain (`tough' is occasionally used colloquially for `bad')

Score: 5
Synset name: affection.n.01
Definition: a positive feeling of liking

Score: 5
Synset name: agony.n.01
Definition: intense feelings of suffering; acute mental or physical pain

Score: 5
Synset name: ardor.n.01
Definition: a feeling of strong eagerness (usually in favor of a person or cause)

Score: 5
Synset name: constriction.n.03
Definition: a tight feeling in some part of the body



### Blurriness

In [25]:
clean_words = []
for definition in blurriness_def:
    clean_words = [clean_and_split(definition, stop_words) for definition in blurriness_def]
disambiguation_context = ft.reduce(set.union, [set(words) for words in clean_words])


word_frequencies = get_words_frequency(ft.reduce(concat, clean_words))
candidates_genus = [wf[0] for wf in word_frequencies[:15]]


senses = []
senses_rank = modifiedLesk(candidates_genus, disambiguation_context, stop_words)
senses.append(hq.nsmallest(5, senses_rank))
# Add the 5 most probable senses to the senses list
top_senses = hq.nsmallest(5, senses_rank)

print("Term we are looking for: blurriness")
print()

for score, synset, definition in top_senses:
    
    print(f"Score: {abs(score)}")
    print(f"Synset name: {synset.name()}")
    print(f"Definition: {definition}\n")

Term we are looking for: blurriness

Score: 6
Synset name: accommodating_lens_implant.n.01
Definition: a lens implant containing a hinge that allows for both near and far vision (thus mimicking the natural lens of a young person)

Score: 6
Synset name: collage.n.01
Definition: a paste-up made by sticking together pieces of paper or photographs to form an artistic image

Score: 6
Synset name: picture.n.01
Definition: a visual representation (of an object or scene or person or abstraction) produced on a surface

Score: 5
Synset name: acuity.n.01
Definition: sharpness of vision; the visual ability to resolve fine detail (usually measured by a Snellen chart)

Score: 5
Synset name: adapter.n.02
Definition: device that enables something to be used in a way different from that for which it was intended or makes different pieces of apparatus compatible

