## Semantic Processing Challenge: Word Sense Disambiguation (WSD) with Lesk and WordNet

In this notebook, we will build upon our previous work to dive into **Semantic Processing** by exploring **Word Sense Disambiguation (WSD)**, the computational problem of determining which sense (meaning) of a word is activated by its use in a particular context. We will focus on **knowledge-based WSD** using the **WordNet** lexical database and implement the classic **Lesk Algorithm**. Crucially, we'll integrate `spaCy` for automated **Part-of-Speech (POS)** tagging, a vital step that significantly improves the accuracy of WordNet lookups.



#### Note:

As you have seen in the previous challenge, we will use our email thread dataset that has 4167 threads and 21684 emails. Like the previous challenges, we will only use a small sample of the dataset and focus on exploring the concepts of semantic processing.

#### We will cover the following items as part of this notebook

- **Word Sense Disambiguation (WSD)**: Definition and challenge.

- **Lesk Algorithm Implementation**: Step-by-step WSD based on context and sense definitions.

- **Integration with spaCy**: Using automated POS tags to filter WordNet senses.

- **WordNet Similarity Measures**: Applying similarity to compare sense definitions.

- **Practical Disambiguation**: Applying the combined method to example sentences.

#### What we will be learning from this challenge

- **WSD** is essential for true language understanding, distinguishing meanings like "bank" (river) vs. "bank" (financial).

- The **Lesk Algorithm** works by finding the maximum lexical overlap between the context of an ambiguous word and the definitions (glosses) of its potential senses.

- **POS Tagging** (e.g., distinguishing noun vs. verb) is crucial for drastically reducing the number of candidate senses.

- **WordNet** provides the semantic network (senses, glosses, examples) needed for knowledge-based WSD

**Let's get started now**

### Setup and Prerequisites

We will need `spacy` for POS tagging and `nltk` for WordNet and the Lesk implementation.

In [None]:
# Install the main libraries
! pip install spacy
! pip install nltk



We also need to download the necessary NLTK resources and a spaCy model.

In [None]:
# Download the specific spaCy language model
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


**Library Versions**

In [4]:
import yaml

config_path='/Users/aditikulkarni/Documents/Masters/AI-Projects/05-DL-NLP/nlp-semantic/'
# Load the environment.yml file
print (config_path + "/configs/environment.yml")
with open(config_path + "/configs/environment.yml", "r") as f:
    config = yaml.safe_load(f)

# Choose environment (local or aws)
env = "local"   # or "aws"

base_path = config[env]["base_path"]
raw_data_path = base_path + config[env]["raw_data"]
processed_data_path = base_path + config[env]["processed_data"]
models_path = base_path + config[env]["models"]

print("Raw data path:", raw_data_path)
print("Processed data path:",  processed_data_path)
print("Models path:",  models_path)

/Users/aditikulkarni/Documents/Masters/AI-Projects/05-DL-NLP/nlp-semantic//configs/environment.yml
Raw data path: /Users/aditikulkarni/Documents/Masters/AI-Projects/05-DL-NLP/nlp-semantic/data/raw/
Processed data path: /Users/aditikulkarni/Documents/Masters/AI-Projects/05-DL-NLP/nlp-semantic/data/processed/
Models path: /Users/aditikulkarni/Documents/Masters/AI-Projects/05-DL-NLP/nlp-semantic/models/


In [5]:
import spacy
import nltk
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from spacy.lang.en import English
import json

print(f"spaCy Version: {spacy.__version__}")
print(f"NLTK Version: {nltk.__version__}")

# Download NLTK resources required for WSD and POS tagging
try:
    nltk.download('wordnet', quiet=True)
    nltk.download('omw-1.4', quiet=True)
    # Note: We rely on spaCy for tagging, but include the NLTK POS tagger resources just in case.
    nltk.download('averaged_perceptron_tagger', quiet=True)
except Exception as e:
    print(f"NLTK download error: {e}")

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

spaCy Version: 3.8.7
NLTK Version: 3.9.2


### Loading the dataset

We will use the same dataset from the previous exercise. However, we will just use a small sample of the data

In [6]:
# Loading the JSON data
email_data = json.load(open(raw_data_path + "/email_thread_details.json"))
email_summary = json.load(open(raw_data_path + "/email_thread_summaries.json"))

In [7]:
## We will pick a random email summary record as our sample text
SAMPLE_TEXT = email_summary[100]['summary'].split(". ")[0]
print(SAMPLE_TEXT)

Bert sent multiple emails with attached files containing weekend notes for different dates


Let us make a small sample subset to apply the semantic processing methods

In [8]:
import random

sampled_keys = random.sample(list(range(len(email_summary))), 100)

sub_email_dataset = [email_summary[k] for k in sampled_keys]

We will use this small subset to apply these different semantic processing methods

### Word Sense Disambiguation (WSD)

**Word Sense Disambiguation (WSD)** is the process of identifying which meaning (or sense) of a word is used in a specific context. For example, the word *pitcher* can refer to a *container for liquids* or a *player in baseball*. WSD aims to resolve this ambiguity based on the surrounding text.

Before proceeding with our dataset, let us take some sample examples to understand this.

Consider the ambiguous word bank in these two sentences:

```
Sentence	                        Ambiguous Word	    Correct Sense
"I withdrew money from the bank."	 bank	             Financial Institution (Noun)
"We walked along the river bank."	 bank	             Edge of a river (Noun)
```

The **Lesk Algorithm** is a knowledge-based WSD technique. It finds the correct sense of an ambiguous word by comparing the **dictionary definition (gloss)** of each candidate sense with the context of the ambiguous word. The sense whose gloss shares the most words (has the **maximum lexical overlap**) with the context is selected as the correct one.

We use NLTK's built-in Lesk function.

In [9]:
target_word = "bank"
context_1 = "I withdrew money from the financial institution near the square."

# Apply Lesk's algorithm
best_sense_1 = lesk(context_1.split(), target_word)

print(f"--- Disambiguation for: '{context_1}' ---")
print(f"Target Word: {target_word}")
print(f"Best Sense: {best_sense_1.name()}")
print(f"Definition: {best_sense_1.definition()}")

--- Disambiguation for: 'I withdrew money from the financial institution near the square.' ---
Target Word: bank
Best Sense: depository_financial_institution.n.01
Definition: a financial institution that accepts deposits and channels the money into lending activities


Lesk compared the context words ('money', 'financial', 'institution') with the definitions of all senses of 'bank' and found the best overlap with the 'financial institution' sense.

Now, let us apply this method on our dataset to see if there are any disambiguities

In [10]:
# Let us pick up a sample target word: 'copy'
target_word = "copy"

for context_summary in sub_email_dataset:
  if target_word in context_summary['summary'].split():
    best_sense = lesk(context_summary['summary'].split(), target_word)

    print(f"--- Disambiguation for: '{context_summary['summary']}' ---")
    print(f"Target Word: {target_word}")
    print(f"Best Sense: {best_sense.name()}")
    print(f"Definition: {best_sense.definition()}")

--- Disambiguation for: 'David Minns contacted Shari Stack regarding an ISDA Master Agreement with Westpac for Enron Australia Finance Pty Limited. David asked about changes to the credit sheet and requested a copy of the Enron Corp. guarantee. Margaret Lindeman from Westpac sent David a draft of the ISDA Schedule based on the agreement with Enron North America Corp. Shari Stack expressed surprise at Westpac's email and mentioned that she had sent a mark-up of the schedule to Westpac. Sara Shackleton asked if the master agreement had been executed.' ---
Target Word: copy
Best Sense: replicate.v.02
Definition: reproduce or make an exact copy of


#### Integration with spaCy for Automated POS Tagging

A word's part-of-speech (e.g., noun vs. verb) significantly limits its potential senses.

```
Word	POS	    Senses (WordNet)
train	Noun	Locomotive, Sequence (of events)
train	Verb	To teach, To aim (a weapon)
```

WordNet requires a specific POS format (`wn.NOUN`, `wn.VERB`, etc.). We map spaCy's UPOS tags to the WordNet format.

In [11]:
# Function to convert spaCy POS tags to WordNet POS tags
def spacy_to_wordnet(spacy_tag):
    if spacy_tag.startswith('N'):
        return wn.NOUN
    elif spacy_tag.startswith('V'):
        return wn.VERB
    elif spacy_tag.startswith('J'):
        return wn.ADJ
    elif spacy_tag.startswith('R'):
        return wn.ADV
    return None

# Combined Lesk function using spaCy for POS
def lesk_with_pos(sentence, ambiguous_word):
    # 1. Use spaCy to process the sentence and get POS tags
    doc = nlp(sentence)

    # 2. Extract context words (all words except the target word)
    context = [token.text for token in doc if token.text.lower() != ambiguous_word.lower()]

    # 3. Get the POS tag for the ambiguous word
    target_token = [token for token in doc if token.text.lower() == ambiguous_word.lower()][0]
    wordnet_pos = spacy_to_wordnet(target_token.pos_)

    print(f"Identified WordNet POS: {wordnet_pos}")

    # 4. Apply NLTK's Lesk algorithm, filtering senses by the POS tag
    best_sense = lesk(context, ambiguous_word, pos=wordnet_pos)
    return best_sense

# Example Sentence
sentence_2 = "The athlete needs to train hard for the Olympics."
ambiguous_word_2 = "train"

best_sense_2 = lesk_with_pos(sentence_2, ambiguous_word_2)

print("\n--- Disambiguation with spaCy POS Filter ---")
print(f"Target Word: {ambiguous_word_2}")
print(f"Best Sense: {best_sense_2.name()}")
print(f"Definition: {best_sense_2.definition()}")


Identified WordNet POS: v

--- Disambiguation with spaCy POS Filter ---
Target Word: train
Best Sense: train.v.08
Definition: exercise in order to prepare for an event or competition


spaCy correctly identifies 'train' as a VERB. Lesk then only considers the verb senses of 'train', significantly improving the chance of selecting the correct sense ('to teach or discipline').

Next, let us try this on our sample dataset

In [12]:
# Let us pick up a sample target word: 'copy'
target_word = "copy"

for context_summary in sub_email_dataset:
  if target_word in context_summary['summary'].split():
    best_sense = lesk_with_pos(context_summary['summary'], target_word)

    print(f"--- Disambiguation with spaCy POS Filter: '{context_summary['summary']}' ---")
    print(f"Target Word: {target_word}")
    print(f"Best Sense: {best_sense.name()}")
    print(f"Definition: {best_sense.definition()}")

Identified WordNet POS: n
--- Disambiguation with spaCy POS Filter: 'David Minns contacted Shari Stack regarding an ISDA Master Agreement with Westpac for Enron Australia Finance Pty Limited. David asked about changes to the credit sheet and requested a copy of the Enron Corp. guarantee. Margaret Lindeman from Westpac sent David a draft of the ISDA Schedule based on the agreement with Enron North America Corp. Shari Stack expressed surprise at Westpac's email and mentioned that she had sent a mark-up of the schedule to Westpac. Sara Shackleton asked if the master agreement had been executed.' ---
Target Word: copy
Best Sense: transcript.n.02
Definition: a reproduction of a written record (e.g. of a legal or school record)


#### WordNet Similarity Measures

WordNet is a semantic network, connecting senses through relations like hypernymy (is-a relation) and meronymy (part-of relation). We can use these relations to calculate similarity between senses.

We'll compare the selected sense of "train" (verb) with a different verb sense, like train.v.03 ("to aim a weapon").

In [13]:
# Get the sense chosen by Lesk
sense_A = best_sense_2

# Define an alternative sense (e.g., 'aim a weapon')
sense_B = wn.synset('train.v.03') # to aim or direct (something, as a gun or camera)

# Get the third, incorrect sense (e.g., 'locomotive')
sense_C = wn.synset('train.n.01')

print("--- WordNet Similarity Comparison ---")
print(f"Sense A (Lesk): {sense_A.name()} | Def: {sense_A.definition()}")
print(f"Sense B (Alternative V): {sense_B.name()} | Def: {sense_B.definition()}")
print(f"Sense C (Incorrect N): {sense_C.name()} | Def: {sense_C.definition()}")

# Calculate the path similarity (based on the shortest path in the hierarchy)
# Note: Similarity is only calculated between senses of the same POS
similarity_AB = sense_A.path_similarity(sense_B)
# similarity_AC will fail or yield low/zero value because they are different POS/hierarchies

print(f"\nPath Similarity (A vs B - V vs V): {similarity_AB:.2f}")

--- WordNet Similarity Comparison ---
Sense A (Lesk): train.v.08 | Def: exercise in order to prepare for an event or competition
Sense B (Alternative V): discipline.v.01 | Def: develop (children's) behavior by instruction and practice; especially to teach self-control
Sense C (Incorrect N): train.n.01 | Def: public transport provided by a line of railway cars coupled together and drawn by a locomotive

Path Similarity (A vs B - V vs V): 0.17


Path similarity measures semantic closeness. A higher score (closer to 1.0) indicates that the senses are functionally or conceptually closer within the WordNet hierarchy. This is often used to verify the output of Lesk or to refine WSD when multiple senses yield similar overlap scores.

#### Practical Disambiguation

Let's apply the combined `lesk_with_pos` function to distinguish between the two meanings of **copy**.

In [14]:
target_word = "copy"

# Sentence 3: Document or book context (Noun)
sentence_3 = "Sara requested for a copy."

# Sentence 4:  Transfering documents context (Verb)
sentence_4 = "Enron employees transferring to UBS Warburg Energy are required to copy all documents themselves."

print("\n--- Practical Disambiguation of 'copy' ---")

# Disambiguation 1: Noun Context
best_sense_3 = lesk_with_pos(sentence_3, target_word)
print(f"\nSentence: '{sentence_3}'")
print(f"Selected Sense: {best_sense_3.name()}")
print(f"Definition: {best_sense_3.definition()}")
print("-" * 20)

# Disambiguation 2: Verb
best_sense_4 = lesk_with_pos(sentence_4, target_word)
print(f"\nSentence: '{sentence_4}'")
print(f"Selected Sense: {best_sense_4.name()}")
print(f"Definition: {best_sense_4.definition()}")


--- Practical Disambiguation of 'copy' ---
Identified WordNet POS: n

Sentence: 'Sara requested for a copy.'
Selected Sense: copy.n.04
Definition: material suitable for a journalistic account
--------------------
Identified WordNet POS: v

Sentence: 'Enron employees transferring to UBS Warburg Energy are required to copy all documents themselves.'
Selected Sense: copy.v.01
Definition: copy down as is


#### Conclusion

In this challenge, we successfully implemented **Word Sense Disambiguation** using the **Lesk Algorithm**, significantly enhanced by `spaCy's` automated POS tagging. By using the context of the sentence (the surrounding words) and comparing it to the definitions provided by WordNet, we were able to computationally distinguish between multiple meanings of ambiguous words like "copy" and "train."

This notebook demonstrates a foundational technique in **Semantic Processing**, highlighting how knowledge-based resources (WordNet) combined with robust NLP tools (spaCy, NLTK) allow machines to move beyond mere syntax and begin to grasp the meaning of text. WSD is a critical component for tasks like Machine Translation, Question Answering, and Information Retrieval, where understanding the precise intended meaning is paramount.