### Name- Aditya Kumar Tiwari
### Roll no.- MSA23023

**NLP Lab 4:** Word Sense Disambiguation Task

**Objective:** Implement a knowledge-based Word Sense Disambiguation system using the Lesk algorithm and evaluate its performance on ambiguous words in context.

In [1]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
import string

In [2]:
def preprocess_text(text):
    """Remove punctuation and stopwords, convert to lowercase."""
    # Tokenize and convert to lowercase
    tokens = word_tokenize(text.lower())
    
    # Remove stopwords and punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]
    
    return tokens

In [3]:
def lesk_algorithm(context_sentence, ambiguous_word):
    """
    Implement the Lesk algorithm for word sense disambiguation.
    
    Args:
        context_sentence (str): The sentence containing the ambiguous word
        ambiguous_word (str): The word to be disambiguated
    
    Returns:
        best_sense: The WordNet sense that best matches the context
    """
    # Preprocess the context sentence
    context = set(preprocess_text(context_sentence))
    
    # Get all possible senses of the ambiguous word
    word_senses = wn.synsets(ambiguous_word)
    
    if not word_senses:
        return None
    
    # Find the sense with maximum overlap
    max_overlap = 0
    best_sense = word_senses[0]  # default to first sense
    
    for sense in word_senses:
        # Create signature from definition and examples
        signature = set(preprocess_text(sense.definition()))
        
        # Add examples to signature
        for example in sense.examples():
            signature.update(set(preprocess_text(example)))
        
        # Calculate overlap between context and signature
        overlap = len(context.intersection(signature))
        
        # Update best sense if current overlap is greater
        if overlap > max_overlap:
            max_overlap = overlap
            best_sense = sense
            
    return best_sense

In [4]:
def evaluate_wsd(test_cases):
    """
    Evaluate the WSD system on multiple test cases.
    
    Args:
        test_cases: List of tuples (sentence, ambiguous_word, correct_sense_key)
    
    Returns:
        accuracy: Percentage of correct disambiguations
    """
    correct = 0
    total = len(test_cases)
    
    for sentence, word, correct_sense_key in test_cases:
        predicted_sense = lesk_algorithm(sentence, word)
        if predicted_sense and predicted_sense.name() == correct_sense_key:
            correct += 1
    
    return (correct / total) * 100


#### Example usage and evaluation

In [6]:
if __name__ == "__main__":
    # Download required NLTK resources
    nltk.download('wordnet')
    nltk.download('punkt')
    nltk.download('stopwords')
    
    # Test cases with sentences containing ambiguous words
    test_cases = [
        ("The bank of the river was muddy.", "bank", "bank.n.01"),
        ("I need to bank the money.", "bank", "depository_financial_institution.n.01"),
        ("The bass guitar sounds great.", "bass", "bass.s.01"),
        ("I caught a huge bass in the lake.", "bass", "bass.n.07")
    ]
    
    # Test individual cases
    print("Individual WSD Results:")
    for sentence, word, _ in test_cases:
        sense = lesk_algorithm(sentence, word)
        print(f"\nContext: {sentence}")
        print(f"Ambiguous word: {word}")
        print(f"Predicted sense: {sense.name()}")
        print(f"Definition: {sense.definition()}")
    
    # Evaluate overall performance
    accuracy = evaluate_wsd(test_cases)
    print(f"\nOverall Accuracy: {accuracy:.2f}%")

Individual WSD Results:

Context: The bank of the river was muddy.
Ambiguous word: bank
Predicted sense: bank.n.01
Definition: sloping land (especially the slope beside a body of water)

Context: I need to bank the money.
Ambiguous word: bank
Predicted sense: depository_financial_institution.n.01
Definition: a financial institution that accepts deposits and channels the money into lending activities

Context: The bass guitar sounds great.
Ambiguous word: bass
Predicted sense: bass.s.01
Definition: having or denoting a low vocal or instrumental range

Context: I caught a huge bass in the lake.
Ambiguous word: bass
Predicted sense: bass.s.01
Definition: having or denoting a low vocal or instrumental range

Overall Accuracy: 75.00%


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lenovo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\lenovo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lenovo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Result Analysis

The implemented Lesk algorithm shows reasonable performance in disambiguating word senses:

1. Strengths:
   - Successfully distinguishes between financial and geographical senses of "bank"
   - Effectively handles different parts of speech for the same word
   - Considers both definitions and examples for better context matching

2. Limitations:
   - Reliance on exact word overlap might miss semantic relationships
   - Performance depends heavily on the quality of context provided
   - Limited by WordNet's coverage of word senses

3. Potential Improvements:
   - Incorporate word embeddings for semantic similarity
   - Add weights to different context words based on their importance
   - Implement sense frequency information from corpus statistics