**Lab 4 - Word Sense Disambiguation (WSD)**

Submitted by: Angeline A

Submitted on: 02/09/2024

In [3]:
pip install youtube-transcript-api transformers nltk spacy


Collecting youtube-transcript-api
  Downloading youtube_transcript_api-0.6.2-py3-none-any.whl.metadata (15 kB)
Downloading youtube_transcript_api-0.6.2-py3-none-any.whl (24 kB)
Installing collected packages: youtube-transcript-api
Successfully installed youtube-transcript-api-0.6.2


In [6]:
import nltk
from youtube_transcript_api import YouTubeTranscriptApi
from transformers import pipeline
from nltk.corpus import wordnet

# Download necessary NLTK data
nltk.download('wordnet')
nltk.download('omw-1.4')

# Step 1: Extract, Clean, and Punctuate Transcript

def get_transcript(video_id):
    """Extracts transcript from a YouTube video given its ID."""
    transcript = YouTubeTranscriptApi.get_transcript(video_id)
    return " ".join([entry['text'] for entry in transcript])

def clean_and_punctuate(transcript):
    """Cleans and punctuates the transcript using a pre-trained model."""
    # Use T5 model for punctuation restoration
    punctuator = pipeline("text2text-generation", model="t5-small")

    # Chunking the text to avoid exceeding token limits
    max_length = 512  # Adjust based on the model's capabilities
    chunks = [transcript[i:i+max_length] for i in range(0, len(transcript), max_length)]

    punctuated_transcript = ""
    for chunk in chunks:
        cleaned_chunk = chunk.lower()  # Lowercasing for consistent punctuation
        punctuated_chunk = punctuator(cleaned_chunk, max_length=512)[0]['generated_text']
        punctuated_transcript += punctuated_chunk + " "

    return punctuated_transcript.strip()


# Extract and clean the transcript
video_id = "W6wVU5b5nQk"
raw_transcript = get_transcript(video_id)
print("Raw Transcript:\n", raw_transcript)

cleaned_punctuated_transcript = clean_and_punctuate(raw_transcript)
print("Cleaned and Punctuated Transcript:\n", cleaned_punctuated_transcript)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Raw Transcript:
 foreign [Music] once upon a time in a small village there lived a wise old Monk he was known far and wide for his wisdom and sense of humor one day a young and eager student named Sam approached the master and said master I want to learn the secret to happiness and success please teach me master Sito looked at Sam with a twinkle in his eye and said very well young one But first you must complete a simple task go to the market and buy the biggest juiciest watermelon you can find then carry it on your head and walk through the village without dropping it Sam was puzzled but determined he went to the market and found a massive watermelon balancing it on his head he walked through the village with utmost concentration as he passed by people couldn't help but laugh and cheer him on some even joined in clapping and making funny faces finally after a bumpy Journey Sam reached Master setu's Hut the watermelon was intact and Sam was relieved he looked at Master situ expecting t

In [10]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [11]:
from nltk.tokenize import sent_tokenize, word_tokenize

# Function to determine word senses
def determine_word_senses(sentence):
    """Determines the number of senses for each open-class word in a sentence."""
    words = word_tokenize(sentence)
    word_senses = {}
    for word in words:
        if nltk.pos_tag([word])[0][1].startswith(('NN', 'VB', 'JJ', 'RB')):  # Open-class words
            senses = wordnet.synsets(word)
            word_senses[word] = len(senses)
    return word_senses

# Collecting a small corpus of example sentences
sentences = sent_tokenize(cleaned_punctuated_transcript)[:10]  # Take the first 10 sentences for example

# Determine the number of senses for each open-class word in each sentence
for sentence in sentences:
    senses = determine_word_senses(sentence)
    print(f"Sentence: {sentence}")
    print("Word Senses:", senses)


Sentence: a young and eager student named sam approached the master and said master i want to learn the secret to happiness and success please teach me master sito looked at sam with a twinkle in his eye and said very well young one but first you must complete a simple task go to the market and buy the biggest juiciest watermelon you can find then carry it on your head an ice cream .
Word Senses: {'young': 14, 'eager': 2, 'student': 2, 'named': 9, 'sam': 1, 'approached': 5, 'master': 15, 'said': 12, 'i': 4, 'want': 9, 'learn': 6, 'secret': 14, 'happiness': 2, 'success': 4, 'please': 4, 'teach': 3, 'sito': 0, 'looked': 10, 'twinkle': 4, 'eye': 6, 'very': 4, 'well': 22, 'first': 16, 'complete': 10, 'simple': 9, 'task': 4, 'go': 35, 'market': 9, 'buy': 6, 'biggest': 13, 'juiciest': 4, 'watermelon': 2, 'find': 18, 'then': 5, 'carry': 41, 'head': 42, 'ice': 11, 'cream': 8}
Sentence: master setu's hut the watermelon was intact and sam was relieved he looked at master situ expecting to be pra

In [12]:
from nltk.wsd import lesk

# Function to perform Lesk algorithm WSD
def lesk_wsd(sentence):
    """Performs Lesk algorithm for Word Sense Disambiguation on a sentence."""
    words = word_tokenize(sentence)
    disambiguated_senses = {}
    for word in words:
        if nltk.pos_tag([word])[0][1].startswith(('NN', 'VB', 'JJ', 'RB')):  # Open-class words
            sense = lesk(sentence, word)
            disambiguated_senses[word] = sense.definition() if sense else None
    return disambiguated_senses

# Apply Lesk WSD on the collected sentences
for sentence in sentences:
    disambiguated_senses = lesk_wsd(sentence)
    print(f"Sentence: {sentence}")
    print("Disambiguated Senses:", disambiguated_senses)


Sentence: a young and eager student named sam approached the master and said master i want to learn the secret to happiness and success please teach me master sito looked at sam with a twinkle in his eye and said very well young one but first you must complete a simple task go to the market and buy the biggest juiciest watermelon you can find then carry it on your head an ice cream .
Disambiguated Senses: {'young': 'British physicist and Egyptologist; he revived the wave theory of light and proposed a three-component theory of color vision; he also played an important role in deciphering the hieroglyphics on the Rosetta Stone (1773-1829)', 'eager': 'a high wave (often dangerous) caused by tidal flow (as by colliding tidal currents or in a narrow estuary)', 'student': 'a learner who is enrolled in an educational institution', 'named': 'charge with a function; charge to be', 'sam': 'a guided missile fired from land or shipboard against an airborne target', 'approached': 'make advances to

In [13]:
# Function to tag open-class words with correct sense
def tag_with_senses(sentence):
    """Tags each open-class word in a sentence with its correct sense using WordNet."""
    words = word_tokenize(sentence)
    tagged_words = {}
    for word in words:
        if nltk.pos_tag([word])[0][1].startswith(('NN', 'VB', 'JJ', 'RB')):  # Open-class words
            sense = lesk(sentence, word)
            tagged_words[word] = sense.name() if sense else None
    return tagged_words

# Tagging each open-class word with the correct sense in the collected sentences
for sentence in sentences:
    tagged_words = tag_with_senses(sentence)
    print(f"Sentence: {sentence}")
    print("Tagged Words with Senses:", tagged_words)


Sentence: a young and eager student named sam approached the master and said master i want to learn the secret to happiness and success please teach me master sito looked at sam with a twinkle in his eye and said very well young one but first you must complete a simple task go to the market and buy the biggest juiciest watermelon you can find then carry it on your head an ice cream .
Tagged Words with Senses: {'young': 'young.n.04', 'eager': 'tidal_bore.n.01', 'student': 'student.n.01', 'named': 'name.v.03', 'sam': 'surface-to-air_missile.n.01', 'approached': 'approach.v.05', 'master': 'victor.n.01', 'said': 'suppose.v.01', 'i': 'one.s.01', 'want': 'wish.n.01', 'learn': 'learn.v.04', 'secret': 'secret.n.02', 'happiness': 'happiness.n.02', 'success': 'success.n.03', 'please': 'please.v.03', 'teach': 'teach.v.02', 'sito': None, 'looked': 'look.v.03', 'twinkle': 'twinkle.v.02', 'eye': 'eye.n.05', 'very': 'very.s.01', 'well': 'well.v.01', 'first': 'first_gear.n.01', 'complete': 'complete.v