# Ejercicio 4. WSD


Referencias: http://www.nltk.org/book/ch02.html (punto	5)
1) Implementar,	 usando	 NLTK	 y	 Python,	 el	 algoritmo	 de	 Lesk simplificado	 para	desambiguar	el	 sentido	 de	las	 palabras	 (WSD).	 La	 función	 recibirá	 una	 palabra	 y	
una	 frase	que	la	contenga	y	decidirá	el	mejor	sentido	para	esa	palabra.	Las	 frases	
serán	en	inglés	 y	 se	 deberá	eliminar	de	la	 frase, de	la	glosa	 y de	los ejemplos	 de	
cada	sentido	las	‘stopwords’.


EJEMPLO:

Sentence:	“Yesterday	I	went to	the	bank to withdraw the	money and the	credit	card
did	not	work”

Word:	bank

In [2]:
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\annal\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\annal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\annal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Importamos la librería NLTK

In [47]:
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.wsd import lesk
import re

In [87]:
def preprocess_text(text):
    text = text.lower()  # convertir a minúsculas
    patron = r'[(+*)-.,:;¿?<>!\'"]'
    text2= re.sub(patron, ' ', text)# eliminar carácteres especiales para que no los tenga en cuenta como tokens 
    tokens = word_tokenize(text2)  # tokenizar el texto
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words] # eliminar las stopwords
   
    return filtered_tokens


In [94]:
def simplified_lesk(word, sentence):
    best_sense = None # las palabras con mismo nombre pero diferentes significados dependiendo del contexto
    max_overlap = 0 
    
    sentence = preprocess_text(sentence) # procesar la frase
    
    for sense in wordnet.synsets(word):
        gloss = sense.definition()
        gloss = preprocess_text(gloss) # eliminar las stopwords de la glosa

        examples = sense.examples()
        examples = [preprocess_text(ex) for ex in examples] # eliminar las stop words de los ejemplos

        s = set(gloss + [word]) # conjunto del glosa y la palabra
        intersection = s.intersection(sentence) # intersección de las definiciones y la frase
        overlap_sense = len(intersection) 
        
        overlap_examples = sum(len(set(ex).intersection(sentence)) for ex in examples) #intersección ejemplos y frase
        
        overlap = overlap_sense+overlap_examples
        
        if overlap > max_overlap: # actualizar el mejor significado si se encuentra uno con un overlap superior
            max_overlap = overlap
            best_sense = sense

    return best_sense

In [154]:
## EJEMPLO:

sentence = "Yesterday I went to the bank to withdraw the money and the credit card did not work"
word = "bank"

sense = simplified_lesk(word, sentence)

if sense is not None:
    print(f"Word: {word}")
    print(f"Best Sense: {sense}")
    print(f"Definition: {sense.definition()}")
else:
    print("No sense found")


Word: bank
Best Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities


2) Implementar	un algoritmo	similar	para	la	desambiguación	semántica	utilizando	
Word	Embeddings	y	una	distancia	de	similitud	semántica	como	la	distancia	coseno	

In [125]:
from nltk.data import find
import gensim

# Cargar el modelo de embeding pre-entrenados del NLTK
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)


In [157]:
def wsd_embeddings(word, sentence):
    best_sense = None
    max_similarity = 0

    sentence_tokens = preprocess_text(sentence) # procesar la frase en tokens
    try:
        for token in sentence_tokens:
            if token in model and token!=word: # si la palabra del contexto está en el modelo 
                similarity = model.similarity(word, token) # distancia coseno

                if similarity > max_similarity:
                    max_similarity = similarity
                    best_sense = token
                    

        return best_sense
    
    except: # si la palabra no está en el modelo
        print(f'Word:{word} not in the model')
        return None

In [158]:
## EJEMPLO:
sentence = "Yesterday I went to the bank to withdraw the money and the credit card did not work"
word = "bank"

sense = wsd_embeddings(word, sentence)
print(f"Word: {word}")
print(f"Best Sense: {sense}")


Word: bank
Best Sense: credit
