## Lesk in NLTK

In [13]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [14]:
# Getting the file STS.input.SMTeuroparl.txt from drive into a DataFrame
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
dt = pd.read_csv('/content/drive/MyDrive/data/ihlt/test-gold/STS.input.SMTeuroparl.txt',sep='\t',header=None)
# Updating the DataFrame with a new column with STS.gs.SMTeuroparl.txt
dt['gs'] = pd.read_csv('/content/drive/MyDrive/data/ihlt/test-gold/STS.gs.SMTeuroparl.txt',sep='\t',header=None)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
import re

# Getting a list of stop words
nltk.download('stopwords')
stopWordSet = set(nltk.corpus.stopwords.words('english'))

def cleaner (sentenceList):

  # Transforming the tag of the words according to the tag_map
  sentenceList = [(pair[0], tagger(pair[1])) for pair in sentenceList]
  
  # Get the list into lowercase
  sentenceList = list(map(lambda word: (word[0].lower(), word[1]), sentenceList))

  # Filtering the ponctuation and stop words
  sentenceList = list(filter(lambda word : re.search('''[!"#$%&'()*+, -./:;<=>?@[\]^_`{|}~]+''', word[0]) == None and word[0] not in stopWordSet, sentenceList))

  return sentenceList

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
# Get the cleaned sentence without tags
def getSentence(cleanedText):
  return [pair[0] for pair in cleanedText]

In [17]:
# Get the list of synsets
def synset_lst(sentence, cleanedText, consider_none):
  sy_lst = []
  for pair in cleanedText:
    try:
      # Get the synset from a word and appending to the list
      sy_lst.append(nltk.wsd.lesk(sentence, pair[0], pair[1]).name())
    except:
      # Appending the word to the list, when it does not have synset
      if (consider_none): sy_lst.append(pair[0])

  return sy_lst

In [18]:
nltk.download('omw-1.4')
nltk.download('wordnet')
wnl = nltk.stem.WordNetLemmatizer()

# Mapping the tags between Treebank and WordNet
tag_map = {
  'CC':"none", # coordin. conjunction (and, but, or)  
  'CD':"n", # cardinal number (one, two)             
  'DT':"none", # determiner (a, the)                    
  'EX':"r", # existential ‘there’ (there)           
  'FW':"none", # foreign word (mea culpa)             
  'IN':"r", # preposition/sub-conj (of, in, by)   
  'JJ':"a", # adjective (yellow)                  
  'JJR':"a", # adj., comparative (bigger)          
  'JJS':"a", # adj., superlative (wildest)           
  'LS':"none", # list item marker (1, 2, One)          
  'MD':"none", # modal (can, should)                    
  'NN':"n", # noun, sing. or mass (llama)          
  'NNS':"n", # noun, plural (llamas)                  
  'NNP':"n", # proper noun, sing. (IBM)              
  'NNPS':"n", # proper noun, plural (Carolinas)
  'PDT':"a", # predeterminer (all, both)            
  'POS':"none", # possessive ending (’s )               
  'PRP':"none", # personal pronoun (I, you, he)     
  'PRP$':"none", # possessive pronoun (your, one’s)    
  'RB':"r", # adverb (quickly, never)            
  'RBR':"r", # adverb, comparative (faster)        
  'RBS':"r", # adverb, superlative (fastest)     
  'RP':"a", # particle (up, off)
  'SYM':"none", # symbol (+,%, &)
  'TO':"none", # “to” (to)
  'UH':"none", # interjection (ah, oops)
  'VB':"v", # verb base form (eat)
  'VBD':"v", # verb past tense (ate)
  'VBG':"v", # verb gerund (eating)
  'VBN':"v", # verb past participle (eaten)
  'VBP':"v", # verb non-3sg pres (eat)
  'VBZ':"v", # verb 3sg pres (eats)
  'WDT':"none", # wh-determiner (which, that)
  'WP':"none", # wh-pronoun (what, who)
  'WP$':"none", # possessive (wh- whose)
  'WRB':"none", # wh-adverb (how, where)
}

# Transforming the tag of the words according to the tag_map
def tagger(tag):
  if tag in tag_map.keys():
    return tag_map[tag] 
  return "none"

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [19]:
from nltk.metrics import jaccard_distance
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Adding two empty columns to the DataFrame
## Jaccard considering words without tag
dt['jaccard_only_synsets'] = ''
## Jaccard NOT considering words without tag
dt['jaccard_synsets_&_words'] = ''

limit = len(dt[0][:])

for id in range(limit):

  # Tokenization and tagging
  tagsText1, tagsText2 = nltk.pos_tag(nltk.word_tokenize(dt.loc[id,0])), nltk.pos_tag(nltk.word_tokenize(dt.loc[id,1]))

  # Cleaning 
  cleanedText1, cleanedText2 = cleaner(tagsText1), cleaner(tagsText2)

  # Get the Sentence
  sentence1, sentence2 = getSentence(cleanedText1), getSentence(cleanedText2)

  # List of cleaned sentences with synsets, considering words without synset
  text1_with_none, text2_with_none = synset_lst(sentence1, cleanedText1, True), synset_lst(sentence2, cleanedText2, True)
  # List of cleaned sentences with synsets, NOT considering words without synset
  text1_no_none, text2_no_none = synset_lst(sentence1, cleanedText1, False), synset_lst(sentence2, cleanedText2, False)
  
  # Updating the DataFrame with the similarities according to the method jaccard 
  dt.loc[id,'jaccard_only_synsets'] = jaccard_distance(set(text1_no_none), set(text2_no_none))
  dt.loc[id,'jaccard_synsets_&_words'] = jaccard_distance(set(text1_with_none), set(text2_with_none))

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [20]:
display(dt)

Unnamed: 0,0,1,gs,jaccard_only_synsets,jaccard_synsets_&_words
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.500,0.6,0.692308
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.000,0.0,0.0
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.250,0.625,0.727273
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.500,0.25,0.25
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.000,0.0,0.0
...,...,...,...,...,...
454,It is our job to continue to support Latvia wi...,It is of our duty of continue to support the c...,5.000,0.727273,0.75
455,The vote will take place today at 5.30 p.m.,Vote will take place at 17 h 30.,4.750,0.571429,0.571429
456,Neither was there a qualified majority within ...,There was no qualified majority in this Parlia...,5.000,0.666667,0.636364
457,Let me remind you that our allies include ferv...,"I hold you recall that our allies, there are e...",4.000,0.777778,0.8


In [21]:
from scipy.stats import pearsonr

# Get the correlation and the p-value between gs and jaccard
corr, p = pearsonr(dt['gs'], dt['jaccard_only_synsets'])
print("Only synsets -> Correlation coefficient:", corr)
print("Only synsets -> p-value:", p)
print('\n')

corr, p = pearsonr(dt['gs'], dt['jaccard_synsets_&_words'])
print("Synsets + words without synset -> Correlation coefficient:", corr)
print("Synsets + words without synset -> p-value:", p)

Only synsets -> Correlation coefficient: -0.5139698634879429
Only synsets -> p-value: 2.615398269826006e-32


Synsets + words without synset -> Correlation coefficient: -0.5086380711515205
Synsets + words without synset -> p-value: 1.4275649732895514e-31


# **Conclusion:**

In this updated version of the code, the application of Lesk’s algorithm to get the synsets from words was performed two approaches, in order to obtain the Jaccard Similarity and the Pearson Correlation.

In the first case, only the words with synsets are considered, while in the second example, besides the words with synsets, the others are also considered to be the similarity measure, as they are in their natural form. 

The results from the experiment show that comparing only synsets provide a  Correlation coefficient to the gold standard of -0.514 and a P-value of p-value: 2.6e-32.  
When the words without synsets were included in the comparison, the correlation decreased to -0.509, with a p-value of 1.4e-31.
In both methods, it is visible a negative non-linear correlation between the gold standard and Jaccard methods and, even though the p-value is diminished, meaning that the null hypothesis is false and there is a correlation between the variables, the amount of data is insufficient to make such a conclusion.

Firstly, the results obtained are better than the previous implementations in Lab 2 Document with -0.48 (just the cleaning of the text), and Lab 3 Morphology with -0.491 (cleaning of the text + Lemmatizer).

In Lab 3 was stated the following about the lemmatizer comparing Lab2 and Lab3:

" The Wordnet's lemmatizer produces this improvement, by transforming the words into their basic form (lemmas), according to the characterization (tag) given by the Penn Treebank Tagger. Thus, the similarity method measures phrases, constituted by lemmas instead of words, in which the criteria are more reliable since different words may have the same lemma and, consequentially, equivalent meanings too. Thereby, using lemmas rather than words produces a better approach to the similarity measure."

So, when Lesk’s algorithm is applied, we are able to obtain the following configuration:

"

lemma.pos.number

-> lemma is the word's morphological stem

-> pos is one of the modules attributes ADJ, ADJ_SAT, ADV, NOUN or VERB

-> number is the sense number, counting from 0.

" (NLTK Documentation)

Which, returns not only the lemmas but also the tag related to the word and the sense number. This way, the description we obtain is more reliable since we consider the meaning of a word, with a more detailed description than just the lemma (as in Lab 3).

As a last topic, considering the values obtained only in this practical exercise, it is visible that the Correlation coefficient is more satisfactory when only synsets are considered. In fact, this indicates that, when words are compared in their natural form, very little information is deemed, introducing some errors in the Jaccard measurement. However, the removal of non-synset words gives only an improvement of 0.005 percentage points, which is in the range of uncertainty. Nonetheless, with the increment of words without synsets, the error introduced is bigger, decreasing the correlation value. Thus, for this application it is more benefic do not to consider words without synsets.

As a final note, it is important to denote that, even though it was registered an improvement, the correlation obtained stills not being sufficiently satisfying, mainly due to the method used to compare strings.