<a href="https://colab.research.google.com/github/MMesgar/Knowledge_Based_Systems/blob/main/lecture06/semantic_relatedness.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


---

# Section 1. Path-based measures of semantic relatedness in symbolic KBs






In [None]:
import nltk

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

**Synset:** a set of synonyms that share a common meaning.

In [None]:
cat = wn.synsets('cat')[0]
dog = wn.synsets('dog')[0]

In [None]:
cat.hyponyms()

[Synset('domestic_cat.n.01'), Synset('wildcat.n.03')]

In [None]:
cat.hypernyms()

[Synset('feline.n.01')]


**Path Similarity:**
Return a score denoting how similar two word senses are,
based on the shortest path that connects the senses
in the is-a (hypernym/hypnoym) taxonomy.
The score is in the range 0 to 1.


In [None]:
print(wn.path_similarity(cat, dog))
# 0.2

0.2



**Leacock-Chodorow Similarity:**
Return a score denoting how similar two word senses are,
based on the shortest path that connects the senses (as above)
and the maximum depth of the taxonomy in which the senses occur.
The relationship is given as -log(p/2d)
where p is the shortest path length and d the taxonomy depth.

In [None]:
print(wn.lch_similarity(cat, dog))
# 2.0281482472922856

2.0281482472922856


**Wu-Palmer Similarity:**
Return a score denoting how similar two word senses are,
based on the depth of the two senses in the taxonomy
and that of their Least Common Subsumer (most specific ancestor node).


In [None]:

print(wn.wup_similarity(cat, dog))
# 0.8571428571428571


0.8571428571428571




---
## Information Content-based Semantic Relation Measures





In [None]:
from nltk.corpus import wordnet_ic
nltk.download('wordnet_ic')

In [None]:
# Wordnet information content file
brown_ic = wordnet_ic.ic('ic-brown.dat')

**Lin Similarity:**
Return a score denoting how similar two word senses are,
based on the Information Content (IC) of the Least Common Subsumer
and that of the two input Synsets.
The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).


In [None]:
print(wn.lin_similarity(cat, dog, ic=brown_ic))
# 0.8768009843733973

0.8768009843733973


**Resnik Similarity:**
Return a score denoting how similar two word senses are,
based on the Information Content (IC) of the Least Common Subsumer
Note that for any similarity measure that uses information content,
the result is dependent on the corpus used to generate the information content
and the specifics of how the information content was created.

In [None]:
print(wn.res_similarity(cat, dog, ic=brown_ic))
# 7.911666509036577


7.911666509036577


**Jiang-Conrath Similarity**
Return a score denoting how similar two word senses are,
based on the Information Content (IC) of the Least Common Subsumer
and that of the two input Synsets.
The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

In [None]:
print(wn.jcn_similarity(cat, dog, ic=brown_ic))

0.4497755285516739




---

# Section 3. Definition-based measures of semantic relatedness






In [None]:
tree = wn.synsets('tree')[0]

In [None]:
tree.definition()
# a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms

'a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms'

In [None]:
trunk = wn.synsets('trunk')[0]

In [None]:
trunk.definition()
# 'the main stem of a tree; usually covered with bark; the bole is usually the part that is commercially useful for lumber' 

'the main stem of a tree; usually covered with bark; the bole is usually the part that is commercially useful for lumber'

**Lesk measure** is used to measure the relatedness of two words(senses) by counting the number of words they have in common (overlaps), in their definitions (glosses). The Lesk measure is the number of such common words.
Lesk algorithm is used in word disambiguation; it associates a sense to a given word based on how related it is to the context (the rest of the words in the text).


In [None]:
from nltk.wsd import lesk

In [None]:
def getSenses(word):
  return wn.synsets(word.lower())

In [None]:
def getGloss(senses):

    gloss = {}

    for sense in senses:
        gloss[sense.name()] = []

    for sense in senses:
        gloss[sense.name()] += sense.definition().split()

    return gloss

In [None]:
def getAll(word):
    senses = getSenses(word)

    if senses == []:
        return {word.lower(): senses}

    return getGloss(senses)

In [None]:
def getScore(set1, set2):
    # Base
    overlap = 0

    # Step
    for word in set1:
        if word in set2:
            overlap += 1

    return overlap / (len(set1) + len(set2))

In [None]:
def overlapScore(word1, word2):

    gloss_set1 = getAll(word1)
    gloss_set2 = getAll(word2)
    
    score = {}
    for i in gloss_set1.keys():
        score[i] = 0
        for j in gloss_set2.keys():
            score[i] += getScore(gloss_set1[i], gloss_set2[j])

    bestSense = None
    max_score = 0
    for i in gloss_set1.keys():
        if score[i] > max_score:
            max_score = score[i]
            bestSense = i

    return bestSense, max_score

In [None]:
overlapScore("cat", "dog")

('guy.n.01', 0.8618881118881119)



---

# Section 3. Distributional Approach




## Word-Context Matrix


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

In [None]:
samples = ['I like DL',
           'I like NLP',
           'I love ML', 
           'I love NLP']

CountVectorizer converts a collection of text documents to a matrix of token counts.
More info [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
vectorizer = CountVectorizer() 
co_occurrences = vectorizer.fit_transform(samples).toarray()

Let's see what words it has extracted from the corpus.

In [None]:
vocabulary = vectorizer.get_feature_names_out()
print(vocabulary)

['dl' 'like' 'love' 'ml' 'nlp']


How many words are extracted?

In [None]:
# write your code here


Let's print the co-occurrence matrix

In [None]:
output = " "*12
for word in vocabulary:
  output+= f"{word} "
output += "\n"
for context_id, context in enumerate(samples):
    output += f"{context.lower()}"+ " "*(12-len(context))
    for item in co_occurrences[context_id]:
      output += f"{item} "+ " "*(3-len(str(item)))
    output += "\n"
print(output)

            dl like love ml nlp 
i like dl   1   1   0   0   0   
i like nlp  0   1   0   0   1   
i love ml   0   0   1   1   0   
i love nlp  0   0   1   0   1   



## Dense vectors

In [1]:
from gensim.models import KeyedVectors

In [2]:
import gensim.downloader as api

Let's download a pretrained dense vectors by Google

In [3]:
word_vectors= api.load("word2vec-google-news-300")



In [8]:
vector_car = word_vectors['car'] 
print(vector_car)

[ 0.13085938  0.00842285  0.03344727 -0.05883789  0.04003906 -0.14257812
  0.04931641 -0.16894531  0.20898438  0.11962891  0.18066406 -0.25
 -0.10400391 -0.10742188 -0.01879883  0.05200195 -0.00216675  0.06445312
  0.14453125 -0.04541016  0.16113281 -0.01611328 -0.03088379  0.08447266
  0.16210938  0.04467773 -0.15527344  0.25390625  0.33984375  0.00756836
 -0.25585938 -0.01733398 -0.03295898  0.16308594 -0.12597656 -0.09912109
  0.16503906  0.06884766 -0.18945312  0.02832031 -0.0534668  -0.03063965
  0.11083984  0.24121094 -0.234375    0.12353516 -0.00294495  0.1484375
  0.33203125  0.05249023 -0.20019531  0.37695312  0.12255859  0.11425781
 -0.17675781  0.10009766  0.0030365   0.26757812  0.20117188  0.03710938
  0.11083984 -0.09814453 -0.3125      0.03515625  0.02832031  0.26171875
 -0.08642578 -0.02258301 -0.05834961 -0.00787354  0.11767578 -0.04296875
 -0.17285156  0.04394531 -0.23046875  0.1640625  -0.11474609 -0.06030273
  0.01196289 -0.24707031  0.32617188 -0.04492188 -0.114257

## Euclidian distance

In [16]:
vector_cat =  word_vectors['cat']

vector_dog = word_vectors['dog']

Use the following cell to implement this distance.



In [18]:
def euclidian_distance(a,b):
# write your code here
  return 0

In [19]:
euc_dist_cat_dog = euclidian_distance(vector_cat,vector_dog)
print(euc_dist_cat_dog)

0


Let's check your implementation.

In [20]:
from scipy.spatial import distance
dst = distance.euclidean(vector_cat,vector_dog)
print(dst)

2.081533670425415


## Cosine distance

In [27]:
cosine_dist = distance.cosine(vector_cat, vector_dog)

The above number is the cosine distance. Let's convert it to cosine similarity

In [29]:
cosine_sim = 1 - cosine_dist
print(cosine_sim)

0.760945737361908


In [30]:
word_vectors.similarity('cat','dog')

0.76094574

In [31]:
word_vectors.most_similar('car')

[('vehicle', 0.7821096181869507),
 ('cars', 0.7423830032348633),
 ('SUV', 0.7160962820053101),
 ('minivan', 0.6907036304473877),
 ('truck', 0.6735789775848389),
 ('Car', 0.6677608489990234),
 ('Ford_Focus', 0.667320191860199),
 ('Honda_Civic', 0.662684977054596),
 ('Jeep', 0.6511331796646118),
 ('pickup_truck', 0.64414381980896)]



---

# Section 4. Correlations 

In [5]:
x = [1,2,3]
y = [2,5,6]

Write a function that computes the pearson correlation coefficient.

In [11]:
# Write your code here
def pearson(a,b):
  
  return 0


run your function on x and y



In [12]:
r = pearson(x,y)
print(r)

0


Let's check the correctness of your function

In [13]:
import scipy.stats

In [14]:
scipy.stats.pearsonr(x, y)    # Pearson's r

(0.9607689228305227, 0.17891237502206703)

We can also compute the spearman correlation

In [15]:
scipy.stats.spearmanr(x, y)

SpearmanrResult(correlation=1.0, pvalue=0.0)

Well done! You completed these exercise.