# Similarity

We will work on applying similarity: Jaccard and Cosine similarity. This exercise is a simple application of the lecture.

Begin by importing the needed libraries:

In [1]:
# import needed libraries
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import math

We will work with the following examples:

In [2]:
A = "Outside the classroom, Stallman pursued his studies with even more diligence, rushing off to fulfill his laboratory-assistant duties at Rockefeller University during the week and dodging the Vietnam protesters on his way to Saturday school at Columbia. It was there, while the rest of the Science Honors Program students sat around discussing their college choices, that Stallman finally took a moment to participate in the preclass bull session."
B = "To facilitate the process, AI Lab hackers had built a system that displayed both the source and display modes on a split screen. Despite this innovative hack, switching from mode to mode was still a nuisance."
C = "With no dorm and no dancing, Stallman's social universe imploded. Like an astronaut experiencing the aftereffects of zero-gravity, Stallman found that his ability to interact with nonhackers, especially female nonhackers, had atrophied significantly. After 16 weeks in the AI Lab, the self confidence he'd been quietly accumulating during his 4 years at Harvard was virtually gone."

Begin by computing the Jaccard Similarity J of all possibilities:
* J(A, B)
* J(B, C)
* J(A, C)

In [3]:
# TODO: compute the Jaccard similarities
# Split the sentences
a = set(A.split(' '))
b = set(B.split(' '))
c = set(C.split(' '))
# Compute the intersection and union
ABI = set([x for x in a if x in b])
ABU = set(list(a)+list(b))

BCI = set([x for x in b if x in c])
BCU = set(list(b)+list(c))

ACI = set([x for x in a if x in c])
ACU = set(list(a)+list(c))
# Compute and print the Jaccard Similarity
print(len(ABI)/len(ABU))
print(len(BCI)/len(BCU))
print(len(ACI)/len(ACU))

0.08536585365853659
0.09210526315789473
0.125


What are the closest to the other according to Jaccard Similarity?

Now let's do the same using TF-IDF and Cosine Similarity. Compute the TF-IDF and cosine similarities and print them.

In [4]:
vectorizer = CountVectorizer()
vec = vectorizer.fit_transform([A])
avec = vec.toarray()
aname = vectorizer.get_feature_names_out()
vec = vectorizer.fit_transform([B])
bvec = vec.toarray()
bname = vectorizer.get_feature_names_out()
vec = vectorizer.fit_transform([C])
cvec = vec.toarray()
cname = vectorizer.get_feature_names_out()

adict = {}
bdict = {}
cdict = {}
for x,y in zip(aname,avec[0]):
    adict[x] = y/len(A.split(' '))
for x,y in zip(bname,bvec[0]):
    bdict[x] = y/len(B.split(' '))
for x,y in zip(cname,cvec[0]):
    cdict[x] = y/len(C.split(' '))

tf = [adict,bdict,cdict]
tf

[{'and': 0.014705882352941176,
  'around': 0.014705882352941176,
  'assistant': 0.014705882352941176,
  'at': 0.029411764705882353,
  'bull': 0.014705882352941176,
  'choices': 0.014705882352941176,
  'classroom': 0.014705882352941176,
  'college': 0.014705882352941176,
  'columbia': 0.014705882352941176,
  'diligence': 0.014705882352941176,
  'discussing': 0.014705882352941176,
  'dodging': 0.014705882352941176,
  'during': 0.014705882352941176,
  'duties': 0.014705882352941176,
  'even': 0.014705882352941176,
  'finally': 0.014705882352941176,
  'fulfill': 0.014705882352941176,
  'his': 0.04411764705882353,
  'honors': 0.014705882352941176,
  'in': 0.014705882352941176,
  'it': 0.014705882352941176,
  'laboratory': 0.014705882352941176,
  'moment': 0.014705882352941176,
  'more': 0.014705882352941176,
  'of': 0.014705882352941176,
  'off': 0.014705882352941176,
  'on': 0.014705882352941176,
  'outside': 0.014705882352941176,
  'participate': 0.014705882352941176,
  'preclass': 0.0147

In [5]:
tfi = {}
for x in set([*adict,*bdict,*cdict]):
    ci = 0
    if x in adict:
        ci += 1
    if x in bdict:
        ci += 1
    if x in cdict:
        ci += 1
    tfi[x] = 1+math.log(3/ci)
tfi

{'dancing': 2.09861228866811,
 'he': 2.09861228866811,
 'there': 2.09861228866811,
 'with': 1.4054651081081644,
 'aftereffects': 2.09861228866811,
 'vietnam': 2.09861228866811,
 'that': 1.0,
 'at': 1.4054651081081644,
 'hack': 2.09861228866811,
 'self': 2.09861228866811,
 'virtually': 2.09861228866811,
 'nonhackers': 2.09861228866811,
 '16': 2.09861228866811,
 'duties': 2.09861228866811,
 'no': 2.09861228866811,
 'on': 1.4054651081081644,
 'stallman': 1.4054651081081644,
 'his': 1.4054651081081644,
 'process': 2.09861228866811,
 'universe': 2.09861228866811,
 'in': 1.4054651081081644,
 'displayed': 2.09861228866811,
 'both': 2.09861228866811,
 'diligence': 2.09861228866811,
 'assistant': 2.09861228866811,
 'especially': 2.09861228866811,
 'honors': 2.09861228866811,
 'quietly': 2.09861228866811,
 'college': 2.09861228866811,
 'studies': 2.09861228866811,
 'significantly': 2.09861228866811,
 'around': 2.09861228866811,
 'students': 2.09861228866811,
 'split': 2.09861228866811,
 'choices

In [7]:
tfidf = [{},{},{}]
for e,y in enumerate(tf):
    for x in y:
        tfidf[e-1][x] = y[x]*tfi[x]
tfidf

[{'ai': 0.03904069744744901,
  'and': 0.027777777777777776,
  'both': 0.05829478579633639,
  'built': 0.05829478579633639,
  'despite': 0.05829478579633639,
  'display': 0.05829478579633639,
  'displayed': 0.05829478579633639,
  'facilitate': 0.05829478579633639,
  'from': 0.05829478579633639,
  'hack': 0.05829478579633639,
  'hackers': 0.05829478579633639,
  'had': 0.03904069744744901,
  'innovative': 0.05829478579633639,
  'lab': 0.03904069744744901,
  'mode': 0.11658957159267277,
  'modes': 0.05829478579633639,
  'nuisance': 0.05829478579633639,
  'on': 0.03904069744744901,
  'process': 0.05829478579633639,
  'screen': 0.05829478579633639,
  'source': 0.05829478579633639,
  'split': 0.05829478579633639,
  'still': 0.05829478579633639,
  'switching': 0.05829478579633639,
  'system': 0.05829478579633639,
  'that': 0.027777777777777776,
  'the': 0.05555555555555555,
  'this': 0.05829478579633639,
  'to': 0.05555555555555555,
  'was': 0.027777777777777776},
 {'16': 0.037475219440501965,

In [4]:
# TODO: compute the TF-IDF of A, B and C and the cosine similarities of all possibilities

# could only find examples of finding similarity based on a particular query rather than one document to another
# where cos(A,b) =/= cos(B,A)

# given values are similar quallitatively except for (B,C)

cos(A, B): [[0.1679327]]
cos(B, C): [[0.13618963]]
cos(A, C): [[0.2850296]]


Is it consistent with the Jaccard values?