### **Cosine Similarity**

Reference
[link text](https://www.learndatasci.com/glossary/cosine-similarity/#:~:text=For%20example%3A,a%20cosine%20similarity%20of%20%2D1.)

# **Implementing Cosine Similarity using Python Function**

In [11]:
import numpy as np

def cosine_similarity(x, y):

    # Ensure length of x and y are the same
    if len(x) != len(y) :
        return None
    else:
    # Compute the dot product between x and y
      dot_product = np.dot(x, y)

    # Compute the L2 norms (magnitudes) of x and y
      magnitude_x = np.sqrt(np.sum(x**2))
      magnitude_y = np.sqrt(np.sum(y**2))

    # Compute the cosine similarity
    cosine_similarity = dot_product / (magnitude_x * magnitude_y)

    return cosine_similarity

In [12]:
x='data science is one of the important science fields'
y='this is one of the best data science courses'

In [14]:
corpus = [  'data science is one of the most important fields of science',
            'this is one of the best data science courses',
            'data scientists analyze data'   ]

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a matrix to represent the corpus
X = CountVectorizer().fit_transform(corpus).toarray()

print(X)

[[0 0 0 1 1 1 1 1 2 1 2 0 1 0]
 [0 1 1 1 0 0 1 0 1 1 1 0 1 1]
 [1 0 0 2 0 0 0 0 0 0 0 1 0 0]]


In [17]:
cos_sim_1_2 = cosine_similarity(X[0, :], X[1, :])
cos_sim_1_3 = cosine_similarity(X[0, :], X[2, :])
cos_sim_2_3 = cosine_similarity(X[1, :], X[2, :])

print('Cosine Similarity between: ')
print('\tDocument 1 and Document 2: ', cos_sim_1_2)
print('\tDocument 1 and Document 3: ', cos_sim_1_3)
print('\tDocument 2 and Document 3: ', cos_sim_2_3)

Cosine Similarity between: 
	Document 1 and Document 2:  0.6885303726590962
	Document 1 and Document 3:  0.21081851067789195
	Document 2 and Document 3:  0.2721655269759087


# **Implement Cosine Similarity by using the "Sklearn -Cosine Similarity"**

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

In [19]:
cos_sim_1_2 = cosine_similarity([X[0, :], X[1, :]])

print('Cosine Similarity between Document 1 and Document 2 is \n',cos_sim_1_2 )

Cosine Similarity between Document 1 and Document 2 is 
 [[1.         0.68853037]
 [0.68853037 1.        ]]


The input to sklearn's function is a matrix, and the output is also a matrix.

# **Implement Cosine Similarity using NLTK**

In [1]:
# Program to measure the similarity between
# two sentences using cosine similarity.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


In [2]:
# X = input("Enter first string: ").lower()
# Y = input("Enter second string: ").lower()
X ="I love horror movies"
Y ="Lights out is a horror movie"


In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
# tokenization
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)


In [9]:
X_list


['I', 'love', 'horror', 'movies']

In [10]:
Y_list

['Lights', 'out', 'is', 'a', 'horror', 'movie']

In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [13]:
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[];l2 =[]

In [14]:
# remove stop words from the string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}

In [15]:

X_set

{'I', 'horror', 'love', 'movies'}

In [16]:
rvector = X_set.union(Y_set)
for w in rvector:
	if w in X_set: l1.append(1) # create a vector
	else: l1.append(0)
	if w in Y_set: l2.append(1)
	else: l2.append(0)
c = 0

In [17]:
l1

[0, 1, 1, 1, 0, 1]

In [18]:
# cosine formula
for i in range(len(rvector)):
		c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
print("similarity: ", cosine)

similarity:  0.2886751345948129
