#### Cosine Similarity

Cosine similarity is a measure of similarity between two non-zero vectors in an inner product space that measures the cosine of the angle between them. In the context of text data, it's often used to determine the similarity between two documents represented as vectors in a high-dimensional space. The cosine similarity ranges from -1 (completely dissimilar) to 1 (completely similar), with 0 indicating orthogonality (no similarity).

The cosine similarity $cosine_similarity(A,B)$ between two vectors $A$ and $B$ is calculated using the formula:

$similarity(A,B)$ $=$ $A . B$ $/$ $||A|| ||B||$


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample Documents
document = [ "This is the first document.",
    "This document is the second document.",
    "And this is the third document.",
    "Is this the first document?"
]

# Create an instance of  TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(document)

# Calculate the cosine similarity
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# print the feature space
print("\nVectorize Feature Names:")
print(vectorizer.get_feature_names_out())

print("\nVectorize Array for feature names:")
print(tfidf_matrix.toarray())

# print the cosine similarity matrix
print("\nCosine Similarity Matrix:")
print(cos_sim)


Vectorize Feature Names:
['and' 'document' 'first' 'is' 'second' 'the' 'third' 'this']

Vectorize Array for feature names:
[[0.         0.39896105 0.60276058 0.39896105 0.         0.39896105
  0.         0.39896105]
 [0.         0.61221452 0.         0.30610726 0.5865905  0.30610726
  0.         0.30610726]
 [0.56894695 0.29690012 0.         0.29690012 0.         0.29690012
  0.56894695 0.29690012]
 [0.         0.39896105 0.60276058 0.39896105 0.         0.39896105
  0.         0.39896105]]

Cosine Similarity Matrix:
[[1.         0.61062437 0.47380634 1.        ]
 [0.61062437 1.         0.45441641 0.61062437]
 [0.47380634 0.45441641 1.         0.47380634]
 [1.         0.61062437 0.47380634 1.        ]]
