# Cosine Similarity

* **What it measures:** Angle between two vectors in a multi-dimensional space.
* **Range:** -1 (opposite) to 1 (identical)
* **Use case:** Comparing numerical feature vectors, word embeddings, or document vectors.
* **Key idea:** Vectors pointing in the same direction are considered similar, regardless of magnitude.

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Example vectors
vec1 = np.array([1, 2, 3]).reshape(1, -1)
vec2 = np.array([2, 3, 4]).reshape(1, -1)

similarity = cosine_similarity(vec1, vec2)[0][0]
print("Cosine Similarity:", similarity)


Cosine Similarity: 0.9925833339709303


# Jaccard Similarity

* **What it measures:** Overlap between two sets.
* **Range:** 0 (no overlap) to 1 (identical sets)
* **Use case:** Comparing sets of words, tags, or features.
* **Key idea:** Similarity = size of intersection ÷ size of union.

In [None]:
def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union

s1 = {"apple", "banana", "mango"}
s2 = {"banana", "mango", "orange"}

print("Jaccard Similarity:", jaccard_similarity(s1, s2))


# Levenshtein (Edit) Similarity

* **What it measures:** How many single-character edits (insertions, deletions, substitutions) are needed to transform one string into another.
* **Range:** 0 (completely different) to 1 (identical)
* **Use case:** Spell checking, fuzzy string matching, DNA sequence comparison.
* **Key idea:** Fewer edits → more similar strings.

In [6]:
!pip install levenshtein




[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
import Levenshtein

s1 = "kitten"
s2 = "sitting"

similarity = 1 - (Levenshtein.distance(s1, s2) / max(len(s1), len(s2)))
print("Levenshtein Similarity:", similarity)


Levenshtein Similarity: 0.5714285714285714


# TF-IDF with Cosine Similarity

* **What it measures:** Cosine similarity after converting text into weighted term vectors (TF-IDF).
* **Range:** 0 (no similarity) to 1 (identical)
* **Use case:** Comparing sentences, documents, or paragraphs in NLP.
* **Key idea:** Words common across documents get lower weight; unique words matter more.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = [
    "I love machine learning and AI",
    "AI and machine learning are amazing",
    "Python is great for data science"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Compare doc[0] with doc[1] and doc[2]
similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
print("Similarity with other docs:", similarities[0])


Similarity with other docs: [1.         0.61195311 0.        ]
