**Methods of Comparison**

*Cosine Similarity*
- How It Works: Measures the cosine of the angle between two vectors. Focuses on the direction of the vectors, ignoring their magnitude.
- When to Use: When the length of the text (magnitude of vectors) doesn't matter, but the pattern of words does.
- Pros:
    - Works well with TF-IDF vectors.
    - Commonly used for text-based recommendations.
- Cons:
    - May not capture semantic meaning.

*Euclidean Distance*
- How It Works: Calculates the straight-line distance between two vectors.
- When to Use: When the magnitude (size) of vectors is important.
- Pros:
    - Intuitive and simple to compute.
- Cons:
    -Sensitive to vector magnitude, so normalization is often required.

*Jaccard Similarity*
- How It Works: Compares the intersection over the union of two sets (e.g., word sets or n-grams).
- When to Use: When comparing the overlap of terms is more important than their frequency or context.
- Pros:
    - Useful for binary data or token-based comparisons.
- Cons:
    - Ignores term frequency and context.

*Pearson Correlation*
- How It Works: Measures the linear correlation between two vectors.
- When to Use: When you're interested in the degree to which the two vectors change together.
- Pros:
    - Captures linear relationships well.
- Cons:
    - Not commonly used for text data.

*Soft Cosine Similarity*
- How It Works: Extends cosine similarity by accounting for the similarity between words (e.g., synonyms).
- When to Use: When you want to include semantic similarity (e.g., "AI" and "Artificial Intelligence").
- Pros:
    - Incorporates word embeddings for richer comparisons.
- Cons:
    - More computationally intensive.

*Pre-trained Embeddings with Similarity Metrics*
- How It Works: Uses pre-trained word embeddings (e.g., Word2Vec, GloVe, or BERT) to represent text and then calculates similarity (e.g., cosine similarity) on the embeddings.
- When to Use: When you want to capture semantic meaning and contextual relationships between words.
- Pros:
    - Captures semantic meaning.
    - State-of-the-art performance for many tasks.
- Cons:
    - Requires more computational resources.
    - May need fine-tuning for specific datasets.


In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from scipy.spatial.distance import jaccard
from gensim.models import KeyedVectors
from gensim.matutils import softcossim
from gensim.corpora import Dictionary

# Example dataset
Samples = [
    "Advanced Machine Learning and Deep Learning, with focus on Neural Networks and AI.",
    "Introduction to Programming in Python and Basics of AI.",
    "Data Science and Machine Learning using Python and AI techniques.",
]

# Preprocessing: Lowercase and simple tokenization
cleaned_samples = [desc.lower().replace(",", "").replace(".", "") for desc in Samples]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(cleaned_samples)

# Method 1: Cosine Similarity
print("Cosine Similarity:")
cos_sim = cosine_similarity(tfidf_matrix)
print(cos_sim)

# Method 2: Euclidean Distance
print("\nEuclidean Distance:")
euclidean_dist = euclidean_distances(tfidf_matrix)
print(euclidean_dist)

# Method 3: Jaccard Similarity
def jaccard_similarity(str1, str2):
    set1, set2 = set(str1.split()), set(str2.split())
    return len(set1 & set2) / len(set1 | set2)

print("\nJaccard Similarity:")
jaccard_sim = np.zeros((len(cleaned_samples), len(cleaned_samples)))
for i in range(len(cleaned_samples)):
    for j in range(len(cleaned_samples)):
        jaccard_sim[i, j] = jaccard_similarity(cleaned_samples[i], cleaned_samples[j])
print(jaccard_sim)

# Method 4: Soft Cosine Similarity
print("\nSoft Cosine Similarity:")
# Load pre-trained word vectors (replace with actual file path to word embeddings like GloVe or Word2Vec)
# word_vectors = KeyedVectors.load_word2vec_format("path/to/word2vec/file", binary=True)
# For demonstration, we create a dummy word embedding dictionary
dummy_word_vectors = {
    "advanced": np.random.rand(100),
    "machine": np.random.rand(100),
    "learning": np.random.rand(100),
    "python": np.random.rand(100),
    "ai": np.random.rand(100),
    "data": np.random.rand(100),
    "science": np.random.rand(100),
}

# Create a Gensim dictionary and similarity matrix
dictionary = Dictionary([desc.split() for desc in cleaned_samples])
similarity_matrix = np.zeros((len(dictionary), len(dictionary)))

# Fill the similarity matrix using dummy vectors
for i, word1 in enumerate(dictionary.token2id.keys()):
    for j, word2 in enumerate(dictionary.token2id.keys()):
        vec1, vec2 = dummy_word_vectors.get(word1, np.zeros(100)), dummy_word_vectors.get(word2, np.zeros(100))
        similarity_matrix[i, j] = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2) + 1e-10)

# Soft Cosine Similarity
soft_cos_sim = np.zeros((len(cleaned_samples), len(cleaned_samples)))
for i, desc1 in enumerate(cleaned_samples):
    bow1 = dictionary.doc2bow(desc1.split())
    for j, desc2 in enumerate(cleaned_samples):
        bow2 = dictionary.doc2bow(desc2.split())
        soft_cos_sim[i, j] = softcossim(bow1, bow2, similarity_matrix)

print(soft_cos_sim)

# Method 5: Pre-trained Embeddings with Cosine Similarity
print("\nPre-trained Embeddings with Cosine Similarity:")
# Generate dummy sentence embeddings by averaging word vectors
sentence_embeddings = []
for desc in cleaned_samples:
    vectors = [dummy_word_vectors.get(word, np.zeros(100)) for word in desc.split()]
    sentence_embeddings.append(np.mean(vectors, axis=0))

# Compute cosine similarity for sentence embeddings
pretrained_cos_sim = cosine_similarity(sentence_embeddings)
print(pretrained_cos_sim)
