## PROGRAM:  
#### Measure similarity between texts. Use vectorization techniques (e.g., TF-IDF, word embeddings). i. Implement cosine similarity ii. Compare the effectiveness of different similarity measures  


### Gensim
Gensim is an open-source Python library for natural language processing  
It is designed for unsupervised learning of text documents and is widely used for tasks like word embeddings, Document Similarity analysis, etc.  
### Text Similarity  
Text similarity measures are techniques used to determine how similar two pieces of text are.  
1) Cosine Similarity  
2) Jaccard Similarity
### vectorization
Vectorization is essential for performing text similarity techniques because computers do not understand raw text—they process numerical representations instead.  
1) TF-IDF  
2) Word Embeddings


#Implementing Cosine Similarity using Word Embeddings (Word2Vec - Pretrained GloVe vectors)

In [8]:
import numpy as np
import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
glove_model = api.load("glove-wiki-gigaword-50")



In [3]:
def get_embedding(text, model):
  words = text.lower().split()
  word_vectors = [model[word] for word in words if word in model]
  if not word_vectors:
    return np.zeros(model.vector_size)
  return np.mean(word_vectors, axis=0)


In [5]:
text1 = "Machine learning is a field of artificial intelligence."
text2 = "Deep learning is a branch of artificial intelligence and machine learning."
embedding1 = get_embedding(text1, glove_model)
embedding2 = get_embedding(text2, glove_model)

In [6]:
cosine_sim = cosine_similarity([embedding1], [embedding2])


In [7]:
print(f"Cosine Similarity (Word Embeddings - GloVe): {cosine_sim[0][0]:.4f}")

Cosine Similarity (Word Embeddings - GloVe): 0.9705


# code to perform TF-IDF vectorization on text_a and text_b and then perform cosine similarity on the vectors using spaCy library

In [10]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [11]:
nlp = spacy.load("en_core_web_sm")

In [16]:
text_a = "Machine learning is a field of artificial intelligence."
text_b = "Deep learning is a branch of artificial intelligence and machine learning."

In [17]:
def preprocess(text):
    doc = nlp(text.lower())
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(tokens)

In [18]:
processed_a = preprocess(text_a)
processed_b = preprocess(text_b)

In [19]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([processed_a, processed_b])

In [20]:
similarity = cosine_similarity(vectors[0], vectors[1])

In [None]:
print(f"Cosine Similarity:(TF-IDF Vectorization) {similarity[0][0]:.4f}")

# code to perform word2Vec vectorization on text1 and text2 and perform jaccard similarity using NLTK library

In [23]:
import nltk
import numpy as np
import gensim.downloader as api
from nltk.tokenize import word_tokenize

In [24]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [29]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

True

In [25]:
word2vec_model = api.load("word2vec-google-news-300")



In [26]:
text1 = "Machine learning is a field of artificial intelligence."
text2 = "Deep learning is a branch of artificial intelligence and machine learning."

In [27]:
def get_embedding(text, model):
    words = word_tokenize(text.lower())
    word_vectors = [model[word] for word in words if word in model]
    if not word_vectors:
        return np.zeros(model.vector_size)
    return np.mean(word_vectors, axis=0)

In [30]:
embedding1 = get_embedding(text1, word2vec_model)
embedding2 = get_embedding(text2, word2vec_model)

In [31]:
def jaccard_similarity(text1, text2):
    words1 = set(word_tokenize(text1.lower()))
    words2 = set(word_tokenize(text2.lower()))
    intersection = len(words1 & words2)
    union = len(words1 | words2)
    return intersection / union if union != 0 else 0

In [32]:
jaccard_sim = jaccard_similarity(text1, text2)

In [33]:
print(f"Jaccard Similarity:(Word Embeddings - Word2Vect) {jaccard_sim:.4f}")

Jaccard Similarity:(Word Embeddings - Word2Vect) 0.6667
