<a href="https://colab.research.google.com/github/Aleena24/NLP_lab/blob/main/exercise_1_SimilarityAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [3]:
# 2 documents
para1 = "The quick brown fox jumps over the lazy dog. It is a classic sentence used to showcase the usage of all the letters in the English alphabet. This sentence is often employed in typing practice and as an example in various linguistic studies."
para2 = "The lazy dog is jumped over by the quick brown fox. This sentence, much like its counterpart, demonstrates the utilization of all letters in the English alphabet. It serves as a common exercise in typing proficiency and is frequently referenced in linguistic research."

In [4]:
# Tokenization
def preprocess(text):
    # Tokenization
    tokens = word_tokenize(text.lower())

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    return " ".join(lemmatized_tokens)


In [6]:
# Preprocess documents
preprocessed_doc1 = preprocess(para1)
preprocessed_doc2 = preprocess(para2)

print(para1)
print(para2)

The quick brown fox jumps over the lazy dog. It is a classic sentence used to showcase the usage of all the letters in the English alphabet. This sentence is often employed in typing practice and as an example in various linguistic studies.
The lazy dog is jumped over by the quick brown fox. This sentence, much like its counterpart, demonstrates the utilization of all letters in the English alphabet. It serves as a common exercise in typing proficiency and is frequently referenced in linguistic research.


# TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF (Term Frequency-Inverse Document Frequency) vectors are numerical representations of text documents that capture the importance of terms in the context of a corpus.

In [7]:
# Compute TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([preprocessed_doc1, preprocessed_doc2])


# Cosine Similarity

Cosine similarity is a measure used to determine how similar two vectors are in a multi-dimensional space

In [8]:
# Compute cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

The value of 0.331 suggests that the documents share some common features or themes, but they are not highly similar.

In [9]:
print("Cosine Similarity between the two documents:", cosine_sim[0][0])

Cosine Similarity between the two documents: 0.3314837871046527


# Similarity Analysis between 2 para with same text

In [10]:
para3 = "The quick brown fox jumps over the lazy dog. It is a classic sentence used to showcase the usage of all the letters in the English alphabet. This sentence is often employed in typing practice and as an example in various linguistic studies."
para4 = "The quick brown fox jumps over the lazy dog. It is a classic sentence used to showcase the usage of all the letters in the English alphabet. This sentence is often employed in typing practice and as an example in various linguistic studies."


In [11]:
# Preprocess documents
preprocessed_doc1 = preprocess(para3)
preprocessed_doc2 = preprocess(para4)

print(para3)
print(para4)

The quick brown fox jumps over the lazy dog. It is a classic sentence used to showcase the usage of all the letters in the English alphabet. This sentence is often employed in typing practice and as an example in various linguistic studies.
The quick brown fox jumps over the lazy dog. It is a classic sentence used to showcase the usage of all the letters in the English alphabet. This sentence is often employed in typing practice and as an example in various linguistic studies.


In [12]:
# Compute TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([preprocessed_doc1, preprocessed_doc2])


In [13]:
# Compute cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

If the cosine similarity is 1, it means the vectors have the same direction and are perfectly aligned, indicating maximum similarity.

In [14]:
print("Cosine Similarity between the two documents:", cosine_sim[0][0])

Cosine Similarity between the two documents: 1.0


# # Similarity Analysis between 2 para with no similarity

In [15]:
para5 = "In the heart of the bustling city, the vibrant market came to life each morning with a symphony of colors and sounds. Stalls adorned with fresh produce and exotic spices beckoned shoppers, while the rhythmic chatter of vendors echoed through the narrow alleys. The aroma of street food wafted through the air, creating a sensory tapestry that encapsulated the lively atmosphere of the urban marketplace."
para6 = "Nestled at the edge of the ancient forest, a serene pond reflected the dappled sunlight filtering through the thick foliage. The tranquil water mirrored the surrounding trees and served as a haven for diverse wildlife. Birds chirped melodiously, their vibrant plumage blending with the lush greenery, while the occasional rustle of leaves hinted at the secretive movements of unseen creatures in the enchanting woodland."

In [17]:
# Preprocess documents
preprocessed_doc1 = preprocess(para5)
preprocessed_doc2 = preprocess(para6)

print(para5)
print(para6)

In the heart of the bustling city, the vibrant market came to life each morning with a symphony of colors and sounds. Stalls adorned with fresh produce and exotic spices beckoned shoppers, while the rhythmic chatter of vendors echoed through the narrow alleys. The aroma of street food wafted through the air, creating a sensory tapestry that encapsulated the lively atmosphere of the urban marketplace.
Nestled at the edge of the ancient forest, a serene pond reflected the dappled sunlight filtering through the thick foliage. The tranquil water mirrored the surrounding trees and served as a haven for diverse wildlife. Birds chirped melodiously, their vibrant plumage blending with the lush greenery, while the occasional rustle of leaves hinted at the secretive movements of unseen creatures in the enchanting woodland.


In [18]:
# Compute TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([preprocessed_doc1, preprocessed_doc2])

In [19]:
# Compute cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

If the cosine similarity is near to -1,it means the two vectors are exactly opposite or dissimilar.

In [20]:
print("Cosine Similarity between the two documents:", cosine_sim[0][0])

Cosine Similarity between the two documents: 0.013497591773582429
