<a href="https://colab.research.google.com/github/SRIKAR-SILUVERI/NLP/blob/main/Lab7_4_Text_Similarity_Srikar_2403a52240.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **STEP 2 — Import required libraries**

In [None]:
# Basic libraries
import numpy as np
import pandas as pd
import string

# NLP libraries
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Feature extraction & similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

`Why these libraries are used`

**nltk** → text preprocessing, tokenization, stopwords, WordNet

**pandas** → dataset handling

**scikit-learn** → TF-IDF vectorization and cosine similarity

**numpy** → numerical operations

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

# **STEP 3 — Load or prepare dataset**

In [None]:
data = [
    ("The doctor is treating a patient", "The physician is caring for a sick person"),
    ("I love machine learning", "I enjoy studying machine learning"),
    ("The cat sat on the mat", "The cat sat on the mat"),
    ("She is reading a book", "She is watching television"),
    ("The sky is blue", "Bananas are yellow"),
    ("He drives a car", "He operates an automobile"),
    ("Football is a popular sport", "Soccer is loved worldwide"),
    ("I am happy today", "I feel joyful today"),
    ("This is a pen", "That is a pencil"),
    ("The sun is bright", "The sun shines brightly"),

    ("Dogs are loyal animals", "Cats are independent animals"),
    ("He is eating food", "He is consuming a meal"),
    ("She teaches mathematics", "She instructs math"),
    ("The train arrived late", "The train was delayed"),
    ("I like coffee", "I hate coffee"),

    ("He wrote a letter", "He sent an email"),
    ("The child is playing", "The kid is playing"),
    ("Weather is very cold", "It is freezing outside"),
    ("She bought a new phone", "She purchased a smartphone"),
    ("The movie was boring", "The film was dull"),

    ("Birds can fly", "Fish can swim"),
    ("He is running fast", "He is sprinting quickly"),
    ("I completed my homework", "My assignment is finished"),
    ("The food tastes good", "The meal is delicious"),
    ("She is sad", "She is unhappy"),

    ("Open the door", "Close the window"),
    ("The exam was difficult", "The test was hard"),
    ("He is my friend", "He is my enemy"),
    ("The laptop is expensive", "The computer costs a lot"),
    ("I am learning NLP", "I am studying natural language processing")
]

df = pd.DataFrame(data, columns=["Sentence1", "Sentence2"])
df.head()

Unnamed: 0,Sentence1,Sentence2
0,The doctor is treating a patient,The physician is caring for a sick person
1,I love machine learning,I enjoy studying machine learning
2,The cat sat on the mat,The cat sat on the mat
3,She is reading a book,She is watching television
4,The sky is blue,Bananas are yellow


`Dataset Explanation`

This dataset consists of 30 sentence pairs created manually.
The pairs include identical sentences, paraphrased sentences, and unrelated sentences.

# **STEP 4 — Preprocess text**

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower()                              # lowercase
    text = text.translate(str.maketrans('', '', string.punctuation + string.digits))
    tokens = word_tokenize(text)                     # tokenize
    tokens = [w for w in tokens if w not in stop_words]  # remove stopwords
    tokens = [lemmatizer.lemmatize(w) for w in tokens]   # lemmatize
    return " ".join(tokens)

`Preprocessing Explanation`
*Lowercasing  removes case sensitivity
*Punctuation & number removal reduces noise

*Tokenization splits text into words
*Stopword removal removes unimportant words

*Lemmatization converts words to base form

# **STEP 5 — Represent text numerically**

In [None]:
import nltk
nltk.download('punkt_tab', quiet=True)

df["S1_clean"] = df["Sentence1"].apply(preprocess)
df["S2_clean"] = df["Sentence2"].apply(preprocess)

In [None]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(
    df["S1_clean"].tolist() + df["S2_clean"].tolist()
)
print(tfidf_matrix)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 134 stored elements and shape (60, 118)>
  Coords	Values
  (0, 26)	0.5773502691896257
  (0, 110)	0.5773502691896257
  (0, 78)	0.5773502691896257
  (1, 63)	0.6279905157261365
  (1, 66)	0.5703526715167403
  (1, 59)	0.5294579702410176
  (2, 14)	0.5487480098852467
  (2, 92)	0.5911326507844838
  (2, 67)	0.5911326507844838
  (3, 89)	0.7071067811865475
  (3, 7)	0.7071067811865475
  (4, 96)	0.7071067811865475
  (4, 6)	0.7071067811865475
  (5, 29)	0.7071067811865475
  (5, 12)	0.7071067811865475
  (6, 44)	0.5773502691896257
  (6, 85)	0.5773502691896257
  (6, 99)	0.5773502691896257
  (7, 48)	0.7402613937158491
  (7, 108)	0.6723191719517365
  (8, 79)	1.0
  (9, 102)	0.6723191719517365
  (9, 10)	0.7402613937158491
  (10, 27)	0.5949787509718095
  (10, 65)	0.5949787509718095
  :	:
  (48, 87)	0.7071067811865475
  (48, 97)	0.7071067811865475
  (49, 39)	0.7071067811865475
  (49, 30)	0.7071067811865475
  (50, 41)	0.7071067811865475
  (50, 103)	

TF-IDF reduces the importance of very common words and emphasizes meaningful terms.
It performs better than Bag-of-Words for similarity tasks.

## **STEP 6 — Compute Cosine Similarity**

In [None]:
cosine_scores = []

for i in range(len(df)):
    vec1 = tfidf_matrix[i]
    vec2 = tfidf_matrix[i + len(df)]
    score = cosine_similarity(vec1, vec2)[0][0]
    cosine_scores.append(score)

df["Cosine_Similarity"] = cosine_scores
df[["Sentence1", "Sentence2", "Cosine_Similarity"]].head()

Unnamed: 0,Sentence1,Sentence2,Cosine_Similarity
0,The doctor is treating a patient,The physician is caring for a sick person,0.0
1,I love machine learning,I enjoy studying machine learning,0.526076
2,The cat sat on the mat,The cat sat on the mat,1.0
3,She is reading a book,She is watching television,0.0
4,The sky is blue,Bananas are yellow,0.0


`Interpretation (5 examples)`

Identical sentences give scores close to 1

Paraphrased sentences give medium to high scores

Unrelated sentences give low scores

Higher cosine score → higher similarity

Sensitive to word overlap and importance

# **STEP 7 — Compute Jaccard Similarity**

In [None]:
def jaccard_similarity(s1, s2):
    set1 = set(s1.split())
    set2 = set(s2.split())
    return len(set1 & set2) / len(set1 | set2)

df["Jaccard_Similarity"] = df.apply(
    lambda x: jaccard_similarity(x["S1_clean"], x["S2_clean"]), axis=1
)

df[["Sentence1", "Sentence2", "Jaccard_Similarity"]].head()

Unnamed: 0,Sentence1,Sentence2,Jaccard_Similarity
0,The doctor is treating a patient,The physician is caring for a sick person,0.0
1,I love machine learning,I enjoy studying machine learning,0.4
2,The cat sat on the mat,The cat sat on the mat,1.0
3,She is reading a book,She is watching television,0.0
4,The sky is blue,Bananas are yellow,0.0


**Jaccard Interpretation**

Depends only on word overlap

Identical sentences get high scores

Does not capture semantic meaning well

# **STEP 8 — WordNet-based Semantic Similarity**

In [None]:
def wordnet_similarity(sent1, sent2):
    words1 = sent1.split()
    words2 = sent2.split()
    scores = []

    for w1 in words1:
        for w2 in words2:
            syn1 = wordnet.synsets(w1)
            syn2 = wordnet.synsets(w2)
            if syn1 and syn2:
                sim = syn1[0].wup_similarity(syn2[0])
                if sim:
                    scores.append(sim)

    return np.mean(scores) if scores else 0

In [None]:
df["WordNet_Similarity"] = df.apply(
    lambda x: wordnet_similarity(x["S1_clean"], x["S2_clean"]), axis=1
)

df[["Sentence1", "Sentence2", "WordNet_Similarity"]].head(10)

Unnamed: 0,Sentence1,Sentence2,WordNet_Similarity
0,The doctor is treating a patient,The physician is caring for a sick person,0.352114
1,I love machine learning,I enjoy studying machine learning,0.355015
2,The cat sat on the mat,The cat sat on the mat,0.459609
3,She is reading a book,She is watching television,0.261111
4,The sky is blue,Bananas are yellow,0.338685
5,He drives a car,He operates an automobile,0.345651
6,Football is a popular sport,Soccer is loved worldwide,0.40181
7,I am happy today,I feel joyful today,0.396296
8,This is a pen,That is a pencil,0.888889
9,The sun is bright,The sun shines brightly,0.361742


**Discussion**

Captures meaning-based similarity

Words like doctor–physician become similar

Works well for paraphrased sentences

# **STEP 9 — Compare All Three Methods**

**Cosine similarity** works well when important words overlap.

**Jaccard similarity **depends on exact word matching.

**WordNet similarit**y captures meaning using semantic relationships.

WordNet performs better for paraphrased sentences.

Jaccard fails when synonyms are used.

Cosine balances word frequency and importance.

Scores differ when vocabulary changes but meaning is same.

Overall, **semantic similarity **captures meaning better.

# **STEP 10 — Write Lab Report Section**

**Objective**

To understand and implement different text similarity techniques in NLP.

**Dataset Description**

30 manually created sentence pairs including identical, paraphrased, and unrelated texts.

**Preprocessing**

Lowercasing, punctuation removal, stopword removal, tokenization, lemmatization.

**Results**

Cosine similarity: lexical similarity using TF-IDF

Jaccard similarity: overlap-based similarity

WordNet similarity: semantic similarity

**Conclusion**

Cosine similarity is efficient and widely used.
Jaccard is simple but limited.
WordNet captures meaning better but is slower.
Choosing the right method depends on the task.

# **Answers to Questions**

**1ANS:** Text similarity measures how similar two texts are.

**2ANS:** Lexical uses words; semantic uses meaning.

**3ANS**: Cosine handles high-dimensional sparse data well.

**4ANS**: Jaccard fails with synonyms.

**5ANS**: WordNet uses semantic relationships.

**6ANS:** Preprocessing improves accuracy.

**7ANS**:Plagiarism detection, search engines.