<a href="https://colab.research.google.com/github/Darshan235/NLP/blob/main/nlp_asn_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Lab 7: Text Similarity in NLP    

## Learning Objectives
By the end of this lab, students will be able to:
- Understand what text similarity means in NLP  
- Compute similarity using:
  - Cosine Similarity  
  - Jaccard Similarity  
  - WordNet-based Semantic Similarity  
- Interpret similarity scores  
- Compare lexical vs semantic similarity  

## Lab Outcomes
After completing this lab, students will be able to:
- Explain the purpose of text similarity in NLP tasks  
- Perform preprocessing for similarity tasks  
- Represent text using Bag-of-Words / TF-IDF  
- Implement cosine and jaccard similarity in Python  
- Use WordNet to compute semantic similarity  
- Analyze which similarity measure works better  
- Prepare a short analysis report  



## 1. What is Text Similarity in NLP?
Text similarity measures how close two pieces of text are in terms of meaning or word usage.
It is widely used in search engines, recommendation systems, plagiarism detection, and chatbots.

## 2. Lexical vs Semantic Similarity
- **Lexical similarity** depends on exact word overlap (Cosine, Jaccard).
- **Semantic similarity** considers meaning, even if words differ (WordNet).

## 3. Why Cosine Similarity?
Cosine similarity measures the angle between vectors, not magnitude.
It works well with TF-IDF and handles different document lengths efficiently.

## 4. When Jaccard Fails
Jaccard fails when:
- Synonyms are used
- Sentences are short
- Word order/meaning matters

## 5. How WordNet Improves Similarity
WordNet connects words via synonym sets (synsets).
It captures relationships like doctor–physician.

## 6. Effect of Preprocessing
Preprocessing reduces noise and improves similarity accuracy.
Poor preprocessing leads to misleading similarity scores.

## 7. Applications of Text Similarity
- Search engines
- Plagiarism detection



## STEP 2 — Import Required Libraries
We use:
- **nltk**: preprocessing, stopwords, WordNet
- **sklearn**: TF-IDF, cosine similarity
- **string & re**: text cleaning


In [None]:

import nltk
import string
import re
import itertools
import numpy as np

from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True


## STEP 3 — Dataset Description
We created a dataset of 24 short documents across 4 topics:
- Sports
- Politics
- Health
- Technology

Each topic contains 6 documents.
The dataset is manually created to control topic relevance and vocabulary.


In [None]:

documents = [
    # Sports
    "The football team won the championship",
    "Cricket players trained hard for the match",
    "The athlete broke the world record",
    "Basketball requires teamwork and speed",
    "The coach planned a new strategy",
    "The match was intense and competitive",

    # Politics
    "The government passed a new law",
    "Elections determine the political future",
    "The president addressed the nation",
    "Parliament debated the new policy",
    "The minister announced reforms",
    "Democracy depends on citizen participation",

    # Health
    "Doctors recommend regular exercise",
    "A healthy diet improves immunity",
    "The patient recovered from illness",
    "Mental health awareness is important",
    "Vaccines prevent serious diseases",
    "Hospitals provide emergency care",

    # Technology
    "Artificial intelligence is transforming industries",
    "The smartphone uses advanced technology",
    "Cybersecurity protects digital data",
    "Machine learning improves predictions",
    "Cloud computing enables scalability",
    "Software development requires logical thinking"
]

print("Sample documents:")
for doc in documents[:5]:
    print("-", doc)


Sample documents:
- The football team won the championship
- Cricket players trained hard for the match
- The athlete broke the world record
- Basketball requires teamwork and speed
- The coach planned a new strategy



## STEP 4 — Text Preprocessing
Steps:
1. Lowercasing
2. Removing punctuation and numbers
3. Stopword removal
4. Tokenization
5. Lemmatization


In [None]:

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    return ' '.join(tokens)

processed_docs = [preprocess(doc) for doc in documents]

print(processed_docs[:5])


['football team championship', 'cricket player trained hard match', 'athlete broke world record', 'basketball requires teamwork speed', 'coach planned new strategy']



## STEP 5 — Text Representation (TF-IDF)
TF-IDF is chosen because:
- It reduces importance of common words
- Highlights meaningful terms
- Works well with cosine similarity


In [None]:

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(processed_docs)



## STEP 6 — Cosine Similarity
Higher score → more similar meaning.


In [None]:

cosine_sim = cosine_similarity(tfidf_matrix)

for i in range(5):
    print(f"Similarity between doc {i} and doc {i+1}: {cosine_sim[i][i+1]:.3f}")


Similarity between doc 0 and doc 1: 0.000
Similarity between doc 1 and doc 2: 0.000
Similarity between doc 2 and doc 3: 0.000
Similarity between doc 3 and doc 4: 0.000
Similarity between doc 4 and doc 5: 0.000



## STEP 7 — Jaccard Similarity
Based on word overlap.


In [None]:

def jaccard_similarity(doc1, doc2):
    set1, set2 = set(doc1.split()), set(doc2.split())
    return len(set1 & set2) / len(set1 | set2)

for i in range(5):
    print(f"Jaccard(doc {i}, doc {i+1}) = {jaccard_similarity(processed_docs[i], processed_docs[i+1]):.3f}")


Jaccard(doc 0, doc 1) = 0.000
Jaccard(doc 1, doc 2) = 0.000
Jaccard(doc 2, doc 3) = 0.000
Jaccard(doc 3, doc 4) = 0.000
Jaccard(doc 4, doc 5) = 0.000



## STEP 8 — WordNet Semantic Similarity
We use Wu-Palmer similarity to capture semantic meaning.


In [None]:

def wordnet_similarity(w1, w2):
    syns1 = wordnet.synsets(w1)
    syns2 = wordnet.synsets(w2)
    if not syns1 or not syns2:
        return 0
    return syns1[0].wup_similarity(syns2[0]) or 0

pairs = [("doctor", "physician"), ("football", "cricket"), ("law", "policy"),
         ("hospital", "clinic"), ("software", "program"), ("government", "state"),
         ("athlete", "player"), ("disease", "illness"), ("technology", "innovation"),
         ("election", "vote")]

for w1, w2 in pairs:
    print(f"{w1} - {w2}: {wordnet_similarity(w1, w2):.3f}")


doctor - physician: 1.000
football - cricket: 0.091
law - policy: 0.286
hospital - clinic: 0.118
software - program: 0.267
government - state: 0.133
athlete - player: 0.667
disease - illness: 0.947
technology - innovation: 0.125
election - vote: 0.625



## STEP 9 — Comparison of Methods
- Cosine works best for longer texts
- Jaccard depends heavily on exact word overlap
- WordNet captures meaning better
- Scores disagree when synonyms are used



## STEP 10 — Lab Report Conclusion
This lab demonstrated how different similarity techniques behave.
Lexical methods are simple but limited.
Semantic similarity captures deeper meaning.
Choosing the right method depends on text length and task.
