**Steps:**

**Step-1:Import Required Libraries**

In [3]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import wordnet as wn


In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

**Why these libraries?**

**1.nltk** → tokenization, stopwords, WordNet

**2.scikit-learn** → TF-IDF and cosine similarity

**3.numpy/pandas**→ data handling

**4.re** → text cleaning using regex

**Step-2:Prepare Dataset**

In [4]:
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Artificial intelligence helps machines learn",
    "Deep learning is part of machine learning",
    "Python is widely used for data science",
    "Data science involves statistics and programming",
    "Students learn programming using Python",
    "Doctors diagnose diseases using medical tests",
    "Physicians treat patients in hospitals",
    "Hospitals provide medical care",
    "Online education is growing rapidly",
    "E-learning platforms help students",
    "Teachers use digital tools for teaching",
    "Cloud computing provides scalable resources",
    "AWS is a popular cloud platform",
    "Azure and Google Cloud are cloud services",
    "Smartphones are used for communication",
    "Mobile phones support internet access",
    "iPhones are popular smartphones",
    "Android phones dominate the market",
    "Cybersecurity protects computer systems",
    "Network security prevents cyber attacks",
    "Encryption secures sensitive data",
    "Big data deals with large datasets",
    "Hadoop processes big data efficiently",
    "Databases store structured information",
    "SQL is used to query databases",
    "Artificial neural networks mimic the brain",
    "AI applications include chatbots",
    "Chatbots use natural language processing",
    "NLP enables machines to understand language"
]


In [5]:
df = pd.DataFrame({"Text": documents})
df.head()


Unnamed: 0,Text
0,Machine learning is a subset of artificial int...
1,Artificial intelligence helps machines learn
2,Deep learning is part of machine learning
3,Python is widely used for data science
4,Data science involves statistics and programming


**Dataset Explanation**

This dataset contains 30 short sentences related to artificial intelligence, data science, healthcare, education, cloud computing, and mobile technology. The sentences are designed to have partial lexical overlap as well as semantic relationships. This helps in evaluating how different similarity measures behave for exact word matching versus meaning-based similarity. The dataset is suitable for testing cosine, Jaccard, and WordNet-based similarity techniques

**Step-3: Preprocess Text**

In [7]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in stop_words]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return " ".join(tokens)


In [9]:
nltk.download('punkt_tab')
df["Cleaned_Text"] = df["Text"].apply(preprocess_text)
df.head()

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Unnamed: 0,Text,Cleaned_Text
0,Machine learning is a subset of artificial int...,machine learning subset artificial intelligence
1,Artificial intelligence helps machines learn,artificial intelligence help machine learn
2,Deep learning is part of machine learning,deep learning part machine learning
3,Python is widely used for data science,python widely used data science
4,Data science involves statistics and programming,data science involves statistic programming


**Explanation**

1.Lowercasing ensures uniform comparison

2.Removing punctuation avoids noise

3.Stopwords removal reduces irrelevant words

4.Lemmatization converts words to base form

**Step-4:Text Representation (TF-IDF)**

In [10]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df["Cleaned_Text"])


**Why TF-IDF?**

TF-IDF is chosen because it reduces the importance of frequently occurring words and highlights informative terms. It provides better discrimination between documents than Bag-of-Words, especially when documents share common vocabulary. TF-IDF works well with cosine similarity for text comparison.

**Step-5:Cosine Similarity**

In [11]:
cosine_sim = cosine_similarity(tfidf_matrix)


In [12]:
cosine_sim[:5, :5]


array([[1.        , 0.54540316, 0.46066263, 0.        , 0.        ],
       [0.54540316, 1.        , 0.12498149, 0.        , 0.        ],
       [0.46066263, 0.12498149, 1.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 1.        , 0.33031037],
       [0.        , 0.        , 0.        , 0.33031037, 1.        ]])

**Interpretation**

1.Sentences about AI and machine learning show high similarity

2.Python-related sentences are strongly similar

3.Medical sentences cluster together

4.Higher score → closer semantic meaning

5.Near zero → unrelated content

**STEP-6: Jaccard Similarity**

In [13]:
def jaccard_similarity(doc1, doc2):
    set1 = set(doc1.split())
    set2 = set(doc2.split())
    return len(set1 & set2) / len(set1 | set2)


In [14]:
jaccard_similarity(df["Cleaned_Text"][0], df["Cleaned_Text"][1])


0.42857142857142855

**Interpretation**

1.Depends heavily on exact word overlap

2.Fails when synonyms are used

3.Short sentences often get low scores

4.Useful for keyword matching tasks

**Step-7: WordNet-based Semantic Similarity**

In [15]:
def wordnet_similarity(word1, word2):
    syn1 = wn.synsets(word1)
    syn2 = wn.synsets(word2)
    if syn1 and syn2:
        return syn1[0].wup_similarity(syn2[0])
    return None


In [16]:
wordnet_similarity("doctor", "physician")


1.0

In [17]:
pairs = [
    ("doctor", "physician"),
    ("student", "learner"),
    ("phone", "smartphone"),
    ("teacher", "educator"),
    ("hospital", "clinic")
]

for w1, w2 in pairs:
    print(w1, w2, wordnet_similarity(w1, w2))


doctor physician 1.0
student learner 0.7058823529411765
phone smartphone None
teacher educator 0.9523809523809523
hospital clinic 0.11764705882352941


**STEP-8: Comparison of Methods**

Cosine similarity works best for short text when TF-IDF is used, as it balances word importance. Jaccard similarity depends entirely on exact word overlap and performs poorly when synonyms are present. WordNet similarity captures semantic meaning better than lexical methods. Cosine and Jaccard often disagree when documents share meaning but not vocabulary. Jaccard is suitable for keyword-based retrieval. WordNet performs well for concept-level similarity but not full sentences. Cosine similarity provides the best overall balance. Semantic similarity is crucial for understanding meaning beyond words.