# Punctuation and Stopwords
Punctuation consists of symbols used for sentence structure, and they are usually removed to reduce noise unless needed for sentiment or syntax analysis.

example - . , ! ? : ; " ' ( ) - ...

Stopwords are high-frequency function words like "the", "is", "and" that often carry little semantic meaning and are removed to reduce dimensionality in NLP tasks.

example - the, is, am, are, was, were, in, on, at, and, of, to


# Morphology VS Syntax VS Semantics

Morphology studies the internal structure of words — how words are formed from smaller meaningful units called morphemes.
A morpheme is the smallest unit of meaning.

Syntax studies how words combine to form grammatically correct sentences.
It focuses on sentence structure.

Semantics studies the meaning of words, phrases, and sentences independent of context.

Example - "The dogs were running."

Morphology :
    
    dogs → dog + s
    running → run + ing

Syntax :

     dogs = subject
     were running = verb phrase

Semantics :

    An agent → dogs
    An event → running
    A time frame → past ongoing
    Meaning: There exist dogs who were in the state of running              in the past.



# BOW 

Bag of Words is a simple text representation method where we ignore grammar, word order and we only count word frequency.

It converts text → numeric vector.

# Reasons not to use BOW are -
1. No Importance Weighting as common words dominate.
2. Completely Ignores Document-Level Information as BoW only counts word frequency in a document.
3. No context awareness.

# TF-IDF

TF-IDF improves BoW by giving high weight to important words and giving low weight to common words.

TF(Term Frequency) Measures how frequently a term appears in a document.

    TF(t,d) = (count of term t in document d) / (total number of terms in document d)

IDF (Inverse Document Frequency) Measures how rare a word is across all documents.

    IDF(t) = log( N / df(t) )

    Where:
    N = total number of documents
    df(t) = number of documents containing term t


# Reasons not to use TF-IDF are - 
1. No Context Awareness (Major Limitation) - TF-IDF treats words independently.
2. Extremely Sparse & High Dimensional - Vocabulary size can easily reach very large.
3. TF-IDF cannot handle unseen words well.



# n-gram
An n-gram is a contiguous sequence of n tokens from a text, used to capture local word order information in classical NLP models.

example - "I love natural language processing"

for n = 1 (unigram)
    ["I", "love", "natural", "language", "processing"]

for n = 2 (bigram)
    ["I love", "love natural", "natural language", "language processing"]

similarly upto n.

# Reasons not to use n-gram are 
1. Feature Explosion - If vocabulary size = 50,000 then Possible bigrams:50000 x 50000 =2.5 billion. so huge memory issue.
2. Many n-grams appear rarely → unreliable statistics.
3. Still No Long-Range Context awareness.

# Semantic Similarity 

Semantic similarity measures how similar two pieces of text are in meaning — not just in words.

Various computational approaches in NLP are - 

1. Cosine Similarity + TF-IDF (Classical Approach)

       Step 1: Convert sentences into TF-IDF vectors
       Step 2: Compute cosine similarity

       Cosine Similarity Formula

             # Cosine(A, B) = (A · B) / (||A|| * ||B||)

       Where:
             A⋅B = dot product
             ∥A∥ = vector magnitude

       Range:
              1 → identical
              0 → unrelated
             -1 → opposite (rare in TF-IDF)


2. Embedding-Based Approaches - Words are mapped to dense vectors.Then compute cosine similarity between vectors.Used in models like Word2Vec, GloVe, FastText.


4. Transformer-Based Semantic Similarity - first Encode sentence into dense vector and then compare vectors using cosine similarity.Used in models like - BERT , Transformers encoders.


5. Sentence-Level Similarity Pipeline (Modern Production)

   > Convert sentence → embedding (768-d vector)

   > Store embeddings in vector database

   > Use cosine similarity or dot product

   > Retrieve top-k similar sentences

   and this is used in semantic search, RAG pipelines and Recommendation systems.

# Advanced Interview Insight 

Cosine similarity works because It measures angle between vectors, ignoring magnitude. But, In large-scale systems dot product is often used for efficiency.


# POS Tagging

POS (Part-of-Speech) tagging is the process of assigning a grammatical category to each word in a sentence. and it is used in NER, MAchine Translation etc.

And it actually answers: What role is each word playing in the sentence?


# Lemmatization

Lemmatization is the process of reducing words to their dictionary base form using vocabulary and morphological analysis, often considering part-of-speech to ensure linguistically correct outputs.

Wrorking Steps are - 

1️⃣ Identify the word
2️⃣ Determine its POS (noun, verb, adj, etc.)
3️⃣ Apply morphological rules
4️⃣ Return dictionary base form

# Stemming 

Stemming is a rule-based process of reducing words to their root form by removing suffixes, without using a dictionary or morphological analysis.

Example - "The children are running faster."

| Word     | Stemming | Lemmatization |
| -------- | -------- | ------------- |
| children | children | child         |
| running  | run      | run           |
| faster   | faster   | fast          |

Conceptual difference - 

| Stemming              | Lemmatization            |
| --------------------- | ------------------------ |
| Rule-based truncation | Linguistic normalization |
| No dictionary         | Uses dictionary          |
| Fast but rough        | Slower but accurate      |
| May produce non-words | Produces valid words     |


# Implementing n-gram model

In [24]:
# n-gram 

def generate_ngrams(text, n):
    tokens = text.lower().split()
    ngrams = []

    for i in range(len(tokens) - n + 1):
        ngram = " ".join(tokens[i:i+n])
        ngrams.append(ngram)

    return ngrams


text = "I love natural language processing"

print("Unigrams:", generate_ngrams(text, 1))
print("Bigrams:", generate_ngrams(text, 2))
print("Trigrams:", generate_ngrams(text, 3))


Unigrams: ['i', 'love', 'natural', 'language', 'processing']
Bigrams: ['i love', 'love natural', 'natural language', 'language processing']
Trigrams: ['i love natural', 'love natural language', 'natural language processing']


# Write a python function that takes document as input and returns a dictionary containing the frequency of each word in the document

In [1]:
# simple implementation

import re
from collections import defaultdict

def word_frequency(document: str) -> dict:
    """
    Takes a document string and returns a dictionary
    containing frequency of each word.
    """
    if not document:
        return {}

    # Normalize text: lowercase + remove punctuation
    tokens = re.findall(r'\b\w+\b', document.lower())

    freq = defaultdict(int)
    for word in tokens:
        freq[word] += 1

    return dict(freq)

doc = "NLP is fun. NLP is powerful!"
print(word_frequency(doc))



{'nlp': 2, 'is': 2, 'fun': 1, 'powerful': 1}


In [22]:
# implementation using collections.Counter
import re
from collections import Counter

def word_frequency(document: str) -> dict:
    tokens = re.findall(r'\b\w+\b', document.lower())
    return dict(Counter(tokens))

doc = "NLP is fun. NLP is powerful!"
print(word_frequency(doc))

{'nlp': 2, 'is': 2, 'fun': 1, 'powerful': 1}


# Create a function to clean and tokenize given text , removing punctuation and converting words into lowercase


In [19]:
import re

def clean_and_tokenize(text: str) -> list:
    """
    Cleans and tokenizes input text by:
    - Converting to lowercase
    - Removing punctuation
    - Splitting into words
    
    Returns:
        List of tokens
    """
    if not isinstance(text, str) or not text.strip():
        return []

    # Convert to lowercase
    text = text.lower()

    # Remove punctuation and extract words
    tokens = re.findall(r'\b[a-z0-9]+\b', text)

    return tokens

text = "Hello, World! NLP is AMAZING!!!"
print(clean_and_tokenize(text))


['hello', 'world', 'nlp', 'is', 'amazing']


# Develop a function to remove stopwords from given text

In [8]:
import re

DEFAULT_STOPWORDS = {
    "a", "an", "the", "is", "are", "was", "were",
    "in", "on", "at", "to", "for", "of", "and",
    "or", "but", "if", "then", "this", "that",
    "it", "as", "with", "by", "from"
}

def remove_stopwords(text: str, stopwords: set = None) -> list:
    """
    Removes stopwords from input text.
    
    Args:
        text (str): Input text
        stopwords (set): Set of stopwords (optional)
        
    Returns:
        List of filtered tokens
    """
    if not isinstance(text, str) or not text.strip():
        return []

    if stopwords is None:
        stopwords = DEFAULT_STOPWORDS

    # Lowercase + tokenize
    tokens = re.findall(r'\b[a-z0-9]+\b', text.lower())

    # Remove stopwords
    filtered_tokens = [word for word in tokens if word not in stopwords]

    return filtered_tokens

text = "This is a simple example of removing stopwords from text."
print(remove_stopwords(text))


['simple', 'example', 'removing', 'stopwords', 'text']


In [9]:
def remove_stopwords_text(text: str, stopwords: set = None) -> str:
    tokens = remove_stopwords(text, stopwords)
    return " ".join(tokens)

text = "This is a simple example of removing stopwords from text."
print(remove_stopwords(text))

['simple', 'example', 'removing', 'stopwords', 'text']


# Lemmatization using POS Tagging

In [4]:
!pip install nltk
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')




[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [8]:

from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
sentence = "The children are running towards a better place."
tokens = word_tokenize(sentence)
tagged_tokens = pos_tag(tokens)

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return 'a'
    elif tag.startswith('V'):
        return 'v'
    elif tag.startswith('N'):
        return 'n'
    elif tag.startswith('R'):
        return 'r'
    else:
        return 'n'


lemmatized_sentence = []
for word, tag in tagged_tokens:
    if word.lower() == 'are' or word.lower() in ['is', 'am']:
        lemmatized_sentence.append(word)
    else:
        lemmatized_sentence.append(
            lemmatizer.lemmatize(word, get_wordnet_pos(tag)))
print("Original Sentence: ", sentence)
print("Lemmatized Sentence: ", ' '.join(lemmatized_sentence))



LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\ayush/nltk_data'
    - 'C:\\Users\\ayush\\.conda\\envs\\machinelearning\\nltk_data'
    - 'C:\\Users\\ayush\\.conda\\envs\\machinelearning\\share\\nltk_data'
    - 'C:\\Users\\ayush\\.conda\\envs\\machinelearning\\lib\\nltk_data'
    - 'C:\\Users\\ayush\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


# STEMMING

In [13]:
import re

def simple_stemmer(word: str) -> str:
    """
    Very basic rule-based stemmer.
    Not full Porter algorithm.
    """

    word = word.lower()

    # Step 1: plurals
    if word.endswith("ies"):
        return word[:-3] + "i"
    elif word.endswith("s") and len(word) > 3:
        return word[:-1]

    # Step 2: common verb endings
    if word.endswith("ing") and len(word) > 4:
        return word[:-3]
    elif word.endswith("ed") and len(word) > 3:
        return word[:-2]

    return word


def stem_text(text: str) -> list:
    tokens = re.findall(r'\b[a-z]+\b', text.lower())
    return [simple_stemmer(word) for word in tokens]

text = "The cats were running and studies were completed"
print(stem_text(text))


['the', 'cat', 'were', 'runn', 'and', 'studi', 'were', 'complet']


In [21]:
# using NLTK library 

from nltk.stem import PorterStemmer

def stem_text(text):
    stemmer = PorterStemmer()
    tokens = text.lower().split()   # no punkt dependency
    stems = [stemmer.stem(word) for word in tokens]
    return stems


# Example
text = "The cats were running faster than the other runners"
print(stem_text(text))



['the', 'cat', 'were', 'run', 'faster', 'than', 'the', 'other', 'runner']


# Design a python function that calculates cosine similarity between two text documents

In [1]:
import math
from collections import Counter

def cosine_similarity(doc1, doc2):
    # 1️⃣ Tokenization (simple)
    tokens1 = doc1.lower().split()
    tokens2 = doc2.lower().split()
    
    # 2️⃣ Build vocabulary
    vocabulary = set(tokens1).union(set(tokens2))
    
    # 3️⃣ Create frequency vectors
    freq1 = Counter(tokens1)
    freq2 = Counter(tokens2)
    
    vector1 = [freq1[word] for word in vocabulary]
    vector2 = [freq2[word] for word in vocabulary]
    
    # 4️⃣ Compute dot product
    dot_product = sum(v1 * v2 for v1, v2 in zip(vector1, vector2))
    
    # 5️⃣ Compute magnitudes
    magnitude1 = math.sqrt(sum(v ** 2 for v in vector1))
    magnitude2 = math.sqrt(sum(v ** 2 for v in vector2))
    
    # 6️⃣ Avoid division by zero
    if magnitude1 == 0 or magnitude2 == 0:
        return 0.0
    
    # 7️⃣ Cosine similarity
    return dot_product / (magnitude1 * magnitude2)


# ✅ Example
doc1 = "I love NLP and machine learning"
doc2 = "I love deep learning and NLP"

similarity = cosine_similarity(doc1, doc2)
print("Cosine Similarity:", similarity)


Cosine Similarity: 0.8333333333333335
