### **1. Text Preprocessing and Representation** 

**Problem**: Computers don't understand text. We need to convert words to numbers that capture meaning 

**Tokenization:** Breaking Text into units called tokens, which are words, characters or subwords

**Why it matters:** 
- ML models need numbers, not raw text
- Need to define what a "unit" of meaning is
- Can't process entire sentences as single entities
- First step to counting/vectorizing words
- Determines granularity of your analysis
- **First step in any NLP pipeline. Defines our vocabulary + Affects everything downstream.**


In [None]:
import nltk
from nltk.tokenize import word_tokenize

text = "I love NLP! It's amazing."

# Word tokenization
tokens = word_tokenize(text)
print(tokens)
# Output: ['I', 'love', 'NLP', '!', 'It', "'s", 'amazing', '.']

# Simple split (naive approach)
simple_tokens = text.split()
print(simple_tokens)
# Output: ['I', 'love', 'NLP!', "It's", 'amazing.']

#### **Lowercasing**

**Why we do it (intuition):**

- "Apple", "apple", "APPLE" → should usually be treated as same word
- Reduces vocabulary size (fewer unique tokens)
- Prevents model from thinking "The" and "the" are different
- Simple way to normalize text

**Trade-off:** Loses information. "Apple" (company) vs "apple" (fruit). "US" (country) vs "us" (pronoun). For most tasks, the simplification is worth it. For NER or cases where capitalization carries meaning, don't do it.

In [None]:
text = "Apple makes great products. I love apples."

lowercase_text = text.lower()
print(lowercase_text)
# Output: "apple makes great products. i love apples."

# In a pipeline
tokens = word_tokenize(text.lower())
print(tokens)
# Output: ['apple', 'makes', 'great', 'products', '.', 'i', 'love', 'apples', '.']


#### **Punctuations**

**Why we do it:**

- Punctuation often doesn't carry semantic meaning for simple tasks
- Reduces noise in vocabulary ("word" vs "word," vs "word." are same)
- Makes feature vectors cleaner for BoW/TF-IDF
- "Hello!" and "Hello" should probably mean the same thing

Trade-off: Modern models (BERT, GPT) actually learn from punctuation. "Let's eat grandma" vs "Let's eat, grandma" - punctuation changes meaning. 

**Decision Framework**:
- For classical methods (BoW, TF-IDF) → remove it. 
- For deep learning → keep it.

In [None]:
import string

text = "Hello, world! How are you?"

# Remove all punctuation
no_punct = text.translate(str.maketrans('', '', string.punctuation))
print(no_punct)
# Output: "Hello world How are you"

# Or with regex
import re
no_punct = re.sub(r'[^\w\s]', '', text)
print(no_punct)
# Output: "Hello world How are you"

#### **Stopwords**

**What it is:** Common words that appear frequently but carry little meaningful information (the, is, at, which, on, a, an, etc.)

**Why we do it:**

- "the", "is", "at" appear in almost every document → don't help distinguish documents
- Reduces dimensionality (smaller vocabulary)
- Focuses on content words that carry actual meaning
- Speeds up computation

**Trade-off:** Can lose context. "not good" → remove "not" → becomes "good" (opposite meaning!). 

- For BoW/TF-IDF with limited compute → remove them. 
- For models with enough capacity (embeddings, transformers) → keep them, context matters.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "This is a simple example showing stop word removal"

# Get English stop words
stop_words = set(stopwords.words('english'))

tokens = word_tokenize(text.lower())
print("Original:", tokens)
# Output: ['this', 'is', 'a', 'simple', 'example', 'showing', 'stop', 'word', 'removal']

# Remove stop words
filtered_tokens = [word for word in tokens if word not in stop_words]
print("Filtered:", filtered_tokens)
# Output: ['simple', 'example', 'showing', 'stop', 'word', 'removal']

#### **Stemming vs Lemmatization**
What it is: Both reduce words to their root/base form, but use different approaches.
- Stemming - crude chopping using rules
- Lemmatization - dictionary lookup to find proper root

**Why we do it**:
- Groups related words together ("running", "runs", "ran" → all mean "run")
- Reduces vocabulary size
- Helps model see that different forms are the same concept
- Improves generalization

Trade-off: Stemming is fast but crude (can create non-words like "studi"). Lemmatization is accurate but slower. For search/IR → stemming fine. For tasks needing precision → lemmatization.

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "runs", "ran", "better", "studies", "caring"]

# Stemming
stemmed = [stemmer.stem(word) for word in words]
print("Stemmed:", stemmed)
# Output: ['run', 'run', 'ran', 'better', 'studi', 'care']

# Lemmatization
lemmatized = [lemmatizer.lemmatize(word, pos='v') for word in words]
print("Lemmatized:", lemmatized)
# Output: ['run', 'run', 'run', 'better', 'study', 'care']

#### **N-Grams**

What it is: Sequences of N consecutive words/tokens. Captures phrases instead of just individual words.

- Unigram (1-gram): single words
- Bigram (2-gram): pairs of consecutive words
- Trigram (3-gram): triplets of consecutive words

**Why we do it:**

- Captures phrases and context ("New York" is one entity, not "New" + "York")
"not good" has different meaning than "good" alone
- Word order matters for meaning
- Better features for classification

**Trade-off:** Higher N = more context but exponentially larger vocabulary (sparse features). Bigrams usually sweet spot.

In [None]:
from nltk import ngrams
from nltk.tokenize import word_tokenize

text = "New York is a great city"
tokens = word_tokenize(text.lower())

# Unigrams
unigrams = list(ngrams(tokens, 1))
print("Unigrams:", unigrams)
# Output: [('new',), ('york',), ('is',), ('a',), ('great',), ('city',)]

# Bigrams
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams)
# Output: [('new', 'york'), ('york', 'is'), ('is', 'a'), ('a', 'great'), ('great', 'city')]

# Trigrams
trigrams = list(ngrams(tokens, 3))
print("Trigrams:", trigrams)
# Output: [('new', 'york', 'is'), ('york', 'is', 'a'), ('is', 'a', 'great'), ('a', 'great', 'city')]

#### **Bag of Words**

**What it is:** Represent text as a vector of word counts. Each position in vector = one word in vocabulary. Ignores word order completely.

**Why we do it:**

- Converts text into numbers that ML models can use
- Simple, fast, interpretable
- Each document becomes a fixed-length vector
- Good baseline for classification tasks

**Trade-off:** 
- Loses all word order ("dog bites man" = "man bites man" = "bites dog man"). - High dimensionality (one dimension per unique word). 
- No semantic understanding ("king" and "queen" are completely unrelated).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "I love cats",
    "I love dogs",
    "cats and dogs"
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

print("Vocabulary:", vectorizer.get_feature_names_out())
# Output: ['and' 'cats' 'dogs' 'love']

print("BoW Matrix:\n", bow_matrix.toarray())
# Output:
# [[0 1 0 1]  <- "I love cats"
#  [0 0 1 1]  <- "I love dogs"
#  [1 1 1 0]] <- "cats and dogs"

#### **TF-IDF**

**What it is:** Improved version of BoW. Weights words by how important they are. Common words get lower weight, rare distinctive words get higher weight.

**Formula:**
- TF (Term Frequency) = how often word appears in document
- IDF (Inverse Document Frequency) = log(total documents / documents containing word)
- TF-IDF = TF × IDF

**Why we do it:**

- Downweights common useless words ("the", "is", "and")
- Highlights distinctive/informative words
- Better features than raw counts for classification
- Captures document uniqueness

**Trade-off:** Still no word order. Still no semantics. But much better than BoW for most tasks. Standard baseline for text classification.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are enemies"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

# "the" appears in doc 1 & 2 → low IDF → low TF-IDF
# "enemies" appears in only doc 3 → high IDF → high TF-IDF

### **2. Word Embeddings**

The motivation for moving beyond BoW/TF-IDF to dense vector representations where similar words have similar vectors.

**Why we do it:**
- Capture semantic meaning (similar words → similar vectors)
- Reduce dimensionality (300 dims vs 50,000+ vocab size)
- Enable mathematical operations on meaning (king - man + woman ≈ queen)
- Transfer knowledge (pre-trained embeddings from huge corpora)
- Foundation for modern NLP (BERT, GPT, etc.)

**The Problem with BoW/TF-IDF**

In [None]:
# In BoW/TF-IDF, each word is independent
Vocabulary: [king, queen, man, woman, cat, dog]

"king" = [1, 0, 0, 0, 0, 0]
"queen" = [0, 1, 0, 0, 0, 0]
"cat" = [0, 0, 0, 0, 1, 0]

# Distance between "king" and "queen" = same as distance between "king" and "cat"
# Model has no idea that king/queen are related!

**The Embedding Solution**

In [None]:
# Words become dense vectors in continuous space
"king" = [0.5, 0.8, 0.1, ...]  (e.g., 300 dimensions)
"queen" = [0.48, 0.79, 0.12, ...]
"cat" = [-0.2, 0.1, 0.9, ...]

# Now: king and queen are CLOSE in vector space
# Semantic similarity captured by geometric proximity

#### **WORD2VEC**

**What it is:** A method to learn word embeddings by predicting context. 

Two approaches: CBOW (predict word from context) and Skip-gram (predict context from word).

**Two Architectures:**

1. CBOW (Continuous Bag of Words):

- Given context words → predict center word
- "The cat sat on the ___" → predict "mat"

2. Skip-gram:

- Given center word → predict context words
- Given "cat" → predict "the", "sat", "on"

**Why we do it:**
- Learns embeddings from raw text (unsupervised)
- Words appearing in similar contexts get similar vectors
- Captures semantic and syntactic relationships
- Skip-gram: better for rare words, smaller datasets
- CBOW: faster, better for frequent words

In [None]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

sentences = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are friends"
]

# Tokenize
tokenized = [word_tokenize(sent.lower()) for sent in sentences]

# Train Word2Vec (skip-gram)
model = Word2Vec(sentences=tokenized, vector_size=100, window=5, 
                 min_count=1, sg=1)  # sg=1 for skip-gram, sg=0 for CBOW

# Get embedding for a word
vector = model.wv['cat']
print("Cat vector shape:", vector.shape)  # (100,)

# Find similar words
similar = model.wv.most_similar('cat', topn=3)
print("Similar to cat:", similar)

#### **Global Vectors (GLoVe)**

**What it is:** Word embedding method that uses global word co-occurrence statistics from a corpus. Unlike Word2Vec which uses local context windows, GloVe looks at the entire corpus's word co-occurrence matrix.

**Simplified:** GloVe counts how often words appear near each other across an ENTIRE corpus (like all of Wikipedia), then creates vectors where words that frequently co-occur are close together.

**Core Idea:**
- Count how often words appear together across entire corpus
- "ice" appears with "solid" more than "gas" does
- Ratio of co-occurrence probabilities captures meaning

**Why we do it:**

- Combines benefits of matrix factorization (global stats) and local context (Word2Vec)
- Often performs better than Word2Vec on word analogy tasks
- Pre-trained on massive corpora (Wikipedia, Common Crawl)
- Fixed embeddings you can plug into your model immediately

In [None]:
# GloVe is typically used pre-trained, not trained from scratch
import numpy as np

# Load pre-trained GloVe embeddings
def load_glove(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Download from: https://nlp.stanford.edu/projects/glove/
# glove = load_glove('glove.6B.100d.txt')

# Using with gensim
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

# Convert GloVe format to Word2Vec format
# glove2word2vec('glove.6B.100d.txt', 'word2vec_format.txt')
# model = KeyedVectors.load_word2vec_format('word2vec_format.txt')

#### **FastText**

**What it is:** Extension of Word2Vec that represents words as bags of character n-grams instead of treating words as atomic units. Developed by Facebook.

**Core Idea:**
- Break words into subword units (character n-grams)
- "apple" → ["<ap", "app", "ppl", "ple", "le>", "<apple>"]
- Word embedding = sum of its n-gram embeddings

**Why we do it:**

- Handles out-of-vocabulary (OOV) words (Word2Vec/GloVe can't)
- Good for morphologically rich languages (German, Turkish)
- Captures subword information ("unhappiness" = "un" + "happy" + "ness")
- Useful for typos, rare words, new words

In [None]:
from gensim.models import FastText

sentences = [
    ["the", "cat", "sat"],
    ["the", "cats", "are", "sitting"],
    ["unseen", "word"]
]

# Train FastText
model = FastText(sentences=sentences, vector_size=100, window=5, 
                 min_count=1, min_n=3, max_n=6)  # n-gram range

# Get embedding for seen word
vector = model.wv['cat']

# KEY FEATURE: Get embedding for UNSEEN word (out-of-vocabulary)
# Even if "cats" wasn't in training, FastText can generate embedding
# by combining n-grams: "ca", "at", "ts", etc.
oov_vector = model.wv['kittens']  # Works even if never seen before!

# Find similar words
similar = model.wv.most_similar('cat')

### **3. Classical NLP Tasks**

#### **Sentiment Analysis**

What it is: Determining the emotional tone/opinion in text. Classify text as positive, negative, or neutral.

Why we do it:

Understand customer feedback (reviews, surveys)
Monitor brand reputation on social media
Automate content moderation
Market research and opinion mining
Common entry-level NLP task, good for learning pipeline

Mathematical Intuition: 

1. **Convert Text to Numbers (e.g. TF-IDF)**

"I love this" → [0.2, 0.8, 0.5, 0, 0, ...] (vector of word weights)
```

2. **Learn weights for each word** (during training)
```
"love" → +0.9 (strongly positive)
"hate" → -0.8 (strongly negative)
"the" → 0.01 (neutral)
```

3. **Calculate sentiment score** (dot product)
```
Sentiment Score = Σ(word_weight × learned_weight)
                = (0.2 × 0.1) + (0.8 × 0.9) + (0.5 × 0.05) + ...
                = 0.77
```

4. **Decision boundary**
```
If score > 0.5 → Positive
If score < -0.5 → Negative
Else → Neutral



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Sample data
texts = [
    "I love this product, it's amazing!",
    "Terrible experience, waste of money",
    "It's okay, nothing special",
    "Best purchase ever!",
    "Horrible quality, very disappointed"
]
labels = [1, 0, 2, 1, 0]  # 1=positive, 0=negative, 2=neutral

# Split data
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2)

# Vectorize with TF-IDF
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train classifier
clf = LogisticRegression()
clf.fit(X_train_vec, y_train)

# Predict
new_text = ["This is absolutely wonderful!"]
new_vec = vectorizer.transform(new_text)
prediction = clf.predict(new_vec)
print("Sentiment:", prediction)  # Output: [1] (positive)

#### **Text Classification**

**What it is:** Assigning predefined categories/labels to text documents. Sentiment analysis is one type; this is the broader task.

**Examples:**
- Spam vs not spam (email filtering)
- Topic classification (sports, politics, tech)
- Intent classification (customer query routing)
- Language detection

**Why we do it:**
- Organize large document collections automatically
- Route customer requests to correct department
- Filter content (spam, inappropriate content)
- Tag/categorize articles, emails, support tickets
- Foundation for many real-world NLP applications

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Sample data
texts = [
    "The stock market crashed today",
    "New smartphone released with better camera",
    "Team wins championship game",
    "Company reports quarterly earnings",
    "Latest gadget review and specs"
]
labels = ['business', 'tech', 'sports', 'business', 'tech']

# Create pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB())  # Naive Bayes works well for text
])

# Train
pipeline.fit(texts, labels)

# Predict
new_texts = ["Apple announces new product launch"]
predictions = pipeline.predict(new_texts)
print(predictions)  # Output: ['tech']

# Get probabilities
probs = pipeline.predict_proba(new_texts)
print(probs)  # Shows confidence for each class

#### **Named Entity Recognition**

**What it is:** Identifying and classifying named entities in text into predefined categories like person names, organizations, locations, dates, etc.

**Common Entity Types:**

- PERSON: "Barack Obama", "Marie Curie"
- ORG: "Google", "United Nations"
- LOC: "Paris", "Mount Everest"
- DATE: "January 2023", "yesterday"
- MONEY: "$100", "50 euros"

**Why We Do It**:
- Extract structured information from unstructured text
- Build knowledge graphs
- Information retrieval (find all documents mentioning "Google")
- Question answering systems
- Anonymization (remove/mask person names)

In [None]:
import spacy

# Load pre-trained model
nlp = spacy.load("en_core_web_sm")

text = "Apple Inc. was founded by Steve Jobs in Cupertino in 1976."

# Process text
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")

# Output:
# Apple Inc. -> ORG
# Steve Jobs -> PERSON
# Cupertino -> GPE (Geo-Political Entity)
# 1976 -> DATE

# Visualize (in Jupyter)
from spacy import displacy
displacy.render(doc, style="ent")

#### **Part-of-Speech (PoS) Tagging**

**What it is:** Assigning grammatical categories (noun, verb, adjective, etc.) to each word in a sentence.

**Common POS Tags:**

NN: Noun (singular)
NNS: Noun (plural)
VB: Verb (base form)
VBD: Verb (past tense)
JJ: Adjective
RB: Adverb
DT: Determiner (the, a, an)

**Why we do it:**

-  Disambiguate word meaning ("book" = noun vs verb)
- Feature for other NLP tasks (NER, parsing)
- Grammar checking
- Information extraction (extract all verbs = actions)
- Improve text understanding (syntax structure)

In [None]:
import nltk
from nltk import pos_tag, word_tokenize

# Download required data
# nltk.download('averaged_perceptron_tagger')
# nltk.download('punkt')

text = "The quick brown fox jumps over the lazy dog"

# Tokenize and tag
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

print(pos_tags)
# Output:
# [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), 
#  ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), 
#  ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

# Using spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

for token in doc:
    print(f"{token.text} -> {token.pos_}")