#  Word Embeddings & NLP Algorithms (BoW, TF-IDF, Word2Vec)

Natural Language Processing (NLP) requires text to be converted into **numerical representation** so that machine learning algorithms can process it.  
In this notebook, we will explore 3 important methods:

1. Bag of Words (BoW)  
2. TF-IDF (Term Frequency - Inverse Document Frequency)  
3. Word2Vec  

---


## 🔹 1. Bag of Words (BoW)


- **Idea:** Represent text as a "bag" of words, ignoring grammar and word order.  
- Each document (sentence/paragraph) is represented by a vector of word counts.  
- Vocabulary = all unique words in the dataset.  
- Document → vector where each dimension = frequency of a word.  

###  Limitations
- Ignores word order (so "dog bites man" = "man bites dog").  
- Creates very large and sparse vectors (if vocabulary is huge).  
- Doesn't capture meaning, only frequency.  

### Example
Corpus = ["I love NLP", "I love ML"]  
Vocabulary = {I, love, NLP, ML}  
Vectors:  
- "I love NLP" → [1, 1, 1, 0]  
- "I love ML"  → [1, 1, 0, 1]  


In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Example corpus
corpus = [
    "I love natural language processing",
    "I love machine learning",
    "I enjoy deep learning and NLP"
]

# Bag of Words
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)

# Convert to DataFrame for readability
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
bow_df

Unnamed: 0,and,deep,enjoy,language,learning,love,machine,natural,nlp,processing
0,0,0,0,1,0,1,0,1,0,1
1,0,0,0,0,1,1,1,0,0,0
2,1,1,1,0,1,0,0,0,1,0


## 🔹 2. TF-IDF (Term Frequency - Inverse Document Frequency)


BoW gives all words equal importance, but in real text:
- Common words like "is", "the", "and" appear often → not very informative.
- Rare words carry **more meaning**.

That’s where **TF-IDF** comes in.

### Formula
- **TF (Term Frequency):**  
  How often a word appears in a document.  
  \[
  TF(t, d) = \frac{\text{Count of term t in document d}}{\text{Total terms in document d}}
  \]

- **IDF (Inverse Document Frequency):**  
  How unique a word is across all documents.  
  \[
  IDF(t) = \log \frac{\text{Total number of documents}}{1 + \text{Number of documents containing t}}
  \]

- **TF-IDF = TF × IDF**

###  Key Point
- High for words that are frequent in one document but rare in others.  
- Low for words that are common across all documents.  

###  Limitations
- Still produces sparse vectors (like BoW).  
- Ignores semantics and word order.  


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Representation
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Convert to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,and,deep,enjoy,language,learning,love,machine,natural,nlp,processing
0,0.0,0.0,0.0,0.528635,0.0,0.40204,0.0,0.528635,0.0,0.528635
1,0.0,0.0,0.0,0.0,0.517856,0.517856,0.680919,0.0,0.0,0.0
2,0.467351,0.467351,0.467351,0.0,0.355432,0.0,0.0,0.0,0.467351,0.0


##  3. Word2Vec


Unlike BoW & TF-IDF (which are sparse and don't capture meaning), **Word2Vec** uses a shallow neural network to learn **dense word embeddings**.

- Each word is represented as a **vector of real numbers** (e.g., 50 or 100 dimensions).
- Words with similar meanings have similar vectors.
- Famous for capturing semantic relationships:
  \[
  \text{king} - \text{man} + \text{woman} ≈ \text{queen}
  \]

###  Architectures
1. **CBOW (Continuous Bag of Words):**
   - Predicts the target word from its context (surrounding words).
   - Faster, works well with small datasets.
   
2. **Skip-Gram:**
   - Predicts the surrounding words from the target word.
   - Better for rare words.

###  Advantages
- Captures semantic meaning (e.g., "happy" and "joyful" are close).  
- Dense, low-dimensional vectors (efficient).  
- Useful for deep learning models.  

###  Limitations
- Requires large corpus to train well.  
- Static embeddings (same vector for a word, regardless of context).  
  (Later models like BERT solve this.)

In [3]:
from gensim.models import Word2Vec

# Tokenize corpus
tokenized_corpus = [sentence.lower().split() for sentence in corpus]

# Train Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_corpus, vector_size=50, window=5, min_count=1, workers=4)

# Check vector representation
print("Vector for 'learning':\n", w2v_model.wv['learning'])

# Find similar words
print("\nMost similar words to 'learning':")
print(w2v_model.wv.most_similar('learning'))

Vector for 'learning':
 [-0.01631583  0.0089916  -0.00827415  0.00164907  0.01699724 -0.00892435
  0.009035   -0.01357392 -0.00709698  0.01879702 -0.00315531  0.00064274
 -0.00828126 -0.01536538 -0.00301602  0.00493959 -0.00177605  0.01106732
 -0.00548595  0.00452013  0.01091159  0.01669191 -0.00290748 -0.01841629
  0.0087411   0.00114357  0.01488382 -0.00162657 -0.00527683 -0.01750602
 -0.00171311  0.00565313  0.01080286  0.01410531 -0.01140624  0.00371764
  0.01217773 -0.0095961  -0.00621452  0.01359526  0.00326295  0.00037983
  0.00694727  0.00043555  0.01923765  0.01012121 -0.01783478 -0.01408312
  0.00180291  0.01278507]

Most similar words to 'learning':
[('and', 0.12486252933740616), ('enjoy', 0.07395090907812119), ('i', 0.04237302392721176), ('deep', 0.01819584146142006), ('love', 0.011071967892348766), ('language', 0.001357130240648985), ('processing', -0.1191045492887497), ('nlp', -0.17424817383289337), ('machine', -0.1754782646894455), ('natural', -0.24708358943462372)]


# 🎯 Final Summary

| Method   | Type            | Captures Meaning? | Vector Size | Pros | Cons |
|----------|----------------|------------------|-------------|------|------|
| BoW      | Frequency-based | ❌ No            | Large & Sparse | Simple, easy | Ignores meaning & order |
| TF-IDF   | Weighted counts | ❌ No            | Large & Sparse | Reduces importance of common words | Still ignores meaning |
| Word2Vec | Neural Embedding| ✅ Yes           | Dense (50-300) | Captures semantics, efficient | Needs large data, context-independent |

✅ Use **BoW/TF-IDF** for simple ML models.  
✅ Use **Word2Vec (or modern embeddings like BERT/Glove/FastText)** for deep learning & semantic tasks.


# ✅ Summary

- **BoW**: Simple count of words (ignores order & meaning).  
- **TF-IDF**: Weighted version of BoW (gives importance to rare but significant words).  
- **Word2Vec**: Learns dense word embeddings that capture semantic meaning of words.  

These are fundamental techniques in NLP and are often the first step before applying machine learning models.