# Word Embeddings

### Word Embeddings in NLP

Word embeddings are **vector representations of words** that capture their meaning, context, and relationships. Unlike traditional representations like one-hot encoding (where each word is just a unique index), embeddings place words in a continuous vector space where **semantically similar words are closer together**.

---

#### 1. Why Word Embeddings?

* **One-hot encoding limitations**:

  * Creates very high-dimensional sparse vectors.
  * No notion of similarity (e.g., "king" and "queen" are just different one-hot vectors).

* **Embeddings solve this**:

  * Represent words as **dense vectors** (e.g., 100–300 dimensions).
  * Capture **semantic similarity** (e.g., vector("king") is close to vector("queen")).

---

#### 2. Popular Word Embedding Techniques

* **Word2Vec**

  * Based on neural networks (CBOW and Skip-gram models).
  * Learns word meaning from context.
  * Famous example:

    ```
    vector("king") - vector("man") + vector("woman") ≈ vector("queen")
    ```

* **GloVe (Global Vectors)**

  * Learns embeddings by factorizing word co-occurrence matrices.
  * Captures global statistical information.

* **FastText**

  * Extension of Word2Vec.
  * Represents words as **subword (character n-grams)**.
  * Handles **out-of-vocabulary (OOV)** words better.

* **Contextual Embeddings (modern approaches)**

  * Words get different embeddings depending on context.
  * Examples: **ELMo, BERT, GPT**.
  * "Bank" in *river bank* vs *money bank* will have different embeddings.

---

#### 3. Properties of Word Embeddings

* **Semantic similarity**: Words like "car" and "automobile" will have close vectors.
* **Analogies**: Word relationships can be expressed as vector operations.
* **Clustering**: Similar words form clusters (animals, colors, verbs, etc.).

---

#### 4. Applications of Word Embeddings in NLP

* Text classification (sentiment analysis, spam detection).
* Machine translation.
* Named entity recognition (NER).
* Question answering and chatbots.
* Semantic search and recommendation systems.

---

#### 5. Example in Python (using Gensim Word2Vec)

```python
from gensim.models import Word2Vec

# Example corpus
sentences = [
    ["I", "love", "machine", "learning"],
    ["machine", "learning", "is", "awesome"],
    ["deep", "learning", "is", "a", "subset", "of", "machine", "learning"]
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)

# Get embedding for a word
print("Vector for 'machine':\n", model.wv['machine'])

# Find similar words
print("Most similar to 'learning':\n", model.wv.most_similar('learning'))
```

---

✅ In short, **word embeddings transform words into meaningful numerical representations** that machine learning models can understand and use effectively.



# Word2Vec

Alright, let’s dive into **Word2Vec** in detail. This is one of the most important foundational techniques in NLP for creating **word embeddings**.

---

## What is Word2Vec?

* **Word2Vec** is a technique to represent words as **dense vectors** (embeddings) in a continuous vector space.
* Developed by **Tomas Mikolov and team at Google (2013)**.
* The idea: words with similar meanings should have similar vector representations.
  Example:

  * Vector("king") – Vector("man") + Vector("woman") ≈ Vector("queen")

Instead of sparse **one-hot vectors**, Word2Vec produces **dense embeddings** where dimensions capture semantic meaning.

---

## Two Main Architectures in Word2Vec

Word2Vec uses shallow neural networks with two major approaches:

1. **Continuous Bag of Words (CBOW)**

   * Predicts the **target word** from surrounding context words.
   * Example:
     Sentence: "The cat sits on the \_\_\_"
     Context: \["The", "cat", "sits", "on", "the"] → predict "mat"
   * Works well for small datasets and is faster.

2. **Skip-Gram**

   * Opposite of CBOW. It predicts the **context words** given the target word.
   * Example:
     Target word: "mat" → predict \["The", "cat", "sits", "on", "the"]
   * Better for larger datasets and captures rare words well.

---

## Training Word2Vec

* Input layer: one-hot encoded word.
* Hidden layer: produces embedding representation.
* Output layer: predicts word probabilities using **softmax**.

### Challenges

* Vocabulary size is huge → softmax is expensive.
* Solutions:

  * **Hierarchical Softmax**: speeds up computation using a binary tree.
  * **Negative Sampling**: instead of updating weights for all words, update only for a small set of negative samples.

---

## Key Features of Word2Vec

* Produces embeddings that capture **semantic relationships**.
* Similar words cluster together (cosine similarity).
* Can perform **vector arithmetic** on words.
* Efficient to train even on large datasets.

---

## Example in Python (using Gensim)

```python
from gensim.models import Word2Vec

# Sample corpus
sentences = [
    ["the", "cat", "sits", "on", "the", "mat"],
    ["dogs", "play", "in", "the", "park"],
    ["the", "boy", "plays", "football"]
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)

# Get embedding for a word
print(model.wv['cat'])

# Find similar words
print(model.wv.most_similar('cat'))
```

* `vector_size`: embedding dimension.
* `window`: context window size.
* `sg=1`: Skip-Gram; `sg=0`: CBOW.

---

## Limitations of Word2Vec

* Produces **static embeddings**: each word has one fixed vector.
  Example: the word "bank" (river bank vs. financial bank) has the same embedding.
* Struggles with **out-of-vocabulary (OOV)** words.
* Later methods like **GloVe, FastText, and Transformers (BERT, GPT)** address these.

---

✅ In short: **Word2Vec is the foundation of modern NLP embeddings**. It transforms words into meaningful dense vectors, capturing semantic similarity and analogies.



# GloVe

### GloVe (Global Vectors for Word Representation)

**GloVe** is a popular word embedding technique introduced by researchers at Stanford. It is similar in spirit to Word2Vec but is based on a different idea: instead of predicting a word based on context (like Word2Vec), GloVe directly uses **global statistical information** from the text corpus.

---

### 1. **Core Idea**

* Word2Vec is predictive: learns embeddings by predicting context words.
* GloVe is count-based: builds a co-occurrence matrix of words and then factorizes it.

The intuition is:

* Words that appear in similar contexts should have similar vector representations.
* For example: *“king – man + woman ≈ queen”* also emerges naturally in GloVe embeddings.

---

### 2. **How GloVe Works**

1. **Co-occurrence Matrix**

   * Build a matrix `X` where each entry `X_ij` = number of times word `j` appears in the context of word `i`.
   * Example: in “I like playing football”, the co-occurrence of (“like”, “playing”) increases.

2. **Probability Ratios**

   * Instead of just raw counts, GloVe looks at **ratios of co-occurrence probabilities**.
   * If two words `i` and `j` are related (say “ice” and “solid”), they should co-occur more often compared to unrelated words (say “ice” and “fashion”).

   Formula:

   $$
   P_{ij} = \frac{X_{ij}}{\sum_k X_{ik}}
   $$

   where `P_ij` is the probability of seeing word `j` in the context of word `i`.

3. **Training Objective**

   * GloVe tries to find word vectors such that their **dot product approximates log of co-occurrence probability**.

   $$
   w_i^T \cdot w_j + b_i + b_j \approx \log(X_{ij})
   $$

   where:

   * `w_i` and `w_j` are word vectors
   * `b_i`, `b_j` are bias terms

   This converts co-occurrence statistics into useful embeddings.

4. **Optimization**

   * The loss function minimizes the difference between predicted values and `log(X_ij)`.
   * A weighting function `f(X_ij)` is added so that very frequent words (like “the”, “is”) don’t dominate training.

---

### 3. **Key Features of GloVe**

* **Global + Local Context**: Combines the advantages of matrix factorization (global statistics) and neural embeddings (local context).
* **Efficient**: Pre-trained GloVe embeddings are available (trained on billions of tokens like Wikipedia + Common Crawl).
* **Good Semantic Properties**: Captures analogies, synonyms, and relationships well.

Example:

* *“Paris – France + Italy ≈ Rome”*
* *“king – man + woman ≈ queen”*

---

### 4. **Pre-trained GloVe Embeddings**

Commonly available pre-trained vectors:

* 50d, 100d, 200d, 300d dimensions
* Trained on different datasets like:

  * Wikipedia 2014 + Gigaword 5
  * Common Crawl (42B and 840B tokens)
  * Twitter (27B tokens)

You can download from [GloVe website](https://nlp.stanford.edu/projects/glove/).

---

### 5. **Python Example (Using GloVe Pre-trained Vectors)**

```python
import numpy as np

# Load GloVe embeddings
embeddings_index = {}
with open("glove.6B.100d.txt", encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = vector

# Example
print("Vector for 'king':")
print(embeddings_index['king'])

# Word analogy example: king - man + woman ≈ queen
king = embeddings_index['king']
man = embeddings_index['man']
woman = embeddings_index['woman']

vector = king - man + woman
```

You would then compare `vector` with all other embeddings to find the closest one (usually it will be “queen”).

---

### 6. **Comparison with Word2Vec**

* **Word2Vec**: Predictive, uses neural networks, focuses on local context windows.
* **GloVe**: Count-based, factorizes co-occurrence matrix, uses global corpus statistics.

Both methods produce embeddings with useful semantic relationships.





### **What is FastText?**
FastText is a word embedding model developed by **Facebook AI Research (FAIR)**. It extends Word2Vec by representing words not just as atomic units, but as a collection of **character n-grams**.  
This allows FastText to capture **subword information**, making it especially powerful for morphologically rich languages and handling out-of-vocabulary (OOV) words.

---

### **How it Works**
1. **Word Representation**  
   - Unlike Word2Vec, which treats each word as a single entity, FastText breaks a word into character-level **n-grams**.  
   - Example:  
     For the word `"apple"` with n-grams of size 3 (trigrams):  
     - `<ap`, `app`, `ppl`, `ple`, `le>`  
     - Special boundary symbols `< >` are used to distinguish prefixes/suffixes.

2. **Vector Learning**  
   - Each word’s embedding is the sum of its character n-gram embeddings.  
   - Example: `"apple"` embedding = `sum(embedding("<ap"), embedding("app"), embedding("ppl"), embedding("ple"), embedding("le>"))`.

3. **Training Objective**  
   - Similar to Word2Vec’s **skip-gram with negative sampling (SGNS)**.  
   - Predict context words given the center word, but embeddings come from subwords.

---

### **Advantages of FastText**
- **Handles Out-of-Vocabulary (OOV) Words**  
  Since embeddings are built from character n-grams, FastText can generate embeddings for unseen words (unlike Word2Vec and GloVe).
  - Example: `"unhappiness"` may not be in training, but embeddings can be built from `"un"`, `"hap"`, `"ness"`.
  
- **Better for Morphologically Rich Languages**  
  Languages like German, Turkish, or Hindi, where words change forms (inflections), benefit a lot because FastText captures subword structure.  

- **Captures Subword Semantics**  
  Prefixes, suffixes, and roots contribute to meaning. `"run"`, `"runner"`, `"running"` will share overlapping n-grams and thus related embeddings.

---

### **Disadvantages**
- **Larger Model Size**  
  Because it stores embeddings for many subword n-grams.  
- **Slower Training**  
  Compared to Word2Vec since more embeddings need to be learned.

---

### **Example in Python (Using Gensim)**
```python
from gensim.models import FastText

# Sample corpus
sentences = [
    ["fasttext", "is", "great"],
    ["word", "embeddings", "are", "useful"],
    ["nlp", "involves", "words"]
]

# Train FastText model
model = FastText(sentences, vector_size=50, window=3, min_count=1, epochs=10)

# Get word vector
print(model.wv['fasttext'])

# Get embedding for an OOV word
print(model.wv['fasttexts'])  # Still works because it’s built from subwords
```

---

### **Comparison with Word2Vec and GloVe**
| Feature                | Word2Vec        | GloVe           | FastText        |
|------------------------|-----------------|-----------------|-----------------|
| Uses co-occurrence?    | Local (context) | Global matrix   | Local (context) |
| Subword info           | ❌              | ❌              | ✅              |
| Handles OOV words      | ❌              | ❌              | ✅              |
| Training speed         | Fast            | Medium          | Slower          |
| Language suitability   | Works best for English | Works best for English | Works for morphologically rich languages |




### 1. What are Pre-trained Word Embeddings?  
Pre-trained word embeddings are **vector representations of words** that have been trained on a very large text corpus (like Wikipedia, Common Crawl, Google News, etc.) and are shared for reuse. Instead of training embeddings from scratch (which needs huge data and computation), we can directly use these pre-trained embeddings.  

They capture **semantic meaning** of words, so words with similar meaning have vectors close to each other.  

Examples:
- Word2Vec (Google News dataset: 3 million words, 300 dimensions)  
- GloVe (trained on Wikipedia + Gigaword, Common Crawl)  
- FastText (trained on Common Crawl, supports subword embeddings)  
- Transformer-based contextual embeddings: BERT, GPT embeddings  

---

### 2. Why Use Pre-trained Embeddings?  
- **Save time and resources**: No need to train from scratch on billions of tokens.  
- **Better performance**: Already trained on huge corpora, so embeddings are rich.  
- **Transfer learning**: Knowledge from general text applies well to specific NLP tasks.  
- **Handle rare words** (FastText, subword models).  

---

### 3. How to Use Pre-trained Embeddings  
There are **two main ways**:  

#### A. Static Word Embeddings (Word2Vec, GloVe, FastText)
- Download pre-trained embeddings.  
- Load them into your NLP pipeline.  
- Replace words in your dataset with the corresponding vectors.  
- Use embeddings as **features** in ML models (logistic regression, SVM, deep learning).  

Example (GloVe with Gensim in Python):  
```python
import gensim.downloader as api

# Load pre-trained GloVe embeddings
glove_vectors = api.load("glove-wiki-gigaword-100")

# Find vector for a word
vector = glove_vectors['computer']

# Find most similar words
similar_words = glove_vectors.most_similar('computer', topn=5)
print(similar_words)
```

#### B. Contextual Embeddings (BERT, GPT, etc.)
- Unlike Word2Vec/GloVe, BERT embeddings are **context-dependent**.  
- Example: “bank” in *river bank* vs *bank account* → different vectors.  
- Usually used via **transformer libraries** (like Hugging Face `transformers`).  

Example (using BERT for embeddings):  
```python
from transformers import BertTokenizer, BertModel
import torch

# Load BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Encode text
inputs = tokenizer("NLP is amazing", return_tensors="pt")
outputs = model(**inputs)

# Extract embeddings (last hidden state)
embeddings = outputs.last_hidden_state
print(embeddings.shape)  # [batch_size, sequence_length, hidden_dim]
```

---

### 4. When to Use Which  
- **Word2Vec / GloVe / FastText**:  
  Good for classical ML tasks or when deep contextual understanding is not required. Lightweight and fast.  

- **BERT / Transformer embeddings**:  
  Best for tasks like sentiment analysis, question answering, text classification, named entity recognition (NER). More accurate but computationally heavy.  

---

👉 In short: **Pre-trained embeddings give NLP systems a head start**, letting them leverage semantic knowledge from huge datasets instead of reinventing the wheel.  


**Thank You!**