<a href="https://colab.research.google.com/github/Mohamedh0/NLP/blob/main/WordEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Revisiting BOW, TF-IDF

#### **Part 1: Recap on BOW and TF-IDF**

In this section, we will simply remind ourselves of the Bag of Words (BOW) and TF-IDF, as the focus will shift to word embeddings later.


- **BOW** is a simple representation where each word in the document is represented by its frequency in the document.
- It disregards grammar and word order but keeps track of word occurrences.

#### 1.2 **TF-IDF**
- **TF-IDF** stands for "Term Frequency-Inverse Document Frequency".
- It weighs words by how important they are in a document, considering the term frequency (TF) and how rare a term is in the entire corpus (IDF).


In [None]:
# Example: Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["The dog barks", "The dog runs fast", "The cat sleeps"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Bag of Words Matrix:\n", X.toarray())

In [None]:
# Example: TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", X_tfidf.toarray())

## Word-Embeddings

#### 2.1 **What are Word Embeddings?**
- Word embeddings are dense vector representations of words where similar words have similar vector representations.
- Unlike traditional methods like **BOW** or **TF-IDF**, word embeddings capture semantic meaning and relationships between words, such as synonyms, antonyms, and even analogies.
- Embeddings are learned from context, not just word frequency or importance.

#### 2.2 **Why Are Word Embeddings Important?**
- Word embeddings address the limitation of BOW and TF-IDF where word order and meaning are lost.
- They allow models to understand semantic relationships between words. For instance, the vectors for **"king"** and **"queen"** would be closer than "king" and "dog".

#### 2.3 **Types of Word Embeddings**
Some of the famous word embedding algorithms are:
- **Word2Vec**: Learns embeddings based on the context words of a target word (Skip-Gram and CBOW).
- **GloVe**: Uses matrix factorization techniques to find word embeddings based on word co-occurrence.
- **FastText**: An extension of Word2Vec that represents words as bags of character n-grams.

We will focus on **Word2Vec**, a widely used algorithm, in the next steps.

#### 2.4 **Skip-Gram and CBOW (Continuous Bag of Words)**

- **Skip-Gram**: Predicts the surrounding context words (context) given a target word.
  - E.g., Given "dog" as the target, predict words like "barks", "chases", etc.
  
- **CBOW**: Predicts the target word based on the surrounding context.
  - E.g., Given words like "barks", "chases", predict the target word "dog".

In the next part, we will start building a **Word2Vec model** using a neural network.

### **Building a Word2Vec Model using CBOW**

#### 3.1 **Overview of CBOW**
In the **CBOW** (Continuous Bag of Words) model, we are tasked with predicting the target word based on a given context. The context consists of a fixed-size window of surrounding words.

For example:
- Given the context words: "the", "dog", "chases"
- The model tries to predict the target word: "cat"

#### 3.2 **Steps to Build the CBOW Model**
We will implement CBOW from scratch using a neural network with the following steps:
1. **Data Preparation**: Convert the text into pairs of context-target words.
2. **Create Vocabulary**: Map words to unique integers (word index).
3. **Input Layer**: The context words will be one-hot encoded.
4. **Hidden Layer**: The embeddings will be learned in this layer.
5. **Output Layer**: Use softmax to predict the target word.

---

#### 3.3 **Implementing CBOW in Code**

We'll need to install some libraries and use a sample corpus to build this model.

#### 3.4 **What Happens Here:**
1. **Tokenizer**: We convert the words into sequences of integers (a word index).
2. **Context-Target Generation**: For each word in a sequence, we generate a context of surrounding words, and the target is the central word.
3. **Model Architecture**:
   - **Embedding Layer**: This layer learns the word embeddings. The input size is the vocabulary size, and the output size is the embedding dimension (50 in this case).
   - **Flatten Layer**: To make the output suitable for a Dense layer.
   - **Dense Layer**: This gives the output probabilities for each word in the vocabulary.
4. **Training**: The model is trained using the context-target pairs.

The embeddings will be learned during the training process. After training, you can extract the word embeddings from the **embedding layer** of the model.

---


In [None]:
# Import necessary libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, Flatten

# Sample corpus
corpus = [
    "the dog barks",
    "the dog chases the cat",
    "the cat sleeps",
    "the dog runs fast"
]

# 1. Data Preparation: Create a context-target pair (CBOW)
window_size = 2  # Context size is 2, meaning 2 words before and 2 after the target

# Tokenizer to convert text to sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
vocab_size = len(tokenizer.word_index) + 1  # Add 1 for padding (if needed)

# Convert words to integer tokens
sequences = tokenizer.texts_to_sequences(corpus)

# Generate context-target pairs (X is context, y is target)
X = []
y = []

for seq in sequences:
    for i in range(window_size, len(seq) - window_size):
        context = seq[i-window_size:i] + seq[i+1:i+window_size+1]  # Context words
        target = seq[i]  # Target word
        X.append(context)
        y.append(target)

# Pad the context words to ensure uniform length
X = np.array(X)
y = np.array(y)

# 2. Create the CBOW Model
embedding_dim = 50  # Size of the embedding vector

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=window_size * 2))
model.add(Flatten())
model.add(Dense(vocab_size, activation='softmax'))

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# 3. Train the Model
model.summary()
model.fit(X, y, epochs=100, verbose=1)

# Now the embeddings are learned!

## Skip-Gram, N-Skip-Gram

In [None]:
# Skip-Gram model implementation

# Context size (window size)
window_size = 2

# Generate context-target pairs for Skip-Gram
X_skipgram = []
y_skipgram = []

for seq in sequences:
    for i in range(window_size, len(seq) - window_size):
        target = seq[i]  # Target word
        context = seq[i-window_size:i] + seq[i+1:i+window_size+1]  # Context words
        for word in context:
            X_skipgram.append([target])
            y_skipgram.append(word)

X_skipgram = np.array(X_skipgram)
y_skipgram = np.array(y_skipgram)

# Build Skip-Gram model
model_skipgram = Sequential()
model_skipgram.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=1))
model_skipgram.add(Flatten())
model_skipgram.add(Dense(vocab_size, activation='softmax'))

model_skipgram.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the Skip-Gram model
model_skipgram.summary()
model_skipgram.fit(X_skipgram, y_skipgram, epochs=100, verbose=1)

#### 4.1 **Overview of Skip-Gram**
In the **Skip-Gram** model, the goal is to predict the context words given a target word. The target word is the central word in the context window, and the model learns to predict the surrounding words.

For example:
- Given the target word: "dog"
- The model tries to predict words like "barks", "chases", etc.

This is the reverse of the **CBOW** approach, where context predicts the target.

#### 4.2 **Skip-Gram Implementation**

We'll start by implementing the Skip-Gram model, similar to the previous CBOW approach, but the difference is that the target is the central word and the context words are predicted.


#### 4.3 **What Happens in Skip-Gram:**
- **Input**: The input is a single target word (central word) in the context window.
- **Output**: The output is the probability distribution of context words. The model tries to maximize the probability of predicting the context words given the target word.

The architecture here is quite similar to the CBOW, with the main difference being that Skip-Gram predicts multiple context words for a single target word.

---

In [None]:
# N-Skip-Gram implementation
N = 3  # We can vary N to predict more context words

X_nskipgram = []
y_nskipgram = []

for seq in sequences:
    for i in range(window_size, len(seq) - window_size):
        target = seq[i]
        context = seq[i-N:i] + seq[i+1:i+N+1]  # Larger context window (N words)
        for word in context:
            X_nskipgram.append([target])
            y_nskipgram.append(word)

X_nskipgram = np.array(X_nskipgram)
y_nskipgram = np.array(y_nskipgram)

# Build N-Skip-Gram model
model_nskipgram = Sequential()
model_nskipgram.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=1))
model_nskipgram.add(Flatten())
model_nskipgram.add(Dense(vocab_size, activation='softmax'))

model_nskipgram.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the N-Skip-Gram model
model_nskipgram.summary()
model_nskipgram.fit(X_nskipgram, y_nskipgram, epochs=100, verbose=1)

### **Part 5: N-Skip-Gram Model**

The **N-Skip-Gram** model is an extension of Skip-Gram, where the target word is used to predict the **N** context words, but instead of predicting each context word individually, the model tries to predict a larger set of context words (possibly with larger context windows).

#### 5.1 **Difference with Skip-Gram**:
- **Skip-Gram**: Predicts a set of context words for each target word. The model output is a probability distribution over the vocabulary for each context word.
- **N-Skip-Gram**: Instead of predicting a fixed number of context words (as in Skip-Gram), we predict **N** context words over a larger window of words around the target word.

Here’s how we can implement the **N-Skip-Gram** model:


---

### **Part 6: Difference Between Skip-Gram and N-Skip-Gram**

- **Skip-Gram**: For each target word, predicts a fixed number of context words (one context word at a time). The size of the context window is usually fixed.
- **N-Skip-Gram**: Expands on the Skip-Gram model by predicting multiple context words (based on a larger context window). The difference is that N-Skip-Gram tries to capture a broader context for each target word by considering more context words in its predictions.

---

## Pre-Trained Embeddings

### **Part 5: Pre-trained Word Embeddings**

#### 7.1 **What Are Pre-trained Word Embeddings?**

Pre-trained word embeddings are vector representations of words that have been learned from large corpora of text and can be used directly in downstream tasks like text classification, sentiment analysis, etc.

The benefit of pre-trained embeddings is that they capture rich semantic relationships between words, which can significantly improve performance in many NLP tasks compared to training embeddings from scratch.

#### 7.2 **Types of Pre-trained Word Embeddings**

1. **Word2Vec (Skip-Gram and CBOW)**:
   - Word2Vec is an unsupervised learning algorithm used to learn word embeddings from a large corpus.
   - The two models in Word2Vec are **Skip-Gram** and **CBOW** (which we discussed earlier).

   The **Skip-Gram model** maximizes the following objective function:
   \[
   J(\theta) = -\sum_{t=1}^{T} \sum_{-C \leq j \leq C, j \neq 0} \log p(w_{t+j} | w_t)
   \]
   - **Target Word**: \(w_t\)
   - **Context Words**: \(w_{t+j}\) (within the context window \(C\))
   - **Probability**: \(p(w_{t+j} | w_t)\) is the conditional probability of observing the context word given the target word.
   
   For the **CBOW** model, the goal is to predict the target word based on the context words. The equation is:
   \[
   J(\theta) = -\sum_{t=1}^{T} \log p(w_t | context_{t})
   \]
   where the context is the surrounding words and the target is the center word.

   - **Objective**: The models attempt to maximize the likelihood of the context given the target (Skip-Gram) or maximize the likelihood of the target given the context (CBOW).

---

2. **GloVe (Global Vectors for Word Representation)**:
   - GloVe is a word embedding model based on **matrix factorization**. It uses the global word co-occurrence statistics of a corpus.
   - The objective of GloVe is to factorize the word co-occurrence matrix \(X\) (where each element \(X_{ij}\) represents the number of times word \(i\) appears in the context of word \(j\)).
   
   The equation for GloVe is:
   \[
   J(\theta) = \sum_{i=1}^{V} \sum_{j=1}^{V} f(X_{ij}) \left( w_i^T w_j + b_i + b_j - \log X_{ij} \right)^2
   \]
   - **Objective**: The objective is to minimize the squared error between the actual co-occurrence (\(\log X_{ij}\)) and the predicted co-occurrence (\(w_i^T w_j + b_i + b_j\)), with \(f(X_{ij})\) as a weighting function to balance the influence of rare and frequent co-occurrences.
   - **\(w_i\)** and **\(w_j\)**: These are the embedding vectors for words \(i\) and \(j\).
   - **\(b_i\)** and **\(b_j\)**: Bias terms for the words.

   GloVe attempts to learn embeddings such that the dot product between two word vectors \(w_i\) and \(w_j\) is close to the log of the co-occurrence count.

---

3. **FastText**:
   - FastText is an extension of Word2Vec where each word is represented as a bag of character n-grams. This allows the model to generate embeddings for **out-of-vocabulary words** (words not seen during training).
   
   The main difference with Word2Vec is that it uses subword information, so for a word \(w\), FastText breaks it into subwords (e.g., n-grams of characters).
   - For example, the word “apple” can be represented as a combination of n-grams like **‘ap’, ‘pp’, ‘pl’, ‘le’** (depending on the n-gram size chosen).

   The embedding for a word is a sum of the embeddings of all its subword n-grams.

   The equation for FastText would be similar to Word2Vec but considering n-grams:
   \[
   \text{Emb}(w) = \sum_{\text{ngram}(w)} \text{Emb}(\text{ngram})
   \]
   where the **ngram(w)** represents all character n-grams derived from the word \(w\).

---

#### 7.3 **How to Use Pre-trained Word Embeddings**

Using pre-trained embeddings involves loading an existing model that has been trained on a large corpus, such as **Google's Word2Vec**, **GloVe**, or **FastText**, and using these vectors for tasks like similarity measurement, text classification, etc.

Here’s an example of how to load pre-trained **GloVe** embeddings and use them with a Keras model:

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
!unzip glove.6B.zip

In [None]:
# Load GloVe pre-trained embeddings
import numpy as np

# Define the embedding dimension and file path for GloVe (e.g., 100-dimensional vectors)
embedding_dim = 100
glove_file = 'glove.6B.100d.txt'

# Load the GloVe word vectors into a dictionary
embeddings_index = {}
with open(glove_file, 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print(f"Loaded {len(embeddings_index)} word vectors.")

# Example: Access the embedding for the word 'king'
embedding_king = embeddings_index['king']
print("Embedding for 'king':", embedding_king)