# Embeddings

## Embeddings in Natural Language Processing (NLP)

In the context of natural language processing (NLP), "embeddings" refer to dense vector representations of words (or sometimes phrases and sentences) in a continuous vector space. These vector representations are learned through unsupervised machine learning techniques like Word2Vec, GloVe, or FastText, where words with similar meanings or appearing in similar contexts are mapped to vectors that are close together in the vector space.

![Embeddings](https://miro.medium.com/max/600/1*UCKRYEj85S3eH1uv1vFfCw.gif)

## Limitations of Traditional Word Representations

### One-Hot Encoding
Traditionally, words have been represented using one-hot encoding, where each word is represented as a sparse binary vector. In this representation, there is a 1 in the position corresponding to the word's index in the vocabulary and 0s everywhere else. However, one-hot encoded vectors have several limitations:

![One hot encoding](https://miro.medium.com/v2/resize:fit:1400/1*ggtP4a5YaRx6l09KQaYOnw.png)

- **High Dimensionality:** One-hot encoded vectors are very high-dimensional, with the dimensionality equal to the size of the vocabulary. This leads to increased computational complexity and storage requirements.
- **Lack of Semantic Information:** One-hot vectors do not capture any semantic relationships between words. Each word is treated as an isolated entity with no notion of similarity or relatedness to other words.

## Advantages of Word Embeddings

Embeddings address the limitations of one-hot encoding and offer several advantages in NLP:

### 1. Low-Dimensional Dense Representations
Word embeddings are low-dimensional dense vectors, typically ranging from 50 to 300 dimensions. This makes them computationally efficient and memory-friendly compared to one-hot vectors.

### 2. Semantic Relationships
Embeddings capture semantic relationships between words. Words with similar meanings or appearing in similar contexts will have similar vector representations, enabling models to understand the meaning and context of words.

### 3. Generalization
Word embeddings allow NLP models to generalize better across different tasks and datasets. Pre-trained word embeddings can be used as features for various downstream tasks, even if the training data for the downstream task is limited.

### 4. Out-of-Vocabulary (OOV) Words
Word embeddings provide representations for words not seen during training (OOV words) by generalizing from the context of other words.

### 5. Efficiency
Once trained, word embeddings can be efficiently stored and reused, which is especially important for large-scale NLP applications.

### 6. Capturing Analogies
Word embeddings can capture analogical relationships like "king" - "man" + "woman" ≈ "queen," allowing models to perform analogy-based reasoning.

In summary, embeddings are a powerful tool in NLP, offering a more efficient, semantically rich, and generalizable way to represent words compared to traditional methods like one-hot encoding.

## Word2Vec

Word2Vec is a popular technique for learning word embeddings, which are dense vector representations of words in a continuous vector space. These embeddings capture semantic relationships between words, allowing machines to understand and work with words in a more meaningful way. Word2Vec was introduced by researchers at Google in 2013 and has since become one of the foundational techniques in natural language processing (NLP) and related fields.

The basic idea behind Word2Vec is to represent each word in a high-dimensional vector space, where words with similar meanings or contexts are located close to each other. This is based on the distributional hypothesis, which posits that words appearing in similar contexts tend to have similar meanings. For example, in the sentences "I love cats" and "I adore felines," the words "love" and "adore" are likely to be used in similar contexts and have similar semantic meanings.

Word2Vec can be trained using two main architectures: Continuous Bag of Words (CBOW) and Skip-gram. Let's explore each of these in detail:

<img src="../../assets/cbow_skipgram.png">

### 1. Continuous Bag of Words (CBOW)

CBOW aims to predict a target word based on its surrounding context words. Given a sequence of words in a sentence, CBOW tries to predict the middle word based on the surrounding context words. The context window size determines how many words before and after the target word are considered as the context.

#### Example:
Consider the sentence: "The cat sat on the mat." If we set the context window size to 2 and assume "sat" is the target word, CBOW will use the context words "The," "cat," "on," and "the" to predict the word "sat."

#### Architecture:
The architecture involves the following steps:
- Convert the context words to their corresponding word embeddings.
- Average these embeddings to create a context vector.
- Use this context vector as input to a neural network to predict the target word.

## Implementing Word2Vec in Python

Python provides a package named `gensim` to make implementing Word2Vec straightforward. Here's how you can get started with it:

### Installation
First, install the `gensim` package if you haven't already:

```bash
pip install gensim
```

### Explanation of Parameters:
- `sentences`: The input data, which is a list of tokenized sentences.
- `vector_size`: The dimensionality of the word vectors.
- `window`: The maximum distance between the current and predicted word within a sentence.
- `min_count`: Ignores all words with total frequency lower than this.
- `workers`: The number of worker threads to train the model.

Word2Vec with `gensim` is a powerful tool that simplifies the creation of word embeddings, making it easier to integrate semantic understanding into your NLP applications.

In [26]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# Download the NLTK tokenizer data if not already done
nltk.download('punkt')

# List of sentences for training (unsplit)
sentences = [
    "I love machine learning",
    "Natural language processing is exciting",
    "Word2Vec creates word embeddings",
    "Gensim is a useful library",
    "Deep learning is a subset of machine learning",
    "Embeddings capture semantic relationships",
    "Python is a popular programming language",
    "Artificial intelligence and machine learning are related fields"
]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [27]:
# Tokenize the sentences
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
print(tokenized_sentences[0])

['I', 'love', 'machine', 'learning']


In [28]:
# Training the model using the tokenized sentences
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=2, min_count=1, workers=4)

In [29]:
# Function to find similar words using the trained model
def find_similar_words(model, word, top_n=5):
    vector = model.wv[word]
    similar_words = model.wv.most_similar(positive=[word], topn=top_n)
    return vector, similar_words

In [30]:
# Get the vector and most similar words for a specific word
target_word='machine'
vector, similar_words = find_similar_words(model, target_word)

print("Vector for 'machine':", vector)
print("Words most similar to 'machine':", similar_words)

Vector for 'machine': [ 9.0809692e-05  3.0832055e-03 -6.8151313e-03 -1.3689728e-03
  7.6685268e-03  7.3423618e-03 -3.6741595e-03  2.6473312e-03
 -8.3173197e-03  6.2057734e-03 -4.6351124e-03 -3.1670930e-03
  9.3112951e-03  8.7239651e-04  7.4903150e-03 -6.0771578e-03
  5.1645460e-03  9.9195987e-03 -8.4572462e-03 -5.1375022e-03
 -7.0665088e-03 -4.8636729e-03 -3.7799729e-03 -8.5374974e-03
  7.9519451e-03 -4.8466586e-03  8.4186336e-03  5.2713170e-03
 -6.5517426e-03  3.9549218e-03  5.4736012e-03 -7.4305790e-03
 -7.4054408e-03 -2.4740247e-03 -8.6299535e-03 -1.5781232e-03
 -3.9694359e-04  3.3004046e-03  1.4376161e-03 -8.7451038e-04
 -5.5918437e-03  1.7300018e-03 -8.9923030e-04  6.7969901e-03
  3.9745839e-03  4.5312811e-03  1.4351372e-03 -2.7016769e-03
 -4.3661897e-03 -1.0324767e-03  1.4385569e-03 -2.6458562e-03
 -7.0720618e-03 -7.8036557e-03 -9.1262041e-03 -5.9363050e-03
 -1.8445110e-03 -4.3226061e-03 -6.4571970e-03 -3.7157002e-03
  4.2899637e-03 -3.7400872e-03  8.3837649e-03  1.5315602e-03
 -

## 2. Skip-gram

Skip-gram, on the other hand, works in the opposite manner to CBOW. It aims to predict the context words given a target word. Skip-gram is particularly useful for smaller datasets and when you want to capture more information about infrequent words.

#### Example:
Using the same sentence "The cat sat on the mat" and assuming "sat" is the target word with a context window size of 2, Skip-gram will try to predict the context words "The," "cat," "on," and "the" from the target word "sat."

#### Architecture:
The architecture involves the following steps:
- Convert the target word to its corresponding word embedding.
- Use this embedding to predict the context words through a neural network.

### Explanation of Parameters:
- Other parameters are already explained above for CBOW
- `sg`: Specifies the Skip-gram architecture. (Setting sg=0 would use CBOW instead)

In [31]:
# Skip-gram model
skipgram_model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=2, sg=1, min_count=1, workers=4)

In [33]:
# Get the vector and most similar words for a specific word
target_word='machine'
vector, similar_words = find_similar_words(skipgram_model, target_word)

print("Vector for 'machine':", vector)
print("Words most similar to 'machine':", similar_words)

Vector for 'machine': [ 8.7139240e-05  3.0821874e-03 -6.8086009e-03 -1.3650591e-03
  7.6718316e-03  7.3448555e-03 -3.6722927e-03  2.6453969e-03
 -8.3143059e-03  6.2077623e-03 -4.6357708e-03 -3.1756633e-03
  9.3123931e-03  8.7520888e-04  7.4898605e-03 -6.0778940e-03
  5.1586907e-03  9.9251298e-03 -8.4566930e-03 -5.1376238e-03
 -7.0603210e-03 -4.8608501e-03 -3.7769042e-03 -8.5377069e-03
  7.9546170e-03 -4.8412192e-03  8.4215924e-03  5.2663768e-03
 -6.5552192e-03  3.9525498e-03  5.4742219e-03 -7.4337036e-03
 -7.4008247e-03 -2.4778158e-03 -8.6272350e-03 -1.5785688e-03
 -3.9278960e-04  3.3041148e-03  1.4385493e-03 -8.7304221e-04
 -5.5887694e-03  1.7217635e-03 -9.0340531e-04  6.8027633e-03
  3.9759362e-03  4.5307246e-03  1.4332269e-03 -2.7039147e-03
 -4.3643410e-03 -1.0367834e-03  1.4403542e-03 -2.6488120e-03
 -7.0668710e-03 -7.8039132e-03 -9.1294488e-03 -5.9293797e-03
 -1.8407891e-03 -4.3247347e-03 -6.4607803e-03 -3.7125184e-03
  4.2900951e-03 -3.7411901e-03  8.3826398e-03  1.5306490e-03
 -