# Word2Vec

Word2Vec is a widely used technique for learning **vector representations of words** from large text datasets. Introduced by Mikolov et al. (2013), it has become a cornerstone in Natural Language Processing (NLP) for word representation.

Word2Vec operates using two primary architectures:
* **Continuous Bag of Words (CBOW)**
* **Skip-Gram**



| Feature               | Continuous Bag of Words (CBOW)                                     | Skip-Gram                                                        |
| :-------------------- | :----------------------------------------------------------------- | :---------------------------------------------------------------- |
| **Objective** | Predict a **target word** from its surrounding context words.      | Predict **context words** from a given target word.             |
| **How it Works** | Given a window of context words, it tries to predict the central word. | Given a central word, it tries to predict the surrounding context words. |
| **Example** | For "The cat is in the ___", with "cat" as target and "The", "is", "in", "the" as context, CBOW predicts "cat". | For central word "cat", Skip-Gram predicts context words like "The", "is", "in", "the". |
| **Training Speed** | Generally **faster** to train.                                     | Generally **slower** to train.                                  |
| **Quality of Embeddings** | Good, especially on smaller datasets.                               | Often produces **higher quality** word representations, especially with large datasets. |

According to Almeida and Xexéo (2019), the development of **prediction-based models for word representations (embeddings)** is intrinsically linked to the history of **Neural Network Language Models (NNLMs)**, as embeddings were initially conceived as the raw vector projection in the first "representation layer" of these models.

The history of NNLMs has primarily been one of **gradual efficiency gains**, occasional insights, and trade-offs between model complexity and the ability to train with larger datasets. Early results showed NNLMs outperformed n-gram based predecessors in language modeling, but **long training times (days to weeks)** were a significant hurdle.

Mikolov et al. (2013) noted that while methods like LSA and LDA existed for creating continuous word representations, their Word2Vec studies focused on **distributed word representations learned by neural networks**. These neural network-based embeddings demonstrated superior performance in preserving linear relationships between words compared to LSA. Furthermore, LDA proved computationally expensive for large datasets. To evaluate model architectures, Mikolov et al. (2013) introduced a **computational complexity metric** measuring the number of parameters accessed during training. The goal was to maximize model accuracy while minimizing computational complexity, striking a balance between performance and efficiency in word representation.

The "nonlinear" models proposed by Mikolov et al. (2013) were two new architectures designed to learn distributed word representations with reduced computational complexity. A key observation was that most of the complexity stemmed from the nonlinear hidden layer typical of neural networks. While this nonlinearity is a strength of neural networks, the researchers opted to explore **simpler models**. Although these simpler models might not represent data as precisely as full neural networks, they could be trained **much more efficiently on significantly larger datasets**.

| Concept        | Latent Semantic Analysis (LSA)                                     | Latent Dirichlet Allocation (LDA)                                |
| :------------- | :----------------------------------------------------------------- | :--------------------------------------------------------------- |
| **Type** | Statistical method (matrix factorization)                          | Generative probabilistic model                                   |
| **Objective** | Discover relationships between terms and documents by identifying "latent concepts" (topics). | Discover abstract "topics" within a collection of documents and assign topics to documents. |
| **How it Works** | Builds a term-document matrix (word frequencies in documents) and applies **Singular Value Decomposition (SVD)** to reduce dimensionality and reveal underlying semantic structures. | Assumes each document is a **mixture of topics**, and each topic is a **mixture of words**. Infers these underlying distributions. |
| **Core Idea** | Words that appear in similar contexts have similar meanings. Uncovers latent semantic relationships by representing words and documents in a lower-dimensional "concept space". | Documents are composed of a distribution of topics, and topics are characterized by a distribution of words. Aims to reverse-engineer this generative process. |
| **Strengths** | - Effective for dimensionality reduction.\<br\>- Can find synonyms and related terms.\<br\>- Computationally less expensive than some neural models for certain tasks. | - Probabilistic, offering more interpretable topic mixtures.\<br\>- Handles polysemy (words with multiple meanings) better than LSA.\<br\>- Can assign multiple topics to a single document. |
| **Limitations** | - Does not account for word order or context.\<br\>- Struggles with polysemy (averages out meanings of a word).\<br\>- Assumes linear relationships. | - Requires specifying the number of topics beforehand.\<br\>- Computationally more intensive than LSA for very large datasets (though efficient for many topic modeling tasks). |
| **Primary Use** | Information retrieval, document clustering, early topic modeling.   | Topic modeling, document classification, recommender systems.     |

A proposed architecture, similar to a feedforward NNLM, removes the nonlinear hidden layer and shares the projection layer across all words, not just the projection matrix. This means all words are projected to the same position, and their vectors are averaged. This model is called a bag-of-words model because word order doesn't affect the projection.

The model also incorporates future words. The best performance on the described task was achieved using a log-linear classifier with four past and four future words as input. The training objective is to correctly classify the current (middle) word.

The training complexity is represented by the formula:

Q = N × D + D × log2(V)

Imagine you're trying to guess the **middle word** in a sentence.

This new method is like a super-simple way to do that. Instead of a complicated process, it just **averages out** the meaning of all the words around the middle word (both before and after it). It doesn't care about the order of these surrounding words, which is why it's called a "**bag-of-words**" model – like all the words are just thrown into a bag.

Then, it uses this average meaning to try and figure out what the middle word should be.

The formula you see ($Q = N \times D + D \times \log_2(V)$) is just a way to measure **how much "work"** the computer has to do to learn from all the words.
* $N$ is like the total number of words it's looking at.
* $D$ is how complex the "meaning" of each word is (like how many details it pays attention to).
* $V$ is the total number of unique words it knows.

So, in short, it's a straightforward way to predict a word by simply combining the meanings of its neighbors, and the formula tells you how much computing power it needs to learn.

In [None]:
from gensim.models import Word2Vec

# Examples of context and target words
examples = [
    (["The", "fly"], "birds"),
    (["birds", "gracefully"], "fly"),
    (["fly", "through"], "gracefully"),
    (["gracefully", "sky"], "through")
]

# Load a pre-trained Word2Vec model (e.g., the "word2vec-google-news-300" model)
# Make sure you have the pre-trained model downloaded and the correct path provided
# Note: 'load_word2vec_format' is deprecated. Consider using 'load' for models saved with model.save()
# or adjust based on your model's saving format.
model_w2v = Word2Vec.load_word2vec_format('path/to/your/model.bin', binary=True)

# Iterate over the examples and print the similarities
for context, target_word in examples:
    similarity = model_w2v.wv.n_similarity(context, [target_word])
    print(f"Similarity between {context} and {target_word}: {similarity}")


## SKIP-GRAM Architecture Explained

The **SKIP-GRAM** architecture takes a different approach than predicting a word from its context. Instead, it aims to **predict surrounding words based on a given "current" word**. Think of it as the reverse of the previous model.

Here's how it works: for every word in a sentence, the model uses that word as input to a simple log-linear classifier. Its goal is to correctly predict words that fall within a certain range (both before and after) the input word. The wider this range, the better the quality of the resulting word vectors, but it also increases the computational cost.

### Training Complexity

The training complexity for this architecture is directly proportional to:

$Q = C \times (D + D \times \log_2(V))$

Where:
* $C$ is the **maximum distance** between the words.
* $D$ represents the **dimensionality of the word vectors** (how much "information" each word's representation holds).
* $V$ is the **size of the vocabulary** (the total number of unique words the model knows).

For example, if you choose a maximum distance $C = 5$, for each training word, the model randomly selects a number $R$ between 1 and 5. It then uses $R$ words from the past and $R$ words from the future of the current word as the correct labels. This means for each current word, the model performs $R \times 2$ classifications, with the current word as input and each of the $R + R$ surrounding words as outputs.

In [None]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

# Load your training data as a list of lists of words.
# Each inner list represents a sentence.
# Here, I'll create a simple example with fictional sentences.
sentences = [
    ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
    ["she", "sells", "seashells", "by", "the", "seashore"],
]

# Train the CBOW model (Continuous Bag-of-Words)
cbow_model = Word2Vec(sentences, sg=0, vector_size=100, window=5, min_count=1, workers=4)

# Train the Skip-Gram model
skipgram_model = Word2Vec(sentences, sg=1, vector_size=100, window=5, min_count=1, workers=4)

# Now, you can use the models to get word vector representations.
# For example, to get the vector for the word "fox" in the CBOW model:
vector_cbow_fox = cbow_model.wv['fox']

# To get the vector for the word "fox" in the Skip-Gram model:
vector_skipgram_fox = skipgram_model.wv['fox']

# To calculate the similarity between two words, you can do the following:
similarity = cbow_model.wv.similarity('fox', 'dog')

# This returns a similarity value between -1 and 1.
print(f"Similarity between 'fox' and 'dog' (CBOW): {similarity}")