<h1 align=center>Word2Vec (CBOW, Skip-gram) In Depth</h1>

- Word2Vec is an important model for natural language processing (NLP) developed by researchers at Google.
- Word2Vec is a group of related models used to produce word embeddings, which are dense vector representations of words in a continuous vector space.
- A two layer network to generate word embedding given a text corpus.
- These embeddings capture semantic relationships between words based on their usage in a large corpus of text.

![word2vec000.png](attachment:word2vec000.png)

**Word Embeddings:** Word embeddings are fixed-size, dense vectors representing words. They capture semantic meaning in such a way that similar words have similar vectors. For example, "king" and "queen" might have vectors that are close to each other.

## Architecture

Word2Vec has two main architectures for generating word embeddings:

1. **Continuous Bag-of-Words (CBOW)**
2. **Skip-Gram**

![word2vec11.png](attachment:word2vec11.png)

### **1. Continuous Bag-of-Words (CBOW)**

- Predict the target word (center word) from the surrounding context words.
- The model averages the vectors of context words and uses this average to predict the target word.
- Faster to train since it predicts only one word from multiple context words.

### **Practical Example:**

- For simplicity, imagine we got these five words: “google dream company software engineer”

**1st Iteration:**

- Select the window size (window_size=3).
- The target is to predict center word from context words (Surrounding words).
- We create the dataset where context word is independent features and center word is our output.

![word2vec1.png](attachment:word2vec1.png)

- Convert it into one hot encoding.

![word2vec2.png](attachment:word2vec2.png)

- Next, we pass the it to Neural Network

![word2vec3.png](attachment:word2vec3.png)

**2nd Iteration:**

- We go for the next three words

![word2vec4.png](attachment:word2vec4.png)

**3rd Iteration:**

- Finally, for the last three words the process is shown below:

![word2vec5.png](attachment:word2vec5.png)

### Getting Word Embeddings:

- below is shown the process of getting word embeddings
    
![word2vec9.png](attachment:word2vec9.png)

## **2. Skip-Gram**

- Predict the surrounding context words given a target word.
- The model takes the target word and tries to predict each of the context words within a window.
- Performs better on smaller datasets and can capture more complex relationships between words.

**Practical Example:**

- We will use the above example:

**1st Iteration:**

- Here, our input is our output value, and we predict the surrounding words.

![word2vec6.png](attachment:word2vec6.png)

**2nd Iteration:**

![word2vec7.png](attachment:word2vec7.png)

**3rd Iteration:**

![word2vec8.png](attachment:word2vec8.png)

## Extensions and Alternatives

Several models build on or improve Word2Vec:

1. **GloVe (Global Vectors for Word Representation)**: Combines global word co-occurrence statistics with local context-based learning.
2. **FastText**: Extends Word2Vec by representing words as n-grams of characters, improving representations for rare words.
3. **ELMo (Embeddings from Language Models)**: Uses deep, contextualized word representations.
4. **BERT (Bidirectional Encoder Representations from Transformers)**: Uses transformers for contextualized word embeddings, considering both left and right context.

## Applications

1. **Semantic Similarity**: Measuring similarity between words.
2. **Text Classification**: Improving feature representation for classification tasks.
3. **Machine Translation**: Enhancing translation quality by providing better word embeddings.
4. **Information Retrieval**: Improving search results by understanding word semantics.
5. **Recommendation Systems**: Enhancing recommendations by understanding user preferences and item descriptions.

## Limitations

1. **Context Independence**: It doesn't consider the order of words or their syntactic roles.
2. **Out-of-Vocabulary Words**: It cannot handle words that were not present in the training corpus.
3. **Fixed Embedding Size**: All words are represented by fixed-size vectors, regardless of their frequency or importance.