### Word2Vec

Word2Vec is a word embedding technique in natural language processing (NLP) that allows words to be represented as vectors in a continuous vector space. Researchers at Google developed word2Vec that maps words to high-dimensional vectors to capture the semantic relationships between words. It works on the principle that words with similar meanings should have similar vector representations.

#### 1.  CBOW (Continuous Bag of Words)
1. Data Preparation
First, prepare your text corpus:

* Tokenize the text into words
* Build a vocabulary of unique words
* Assign each word a unique index
* Create training examples with context windows

Example: For the sentence "The cat sits on the mat" with window size = 2:

Context: [The, cat, on, the] → Target: sits
Context: [cat, sits, the, mat] → Target: on

2. Define the Architecture
The CBOW model has three layers:

* Input Layer: Takes context words as one-hot encoded vectors
* Hidden Layer: Projects words into a lower-dimensional embedding space (no activation function)
* Output Layer: Predicts the target word using softmax

Key parameters:

* Vocabulary size: V
* Embedding dimension: N (typically 100-300)
* Context window size: C (words on each side of target)

3. Initialize Weight Matrices
Create two weight matrices with random small values:

* W (V × N): Input-to-hidden weights (this becomes our word embeddings)
* W' (N × V): Hidden-to-output weights

4. Forward Propagation
For each training example:
* Step 4a: Convert context words to one-hot vectors

* Each context word becomes a V-dimensional vector with 1 at its index, 0 elsewhere

* Step 4b: Look up embeddings

* Multiply each one-hot vector by W to get embedding vectors
This is equivalent to selecting rows from W

* Step 4c: Average the context embeddings

* Sum all context word embeddings and divide by the number of context words
Result: h = (1/C) × Σ(embeddings of context words)

* Step 4d: Compute output scores

Multiply averaged embedding h by W': u = h × W'
u is a V-dimensional vector of scores for each vocabulary word

* Step 4e: Apply softmax

Convert scores to probabilities: p(w) = exp(u_w) / Σ(exp(u_i))
This gives probability distribution over all vocabulary words

5. Calculate Loss
Use cross-entropy loss:

Loss = -log(p(target_word))
The goal is to maximize the probability of the correct target word

6. Backward Propagation
Compute gradients and update weights:
Step 6a: Calculate output error

Error at output = predicted probabilities - actual target (one-hot)

Step 6b: Compute gradients for W'

∂Loss/∂W' = h^T × error

Step 6c: Backpropagate to hidden layer

error_hidden = W' × error

Step 6d: Compute gradients for W

Update embeddings for each context word
∂Loss/∂W = (1/C) × error_hidden for each context word position

7. Update Weights
Using gradient descent (or variants like SGD, Adam):

W = W - learning_rate × ∂Loss/∂W
W' = W' - learning_rate × ∂Loss/∂W'

8. Iterate
Repeat steps 4-7 for all training examples across multiple epochs until convergence.
Key Optimizations
Hierarchical Softmax: Replaces expensive softmax with binary tree structure, reducing complexity from O(V) to O(log V)
Negative Sampling: Instead of updating all words in vocabulary, only update the target word (positive sample) and a small number of random words (negative samples)
Final Output
After training, the W matrix contains the learned word embeddings. Each row represents a word's vector representation that captures semantic meaning based on the contexts where it appears.
Words used in similar contexts will have similar vector representations, enabling operations like:

Similarity: king ≈ queen
Analogies: king - man + woman ≈ queen


Step-by-Step Process of Skip-Gram Word2Vec Model
Overview
Skip-Gram is the inverse of CBOW. Instead of predicting a target word from context, it predicts context words given a target word. Skip-Gram generally works better for smaller datasets and rare words.
Detailed Process
1. Data Preparation
Prepare your text corpus:

Tokenize the text into words
Build a vocabulary of unique words
Assign each word a unique index
Create training examples with context windows

Example: For "The cat sits on the mat" with window size = 2:

Target: sits → Context words: The, cat, on, the
This creates 4 separate training pairs:

(sits, The)
(sits, cat)
(sits, on)
(sits, the)



Key Difference from CBOW: Skip-Gram creates multiple training examples (one per context word) while CBOW creates one example per target.
2. Define the Architecture
The Skip-Gram model has three layers:

Input Layer: Takes a single target word as one-hot encoded vector
Hidden Layer: Projects the word into embedding space (no activation function)
Output Layer: Predicts each context word independently using softmax

Key parameters:

Vocabulary size: V
Embedding dimension: N (typically 100-300)
Context window size: C (words on each side)

3. Initialize Weight Matrices
Create two weight matrices with random small values:

W (V × N): Input-to-hidden weights (word embeddings)
W' (N × V): Hidden-to-output weights (context embeddings)

4. Forward Propagation
For each training pair (target_word, context_word):
Step 4a: Convert target word to one-hot vector

Create V-dimensional vector with 1 at target word's index, 0 elsewhere
Example: if "sits" is word index 5 in vocab of size 1000, create vector with 1 at position 5

Step 4b: Look up target word embedding

Multiply one-hot vector by W to get embedding: h = x^T × W
This is equivalent to selecting the row from W corresponding to the target word
Result: h is an N-dimensional vector (the word embedding)

Step 4c: Compute output scores

Multiply embedding h by W': u = h × W'
u is a V-dimensional vector of scores for predicting each vocabulary word

Step 4d: Apply softmax

Convert scores to probabilities: p(context_word_i) = exp(u_i) / Σ(exp(u_j))
This gives probability distribution over all possible context words

5. Calculate Loss
For each (target, context) pair:

Loss = -log(p(context_word | target_word))
The goal is to maximize probability of the actual context word

Total loss for one target word:

Sum losses across all context words in the window
Loss_total = -Σ log(p(context_word_i | target_word))

6. Backward Propagation
Compute gradients and update weights:
Step 6a: Calculate output error

error = predicted probabilities - actual context word (one-hot)
This is a V-dimensional vector

Step 6b: Compute gradients for W'

∂Loss/∂W' = h^T × error
This updates the context word representations

Step 6c: Backpropagate to hidden layer

error_hidden = W' × error
This is an N-dimensional vector

Step 6d: Compute gradients for W

∂Loss/∂W = x × error_hidden^T
This updates only the embedding of the target word (one row of W)

7. Update Weights
Using gradient descent (or variants):

W = W - learning_rate × ∂Loss/∂W
W' = W' - learning_rate × ∂Loss/∂W'

Important: Only the row corresponding to the target word in W gets updated in each iteration.
8. Iterate
Repeat steps 4-7 for all training pairs across multiple epochs until convergence.
Training Example Walkthrough
Sentence: "The cat sits on the mat" (window size = 2)
All training pairs generated:

(The, cat) - "The" predicts "cat"
(cat, The), (cat, sits) - "cat" predicts "The" and "sits"
(sits, The), (sits, cat), (sits, on), (sits, the) - "sits" predicts 4 context words
(on, cat), (on, sits), (on, the), (on, mat) - "on" predicts 4 context words
(the, sits), (the, on), (the, mat) - "the" predicts 3 context words
(mat, on), (mat, the) - "mat" predicts 2 context words

Each pair is trained separately.
Key Optimizations
Negative Sampling (Most Common)
Instead of using expensive softmax over entire vocabulary:

Update the actual context word (positive sample)
Update k random words that are NOT in context (negative samples, typically k=5-20)
Loss becomes: log(σ(u_pos)) + Σ log(σ(-u_neg))
This reduces complexity from O(V) to O(k)

Hierarchical Softmax

Organizes vocabulary in binary tree
Reduces complexity from O(V) to O(log V)
Each word is a leaf; path from root determines probability

Subsampling Frequent Words

Frequent words (like "the", "a") are randomly discarded during training
Probability of keeping word w: P(w) = √(t/f(w))
This balances training and improves rare word representations

Skip-Gram vs CBOW Comparison
AspectSkip-GramCBOWPredictionContext from targetTarget from contextTraining pairsMultiple per windowOne per windowTraining timeSlower (more examples)FasterPerformanceBetter on small data, rare wordsBetter on large data, frequent wordsEmbedding qualityHigher quality for infrequent wordsSmoother embeddings overall
Final Output
After training, the W matrix contains the learned word embeddings. Each row is a dense vector representation of a word that captures:

Semantic meaning
Syntactic patterns
Contextual relationships

Words appearing in similar contexts will have similar vectors, enabling:

Similarity: dog ≈ cat
Analogies: paris - france + germany ≈ berlin
Clustering: Related words cluster together in vector space