<center>

# NLP_07 : Word2Vec — CBOW & Skip-Gram

</center>
<br>

### What is Word2Vec?

**Word2Vec** is a word embedding technique that represents words as **dense, low-dimensional vectors**.  
Unlike BoW or TF-IDF, Word2Vec captures **semantic meaning** and **contextual similarity** between words.

Words with similar meanings tend to have **similar vector representations**.

### Why Word2Vec?

Limitations of BoW and TF-IDF:
- High-dimensional and sparse vectors
- No semantic understanding
- No context awareness

Word2Vec solves these problems by:
- Learning dense vectors
- Capturing semantic relationships
- Preserving contextual information

### Two Architectures of Word2Vec

Word2Vec has **two training models**:

1. **CBOW (Continuous Bag of Words)**
2. **Skip-Gram**

In [1]:
import os
from nltk.tokenize import sent_tokenize
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec

In [2]:
# Load and preprocess the text data
story = []
for filename in os.listdir('data'):
    f = open(os.path.join('data', filename))
    corpus = f.read()
    raw_sent = sent_tokenize(corpus)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

In [3]:
print(f"Number of sentences: {len(story)}")

Number of sentences: 145020


#### Train a CBOW model and extract word vectors and Train a Skip-Gram model and compare the results with CBOW.


## CBOW (Continuous Bag of Words)

### Definition
**CBOW (Continuous Bag of Words)** is a Word2Vec model that predicts a **target word** based on its **surrounding context words**.It learns word embeddings by using the **neighboring words** within a fixed window to estimate the missing word.In CBOW, **word order is ignored**, and only the presence of context words is considered.

### Example

Sentence: I love natural language processing <br>
If the **target word** is: language <br>

And the **context window size = 2**, then the **context words** are: "love", "natural", "processing"<br>
CBOW learns the mapping:
Context Words → Target Word <br>
["love", "natural", "processing"] → "language"

In [4]:
# Train a CBOW model
cbow_model = Word2Vec(
    sentences=story,
    vector_size=100,
    window=5,
    min_count=2,
    workers=4,
    sg=0  # CBOW
)

In [5]:
# Build vocabulary and train
cbow_model.build_vocab(story)

In [6]:
cbow_model.train(story, total_examples=cbow_model.corpus_count, epochs=cbow_model.epochs)

(6570205, 8628190)

In [7]:
# Extract and analyze CBOW vectors
cbow_vector = cbow_model.wv['king']

In [8]:
print(f"CBOW vector shape for 'king': {cbow_vector.shape}")

CBOW vector shape for 'king': (100,)


In [9]:
cbow_vectors = cbow_model.wv.vectors

In [10]:
cbow_vocab = list(cbow_model.wv.index_to_key)

In [11]:
print("\nCBOW Model Results:")
print("=" * 50)


CBOW Model Results:


In [12]:
print("CBOW Similar to king:")
cbow_model.wv.most_similar("king")

CBOW Similar to king:


[('realm', 0.6187288165092468),
 ('baratheon', 0.6147810816764832),
 ('aegon', 0.49529320001602173),
 ('robert', 0.4896616041660309),
 ('rebellion', 0.48941537737846375),
 ('throne', 0.4880048334598541),
 ('heir', 0.48798668384552),
 ('lannisters', 0.47327372431755066),
 ('war', 0.4626232385635376),
 ('renly', 0.4624309241771698)]

In [13]:
print("CBOW Similar to queen:")
cbow_model.wv.most_similar("queen")

CBOW Similar to queen:


[('princess', 0.5955092906951904),
 ('kingslayer', 0.5950101613998413),
 ('joffrey', 0.5702580809593201),
 ('sister', 0.5633024573326111),
 ('margaery', 0.5477557182312012),
 ('myrcella', 0.5228344202041626),
 ('beloved', 0.5170937180519104),
 ('marriage', 0.5148559808731079),
 ('cersei', 0.5106081366539001),
 ('approach', 0.5100813508033752)]

In [14]:
cbow_model.wv.most_similar(positive=['king','woman'], negative=['man'])

[('queen', 0.5519255995750427),
 ('princess', 0.5244705677032471),
 ('baratheon', 0.5238691568374634),
 ('mother', 0.44155389070510864),
 ('aegon', 0.43635252118110657),
 ('murdered', 0.4243878722190857),
 ('realm', 0.41330569982528687),
 ('margaery', 0.4110272228717804),
 ('court', 0.40610864758491516),
 ('shame', 0.40155017375946045)]

In [15]:
print(f"\nWord that doesn't match in ['cersei', 'jaime', 'broom', 'tyrion']:")
print(cbow_model.wv.doesnt_match(['cersei', 'jaime', 'broom', 'tyrion']))


Word that doesn't match in ['cersei', 'jaime', 'broom', 'tyrion']:
broom


In [16]:
cbow_model.wv.similarity('arya','sansa')

0.8337543

In [17]:
cbow_model.wv.similarity('cersei','sansa')

0.6142799

### Characteristics
- Faster training
- Works well with large datasets
- Better for frequent words


### Skip-Gram (Word2Vec Model)

### Definition
**Skip-Gram** is a Word2Vec model that predicts the **surrounding context words** given a **target word**.The goal of Skip-Gram is to learn word embeddings that are good at representing **semantic and contextual relationships**, especially for **rare words**.

### Learning Objective
### Example

Sentence: I love natural language processing <br>
If the **target word** is: language <br>
And the **context window size = 2**, the **context words** are: ["love", "natural", "processing"] <br>
Skip-Gram learns the mappings: <br>
"language" → "love" <br>
"language" → "natural" <br>
"language" → "processing" <br>

In [18]:
# Train a Skip-Gram model
skipgram_model = Word2Vec(
    sentences=story,
    vector_size=100,
    window=5,  # Changed from 2 to 5 for consistency with CBOW
    min_count=2,
    sg=1,    # Skip-gram
    workers=4
)

In [19]:
# Build vocabulary and train Skip-Gram
skipgram_model.build_vocab(story)
skipgram_model.train(story, total_examples=skipgram_model.corpus_count, epochs=skipgram_model.epochs)


(6570360, 8628190)

In [20]:
print("\n" + "=" * 50)
print("Skip-Gram Model Results:")
print("=" * 50)


Skip-Gram Model Results:


In [21]:
print(f"\nSimilar to 'king':")
skipgram_model.wv.most_similar("king")


Similar to 'king':


[('landing', 0.7086208462715149),
 ('ii', 0.6590932607650757),
 ('jaehaerys', 0.6577804684638977),
 ('robert', 0.6571977734565735),
 ('pretender', 0.651711106300354),
 ('baratheon', 0.6515921950340271),
 ('tommen', 0.6410520076751709),
 ('joffrey', 0.6385347247123718),
 ('condemned', 0.633653998374939),
 ('proclaim', 0.6309674382209778)]

In [22]:
print(f"\nSimilar to 'queen':")
skipgram_model.wv.most_similar("queen")


Similar to 'queen':


[('cersei', 0.7103806734085083),
 ('margaery', 0.6944690942764282),
 ('regent', 0.6815361976623535),
 ('selyse', 0.6643837094306946),
 ('unburnt', 0.6469699740409851),
 ('hizdahr', 0.6462743878364563),
 ('niece', 0.624958872795105),
 ('joffrey', 0.6215522289276123),
 ('taena', 0.6164122819900513),
 ('sister', 0.6102229356765747)]

In [23]:
# Compare results
print("\n" + "=" * 50)
print("Comparison Summary:")
print("=" * 50)
print(f"Vocabulary size (CBOW): {len(cbow_vocab)}")
print(f"Vocabulary size (Skip-Gram): {len(skipgram_model.wv.index_to_key)}")
print(f"CBOW 'king' vector dimension: {cbow_vector.shape}")
print(f"Skip-Gram 'king' vector dimension: {skipgram_model.wv['king'].shape}")



Comparison Summary:
Vocabulary size (CBOW): 17453
Vocabulary size (Skip-Gram): 17453
CBOW 'king' vector dimension: (100,)
Skip-Gram 'king' vector dimension: (100,)


### Characteristics
- Slower than CBOW
- Works well with small datasets
- Better for rare words

## CBOW vs Skip-Gram (Conceptual Comparison)

| Feature | CBOW | Skip-Gram |
|------|------|-----------|
| Prediction direction | Context → Target | Target → Context |
| Training speed | Faster | Slower |
| Performance on rare words | Lower | Better |
| Dataset size suitability | Large datasets | Small datasets |
| Computational cost | Lower | Higher |

## Word2Vec Training Mechanism (High-Level)

Both CBOW and Skip-Gram use:
- A shallow neural network
- One hidden layer
- Backpropagation

To improve efficiency, they use:
- **Negative Sampling**
- **Hierarchical Softmax**
  
## When to Use CBOW vs Skip-Gram?

### Use CBOW when:
- Dataset is large
- Speed is important
- Focus is on frequent words

### Use Skip-Gram when:
- Dataset is small
- Rare words are important
- Higher accuracy is required
  
## Summary

- Word2Vec converts words into dense vectors
- CBOW predicts target word from context
- Skip-Gram predicts context words from target
- Both capture semantic relationships effectively


<div style="text-align: right;">
    <b>Author:</b> Monower Hossen <br>
    <b>Date:</b> January 7, 2026
</div>
