# Word Embeddings in NLP

Word Embedding is a technique in Natural Language Processing (NLP) where words are represented as **dense numerical vectors**.  
These vectors capture **semantic meaning**, so similar words have similar vector representations.

Unlike traditional methods like **Bag of Words** or **One-Hot Encoding**, word embeddings:
- Reduce dimensionality
- Preserve context and meaning
- Capture word relationships

## What is Word2Vec?
Word2Vec is a popular **word embedding model** developed by Google.  
It converts words into vectors such that words with similar meanings are located close to each other in vector space.

### Word2Vec Architectures

#### 1. CBOW (Continuous Bag of Words)
- Predicts the **target word** from surrounding context words
- Faster to train
- Works well with large datasets

**Example:**
Sentence: I love natural language processing
Context: I, love, language, processing
Target: natural


#### 2. Skip-Gram
- Predicts **surrounding context words** from the target word
- Better for small datasets
- Captures rare words better

**Example:**
Input: natural
Output: I, love, language, processing


## Key Characterstics of Word2Vec
✔  Captures semantic relationships ✔ Dense and efficient representations ✔ Supports vector arithmetic  (e.g., King − Man + Woman ≈ Queen)

❌ Cannot handle **out-of-vocabulary (OOV)** words ❌ Each word has **only one vector** (no context awareness) ❌ Does not capture polysemy (multiple meanings)

## Other Word Embedding Techniques

#### 1. GloVe (Global Vectors)
- Combines **global word co-occurrence statistics**
- Trained using word-word frequency matrix
- Good semantic performance

**Example:**
Paris - France + Italy ≈ Rome

#### 2. FastText
- Developed by Facebook
- Breaks words into **character n-grams**
- Handles spelling mistakes and rare words better

**Example:**
playing → play + ing

#### 3. ELMo (Embeddings from Language Models)
- Generates **contextual embeddings**
- Same word can have different meanings in different sentences
- Uses deep neural networks

**Example:**
Bank (river bank) ≠ Bank (money bank)

#### 4. BERT Embeddings
- Uses transformer architecture
- Fully contextual embeddings
- Bidirectional understanding of text
- Used in modern NLP tasks like QA, chatbots, summarization


In [8]:
!pip install gensim




[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [22]:
!pip install wordcloud

Collecting wordcloud
  Downloading wordcloud-1.9.6-cp313-cp313-win_amd64.whl.metadata (3.5 kB)
Downloading wordcloud-1.9.6-cp313-cp313-win_amd64.whl (306 kB)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.6



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
import gensim

from gensim.models import Word2Vec, KeyedVectors

In [10]:
## References: https://stackoverflow.com/questions/46433778/import-googlenews-vectors-negative300-bin

In [12]:
import gensim.downloader as api

In [13]:
wv = api.load('word2vec-google-news-300')



In [14]:
vec_king =  wv['king']

vec_king

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [15]:
wv.most_similar('man')

[('woman', 0.7664012908935547),
 ('boy', 0.6824871301651001),
 ('teenager', 0.6586930155754089),
 ('teenage_girl', 0.6147903203964233),
 ('girl', 0.5921714305877686),
 ('suspected_purse_snatcher', 0.571636438369751),
 ('robber', 0.5585118532180786),
 ('Robbery_suspect', 0.5584409832954407),
 ('teen_ager', 0.5549196004867554),
 ('men', 0.5489763021469116)]

In [16]:
wv.most_similar('king')

[('kings', 0.7138045430183411),
 ('queen', 0.6510957479476929),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204220056533813),
 ('prince', 0.6159993410110474),
 ('sultan', 0.5864824056625366),
 ('ruler', 0.5797566771507263),
 ('princes', 0.5646551847457886),
 ('Prince_Paras', 0.5432944297790527),
 ('throne', 0.5422105193138123)]

In [17]:
wv.similarity('man', 'king')

np.float32(0.22942671)

In [18]:
wv.similarity('html', 'programming')

np.float32(0.21589944)

In [19]:
vec = wv['king'] - wv['man'] + wv['woman']

wv.most_similar([vec])

[('king', 0.8449392318725586),
 ('queen', 0.7300518155097961),
 ('monarch', 0.645466148853302),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676948547363),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613664388656616),
 ('sultan', 0.5376776456832886),
 ('Queen_Consort', 0.5344247221946716),
 ('queens', 0.5289887189865112)]