## **Word Embedding**

---

### **What is Word Embedding?**

**Word Embedding** is a way to represent words as **dense vectors of real numbers**.

Unlike BoW or TF-IDF (which are sparse and high-dimensional), embeddings **capture meaning** and **relationships** between words.

---

### **Why Use Word Embedding?**

* Captures **semantic meaning** (e.g., "king" and "queen" are related).
* Words with similar meanings have **similar vector representations**.
* More efficient and powerful for deep learning models.

---

### **Real Example:**

In a well-trained embedding space:

* `vec("king") - vec("man") + vec("woman") ≈ vec("queen")`
* This shows it understands **gender relationships** and **roles**.

---

### **Common Word Embedding Models**

| Model    | Description                             |
| -------- | --------------------------------------- |
| Word2Vec | Learns word relationships using context |
| GloVe    | Captures global word co-occurrence      |
| FastText | Like Word2Vec but includes subword info |
| BERT     | Contextual word embeddings (advanced)   |

---

### **Summary**

| Aspect      | Details                                        |
| ----------- | ---------------------------------------------- |
| Vector type | Dense (real-valued, low-dimensional)           |
| Advantage   | Captures meaning, relationships                |
| Use cases   | NLP, sentiment analysis, chatbots, translation |

---


> Embedding = a special kind of number that understands the word.

## **1. Word2Vec**

---

### **What is Word2Vec?**

**Word2Vec** is a popular word embedding technique that converts words into dense vectors.
It learns these vectors by predicting words based on their surrounding words (context).

---

### **Why Use Word2Vec?**

* Captures **semantic relationships** between words.
* Words used in similar contexts get **similar vectors**.
* Helps machines understand meaning, not just word counts.

---

### **How Does Word2Vec Work?**

Two main models:

* **CBOW (Continuous Bag of Words):** Predicts a word based on its context (neighboring words).
* **Skip-gram:** Predicts the context given a word.


---

### **Summary**

| Aspect    | Details                                             |
| --------- | --------------------------------------------------- |
| Goal      | Learn word vectors capturing context                |
| Models    | CBOW and Skip-gram                                  |
| Output    | Dense vector for each word                          |
| Use cases | NLP tasks like sentiment analysis, search, chatbots |

---


In [None]:
# Step 1: Install the Gensim library (only needs to be done once)
#pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp310-cp310-win_amd64.whl.metadata (8.2 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Using cached numpy-1.26.4-cp310-cp310-win_amd64.whl.metadata (61 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp310-cp310-win_amd64.whl.metadata (60 kB)
Downloading gensim-4.3.3-cp310-cp310-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
    --------------------------------------- 0.5/24.0 MB 5.6 MB/s eta 0:00:05
   --- ------------------------------------ 1.8/24.0 MB 5.3 MB/s eta 0:00:05
   ---- ----------------------------------- 2.9/24.0 MB 5.2 MB/s eta 0:00:05
   ------ --------------------------------- 3.9/24.0 MB 5.5 MB/s eta 0:00:04
   -------- ------------------------------- 5.2/24.0 MB 5.5 MB/s eta 0:00:04
   ---------- ----------------------------- 6.6/24.0 MB 5.6 MB/s eta 0:00:04
   ------------- -------------------------- 7.9/24.0 MB 5.7 MB/s eta 0:00:03
   -------

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bnltk 0.7.8 requires numpy==2.0.2, but you have numpy 1.26.4 which is incompatible.


In [2]:
# Step 2: Import necessary modules
import gensim # Main library
from gensim.models import Word2Vec,keyedvectors

In [3]:
# Step 3: Import Gensim's downloader to load pre-trained models easily
import gensim.downloader as api

# Step 4: Load the Google News Word2Vec model (pre-trained on ~100 billion words)
# This is a huge model (~1.6 gb) and it will take some time to load
model = api.load('word2vec-google-news-300')

In [4]:
model["King"]

array([-0.00350952,  0.01623535, -0.08154297,  0.12792969,  0.11230469,
       -0.00534058,  0.03063965,  0.04931641,  0.22070312,  0.07373047,
       -0.13769531,  0.16210938,  0.02148438, -0.09375   , -0.12792969,
       -0.12402344, -0.11132812,  0.11816406, -0.07861328,  0.25390625,
        0.01794434,  0.14160156,  0.0612793 , -0.08691406,  0.07763672,
        0.05175781, -0.24609375, -0.17578125,  0.14746094,  0.06640625,
       -0.03833008, -0.09033203, -0.07226562,  0.09375   , -0.18847656,
        0.06347656,  0.24121094,  0.00714111, -0.30273438, -0.02478027,
       -0.09619141, -0.30859375, -0.06054688,  0.22167969,  0.07763672,
        0.05834961,  0.15527344, -0.13476562, -0.00341797, -0.13964844,
       -0.02905273,  0.03833008, -0.15332031, -0.20996094,  0.21679688,
        0.01171875, -0.078125  ,  0.00402832, -0.23535156, -0.10400391,
        0.08837891,  0.25976562,  0.02709961,  0.01123047,  0.12988281,
       -0.11914062, -0.07861328, -0.04736328, -0.06591797,  0.07

In [6]:
model["King"].shape

(300,)

In [7]:
model["Queen"]

array([-0.22070312, -0.17480469, -0.10498047,  0.2578125 ,  0.16210938,
       -0.13085938, -0.16699219,  0.07373047, -0.07226562,  0.02404785,
       -0.13964844,  0.02197266,  0.17675781, -0.19140625,  0.0378418 ,
       -0.01782227, -0.03710938, -0.03735352,  0.15625   ,  0.08837891,
        0.0534668 , -0.02392578, -0.2734375 , -0.2578125 , -0.00720215,
        0.06933594, -0.21777344, -0.10058594,  0.2421875 ,  0.03417969,
       -0.12890625, -0.1171875 , -0.18261719,  0.04321289, -0.125     ,
       -0.09960938,  0.26367188,  0.375     , -0.32421875, -0.1328125 ,
       -0.13378906, -0.50390625, -0.05908203,  0.04077148,  0.23730469,
       -0.03393555, -0.01495361, -0.09765625, -0.06445312,  0.02087402,
       -0.10302734,  0.10449219,  0.20019531, -0.16503906, -0.01196289,
        0.30859375, -0.41015625, -0.22070312,  0.08056641, -0.12792969,
        0.13085938,  0.28515625, -0.07275391,  0.02612305,  0.01916504,
       -0.16992188,  0.01745605,  0.13085938, -0.17089844, -0.10

In [9]:
model.most_similar('man')

[('woman', 0.7664011716842651),
 ('boy', 0.6824871301651001),
 ('teenager', 0.6586929559707642),
 ('teenage_girl', 0.6147903203964233),
 ('girl', 0.5921714305877686),
 ('suspected_purse_snatcher', 0.5716364979743958),
 ('robber', 0.5585119128227234),
 ('Robbery_suspect', 0.5584410429000854),
 ('teen_ager', 0.5549196600914001),
 ('men', 0.5489761233329773)]

Here's **how** it works in 2 simple steps:

---

###  **Step 1:**

Word2Vec has a **300-number vector** for `'man'`:

```
'man' → [0.134, -0.298, ..., 0.072]
```

---

### **Step 2:**

It **compares** this vector to every other word’s vector in the vocabulary using **cosine similarity**.
The ones with **highest similarity score** (like `'woman'`, `'boy'`, `'gentleman'`) are returned as **most similar**.

---

In [10]:
model.most_similar('cricket')

[('cricketing', 0.8372227549552917),
 ('cricketers', 0.8165745735168457),
 ('Test_cricket', 0.8094819188117981),
 ('Twenty##_cricket', 0.8068487048149109),
 ('Twenty##', 0.762426495552063),
 ('Cricket', 0.7541398406028748),
 ('cricketer', 0.7372578382492065),
 ('twenty##', 0.7316358685493469),
 ('T##_cricket', 0.7304614186286926),
 ('West_Indies_cricket', 0.698798656463623)]

In [18]:
model.most_similar('facebook')

[('Facebook', 0.7563531398773193),
 ('FaceBook', 0.7076998949050903),
 ('twitter', 0.6988552212715149),
 ('myspace', 0.6941818594932556),
 ('Twitter', 0.6642445921897888),
 ('twitter_facebook', 0.6572229862213135),
 ('Facebook.com', 0.6529869437217712),
 ('myspace_facebook', 0.6370644569396973),
 ('facebook_twitter', 0.6367619633674622),
 ('linkedin', 0.6356592178344727)]

In [19]:
model.similarity('man','woman')

0.7664013

In [20]:
model.similarity('man','food')

0.046399325

In [22]:
model.similarity('woman','food')

0.09206605

In [44]:
model.doesnt_match(['PHP','java','monkey'])

'monkey'

In [55]:
model.doesnt_match(['python','java','monkey'])

'java'

In [52]:
model.similarity('python','snake')

0.66062933

In [53]:
model.similarity('python','programming')

0.09035955

In [58]:
vec = model['king'] - model['man'] + model['woman']
model.most_similar(vec)

[('king', 0.8449392914772034),
 ('queen', 0.730051577091217),
 ('monarch', 0.6454662084579468),
 ('princess', 0.6156250834465027),
 ('crown_prince', 0.5818676948547363),
 ('prince', 0.5777117013931274),
 ('kings', 0.561366617679596),
 ('sultan', 0.5376775860786438),
 ('Queen_Consort', 0.5344247221946716),
 ('queens', 0.5289887189865112)]

In [None]:
vec = model['Taka'] - model ['Bangladesh'] + model['India']
model.most_similar([vec])

[('INR', 0.7877920269966125),
 ('Tk', 0.6473284363746643),
 ('taka', 0.6242390275001526),
 ('Rs3', 0.5809125900268555),
 ('Rs###', 0.5680422782897949),
 ('Rs##', 0.5675228238105774),
 ('Rs', 0.5669804215431213),
 ('Rs1', 0.5598682761192322),
 ('Rs.####', 0.5585658550262451),
 ('Bangladesh', 0.5535844564437866)]