#Word2vec

- A deep learning model used for text representation [[ numerical vectors (word embeddings) so that computers can understand relationships between words just like humans do.]]

- The key idea is that **words appearing in similar contexts have similar meanings.**

Example:


```
“The king lives in a palace” and “The queen lives in a palace.”
Both king and queen appear in similar contexts → hence their vectors will be close in space.
```



Word2Vec doesn’t memorize words — it learns word meaning from large text corpora using one of two architectures:

1. **CBOW (Continuous Bag of Words):**

- Predicts the current word based on its context words.
- Example: From “the cat sat on the ___”, it predicts “mat”.

2. **Skip-Gram:**

- Does the opposite — predicts context words from the current word.
- Example: From “cat”, it predicts nearby words like “the”, “sat”, “on”.

Both models use a shallow neural network with one hidden layer to learn embeddings.

         [Context Words]                  [Target Word]
        ┌───────────────┐                ┌──────────────┐
        │ the, sat, on  │  ───▶ NN ───▶ |  cat/mat     │
        └───────────────┘                └──────────────┘
                ↑                               ↓
            CBOW Model                    Skip-Gram Model


In [None]:
!pip install gensim --upgrade --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import gensim
from gensim.models import Word2Vec,KeyedVectors

In [None]:
!pip install opendatasets --upgrade --quiet

In [None]:
import os
import opendatasets as od
import numpy as np
import pandas as pd

In [None]:
od.download('https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus/data')

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: sana620
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus
Downloading children-stories-text-corpus.zip to ./children-stories-text-corpus


100%|██████████| 7.26M/7.26M [00:00<00:00, 952MB/s]







In [None]:
os.listdir('children-stories-text-corpus')

['cleaned_merged_fairy_tales_without_eos.txt']

In [None]:
!head -n 20 children-stories-text-corpus/cleaned_merged_fairy_tales_without_eos.txt

The Happy Prince.
HIGH above the city, on a tall column, stood the statue of the Happy Prince.  He was gilded all over with thin leaves of fine gold, for eyes he had two bright sapphires, and a large red ruby glowed on his sword-hilt.
He was very much admired indeed.  “He is as beautiful as a weathercock,” remarked one of the Town Councillors who wished to gain a reputation for having artistic tastes; “only not quite so useful,” he added, fearing lest people should think him unpractical, which he really was not.
“Why can’t you be like the Happy Prince?” asked a sensible mother of her little boy who was crying for the moon.  “The Happy Prince never dreams of crying for anything.”
“I am glad there is some one in the world who is quite happy,” muttered a disappointed man as he gazed at the wonderful statue.
“He looks just like an angel,” said the Charity Children as they came out of the cathedral in their bright scarlet cloaks and their clean white pinafores.
“How do you know?” said the M

**simple_preprocess(text)** is a utility function from the gensim.utils module
>> Example: "Hello!!! This is, an example... 123 :)"

It can be used to do basic preprocessing like:
1. Lowercases all text
* "hello!!! this is, an example... 123 :)"
2. Removes punctuation, symbols, and special characters
* "hello this is an example 123"
3. Removes numeric tokens
* (by default it removes tokens containing digits)
4. Splits text into tokens (words)
* ["hello", "this", "is", "an", "example"]
5. Removes short and very long tokens
* Default: keeps only tokens with length ≥2 and ≤15 characters.
(Can be changed using min_len and max_len parameters.)

In [None]:
from gensim.utils import simple_preprocess

In [None]:
corpus = []
with open("children-stories-text-corpus/cleaned_merged_fairy_tales_without_eos.txt", "r", encoding="utf-8") as f:
      text = f.read()
      tokens = simple_preprocess(text)  # lowercasing + basic cleaning
      corpus.append(tokens)
corpus

[['the',
  'happy',
  'prince',
  'high',
  'above',
  'the',
  'city',
  'on',
  'tall',
  'column',
  'stood',
  'the',
  'statue',
  'of',
  'the',
  'happy',
  'prince',
  'he',
  'was',
  'gilded',
  'all',
  'over',
  'with',
  'thin',
  'leaves',
  'of',
  'fine',
  'gold',
  'for',
  'eyes',
  'he',
  'had',
  'two',
  'bright',
  'sapphires',
  'and',
  'large',
  'red',
  'ruby',
  'glowed',
  'on',
  'his',
  'sword',
  'hilt',
  'he',
  'was',
  'very',
  'much',
  'admired',
  'indeed',
  'he',
  'is',
  'as',
  'beautiful',
  'as',
  'weathercock',
  'remarked',
  'one',
  'of',
  'the',
  'town',
  'councillors',
  'who',
  'wished',
  'to',
  'gain',
  'reputation',
  'for',
  'having',
  'artistic',
  'tastes',
  'only',
  'not',
  'quite',
  'so',
  'useful',
  'he',
  'added',
  'fearing',
  'lest',
  'people',
  'should',
  'think',
  'him',
  'unpractical',
  'which',
  'he',
  'really',
  'was',
  'not',
  'why',
  'can',
  'you',
  'be',
  'like',
  'the',
  'hap

In [None]:
len(corpus[1])

3680620

In [None]:
model = Word2Vec(
      corpus,
      vector_size=100,   # number of features per word
      window=5,          # context window size
                 # min_count=2,        ignore words with total frequency < 2
      workers=4,         # parallel training
      sg=1               # 1 = skip-gram, 0 = CBOW
      )

### Retrieve the **vector representation (embedding) for a word.**

In [None]:
model.wv['prince']

array([-0.22754519,  0.3063115 ,  0.0382118 , -0.05236911, -0.20472343,
       -0.6070588 ,  0.09126571,  0.58525133, -0.12378562, -0.0853784 ,
        0.12877752, -0.24660671,  0.00606862,  0.30062127, -0.14095794,
       -0.20844589,  0.04251022, -0.28501993, -0.10335243, -0.45639557,
        0.17106639,  0.188978  ,  0.22557887, -0.04286993,  0.04045994,
       -0.25076494,  0.0466596 , -0.23334664, -0.14018387, -0.00636684,
        0.32959133,  0.03955339,  0.20541003, -0.5406935 , -0.10826647,
        0.32365212, -0.05457682, -0.36740538, -0.229211  , -0.16469744,
        0.31275383, -0.41529617, -0.37512138,  0.08593159, -0.00380681,
       -0.20444186, -0.00536097, -0.0656048 ,  0.02299607,  0.04295088,
        0.40621865, -0.47017196, -0.04473781, -0.03056645, -0.37563738,
       -0.06676381,  0.27893397,  0.10331435, -0.1523745 ,  0.14804012,
        0.10001728, -0.00878723, -0.05425403,  0.16616473, -0.09741545,
        0.31294623,  0.10453777,  0.08696328, -0.46652535,  0.45


```
model.wv.most_similar('prince')
```
- Calculates **cosine similarity between the vector for "prince" and all other word vectors in the vocabulary.**
- It then returns the top 10 most similar words by default.

In [None]:
model.wv.most_similar('prince')

[('miller', 0.9989899396896362),
 ('said', 0.9989796876907349),
 ('on', 0.9989082217216492),
 ('cried', 0.9988980889320374),
 ('to', 0.9988819360733032),
 ('see', 0.9988539218902588),
 ('then', 0.998810887336731),
 ('thing', 0.9987981915473938),
 ('what', 0.9987908005714417),
 ('are', 0.9987835884094238)]

In [None]:
model.wv.most_similar('happy')

[('how', 0.9992204904556274),
 ('don', 0.9992066621780396),
 ('not', 0.9992014169692993),
 ('what', 0.9991706013679504),
 ('quite', 0.9991419315338135),
 ('indeed', 0.9991416335105896),
 ('answered', 0.999141275882721),
 ('now', 0.9991400241851807),
 ('like', 0.9991389513015747),
 ('on', 0.9991325736045837)]


```
model.wv.similarity('prince', 'happy')
```
Calculates **how semantically close these two embeddings are**. The formula is:

```
similarity(a,b) = (a.b) / ( ∥a∥ × ∥b∥ )
```
Output will be:

| Value     | Meaning                                |
| --------- | -------------------------------------- |
| `1.0`     | identical meaning (rare)               |
| `0.6–0.8` | strongly related                       |
| `0.3–0.5` | loosely related                        |
| `0.0`     | unrelated                              |
| `< 0`     | opposite direction (rare, often noise) |





In [None]:
model.wv.similarity('prince','happy')

np.float32(0.99863446)

In [None]:
model.wv.similarity('prince','fairy')

np.float32(0.2217394)

#### Finding **odd one out**

In [None]:
model.wv.doesnt_match(['prince','fairy','happy'])

'fairy'

#### **Word2Vec analogy equation**



In [None]:
vec = model.wv['king'] - model.wv['man'] + model.wv['woman']
model.wv.most_similar([vec])

[('king', 0.9961283802986145),
 ('are', 0.9951633214950562),
 ('all', 0.994941234588623),
 ('with', 0.9948406219482422),
 ('don', 0.9947998523712158),
 ('my', 0.9947931170463562),
 ('answered', 0.9947691559791565),
 ('should', 0.9947195053100586),
 ('which', 0.9946994781494141),
 ('to', 0.9946808815002441)]