# Word2Vec / GloVe Exploration

This notebook loads pre-trained **GloVe 100d** vectors (converted to Word2Vec format) and runs a few classic vector operations.

**Expected file:** `lecture1/word2vec/glove.6B.100d.kv`  
If not found, run in a terminal:
```
python lecture1/word2vec/word2vec_download.py
```

**Notes**
- The GloVe `glove.6B` vocabulary is **lowercased**, so use lowercase tokens (e.g., `king`, `father`, `mango`).
- Library: `gensim` KeyedVectors.

In [2]:
import os
print("Current working dir:", os.getcwd())


Current working dir: c:\Users\murth\Desktop\NLP\Code-Base\session01\lecture1\word2vec


In [3]:
from pathlib import Path
print([p.name for p in Path('.').glob('*.kv')])


['glove.6B.100d.kv']


In [4]:
from pathlib import Path
from gensim.models import KeyedVectors

KV_PATH = Path("glove.6B.100d.kv")  # because you're already in ...\lecture1\word2vec
if not KV_PATH.exists():
    raise FileNotFoundError(f"Not found: {KV_PATH.resolve()}")

wv = KeyedVectors.load(str(KV_PATH), mmap="r")
print("Loaded:", KV_PATH.resolve())
print("Vocab size:", len(wv.index_to_key), "Vector size:", wv.vector_size)


Loaded: C:\Users\murth\Desktop\NLP\Code-Base\session01\lecture1\word2vec\glove.6B.100d.kv
Vocab size: 400000 Vector size: 100


In [5]:
# Print the vector of a word ("king")
word = "king"
if word in wv:
    vec = wv[word]
    print(f"Vector for '{word}' (len={len(vec)}). First 10 dims:\n{vec[:10]}")
else:
    print(f"'{word}' not in vocabulary.")

Vector for 'king' (len=100). First 10 dims:
[-0.32307 -0.87616  0.21977  0.25268  0.22976  0.7388  -0.37954 -0.35307
 -0.84369 -1.1113 ]


In [9]:
# Similar words to "father"
query = "father"
if query in wv:
    print(f"Most similar to '{query}':")
    for w, score in wv.most_similar(query, topn=10):
        print(f"{w:>15s}  {score:.4f}")
else:
    print(f"'{query}' not in vocabulary.")

Most similar to 'father':
            son  0.9240
        brother  0.9225
    grandfather  0.8828
         mother  0.8657
          uncle  0.8647
           wife  0.8441
        husband  0.8431
       daughter  0.8397
         friend  0.8364
         cousin  0.8158


In [10]:
# Odd one out among [mango, orange, apple, computer]
words = ["mango", "orange", "apple", "computer"]
present = [w for w in words if w in wv]
print("Input words:", words)
print("In vocab   :", present)

if len(present) >= 2:
    try:
        odd = wv.doesnt_match(present)
        print("Odd one out:", odd)
    except Exception as e:
        print("Could not compute odd-one-out:", e)
else:
    print("Not enough words found in vocab.")

Input words: ['mango', 'orange', 'apple', 'computer']
In vocab   : ['mango', 'orange', 'apple', 'computer']
Odd one out: computer


In [11]:
# Analogy: uncle - man + woman ≈ ?
pos = ["uncle", "woman"]
neg = ["man"]

tokens = pos + neg
missing = [t for t in tokens if t not in wv]
if missing:
    print("Missing tokens in vocab:", missing)
else:
    print("Analogy: uncle - man + woman ≈ ?")
    for w, score in wv.most_similar(positive=pos, negative=neg, topn=10):
        print(f"{w:>15s}  {score:.4f}")

Analogy: uncle - man + woman ≈ ?
           aunt  0.8368
       daughter  0.8227
          niece  0.8221
    grandmother  0.8220
         mother  0.8079
           wife  0.7898
         cousin  0.7828
  granddaughter  0.7686
         father  0.7636
        husband  0.7516
