# Word2Vec training and Evaluation

The objective of this assignment is to learn distributed word representations that capture syntactic and semantic relation between them, using Skip-gram with negative sampling, published in [_Mikolov et al,_](https://browse.arxiv.org/pdf/1310.4546.pdf). The paper mentioned above reviews approaches used for the task at hand in the past  like **Continous bag of words**, **Skip-gram** and challenges associated with them. Where **Skip-gram** made the process of learning word representaions efficient, _Mikolov et. al._ suggested following extentions to Skip-gram model which further improved it around 2X-10X:

- Hierarchical softmax
- Negative sampling
- Noice contrastive estimation

Here in this notebook I have worked on **Skip-gram with negative sampling**. The project has following folder structure:
```
Word2Vec
├── Artifacts
│   ├── metadata
│   └── model
├── Data
│   ├── ds1.txt
│   └── ds2.txt
├── ds2_coding.pdf
├── __init__.py
├── main.py
├── Mikolov et al.pdf
├── notebooks
│   └── Word2Vec training and evaluation.ipynb
├── __pycache__
│   └── main.cpython-311.pyc
├── setup.py
├── src
│   ├── models.py
│   ├── __pycache__
│   └── utils.py
```

Root folder is "**Word2Vec**" which contains:

- **Artifacts:** for saving processed data, models, plots and other artifacts.
- **Data:** containing raw data
- **main.py:** main file, but its better to use **Word2Vec training and evaluation** file in the notebooks folder
- **src:** contains **utils.py** and **models.py** files with all the source code.
    - utils.py has two classes _DataIO_ for data read and write, _DataLoader_ for creating batches and negative samples.
    - models.py has three classes _SGNS_ - the model _per se_, _Word2Vec_ - model training wrapper and _EvaluateSGNS_ - to evaluate models.

### Objective

- **Skip-gram:**

$$
\underset{\theta}{\text{maximize}} \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \le j \le c,j\ne0} \log p(w_{t+j} | w_t)
$$

$$
p(w_O | w_I) = \frac{exp({v'_{w_O}}^T v_{w_I})}{\sum_{w=1}^{W} exp(v'_w v_{w_I})}
$$

Skip-gram uses softmax to compute probability of words using softmax function which is very inefficient. Negative sampling reduces this computation by updating the objective funtion. 

- **Skip-gram with negative sampling:**

$$
\underset{\theta}{\text{maximize}} \sum_{(w,c) \in D} \log \sigma(v_c^Tv_w) + \sum_{(w,r) \in D'} \log \sigma(-v_r^Tv_w)
$$

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

### Evaluation

For model evaluation I am just looking for closest word and manually check if it does make sense. There other ways like comparing model scores with bert scores for similar and dissimilar words.

**Sample output:**
```
{'When': ['doing.', 'bookshelf.', 'acorns', 'car,"', 'stumble'],
 'diamond.': ['are."', 'cobweb', 'side!"', 'day."', 'day:'],
 'disgusting!': ['word.', 'appreciated', 'Leave', 'Pinchy', 'hole!"'],
 'fireplace.': ['band.', 'brave', 'jewel', 'flown', 'teeth,'],
 'fruits': ['tree,', '"Thanks', 'paint.', 'shelf.', 'calling'],
 'insect.': ['Pandy.', 'secrets.', 'recommended', 'problems.Once', 'high!"'],
 'know!': ['elephant,', 'shadow.', 'back!', 'rough.', 'muffin.'],
 'lungs': ['snake!', 'needle', 'park.Once', 'white', '"Yuck!"'],
 'notes': ['ones,', 'Ash', 'tired.Once', 'organized', 'pumpkins'],
 'please!': ['part', 'mine!"', 'knot.', 'farm.Once', 'garden,']}
```

The results above shows top 5 most similar words to a given random word. The model for this was trained for 10000 steps for a vocab of 10000 words and embedding dimention of 300. The results don't look that good, will need to play around a bit with the parameters to tune it.

In [27]:
import os, pprint as pp
import torch
from src.utils import DataIO, DataLoader
from src.models import Word2Vec, EvaluateSGNS
import gc

In [28]:
from main import ROOTDIR

In [29]:
FILEPATH = 'Artifacts/metadata'
FNAMES = [
    'processed_data.pkl',
    'word_counts.pkl',
    'word2index.pkl',
    'index2word.pkl'
]
# All paths are relative to root folder
DATAPATH = ['Data/ds2.txt'] # path of training data
MODELPATH = os.path.join(ROOTDIR, 'Artifacts/model/word2vec-2023-10-08 07:54:58.851242.pt') # path of the model
SOURCEFILE = ['Data/ds1.txt'] #path of test data
W2I = 'Artifacts/metadata/word2index.pkl'# path of word to index map
I2W = 'Artifacts/metadata/index2word.pkl' # path of index to word map
VOCAB_SIZE = 30000
EMBED_SIZE = 300
EXP_CONST = 3/4
BATCH_SIZE = 128
WINDOW = 4
NEG_SAMPLES = 15
HISTORY = 100
EPOCHS = 100
LR = 5e-5

In [30]:
data_io = DataIO(ROOTDIR, filepath=DATAPATH, vocab_size=VOCAB_SIZE)
data_io.process_data(filepath=FILEPATH, fnames=FNAMES)

metadata = DataIO.load_data(root_dir=ROOTDIR, filepath=FILEPATH, fnames=FNAMES)
loader = DataLoader(*metadata, exp_const=EXP_CONST)

In [31]:
torch.cuda.empty_cache()
gc.collect()

236

In [None]:
word2vec = Word2Vec(
    root_dir=ROOTDIR,
    train=True,
    sgd=False,
    process_data=False,
    source_filepath=[],
    metadata_filepath=FILEPATH,
    metadata_fnames=FNAMES,
    vocab_size=VOCAB_SIZE,
    embedding_dim=EMBED_SIZE,
    learning_rate= LR
)

loss, modelname = word2vec.train(epochs=EPOCHS, steps=1000, window=WINDOW, k=NEG_SAMPLES, batch_size=BATCH_SIZE, loss_history=HISTORY)

loss at step no. 20800 of 100000 steps is 0.00031677945517003534, job is 20.8% complete

In [None]:
# model = Word2Vec.load_sgns(vocab_size=VOCAB_SIZE, embedding_dim=EMBED_SIZE, path=MODELPATH)

In [None]:
evaluator = EvaluateSGNS(
    model=word2vec.model,
    root_dir=ROOTDIR,
    source_filepath=SOURCEFILE,
    w2i_path=W2I,
    i2w_path=I2W
)
output = evaluator.evaluate(ksamples=10, top_k=5)
pp.pprint(output)