In [1]:
from IPython.display import clear_output

In [4]:
# %pip install gensim nltk tqdm scikit-learn

%pip install datasets

clear_output()

# Content

In this demo, we will train a word2vec model on custom data. We will use IMDB dataset as our training data

a word2vec model is another way to convert strings(or words) to numerical represntation, so those vectors can then be used to perform some task or train another model.

A word2vec uses Continuous bag of words technique to learn embeddings of words it sees during training, and then saves them. After training, we can infer those vectors by providing the word. We will not delve into details about contiuous bag of words (CBOW) here.

The special thing about word2vec vectors (which is not found in TF-IDF or word to index) is that the distance in word2vec vectors (between each other) also represent information about how similar the words are to each other

for example, the vectors of words "cat" and "dog" will have a higher similarity (or lower distance) compared to the vectors of words "cat" and "building"


In [15]:
import nltk
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity

from datasets import load_dataset


nltk.download('punkt')

clear_output()

## Preparing the data

In [6]:
train_data = load_dataset('imdb', split='train')['text']
train_data = train_data[:500]  # shorten the data because all 25k rows take too long to train

100%|██████████| 25000/25000 [00:21<00:00, 1176.71it/s]


In [7]:
tokenized_train_data = [word_tokenize(text.lower()) for text in tqdm(train_data, desc='Tokenizing data')]

Tokenizing data: 100%|██████████| 5000/5000 [00:10<00:00, 470.89it/s]


## Train the model

In [9]:
# with tqdm(total=len(tokenized_train_data), desc='Training model') as pbar:
model = Word2Vec(tokenized_train_data, vector_size=128, window=5, min_count=1, workers=4)

In [10]:
word_vectors = model.wv

## Try the model

In [51]:
test_words = ['cat', 'dog', 'building', 'movie', 'action', 'comedy', 'fight', 'plot', 'laugh']
test_word_vectors = [word_vectors[word] for word in test_words]

test_words_similarity_map = {}  # map each word to all other words where the other words list is sorted according to similarity (most similar first)

for i, word in enumerate(test_words):

    remaining_words = test_words[:i]+test_words[i+1:]
    remaining_word_vectors = test_word_vectors[:i]+test_word_vectors[i+1:]

    word_to_remaining_cosine_sims = cosine_similarity(test_word_vectors[i].reshape(1, -1), remaining_word_vectors)[0]
    sorting_order = word_to_remaining_cosine_sims.argsort()[::-1]  # [::-1] because argsort returns lowest to highest and we want highest to lowest because higher cosine sim is more similar word
    ordered_remaining_words = [remaining_words[j] for j in sorting_order]  # ordered according to similarity with the word(in loop iteration)

    test_words_similarity_map[word] = list(zip(ordered_remaining_words, word_to_remaining_cosine_sims[sorting_order]))

In [48]:
test_words_similarity_map['cat']

[('dog', 0.907266),
 ('building', 0.8194644),
 ('guns', 0.79736876),
 ('fight', 0.79011095),
 ('action', 0.45516294),
 ('comedy', 0.34774196),
 ('plot', 0.15761659),
 ('movie', 0.07604584)]

In [49]:
test_words_similarity_map['action']

[('plot', 0.7116974),
 ('comedy', 0.7073522),
 ('fight', 0.49309626),
 ('cat', 0.45516294),
 ('dog', 0.43521827),
 ('movie', 0.36971736),
 ('guns', 0.3668394),
 ('building', 0.3169372)]

In [54]:
test_words_similarity_map['comedy']

[('action', 0.7073522),
 ('plot', 0.6685944),
 ('movie', 0.6605965),
 ('laugh', 0.5075881),
 ('dog', 0.4103828),
 ('cat', 0.34774196),
 ('fight', 0.2577684),
 ('building', 0.22762263)]

In [55]:
test_words_similarity_map['laugh']

[('comedy', 0.5075881),
 ('movie', 0.50631833),
 ('fight', 0.36221814),
 ('action', 0.31770185),
 ('building', 0.31391907),
 ('dog', 0.30144316),
 ('plot', 0.29532665),
 ('cat', 0.17594783)]

## Finidng similar words

Since the vectors now represent a dimension of similarity, word2vec also allows us to find similar words to a word (much like we did above but from the whole corpus instead of a selective words)

word2vec also performs reasonably well for doing things like:

king_vec - man_vec + woman_vec = queen_vec (a famous example). However for good results, do note that the model needs to be trained on siginificant data. Our model is trained on a very small data (relatively) to do things like this

In [75]:
word_vectors.most_similar('movie')

[('film', 0.9537474513053894),
 ('flick', 0.8437069058418274),
 ('show', 0.8287235498428345),
 ('sequel', 0.774604320526123),
 ('thing', 0.7666338682174683),
 ('series', 0.7632709741592407),
 ('mess', 0.7579289078712463),
 ('crap', 0.7566468715667725),
 ('picture', 0.7493321895599365),
 ('case', 0.7326716184616089)]

In [77]:
word_vectors.most_similar('mother')

[('daughter', 0.9675498604774475),
 ('father', 0.9666376113891602),
 ('son', 0.950274646282196),
 ('wife', 0.9387264847755432),
 ('husband', 0.9386721253395081),
 ('boyfriend', 0.9364470839500427),
 ('sister', 0.9297873377799988),
 ('brother', 0.9262775778770447),
 ('dad', 0.9232807159423828),
 ('mom', 0.9198517203330994)]