# **Loading Data**

---



In [None]:
from google.colab import files

story = files.upload()

In [11]:
story = open("story.txt")
story = story.read()
story

'Mira lived in a quiet village surrounded by misty hills. Every morning, she wandered through the old forest on her way to the river. One day, while following a trail of bright blue butterflies, she discovered a rusty gate hidden behind thick vines. Curious, she pushed it open and stepped into a forgotten garden. Flowers of every colour bloomed wildly, and a small stone fountain sat in the centre, still trickling with clear water. As Mira explored the garden, she noticed a wooden box buried beneath a rose bush. Inside it was a faded journal filled with stories about the garden’s magic—stories written by her grandmother many years ago. From that day on, Mira visited the hidden garden whenever she felt lonely. And slowly, the garden came back to life, just as the stories promised.'

# **Word2Vec**

---

Word2Vec is a prediction based embeddings generator. It is an advanced version of frequency based techniques like TF-IDF. Word2Vec creates embeddings or vector representation in such a way that the values carries semantic meaning as well, allowing mathematical operations like:
king - man + queen ~ woman

---


It uses two techniques - Continuous Bag of Words (CBoW) and Skip-gram. CBoW uses surrounding words to predict target word and Skip-gram predicts surrounding context word using target word.
Gensim library provides both methods to be used by toggling values 0 and 1 for 'sg' while training gensim model.

---




In [9]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m55.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [22]:
import gensim
from gensim.utils import simple_preprocess
from nltk import sent_tokenize

story_tokens=[]
tokenized_story = sent_tokenize(story)
for sent in tokenized_story:
  story_tokens.append(simple_preprocess(sent))

story_tokens

[['mira',
  'lived',
  'in',
  'quiet',
  'village',
  'surrounded',
  'by',
  'misty',
  'hills'],
 ['every',
  'morning',
  'she',
  'wandered',
  'through',
  'the',
  'old',
  'forest',
  'on',
  'her',
  'way',
  'to',
  'the',
  'river'],
 ['one',
  'day',
  'while',
  'following',
  'trail',
  'of',
  'bright',
  'blue',
  'butterflies',
  'she',
  'discovered',
  'rusty',
  'gate',
  'hidden',
  'behind',
  'thick',
  'vines'],
 ['curious',
  'she',
  'pushed',
  'it',
  'open',
  'and',
  'stepped',
  'into',
  'forgotten',
  'garden'],
 ['flowers',
  'of',
  'every',
  'colour',
  'bloomed',
  'wildly',
  'and',
  'small',
  'stone',
  'fountain',
  'sat',
  'in',
  'the',
  'centre',
  'still',
  'trickling',
  'with',
  'clear',
  'water'],
 ['as',
  'mira',
  'explored',
  'the',
  'garden',
  'she',
  'noticed',
  'wooden',
  'box',
  'buried',
  'beneath',
  'rose',
  'bush'],
 ['inside',
  'it',
  'was',
  'faded',
  'journal',
  'filled',
  'with',
  'stories',
  'abou

In [27]:
model1 = gensim.models.Word2Vec(window=5,
                               min_count=2,
                               sg = 1)

model1.build_vocab(story_tokens)

In [34]:
model1.train(story_tokens, total_examples=model1.corpus_count, epochs=model1.epochs)



(45, 640)

In [35]:
model2 = gensim.models.Word2Vec(window=5,
                               min_count=2,
                               sg = 0)

model2.build_vocab(story_tokens)

In [36]:
model2.train(story_tokens, total_examples=model2.corpus_count, epochs=model2.epochs)

(45, 640)

In [37]:
model1.wv.vectors

array([[-8.8021817e-04,  5.8665196e-04,  5.2631511e-03, ...,
        -7.3048673e-03,  9.1995345e-04,  6.4612739e-03],
       [-8.8032279e-03,  3.8024306e-03,  5.2924352e-03, ...,
        -2.5713351e-03, -9.4952807e-03,  4.5377817e-03],
       [-1.3203774e-05,  3.1644446e-03, -6.7734085e-03, ...,
         4.1708900e-04,  8.1872037e-03, -6.9618225e-03],
       ...,
       [-7.2971093e-03,  4.3584509e-03,  2.2680548e-03, ...,
         9.3065202e-03,  7.0775962e-03,  6.7898845e-03],
       [ 1.2062541e-03, -9.7318524e-03,  4.6373056e-03, ...,
        -2.6887543e-03, -7.7358098e-03,  4.2439783e-03],
       [ 1.7577567e-03,  7.0803151e-03,  2.9932698e-03, ...,
        -1.9136701e-03,  3.6042929e-03, -6.9995876e-03]], dtype=float32)

In [38]:
model2.wv.vectors

array([[-5.3342577e-04,  2.3830612e-04,  5.1013809e-03, ...,
        -7.0454874e-03,  8.9799735e-04,  6.4001861e-03],
       [-8.6263921e-03,  3.6673218e-03,  5.1889298e-03, ...,
        -2.3985296e-03, -9.5140878e-03,  4.5123072e-03],
       [ 9.4174065e-05,  3.0818784e-03, -6.8125203e-03, ...,
         5.1322090e-04,  8.2111023e-03, -7.0140711e-03],
       ...,
       [-7.1909428e-03,  4.2328904e-03,  2.1633946e-03, ...,
         9.4380733e-03,  7.0552849e-03,  6.7549525e-03],
       [ 1.3006177e-03, -9.7988760e-03,  4.5912131e-03, ...,
        -2.5952731e-03, -7.7680009e-03,  4.2066905e-03],
       [ 1.8002307e-03,  7.0460914e-03,  2.9446983e-03, ...,
        -1.8595541e-03,  3.6117458e-03, -7.0364270e-03]], dtype=float32)

In [40]:
model1.wv.most_similar('mira', topn=5)

[('day', 0.17636650800704956),
 ('stories', 0.15012679994106293),
 ('and', 0.14156045019626617),
 ('in', 0.07442259788513184),
 ('she', 0.07290764153003693)]

In [41]:
model2.wv.most_similar('mira', topn=5)

[('day', 0.16732603311538696),
 ('and', 0.13921351730823517),
 ('stories', 0.13143806159496307),
 ('in', 0.07173357903957367),
 ('she', 0.06429928541183472)]

In [42]:
model1.wv.index_to_key

['the',
 'garden',
 'she',
 'stories',
 'and',
 'mira',
 'as',
 'with',
 'it',
 'hidden',
 'of',
 'day',
 'to',
 'her',
 'on',
 'every',
 'by',
 'in']

In [43]:
model2.wv.index_to_key

['the',
 'garden',
 'she',
 'stories',
 'and',
 'mira',
 'as',
 'with',
 'it',
 'hidden',
 'of',
 'day',
 'to',
 'her',
 'on',
 'every',
 'by',
 'in']