<h3>Word2Vec - Word Embeddings Technique:</h3>

<p>
    Word2vec is a deep learning based technique to create word embeddings for each data/ text present in the corpus while keeping their contextual meanining entact. Means, words that are usually used in similar context, will have the same or nearly same representation of embeddings. Embeddings refer here the numeric representation of textual data in high dimensioanl space which is most important for any text data to analyze and build predictive models.
    
There are also some other techniques such as BoW, TF-IDF which are the basic ones and have some drawbacks which are overcome by the wrod2vec. Those shortcomings are: 
<ul>
    <li>Length of Vector: Earlier approaches usually have very high dimensioanl space such as 500, 1000 etc, depends on the unique words in the corpous, which is computationally very expensive whereas Word2Vec has fixed legnth of numeric representation and can be controlled by the user.</li>
    <li>Sparse Vectors: Other mentionaed appraoches are frequency based approaches that's why they have highly sparse vectors. Unlikely, word2vec can have sparse vetcor negligibly.</li>
    <li>Semantic Meaning: Most important, word2vec captures the semantic meaning of words. Thus, words used in similar context will have same vector as numeric representation whereas other approaches lost the semantic meanings of words.</li>
</ul>


There are two main variants of Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram. Here's a brief overview of how Word2Vec works:
<ol>
    <li>Continuous Bag of Words (CBOW): In CBOW, the model predicts a target word based on its surrounding context words. The input to the model is a context window of words, and the output is the target word. CBOW is efficient and tends to work well for smaller datasets.</li>
    <li>Skip-gram: Skip-gram, on the other hand, predicts the context words (surrounding words) given a target word. It's more computationally intensive but often performs better when you have a larger corpus.</li>
</ol>

In essence, word2vec provides a way to represent words that preserves their semantic relationships, making them a powerful tool for understanding and working with language in a more meaningful way than traditional methods like BoW and TF-IDF.

</p>

In [1]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [2]:
import gensim
from gensim.utils import simple_preprocess

In [3]:
import nltk
from nltk import sent_tokenize
from nltk.corpus import stopwords

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
file_path = '''/kaggle/input/game-of-thrones-books/'''
os.listdir(file_path)

['004ssb.txt', '005ssb.txt', '001ssb.txt', '002ssb.txt', '003ssb.txt']

In [6]:
books = []
for book in os.listdir(file_path):
    book_name = os.path.join(file_path, book)
    with open(book_name, encoding='utf-8', errors='ignore') as file:
        sentences = sent_tokenize(file.read())
        for sent in sentences:
            preprocessed_sent = simple_preprocess(sent)
            books.append([each for each in preprocessed_sent if each.lower() not in stopwords.words('english')])

In [9]:
len(books)

158872

In [10]:
books[:5]

[['version',
  'history',
  'reedited',
  'maelstrom',
  'feast',
  'crows',
  'book',
  'four',
  'song',
  'ice',
  'fire',
  'george'],
 ['martin', 'prologue', 'dragons', 'said', 'mollander'],
 ['snatched', 'withered', 'apple', 'ground', 'tossed', 'hand', 'hand'],
 ['throw', 'apple', 'urged', 'alleras', 'sphinx'],
 ['slipped', 'arrow', 'quiver', 'nocked', 'bowstring']]

In [11]:
np.shape(books)

  result = asarray(a).shape


(158872,)

In [12]:
model = gensim.models.Word2Vec(window=5, min_count=5)

In [13]:
%%time
model.build_vocab(books)

CPU times: user 672 ms, sys: 7.07 ms, total: 679 ms
Wall time: 681 ms


In [14]:
model.corpus_count

158872

In [15]:
%%time
model.train(books, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 13.2 s, sys: 89.5 ms, total: 13.3 s
Wall time: 5.3 s


(4331763, 4596315)

In [16]:
np.shape(model.wv.get_normed_vectors())

(11834, 100)

In [17]:
model.wv.get_normed_vectors()[:5]

array([[-5.93788661e-02,  7.10229054e-02, -1.22555345e-02,
         8.55302513e-02, -1.54055595e-01, -2.08716556e-01,
         1.69743210e-01,  1.44997627e-01, -4.35820669e-02,
        -5.84317185e-02,  8.92198160e-02, -7.55257085e-02,
        -1.03625320e-01,  2.74434686e-03, -4.01304401e-02,
         2.67664269e-02,  4.84419689e-02,  1.05602011e-01,
        -3.69488299e-02, -7.28894696e-02, -2.57672905e-03,
         8.89472961e-02,  9.37365443e-02, -1.36632472e-02,
         9.30336863e-02, -3.85163538e-03, -1.47053301e-01,
         3.01664528e-02, -3.53593566e-02, -9.04390123e-04,
         2.69859601e-02,  6.93063289e-02,  1.30222782e-01,
        -8.22050422e-02,  1.61626227e-02,  4.71662320e-02,
         4.13585082e-02, -3.06024738e-02,  2.02639941e-02,
        -2.70566225e-01, -9.95824635e-02, -9.65384617e-02,
        -9.03154537e-03,  4.10068370e-02, -4.60070670e-02,
         5.92671558e-02, -1.43982813e-01,  5.54820225e-02,
         1.14485592e-01,  1.06758729e-01, -1.97161734e-0

In [18]:
 y = model.wv.index_to_key

In [19]:
y[:5]

['said', 'lord', 'would', 'one', 'ser']

In [20]:
np.shape(y)

(11834,)