## Word2vec
**Word2vec** is a popular technique used in natural language processing to represent words as numerical vectors. It is a neural network-based method that learns word embeddings from large amounts of text data. Word2vec has become an important tool in the field of natural language processing and has helped to advance the development of more advanced language technologies.

Word2vec works by considering the context in which words appear in a text corpus. The neural network is trained to predict the probability of a word appearing in a given context, given the surrounding words. This allows the network to learn word embeddings that capture the relationships between words in the text.

### We'll be using Game of thrones Books dataset for conversion of words to vector.

### We'll be using gensim and nltk library for the manipulation of text.

In [9]:
#Importing important libraries
import numpy as np
import pandas as pd
import gensim
import os
import nltk
!pip install nltk
from nltk import sent_tokenize
nltk.download('punkt')
from gensim.utils import simple_preprocess



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ashis\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


we have downloaded the dataset of game of throne books from kaggle and we perform following operations:

- Fetch each file
- read each file
- Perform preprocessing and put each preproprocessed sentence in a list 

In [11]:
story = []
for filename in os.listdir('data'):
    
    f = open(os.path.join('data',filename))
    
    #read file
    corpus = f.read()
    
    #put all sentence in list
    raw_sent = sent_tokenize(corpus)
    
    #Basic preprocess each sentence 
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

In [17]:
# Let's print first six sentence from our story list
print([story[i] for i in range(6)])

[['game', 'of', 'thrones', 'book', 'one', 'of', 'song', 'of', 'ice', 'and', 'fire', 'by', 'george', 'martin', 'prologue', 'we', 'should', 'start', 'back', 'gared', 'urged', 'as', 'the', 'woods', 'began', 'to', 'grow', 'dark', 'around', 'them'], ['the', 'wildlings', 'are', 'dead'], ['do', 'the', 'dead', 'frighten', 'you'], ['ser', 'waymar', 'royce', 'asked', 'with', 'just', 'the', 'hint', 'of', 'smile'], ['gared', 'did', 'not', 'rise', 'to', 'the', 'bait'], ['he', 'was', 'an', 'old', 'man', 'past', 'fifty', 'and', 'he', 'had', 'seen', 'the', 'lordlings', 'come', 'and', 'go']]


### Under class Word2Vec which is inside the gensim.models is used to make model
We are going to use three parameters 

- **window**: Window is used to define how many words can be used to make predections of center words and on both side we use 10 10 words for the predictions if window = 10

- **min_count**: we use only those sentences which have the minimum word 2 if min_count = 2

- **vector_size**: tells the model how many size of vector we want to make for each words. Basically hidden layers play important role for this.(see in docs below)

In [19]:
model = gensim.models.Word2Vec(
        window=10,
        min_count=2,
        vector_size = 200)

In [20]:
#build our vocabulary from the story
model.build_vocab(story)

In [22]:
# let's train our model
model.train(story, total_examples=model.corpus_count, epochs = model.epochs)

(6570597, 8628190)

### We have trained our model using model.train now we can explore our trained dataset

We can use model.wv.func() for the manipulating models and words.
eg we can use **model.wv.most_similar(word)** to get most similar word fora certain words and alsovisualise the vector of a word

In [36]:
print(model.wv.most_similar("got"))

print(model.wv["man"])

[('started', 0.721495509147644), ('went', 0.6500563025474548), ('gets', 0.6296276450157166), ('tried', 0.6125317215919495), ('stole', 0.5981466770172119), ('squirting', 0.595348060131073), ('goes', 0.5852327346801758), ('came', 0.5687958598136902), ('moved', 0.5684332251548767), ('shouted', 0.5670421719551086)]
[-0.65933394  1.5233262   1.669839    0.5993785   1.1982256   2.2859757
  0.24119651 -0.4587817   0.8062619   0.0499928   2.5661209   0.39831415
  0.50965524  0.10206635  0.21645388 -0.2893774   0.3961005   0.6720759
 -1.6263245  -1.3243555   0.30309412  0.13111965 -1.3896111  -1.1457307
 -2.438746   -1.2344465   0.41160107 -0.3624295  -1.3560109   1.7957324
 -0.5155781   0.03727447  0.90332806  1.4986315   0.97128665 -0.1682729
  0.1172601  -0.702237    0.05568082  0.58727115  0.9436814  -0.38819495
  0.44841886  1.2588859   2.804377    0.12495188  0.7907396  -0.01418527
 -0.15780063  0.79143685  1.4343985  -0.7822081   0.4649995   1.5629027
 -0.41865376  0.52656144  0.81241685

Word2Vec is considered a good word embedding technique for several reasons:

- It captures semantic and syntactic relationships between words: Word2Vec uses a neural network to learn representations of words that are close to each other in meaning or context. This allows it to capture semantic relationships between words, such as "king" and "queen", which are likely to appear in similar contexts. It can also capture syntactic relationships, such as the relationship between "walk" and "walked".


- It can handle large vocabularies: Word2Vec can efficiently handle vocabularies with hundreds of thousands of words, which is important for many natural language processing tasks.



- It is computationally efficient: Word2Vec uses a technique called negative sampling to train the neural network, which makes it more computationally efficient than other techniques like neural language models.



- It can be pre-trained on large corpora: Word2Vec can be pre-trained on large corpora of text, such as Wikipedia or the entire internet. This pre-training can help improve performance on downstream tasks that have limited training data.

Overall, Word2Vec is a powerful word embedding technique that has been widely used and proven effective for a variety of natural language processing tasks.