## Word Embeddings

In NLP, word embeddings in a term used for the representation of the word for text analysis, typically in the form of a real valued vector that encodes the meaning of the word such that words that are closer in the vector space are expected to be similar in meaning.

There are 2 types of word embeddings
1. Frequency Based: BagOfWords,Tf-Idf, Glove
2. Prediction Based: Word2Vec


# Word2Vec

What is Word2Vec

It is a word embedding technique whose task is to convert the given words to vector(i.e collection of numbers). Google Engineers implemented Word2Vec in 2013.

Find the research paper here: https://arxiv.org/pdf/1301.3781.pdf

1. In word2vec, we can get the semantic meaning of the words(for example, it can tell happy and joy are similar words), which was not possible in Tf-idf and bag of words
2. low dimension vectors(generally in range of 100-300) resulting in faster computations
3. It is a dense vector(as opposed to sparse vector in bag of words and TF-IDF). This results in elimination of problems like overfitting

We can use the word2vec model in 2 ways, either we can use some pre-trained model or train our owm model and use it.

**The underlying assumption of word2vec is that 2 words sharing similar contexts also share similar meaning and consequently share a similar vector representation from the model**


We will first use the pre-trained weights of word2vec that was trained on google news corpus containing 3 billion words. This model consists of 300-dimensional vectors for 3 million words and phrases.

In [2]:
import gensim
from gensim.models import Word2Vec,KeyedVectors

In [5]:
# Download the googleNews vector model from the below link:
# https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/view?resourcekey=0-wjGZdNAUop6WykTtMip30g



--2024-01-08 15:57:57--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 

16.182.37.184, 16.182.38.88, 52.217.207.80, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|16.182.37.184|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-01-08 15:57:58 ERROR 404: Not Found.



In [6]:
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',binary=True,limit=500000)

In [7]:
model['cricket']

array([-3.67187500e-01, -1.21582031e-01,  2.85156250e-01,  8.15429688e-02,
        3.19824219e-02, -3.19824219e-02,  1.34765625e-01, -2.73437500e-01,
        9.46044922e-03, -1.07421875e-01,  2.48046875e-01, -6.05468750e-01,
        5.02929688e-02,  2.98828125e-01,  9.57031250e-02,  1.39648438e-01,
       -5.41992188e-02,  2.91015625e-01,  2.85156250e-01,  1.51367188e-01,
       -2.89062500e-01, -3.46679688e-02,  1.81884766e-02, -3.92578125e-01,
        2.46093750e-01,  2.51953125e-01, -9.86328125e-02,  3.22265625e-01,
        4.49218750e-01, -1.36718750e-01, -2.34375000e-01,  4.12597656e-02,
       -2.15820312e-01,  1.69921875e-01,  2.56347656e-02,  1.50146484e-02,
       -3.75976562e-02,  6.95800781e-03,  4.00390625e-01,  2.09960938e-01,
        1.17675781e-01, -4.19921875e-02,  2.34375000e-01,  2.03125000e-01,
       -1.86523438e-01, -2.46093750e-01,  3.12500000e-01, -2.59765625e-01,
       -1.06933594e-01,  1.04003906e-01, -1.79687500e-01,  5.71289062e-02,
       -7.41577148e-03, -

In [9]:
model.most_similar('football')

[('soccer', 0.731354832649231),
 ('Football', 0.7124834060668945),
 ('basketball', 0.668246865272522),
 ('footbal', 0.6649289727210999),
 ('athletics', 0.6265192627906799),
 ('gridiron', 0.6191604733467102),
 ('baseball', 0.6162001490592957),
 ('sports', 0.5927178859710693),
 ('footballing', 0.5805955529212952),
 ('coaches', 0.5791539549827576)]

In [11]:
model.similarity('man','woman')

0.76640123

In [12]:
model.similarity('man','tennis')

0.12813176

In [16]:
model.doesnt_match(['css','html','kitten'])

'kitten'

## 2 Architectures of word2Vec

1. CBoW(Continuous Bag of words)
2. Skip-grams

Both are neural networks 

When to use which

If we are working with small data, then we should use CBoW

If we are working with large data, we should be selecting Skip-Grams. These points are proven in the research

**Improve Quality of Word2Vec Embeddings**
1. Increase training data
2. Increase the dimension of the vectors(hidden layers)
3. Increase window size



# Building our own Word2Vec Model on Game of Thrones Data

In [1]:
import numpy as np
import pandas as pd
import gensim
import os


In [2]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

story = []
for filename in os.listdir('data'):
    filepath = os.path.join('data/', filename)
    with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
        corpus = f.read()
        raw_sent = sent_tokenize(corpus)
        for sent in raw_sent:
            story.append(simple_preprocess(sent))

In [3]:
story

[['version',
  'history',
  'reedited',
  'by',
  'maelstrom',
  'feast',
  'for',
  'crows',
  'book',
  'four',
  'song',
  'of',
  'ice',
  'and',
  'fire',
  'george'],
 ['martin', 'prologue', 'dragons', 'said', 'mollander'],
 ['he',
  'snatched',
  'withered',
  'apple',
  'off',
  'the',
  'ground',
  'and',
  'tossed',
  'it',
  'hand',
  'to',
  'hand'],
 ['throw', 'the', 'apple', 'urged', 'alleras', 'the', 'sphinx'],
 ['he',
  'slipped',
  'an',
  'arrow',
  'from',
  'his',
  'quiver',
  'and',
  'nocked',
  'it',
  'to',
  'his',
  'bowstring'],
 ['should', 'like', 'to', 'see', 'dragon'],
 ['roone',
  'was',
  'the',
  'youngest',
  'of',
  'them',
  'chunky',
  'boy',
  'still',
  'two',
  'years',
  'shy',
  'of',
  'manhood'],
 ['should', 'like', 'that', 'very', 'much'],
 ['and',
  'should',
  'like',
  'to',
  'sleep',
  'with',
  'roseys',
  'arms',
  'around',
  'me',
  'pate',
  'thought'],
 ['he', 'shifted', 'restlessly', 'on', 'the', 'bench'],
 ['by', 'the', 'morrow

In [4]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)


In [5]:
model.build_vocab(story)

In [6]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(6571866, 8627235)

In [7]:
model.wv.most_similar('daenerys')

[('stormborn', 0.8179689645767212),
 ('targaryen', 0.7677925229072571),
 ('unburnt', 0.7556098699569702),
 ('queen', 0.6846722960472107),
 ('princess', 0.676594614982605),
 ('dorne', 0.6668779253959656),
 ('elia', 0.6556617021560669),
 ('regent', 0.6521868109703064),
 ('myrcella', 0.6493724584579468),
 ('westeros', 0.6344913840293884)]

In [8]:
model.wv.doesnt_match(['jon','rikon','robb','arya','sansa','bran'])

'jon'

In [9]:
model.wv['king']

array([-1.1278077 ,  0.4699212 ,  2.805081  , -2.8894534 , -4.1045623 ,
        0.90772444,  1.4439554 ,  3.4037502 , -1.3755314 , -0.31714225,
        0.53251576, -1.1636502 , -1.7950252 ,  2.0721662 , -2.6240528 ,
        0.1928149 , -0.9703315 ,  0.87171143, -1.7479708 , -0.8257515 ,
        1.7716615 , -0.70534927, -1.3757478 ,  0.9829423 , -2.1834338 ,
       -1.8190233 ,  1.3525578 ,  1.0726341 ,  3.822606  ,  1.1726354 ,
        0.08465   ,  1.6914533 , -0.37572816,  1.1318582 , -0.6198092 ,
       -1.4762629 ,  1.8619627 , -2.8753948 ,  0.26056585, -0.36060756,
        0.727543  ,  2.956461  , -0.20703487,  3.9132373 , -1.9147433 ,
       -1.686599  , -1.4111073 , -1.3884012 ,  1.5301958 ,  0.05193291,
       -2.8162498 , -1.8830812 , -1.2469862 , -3.36048   ,  3.0694559 ,
        1.5527705 , -0.1959809 , -0.57217187, -0.1622321 , -3.3723292 ,
        0.6377066 ,  0.41529018, -2.233273  ,  0.9437903 , -2.2170756 ,
        2.63736   , -0.3141105 , -0.3245771 , -0.25476202, -1.26

In [10]:
model.wv.similarity('arya','sansa')

0.83841383

In [11]:
y = model.wv.index_to_key

In [12]:
y

['the',
 'and',
 'to',
 'of',
 'he',
 'his',
 'was',
 'you',
 'her',
 'in',
 'it',
 'had',
 'that',
 'she',
 'as',
 'with',
 'him',
 'not',
 'but',
 'for',
 'they',
 'is',
 'at',
 'on',
 'said',
 'my',
 'have',
 'be',
 'lord',
 'them',
 'no',
 'from',
 'would',
 'were',
 'me',
 'your',
 'one',
 'all',
 'when',
 'will',
 'ser',
 'if',
 'so',
 'their',
 'could',
 'we',
 'are',
 'man',
 'there',
 'this',
 'up',
 'been',
 'what',
 'did',
 'by',
 'do',
 'men',
 'king',
 'back',
 'out',
 'more',
 'or',
 'who',
 'down',
 'well',
 'than',
 'only',
 'like',
 'jon',
 'some',
 'old',
 'hand',
 'even',
 'too',
 'before',
 'never',
 'father',
 'tyrion',
 'an',
 'off',
 'see',
 'know',
 'into',
 'made',
 'now',
 'eyes',
 'black',
 'told',
 'thought',
 'lady',
 'time',
 'then',
 'how',
 'long',
 'has',
 'might',
 'us',
 'come',
 'where',
 'can',
 'here',
 'through',
 'face',
 'still',
 'head',
 'red',
 'way',
 'page',
 'boy',
 'must',
 'once',
 'good',
 'two',
 'queen',
 'over',
 'brother',
 'little'

In [13]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)

X = pca.fit_transform(model.wv.get_normed_vectors())

X.shape



(17712, 3)

In [14]:


import plotly.express as px
fig = px.scatter_3d(X[200:300],x=0,y=1,z=2, color=y[200:300])
fig.show()

