### Word Embeddings
- In 2013, a seminal work by Mikolov showed that their neural network–based word representation model known as “Word2vec,” based on “distributional similarity,” can capture word analogy relationships such as: King – Man + Woman ≈ Queen

- Low dimensional vectors (50 - 500), dimension of 300 is the most used, also called embeddings.

- Research gropus develop the following word embeddings: 

* Google: Word2vec  
* Facebook: fastText  
* Stanford: GloVe

- Word2vec can be trained with 2 methods: CBOW and Skip-gram model  
- fastText is trained using n-grams  
- Glove is trained using co--ocurrence matrices

### How do we find the vector that best represents the meaning of the word?
- Consider a large corpus of text as input and “learns” to  represent
the words in a common vector space based on the contexts in which they appear in
the corpus.  
- Given a word w and the words appearing in its context C.
- For every word w in corpus, we start with a vector v w initialized with random values. The Word2vec model refines the values in v_w by predicting v_w , given the vectors for words in the context 'C'. It does this using a two-layer neural network.

Example:
 - My dog likes to play in the garden.
 - My cat likes to play in the kitchen.
 - I like to eat pizza.

<img src="images/worde_3p.png">


There are 3 scenarios to use word embeddings:  
1. Train the word embeddings and the model at the same time.  
2. First train the word embeddings with large dataset, the use it to train the model.  
3. Load a pretrained word embeddings an use to train the model.   

<img src="images/word_embeddings.png">


### Pre-trained word embeddings

- Training your own word embeddings is a pretty expensive process (in terms of both
time and computing). However Pre-trained word embeddings are trained on a large corpus, such as Wikipedia, news articles, or even the entire web.  
- Such embeddings can be thought of as a large collection of key-value pairs, where keys are the words in the vocabulary and values are their corresponding word vectors.

In [2]:
import os
import wget
import gzip
import shutil

gn_vec_path = "GoogleNews-vectors-negative300.bin"
if not os.path.exists("GoogleNews-vectors-negative300.bin"):
    if not os.path.exists("../Ch2/GoogleNews-vectors-negative300.bin"):
        #Downloading the reqired model
        if not os.path.exists("../Ch2/GoogleNews-vectors-negative300.bin.gz"):
            if not os.path.exists("GoogleNews-vectors-negative300.bin.gz"):
                wget.download("https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz")
            gn_vec_zip_path = "GoogleNews-vectors-negative300.bin.gz"
        else:
            gn_vec_zip_path = "../Ch2/GoogleNews-vectors-negative300.bin.gz"
        #Extracting the required model
        with gzip.open(gn_vec_zip_path, 'rb') as f_in:
            with open(gn_vec_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
    else:
        gn_vec_path = "../Ch2/" + gn_vec_path

print(f"Model at {gn_vec_path}")

Model at GoogleNews-vectors-negative300.bin


In [3]:
import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore") 

import psutil #This module helps in retrieving information on running processes and system resource utilization
process = psutil.Process(os.getpid())
from psutil import virtual_memory
mem = virtual_memory()

import time #This module is used to calculate the time

In [5]:

from gensim.models import Word2Vec, KeyedVectors
pretrainedpath = gn_vec_path

#Load W2V model. This will take some time, but it is a one time effort! 
pre = process.memory_info().rss
print("Memory used in GB before Loading the Model: %0.2f"%float(pre/(10**9))) #Check memory usage before loading the model
print('-'*10)

start_time = time.time() #Start the timer
ttl = mem.total #Toal memory available

w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True) #load the model
print("%0.2f seconds taken to load"%float(time.time() - start_time)) #Calculate the total time elapsed since starting the timer
print('-'*10)

print('Finished loading Word2Vec')
print('-'*10)

post = process.memory_info().rss
print("Memory used in GB after Loading the Model: {:.2f}".format(float(post/(10**9)))) #Calculate the memory used after loading the model
print('-'*10)

print("Percentage increase in memory usage: {:.2f}% ".format(float((post/pre)*100))) #Percentage increase in memory after loading the model
print('-'*10)

print("Numver of words in vocablulary: ",len(w2v_model.index_to_key)) #Number of words in the vocabulary.

Memory used in GB before Loading the Model: 4.21
----------
52.81 seconds taken to load
----------
Finished loading Word2Vec
----------
Memory used in GB after Loading the Model: 6.73
----------
Percentage increase in memory usage: 160.05% 
----------
Numver of words in vocablulary:  3000000


In [6]:
#Let us examine the model by knowing what the most similar words are, for a given word!
w2v_model.most_similar('beautiful')

[('gorgeous', 0.8353005051612854),
 ('lovely', 0.8106936812400818),
 ('stunningly_beautiful', 0.7329413294792175),
 ('breathtakingly_beautiful', 0.7231340408325195),
 ('wonderful', 0.6854086518287659),
 ('fabulous', 0.6700063943862915),
 ('loveliest', 0.6612576246261597),
 ('prettiest', 0.6595001816749573),
 ('beatiful', 0.6593326330184937),
 ('magnificent', 0.6591402888298035)]

In [7]:
#Let us try with another word! 
w2v_model.most_similar('toronto')

[('montreal', 0.6984112858772278),
 ('vancouver', 0.6587257385253906),
 ('nyc', 0.6248832941055298),
 ('alberta', 0.6179691553115845),
 ('boston', 0.611499547958374),
 ('calgary', 0.61032634973526),
 ('edmonton', 0.6100260615348816),
 ('canadian', 0.5944076776504517),
 ('chicago', 0.5911980271339417),
 ('springfield', 0.5888351798057556)]

In [9]:
#What is the vector representation for a word? 
w2v_model['computer'].shape

(300,)

In [10]:
w2v_model['computer']

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -

In [3]:
!python3 -m spacy download en_core_web_md

2021-07-04 13:16:15.616178: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2021-07-04 13:16:15.616272: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2021-07-04 13:16:15.616286: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Collecting en-core-web-md==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.0.0/en_core_web_md-3.0.0-py3-none-any.whl (47.1 MB)
[K     |████████████████████████████████| 47.1 MB 138 kB/s 
Installing collected packages: en-core-web-m

In [4]:
import spacy

%time nlp = spacy.load('en_core_web_md')
# process a sentence using the model
mydoc = nlp("Canada is a large country")
#Get a vector for individual words
#print(doc[0].vector) #vector for 'Canada', the first word in the text 
print(mydoc.vector) #Averaged vector for the entire sentence

CPU times: user 1.37 s, sys: 165 ms, total: 1.53 s
Wall time: 1.64 s
[-1.12055197e-01  2.26087615e-01 -5.15111461e-02 -1.21812008e-01
  4.13958639e-01 -8.56475979e-02 -2.84600933e-03 -2.26096585e-01
  6.98113963e-02  2.27946019e+00 -4.49774921e-01 -6.39050007e-02
 -1.80326015e-01 -8.79765972e-02  9.93399299e-04 -1.57384202e-01
 -1.23817801e-01  1.54990411e+00  2.00794004e-02  1.38399601e-01
 -1.48897991e-01 -2.23025799e-01 -1.48171991e-01  4.68924567e-02
 -3.17026004e-02  1.19096041e-02 -6.10985979e-02  9.57068056e-02
  9.37099904e-02  1.70955807e-01 -9.29740071e-03  7.88536817e-02
  1.74508005e-01 -1.04450598e-01  1.04872189e-01 -1.16961405e-01
  6.23028055e-02 -2.23016590e-01 -1.44107476e-01 -2.03423887e-01
  2.61404991e-01  2.43404001e-01  1.51980996e-01 -1.12484001e-01
  1.18055798e-01 -9.51323956e-02  8.66319984e-02 -2.54322797e-01
  3.84932049e-02  1.18278004e-01 -3.21602583e-01  3.73764008e-01
  1.13018408e-01 -8.05834010e-02  1.84921592e-01  9.38879885e-03
  1.22166201e-01 -3.2

In [15]:
print(mydoc.vector.shape)
print(mydoc[0])
print(mydoc[0].vector.shape)


(300,)
Canada
(300,)


In [17]:
temp = nlp('practicalnlp is a newword')
temp[0].vector

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

### Training WordEmbeddings

There are 2 variants:  
• Continuous bag of words (CBOW)  
• SkipGram

CBOW:  
<img src="images/CBOW.png">


Example with k = 2  
<img src="images/cbow_2.png">


CBOW Model:

<img src="images/cbow_model.png">


### Distributed Representations Beyond Words and Characters (Doc2vec)
Problem of word2vec is that learned representations for words, and we aggregated them
to form text representations  but that they do not take the context of words
into account.  
Example: Apple has more similarity with Microsoft than orange.

Doc2vec allows us to directly learn the representations for texts of arbitrary lengths (phrases, sentences, paragraphs, and documents) by taking the context of words in the text into account.


Doc2Vec Architecture  
- Offers some form of context and can encode texts of arbitrary length into a fixed, low-dimensional, dense vector.  
- It has found application in a wide range of NLP applications, such as text classification, document tagging, text recommendation systems, and simple chat‐
bots for FAQs.  

<img src="images/doc2vec.png">


## Visualizing word Embeddings with T-SNE

### t-SNE or t-distributed Stochastic Neighboring Embedding.  
It’s a technique used for visualizing high-dimensional data like embeddings by reducing them to two-or three-dimensional data.

t-SNE on MNIST

<img src="images/tsne_MNIST.png">

t-SNE visualization shows some interesting relationships in word_embeddings  
<img src="images/tsne_relations.png">

Embeddings of Wikipedia articles: on various topics, obtain corresponding document vec‐
tors for each article, then plot these vectors using t-SNE
<img src="images/tsne_wikipedia.png">