# Word Embeddings

Set the path to GloVe pretrained embeddings file. This file can be downloaded from here...

In [1]:
# path to the downloaded embeddings file
import os

fileDir = os.path.dirname(os.path.realpath('__file__'))
absFilePathToGloVe = os.path.join(fileDir, '../Data/glove.6B.100d.txt')
pathToGloveEmbeddings = os.path.abspath(os.path.realpath(absFilePathToGloVe))
print (pathToGloveEmbeddings)

c:\Users\v-tastan\source\repos\PetnicaNLPWorkshop\Data\glove.6B.100d.txt


Instantiate the <code>PreTrainedEmbeddigs</code> class, that is used to efficiently load and process embeddings:

In [8]:
from PreTrainedEmbeddings import PreTrainedEmbeddings

embeddings = PreTrainedEmbeddings.from_embeddings_file(pathToGloveEmbeddings)

Explore the loaded pretrained embedding vectors:

In [10]:
embeddings.get_embedding(word="hello")

array([ 0.26688  ,  0.39632  ,  0.6169   , -0.77451  , -0.1039   ,
        0.26697  ,  0.2788   ,  0.30992  ,  0.0054685, -0.085256 ,
        0.73602  , -0.098432 ,  0.5479   , -0.030305 ,  0.33479  ,
        0.14094  , -0.0070003,  0.32569  ,  0.22902  ,  0.46557  ,
       -0.19531  ,  0.37491  , -0.7139   , -0.51775  ,  0.77039  ,
        1.0881   , -0.66011  , -0.16234  ,  0.9119   ,  0.21046  ,
        0.047494 ,  1.0019   ,  1.1133   ,  0.70094  , -0.08696  ,
        0.47571  ,  0.1636   , -0.44469  ,  0.4469   , -0.93817  ,
        0.013101 ,  0.085964 , -0.67456  ,  0.49662  , -0.037827 ,
       -0.11038  , -0.28612  ,  0.074606 , -0.31527  , -0.093774 ,
       -0.57069  ,  0.66865  ,  0.45307  , -0.34154  , -0.7166   ,
       -0.75273  ,  0.075212 ,  0.57903  , -0.1191   , -0.11379  ,
       -0.10026  ,  0.71341  , -1.1574   , -0.74026  ,  0.40452  ,
        0.18023  ,  0.21449  ,  0.37638  ,  0.11239  , -0.53639  ,
       -0.025092 ,  0.31886  , -0.25013  , -0.63283  , -0.0118

One of the core features of word embeddings is that they should encode syntactic and semantic relationships that manifest as regularities in word use. One of the most common way to explore the semantic repationships encoded in word embeddings is a method called "analogy task". There are three words provided and you should determine the fourth word, that has the same relationship to the third word, as the first two words have.

If we observe words purely as vectors in some vector spaces, the difference between vectors <code>word2</code> and <code>word1</code> encodes the relationship between these two words. That means that the same difference should be between vectors <code>word4</code> and <code>word3</code>, as they should have the analoguos relationship. Therefore, the vector correspoding to the fourth word is calculated as <code>word3 + (word2 - word1)</code>. Doing a neaest neighbor query among vectors correspoding to the existing words, for this result vector, solves the analogy task.

In [10]:
def compute_and_print_analogy(embeddings, word1, word2, word3, number_analogies=5):

    vector1 = embeddings.get_embedding(word1)
    vector2 = embeddings.get_embedding(word2)
    vector3 = embeddings.get_embedding(word3)

    spatial_relationship = vector2 - vector1

    vector4 = vector3 + spatial_relationship

    closest_words = embeddings.get_words_closest_to_vector(vector=vector4, n=number_analogies)

    existing_words = set([word1, word2, word3])
    closest_words = [word for word in closest_words if word not in existing_words]

    if len(closest_words) == 0:
        print("Could not find the nearest neighbors for the vector!")
        return

    for word4 in closest_words:
        print("{} : {} :: {} : {}".format(word1, word2, word3, word4))

In [11]:
compute_and_print_analogy(embeddings, "man", "he", "woman")

man : he :: woman : she
man : he :: woman : never
man : he :: woman : her


In [12]:
compute_and_print_analogy(embeddings, "fly", "plane", "sail")

fly : plane :: sail : ship
fly : plane :: sail : vessel
fly : plane :: sail : boat


In [13]:
compute_and_print_analogy(embeddings, "man", "king", "woman")

man : king :: woman : queen
man : king :: woman : monarch
man : king :: woman : throne
man : king :: woman : elizabeth


In [14]:
compute_and_print_analogy(embeddings, "man", "doctor", "woman")

man : doctor :: woman : nurse
man : doctor :: woman : physician
man : doctor :: woman : pregnant


## Add the Sequence Vectorizer

The Sequence Vectorizer prepares the input sequence in the format expected by the Embedding Layer. The Embedding Layer is a PyTorch modelu that encapsulates the embedding matrix. The Embedding Layer enables us to map a token's indeger index (in the Vocabulary) to the vector that is further used in the neural network computation.

Therefore, the input sequence should encoded as sequence of token's indices in the Vocabulary, instead of one-hot encoding.

In [1]:
# path to the preprocesed dataset
absFilePathToPreprocessedDataset = os.path.join(fileDir, '../Data/training.1600000.processed.noemoticon_preprocessed.csv')
pathToPreprocessedDataset = os.path.abspath(os.path.realpath(absFilePathToPreprocessedDataset))
print (pathToPreprocessedDataset)

In [4]:
from TwitterDataset import TwitterDataset

# Step #1: Instantiate the dataset
# instantiate the dataset
dataset = TwitterDataset.load_dataset_and_make_vectorizer(pathToPreprocessedDataset, representation="indices")
# get the vectorizer
vectorizer = dataset.get_vectorizer()

In [5]:
# vectorize the text of the tweet
vectorizer.vectorize(text="Jerry is good")

array([ 2,  1, 42, 79,  3], dtype=int64)