In [None]:
%load_ext autoreload
%autoreload 2

# 5. Word Embeddings

Set the path to GloVe pretrained embeddings file. This file can be downloaded from here...

In [None]:
# path to the downloaded embeddings file
import os

fileDir = os.path.dirname(os.path.realpath('__file__'))
absFilePathToGloVe = os.path.join(fileDir, '../Data/glove.6B.100d.txt')
pathToGloveEmbeddings = os.path.abspath(os.path.realpath(absFilePathToGloVe))
print (pathToGloveEmbeddings)

Instantiate the <code>PreTrainedEmbeddigs</code> class, that is used to efficiently load and process embeddings:

In [None]:
from Common.PreTrainedEmbeddings import PreTrainedEmbeddings

embeddings = PreTrainedEmbeddings.from_embeddings_file(pathToGloveEmbeddings)

Explore the loaded pretrained embedding vectors:

In [None]:
embeddings.get_embedding(word="hello")

One of the core features of word embeddings is that they should encode syntactic and semantic relationships that manifest as regularities in word use. One of the most common way to explore the semantic repationships encoded in word embeddings is a method called "analogy task". There are three words provided and you should determine the fourth word, that has the same relationship to the third word, as the first two words have.

If we observe words purely as vectors in some vector spaces, the difference between vectors <code>word2</code> and <code>word1</code> encodes the relationship between these two words. That means that the same difference should be between vectors <code>word4</code> and <code>word3</code>, as they should have the analoguos relationship. Therefore, the vector correspoding to the fourth word is calculated as <code>word3 + (word2 - word1)</code>. Doing a neaest neighbor query among vectors correspoding to the existing words, for this result vector, solves the analogy task.

In [None]:
def compute_and_print_analogy(embeddings, word1, word2, word3, number_analogies=5):

    vector1 = embeddings.get_embedding(word1)
    vector2 = embeddings.get_embedding(word2)
    vector3 = embeddings.get_embedding(word3)

    spatial_relationship = vector2 - vector1

    vector4 = vector3 + spatial_relationship

    closest_words = embeddings.get_words_closest_to_vector(vector=vector4, n=number_analogies)

    existing_words = set([word1, word2, word3])
    closest_words = [word for word in closest_words if word not in existing_words]

    if len(closest_words) == 0:
        print("Could not find the nearest neighbors for the vector!")
        return

    for word4 in closest_words:
        print("{} : {} :: {} : {}".format(word1, word2, word3, word4))

In [None]:
compute_and_print_analogy(embeddings, "man", "he", "woman")

In [None]:
compute_and_print_analogy(embeddings, "fly", "plane", "sail")

In [None]:
compute_and_print_analogy(embeddings, "man", "king", "woman")

In [None]:
compute_and_print_analogy(embeddings, "man", "doctor", "woman")

## Add the Sequence Vectorizer

The **Sequence Vectorizer** prepares the input sequence in the format expected by the <code>nn.Embedding</code> layer. The <code>nn.Embedding</code> layer is a PyTorch module that encapsulates the embedding matrix. The <code>nn.Embedding</code> layer enables us to map a token's integer index (in the **Vocabulary**) to the vector that is further used in the neural network computation.

Therefore, the input sequence should encoded as sequence of token's indices in the **Vocabulary**, instead of one-hot encoding.

In [None]:
# path to the preprocesed dataset
absFilePathToPreprocessedDataset = os.path.join(fileDir, '../Data/training.1600000.processed.noemoticon_preprocessed.csv')
pathToPreprocessedDataset = os.path.abspath(os.path.realpath(absFilePathToPreprocessedDataset))
print (pathToPreprocessedDataset)

In [None]:
from Common.TwitterDataset import TwitterDataset

# Step #1: Instantiate the dataset
# instantiate the dataset
dataset = TwitterDataset.load_dataset_and_make_vectorizer(pathToPreprocessedDataset, representation="indices")
# get the vectorizer
vectorizer = dataset.get_vectorizer()

In [None]:
# vectorize the text of the tweet
vectorizer.vectorize(text="Jerry is good")

In [None]:
# vectorize the text of the tweet
vectorizer.vectorize(text="Today is a sunny day and we have a workshop")