# Word Embeddings Are All About Distance

The idea behind word embeddings is a theory known as the distributional hypothesis. This hypothesis states that words that co-occur in the same contexts tend to have similar meanings. With word embeddings, we map words that exist with the same context to similar places in our vector space (math-speak for the area in which our vectors exist).

The numeric values that are assigned to the vector representation of a word are not important in their own right, but gather meaning from how similar or not words are to each other.

Thus the cosine distance between words with similar contexts will be small, and the cosine distance between words that have very different contexts will be large.

The literal values of a word’s embedding have no actual meaning. We gain value in word embeddings from comparing the different word vectors and seeing how similar or different they are. Encoded in these vectors, however, is latent information about how they are used.

In [2]:
# pip install spacy
# python -m spacy download en_core_web_md
# python -m spacy download en_core_web_lg

import spacy
from scipy.spatial.distance import cosine
from modules.common_words import most_common_words

In [6]:
# use spaCy to find the vector corresponding to 
# each of the most common words
nlp = spacy.load('en_core_web_md')
vectors = [nlp(_word).vector for _word in most_common_words]

In [7]:
# define find_closest_words
def find_closest_words(word_list, vector_list, word_to_check):
    return sorted(word_list, key=lambda x: cosine(
        vector_list[word_list.index(word_to_check)], 
        vector_list[word_list.index(x)]
        ))[:10]

In [9]:
# find closest words to food
close_to_food = find_closest_words(most_common_words, vectors, 'food')
print(close_to_food)

['food', 'eat', 'dinner', 'fish', 'health', 'kitchen', 'good', 'animal', 'water', 'treat']


In [10]:
# find closest words to summer
close_to_summer = find_closest_words(most_common_words, vectors, 'summer')
print(close_to_summer)

['summer', 'spring', 'fall', 'season', 'year', 'week', 'day', 'evening', 'during', 'last']


# Word2vec

You might be asking yourself a question now. How did we arrive at the vector values that define a word vector? And how do we ensure that the values chosen place the vectors for words with similar context close together and the vectors for words with different usages far apart?

Step in word2vec! Word2vec is a statistical learning algorithm that develops word embeddings from a corpus of text. Word2vec uses one of two different model architectures to come up with the values that define a collection of word embeddings.

One method is to use the continuous bag-of-words (CBOW) representation of a piece of text. The word2vec model goes through each word in the training corpus, in order, and tries to predict what word comes at each position based on applying bag-of-words to the words that surround the word in question. In this approach, the order of the words does not matter!

The other method word2vec can use to create word embeddings is continuous skip-grams. Skip-grams function similarly to n-grams, except instead of looking at groupings of n-consecutive words in a text, we can look at sequences of words that are separated by some specified distance between them.

For example, consider the sentence ``"The squids jump out of the suitcase"``. The 1-skip-2-grams includes all the bigrams (2-grams) as well as the following subsequences:

``(The, jump), (squids, out), (jump, of), (out, the), (of, suitcase)``

When using continuous skip-grams, the order of context is taken into consideration! Because of this, the time it takes to train the word embeddings is slower than when using continuous bag-of-words. The results, however, are often much better!

With either the continuous bag-of-words or continuous skip-grams representations as training data, word2vec then uses a shallow, 2-layer neural network to come up with the values that place words with a similar context in vectors near each other and words with different contexts in vectors far apart from each other.

Let’s take a closer look to see how continuous bag-of-words and continuous skip-grams work!

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

sentence = "It was the best of times, it was the worst of times."
print(sentence)

# preprocessing
sentence_lst = [word.lower().strip(".") for word in sentence.split()]

It was the best of times, it was the worst of times.


In [13]:
# set context_length
context_length = 2

# function to get cbows
def get_cbows(sentence_lst, context_length):
    cbows = list()
    for i, val in enumerate(sentence_lst):
        if i < context_length:
            pass
        elif i < len(sentence_lst) - context_length:
            context = sentence_lst[i-context_length:i] + sentence_lst[i+1:i+context_length+1]
            vectorizer = CountVectorizer()
            vectorizer.fit_transform(context)
            context_no_order = vectorizer.get_feature_names()
            cbows.append((val,context_no_order))
    return cbows

# define cbows here:
cbows = get_cbows(sentence_lst, context_length)

In [15]:
print('\nContinuous Bag of Words')
for cbow in cbows:
    print(cbow)


Continuous Bag of Words
('the', ['best', 'it', 'of', 'was'])
('best', ['of', 'the', 'times', 'was'])
('of', ['best', 'it', 'the', 'times'])
('times,', ['best', 'it', 'of', 'was'])
('it', ['of', 'the', 'times', 'was'])
('was', ['it', 'the', 'times', 'worst'])
('the', ['it', 'of', 'was', 'worst'])
('worst', ['of', 'the', 'times', 'was'])


In [14]:
# function to get skip-grams
def get_skip_grams(sentence_lst, context_length):
    skip_grams = list()
    for i, val in enumerate(sentence_lst):
        if i < context_length:
            pass
        elif i < len(sentence_lst) - context_length:
            context = sentence_lst[i-context_length:i] + sentence_lst[i+1:i+context_length+1]
            skip_grams.append((val, context))
    return skip_grams

# define skip_grams here:
skip_grams = get_skip_grams(sentence_lst, context_length)


In [16]:
print('\nSkip Grams')
for skip_gram in skip_grams:
    print(skip_gram)


Skip Grams
('the', ['it', 'was', 'best', 'of'])
('best', ['was', 'the', 'of', 'times,'])
('of', ['the', 'best', 'times,', 'it'])
('times,', ['best', 'of', 'it', 'was'])
('it', ['of', 'times,', 'was', 'the'])
('was', ['times,', 'it', 'the', 'worst'])
('the', ['it', 'was', 'worst', 'of'])
('worst', ['was', 'the', 'of', 'times'])


# Gensim

Depending on the corpus of text we select to train a word embedding model, different word embeddings will be created according to the context of the words in the given corpus. The larger and more generic a corpus, however, the more generalizable the word embeddings become.

When we want to train our own word2vec model on a corpus of text, we can use the gensim package!

In previous exercises, we have been using pre-trained word embedding models stored in spaCy. These models were trained, using word2vec, on blog posts and news articles collected by the Linguistic Data Consortium at the University of Pennsylvania. With gensim, however, we are able to build our own word embeddings on any corpus of text we like.

To easily train a word2vec model on our own corpus of text, we can use gensim’s ``Word2Vec()`` function.

``model = gensim.models.Word2Vec(corpus, vector_size=100, window=5, min_count=1, workers=2, sg=1)``

- ``corpus`` is a list of lists, where each inner list is a document in the corpus and each element in the inner lists is a word token
- ``vector_size`` determines how many dimensions our word embeddings will include. Word embeddings often have upwards of 1,000 dimensions! Here we will create vectors of 100-dimensions to keep things simple.
- don’t worry about the rest of the keyword arguments here!

To view the entire vocabulary used to train the word embedding model, we can use the ``.wv.vocab.items()`` method.

``vocabulary_of_model = list(model.wv.vocab.items())``

When we train a word2vec model on a smaller corpus of text, we pick up on the unique ways in which words of the text are used.

For example, if we were using scripts from the television show Friends as a training corpus, the model would pick up on the unique ways in which words are used in the show. While the generalized vectors in a spaCy model might not place the vectors for “Ross” and “Rachel” close together, a gensim word embedding model trained on Friends’ scrips would place the vectors for words like “Ross” and “Rachel”, two characters that have a continuous on and off-again relationship throughout the show, very close together!

To easily find which vectors gensim placed close together in its word embedding model, we can use the ``.wv.most_similar()`` method.

``model.wv.most_similar("my_word_here", topn=100)``

- ``"my_word_here"`` is the target word token we want to find most similar words to
- ``topn`` is a keyword argument that indicates how many similar word vectors we want returned

One last gensim method we will explore is a rather fun one: ``.wv.doesnt_match()``.

``model.wv.doesnt_match(["asia", "mars", "pluto"])``

when given a list of terms in the vocabulary as an argument, ``.wv.doesnt_match()`` returns which term is furthest from the others.

In [18]:
import gensim
from nltk.corpus import stopwords
from modules.romeo_and_juliet import romeo_and_juliet

In [19]:
# load stop words
stop_words = stopwords.words('english')

# preprocess text
romeo_and_juliet_processed = [[word for word in romeo_and_juliet.lower().split() if word not in stop_words]]

# view inner list of romeo_and_juliet_processed
print(romeo_and_juliet_processed[0][:20])

['tragedy', 'romeo', 'juliet', 'william', 'shakespeare', 'contents', 'prologue.', 'act', 'scene', 'i.', 'public', 'place.', 'scene', 'ii.', 'street.', 'scene', 'iii.', 'room', 'capulet’s', 'house.']


In [22]:
# train word embeddings model of 100 dimensions
model = gensim.models.Word2Vec(
    romeo_and_juliet_processed, 
    vector_size=100, window=5, min_count=1, 
    workers=2, sg=1)

In [25]:
# similar to romeo
similar_to_romeo = model.wv.most_similar('romeo', topn=20)
print(similar_to_romeo)

[('romeo.', 0.9969708323478699), ('good', 0.9966548681259155), ('juliet.', 0.9965584874153137), ('thou', 0.9963880777359009), ('nurse.', 0.9962818622589111), ('shall', 0.9961342811584473), ('like', 0.9960909485816956), ('thy', 0.9959562420845032), ('love', 0.9958600997924805), ('benvolio.', 0.9958415031433105), ('would', 0.9958189129829407), ('mercutio.', 0.9957923889160156), ('make', 0.9957923293113708), ('romeo,', 0.9956952929496765), ('may', 0.9956520199775696), ('come', 0.9954913258552551), ('o,', 0.9954451322555542), ('’tis', 0.9952890872955322), ('tell', 0.9952310919761658), ('yet', 0.9951685667037964)]


In [27]:
# one is not like the others
not_star_crossed_lover = model.wv.doesnt_match(["romeo", "juliet", "mercutio"])
print(not_star_crossed_lover)

mercutio


# Review

Lost in a multidimensional vector space after this lesson? We hope not! We have covered a lot here, so let’s take some time to recap.

- Vectors are containers of information, and they can have anywhere from 1-dimension to hundreds or thousands of dimensions
- Word embeddings are vector representations of a word, where words with similar contexts are represented with vectors that are closer together
- spaCy is a package that enables us to view and use pre-trained word embedding models
- The distance between vectors can be calculated in many ways, and the best way for measuring the distance between higher dimensional vectors is cosine distance
- Word2Vec is a shallow neural network model that can build word embeddings using either continuous bag-of-words or continuous skip-grams
- Gensim is a package that allows us to create and train word embedding models using any corpus of text