# An Introduction to Word Embeddings

One of the breakthroughs in Natural Language Processing is the usage of word embeddings. Rather than using the words themselves as features, neural network methods typically take as input dense, relatively low-dimensional vectors that model the meaning and usage of a word. Word embeddings were first popularized through the [Word2Vec](https://arxiv.org/abs/1301.3781) model, developed by Thomas Mikolov and colleagues at Google. Since then, scores of alternative approaches have been developed, such as [GloVe](https://nlp.stanford.edu/projects/glove/) and [FastText](https://fasttext.cc/) embeddings. In this notebook, we'll explore word embeddings with the original Word2Vec approach, as implemented in the [Gensim](https://radimrehurek.com/gensim/) library.

Modified from: https://github.com/nlptown/nlp-notebooks/blob/master/An%20Introduction%20to%20Word%20Embeddings.ipynb

## Training word embeddings

To train word embedings, we need to compile a corpus of text that captures the meaning and usage of words in a particular dialect. Wikipedia is a good choice for training generic embeddings representing the proper use of different languages. For our experiments, however, we're going to use scraped comments from [r/aww](https://www.reddit.com/r/aww/), a subreddit of "Things that make you go AWW! Like puppies, bunnies, babies and so on".

We've collected and lowercased a collection of 6 months of comments from this subreddit using the scripts in the notebook `0.Data Scraping and Prep`.

Each entry in this file represents one comment in one post from that subreddit.

In [33]:
import os

class SentenceCorpus(object):

    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        with open(self.filename, "r") as i:
            for line in i:
                tokens = line.strip().split()
                yield tokens
                
                
TEXT_FILE = '../data/reddit_aww_comments.csv'
sentences = SentenceCorpus(TEXT_FILE)

When training word embeddings, Gensim allows the user to set a number of parameters. The most important are:

- `min_count` is the **minimum frequency of the words** in our corpus. For infrequent words, we just don't have enough information to train reliable word embeddings. It therefore makes sense to set this minimum frequency to at least 10.


- `window` is **number of words to the left and to the right that make up the context for the word** that word2vec will take into account.


- `size` is the **dimensionality of the word vectors**. This is generally between 100 and 1000. You often have to make a trade-off: embeddings with a higher dimensionality are able to model more information, but also need more data to train.


- `sg`: there are two **algorithms to train word2vec**: skip-gram and CBOW. Skip-gram tries to predict the context on the basis of the target word; CBOW tries to find the target on the basis of the context. By default, Gensim uses CBOW (`sg=0`).

In [34]:
import gensim
model = gensim.models.Word2Vec(sentences, min_count=10, window=5, size=100)

## Using word embeddings

Let's take a look at the model. The word embeddings are on the `wv` attribute of the model. We can access them by the using the token as key. The vocabulary includes any token that occurs more than `min_count` times in the document, including frequent mispellings and emojis. For example, here is the embedding for the emoji 😍:

In [35]:
model.wv["❤️"]

KeyError: "word '❤️' not in vocabulary"

We can easily find the **similarity between two words** using word embeddings. Similarity is measured as the cosine between the two vectors and ranges between -1 and +1. The higher the cosine, the more similar two words are. As expected, the figures below show that *dog* is closer to *cat* than to *bird*, since variations of the phrase *dogs and cats* are common.

In [36]:
print(model.wv.similarity("dog", "cat"))
print(model.wv.similarity("dog", "bird"))

0.9996786


KeyError: "word 'bird' not in vocabulary"

In a similar vein, we can find the **words that are most similar to a target word**. The words with the closest embedding to *doggo* (lolspeak for *dog*) are for words with similar meaning and use (such as *pupper* and *pup*).

In [37]:
model.wv.similar_by_word("doggo", topn=10)

[('got', 0.9992769360542297),
 ('an', 0.999241828918457),
 ('of', 0.99922776222229),
 ('and', 0.9992210268974304),
 ('so', 0.9992192983627319),
 ('s', 0.9992176294326782),
 ('around', 0.9992173910140991),
 ('have', 0.9992173910140991),
 ('she', 0.9992153644561768),
 ('get', 0.9991946220397949)]

Interestingly, we can look for **words that are similar to a set of words and dissimilar to another set of words** at the same time. This allows us to look for analogies of the type *kitten is to cat what ... is to dog*.

In [38]:
model.wv.most_similar(positive=['kitten', 'cat'], negative=["dog"], topn=10)

KeyError: "word 'kitten' not in vocabulary"

We can use embeddings to **explore the meaning of words**. The word *puppy* is most similar to *kitten*, another young animal.t

In [39]:
model.wv.most_similar(positive=["puppy"], topn=10)

[('or', 0.9991438388824463),
 ('in', 0.9991414546966553),
 ('like', 0.9990968108177185),
 ('who', 0.999070405960083),
 ('you', 0.9990617036819458),
 ('the', 0.9990543723106384),
 ('make', 0.9990527629852295),
 ('many', 0.9990491271018982),
 ('was', 0.9990404844284058),
 ('even', 0.9990345239639282)]

We can get other details behind the **meaning of words** by looking at words that are similar to *puppy* but don't refer to a *dog*:

In [40]:
model.wv.most_similar(positive=["puppy"], negative=["dog"], topn=10)

[('guy', -0.009642738848924637),
 ('made', -0.011732928454875946),
 ('beautiful', -0.01190204732120037),
 ('u', -0.011969465762376785),
 ('glad', -0.012150418013334274),
 ('great', -0.012265190482139587),
 ('pup', -0.01238880306482315),
 ('adorable', -0.012794144451618195),
 ('pupper', -0.012867528945207596),
 ('right', -0.013195604085922241)]

Finally, use word embeddings to identify **the word in a list that is least similar to the others**:

In [41]:
print(model.wv.doesnt_match("dog cat truck mouse".split()))
print(model.wv.doesnt_match("sky bird sun stars".split()))

cat


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


ValueError: cannot select a word from an empty list

## Plotting embeddings

Word embeddings are high-dimensional vectors. In this example, our word embeddings have 100 dimensions. There is no hierarchy in the explanatory power of each dimension so we can't simply plot the first two or three dimensions and expect to see significant patterns.

To plot embeddings with high dimensionality, we first need to map them to a dimensionality of 2. We do this with the popular [t-SNE](https://lvdmaaten.github.io/tsne/) method. T-SNE, short for **t-distributed Stochastic Neighbor Embedding**, helps us visualize high-dimensional data by mapping similar data to nearby points and dissimilar data to distance points in two-dimensional space.

T-SNE is available in [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). To run it, we just have to specify the number of dimensions we'd like to map the data to (`n_components`), and the similarity metric that t-SNE should use to compute the similarity between two data points (`metric`). We're going to map to 2 dimensions and use the cosine as our similarity metric. Additionally, we use PCA as an initialization method to remove some noise and speed up computation. The [Scikit-learn user guide](https://scikit-learn.org/stable/modules/manifold.html#t-sne) contains some additional tips for optimizing performance. 

Plotting all the embeddings in our vector space (all of the words in the vocabulary) would result in a very crowded figure. Therefore we'll focus on a subset of embeddings by selecting the 200 most similar words to a target word. 

In [42]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.manifold import TSNE

target_word = "corgi"
selected_words = [w[0] for w in model.wv.most_similar(positive=[target_word], topn=200)]
embeddings = [model.wv[w] for w in selected_words]

mapped_embeddings = TSNE(n_components=2, metric='cosine', init='pca').fit_transform(embeddings)

KeyError: "word 'corgi' not in vocabulary"

In [43]:
plt.figure(figsize=(30,30))
x = mapped_embeddings[:,0]
y = mapped_embeddings[:,1]
plt.scatter(x, y)

for i, txt in enumerate(selected_words):
    plt.annotate(txt, (x[i], y[i]), size=20)

NameError: name 'mapped_embeddings' is not defined

<Figure size 2160x2160 with 0 Axes>

## Clustering embeddings

Finally, we're going to cluster our embeddings. We'll use agglomerative clustering, a bottom-up clustering method that iteratively takes together the two most similar clusters (or embeddings) in the data.

In [191]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import normalize

vocab = list(model.wv.vocab)
vectors = [model.wv[w] for w in vocab]
vectors_norm = normalize(vectors)

clusterer = AgglomerativeClustering(n_clusters=500)
clusters = clusterer.fit_predict(vectors_norm)


In [192]:
cluster_dictionary = {}
for cluster, word in zip(clusters, vocab): 
    if cluster not in cluster_dictionary:
        cluster_dictionary[cluster] = []
    cluster_dictionary[cluster].append(word)

In [196]:
for x in cluster_dictionary:
    if "kitteh" in cluster_dictionary[x]:
        print(cluster_dictionary[x])

['chonk', 'bliss', 'heckin', 'beast', 'furry', '💗', 'floof', 'bean', 'nugget', 'floofer', 'pooch', '❤️❤️', 'fella', 'attitude', 'charlie', 'sis', 'archer', 'babe', 'noodle', 'lad', 'boye', 'talent', 'bun', 'das', 'goof', 'mustache', 'kitteh', 'shark', 'venus', 'chloe']


## Conclusions

Word embeddings are one of the most exciting trends on Natural Language Processing since the 2000s. They allow us to model the meaning and usage of a word, and discover words that behave similarly. This is crucial for the generalization capacity of many machine learning models. Moving from raw strings to embeddings allows them to generalize across words that have a similar meaning, and discover patterns that had previously escaped them.