<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/Word2Vec_Revisitied.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The Illustrated Word2vec

> Notes on the excellent article by [Jay Alammar](http://jalammar.github.io/illustrated-word2vec/)

---

![](http://jalammar.github.io/images/word2vec/word2vec.png)

### Basic Intuition

On a scale of 0 to 100, how introverted/extraverted are you (where 0 is the most introverted, and 100 is the most extraverted)? Have you ever taken a personality test like MBTI – or even better, the Big Five Personality Traits test? If you haven’t, these are tests that ask you a list of questions, then score you on a number of axes, introversion/extraversion being one of them.

Imagine I’ve scored 38/100 as my introversion/extraversion score. we can plot that in this way:

![](http://jalammar.github.io/images/word2vec/introversion-extraversion-100.png)

Let’s switch the range to be from -1 to 1:

How well do you feel you know a person knowing only this one piece of information about them? Not much. People are complex. So let’s add another dimension – the score of one other trait from the test.

![](http://jalammar.github.io/images/word2vec/two-traits-vector.png)

We can represent the two dimensions as a point on the graph, or better yet, as a vector from the origin to that point. We have incredible tools to deal with vectors that will come in handy very shortly.


I’ve hidden which traits we’re plotting just so you get used to not knowing what each dimension represents – but still getting a lot of value from the vector representation of a person’s personality.

We can now say that this vector partially represents my personality. The usefulness of such representation comes when you want to compare two other people to me. Say I get hit by a bus and I need to be replaced by someone with a similar personality. In the following figure, which of the two people is more similar to me?

![](http://jalammar.github.io/images/word2vec/personality-two-persons.png)

When dealing with vectors, a common way to calculate a similarity score is cosine_similarity:

![](http://jalammar.github.io/images/word2vec/cosine-similarity.png)

Person #1 is more similar to me in personality. Vectors pointing at the same direction (length plays a role as well) have a higher cosine similarity score.

Yet again, two dimensions aren’t enough to capture enough information about how different people are. Decades of psychology research have led to five major traits (and plenty of sub-traits). So let’s use all five dimensions in our comparison:

The problem with five dimensions is that we lose the ability to draw neat little arrows in two dimensions. This is a common challenge in machine learning where we often have to think in higher-dimensional space. The good thing is, though, that cosine_similarity still works. It works with any number of dimensions:

![](http://jalammar.github.io/images/word2vec/embeddings-cosine-personality.png)

At the end of this section, I want us to come out with two central ideas:

1. We can represent people (and things) as vectors of numbers (which is great for machines!).
2. We can easily calculate how similar vectors are to each other.

![](http://jalammar.github.io/images/word2vec/section-1-takeaway-vectors-cosine.png)



### Word Embeddings

With this understanding, we can proceed to look at trained word-vector examples (also called word embeddings) and start looking at some of their interesting properties.

This is a word embedding for the word “king” (GloVe vector trained on Wikipedia):

```
[ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042 ]
```

It’s a list of 50 numbers. We can’t tell much by looking at the values. But let’s visualize it a bit so we can compare it other word vectors. Let’s put all these numbers in one row:

Let’s color code the cells based on their values (red if they’re close to 2, white if they’re close to 0, blue if they’re close to -2):

![](http://jalammar.github.io/images/word2vec/king-colored-embedding.png)

We’ll proceed by ignoring the numbers and only looking at the colors to indicate the values of the cells. Let’s now contrast “King” against other words:

![](http://jalammar.github.io/images/word2vec/king-man-woman-embedding.png)

See how “Man” and “Woman” are much more similar to each other than either of them is to “king”? This tells you something. These vector representations capture quite a bit of the information/meaning/associations of these words.

Here’s another list of examples (compare by vertically scanning the columns looking for columns with similar colors):

![](http://jalammar.github.io/images/word2vec/queen-woman-girl-embeddings.png)

A few things to point out:

1. There’s a straight red column through all of these different words. They’re similar along that dimension (and we don’t know what each dimensions codes for)

2. You can see how “woman” and “girl” are similar to each other in a lot of places. The same with “man” and “boy”

3. “boy” and “girl” also have places where they are similar to each other, but different from “woman” or “man”. Could these be coding for a vague conception of youth? possible.

4. All but the last word are words representing people. I added an object (water) to show the differences between categories. You can, for example, see that blue column going all the way down and stopping before the embedding for “water”.

5. There are clear places where “king” and “queen” are similar to each other and distinct from all the others. Could these be coding for a vague concept of royalty?




#### Analogies

The famous examples that show an incredible property of embeddings is the concept of analogies. We can add and subtract word embeddings and arrive at interesting results. The most famous example is the formula: “king” - “man” + “woman”:

We can visualize this analogy as we did previously:

![](http://jalammar.github.io/images/word2vec/king-analogy-viz.png)

The resulting vector from "king-man+woman" doesn't exactly equal "queen", but "queen" is the closest word to it from the 400,000 word embeddings we have in this collection.

Now that we’ve looked at trained word embeddings, let’s learn more about the training process. But before we get to word2vec, we need to look at a conceptual parent of word embeddings: the neural language model.

### Language Modeling


If one wanted to give an example of an NLP application, one of the best examples would be the next-word prediction feature of a smartphone keyboard. It’s a feature that billions of people use hundreds of times every day.

Next-word prediction is a task that can be addressed by a language model. A language model can take a list of words (let’s say two words), and attempt to predict the word that follows them.

In the screenshot above, we can think of the model as one that took in these two green words (thou shalt) and returned a list of suggestions (“not” being the one with the highest probability):

![](http://jalammar.github.io/images/word2vec/thou-shalt-_.png)

We can think of the model as looking like this black box:

![](http://jalammar.github.io/images/word2vec/language_model_blackbox.png)

But in practice, the model doesn’t output only one word. It actually outputs a probability score for all the words it knows (the model’s “vocabulary”, which can range from a few thousand to over a million words). The keyboard application then has to find the words with the highest scores, and present those to the user.

![](http://jalammar.github.io/images/word2vec/language_model_blackbox_output_vector.png)

After being trained, early neural language models (Bengio 2003) would calculate a prediction in three steps:

![](http://jalammar.github.io/images/word2vec/neural-language-model-prediction.png)

The first step is the most relevant for us as we discuss embeddings. One of the results of the training process was this matrix that contains an embedding for each word in our vocabulary. During prediction time, we just look up the embeddings of the input word, and use them to calculate the prediction:

![](http://jalammar.github.io/images/word2vec/neural-language-model-embedding.png)

### Language Model Training

Language models have a huge advantage over most other machine learning models. That advantage is that we are able to train them on running text – which we have an abundance of. Think of all the books, articles, Wikipedia content, and other forms of text data we have lying around. Contrast this with a lot of other machine learning models which need hand-crafted features and specially-collected data.

> “You shall know a word by the company it keeps” J.R. Firth

Words get their embeddings by us looking at which other words they tend to appear next to. The mechanics of that is that

1. We get a lot of text data (say, all Wikipedia articles, for example). then
2. We have a window (say, of three words) that we slide against all of that text.
3. The sliding window generates training samples for our model

As this window slides against the text, we (virtually) generate a dataset that we use to train a model. To look exactly at how that’s done, let’s see how the sliding window processes this phrase:

> Thou shalt not make a machine in the likeness of a human mind

When we start, the window is on the first three words of the sentence:

We take the first two words to be features, and the third word to be a label:

We then slide our window to the next position and create a second sample:

![](http://jalammar.github.io/images/word2vec/lm-sliding-window-3.png)

And pretty soon we have a larger dataset of which words tend to appear after different pairs of words:

![](http://jalammar.github.io/images/word2vec/lm-sliding-window-4.png)

#### Look both ways

Knowing what you know from earlier in the post, fill in the blank:

![](http://jalammar.github.io/images/word2vec/jay_was_hit_by_a_.png)

The context I gave you here is five words before the blank word (and an earlier mention of “bus”). I’m sure most people would guess the word bus goes into the blank. But what if I gave you one more piece of information – a word after the blank, would that change your answer?

![](http://jalammar.github.io/images/word2vec/jay_was_hit_by_a_bus.png)

This completely changes what should go in the blank. the word red is now the most likely to go into the blank. What we learn from this is the words both before and after a specific word carry informational value. It turns out that accounting for both directions (words to the left and to the right of the word we’re guessing) leads to better word embeddings. Let’s see how we can adjust the way we’re training the model to account for this.



### Skipgram

Instead of only looking two words before the target word, we can also look at two words after it.

![](http://jalammar.github.io/images/word2vec/continuous-bag-of-words-example.png)

If we do this, the dataset we’re virtually building and training the model against would look like this:

![](http://jalammar.github.io/images/word2vec/continuous-bag-of-words-dataset.png)

This is called a **Continuous Bag of Words** architecture and is described in one of the word2vec papers [pdf]. Another architecture that also tended to show great results does things a little differently.

Instead of guessing a word based on its context (the words before and after it), this other architecture tries to guess neighboring words using the current word. We can think of the window it slides against the training text as looking like this:

![](http://jalammar.github.io/images/word2vec/skipgram-sliding-window.png)

The word in the green slot would be the input word, each pink box would be a possible output.

The pink boxes are in different shades because this sliding window actually creates four separate samples in our training dataset:

![](http://jalammar.github.io/images/word2vec/skipgram-sliding-window-samples.png)

This method is called the **skipgram** architecture. We can visualize the sliding window as doing the following: First it covers the words `[though, shalt, not, make, a]` and the current word is `not`. This would add these four samples to our training dataset:

![](http://jalammar.github.io/images/word2vec/skipgram-sliding-window-2.png)

We then slide our window to the next position, now the crrent word is `make`

![](http://jalammar.github.io/images/word2vec/skipgram-sliding-window-4.png)

A couple of positions later, we have a lot more examples:

![](http://jalammar.github.io/images/word2vec/skipgram-sliding-window-5.png)




