### Explain Word Embeddings

The authors of "Attention is All You Need" stated in the beginning of the paper: \
Similarly to other sequence transduction models, we use learned embeddings to\
convert the input tokens and output tokens to vectors of dimension d_model.\
What are word embeddings? Word embeddings are a way to represent textual data\
in terms of condensed real-valued vectors. Embeddings in general is the process\
of creating an euclidean space that represents a group of data mathematically.

#### Why not use indexes
Well if I have to represent a set of words with numbers why not simply\
use indexes like this.


["Being", "Strong", "Is", "All", "What", "Matters"] -> [1,2,3,4,5,6].


Well, we will lose meaning like this. I want to deal with words like numbers \
but without losing meaning. I want when I do this ``` King - Man ``` to get \
```Queen```. Is this possible yes it is possible.

#### Meaning as feature
We can achive the afromentioned behaviour by mapping each word to a set of features.\
Consider the following table where we have a set of words and a set of features.\
In it we are representing each word with a set of "meaning" features.

A "granny" is an old weak female.


| Word   | Old | Male | Weak
|--      |--   |--    | -- 
|Grandpa | 1   | 1    | 1
|Granny  | 1   | 0    | 1
|Man     | 0   | 1    | 0
|Woman   | 0   | 0    | 0
|Mother  | 0   | 0    | 0
|Son     | 0   | 1    | 1


Consider this:

$$
Grandpa - Man = 

\begin{bmatrix}
    1 \\
    1 \\
    1 \\
\end{bmatrix}
-
\begin{bmatrix}
    1 \\
    0 \\
    1 \\
\end{bmatrix}
=
\begin{bmatrix}
    0 \\
    1 \\
    0 \\
\end{bmatrix}
= Granny
$$

You see. What we are doing is subtracting the manliness features from "grandpa"\
which will lead to "granny". Now those are embeddings. The feature vectors. They represent words\
and the relation between them. The question is how to get those embeddings.

#### Training and Embedding

To train an embedding we train a neural network to solve a task that requires\
semantical understanding of words and it will eventually have a representation or\
an embedding that holds semantical meaning.

For more clarity consider a dataset with the following entries:

* sam, altman -> is
* altman, is -> the
* is, the -> CEO
* the, CEO -> of
* CEO, of -> the
* of, the, -> most
* ...

It is composed of trigrams of a given text. If we trained a simple feedforward neural network\
to predict the next word from the previous two words. The weights of the external layer of the\
network will eventually be the transformation function that give us our embeddings.\
$embedding(x) = Wx$ where W is the neural network external weights.

Why we get semantical representation. Consider this word ```The commander ordered```.\
For the network to predict "ordered" it should learn to correlate "ordering" with "commander".\
If we look closely to the embedding, we might also find that "king" is also correlated with ordering.\
Because they both the "commander" and the "king" occur in similar contexts.\
In the case both the embedding of "king" and "commander" will be somehow similar.

In ```Generate Word Embeddings.ipynb``` notebook we will use pytorch\
to train a neural network on a similar task and see how things unfold.