In [1]:
import numpy as np

In this Notebook we like to introduce in the topic of Tranformer structures which use Attention.

## One-hot encoding
Given a list of tokens T, one can represent every the trough a vector.
Therefore T can be encoded as a matrix.


In [2]:
T=["The", "bear market", "bull market", "will", "will not", "crash", "ralley"]

T_rep_1 = np.array([1,0,0,0,0,0,0])
T_rep_2 = np.array([0,1,0,0,0,0,0])
T_rep_3 = np.array([0,0,1,0,0,0,0])
T_rep_4 = np.array([0,0,0,1,0,0,0])
T_rep_5 = np.array([0,0,0,0,1,0,0])
T_rep_6 = np.array([0,0,0,0,0,1,0])
T_rep_7 = np.array([0,0,0,0,0,0,1])

T_matrix = np.array([T_rep_1, T_rep_2, T_rep_3, T_rep_4, T_rep_5, T_rep_6, T_rep_7])
print(T_matrix)



[[1 0 0 0 0 0 0]
 [0 1 0 0 0 0 0]
 [0 0 1 0 0 0 0]
 [0 0 0 1 0 0 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 1 0]
 [0 0 0 0 0 0 1]]


## First order sequence model
We can use a markov transition matrix M to encode how likely it is that word A comes after word B.
In this Example we allow 2 possible sentences which are: \
"The bull market will crash"\
"The bull market will not crash"\
In the transition matrix here we can see the structure of our 2 sentences clearly. Almost all of the transition probabilities are zero or one. There is only one place in the Markov chain where branching happens. After "market" the tokens "will" and "will not". Other than that, there’s no uncertainty about which word will come next. That certainty is reflected by having mostly ones and zeros in the transition matrix.


In [3]:
M = np.array([[0,0,1,0,0,0,0], [0,0,0,0,0,0,0], [0,0,0,0.5,0.5,0,0], [0,0,0,0,0,1,0], [0,0,0,0,0,1,0], [0,0,0,0,0,0,0], [0,0,0,0,0,0,0]])
print(M)

[[0.  0.  1.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.5 0.5 0.  0. ]
 [0.  0.  0.  0.  0.  1.  0. ]
 [0.  0.  0.  0.  0.  1.  0. ]
 [0.  0.  0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0.  0.  0. ]]


In case we like to pull out the transition probabilities associated with a given token, we can use matrix multiplication with the representation of that token. So for the transition probability of the token "market" we have:

In [4]:
prob = np.matmul(T_rep_3, M)
print(prob)

[0.  0.  0.  0.5 0.5 0.  0. ]


## Feature Extraction
Imagine we are building yet another language model. This one has to represent 4 sentences, each equally likely to occur. 


The bear market will not crash \
The bear market will ralley \
The bull market will not ralley \
The bull market will crash 


All sentences contain 4 tokens. Furthermore one can predict the last token, if he knows token 2 and token 3.
We can see that only "bear market", "bull market", "will" and "will not" are important keywords for the last token.

For example, whenever "bear market" and "will not" has been seen so far we know "crash" is the last token.

