## A Hands-on Workshop series in Machine Learning
#### Instructor: Dr. Aashita Kesarwani

### Background

We know the input vectors in the Bag-Of-Words (BOW) methods have a very high dimension (that is the same as the size of the vocabulary). Autoencoder neural networks are one of the simplest way to reduce the dimension for the input vectors. There are several neural networks, most prominently the family of word2vec algorithms, that are more efficient and useful for this purpose but we will study autoencoder as they are the simplest and easiest to understand the word embeddings.

### Autoencoder neural networks

Autoencoder neural networks create a bottleneck to compress data.

![](https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-06-at-3.17.13-PM.png)

For training the autoencoders, we set the output $y$ to be the same as the input $x$. So, for the above network, we can use the following training set:
* x1 = y1 = [1, 0, 0, 0, 0, 0]
* x2 = y2 = [0, 1, 0, 0, 0, 0]
* x3 = y3 = [0, 0, 1, 0, 0, 0]
* x4 = y4 = [0, 0, 0, 1, 0, 0]
* x5 = y5 = [0, 0, 0, 0, 1, 0]
* x6 = y6 = [0, 0, 0, 0, 0, 1]

If we train the above network for several epochs, the hidden layer will learn a representation of the data in a lower dimension. We can use the output of the hidden layer to reduce the dimension of the input vector (in the above case, the dimension is reduced from 6 to 3).

In [2]:
from keras.models import Sequential, Model
from keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from keras import losses
import numpy as np

from keras.utils import set_random_seed
set_random_seed(1)

X = np.eye(6)
Y = X

print(X)

2023-11-10 14:20:33.222634: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


[[1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1.]]


In [3]:
model = Sequential()
model.add(Dense(units=3, input_dim=6, activation="sigmoid")) # Hidden layer
model.add(Dense(units=6, activation="softmax")) # Output layer
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 3)                 21        
                                                                 
 dense_1 (Dense)             (None, 6)                 24        
                                                                 
Total params: 45
Trainable params: 45
Non-trainable params: 0
_________________________________________________________________


2023-11-10 14:21:06.453755: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
opt = Adam(learning_rate=0.008, beta_1=0.9, beta_2=0.999)
model.compile(optimizer=opt, loss="categorical_crossentropy", metrics=["accuracy"])
model.fit(X, Y, epochs=350, verbose=1); 

Epoch 1/350
Epoch 2/350
Epoch 3/350
Epoch 4/350
Epoch 5/350
Epoch 6/350
Epoch 7/350
Epoch 8/350
Epoch 9/350
Epoch 10/350
Epoch 11/350
Epoch 12/350
Epoch 13/350
Epoch 14/350
Epoch 15/350
Epoch 16/350
Epoch 17/350
Epoch 18/350
Epoch 19/350
Epoch 20/350
Epoch 21/350
Epoch 22/350
Epoch 23/350
Epoch 24/350
Epoch 25/350
Epoch 26/350
Epoch 27/350
Epoch 28/350
Epoch 29/350
Epoch 30/350
Epoch 31/350
Epoch 32/350
Epoch 33/350
Epoch 34/350
Epoch 35/350
Epoch 36/350
Epoch 37/350
Epoch 38/350
Epoch 39/350
Epoch 40/350
Epoch 41/350
Epoch 42/350
Epoch 43/350
Epoch 44/350
Epoch 45/350
Epoch 46/350
Epoch 47/350
Epoch 48/350
Epoch 49/350
Epoch 50/350
Epoch 51/350
Epoch 52/350
Epoch 53/350
Epoch 54/350
Epoch 55/350
Epoch 56/350
Epoch 57/350
Epoch 58/350
Epoch 59/350
Epoch 60/350
Epoch 61/350
Epoch 62/350
Epoch 63/350
Epoch 64/350
Epoch 65/350
Epoch 66/350
Epoch 67/350
Epoch 68/350
Epoch 69/350
Epoch 70/350
Epoch 71/350
Epoch 72/350
Epoch 73/350
Epoch 74/350
Epoch 75/350
Epoch 76/350
Epoch 77/350
Epoch 78

Epoch 85/350
Epoch 86/350
Epoch 87/350
Epoch 88/350
Epoch 89/350
Epoch 90/350
Epoch 91/350
Epoch 92/350
Epoch 93/350
Epoch 94/350
Epoch 95/350
Epoch 96/350
Epoch 97/350
Epoch 98/350
Epoch 99/350
Epoch 100/350
Epoch 101/350
Epoch 102/350
Epoch 103/350
Epoch 104/350
Epoch 105/350
Epoch 106/350
Epoch 107/350
Epoch 108/350
Epoch 109/350
Epoch 110/350
Epoch 111/350
Epoch 112/350
Epoch 113/350
Epoch 114/350
Epoch 115/350
Epoch 116/350
Epoch 117/350
Epoch 118/350
Epoch 119/350
Epoch 120/350
Epoch 121/350
Epoch 122/350
Epoch 123/350
Epoch 124/350
Epoch 125/350
Epoch 126/350
Epoch 127/350
Epoch 128/350
Epoch 129/350
Epoch 130/350
Epoch 131/350
Epoch 132/350
Epoch 133/350
Epoch 134/350
Epoch 135/350
Epoch 136/350
Epoch 137/350
Epoch 138/350
Epoch 139/350
Epoch 140/350
Epoch 141/350
Epoch 142/350
Epoch 143/350
Epoch 144/350
Epoch 145/350
Epoch 146/350
Epoch 147/350
Epoch 148/350
Epoch 149/350
Epoch 150/350
Epoch 151/350
Epoch 152/350
Epoch 153/350
Epoch 154/350
Epoch 155/350
Epoch 156/350
Epoch 1

Epoch 168/350
Epoch 169/350
Epoch 170/350
Epoch 171/350
Epoch 172/350
Epoch 173/350
Epoch 174/350
Epoch 175/350
Epoch 176/350
Epoch 177/350
Epoch 178/350
Epoch 179/350
Epoch 180/350
Epoch 181/350
Epoch 182/350
Epoch 183/350
Epoch 184/350
Epoch 185/350
Epoch 186/350
Epoch 187/350
Epoch 188/350
Epoch 189/350
Epoch 190/350
Epoch 191/350
Epoch 192/350
Epoch 193/350
Epoch 194/350
Epoch 195/350
Epoch 196/350
Epoch 197/350
Epoch 198/350
Epoch 199/350
Epoch 200/350
Epoch 201/350
Epoch 202/350
Epoch 203/350
Epoch 204/350
Epoch 205/350
Epoch 206/350
Epoch 207/350
Epoch 208/350
Epoch 209/350
Epoch 210/350
Epoch 211/350
Epoch 212/350
Epoch 213/350
Epoch 214/350
Epoch 215/350
Epoch 216/350
Epoch 217/350
Epoch 218/350
Epoch 219/350
Epoch 220/350
Epoch 221/350
Epoch 222/350
Epoch 223/350
Epoch 224/350
Epoch 225/350
Epoch 226/350
Epoch 227/350
Epoch 228/350
Epoch 229/350
Epoch 230/350
Epoch 231/350
Epoch 232/350
Epoch 233/350
Epoch 234/350
Epoch 235/350
Epoch 236/350
Epoch 237/350
Epoch 238/350
Epoch 

Epoch 250/350
Epoch 251/350
Epoch 252/350
Epoch 253/350
Epoch 254/350
Epoch 255/350
Epoch 256/350
Epoch 257/350
Epoch 258/350
Epoch 259/350
Epoch 260/350
Epoch 261/350
Epoch 262/350
Epoch 263/350
Epoch 264/350
Epoch 265/350
Epoch 266/350
Epoch 267/350
Epoch 268/350
Epoch 269/350
Epoch 270/350
Epoch 271/350
Epoch 272/350
Epoch 273/350
Epoch 274/350
Epoch 275/350
Epoch 276/350
Epoch 277/350
Epoch 278/350
Epoch 279/350
Epoch 280/350
Epoch 281/350
Epoch 282/350
Epoch 283/350
Epoch 284/350
Epoch 285/350
Epoch 286/350
Epoch 287/350
Epoch 288/350
Epoch 289/350
Epoch 290/350
Epoch 291/350
Epoch 292/350
Epoch 293/350
Epoch 294/350
Epoch 295/350
Epoch 296/350
Epoch 297/350
Epoch 298/350
Epoch 299/350
Epoch 300/350
Epoch 301/350
Epoch 302/350
Epoch 303/350
Epoch 304/350
Epoch 305/350
Epoch 306/350
Epoch 307/350
Epoch 308/350
Epoch 309/350
Epoch 310/350
Epoch 311/350
Epoch 312/350
Epoch 313/350
Epoch 314/350
Epoch 315/350
Epoch 316/350
Epoch 317/350
Epoch 318/350
Epoch 319/350
Epoch 320/350
Epoch 

Epoch 332/350
Epoch 333/350
Epoch 334/350
Epoch 335/350
Epoch 336/350
Epoch 337/350
Epoch 338/350
Epoch 339/350
Epoch 340/350
Epoch 341/350
Epoch 342/350
Epoch 343/350
Epoch 344/350
Epoch 345/350
Epoch 346/350
Epoch 347/350
Epoch 348/350
Epoch 349/350
Epoch 350/350


Let's see the weights connecting the input to the hidden layer.

In [5]:
hidden_layer_weights = model.layers[0].get_weights()[0]
hidden_layer_weights

array([[ 2.3535373 ,  1.7600238 ,  2.6751099 ],
       [-2.479652  ,  0.6684221 , -1.6248617 ],
       [ 1.9789417 , -3.3325272 ,  1.6378893 ],
       [-2.6733994 ,  2.0745368 , -3.1317546 ],
       [ 0.88372207, -2.8770435 , -2.793328  ],
       [-3.196977  ,  1.483793  ,  2.7325883 ]], dtype=float32)

Let's create a new neural network that discards the output layer effectively making the output of the hidden layer as the final output. We can use this model for reducing the dimension for word embeddings.

In [6]:
hidden_layer_model = Model(inputs=model.input, 
                           outputs=model.get_layer(index=0).output)
hidden_layer_output = hidden_layer_model(X)
hidden_layer_output

<tf.Tensor: shape=(6, 3), dtype=float32, numpy=
array([[0.92892784, 0.9086109 , 0.9448164 ],
       [0.09424649, 0.76944655, 0.18852136],
       [0.8998663 , 0.0575537 , 0.8585264 ],
       [0.07895716, 0.93158555, 0.04896059],
       [0.7503575 , 0.08784174, 0.06735086],
       [0.04832941, 0.8829389 , 0.94773775]], dtype=float32)>

Let's round up the final output. The input X in our case was the identity matrix. Each row in the output corresponds to reduced dimensional representation of the 6-dimensional identity vectors.

In [7]:
np.round(hidden_layer_output, decimals=2)

array([[0.93, 0.91, 0.94],
       [0.09, 0.77, 0.19],
       [0.9 , 0.06, 0.86],
       [0.08, 0.93, 0.05],
       [0.75, 0.09, 0.07],
       [0.05, 0.88, 0.95]], dtype=float32)

In [20]:
x = np.array([4,2,10,4,8,6]).reshape(1,6)
np.round(hidden_layer_model(x), decimals=2)

array([[0.84, 0.  , 1.  ]], dtype=float32)

#### Further reading
* [Introduction to autoencoders](https://www.jeremyjordan.me/autoencoders)

* [Deep learning book (Chapter 14): Autoencoders](https://www.deeplearningbook.org/contents/autoencoders.html)

### Word embeddings

Word embeddings are simply representations of words as numerical vectors. Several word embedding techniques such as word2vec, autoencoders (seen above) and BERT (Bidirectional Encoder Representations from Transformers) are neural network based representations. The neural networks are trained on a choosen text corpus to generate the word representations. 

BERT involves transformer models that we will study today.

#### Word2vec embedding

Word2vec embeddings are a class of sevaral models that are used to generate word embedding. They were first introduced by Mikolov, et al. in [Efficient Estimation of Word Representations in
Vector Space](https://arxiv.org/pdf/1301.3781.pdf). They can be used as the first step in the neural network to create lower-dimensional vectorization for textual data as compared to count/TFIDF vectorization methods while storing information about the relationship between the words.

The embedding space for word2vec is an n-dimensional vector space where each word has a corresponding vector in the space. The arrangement of vectors corresponding to words in the embedding space stores useful information concerning the relationship between the words. For example, the difference in vectors corresponding to France and Paris would be similar to difference between Italy and Rome.

$$ vec(France) - vec(Paris) = vec(Italy) - vec(Rome)$$

The neural networks learn these relationships on their own based on the corpus used to train them. The vector words that occur together often in the corpus have vectors that are similar in mathematical sense.

Learning word representations is an unsupervised task, so we need some way to create the (input, target) pairs. Similar to autoencoders, the neural networks are trained for a fake task but once trained, the output layer is discarded and the trained weights are to used to map words into vectors. 

<img align="center" src="http://mccormickml.com/assets/word2vec/word2vec_weight_matrix_lookup_table.png" width=500 >

The above word2vec representation reduces the dimension from 10,000 to 300. There are several trained word2vec algorithms available that you can directly use to generate embeddings. 

The fake task that is used to generate the word embeddings are slightly different for the different models in the word2vec family. Two of the most commonly used ones are:

* Continuous Skip-gram Model
* Continuous Bag-of-Words Model (CBOW)

#### Further reading

* [Word2Vec Tutorial - The Skip-Gram Model](http://www.mccormickml.com)
