# Word embeddings

we can represent words from a dictionary as vectors using one-hot encoding where all components are zero except for one (one hot encoding). 

<img src="images/words-one-hot.png">

The advantage of such an encoding is that the calculation of a word vector and looking up a word given its vector is easy. On the other hand this form of encoding does not contain any information about the relationships of words between each other. An alternative sort of word vectors are **word embeddings**. In such vectors, each component of a vector reflects a different feature of a word meaning (e.g. age, sex, food/non-food, word type, etc…). Therefore the components can all have non-null values. Words that are semantically similar have similar values in the individual components. For visualization we could also reduce dimensionality to two (or three) dimensions, e.g. by applying the t-SNE algorithm. By doing so it turns out that words with similar meanings are in similar positions in vector space.

<img src="images/vector-space.png" width="300px">

## Properties of word embeddings

Word embeddings have become hugely popular in NLP and can for example be used for NER. Oftentimes an existing model can be adjusted for a specific task by performing additional training on suitable training data (transfer learning). This training set and also the dimensionality of the word vectors can be much smaller. The relevance of a word embedding $e$ is simliar to the vector of a face in face recognition in computer vision: It is a vectorized representation of the underlying data. An important distinction however is that in order to get word embeddings a model needs to learn a fixed-size vocabulary, while face embeddings can work with any new image. Vectors for words outside this vocabulary can not be calculated. In contrast a CNN could calculate a vector for a face it has never seen before.

Word embeddings are useful to model analogies and relationships between words:

$$e_{man} - e_{woman} \approx e_{king} - e_{queen}$$

The distance between the vectors for “man” and “woman” is similar to the distance between the vectors for “king” and “queen”, because those two pairs of words are related in the same way. We can also observe that a trained model has learned the relationship between these two pairs of words because the vector representations of their distances is approximately parallel. This also applies to other kinds of word pairings, like verbs in different tenses or the relationship between a country and its capital:

<img src="images/word-embeddings.png">

Therefore we could get an embedding by rearranging formula:
$$e_{king} - e_{man} + e_{woman} \approx e_{queen}$$

This way the word embedding for “queen” can be calculated using the embeddings of the other words. To get the word for its embedding we can use a similarity function $sim$, which measures the similarity between two embeddings $u$ and $v$. Often the cosine similarity is used for this function:

$$sim(u,v) = \frac{u^T v}{\lVert u\rVert_2 \lVert v \rVert_2}$$

With the help of the similarity function we can find the word for “queen” by comparing the embedding $e_queen$ against the embeddings of all other word from the vocabulary:

$$w = \underset{w}{\operatorname{argmax}} sim(e_{queen}, e_{king} - e_{man} + e_{woman})$$

## Word2Vec

**Word2Vec (W2V)** is the probably most popular implementation for word embeddings. W2V contains two approaches:
- Skip-Gram
- CBOW (Continuous Bag Of Words)

<img src="images/skip-gram-medium.webp" width="500px">

### Skip-Grams

One of the most famous approaches to dense representation is word2vec, proposed by Google in 2013.

We come up with few context-target pairs to create our supervised learning problem. We pick a random word from the sentence, then we take random number of words from a given windown ($\pm k$). Usually the context words are not chosen with uniform random distribution but according to their frequency in the corpus. Frequent words have a lower probability for being selected as context words. 

<img src="images/skip grams.png" width="500px">

The high-level idea of dense vector representation is to train two models: encoder and decoder. For example, in the case of skip-gram, we might pass the word “christmas” to the encoder. Then, the encoder will produce a vector that we pass to the decoder expecting to get the words “merry”, “to”, and “you”.

<img src="images/skip gram high level model.webp">

This model started to take into account the meaning of the words since it’s trained on the context of the words. However, it ignores morphology (information we can get from the word parts, for example, that “-less” means the lack of something). This drawback was addressed later by looking at subword skip-grams in **GloVe**.

## Packages

In [1]:
import os

import pathlib
import numpy as np
from tensorflow import data as tf_data
from tensorflow import keras
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

## Load Dataset

In [2]:
data_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
)

In [3]:
data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
# data_dir = pathlib.Path("data/20_newsgroup")
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)

fnames = os.listdir(data_dir / "comp.graphics")
print("Number of files in comp.graphics:", len(fnames))
print("Some example filenames:", fnames[:5])

Number of directories: 20
Directory names: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
Number of files in comp.graphics: 1000
Some example filenames: ['37261', '37913', '37914', '37915', '37916']


In [4]:
print(open(data_dir / "comp.graphics" / "38987").read())

Newsgroups: comp.graphics
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!howland.reston.ans.net!agate!dog.ee.lbl.gov!network.ucsd.edu!usc!rpi!nason110.its.rpi.edu!mabusj
From: mabusj@nason110.its.rpi.edu (Jasen M. Mabus)
Subject: Looking for Brain in CAD
Message-ID: <c285m+p@rpi.edu>
Nntp-Posting-Host: nason110.its.rpi.edu
Reply-To: mabusj@rpi.edu
Organization: Rensselaer Polytechnic Institute, Troy, NY.
Date: Thu, 29 Apr 1993 23:27:20 GMT
Lines: 7

Jasen Mabus
RPI student

	I am looking for a hman brain in any CAD (.dxf,.cad,.iges,.cgm,etc.) or picture (.gif,.jpg,.ras,etc.) format for an animation demonstration. If any has or knows of a location please reply by e-mail to mabusj@rpi.edu.

Thank you in advance,
Jasen Mabus  



In [5]:
samples = []
labels = []
class_names = []
class_index = 0
for dirname in sorted(os.listdir(data_dir)):
    class_names.append(dirname)
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath)
    print("Processing %s, %d files found" % (dirname, len(fnames)))
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding="latin-1")
        content = f.read()
        lines = content.split("\n")
        lines = lines[10:]
        content = "\n".join(lines)
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

Processing alt.atheism, 1000 files found
Processing comp.graphics, 1000 files found


Processing comp.os.ms-windows.misc, 1000 files found
Processing comp.sys.ibm.pc.hardware, 1000 files found
Processing comp.sys.mac.hardware, 1000 files found
Processing comp.windows.x, 1000 files found
Processing misc.forsale, 1000 files found
Processing rec.autos, 1000 files found
Processing rec.motorcycles, 1000 files found
Processing rec.sport.baseball, 1000 files found
Processing rec.sport.hockey, 1000 files found
Processing sci.crypt, 1000 files found
Processing sci.electronics, 1000 files found
Processing sci.med, 1000 files found
Processing sci.space, 1000 files found
Processing soc.religion.christian, 997 files found
Processing talk.politics.guns, 1000 files found
Processing talk.politics.mideast, 1000 files found
Processing talk.politics.misc, 1000 files found
Processing talk.religion.misc, 1000 files found
Classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 're

### Shuffle and split the data into training & validation sets

In [6]:
# Shuffle the data
seed = 1337
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)

X_train, X_val, y_train, y_val = train_test_split(samples, labels, train_size=0.8, random_state=seed)

print(f"Training dataset has shape: {len(X_train)}")
print(f"Validation dataset has shape: {len(X_val)}")

Training dataset has shape: 15997
Validation dataset has shape: 4000


In [7]:
text_ds = tf_data.Dataset.from_tensor_slices(X_train).batch(128)

## Create a Vocabulary index

Let's use the `TextVectorization` to index the vocabulary found in the dataset. Later, we'll use the same layer instance to vectorize the samples.

Our layer will only consider the top 20,000 words, and will truncate or pad sequences to be actually 200 tokens long.

In [8]:
vectorizer = keras.layers.TextVectorization(max_tokens=20000, output_sequence_length=200)
vectorizer.adapt(text_ds)

You can retrieve the computed vocabulary used via `vectorizer.get_vocabulary()`. Let's print the top 5 words:

In [9]:
vectorizer.get_vocabulary()[:5]

['', '[UNK]', 'the', 'to', 'of']

Let's vectorize a test sentence:

In [10]:
output = vectorizer([["the cat sat on the mat"]])
output.numpy()[0, :6]

array([   2, 3569, 1691,   15,    2, 6706], dtype=int64)

As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first word in the vocabulary? That's because index 0 is reserved for padding and index 1 is reserved for "out of vocabulary" tokens.

Here's a dict mapping words to their indices:

In [11]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

As you can see, we obtain the same encoding as above for our test sentence:

In [12]:
test = ["the", "cat", "sat", "on", "the", "mat"]
[word_index[w] for w in test]

[2, 3569, 1691, 15, 2, 6706]

## Load pre-trained word embeddings

We will be using GloVe encodings, saved in `glove.6B.100d.txt`. The text file contains text-encoded vectors of various sizes: 50-dimensional, 100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones.
et's make a dict mapping words (strings) to their NumPy vector representation:

In [None]:
!wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
!unzip -q glove.6B.zip

In [13]:
path_to_glove_file = "data/glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(embeddings_index["cat"])
print("Found %s word vectors." % len(embeddings_index))

[ 0.23088    0.28283    0.6318    -0.59411   -0.58599    0.63255
  0.24402   -0.14108    0.060815  -0.7898    -0.29102    0.14287
  0.72274    0.20428    0.1407     0.98757    0.52533    0.097456
  0.8822     0.51221    0.40204    0.21169   -0.013109  -0.71616
  0.55387    1.1452    -0.88044   -0.50216   -0.22814    0.023885
  0.1072     0.083739   0.55015    0.58479    0.75816    0.45706
 -0.28001    0.25225    0.68965   -0.60972    0.19578    0.044209
 -0.31136   -0.68826   -0.22721    0.46185   -0.77162    0.10208
  0.55636    0.067417  -0.57207    0.23735    0.4717     0.82765
 -0.29263   -1.3422    -0.099277   0.28139    0.41604    0.10583
  0.62203    0.89496   -0.23446    0.51349    0.99379    1.1846
 -0.16364    0.20653    0.73854    0.24059   -0.96473    0.13481
 -0.0072484  0.33016   -0.12365    0.27191   -0.40951    0.021909
 -0.6069     0.40755    0.19566   -0.41802    0.18636   -0.032652
 -0.78571   -0.13847    0.044007  -0.084423   0.04911    0.24104
  0.45273   -0.18682 

### Padding

Most deep learning frameworks require that all sequences in the same mini-batch have the **same length**.
This is what allows vectorization to work: If you had a 3-word sentence and a 4-word sentence, then the computations needed for them are different (one takes 3 steps of an LSTM, one takes 4 steps) so it's just not possible to do them both at the same time.
    
#### Padding Handles Sequences of Varying Length
* The common solution to handling sequences of **different length** is to use padding.  Specifically:
    * Set a maximum sequence length
    * Pad all sequences to have the same length. 
    
#### Example of Padding:
* Given a maximum sequence length of 20, you could pad every sentence with "0"s so that each input sentence is of length 20. 
* Thus, the sentence "I love you" would be represented as $(e_{I}, e_{love}, e_{you}, \vec{0}, \vec{0}, \ldots, \vec{0})$. 
* In this example, any sentences longer than 20 words would have to be truncated. 
* One way to choose the maximum sequence length is to just pick the length of the longest sentence in the training set. 
  
### The Embedding Layer

In Keras, the embedding matrix is represented as a "layer."

* The embedding matrix maps word indices to embedding vectors.
    * The word indices are positive integers.
    * The embedding vectors are dense vectors of fixed size.
    * A "dense" vector is the opposite of a sparse vector. It means that most of its values are non-zero.  As a counter-example, a one-hot encoded vector is not "dense."
* The embedding matrix can be derived in two ways:
    * Training a model to derive the embeddings from scratch. 
    * Using a pretrained embedding.
    
#### Using and Updating Pre-trained Embeddings
In this section, you'll create an [Embedding()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer in Keras

* You will initialize the Embedding layer with GloVe 100-dimensional vectors. 
* In the code below, you'll observe how Keras allows you to either train or leave this layer fixed.  
    * Because your training set is quite small, you'll leave the GloVe embeddings fixed instead of updating them.

#### Inputs and Outputs to the Embedding Layer

* The `Embedding()` layer's input is an integer matrix of size **(batch size, max input length)**. 
    * This input corresponds to sentences converted into lists of indices (integers).
    * The largest integer (the highest word index) in the input should be no larger than the vocabulary size.
* The embedding layer outputs an array of shape (batch size, max input length, dimension of word vectors).

* The figure shows the propagation of two example sentences through the embedding layer. 
    * Both examples have been zero-padded to a length of `max_len=5`.
    * The word embeddings are 50 units in length.
    * The final dimension of the representation is  `(2,max_len,50)`. 

<img src="images/emb_kiank.png" style="width:700px;height:250px;">

Now, let's prepare a corresponding embedding matrix that we can use in a Keras `Embedding` layer. It's a simple NumPy matrix where entry at index i is the pre-trained vector for the word of index i in our vectorizer's vocabulary.

In [14]:
num_tokens = len(embeddings_index) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 17970 words (2030 misses)


Next, we load the pre-trained word embeddings matrix into an Embedding layer.

Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to update them during training).

In [15]:
embedding_layer = keras.layers.Embedding(
    num_tokens,
    embedding_dim,
    trainable=False,
)
embedding_layer.build((1,))
embedding_layer.set_weights([embedding_matrix])

Before we build our model, convert our list-of-strings data to NumPy arrays of integer indices. The arrays are right-padded.

In [16]:
x_train = vectorizer(np.array([[s] for s in X_train])).numpy()len
x_val = vectorizer(np.array([[s] for s in X_val])).numpy()

y_train = np.array(y_train)
y_val = np.array(y_val)

## Build the model

### 2.1 - Model Overview

A simple 1D convnet with global max pooling and a classifier at the end.

In [17]:
int_sequences_input = keras.Input(shape=(None,), dtype="int32")
embedded_sequences = embedding_layer(int_sequences_input)
x = keras.layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = keras.layers.MaxPooling1D(5)(x)
x = keras.layers.Conv1D(128, 5, activation="relu")(x)
x = keras.layers.MaxPooling1D(5)(x)
x = keras.layers.Conv1D(128, 5, activation="relu")(x)
x = keras.layers.GlobalMaxPooling1D()(x)
x = keras.layers.Dense(128, activation="relu")(x)
x = keras.layers.Dropout(0.5)(x)
preds = keras.layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 100)         40000200  
                                                                 
 conv1d (Conv1D)             (None, None, 128)         64128     
                                                                 
 max_pooling1d (MaxPooling1D  (None, None, 128)        0         
 )                                                               
                                                                 
 conv1d_1 (Conv1D)           (None, None, 128)         82048     
                                                                 
 max_pooling1d_1 (MaxPooling  (None, None, 128)        0         
 1D)                                                         

In [18]:
model.compile(loss=keras.losses.SparseCategoricalCrossentropy(), optimizer='rmsprop', metrics=['accuracy'])

model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x228c4b40bb0>

## Export an end-to-end model

Now, we may want to export a `Model` object that takes as input a string of arbitrary length, rather than a sequence of indices. It would make the model much more portable, since you wouldn't have to worry about the input preprocessing pipeline.

Our `vectorizer` is actually a Keras layer, so it's simple:

In [19]:
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
outputs = model(x)

end_to_end_model = keras.Model(string_input, outputs)

probabilities = end_to_end_model(
        tf.convert_to_tensor([["this message is about computer graphics and 3D modeling"]])
)

print(class_names[np.argmax(probabilities[0])])

comp.graphics
