# Word2Vec

Word2vec is a vector representation of text which has context.

In [1]:
import random
import numpy as np
import pandas as pd
import tensorflow as tf

### Data

This is just a small Text paragraph from Wikipedia about Machine Learning.

In [2]:
text = '''Machine learning is the study of computer algorithms that \
improve automatically through experience. It is seen as a \
subset of artificial intelligence. Machine learning algorithms \
build a mathematical model based on sample data, known as \
training data, in order to make predictions or decisions without \
being explicitly programmed to do so. Machine learning algorithms \
are used in a wide variety of applications, such as email filtering \
and computer vision, where it is difficult or infeasible to develop \
conventional algorithms to perform the needed tasks.'''

### Tokenization

Since we can’t feed raw string texts into our model, we will need to preprocess this text. The first step, as is the approach taken in many NLP tasks, is to tokenize the text, i.e. splitting the text up into smaller units like words, getting rid of punctuations, and so on. Here is a function that does this trick using regular expressions.

In [3]:
import re

def tokenize(text):
    pattern = re.compile(r'[A-Za-z]+[\w^\']*|[\w^\']*[A-Za-z]+[\w^\']*')
    return pattern.findall(text.lower())

Removed stop words from our text.

In [4]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
words = [word for word in text.split() if word.lower() not in ENGLISH_STOP_WORDS]
new_text = " ".join(words)
print(new_text)
print("Old length: ", len(text))
print("New length: ", len(new_text))

Machine learning study computer algorithms improve automatically experience. seen subset artificial intelligence. Machine learning algorithms build mathematical model based sample data, known training data, order make predictions decisions explicitly programmed so. Machine learning algorithms used wide variety applications, email filtering computer vision, difficult infeasible develop conventional algorithms perform needed tasks.
Old length:  556
New length:  433


In [5]:
tokens = tokenize(new_text)

In [6]:
tokens[:10]

['machine',
 'learning',
 'study',
 'computer',
 'algorithms',
 'improve',
 'automatically',
 'experience',
 'seen',
 'subset']

Another useful operation is to create a map between tokens and indices, and vice versa. In a sense, we are creating a lookup table that allows us to easily convert from words to indices, and indices to words. This will be particularly useful later on when we perform operations such as one-hot encoding.

In [7]:
def mapping(tokens):
    word_to_id = {}
    id_to_word = {}
    
    for i, token in enumerate(set(tokens)):
        word_to_id[token] = i
        id_to_word[i] = token
    
    return word_to_id, id_to_word

Let’s check if the word-to-index and index-to-word maps have successfully been created.

In [8]:
word_to_id, id_to_word = mapping(tokens)
word_to_id

{'model': 0,
 'vision': 1,
 'order': 2,
 'training': 3,
 'infeasible': 4,
 'decisions': 5,
 'data': 6,
 'so': 7,
 'applications': 8,
 'develop': 9,
 'automatically': 10,
 'build': 11,
 'computer': 12,
 'programmed': 13,
 'seen': 14,
 'variety': 15,
 'improve': 16,
 'sample': 17,
 'used': 18,
 'intelligence': 19,
 'known': 20,
 'filtering': 21,
 'perform': 22,
 'algorithms': 23,
 'tasks': 24,
 'artificial': 25,
 'based': 26,
 'experience': 27,
 'mathematical': 28,
 'explicitly': 29,
 'needed': 30,
 'predictions': 31,
 'conventional': 32,
 'email': 33,
 'study': 34,
 'make': 35,
 'subset': 36,
 'learning': 37,
 'machine': 38,
 'wide': 39,
 'difficult': 40}

As we can see, the lookup table is a dictionary object containing the relationship between words and ids. Note that each entry in this lookup table is a token created using the tokenize() function we defined earlier.

### Generating Training Data

* Now that we have tokenized the text and created lookup tables, we can now proceed to generating the actual training data, which are going to take the form of matrices. 
* Since tokens are still in the form of strings, we need to encode them numerically using one-hot vectorization. 
* We also need to generate a bundle of input and target values, as this is a supervised learning technique.
* We loop through each word (or token) in the sentence. In each loop, we look at words to the left and right of the input word

We basically iterate over the tokenized data and generate pairs. One technicality here is that, for the first and last few tokens, it may not be possible to obtain words to the left or right of that input token. In those cases, we simply don’t consider these word pairs and look at only what is feasible without causing IndexErrors. Also note that we create X and y separately instead of putting them in tuple form.

In [9]:
import numpy as np

np.random.seed(42)


def generate_training_data(tokens, word_to_id, window):
    X = []
    y = []
    n_tokens = len(tokens)
    
    for i in range(n_tokens):
        idx = concat(
            range(max(0, i - window), i), 
            range(i, min(n_tokens, i + window + 1))
        )
        for j in idx:
            if i == j:
                continue
            X.append(one_hot_encode(word_to_id[tokens[i]], len(word_to_id)))
            y.append(one_hot_encode(word_to_id[tokens[j]], len(word_to_id)))
    
    return np.asarray(X), np.asarray(y)

Below is the definition for concat, an auxiliary function we used above to combine two range() objects.

In [10]:
def concat(*iterables):
    for iterable in iterables:
        yield from iterable

Also, here is the code we use to one-hot vectorize tokens. This process is necessary in order to represent each token as a vector, which can then be stacked to create the matrices X and y.

In [11]:
def one_hot_encode(id, vocab_size):
    res = [0] * vocab_size
    res[id] = 1
    return res

Finally, let’s generate some training data with a window size of two.

In [12]:
X, y = generate_training_data(tokens, word_to_id, 2)

In [13]:
X.shape

(194, 41)

In [14]:
y.shape

(194, 41)

In [15]:
X[0], y[0]

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]))

In [16]:
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # The examples are read at random, in no particular order
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        j = tf.constant(indices[i: min(i + batch_size, num_examples)])
        yield tf.gather(features, j), tf.gather(labels, j)

In [17]:
batch_size = 10

for X, y in data_iter(batch_size, X, y):
    print(X, '\n', y)
    break

Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB

tf.Tensor(
[[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
  0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]], shape=(10, 41), dtype=int64) 
 tf.Tensor(

2022-10-14 00:58:49.540402: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-10-14 00:58:49.540670: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


### Initializing Model Parameters

W1 and W2 are two vectors that will be used for the embedding layer


In [18]:
vocab_size = len(word_to_id)
num_hiddens = 10 # Number of dimension

# First hidden layer
W1 = tf.Variable(tf.random.normal(shape=(vocab_size, num_hiddens), mean=0, stddev=0.01))

# Second hidden layer
W2 = tf.Variable(tf.random.normal(shape=(num_hiddens, vocab_size), mean=0, stddev=0.01))

params = [W1, W2]

### Softmax Activation

We will use Softmax Activation.
When the output is passed into this layer, it is converted into probability vectors whose elements sum up to one. 

This final output can be considered as context predictions, i.e. which words are likely to be in the window vicinity of the input word.

In [19]:
def softmax(X):
    X_exp = tf.exp(X)
    partition = tf.reduce_sum(X_exp, 1, keepdims=True)
    return X_exp / partition

### Forward Propagation

Coding the forward propagation process simply amounts to transcribing the three matrix multiplication into NumPy code.

In [20]:
def net(X):
    X = tf.reshape(tf.cast(X, dtype=tf.float32), (-1, vocab_size))
    a1 = tf.matmul(X, W1)
    a2 = tf.matmul(a1, W2)
    return softmax(a2)

### Loss Function

Loss function used here is Cross Entropy

In [21]:
def cross_entropy(y_hat, y):
    return -tf.math.log(tf.boolean_mask(y_hat, y))

### Optimization

Here we will be using Stochastic gradient descent as our optimizer

In [22]:
def sgd(params, grads, lr, batch_size):
    """Minibatch stochastic gradient descent.
    Defined in :numref:`sec_linear_scratch`"""
    for param, grad in zip(params, grads):
        param.assign_sub(lr*grad/batch_size)

### Training

In [23]:
num_epochs = 10
lr = 0.1
loss = cross_entropy

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, X, y):
        # Compute gradients and update parameters
        with tf.GradientTape() as tape:
            y_hat = net(X)
            l = loss(y_hat, y)
        # Compute gradient on l with respect to [`w`, `b`]
        grads = tape.gradient(l, params)
        # Update parameters using their gradient
        sgd(params, grads, lr, batch_size)

### Testing

In this we will be testing our model with a word "learning".

In [24]:
learning = one_hot_encode(word_to_id["learning"], len(word_to_id))
result = net([learning])[0]

for word in (id_to_word[id] for id in np.argsort(result)[::-1]):
    print(word)

sample
known
improve
build
so
data
algorithms
difficult
training
tasks
infeasible
needed
automatically
email
explicitly
variety
learning
used
seen
subset
study
machine
make
filtering
mathematical
vision
computer
artificial
applications
perform
conventional
predictions
wide
decisions
based
programmed
experience
develop
model
order
intelligence


### Embedding

The key behind word embeddings is that the rows of the first weight matrix is effectively a dense representation of one-hot encoded vectors each corresponding to various tokens in the text dataset. In our example, therefore, the embedding can simply be obtained by

In [25]:
W1

<tf.Variable 'Variable:0' shape=(41, 10) dtype=float32, numpy=
array([[ 4.95584542e-03, -6.71986071e-03, -5.92995016e-03,
        -6.67822128e-03, -4.51740436e-03,  8.82991962e-03,
        -2.54769693e-03, -1.92553527e-03, -1.20385168e-02,
         2.48499238e-03],
       [ 5.92960091e-03, -1.06932840e-03,  7.76873948e-03,
        -1.37154907e-02, -1.62698328e-02,  5.19725541e-03,
         4.72769985e-04,  8.68969969e-03,  1.11908894e-02,
         1.39254611e-02],
       [-3.11099156e-03,  1.12368008e-02,  1.72610674e-02,
        -1.97626781e-02,  1.21495221e-02,  1.13645811e-02,
         6.05434040e-03,  1.96958939e-03, -7.07712956e-03,
         1.57115818e-03],
       [ 2.85598799e-03, -5.37711428e-03,  6.09529903e-03,
        -7.15362420e-03,  5.26194880e-03,  1.22330440e-02,
         2.59799766e-03, -1.20184645e-02, -5.05101774e-03,
         1.97129454e-02],
       [ 1.69306935e-03,  4.55578929e-03, -1.04620565e-04,
         6.15315186e-03,  2.42418591e-02, -1.48237748e-02,
       

In particular, what we want is to be able to input a word through a function and receive as output the embedding vector for that given word. Below is a function that implements this feature.

In [26]:
def get_embedding(word):
    try:
        idx = word_to_id[word]
    except KeyError:
        print("`word` not in corpus")
    
    return W1[idx]

In [27]:
get_embedding("machine")

<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([ 0.00351905, -0.00263356,  0.00197692, -0.00025845,  0.00620847,
       -0.01444763,  0.01027434, -0.00169978, -0.00776967,  0.00928257],
      dtype=float32)>