# word2vec

## Setup

In [1]:
import io
import re
import string
import tqdm

import numpy as np

import tensorflow as tf
from tensorflow.keras import layers

2023-09-07 23:12:32.145306: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [3]:
SEED = 42
AUTOTUNE = tf.data.AUTOTUNE

### Generate training data

Compile all the steps described above into a function that can be called on a list of vectorized sentences obtained from any text dataset. Notice that the sampling table is built before sampling skip-gram word pairs. You will use this function in the later sections.

In [4]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  # Elements of each training example are appended to these lists.
  targets, contexts, train_labels = [], [], []

  # Build the sampling table for `vocab_size` tokens.
  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  # Iterate over all sequences (sentences) in the dataset.
  for sequence in tqdm.tqdm(sequences):

    # Generate positive skip-gram pairs for a sequence (sentence).
    skip_grams, labels = tf.keras.preprocessing.sequence.skipgrams(
          sequence,
          vocabulary_size=vocab_size,
          sampling_table=sampling_table,
          window_size=window_size,
          negative_samples=0.7)
    
    # Iterate over each positive skip-gram pair to produce training examples
    # with a positive context word and negative samples.
    for (target_word, context_word), label in zip(skip_grams,labels):

      # Append each element from the training example to global lists.
      targets.append(target_word)
      contexts.append(context_word)
      train_labels.append(label)

  return targets, contexts, train_labels

## Prepare training data for word2vec

With an understanding of how to work with one sentence for a skip-gram negative sampling based word2vec model, you can proceed to generate training examples from a larger list of sentences!

### Download text corpus


You will use a text file of Shakespeare's writing for this tutorial. Change the following line to run this code on your own data.

Use the non empty lines to construct a `tf.data.TextLineDataset` object for the next steps:

In [5]:
BASE = "/mnt/c/Users/mnusr/PycharmProjects/news_crawler/mlm_train_dataset"
text_ds = tf.data.TextLineDataset([f"{BASE}/train_sentences_5.txt",f"{BASE}/train_sentences_4.txt",f"{BASE}/train_sentences_3.txt",f"{BASE}/train_sentences_2.txt",f"{BASE}/train_sentences_1.txt"])#.filter(lambda x: tf.cast(tf.strings.length(x), bool))

2023-09-07 23:12:44.558652: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-07 23:12:44.700126: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-07 23:12:44.700176: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-07 23:12:44.701965: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-07 23:12:44.702002: I tensorflow/compile

### Vectorize sentences from the corpus

You can use the `TextVectorization` layer to vectorize sentences from the corpus. Learn more about using this layer in this [Text classification](https://www.tensorflow.org/tutorials/keras/text_classification) tutorial. Notice from the first few sentences above that the text needs to be in one case and punctuation needs to be removed. To do this, define a `custom_standardization function` that can be used in the TextVectorization layer.

In [6]:
# Now, create a custom standardization function to lowercase the text and
# remove punctuation.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  return tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation), '')


# Define the vocabulary size and the number of words in a sequence.
vocab_size = 10000
sequence_length = 15

# Use the `TextVectorization` layer to normalize, split, and map strings to
# integers. Set the `output_sequence_length` length to pad all samples to the
# same length.
vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

Call `TextVectorization.adapt` on the text dataset to create vocabulary.


In [7]:
vectorize_layer.adapt(text_ds.batch(4096*8))

2023-09-07 23:13:47.686656: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 18216012323532874981
2023-09-07 23:13:47.686714: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 13521184409493339783
2023-09-07 23:13:47.686735: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 866342992198037303


Once the state of the layer has been adapted to represent the text corpus, the vocabulary can be accessed with `TextVectorization.get_vocabulary`. This function returns a list of all vocabulary tokens sorted (descending) by their frequency.

In [8]:
# Save the created vocabulary for reference.
inverse_vocab = vectorize_layer.get_vocabulary()
print(inverse_vocab[:20])

['', '[UNK]', 've', 'bir', 'bu', 'da', 'için', 'de', 'ile', 'olarak', 'çok', 'daha', 'olan', 'ise', 'en', 'sonra', 'kadar', 'göre', 'gibi', 'her']


The `vectorize_layer` can now be used to generate vectors for each element in the `text_ds` (a `tf.data.Dataset`). Apply `Dataset.batch`, `Dataset.prefetch`, `Dataset.map`, and `Dataset.unbatch`.

In [9]:
# Vectorize the data in text_ds.
text_vector_ds = text_ds.batch(4096*8).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

### Obtain sequences from the dataset

You now have a `tf.data.Dataset` of integer encoded sentences. To prepare the dataset for training a word2vec model, flatten the dataset into a list of sentence vector sequences. This step is required as you would iterate over each sentence in the dataset to produce positive and negative examples.

Note: Since the `generate_training_data()` defined earlier uses non-TensorFlow Python/NumPy functions, you could also use a `tf.py_function` or `tf.numpy_function` with `tf.data.Dataset.map`.

In [10]:
sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))

18429092


Inspect a few examples from `sequences`:

In [11]:
for seq in sequences[:5]:
  print(f"{seq} => {[inverse_vocab[i] for i in seq]}")

[3914    2 1299 3674  916 5926    2 2454 9816 4899  311  124 1517  771
 3152] => ['radyo', 've', 'televizyon', 'Üst', 'kurumu', 'rtÜk', 've', 'akp’nin', 'sansür', 'uyguladığı', 'iddia', 'edilen', 'dijital', 'yayın', 'platformu']
[   1    1    1   35   97  108  729    1    9 2376    1    2    1    1
 6459] => ['[UNK]', '[UNK]', '[UNK]', 'yapılan', 'açıklamada', 'şöyle', 'denildi', '[UNK]', 'olarak', 'türkiyedeki', '[UNK]', 've', '[UNK]', '[UNK]', 'derinden']
[  64    1 3430 7681    1 2819 6972    0    0    0    0    0    0    0
    0] => ['birlikte', '[UNK]', 'birbirinden', 'yetenekli', '[UNK]', 'gurur', 'duyuyoruz', '', '', '', '', '', '', '', '']
[ 524  235 3341 3373   12    2 3955    1    1    1    6   10    1    2
    4] => ['Şu', 'anda', 'yapım', 'aşamasında', 'olan', 've', 'yakında', '[UNK]', '[UNK]', '[UNK]', 'için', 'çok', '[UNK]', 've', 'bu']
[   1   11 1593 2012    8   78    1    1 2994   11  111 9977    1    1
 2434] => ['[UNK]', 'daha', 'derin', 'işbirliği', 'ile', 'türk', '

### Generate training examples from sequences

`sequences` is now a list of int encoded sentences. Just call the `generate_training_data` function defined earlier to generate training examples for the word2vec model. To recap, the function iterates over each word from each sequence to collect positive and negative context words. Length of target, contexts and labels should be the same, representing the total number of training examples.

In [12]:
targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=2,
    num_ns=4,
    vocab_size=vocab_size,
    seed=SEED)

targets = np.array(targets)
contexts = np.array(contexts)
labels = np.array(labels)

print('\n')
print(f"targets.shape: {targets.shape}")
print(f"contexts.shape: {contexts.shape}")
print(f"labels.shape: {labels.shape}")


100%|██████████| 18429092/18429092 [06:55<00:00, 44405.01it/s]




targets.shape: (259354210,)
contexts.shape: (259354210,)
labels.shape: (259354210,)


In [33]:
del sequences

NameError: name 'sequences' is not defined

In [39]:
import gc
gc.collect()

958

### Configure the dataset for performance

To perform efficient batching for the potentially large number of training examples, use the `tf.data.Dataset` API. After this step, you would have a `tf.data.Dataset` object of `(target_word, context_word), (label)` elements to train your word2vec model!

In [40]:
BATCH_SIZE = 1024
BUFFER_SIZE = 5000
dataset = tf.data.Dataset.from_tensor_slices(((targets[:50000000], contexts[:50000000]), labels[:50000000]))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)

<_BatchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024,), dtype=tf.int64, name=None)), TensorSpec(shape=(1024,), dtype=tf.int64, name=None))>


In [41]:
#del dataset

Apply `Dataset.cache` and `Dataset.prefetch` to improve performance:

In [42]:
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
print(dataset)

<_PrefetchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024,), dtype=tf.int64, name=None)), TensorSpec(shape=(1024,), dtype=tf.int64, name=None))>


In [43]:
for x in dataset.take(1):
    print(x)

((<tf.Tensor: shape=(1024,), dtype=int64, numpy=array([6467, 7348, 8811, ..., 8422, 4164, 1132])>, <tf.Tensor: shape=(1024,), dtype=int64, numpy=array([   1, 4399, 4348, ...,    6,    1,   27])>), <tf.Tensor: shape=(1024,), dtype=int64, numpy=array([1, 0, 1, ..., 1, 1, 1])>)


2023-09-08 00:05:16.340078: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


## Model and training

The word2vec model can be implemented as a classifier to distinguish between true context words from skip-grams and false context words obtained through negative sampling. You can perform a dot product multiplication between the embeddings of target and context words to obtain predictions for labels and compute the loss function against true labels in the dataset.

### Subclassed word2vec model

Use the [Keras Subclassing API](https://www.tensorflow.org/guide/keras/custom_layers_and_models) to define your word2vec model with the following layers:

* `target_embedding`: A `tf.keras.layers.Embedding` layer, which looks up the embedding of a word when it appears as a target word. The number of parameters in this layer are `(vocab_size * embedding_dim)`.
* `context_embedding`: Another `tf.keras.layers.Embedding` layer, which looks up the embedding of a word when it appears as a context word. The number of parameters in this layer are the same as those in `target_embedding`, i.e. `(vocab_size * embedding_dim)`.
* `dots`: A `tf.keras.layers.Dot` layer that computes the dot product of target and context embeddings from a training pair.
* `flatten`: A `tf.keras.layers.Flatten` layer to flatten the results of `dots` layer into logits.

With the subclassed model, you can define the `call()` function that accepts `(target, context)` pairs which can then be passed into their corresponding embedding layer. Reshape the `context_embedding` to perform a dot product with `target_embedding` and return the flattened result.

Key point: The `target_embedding` and `context_embedding` layers can be shared as well. You could also use a concatenation of both embeddings as the final word2vec embedding.

In [44]:
class Word2Vec(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim):
    super(Word2Vec, self).__init__()
    self.target_embedding = layers.Embedding(vocab_size,
                                      embedding_dim,
                                      input_length=1,
                                      name="w2v_embedding")
    self.context_embedding = layers.Embedding(vocab_size,
                                       embedding_dim,
                                       input_length=1)
      
    self.dot = tf.keras.layers.Dot(axes=-1)

  def call(self, pair):
    target, context = pair
    # target: (batch, dummy?)  # The dummy axis doesn't exist in TF2.7+
    # context: (batch, context)
    if len(target.shape) == 2:
      target = tf.squeeze(target, axis=1)
    # target: (batch,)
    word_emb = self.target_embedding(target)
    # word_emb: (batch, embed)
    context_emb = self.context_embedding(context)
    # context_emb: (batch, context, embed)
    dots = self.dot([word_emb,context_emb])
    # dots: (batch, context)
    return dots

### Define loss function and compile model


For simplicity, you can use `tf.keras.losses.CategoricalCrossEntropy` as an alternative to the negative sampling loss. If you would like to write your own custom loss function, you can also do so as follows:

``` python
def custom_loss(x_logit, y_true):
      return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)
```

It's time to build your model! Instantiate your word2vec class with an embedding dimension of 128 (you could experiment with different values). Compile the model with the `tf.keras.optimizers.Adam` optimizer. 

In [45]:
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                 run_eagerly=False,
                 jit_compile=True,
                 metrics=['accuracy'])

Also define a callback to log training statistics for TensorBoard:

In [46]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

Train the model on the `dataset` for some number of epochs:

In [47]:
word2vec.fit(dataset, epochs=2, callbacks=[tensorboard_callback])

Epoch 1/2


2023-09-08 00:05:25.219440: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f48180aa1a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-09-08 00:05:25.219474: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3070 Ti, Compute Capability 8.6
2023-09-08 00:05:25.294425: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-09-08 00:05:26.969064: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600


    1/48828 [..............................] - ETA: 46:47:57 - loss: 0.6933 - accuracy: 0.3799

2023-09-08 00:05:28.018376: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/2


<keras.src.callbacks.History at 0x7f490c2e9b90>

TensorBoard now shows the word2vec model's accuracy and loss:

In [None]:
#docs_infra: no_execute
%tensorboard --logdir logs

<!-- <img class="tfo-display-only-on-site" src="images/word2vec_tensorboard.png"/> -->

## Embedding lookup and analysis

Obtain the weights from the model using `Model.get_layer` and `Layer.get_weights`. The `TextVectorization.get_vocabulary` function provides the vocabulary to build a metadata file with one token per line.

In [48]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

Create and save the vectors and metadata files:

In [49]:
out_v = io.open('vectors_tr_2.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata_tr_2.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()