Logs:
- [2024/04/27]   
  A copy of `word2vec_wiki.ipynb` but usin TensorFlow instead writing all    
  the procedures from scratch  

- [2024/04/28]   
  All the procedures to create a training data set and the model are available   
  in [`word2vec` TensorFlow tutorial](https://www.tensorflow.org/text/tutorials/word2vec#vectorize_an_example_sentence)

  Using wiki article produced vectors that has no apparent clusters


To do:
- I would like to copy `ch-21-nlp-02-wordVec.ipynb` with TensorFlow, but    
  it is difficult than I thought. We can achieve this by creating a model   
  with a custom gradient descent.


In [1]:
import tensorflow as tf   # pip install tensorflow
import requests 
import json
import re
import string
import tqdm
import io
import numpy as np

from typing import List
from bs4 import BeautifulSoup
from scratch.deep_learning import Tensor
from scratch.word2vec import Vocabulary

In [2]:
%load_ext autoreload
%autoreload 2 

An example of `word2vec` for a Wikipedia article

In [42]:
url = "https://en.wikipedia.org/wiki/Actuarial_science"
# url = "https://en.wikipedia.org/wiki/Data_science"
html = requests.get(url).text
soup = BeautifulSoup(html, "html5lib")

# content = soup.find("div", "bodyContent")
content = soup.find("div", "mw-content-ltr")
regex = r"[\w']+|[\.]"

document = []
for paragraph in content("p"):
  words = re.findall(regex, paragraph.text)
  document.extend(words)

document

['Data',
 'science',
 'is',
 'an',
 'interdisciplinary',
 'academic',
 'field',
 '1',
 'that',
 'uses',
 'statistics',
 'scientific',
 'computing',
 'scientific',
 'methods',
 'processes',
 'algorithms',
 'and',
 'systems',
 'to',
 'extract',
 'or',
 'extrapolate',
 'knowledge',
 'and',
 'insights',
 'from',
 'potentially',
 'noisy',
 'structured',
 'or',
 'unstructured',
 'data',
 '.',
 '2',
 'Data',
 'science',
 'also',
 'integrates',
 'domain',
 'knowledge',
 'from',
 'the',
 'underlying',
 'application',
 'domain',
 'e',
 '.',
 'g',
 '.',
 'natural',
 'sciences',
 'information',
 'technology',
 'and',
 'medicine',
 '.',
 '3',
 'Data',
 'science',
 'is',
 'multifaceted',
 'and',
 'can',
 'be',
 'described',
 'as',
 'a',
 'science',
 'a',
 'research',
 'paradigm',
 'a',
 'research',
 'method',
 'a',
 'discipline',
 'a',
 'workflow',
 'and',
 'a',
 'profession',
 '.',
 '4',
 'Data',
 'science',
 'is',
 'a',
 'concept',
 'to',
 'unify',
 'statistics',
 'data',
 'analysis',
 'informatic

In [43]:
# Transform document into sentence
sentences = [sentence.strip()+"." for sentence in " ".join(document).split(".")]
sentences

['Data science is an interdisciplinary academic field 1 that uses statistics scientific computing scientific methods processes algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy structured or unstructured data.',
 '2 Data science also integrates domain knowledge from the underlying application domain e.',
 'g.',
 'natural sciences information technology and medicine.',
 '3 Data science is multifaceted and can be described as a science a research paradigm a research method a discipline a workflow and a profession.',
 '4 Data science is a concept to unify statistics data analysis informatics and their related methods to understand and analyze actual phenomena with data.',
 '5 It uses techniques and theories drawn from many fields within the context of mathematics statistics computer science information science and domain knowledge.',
 '6 However data science is different from computer science and information science.',
 'Turing Award winner Jim

In [44]:
corpus_text = "\n".join(sentences)
with open("./datasets/wiki_article.txt", "w") as fname:
  fname.write(corpus_text)

### Preparaing training data for word2vec

Use the non empty lines to construct a `tf.data.TextLineDataset` object for  
the next steps

In [61]:
# path_to_article = "./datasets/wiki_article.txt"
path_to_article = "./datasets/wiki_article_clean.txt"
# path_to_article = "./datasets/wiki_two_article_clean.txt"
text_ds = tf.data.TextLineDataset(path_to_article).filter(
  lambda x: tf.cast(tf.strings.length(x), bool))
text_ds

<_FilterDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

In [62]:
# Now, create custom standardization function to lowercase the text and
# remove punctuation
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  return tf.strings.regex_replace(lowercase, "[%s]" % re.escape(string.punctuation), "")

# Define the vocabulary size and the number of words in a sequence
vocab_size = 4096  # default for large corpus 4096
sequence_length = 30 

# Use the `TextVectorization` layer to normalize, split, and map strings to
# integers. Set the `output_sequence_length` length to pad all samples to the
# same length
vectorize_layer = tf.keras.layers.TextVectorization(
  standardize=custom_standardization, 
  max_tokens=vocab_size,
  output_mode="int",
  output_sequence_length=sequence_length)

Call `TextVectorization.adapt` on the text dataset to create vocabulary

In [63]:
vectorize_layer.adapt(text_ds.batch(1024))

In [64]:
# Save the created vocabulary for reference.
inverse_vocab = vectorize_layer.get_vocabulary()
print(inverse_vocab[:20])

['', '[UNK]', 'the', 'of', 'and', 'to', 'in', 'actuarial', 'a', 'insurance', 'as', 'for', 'science', 'is', 'financial', 'by', 'life', 'be', 'actuaries', 'models']


The `vectorize_layer` can now be used to generate vectors for each element in   
the `text_ds` (a `tf.data.Dataset`). Apply `Dataset.batch`, `Dataset.prefetch`,  
`Dataset.map`, and `Dataset.unbatch`

In [65]:
# Vectorize the data in text_ds
AUTOTUNE = tf.data.AUTOTUNE
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

`sequence` (singular) represents each sentence (or row) in `wiki_article.txt`.  
`sequences` (plural) is now a list of int encoded sentences.

In [66]:
sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))

82


Inspect a few examples from `sequences`:

In [67]:
for seq in sequences[:5]:
  print(f"{seq} => {[inverse_vocab[i] for i in seq]}")

[  7  12  13   2  76  20 717  34   4 141  30   5  82  32   6   9  33  72
 184   4  39 499   4 383   0   0   0   0   0   0] => ['actuarial', 'science', 'is', 'the', 'discipline', 'that', 'applies', 'mathematical', 'and', 'statistical', 'methods', 'to', 'assess', 'risk', 'in', 'insurance', 'pension', 'finance', 'investment', 'and', 'other', 'industries', 'and', 'professions', '', '', '', '', '', '']
[ 65 193  18 238 151 102   5 457 467   3 135   4  16 565   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0] => ['more', 'generally', 'actuaries', 'apply', 'rigorous', 'mathematics', 'to', 'model', 'matters', 'of', 'uncertainty', 'and', 'life', 'expectancy', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
[ 18  44 384 283   6  28  76   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0] => ['actuaries', 'are', 'professionals', 'trained', 'in', 'this', 'discipline', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',

Define a function to generate training examples from sequences.   
This function iterates over each word from each sequence to collect positive   
and negative context words. Length of target, contexts and labels should be  
the same, representing the total number of training examples.

In [68]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  # Elements of each training example are appended to these lists.
  targets, contexts, labels = [], [], []

  # Build the sampling table for `vocab_size` tokens
  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  # Iterate over all sequences (sentences) in the dataset.
  for sequence in tqdm.tqdm(sequences):

    # Generate positive skip-gram pairs for a sequence (sentence). 
    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
      sequence, vocabulary_size=vocab_size, sampling_table=sampling_table,
      window_size=window_size, negative_samples=0)

    # Iterate over each positive skip-gram pair to produce trainig examples 
    # with a positive context word and negative samples
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
        true_classes=context_class, num_true=1, num_sampled=num_ns,
        unique=True, range_max=vocab_size, seed=seed, name="negative_sampling")

      # Build context and label vectors (for on target word)
      context = tf.concat([tf.squeeze(context_class, 1), negative_sampling_candidates], 0)
      label = tf.constant([1] + [0]*num_ns, dtype="int64")

      # Append each element from the training example to global lists
      targets.append(target_word)
      contexts.append(context)
      labels.append(label)

  return targets, contexts, labels

In [69]:
seed = 24_04_28 

targets, contexts, labels = generate_training_data(
  sequences=sequences, window_size=2, num_ns=4, vocab_size=vocab_size, 
  seed=seed)

targets = np.array(targets)
contexts = np.array(contexts)
labels = np.array(labels)

print("\n")
print(f"tagets.shape: {targets.shape}")
print(f"contexts.shape: {contexts.shape}")
print(f"labels.shape: {labels.shape}")


100%|██████████| 82/82 [00:00<00:00, 980.76it/s]



tagets.shape: (538,)
contexts.shape: (538, 5)
labels.shape: (538, 5)





To perform efficient batching for the potentially large number of training  
examples, use the `tf.data.Dataset` API.

And also apply `Dataset.cache` and `Dataset.prefetch` to improve performance

In [70]:
BATCH_SIZE = 16   # Set it less than number of samples
BUFFER_SIZE = 10
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
print(dataset)

<_PrefetchDataset element_spec=((TensorSpec(shape=(16,), dtype=tf.int64, name=None), TensorSpec(shape=(16, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(16, 5), dtype=tf.int64, name=None))>


### Create model training

In [71]:
class Word2Vec(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim):
    super(Word2Vec, self).__init__()
    self.target_embedding = tf.keras.layers.Embedding(
      vocab_size, embedding_dim, name="w2v_embedding")
    self.context_embedding = tf.keras.layers.Embedding(
      vocab_size, embedding_dim)

  def call(self, pair):
    target, context = pair
    # target: (batch, dummy?)     # The dummy axis doesn't exist in TF2.7+
    # context: (batch, context)
    if len(target.shape) == 2:
      target = tf.squeeze(target, axis=1)
    #target: (batch, )

    word_emb = self.target_embedding(target)
    # word_emb: (batch, embed)

    context_emb = self.context_embedding(context)
    # context_emb = (batch, context, embed)

    # b: batch index
    # e: embedding index
    # c: context index
    dots = tf.einsum("be,bce->bc", word_emb, context_emb)
    # dots: (batch, context)

    return dots

Define loss function and compile model

In [72]:
def custom_loss(x_logit, y_true):
  return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)

Build the model

In [73]:
embedding_dim = 10   # default 128

word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(
  optimizer="adam", 
  loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
  metrics=["accuracy"])

Train the model on the `dataset` for some number of epochs

In [77]:
word2vec.fit(dataset, epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.src.callbacks.History at 0x22057089b50>

Obtain the weights from the model using  `Model.get_layer` and `Layer.get_weights`.  


In [78]:
weights = word2vec.get_layer("w2v_embedding").get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

Create and save the vectors and metadata files

In [79]:
out_v = io.open("./datasets/vectors.tsv", "w", encoding="utf-8")
out_m = io.open("./datasets/metadata.tsv", "w", encoding="utf-8")

for index, word in enumerate(vocab):
  if index == 0:
    continue   # skip 0, it's padding
  vec = weights[index]
  out_v.write("\t".join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")

out_v.close()
out_m.close()

Upload the `./datasets/vectors.tsv` and `./datasets/metadata.tsv` to analyze the obtained embeddings in the [Embedding Projector](https://projector.tensorflow.org/)

- Use PCA
- Click "Spherical Data"
- Contrast between two distanced points.