<br>
<font>
<div dir=ltr align=center>
<img src="https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png" width=150 height=150> <br>
<font color=0F5298 size=7>
    Machine learning <br>
<font color=2565AE size=5>
    Computer Engineering Department <br>
    Fall 2024<br>
<font color=3C99D size=5>
    Practical Assignment 5 - NLP - Skip-Gram <br>
<font color=0CBCDF size=4>
   &#x1F335; Amirhossein Akbari  &#x1F335;
</div>

____

<font color=9999FF size=4>
&#x1F388; Full Name : Farzan Rahmani
<br>
<font color=9999FF size=4>
&#x1F388; Student Number : 403210725

<font color=0080FF size=3>
This notebook explores word embeddings, compact and dense vector representations of words that capture their textual meaning. This notebook focusing on implementing the Word2Vec algorithm using the Skip-gram architecture and negative sampling.
</font>
<br>

**Note:**
<br>
<font color=66B2FF size=2>In this notebook, you are free to use any function or model from TensorFlow to assist with the implementation. However, PyTorch is not permitted for this exercise. This ensures consistency and alignment with the tools being focused on.</font>
<br>
<font color=red size=3>**Run All Cells Before Submission**</font>: <font color=FF99CC size=2>Before saving and submitting your notebook, please ensure you run all cells from start to finish. This practice guarantees that your notebook is self-consistent and can be evaluated correctly by others.</font>

<font color=#ffb578 size=3>
you are free to modify, add, or remove any cells as you see fit to complete your tasks. Feel free to change any of the provided code or content to better suit your understanding and approach to the problems.

- **Questions**: If you have any questions or require clarifications as you work through the notebook, please do not hesitate to ask. You can post your queries on Quera or reach out via Telegram.
- **Feedback**: We encourage you to seek feedback and engage in discussions to enhance your learning experience and improve your solutions.
</font>

In [1]:
import io
import math
import gzip
import nltk
import time
import random
import numpy as np
import tensorflow as tf
import gensim.downloader as api
import tensorflow_datasets as tfds
nltk.download('stopwords')

from collections import Counter
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import skipgrams

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Downloading Dataset
We're going to use text8 dataset. Text8 is first 100,000,000 bytes of plain text from Wikipedia. It's mainly used for testing purposes.

In [2]:
def load_data():
  text8_zip_file_path = api.load('text8', return_path=True)
  with gzip.open(text8_zip_file_path, 'rb') as file:
    file_content = file.read()
  wiki = file_content.decode()
  return wiki

wiki = load_data()

### Preprocessing data

**Stopwords removal** - Begin by removing stopwords from the dataset, as they provide little to no value in learning word embeddings. Ensure your preprocessing pipeline filters out commonly used words such as "the," "and," or "of" that do not contribute to meaningful semantic relationships.

---

**Subsampling words** - In a large corpora, most frequent words can easily occur hundreds of millions of times and such words usually don't bring much information to the table.  It is of essential importance to cut down on their frequencies to mitigate the negative impact it adds. For example, co-occurrences of "English" and "Spanish" benefit much more than co-occurrences of "English" and "the" or "Spanish" and "of". To counter the imbalance between rare and frequent words Mikolov et. al came up with the following heuristic formula for determining probability to drop a particular word:

$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $

where t is threshold value (heuristically set to 1e-5) and f(w) is frequency of the word.

Implement a subsampling mechanism to handle overly frequent words in the corpus. Use the heuristic formula provided by Mikolov et al. to calculate the probability of dropping a word based on its frequency. This step ensures the corpus maintains a balance between rare and frequent words, improving the quality of word co-occurrence relationships.

---

**Filtering words** - Filter out words that occur only once in the dataset, as they lack sufficient context to be represented effectively. Retain only those words that appear at least five times in the corpus to minimize noise and enhance the overall quality of the embeddings.


In [5]:
def preprocess_text(text):
    # Step 1: Replace punctuation with tokens to standardize the text for processing
    # Example: Replace '.', ',', and other punctuation marks with specific tokens
    text = text.replace('.', ' <PERIOD> ')
    text = text.replace(',', ' <COMMA> ')
    text = text.replace('"', ' <QUOTATION_MARK> ')
    text = text.replace(';', ' <SEMICOLON> ')
    text = text.replace(':', ' <COLON> ')
    text = text.replace('?', ' <QUESTION_MARK> ')
    text = text.replace('!', ' <EXCLAMATION_MARK> ')
    text = text.replace('(', ' <LEFT_PAREN> ')
    text = text.replace(')', ' <RIGHT_PAREN> ')
    text = text.replace('--', ' <HYPHENS> ')

    # Step 2: Convert text to lowercase and remove unnecessary whitespaces
    # Example: Apply text.lower() and text.strip()
    text = text.lower()
    text = text.strip()

    # Step 3: Remove stopwords from the text
    # Example: Filter out common words such as 'the', 'and', 'of' using a predefined stopwords list
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word not in stop_words]

    # Step 4: Remove words with frequency less than 5
    # Example: Count word frequencies and filter words appearing fewer than 5 times
    word_counts = Counter(words)
    words = [word for word in words if word_counts[word] >= 5]

    # Step 5: Subsample words using a threshold value (e.g., 1e-5)
    # Example: Implement the subsampling heuristic to reduce the frequency of overly common words
    t = 1e-5
    word_counts = Counter(words)
    total_words = len(words)
    freqs = {word: count / total_words for word, count in word_counts.items()}
    p_drop = {word: 1 - np.sqrt(t / freqs[word]) for word in word_counts}
    words = [word for word in words if random.random() < (1 - p_drop[word])]

    # Return the processed words and word counts
    return words, word_counts

processed_words, word_counts = preprocess_text(wiki)

It's always a good idea to take a quick look at preprocessed sample before heading further - you might observe few things that if handled can enrich or correct your data. More like a validation step this.

In [6]:
len(processed_words)

3813027

In [7]:
word_counts.most_common(10)

[('one', 411764),
 ('zero', 264975),
 ('nine', 250430),
 ('two', 192644),
 ('eight', 125285),
 ('five', 115789),
 ('three', 114775),
 ('four', 108182),
 ('six', 102145),
 ('seven', 99683)]

In [8]:
# Take a quick look at a slice of preprocessed words (e.g., index 1500 to 1550)
# print(processed_words[1500:1550])
print(processed_words[85:85+50])

['zeno', 'repudiated', 'omnipotence', 'regimentation', 'proclaimed', 'moral', 'anabaptists', 'forerunners', 'anarchism', 'bertrand', 'writes', 'anabaptists', 'repudiated', 'law', 'man', 'guided', 'spirit', 'premise', 'arrive', 'communism', 'diggers', 'levellers', 'early', 'communistic', 'war', 'considered', 'forerunners', 'modern', 'era', 'mean', 'armand', 'nouveaux', 'voyages', 'dans', 'rique', 'prisons', 'priests', 'anarchy', 'libertarian', 'movement', 'repeatedly', 'anarchist', 'ancestors', 'revolution', 'william', 'godwin', 'enquiry', 'political', 'godwin', 'use']


### Hyperparameters
Setting a few hyperparamters required for generating batches and for deciding the size of word embeddings.



In [9]:
EMBEDDING_DIM = 128
BUFFER_SIZE = 10000
BATCH_SIZE = 1024
EPOCHS = 5

### Preparing TensorFlow Dataset using Skipgrams

**Generating Skipgrams**

Tokenize your preprocessed textual data and convert the words into their corresponding vectorized tokens. Then, use the `skipgrams` function provided by Keras to generate (word, context) pairs. Ensure the following steps are completed:

- Generate positive samples: (word, word in the same window), with label 1.  
- Generate negative samples: (word, random word from the vocabulary), with label 0.  

Refer to Mikolov et al.'s paper, [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781v3.pdf), for more details on Skipgrams.

---

**Negative Sampling**

For each input word, implement the negative sampling approach to optimize the training process. Transform the problem of predicting context words into independent binary classification tasks. For every (target, context) pair, generate random negative (target, ~context) samples. This step will reduce computational complexity and make training more efficient.


In [10]:
# Step 1: Initialize and fit the tokenizer on preprocessed words
# Tokenize the preprocessed words and create a vocabulary index
tokenizer = Tokenizer()
tokenizer.fit_on_texts([processed_words])

# Step 2: Vectorize the words using the tokenizer's word index
# Convert the preprocessed words into vectorized tokens
word_sequences = tokenizer.texts_to_sequences([processed_words])[0]

# Step 3: Generate skipgram pairs and labels
# Use the skipgrams function to create (word, context) pairs with their labels
vocab_size = len(tokenizer.word_index) + 1
window_size = 2
negative_samples = 1
pairs, labels = skipgrams(word_sequences, vocab_size, window_size=window_size, negative_samples=negative_samples)

# Step 4: Extract target and context words from the generated pairs
# Separate the target words and context words for training
target_words, context_words = zip(*pairs)
target_words = np.array(target_words, dtype="int64")
context_words = np.array(context_words, dtype="int64")
labels = np.array(labels, dtype="int64")

# Step 5: Split the data into training and testing sets
# Define a sample size and divide the data into training and testing subsets
sample_size = len(pairs)
train_size = int(0.9 * sample_size)
train_data = (target_words[:train_size], context_words[:train_size], labels[:train_size])
test_data = (target_words[train_size:], context_words[train_size:], labels[train_size:])

# Step 6: Create TensorFlow datasets
# Prepare TensorFlow datasets for training and testing with appropriate batching and shuffling
# train_dataset = tf.data.Dataset.from_tensor_slices(train_data).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
# test_dataset = tf.data.Dataset.from_tensor_slices(test_data).batch(BATCH_SIZE)

train_dataset = tf.data.Dataset.from_tensor_slices(train_data)
test_dataset = tf.data.Dataset.from_tensor_slices(test_data)

train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)

In [11]:
vocab_size

71141

In [12]:
# Print the number of batches in the training and testing datasets
print(f"Number of training batches: {len(train_dataset)}")
print(f"Number of testing batches: {len(test_dataset)}")

Number of training batches: 26811
Number of testing batches: 2979


### Building the Model

Use the model subclassing method to build your model. While Sequential and Functional APIs are generally more suitable for most use cases, model subclassing allows you to create the model in an object-oriented way. Follow these steps:

1. Define a custom model class by inheriting from `tf.keras.Model`.
2. Implement the `__init__` method to define the layers of your model.
3. Override the `call` method to define the forward pass of your model.
4. Ensure that the model includes embedding layers, a skip-gram architecture, and any other necessary components for training.


In [14]:
# Step 1: Create a custom model class by subclassing `tf.keras.Model`
# Define a class that inherits from the Keras Model class

# Step 2: Initialize the layers in the `__init__` method
# Define all the layers such as embedding, dense, or output layers

# Step 3: Implement the forward pass in the `call` method
# Define how the input data flows through the model to produce the output

# Step 4: Ensure the model implements the skip-gram architecture
# Include logic for embedding lookups and processing positive and negative samples

# Step 5: Verify that the model structure aligns with the objective
# Test the forward pass to ensure proper layer connections and outputs

class SkipGramModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGramModel, self).__init__()
        self.target_embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=1, name="target_embedding")
        self.context_embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=1, name="context_embedding")
        self.dense = tf.keras.layers.Dense(1, activation='sigmoid')

    def call(self, pair):
        target, context = pair
        if len(target.shape) == 2:
            target = tf.squeeze(target, axis=1)
        if len(context.shape) == 2:
            context = tf.squeeze(context, axis=1)
        word_emb = self.target_embedding(target)
        context_emb = self.context_embedding(context)
        dot_product = tf.reduce_sum(tf.multiply(word_emb, context_emb), axis=1)
        output = self.dense(tf.expand_dims(dot_product, axis=1))
        return output

model = SkipGramModel(vocab_size, EMBEDDING_DIM)



### Loss function, Metrics and Optimizers

In [15]:
optimizer = tf.keras.optimizers.Adam()
# optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
# optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
loss_fn = tf.keras.losses.BinaryCrossentropy()
train_acc_metric = tf.keras.metrics.BinaryAccuracy()
val_acc_metric = tf.keras.metrics.BinaryAccuracy()

### Training the Model

Implement custom training for learning word embeddings to gain finer control over optimization and training tasks. Follow these steps:

1. Define a custom training loop that includes forward propagation, loss computation, and backpropagation.
2. Use the optimizer of your choice to update the model's weights based on the computed gradients.
3. Implement batching for efficient data processing during training.
4. Monitor the loss during each epoch to track the model's performance.
5. Save the trained embeddings for later use once the training is complete.

In [16]:
# Step 1: Define the training step
# Create a `train_step` function using `tf.GradientTape` to compute predictions, calculate loss, and apply gradients to update model weights

# Step 2: Define the testing step
# Create a `test_step` function to compute predictions and calculate validation loss without updating the model weights

# Step 3: Initialize the training loop
# Set up a loop to iterate over epochs and train the model for the defined number of iterations

# Step 4: Perform training on each batch
# For each batch in the training dataset, call the `train_step` function and accumulate the loss

# Step 5: Compute and display training accuracy
# Update and reset training accuracy metrics after each epoch and log the results

# Step 6: Perform validation on the test dataset
# For each batch in the test dataset, call the `test_step` function to calculate validation loss and accuracy

# Step 7: Log validation metrics
# Compute and log validation accuracy and cumulative test loss for each epoch

# Step 8: Track time per epoch
# Record and display the time taken to complete each epoch for performance monitoring

@tf.function
def train_step(inputs):
    target, context, label = inputs
    with tf.GradientTape() as tape:
        predictions = model((target, context))
        loss = loss_fn(label, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    train_acc_metric.update_state(label, predictions)
    return loss

@tf.function
def test_step(inputs):
    target, context, label = inputs
    predictions = model((target, context))
    loss = loss_fn(label, predictions)
    val_acc_metric.update_state(label, predictions)
    return loss

for epoch in range(EPOCHS):
    start_time = time.time()
    total_loss = 0
    for batch in train_dataset:
        batch_loss = train_step(batch)
        total_loss += batch_loss
    train_acc = train_acc_metric.result()
    train_acc_metric.reset_state()

    val_loss = 0
    for batch in test_dataset:
        batch_loss = test_step(batch)
        val_loss += batch_loss
    val_acc = val_acc_metric.result()
    val_acc_metric.reset_state()

    # print(f'Epoch: {epoch + 1}\nTraining Loss: {(total_loss / len(train_dataset)):.4f},  Training Accuracy: {train_acc:.4f}\nVal Loss: {(val_loss / len(test_dataset)):.4f}, Val Accuracy: {val_acc:.4f}\nTime taken: {(time.time() - start_time):.6f}s')
    print(f'Epoch: {epoch + 1}\nCummulative Training Loss: {(total_loss):.4f},  Training Accuracy: {train_acc:.4f}\nCummulative Val Loss: {(val_loss):.4f}, Val Accuracy: {val_acc:.4f}\nTime taken: {(time.time() - start_time):.6f}s')
    print("---------------------------------------")

Epoch: 1
Cummulative Training Loss: 12939.7012,  Training Accuracy: 0.7739
Cummulative Val Loss: 1283.1897, Val Accuracy: 0.8088
Time taken: 332.190586s
---------------------------------------
Epoch: 2
Cummulative Training Loss: 9630.5654,  Training Accuracy: 0.8435
Cummulative Val Loss: 1259.2485, Val Accuracy: 0.8142
Time taken: 274.679492s
---------------------------------------
Epoch: 3
Cummulative Training Loss: 7396.1396,  Training Accuracy: 0.8845
Cummulative Val Loss: 1392.9343, Val Accuracy: 0.8052
Time taken: 273.795422s
---------------------------------------
Epoch: 4
Cummulative Training Loss: 6041.0913,  Training Accuracy: 0.9098
Cummulative Val Loss: 1570.7882, Val Accuracy: 0.7973
Time taken: 272.512912s
---------------------------------------
Epoch: 5
Cummulative Training Loss: 5170.9248,  Training Accuracy: 0.9257
Cummulative Val Loss: 1762.4912, Val Accuracy: 0.7914
Time taken: 333.866028s
---------------------------------------


In [17]:
# Save weights to a Tensorflow Checkpoint file
model.save_weights('skipgram_weights.weights.h5')

In [24]:
# print Word Embeddings shape
print("Model Word Embeddings shape: ", model.target_embedding.get_weights()[0].shape)

Model Word Embeddings shape:  (71141, 128)


### Word Embeddings Projector

Follow these steps to visualize the learned word embeddings using TensorFlow's Embedding Projector:

1. Extract the weights of the embedding layer from your trained model.
2. Save the weights into two files:
   - `vecs.tsv`: This file will store the actual vector representations of words.
   - `meta.tsv`: This file will store the associated metadata (e.g., word labels) for visualization.
3. Go to [TensorFlow Embedding Projector](http://projector.tensorflow.org/).
4. Upload the `vecs.tsv` and `meta.tsv` files created in the previous step.
5. Explore the visualizations provided by TensorFlow's Embedding Projector.
<font color=#ffb578>
6.Save the visualization of a word that best demonstrate the quality of your embeddings as an image and store it near the notebook.
7. Compress the folder into a `.zip` file and submit it as part of your work.

</font>


In [27]:
# Step 1: Access the embedding layer from the trained model
# Retrieve the first layer (embedding layer) from the model
embedding_layer = model.target_embedding

# Step 2: Extract the weights from the embedding layer
# Get the weights (word embeddings) as a NumPy array
weights = embedding_layer.get_weights()[0]
# Limit number of words to visualize to reduce memory needed and upload and dowload size
# words_to_visualize = 5000  # 5k
words_to_visualize = 10000  # 10k
# words_to_visualize = 12000  # 12k
# words_to_visualize = 20000  # 20k
# words_to_visualize = vocab_size  # all owrds

# Step 3: Open files to store embeddings and metadata
# Create two files - 'vecs.tsv' for embeddings and 'meta.tsv' for word metadata
with open('vecs.tsv', 'w') as vecs_file, open('meta.tsv', 'w') as meta_file:
    # Step 4: Iterate through the tokenizer's vocabulary
    # For each word in the vocabulary, write its metadata and embeddings to the files

    # for word, idx in tokenizer.word_index.items():
    for word, idx in list(tokenizer.word_index.items())[:words_to_visualize]:
        meta_file.write(word + '\n')
        vecs_file.write('\t'.join(map(str, weights[idx])) + '\n')

print("Embedding files created.")

Embedding files created.


In [20]:
!gzip /content/skipgram_weights.weights.h5

In [28]:
%ls -l

total 80212
-rw-r--r-- 1 root root    80207 Jan  4 16:07 meta.tsv
drwxr-xr-x 1 root root     4096 Jan  2 14:19 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root 67604774 Jan  4 16:00 skipgram_weights.weights.h5.gz
-rw-r--r-- 1 root root 14437207 Jan  4 16:07 vecs.tsv
