# Text and Sequences

Our tutorial today contains an introduction to word embeddings. You will train your own word embeddings using a simple model for a sentiment classification task, and then visualize them using two methods.


## Representing text/sequnces as numbers

Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model. In this section, you will look at three strategies for doing so.

### One-hot encodings

As a first idea, you might "one-hot" encode each word in your vocabulary. Consider the sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). To represent each word, you will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word. This approach is shown in the following diagram.


<img src="https://www.tensorflow.org/text/guide/images/one-hot.png" alt="Diagram of one-hot encodings" width="400" />

To create a vector that contains the encoding of the sentence, you could then concatenate the one-hot vectors for each word.

Key point: This approach is inefficient. A one-hot encoded vector is sparse (meaning, most indices are zero). Imagine you have 10,000 words in the vocabulary. To one-hot encode each word, you would create a vector where 99.99% of the elements are zero.

### Encode each word with a unique number

A second approach you might try is to encode each word using a unique number. Continuing the example above, you could assign 1 to "cat", 2 to "mat", and so on. You could then encode the sentence "The cat sat on the mat" as a dense vector like [5, 1, 4, 3, 5, 2]. This approach is efficient. Instead of a sparse vector, you now have a dense one (where all elements are full).

There are two downsides to this approach, however:

* The integer-encoding is arbitrary (it does not capture any relationship between words).

* An integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful.

### Word embeddings

Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

<img src="https://www.tensorflow.org/text/guide/images/embedding2.png" alt="Diagram of one-hot encodings" width="400" />


Above is a diagram for a word embedding. Each word is represented as a 4-dimensional vector of floating point values. Another way to think of an embedding is as "lookup table". After these weights have been learned, you can encode each word by looking up the dense vector it corresponds to in the table.

Let's get started with an example

In [3]:
# Import necessary libraries
import os
import re
import string
import shutil
import tarfile
import urllib.request
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, random_split

# Set random seeds for reproducibility
torch.manual_seed(123)
np.random.seed(123)

# Define device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Download the IMDb Dataset
You will use the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) through the tutorial. You will train a sentiment classifier model on this dataset and in the process learn embeddings from scratch. To read more about loading a dataset from scratch, see the [Loading text tutorial](https://www.tensorflow.org/tutorials/load_data/text).  

In [4]:
# Define the URL and download path
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
download_path = Path('./aclImdb_v1.tar.gz')

# Download the dataset if not already downloaded
if not download_path.exists():
    print("Downloading IMDB dataset...")
    urllib.request.urlretrieve(url, download_path)
    print("Download completed.")

# Extract the dataset
extract_path = Path('./aclImdb')
if not extract_path.exists():
    print("Extracting dataset...")
    with tarfile.open(download_path, 'r:gz') as tar:
        tar.extractall(path='.')
    print("Extraction completed.")

Downloading IMDB dataset...
Download completed.
Extracting dataset...
Extraction completed.


Take a look at the train/ directory. It has pos and neg folders with movie reviews labelled as positive and negative respectively. You will use reviews from pos and neg folders to train a binary classification model.

In [5]:
dataset_dir = './aclImdb'
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['urls_pos.txt',
 'unsup',
 'labeledBow.feat',
 'urls_neg.txt',
 'pos',
 'unsupBow.feat',
 'urls_unsup.txt',
 'neg']

The train directory also has additional folders which should be removed before creating training dataset.

In [6]:
# Remove the 'unsup' directory as in the original code
unsup_dir = extract_path / 'train' / 'unsup'
if unsup_dir.exists():
    shutil.rmtree(unsup_dir)
    print("Removed 'unsup' directory.")

Removed 'unsup' directory.


We will the train directory to create training and validation datasets with a split of 20% for validation

In [8]:
# Define parameters
batch_size = 1024
validation_split = 0.2
seed = 123

In [9]:
# Custom dataset class to load data from directories
class IMDBDataset(Dataset):
    def __init__(self, data_dir, subset='train'):
        self.texts = []
        self.labels = []
        for label in ['pos', 'neg']:
            labeled_dir = data_dir / subset / label
            for file_path in labeled_dir.iterdir():
                with open(file_path, encoding='utf-8') as f:
                    self.texts.append(f.read())
                    self.labels.append(1 if label == 'pos' else 0)
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

# Load training data
full_dataset = IMDBDataset(extract_path, 'train')

# Split into training and validation sets
train_size = int((1 - validation_split) * len(full_dataset))
val_size = len(full_dataset) - train_size
train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size],
                                          generator=torch.Generator().manual_seed(seed))


Take a look at a few movie reviews and their labels (1: positive, 0: negative) from the train dataset

In [10]:
for i in range(3):
    print(full_dataset[i][1], full_dataset[i][0][:100])  # Print first 100 characters

1 ....after 16 years Tim Burton finally disappoints me!!!! Whatever happened to the old Burton who rea
1 Star Trek: Hidden Frontier is a long-running internet only fan film, done completely for the love of
1 This is a romantic comedy with the emphasis on comedy for a change. As usual the lovers--Sally Field


## Using the Embedding layer

Next, define the dataset preprocessing steps required for your sentiment classification model.

In [12]:
# Define a simple tokenizer (split by space)
def tokenizer(text):
    return text.split()

# Define text preprocessing function
def preprocess(text):
    # Lowercase
    text = text.lower()
    # Remove HTML break tags
    text = re.sub(r'<br\s*/?>', ' ', text)
    # Remove punctuation
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)
    return text

# Build vocabulary
def build_vocab(dataset, tokenizer, max_tokens):
    freq = {}
    for text, _ in dataset:
        tokens = tokenizer(preprocess(text))
        for token in tokens:
            freq[token] = freq.get(token, 0) + 1
    # Sort tokens by frequency
    sorted_tokens = sorted(freq.items(), key=lambda x: x[1], reverse=True)
    # Limit to max_tokens
    sorted_tokens = sorted_tokens[:max_tokens - 2]  # Reserve spots for <pad> and <unk>
    # Create word to index mapping
    vocab = {'<pad>': 0, '<unk>': 1}
    for idx, (word, _) in enumerate(sorted_tokens, start=2):
        vocab[word] = idx
    return vocab

# Parameters
vocab_size = 10000
sequence_length = 100

# Build vocabulary from training data
vocab = build_vocab(train_dataset, tokenizer, vocab_size)
inverse_vocab = {idx: word for word, idx in vocab.items()}

# Function to numericalize and pad/truncate sequences
def numericalize(text, vocab, tokenizer, seq_length):
    tokens = tokenizer(preprocess(text))
    numerical = [vocab.get(token, vocab['<unk>']) for token in tokens]
    if len(numerical) < seq_length:
        numerical += [vocab['<pad>']] * (seq_length - len(numerical))
    else:
        numerical = numerical[:seq_length]
    return numerical

# Define a collate function for DataLoader
def collate_batch(batch):
    texts, labels = zip(*batch)
    numericalized = [numericalize(text, vocab, tokenizer, sequence_length) for text in texts]
    padded = torch.tensor(numericalized, dtype=torch.long)
    labels = torch.tensor(labels, dtype=torch.float32)
    return padded, labels

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True,
                          collate_fn=collate_batch, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False,
                        collate_fn=collate_batch, num_workers=4)


In [15]:
# Preview some batches
for texts, labels in train_loader:
    for i in range(3):
        print(labels[i].item(), texts[i].tolist())
    break

1.0 [325, 1669, 9213, 5545, 345, 35, 4912, 2, 1, 1290, 7188, 1975, 3, 1, 4375, 1, 427, 3067, 3011, 3, 5867, 710, 9743, 16, 4950, 885, 538, 2317, 2866, 3, 49, 282, 7539, 1, 8, 550, 12, 289, 8, 195, 64, 197, 8528, 29, 21, 792, 609, 48, 2, 8528, 7, 31, 42, 18, 306, 1144, 9, 8, 2, 3007, 1, 936, 5, 1, 30, 2, 440, 18, 2, 101, 1, 57, 14, 2, 19, 4369, 962, 35, 12, 7, 2, 1994, 1, 5, 2, 135, 8, 1, 7188, 2, 959, 101, 1273, 4140, 361, 757, 5, 2773, 4, 396]
1.0 [243, 9, 13, 73, 316, 3, 90, 51, 73, 316, 3, 90, 18, 2, 62, 13, 40, 37, 778, 3, 2, 1101, 5, 3330, 13, 53, 6054, 21, 12, 1342, 4586, 269, 49, 18, 2, 108, 27, 252, 13, 157, 18, 4, 1, 80, 22, 378, 6, 366, 69, 12, 3330, 13, 4, 2012, 5886, 1451, 2411, 16, 56, 1120, 5, 2, 1355, 41, 2, 2690, 10, 378, 36, 2, 544, 13, 27, 408, 6, 26, 3192, 1, 10, 378, 207, 20, 522, 40, 652, 22, 1, 92, 10, 375, 12, 2176, 13, 164, 15, 83, 158, 635, 233, 18]
1.0 [11, 17, 7, 158, 3, 4151, 15, 99, 605, 9, 7, 388, 1, 715, 2, 175, 121, 4, 459, 285, 385, 1491, 8, 2, 1, 512, 

## Create a classification model

In [16]:
# Define the model
class SentimentModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim=1):
        super(SentimentModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=vocab['<pad>'])
        self.global_avg_pool = nn.AdaptiveAvgPool1d(1)
        self.fc1 = nn.Linear(embedding_dim, 16)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(16, output_dim)
    
    def forward(self, x):
        embedded = self.embedding(x)  # (batch_size, seq_length, embedding_dim)
        embedded = embedded.permute(0, 2, 1)  # (batch_size, embedding_dim, seq_length)
        pooled = self.global_avg_pool(embedded).squeeze(2)  # (batch_size, embedding_dim)
        out = self.fc1(pooled)
        out = self.relu(out)
        out = self.fc2(out)
        return out.squeeze(1)  # (batch_size)

# Initialize model, loss function, and optimizer
embedding_dim = 16
model = SentimentModel(vocab_size, embedding_dim).to(device)

In [17]:
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [18]:
# Training loop
epochs = 15
for epoch in range(epochs):
    model.train()
    total_loss = 0
    total_correct = 0
    total_samples = 0
    for texts, labels in train_loader:
        texts, labels = texts.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(texts)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item() * texts.size(0)
        preds = torch.round(torch.sigmoid(outputs))
        total_correct += (preds == labels).sum().item()
        total_samples += texts.size(0)
    
    avg_loss = total_loss / total_samples
    accuracy = total_correct / total_samples
    
    # Validation
    model.eval()
    val_loss = 0
    val_correct = 0
    val_samples = 0
    with torch.no_grad():
        for texts, labels in val_loader:
            texts, labels = texts.to(device), labels.to(device)
            outputs = model(texts)
            loss = criterion(outputs, labels)
            
            val_loss += loss.item() * texts.size(0)
            preds = torch.round(torch.sigmoid(outputs))
            val_correct += (preds == labels).sum().item()
            val_samples += texts.size(0)
    
    avg_val_loss = val_loss / val_samples
    val_accuracy = val_correct / val_samples
    
    print(f"Epoch {epoch+1}/{epochs}")
    print(f"Train Loss: {avg_loss:.4f}, Train Acc: {accuracy:.4f}")
    print(f"Val Loss: {avg_val_loss:.4f}, Val Acc: {val_accuracy:.4f}")

Epoch 1/15
Train Loss: 0.7010, Train Acc: 0.5002
Val Loss: 0.6985, Val Acc: 0.4992
Epoch 2/15
Train Loss: 0.6959, Train Acc: 0.5002
Val Loss: 0.6945, Val Acc: 0.4988
Epoch 3/15
Train Loss: 0.6926, Train Acc: 0.5099
Val Loss: 0.6918, Val Acc: 0.5260
Epoch 4/15
Train Loss: 0.6901, Train Acc: 0.5592
Val Loss: 0.6897, Val Acc: 0.5724
Epoch 5/15
Train Loss: 0.6877, Train Acc: 0.5802
Val Loss: 0.6870, Val Acc: 0.5920
Epoch 6/15
Train Loss: 0.6842, Train Acc: 0.6038
Val Loss: 0.6829, Val Acc: 0.6066
Epoch 7/15
Train Loss: 0.6787, Train Acc: 0.6228
Val Loss: 0.6765, Val Acc: 0.6266
Epoch 8/15
Train Loss: 0.6701, Train Acc: 0.6472
Val Loss: 0.6669, Val Acc: 0.6484
Epoch 9/15
Train Loss: 0.6575, Train Acc: 0.6644
Val Loss: 0.6530, Val Acc: 0.6674
Epoch 10/15
Train Loss: 0.6397, Train Acc: 0.6914
Val Loss: 0.6348, Val Acc: 0.6822
Epoch 11/15
Train Loss: 0.6170, Train Acc: 0.7058
Val Loss: 0.6125, Val Acc: 0.6982
Epoch 12/15
Train Loss: 0.5899, Train Acc: 0.7251
Val Loss: 0.5875, Val Acc: 0.7126
E

In [19]:
# Display model summary
print("\nModel Summary:")
print(model)


Model Summary:
SentimentModel(
  (embedding): Embedding(10000, 16, padding_idx=0)
  (global_avg_pool): AdaptiveAvgPool1d(output_size=1)
  (fc1): Linear(in_features=16, out_features=16, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=16, out_features=1, bias=True)
)


## Retrieve the trained word embeddings and save them to disk

Next, retrieve the word embeddings learned during training. The embeddings are weights of the Embedding layer in the model. The weights matrix is of shape `(vocab_size, embedding_dimension)`.

Obtain the weights from the model using `get_layer()` and `get_weights()`. The `get_vocabulary()` function provides the vocabulary to build a metadata file with one token per line.

In [21]:
# Get embedding weights
embedding_weights = model.embedding.weight.data.cpu().numpy()

Write the weights to disk. To use the [Embedding Projector](http://projector.tensorflow.org), you will upload two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words).

In [23]:
# Save embeddings to 'vectors.tsv' and 'metadata.tsv'
with open('vectors.tsv', 'w', encoding='utf-8') as out_v, \
     open('metadata.tsv', 'w', encoding='utf-8') as out_m:
    for idx, word in enumerate(inverse_vocab):
        if idx == vocab['<pad>']:
            continue  # skip padding
        vec = embedding_weights[idx]
        out_v.write('\t'.join([str(x) for x in vec]) + "\n")
        out_m.write(str(word) + "\n")

print("Embeddings and metadata have been saved.")

Embeddings and metadata have been saved.


## Bag of Words (BoW)

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

1. A vocabulary of known words.
2. A measure of the presence of known words.

### Step 1: Collect Data

    It was the best of times,
    it was the worst of times,
    it was the age of wisdom,
    it was the age of foolishness,

For this small example, let’s treat each line as a separate “document” and the 4 lines as our entire corpus of documents.

### Step 2: Design the Vocabulary
Now we can make a list of all of the words in our model vocabulary.

The unique words here (ignoring case and punctuation) are:

“it”

“was”

“the”

“best”

“of”

“times”

“worst”

“age”

“wisdom”

“foolishness”

That is a vocabulary of 10 words from a corpus containing 24 words.

### Step 3: Create Document Vectors
The next step is to score the words in each document.

The objective is to turn each document of free text into a vector that we can use as input or output for a machine learning model.

Because we know the vocabulary has 10 words, we can use a fixed-length document representation of 10, with one position in the vector to score each word.

The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present.

Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first document (“It was the best of times“) and convert it into a binary vector.

The scoring of the document would look as follows:

“it” = 1

“was” = 1

“the” = 1

“best” = 1

“of” = 1

“times” = 1

“worst” = 0

“age” = 0

“wisdom” = 0

“foolishness” = 0

As a binary vector, this would look as follows:
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

The other three documents would look as follows:

"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]


#### Practice what words are most likely to appear in a spam email?

### Limitations

- **Vocabulary**: How many words in English?

- **Sparsity**: (0,1) hard to model.

- **Meaning**: Discarding word order ignores the context, and in turn meaning of words in the document (semantics)

## Word2Vec Algorithem
Word2Vec is not a singular algorithm, rather, it is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. Embeddings learned through Word2Vec have proven to be successful on a variety of downstream natural language processing tasks.

Efficient Estimation of Word Representations in
Vector Space https://arxiv.org/pdf/1301.3781.pdf

Distributed Representations of Words and Phrases
and their Compositionality https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf

These papers proposed two methods for learning representations of words:

- Continuous Bag-of-Words Model which predicts the middle word based on surrounding context words. The context consists of a few words before and after the current (middle) word. This architecture is called a bag-of-words model as the order of words in the context is not important.
- Continuous Skip-gram Model which predict words within a certain range before and after the current word in the same sentence. A worked example of this is given below.

You'll use the skip-gram approach in this tutorial. First, you'll explore skip-grams and other concepts using a single sentence for illustration. Next, you'll train your own Word2Vec model on a small dataset.


<img src="https://miro.medium.com/max/1400/1*xD9n3KeWXuenMNL_BpYp6A.png" alt="Diagram of one-hot encodings" width="600" />

source(medium.com)

## Skip-gram and Negative Sampling
While a bag-of-words model predicts a word given the neighboring context, a skip-gram model predicts the context (or neighbors) of a word, given the word itself. The model is trained on skip-grams, which are n-grams that allow tokens to be skipped (see the diagram below for an example). The context of a word can be represented through a set of skip-gram pairs of (target_word, context_word) where context_word appears in the neighboring context of target_word.


Consider the following sentence of 8 words.
> The wide road shimmered in the hot sun.

The context words for each of the 8 words of this sentence are defined by a window size. The window size determines the span of words on either side of a `target_word` that can be considered `context word`. Take a look at this table of skip-grams for target words based on different window siz

Note: For this tutorial, a window size of *n* implies n words on each side with a total window span of 2*n+1 words across a word.

![word2vec_skipgrams](https://tensorflow.org/text/tutorials/images/word2vec_skipgram.png)

The training objective of the skip-gram model is to maximize the probability of predicting context words given the target word. For a sequence of words w1, w2, ... wT, the objective can be written as the average log probability


![word2vec_skipgram_objective](https://tensorflow.org/text/tutorials/images/word2vec_skipgram_objective.png)


where c is the size of the training context. The basic skip-gram formulation defines this probability using the softmax function.

![word2vec_full_softmax](https://tensorflow.org/text/tutorials/images/word2vec_full_softmax.png)


where v and v' are target and context vector representations of words and W is vocabulary size.

Computing the denominator of this formulation involves performing a full softmax over the entire vocabulary words which is often large (105-107) terms.

The Noise Contrastive Estimation loss function is an efficient approximation for a full softmax. With an objective to learn word embeddings instead of modelling the word distribution, NCE loss can be simplified to use negative sampling.

The simplified negative sampling objective for a target word is to distinguish the context word from num_ns negative samples drawn from noise distribution Pn(w) of words. More precisely, an efficient approximation of full softmax over the vocabulary is, for a skip-gram pair, to pose the loss for a target word as a classification problem between the context word and num_ns negative samples.

A negative sample is defined as a (target_word, context_word) pair such that the context_word does not appear in the window_size neighborhood of the target_word. For the example sentence, these are few potential negative samples (when window_size is 2).


(hot, shimmered)
(wide, hot)
(wide, sun)

In the next section, you'll generate skip-grams and negative samples for a single sentence. You'll also learn about subsampling techniques and train a classification model for positive and negative training examples later in the tutorial.

### vectorize an example sentence
Consider the following sentence:
The wide road shimmered in the hot sun.

Tokenize the sentence:

In [24]:
# Define a simple sentence
sentence = "The wide road shimmered in the hot sun"
tokens = sentence.lower().split()
print(len(tokens))

8


In [25]:

# Build vocabulary manually
manual_vocab = {'<pad>':0}
index = 1
for token in tokens:
    if token not in manual_vocab:
        manual_vocab[token] = index
        index += 1
vocab_size_manual = len(manual_vocab)
print(manual_vocab)

{'<pad>': 0, 'the': 1, 'wide': 2, 'road': 3, 'shimmered': 4, 'in': 5, 'hot': 6, 'sun': 7}


In [26]:
# Create inverse vocabulary
inverse_vocab_manual = {idx: word for word, idx in manual_vocab.items()}
print(inverse_vocab_manual)

{0: '<pad>', 1: 'the', 2: 'wide', 3: 'road', 4: 'shimmered', 5: 'in', 6: 'hot', 7: 'sun'}


In [27]:
# Vectorize the sentence
example_sequence = [manual_vocab[word] for word in tokens]
print(example_sequence)

[1, 2, 3, 4, 5, 1, 6, 7]


### Generating skip-grams from one sentence

In [28]:
# Define window size
window_size = 2

# Generate positive skip-grams
def generate_skipgrams(sequence, window_size):
    skip_grams = []
    for i, target in enumerate(sequence):
        context_start = max(0, i - window_size)
        context_end = min(len(sequence), i + window_size + 1)
        for j in range(context_start, context_end):
            if j != i:
                skip_grams.append((target, sequence[j]))
    return skip_grams

positive_skip_grams = generate_skipgrams(example_sequence, window_size)
print(len(positive_skip_grams))

26


In [29]:
# Display some skip-grams
for target, context in positive_skip_grams[:5]:
    print(f"({target}, {context}): ({inverse_vocab_manual[target]}, {inverse_vocab_manual[context]})")

(1, 2): (the, wide)
(1, 3): (the, road)
(2, 1): (wide, the)
(2, 3): (wide, road)
(2, 4): (wide, shimmered)


### Negative sampling for one skip-gram

The `skipgrams` function returns all positive skip-gram pairs by sliding over a given window span. To produce additional skip-gram pairs that would serve as negative samples for training, you need to sample random words from the vocabulary. Use the `tf.random.log_uniform_candidate_sampler` function to sample `num_ns` number of negative samples for a given target word in a window. You can call the function on one skip-grams's target word and pass the context word as true class to exclude it from being sampled.

Key point: `num_ns` (the number of negative samples per a positive context word) in the `[5, 20]` range is [shown to work](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) best for smaller datasets, while `num_ns` in the `[2, 5]` range suffices for larger datasets.

In [30]:
# Select one positive skip-gram
target_word, context_word = positive_skip_grams[0]

# Number of negative samples
num_ns = 4

# Negative sampling using uniform distribution
def negative_sampling(context_word, num_ns, vocab_size, seed=42):
    np.random.seed(seed)
    negatives = []
    while len(negatives) < num_ns:
        neg = np.random.randint(1, vocab_size)  # skip 0 (padding)
        if neg != context_word and neg not in negatives:
            negatives.append(neg)
    return negatives

negative_samples = negative_sampling(context_word, num_ns, vocab_size_manual)
print(negative_samples)
print([inverse_vocab_manual[idx] for idx in negative_samples])

[7, 4, 5, 3]
['sun', 'shimmered', 'in', 'road']


### Construct one training example
For a given positive `(target_word, context_word)` skip-gram, you now also have `num_ns` negative sampled context words that do not appear in the window size neighborhood of `target_word`. Batch the `1` positive `context_word` and `num_ns` negative context words into one tensor. This produces a set of positive skip-grams (labeled as `1`) and negative samples (labeled as `0`) for each target word.

In [31]:
# Prepare context and labels
context = [context_word] + negative_samples
labels = [1] + [0]*num_ns

Take a look at the context and the corresponding labels for the target word from the skip-gram example above.


In [32]:
print(f"target_index    : {target_word}")
print(f"target_word     : {inverse_vocab_manual[target_word]}")
print(f"context_indices : {context}")
print(f"context_words   : {[inverse_vocab_manual[c] for c in context]}")
print(f"label           : {labels}")

target_index    : 1
target_word     : the
context_indices : [2, 7, 4, 5, 3]
context_words   : ['wide', 'sun', 'shimmered', 'in', 'road']
label           : [1, 0, 0, 0, 0]


A tuple of (target, context, label) tensors constitutes one training example for training your skip-gram negative sampling Word2Vec model. Notice that the target is of shape (1,) while the context and label are of shape (1+num_ns,)

### Summary

This picture summarizes the procedure of generating training example from a sentence.

![word2vec_negative_sampling](https://tensorflow.org/text/tutorials/images/word2vec_negative_sampling.png)

### Compile all steps into one function

#### Skip-gram Sampling table
A large dataset means larger vocabulary with higher number of more frequent words such as stopwords. Training examples obtained from sampling commonly occurring words (such as the, is, on) don't add much useful information for the model to learn from. Mikolov et al. suggest subsampling of frequent words as a helpful practice to improve embedding quality.

In [33]:
# Define a sampling table (not used in this simple example)
sampling_table = np.random.randint(0, 10, size=10)
print(sampling_table)

[7 4 6 9 2 6 7 4 3 7]


`sampling_table[i]` denotes the probability of sampling the i-th most common word in a dataset. The function assumes a [Zipf's distribution](https://en.wikipedia.org/wiki/Zipf%27s_law) of the word frequencies for sampling.

Key point: The `tf.random.log_uniform_candidate_sampler` already assumes that the vocabulary frequency follows a log-uniform (Zipf's) distribution. Using these distribution weighted sampling also helps approximate the Noise Contrastive Estimation (NCE) loss with simpler loss functions for training a negative sampling objective.

### Generate training data

Compile all the steps described above into a function that can be called on a list of vectorized sentences obtained from any text dataset. Notice that the sampling table is built before sampling skip-gram word pairs. You will use this function in the later sections.

In [34]:
# Function to generate training data with negative sampling
def generate_training_data_skipgram(sequences, window_size, num_ns, vocab_size, seed=42):
    targets = []
    contexts = []
    labels = []
    np.random.seed(seed)
    
    for sequence in sequences:
        skip_grams = generate_skipgrams(sequence, window_size)
        for target, context_word in skip_grams:
            neg_samples = negative_sampling(context_word, num_ns, vocab_size, seed)
            context_combined = [context_word] + neg_samples
            label_combined = [1] + [0]*num_ns
            targets.append(target)
            contexts.append(context_combined)
            labels.append(label_combined)
    
    return targets, contexts, labels

### Prepare training data for Word2Vec

With an understanding of how to work with one sentence for a skip-gram negative sampling based Word2Vec model, you can proceed to generate training examples from a larger list of sentences!

#### Download text corpus
You will use a text file of Shakespeare's writing for this tutorial. Change the following line to run this code on your own data.

In [35]:
# Example sequences (for demonstration, using the single sentence)
sequences = [example_sequence]

In [36]:
# Generate training data
targets, contexts, labels = generate_training_data_skipgram(
    sequences=sequences,
    window_size=window_size,
    num_ns=num_ns,
    vocab_size=vocab_size_manual,
    seed=42
)

print('\n')
print(f"targets.shape: {len(targets)}")
print(f"contexts.shape: {len(contexts)}")
print(f"labels.shape: {len(labels)}")




targets.shape: 26
contexts.shape: 26
labels.shape: 26


Vectorize sentences from the corpus

You can use the TextVectorization layer to vectorize sentences from the corpus. Learn more about using this layer in this Text Classification tutorial. Notice from the first few sentences above that the text needs to be in one case and punctuation needs to be removed. To do this, define a custom_standardization function that can be used in the TextVectorization layer.

In [38]:
# Define a custom Dataset for Word2Vec
class Word2VecDataset(Dataset):
    def __init__(self, targets, contexts, labels):
        self.targets = torch.tensor(targets, dtype=torch.long)
        self.contexts = torch.tensor(contexts, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.float32)
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return (self.targets[idx], self.contexts[idx]), self.labels[idx]

In [39]:
# Create the dataset and dataloader
w2v_dataset = Word2VecDataset(targets, contexts, labels)
w2v_loader = DataLoader(w2v_dataset, batch_size=1024, shuffle=True)

Once the state of the layer has been adapted to represent the text corpus, the vocabulary can be accessed with get_vocabulary(). This function returns a list of all vocabulary tokens sorted (descending) by their frequency.

In [40]:
print(w2v_loader)

<torch.utils.data.dataloader.DataLoader object at 0x72904ebc6380>


### Model and Training
The Word2Vec model can be implemented as a classifier to distinguish between true context words from skip-grams and false context words obtained through negative sampling. You can perform a dot product between the embeddings of target and context words to obtain predictions for labels and compute loss against true labels in the dataset.

#### Subclassed Word2Vec Model

In [41]:
class Word2VecModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_ns):
        super(Word2VecModel, self).__init__()
        self.target_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.context_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.num_ns = num_ns
    
    def forward(self, pair):
        target, context = pair
        target = target.unsqueeze(1)  # (batch, 1)
        word_emb = self.target_embedding(target)  # (batch, 1, embed_dim)
        context_emb = self.context_embedding(context)  # (batch, context, embed_dim)
        # Compute dot product between target and context embeddings
        dots = torch.bmm(context_emb, word_emb.transpose(1, 2)).squeeze(2)  # (batch, context)
        return dots

In [42]:
# Initialize model, loss function, and optimizer
embedding_dim_w2v = 128
word2vec_model = Word2VecModel(vocab_size_manual, embedding_dim_w2v, num_ns).to(device)
criterion_w2v = nn.BCEWithLogitsLoss()
optimizer_w2v = optim.Adam(word2vec_model.parameters(), lr=0.001)

In [43]:
# Training loop for Word2Vec
epochs_w2v = 20
for epoch in range(epochs_w2v):
    word2vec_model.train()
    total_loss = 0
    total_correct = 0
    total_samples = 0
    for (targets_batch, contexts_batch), labels_batch in w2v_loader:
        targets_batch = targets_batch.to(device)
        contexts_batch = contexts_batch.to(device)
        labels_batch = labels_batch.to(device)
        
        optimizer_w2v.zero_grad()
        outputs = word2vec_model((targets_batch, contexts_batch))
        loss = criterion_w2v(outputs, labels_batch)
        loss.backward()
        optimizer_w2v.step()
        
        total_loss += loss.item() * targets_batch.size(0)
        preds = torch.round(torch.sigmoid(outputs))
        total_correct += (preds == labels_batch).sum().item()
        total_samples += labels_batch.numel()
    
    avg_loss = total_loss / total_samples
    accuracy = total_correct / total_samples
    
    print(f"Epoch {epoch+1}/{epochs_w2v}")
    print(f"Train Loss: {avg_loss:.4f}, Train Acc: {accuracy:.4f}")

Epoch 1/20
Train Loss: 1.1611, Train Acc: 0.3692
Epoch 2/20
Train Loss: 1.1496, Train Acc: 0.3692
Epoch 3/20
Train Loss: 1.1382, Train Acc: 0.3692
Epoch 4/20
Train Loss: 1.1269, Train Acc: 0.4000
Epoch 5/20
Train Loss: 1.1156, Train Acc: 0.4000
Epoch 6/20
Train Loss: 1.1043, Train Acc: 0.4000
Epoch 7/20
Train Loss: 1.0932, Train Acc: 0.4000
Epoch 8/20
Train Loss: 1.0821, Train Acc: 0.4000
Epoch 9/20
Train Loss: 1.0710, Train Acc: 0.4000
Epoch 10/20
Train Loss: 1.0600, Train Acc: 0.4000
Epoch 11/20
Train Loss: 1.0490, Train Acc: 0.4000
Epoch 12/20
Train Loss: 1.0381, Train Acc: 0.4000
Epoch 13/20
Train Loss: 1.0273, Train Acc: 0.4000
Epoch 14/20
Train Loss: 1.0165, Train Acc: 0.4000
Epoch 15/20
Train Loss: 1.0057, Train Acc: 0.4077
Epoch 16/20
Train Loss: 0.9950, Train Acc: 0.4077
Epoch 17/20
Train Loss: 0.9844, Train Acc: 0.4308
Epoch 18/20
Train Loss: 0.9738, Train Acc: 0.4308
Epoch 19/20
Train Loss: 0.9632, Train Acc: 0.4308
Epoch 20/20
Train Loss: 0.9527, Train Acc: 0.4462


### Embedding lookup and analysis
Obtain the weights from the model using get_layer() and get_weights(). The get_vocabulary() function provides the vocabulary to build a metadata file with one token per line.

In [44]:
# Get Word2Vec embedding weights
w2v_weights = word2vec_model.target_embedding.weight.data.cpu().numpy()

In [45]:
# Save embeddings to 'vectorsw2v.tsv' and 'metadataw2v.tsv'
with open('vectorsw2v.tsv', 'w', encoding='utf-8') as out_v, \
     open('metadataw2v.tsv', 'w', encoding='utf-8') as out_m:
    for idx, word in inverse_vocab_manual.items():
        if idx == manual_vocab['<pad>']:
            continue  # skip padding
        vec = w2v_weights[idx]
        out_v.write('\t'.join([str(x) for x in vec]) + "\n")
        out_m.write(word + "\n")

print("Word2Vec embeddings and metadata have been saved.")

Word2Vec embeddings and metadata have been saved.
