# Natural Language Processing with PyTorch and TorchText

This script demonstrates the basic workflow of tokenizing text data, building a vocabulary, and generating word embeddings using PyTorch's nn.Embedding and nn.EmbeddingBag modules.

## Key Features:
- Tokenization using SpaCy tokenizer via TorchText
- Vocabulary construction from a dataset of sample sentences
- Generation of word embeddings for individual words and sentences
- Introduction to handling variable-length sequences with offsets in EmbeddingBag

## Dependencies:
- torch
- torchtext
- spacy ('en_core_web_sm' language model)

Author: Tamunowunari-Tasker Anointing


In [1]:
# Importing necessary libraries from PyTorch and TorchText
import torchtext.data.utils as tdu  # For tokenization utilities
import torchtext.vocab as tv        # For vocabulary building
import torch.nn as nn               # For neural network components like Embedding
import torch                        # For tensor operations

In [2]:
# Sample dataset of simple sentences
dataset = [
    "I like cats",
    "I hate dogs",
    "I'm impartial to hippos"]

In [3]:
# Initialize the tokenizer using SpaCy's small English model
tokenizer = tdu.get_tokenizer('spacy', language = 'en_core_web_sm') # get_tokenizer() is unique to torchtext

In [4]:
# Function to yield tokens from each sentence in the dataset
def yield_tokens(data_iter):
    """
    Tokenizes each sentence in the dataset.
    
    Args:
        data_iter (iter): An iterator over the dataset.
        
    Yields:
        list: A list of tokens for each sentence.
    """
    for data_sample in data_iter:
        yield tokenizer(data_sample) # gives you one value at a time but remembers where it left off, ready to continue when called again.

# Create an iterator over the dataset
data_iter = iter(dataset) # iter is an inbuilt function in Python that returns an iterator from a list, tuple or a custom dataset class

In [5]:
# Build a vocabulary from the tokenized dataset
vocab = tv.build_vocab_from_iterator(yield_tokens(data_iter))

In [6]:
# Display the vocabulary as a list of words (index-to-string mapping)
print(vocab.get_itos()) # get_itos() returns the list/bag of words. This results in a bag of 9

['I', "'m", 'cats', 'dogs', 'hate', 'hippos', 'impartial', 'like', 'to']


In [7]:
# A lambda in Python is just a one-liner function. It’s a shortcut for simple operations.                                                               
input_indexes = lambda x: [torch.tensor(vocab(tokenizer(data_sample))) for data_sample in x]

# Just for comparison
def input_indexes(x):
    result = []
    for data_sample in x:
        tokens = tokenizer(data_sample)
        indices = vocab(tokens)
        tensor = torch.tensor(indices)
        result.append(tensor)
    return result
# Same effect, but more lines

In [8]:
# Convert dataset sentences to token indices
index = input_indexes(dataset) # index has three tensors
print("Token indices:", index)

Token indices: [tensor([0, 7, 2]), tensor([0, 4, 3]), tensor([0, 1, 6, 8, 5])]


In [9]:
# Set embedding dimensions (number of features each word vector will have)
embedding_dim = 3 # Each word will be represented in a 3D vector space

# Number of unique words in the vocabulary
n_embedding = len(vocab)
#        OR
n_embedding = 9

In [10]:
# Initialize embedding layer
embeds = nn.Embedding(n_embedding, embedding_dim)
print(embeds)

Embedding(9, 3)


In [11]:
# Example: Get embeddings for the sentence "I like cats"
i_like_cats = embeds(index[0])
print("Embeddings for I like cats: ", i_like_cats)

Embeddings for I like cats:  tensor([[-0.4450, -1.2641, -0.1237],
        [ 0.1213,  2.2594, -0.3364],
        [-0.7020,  0.1398, -1.4275]], grad_fn=<EmbeddingBackward0>)


In [12]:
# Example: Get embeddings for "I'm impartial to hippos"
impartial_to_hippos = embeds(index[-1]) # -1 is the equivalent of [2] or the last set which is the hippos one
print("Embedding for I'm impartial to hippos: ", impartial_to_hippos)
# "I'm impartial to hippos" is [5] words (I is part) by the [3] neurons of the next layer

Embedding for I'm impartial to hippos:  tensor([[-0.4450, -1.2641, -0.1237],
        [-0.0810,  2.2538,  0.5334],
        [ 2.2675,  0.5273,  0.0998],
        [ 0.0560, -0.6253, -0.4342],
        [ 1.8003,  0.3011, -0.3945]], grad_fn=<EmbeddingBackward0>)


In [13]:
# Flatten all sentence indices into a single tensor
index_flat = torch.cat(index)
print(index_flat)

tensor([0, 7, 2, 0, 4, 3, 0, 1, 6, 8, 5])


In [14]:
# Calculate offsets to indicate the start of each sentence in the flattened tensor
offset = [len(sample) for sample in index]
offset.insert(0,0) # Insert 0 at the start to indicate the first sentence
print(offset)

[0, 3, 3, 5]


In [15]:
# Using cummulative sum we show where the next sentence begins (position 0, position 3, position 6)
offset = torch.tensor(offset)

# Compute cumulative sum to get starting positions for each sentence
offset = torch.cumsum(offset, 0)[0:-1] # Slices of the last element as not needed
print(offset)

tensor([0, 3, 6])


In [16]:
# Initialize EmbeddingBag layer, which computes sentence embeddings by averaging word vectors
embedding_bag = nn.EmbeddingBag(num_embeddings=n_embedding, embedding_dim=embedding_dim, mode='mean')

# Compute embeddings for entire sentences
sentence_embeddings = embedding_bag(index_flat, offsets = offset)
print(sentence_embeddings)

tensor([[ 0.2888, -0.1475, -0.1623],
        [-0.8679, -0.0473,  0.4397],
        [-0.1703, -0.2355, -0.2356]], grad_fn=<EmbeddingBagBackward0>)
