##  Building and Training a Feedforward Neural Network for Language Modeling

This project explores the use of Feedforward Neural Networks (FNNs) in language modeling. The primary objective is to build a neural network that learns word relationships and generates meaningful text sequences. The implementation is done using PyTorch, covering key aspects of Natural Language Processing (NLP), such as:
* Tokenization & Indexing: Converting text into numerical representations.
* Embedding Layers: Mapping words to dense vector representations for efficient learning.
* Context-Target Pair Generation (N-grams): Structuring training data for sequence prediction.
* Multi-Class Neural Network: Designing a model to predict the next word in a sequence.

The training process includes optimizing the model with loss functions and backpropagation techniques to improve accuracy and coherence in text generation. By the end of the project, we will have a working FNN-based language model capable of generating text sequences.

 - Implement a feedforward neural network using the PyTorch framework, including embedding layers, for language modeling tasks.
 - Fine-tune the output layer of the neural network for optimal performance in text generation.
 - Apply various training strategies and fundamental Natural Language Processing (NLP) techniques, such as tokenization and sequence analysis, to improve text generation.

### Importing Requierd Libraries

In [None]:
import warnings
from tqdm import tqdm

warnings.simplefilter('ignore')
import time
from collections import OrderedDict

import re

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import string
import time
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from nltk.tokenize import word_tokenize

from sklearn.manifold import TSNE

# We can also use this section to suppress warnings generated by our code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sekai\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\sekai\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Feedforward Neural Network (FNNs) for Language Models

FNNs, or Multi-Layer Perceptrons, serve as the foundational components for comprehending neural networks in natural language processing (NLP). In NLP tasks, FNNs process textual data by transforming it into numerical vectors known as embeddings. Subsequently, these embeddings are input to the network to predict language facets, such as the upcoming word in a sentence or the sentiment of a text.

Let us consider the following song lyrics for our analysis.

In [None]:
song= """We are no strangers to love
You know the rules and so do I
A full commitments what Im thinking of
You wouldnt get this from any other guy
I just wanna tell you how Im feeling
Gotta make you understand
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Weve known each other for so long
Your hearts been aching but youre too shy to say it
Inside we both know whats been going on
We know the game and were gonna play it
And if you ask me how Im feeling
Dont tell me youre too blind to see
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Weve known each other for so long
Your hearts been aching but youre too shy to say it
Inside we both know whats been going on
We know the game and were gonna play it
I just wanna tell you how Im feeling
Gotta make you understand
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you"""

### Tokenization for FNN

This PyTorch function is used to obtain a tokenizer for text.

In [82]:
tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(song)

In [84]:
def preprocess_string(s):
    """
    Preprocesses a given string by performing the following steps:
    
    1. Removes all non-word characters (excluding letters and numbers).
    2. Removes all whitespace characters.
    3. Removes all numeric digits.

    Parameters:
    s (str): The input string to be cleaned.

    Returns:
    str: The processed string with only alphabetic characters, no spaces, and no digits.
    """
    
    # Remove all non-word characters (everything except letters and numbers)
    # \w matches any word character (letters, numbers, and underscores)
    # \s matches any whitespace characters
    # ^ inside [] negates the selection, so [^\w\s] matches anything that's NOT a word character or whitespace.
    s = re.sub(r"[^\w\s]", '', s)

    # Remove all whitespace characters (spaces, tabs, newlines)
    # \s+ matches one or more whitespace characters.
    s = re.sub(r"\s+", '', s)

    # Remove all digits (0-9)
    # \d matches any digit character.
    s = re.sub(r"\d", '', s)

    return s

In [121]:
def preprocess(words):
    """
    Preprocesses a given text by tokenizing it, cleaning individual words, and 
    converting them to lowercase while removing empty or punctuation tokens.

    Steps:
    1. Tokenization: Splits the input text into individual word tokens.
    2. Cleaning: Applies `preprocess_string()` to remove non-word characters, 
       spaces, and digits from each token.
    3. Normalization: Converts all tokens to lowercase.
    4. Filtering: Removes empty strings and punctuation tokens.

    Parameters:
    words (str): The input text to be tokenized and preprocessed.

    Returns:
    list: A list of cleaned, lowercase tokens.
    """

    # Tokenize the input text into words 
    tokens = word_tokenize(words)

    # Apply preprocessing to each token (removes unwanted characters)
    tokens = [preprocess_string(w) for w in tokens]

    # Convert tokens to lowercase and remoce empty strings or punctuation 
    return [w.lower() for w in tokens if len(w) != 0 and w not in string.punctuation]

# Example usage:
tokens = preprocess(song) # Preprocess the text in 'song'

### Indexing

TorchText provides tools to tokenize text into individual words (tokens) and build a vocabulary, which maps tokens to unique integer indexes. This is a crucial step in preparing text data for  machine learning models that require numerical input.


In [122]:
def tokenizetext(song):
    """
    Tokenizes the input text (song) and builds a vocabulary from the tokens.

    Steps:
    1. Tokenization: The function splits the input text into words and applies 
       a tokenizer function to each word.
    2. Vocabulary Building: Constructs a vocabulary from the tokenized words,
       including a special "<unk>" token to handle out-of-vocabulary words.
    3. Default Indexing: Sets the default index for unknown words, ensuring 
       that any unseen tokens are mapped to "<unk>".

    Parameters:
    song (str): The input text (song lyrics) to be tokenized and processed.

    Returns:
    vocab (Vocab): A vocabulary object mapping tokens to their corresponding indices.
    """

    # Tokenize the text
    # Split the input text into words and apply the tokenizer function to each word.
    # The 'map' function ensures that each word is tokenized properly. 
    tokenized_song = map(tokenizer, song.split())

    # Build vocabulary from tokenized text
    # The function `build_vocab_from_iterator` constructs a vocabulary by iterating 
    # over the tokenized words. The special token "<unk>" is added to handle words 
    # that are not present in the vocabulary.
    vocab = build_vocab_from_iterator(tokenized_song, specials=["<unk>"])

    # Set the default index for unknown words
    # The default index is set to the index of "<unk>" so that any word not found 
    # in the vocabulary is mapped to this token, preventing errors during lookup.
    vocab.set_default_index(vocab["<unk>"])

    return vocab 

Convert the tokens to indices by applying the function as shown here:

In [123]:
vocab = tokenizetext(song)
vocab(tokens[0:10])

[21, 58, 70, 74, 25, 69, 2, 20, 31, 72]

In [124]:
tokens[0:10]

['we', 'are', 'no', 'strangers', 'to', 'love', 'you', 'know', 'the', 'rules']

We will write a text function that converts raw text into indices.

In [125]:
text_pipeline = lambda x: vocab(tokenizer(x))
text_pipeline(song)[0:10]

[21, 58, 70, 74, 25, 69, 2, 20, 31, 72]

Find the word corresponsing to an index using the `get_itos()` method. The result is a list where the index of the list corresponds to a word.

In [126]:
index_to_token = vocab.get_itos()
index_to_token[58]

'are'

### Embedding Layers

An embedding layer is a crucial element in natural language processing (NLP) and neural networks designed for sequential data. It serves to convert categorical variables, like words or discrete indexes representing tokens, into continuous vectors. This transformation facilitates training and enables the network to learn meaningful relationships among words.

Let's consider a simple example involving a vocabulary of words 
- **Vocabulary**: {apple, banana, orange, pear}

Each word in our vocabulary has a unique index assigned to it: 
- **Indices**: {0, 1, 2, 3}

When using an embedding layer, we will initialize random continuous vectors for each index. For instance, the embedding vectors might look like:

- Vector for index 0 (apple): [0.2, 0.8]
- Vector for index 1 (banana): [0.6, -0.5]
- Vector for index 2 (orange): [-0.3, 0.7]
- Vector for index 3 (pear): [0.1, 0.4]
In PyTorch, we can create an embedding layer.

In [127]:
def genembedding(vocab):
    """
    Generates an embedding layer for the given vocabulary.

    The embedding layer transforms words into dense vector representations, 
    allowing the model to learn semantic relationships between words.

    Parameters:
    vocab (Vocab): The vocabulary object containing unique words and their indices.

    Returns:
    nn.Embedding: A PyTorch embedding layer with a specified embedding dimension.
    """

    # Define the embedding dimension (size of word vectors)
    embedding_dim = 20 

    # Get the vocabulary size (number of unique words in the vocabulary)
    vocab_size = len(vocab) 

    # Create an embedding layer 
    # The nn.Embedding module maps word indices to dense vector representations. 
    # It takes vocab_size as the number of words and embedding_dim as the vector size.

    embeddings = nn.Embedding(vocab_size, embedding_dim)

    return embeddings

**Embeddings**: Obtain the embedding for the first word with index 0 or 1. Don't forget that we have to convert the input into a tensor. The embeddings are initially initialized randomly, but as the model undergoes training, words with similar meanings gradually come to cluster closer together


In [128]:
embeddings = genembedding(vocab)

for n in range(2):
    embedding = embeddings(torch.tensor(n))
    print("word", index_to_token[n])
    print("index", n)
    print("embedding", embedding)
    print("embedding shape", embedding.shape)

word <unk>
index 0
embedding tensor([-0.1948, -0.0872, -1.1712, -2.4286, -0.6913,  1.3402, -0.6221,  0.4498,
        -0.9831,  1.2930,  0.8697, -0.1732,  2.9601,  0.9453,  0.1231, -0.5225,
         0.0386,  0.0675,  2.6692,  0.7498], grad_fn=<EmbeddingBackward0>)
embedding shape torch.Size([20])
word gonna
index 1
embedding tensor([ 0.2825,  0.3624,  1.6737,  1.6189, -1.8306, -1.2440, -0.3699,  0.8209,
         0.4144,  1.3435, -0.0302, -0.6240, -1.0160,  1.4343,  0.3890, -0.4814,
        -0.9842,  0.2842, -1.3067, -0.9070], grad_fn=<EmbeddingBackward0>)
embedding shape torch.Size([20])


These vectors will serve as inputs for the next layer.

### Generating Context-Target Pairs (n-grams)

Organize words within a variable-size context using the following approach: Each word is denoted by 'i'. 
To establish the context, simply subtract 'j'. The size of the context is determined by the value of ``CONTEXT_SIZE``.

In [129]:
# Define the context size for generating n-grams
CONTEXT_SIZE = 2 # The number of previous words used to predict the next word 

def genngrams(tokens):
    """
    Generates n-grams from a list of tokens, where each n-gram consists of a 
    context (previous words) and a target (next word).

    The function constructs a list of tuples where:
    - The first element is a list of `CONTEXT_SIZE` previous words.
    - The second element is the target word that follows the context.

    Parameters:
    tokens (list): A list of preprocessed word tokens.

    Returns:
    list: A list of tuples representing n-grams.
          Each tuple contains (context_words, target_word).
    """

    # Generate n-grams 
    # Iterate through the tokens starting from index CONTEXT_SIZR to the end 
    # For each token at posisiton 'i', extract the previous CONTEXT_SIZE words as context 
    ngrams = [
        (
            [tokens [i - j - 1] for j in range(CONTEXT_SIZE)], # Context words (previous words)
            tokens[i] # Target word (the word to predict)
        )
        for i in range(CONTEXT_SIZE, len(tokens))
    ]

    return ngrams


Output the first element, which results in a tuple. The initial element represents the context, and the index indicates the following word. 

In [130]:
ngrams = genngrams(tokens)
context, target = ngrams[0]
print("context", context, "target", target)
print("context inex", vocab(context), "target index", vocab([target]))

context ['are', 'we'] target no
context inex [58, 21] target index [70]


In this context, there are multiple words. Aggregate the embeddings of each of these words and then adjust the input size of the subsequent layer accordingly. Then, create the next layer.

In [131]:
embedding_dim = 20
linear = nn.Linear(embedding_dim*CONTEXT_SIZE, 128)

We have the two embeddings.

In [132]:
embeddings = genembedding(vocab)
my_embeddings = embeddings(torch.tensor(vocab(context)))
my_embeddings.shape

torch.Size([2, 20])

Reshape the embeddings.

In [133]:
my_embeddings = my_embeddings.reshape(1, -1)
my_embeddings.shape

torch.Size([1, 40])

They can now be used as inputs in the next layer.

In [134]:
linear(my_embeddings)

tensor([[-0.4028, -1.4974,  0.4740,  0.1218, -0.5506,  0.1006,  0.1310, -0.2432,
          0.0517, -0.7850,  0.3461, -0.0670,  0.5225,  0.4835,  1.0080,  0.6244,
          0.8042,  0.0659,  0.2579,  0.4803,  0.3325, -0.1497,  0.4614,  0.0488,
          0.2879, -0.0150, -0.3574,  1.1154,  0.8951,  1.0535,  0.3276, -0.2760,
          0.4410, -0.2273,  0.8928,  0.4261,  0.3102, -0.1316, -0.1292, -0.1616,
         -0.1769,  0.0392, -0.5153,  0.8410, -0.2673, -0.0222, -0.6228, -0.0354,
          0.7729, -0.5941, -0.0075,  0.1556,  0.8108, -0.2758,  0.0297,  1.0555,
          0.4652,  0.6695, -0.4220,  0.6633,  0.1155, -0.5431,  0.1106,  0.1090,
         -0.2540, -0.2349,  0.9288,  0.2240,  0.2332, -0.4005, -0.7519,  0.0186,
         -0.4358,  0.5097, -0.5526, -0.6808,  0.5598, -0.0052, -0.2793, -0.7719,
         -0.6207, -1.0954,  0.6852,  0.5970, -1.4211,  0.2467, -0.4849,  0.3938,
          0.1085, -0.8326, -0.3311, -0.7827, -0.1324,  0.0725, -0.1618,  0.1830,
          0.5206, -0.1482, -

### Batch Function

Create a Batch function to interface with the data loader. Several adjustments are necessary to handle words that are part of a context in one batch and a predicted word in the following batch.

In [142]:
from torch.utils.data import DataLoader 
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

CONTEXT_SIZE = 3    # Number of previous words used as context for prediction
BATCH_SIZE = 10     # Number of samples per training batch
EMBEDDING_DIM = 10  # Dimension of word embeddings

def collate_batch(batch):
    """
    Processes a batch of text data into input (context) and output (target) tensors
    for training a language model.

    The function extracts:
    - `context`: A list of word indices representing the context words for each target word.
    - `target`: A list of word indices representing the target word to predict.

    Parameters:
    batch (list): A list of tokenized words (strings).

    Returns:
    tuple: Two PyTorch tensors: (context_tensor, target_tensor)
           - context_tensor: Tensor of shape (batch_size - CONTEXT_SIZE, CONTEXT_SIZE),
             containing the word indices of context words.
           - target_tensor: Tensor of shape (batch_size - CONTEXT_SIZE,),
             containing the word indices of target words.
    """

    batch_size = len(batch)
    context, target = [], []

    # Loop through the batch, ensuring enough previous words exist for context 
    for i in range(CONTEXT_SIZE, batch_size):
        # Convert the target word to its index using the vocabulary 
        target.append(vocab([batch[i]]))

        # Convert the previous CONTEXT_SIZE words to indices using the vocabulary 
        context.append(vocab([batch[i - j - 1] for j in range(CONTEXT_SIZE)]))

    # Convert lists to PyTorch tensors and move them to the appropriate device (CPU/GPU)
    return torch.tensor(context).to(device), torch.tensor(target).to(device).reshape(-1)

Similarly, it's important to highlight that the size of the last batch could deviate from that of the earlier batches. To tackle this, the approach involves adjusting the final batch to conform to the specified batch size, ensuring it becomes a multiple of the predetermined size. When necessary, we'll employ padding techniques to achieve this harmonization. One approach we'll use is appending the beginning of the song to the end of the batch.


In [143]:
Padding = BATCH_SIZE - len(tokens) % BATCH_SIZE
tokens_pad = tokens + tokens[0:Padding]

Create the `DataLoader`

In [144]:
dataloader = DataLoader(
    tokens_pad, batch_size = BATCH_SIZE, shuffle = False, collate_fn = collate_batch
)

### Multi-Class Neural Network

We have developed a PyTorch class for a multi-class neural network. The network's output is the probability of the next word within a given context. Therefore, the number of classes corresponds to the count of distinct words. The initial layer consists of embeddings, and in addition to the final later, an extra hidden layer is incorporated.

In [145]:
class NGramLanguageModeler(nn.Module):
    """
    A neural network-based n-gram language model that predicts the next word 
    given a sequence of context words.

    This model consists of:
    - An embedding layer that converts word indices into dense vector representations.
    - A fully connected hidden layer with ReLU activation.
    - An output layer that predicts the probability distribution over the vocabulary.

    Parameters:
    vocab_size (int): The number of unique words in the vocabulary.
    embedding_dim (int): The size of the word embeddings (vector representation of words).
    context_size (int): The number of previous words used as context to predict the next word.
    """

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        
        # Store context size and embedding dimension 
        self.context_size = context_size 
        self.embedding_dim = embedding_dim 

        # Embedding layer: Maps word indices to dense vectors 
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Fully connected layer: Maps word indices to dense vectors 
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)

        # Output layer: Maps the hidden layer output to vocabulary size (probability distribution over words)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        """
        Forward pass of the model.

        Parameters:
        inputs (Tensor): A tensor of shape (batch_size, context_size) containing word indices.

        Returns:
        Tensor: A tensor of shape (batch_size, vocab_size) representing predicted probabilities for the next word.
        """

        # Convert input word indices into dense vectors using the embedding layer 
        embeds = self.embeddings(inputs) # Shape: (batch_size, context_size, embedding_dim)

        # Reshape the embeddings into a single vector per input sample 
        embeds = torch.reshape(embeds, (-1, self.context_size * self.embedding_dim))
        # New shape: (batch_size, context_size * embedding_dim)

        # Apply first fully connected layer with ReLU activation 
        out = F.relu(self.linear1(embeds)) # Shape: (batch_size, 128)

        # Apply second fully connected layer to generate vocabulary-size logits 
        out = self.linear2(out)

        return out

Create a model.

In [146]:
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE).to(device)

Retrieve samples from the data loader object and input them into the neural network.

In [147]:
context, target = next(iter(dataloader))
print(context, target)
out = model(context)

tensor([[70, 58, 21],
        [74, 70, 58],
        [25, 74, 70],
        [69, 25, 74],
        [ 2, 69, 25],
        [20,  2, 69],
        [31, 20,  2]]) tensor([74, 25, 69,  2, 20, 31, 72])


Retrieve samples from the data loader objext and input them into the neural network.

In [149]:
context, target = next(iter(dataloader))
print(context, target)
out = model(context)

tensor([[70, 58, 21],
        [74, 70, 58],
        [25, 74, 70],
        [69, 25, 74],
        [ 2, 69, 25],
        [20,  2, 69],
        [31, 20,  2]]) tensor([74, 25, 69,  2, 20, 31, 72])


While the model remains untrained, analyzing the output can provide us with a clearer understanding. In the output, the first dimension corresoponds to the batch size, while the second dimension represents the probability associated with each class.

In [150]:
out.shape

torch.Size([7, 79])

Find the index with the highest probability

In [152]:
predicted_index = torch.argmax(out, 1)
predicted_index

tensor([28, 57, 57, 56, 11, 16, 48])

In [153]:
[index_to_token[i.item()] for i in predicted_index]

['i', 'any', 'any', 'your', 'desert', 'let', 'on']

We will create a function that accomplished the same task for the tokens.

In [None]:
def write_song(model, my_song, number_of_words = 100):
    """
    Generates text using a trained n-gram language model.

    Given an initial text (`my_song`), the function generates additional words by 
    predicting the next word iteratively based on the trained model.

    Parameters:
    model (nn.Module): The trained n-gram language model.
    my_song (str): The initial seed text to start generating words.
    number_of_words (int): The number of words to generate (default: 100).

    Returns:
    str: The generated song lyrics as a string.
    """

    # Get the mapping from index to word for decoding predictions
    index_to_token = vocab.get_itos()

    # Loop to generate from index to word for decoding predictions 
    for i in range(number_of_words):
        with torch.no_grad(): # Disable a gradient computation for inference 
            # Prepare the input context by extracting the last CONTEXT_SIZE words from tokens 
            context = torch
