# **Building and Training a Feedforward Neural Network for Language Modeling**

Estimated time needed: **60** minutes

This project explores the use of Feedforward Neural Networks (FNNs) in language modeling. The primary objective is to build a neural network that learns word relationships and generates meaningful text sequences. The implementation is done using PyTorch, covering key aspects of Natural Language Processing (NLP), such as:
* Tokenization & Indexing: Converting text into numerical representations.
* Embedding Layers: Mapping words to dense vector representations for efficient learning.
* Context-Target Pair Generation (N-grams): Structuring training data for sequence prediction.
* Multi-Class Neural Network: Designing a model to predict the next word in a sequence.

The training process includes optimizing the model with loss functions and backpropagation techniques to improve accuracy and coherence in text generation. By the end of the project, you will have a working FNN-based language model capable of generating text sequences.

## STEPS 

    1. Get the datasets ready
    2. Tokenize it (Fragmented the datasets into smaller fractions (words))
    3. Create vocabulary from the token (get the indices of the word)
    4. Create a function text_pipeline which will take words as input and return indices as output. 
    5. Create embeddings from the indices by using nn.Embeddings
    


## Import required Libraries

In [1]:
import time
import re
import string

import numpy as np
import pandas as pd

import torch
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator


import torch.nn as nn
import torch.nn.functional as f


import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize


import matplotlib.pyplot as plt
from sklearn.manifold import TSNE


import warnings
def warn(*args,**kwargs):
    pass
warnings.warn = warn


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tinonturjamajumder/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/tinonturjamajumder/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [2]:
song= """We are no strangers to love
You know the rules and so do I
A full commitments what Im thinking of
You wouldnt get this from any other guy
I just wanna tell you how Im feeling
Gotta make you understand
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Weve known each other for so long
Your hearts been aching but youre too shy to say it
Inside we both know whats been going on
We know the game and were gonna play it
And if you ask me how Im feeling
Dont tell me youre too blind to see
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Weve known each other for so long
Your hearts been aching but youre too shy to say it
Inside we both know whats been going on
We know the game and were gonna play it
I just wanna tell you how Im feeling
Gotta make you understand
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you"""

In [3]:
def preprocess_string(s):
    """
    Preprocess a given string by performing the following steps:
    1. Removing anything but letters and digits
    2. Removing whitespace
    3. Removes all numeric digits
    """

    # Removing all non-word characters (everything except letters and numbers)
    # \w matches any word characters (letters, numbers, and underscores)
    # \s matches any whitespace characters
    # ^ inside [] negates the selection, so [^\w\s] matches anything that's not a word character or whitespace
    s = re.sub(r"[^\w\s]", '',s)

    # removing white spaces
    # \s+ matches one or more whitespace characters.
    s = re.sub(r"\s+",'',s)

    # removing digits
    # \d matches digits (0-9)
    s = re.sub(r"\d",'',s)


    return s

In [4]:
def preprocess_words(word,tokenizer):
    token = tokenizer(word)
    tokens = [preprocess_string(w) for w in token if len(w) !=0 and w not in string.punctuation]
    return tokens

In [5]:
tokens = preprocess_words(song,word_tokenize)

In [6]:
tokens[:10]

['We', 'are', 'no', 'strangers', 'to', 'love', 'You', 'know', 'the', 'rules']

## Indexing

In [7]:
tokenizer = get_tokenizer('basic_english')

In [8]:
def tokenizetext(song):
    """Tokenizes the input text(song) and builds a vocabulary from the tokens

    Steps:
        1. Tokenizations: The function splits the input text into words and applies a tokenizer function to each word
        2. vocabulary building: 
    """


    tokenized_song = map(tokenizer,song.split())

    vocab  = build_vocab_from_iterator(tokenized_song,specials = ["<unk>"])
    vocab.set_default_index(vocab["<unk>"])

    return vocab

In [9]:
vocab= tokenizetext(song)

In [10]:
list(vocab.get_itos())[:10]

['<unk>', 'gonna', 'you', 'never', 'and', 'tell', 'make', 'say', 'a', 'around']

In [11]:
tokens[:10],vocab(tokens[:10])

(['We', 'are', 'no', 'strangers', 'to', 'love', 'You', 'know', 'the', 'rules'],
 [0, 58, 70, 74, 25, 69, 0, 20, 31, 72])

A function that converts raw text into indices

In [12]:
text_pipeline = lambda x: vocab(tokenizer(x))
text_pipeline(song)[:10]

[21, 58, 70, 74, 25, 69, 2, 20, 31, 72]

Convert `index` *-->* `words`

In [13]:
index_to_token = vocab.get_itos()
index_to_token[1]

'gonna'

In [14]:
list(vocab.get_stoi())[:10]

['wouldnt',
 'what',
 'thinking',
 'rules',
 'no',
 'love',
 'if',
 'full',
 'from',
 'get']

In [15]:
val = vocab.get_stoi() # stoi is a dictionary basically
val['gonna']

1

## Embedding Layers

In [16]:
def embedding(vocab):
    """ Generating an embedding layer for the given vocabulary,
    The embedding layer transforms words into dense vector representation,
    allowing the model to learn semantic relationship between words.

    Parameters: vocab_size, embedding_dimension

    output: nn.Embedding: A PyTorch embedding layer with a specified dimension"""

    embedding_dimension = 20
    embedding = nn.Embedding(len(vocab),embedding_dimension)
    return embedding
    

In [17]:
embeddings = embedding(vocab)

type(embeddings)

torch.nn.modules.sparse.Embedding

In [18]:
embeddings(torch.tensor(10))

tensor([-0.8908, -2.2832, -0.1526, -0.1233, -0.6383,  0.4833, -0.8340, -1.2459,
         0.1786,  0.5443,  0.9476,  0.2217,  1.5063,  1.1882, -0.4161,  1.3374,
         0.4252, -0.1103,  0.7669,  0.1497], grad_fn=<EmbeddingBackward0>)

In [19]:
for i in range(2):
    embed = embeddings(torch.tensor(i))
    print(f"word: {index_to_token[i]}")
    print(f"index: {i}")
    print(f"embedding: {embed}")
    print(f"shape: {embed.shape}")

word: <unk>
index: 0
embedding: tensor([-0.6442, -0.2756,  1.0475, -0.2907, -1.0958,  1.6039, -0.1098,  0.5874,
         0.9032, -0.6276,  0.6190, -0.7328, -3.0555,  0.3898,  0.1277, -2.1054,
         0.9229,  0.7712,  0.1017,  0.2045], grad_fn=<EmbeddingBackward0>)
shape: torch.Size([20])
word: gonna
index: 1
embedding: tensor([ 7.3413e-01, -1.6966e+00, -1.7799e+00, -6.3697e-01,  3.3318e-02,
         1.5877e-03,  1.1237e+00,  1.8577e+00,  4.6164e-01,  4.7497e-01,
        -3.7971e-01, -5.8951e-01, -6.2694e-02, -1.5155e+00, -2.9917e-01,
         8.7756e-02, -7.8534e-01, -1.7775e-01,  1.4298e-01,  3.0548e-01],
       grad_fn=<EmbeddingBackward0>)
shape: torch.Size([20])


## Generating Context-Target Pairs (n_grams)

Organize words with a variable size of context using the following approach: each word is denoted by i, To establish the context, simply subtract 'j'.

In [20]:
# Define the size for generating n-gram
CONTEXT_SIZE  = 2

def gengrams(tokens):
    """
    Parameters:
    tokens(list): A list of preprocessed word tokens.

    Returns:
        List: A list of tuples representing n-grams.
        Each tuple contains (context_words, target_word)
    """
    ngrams = [
    ([tokens[i-j-1] for j in range(CONTEXT_SIZE)], # Context_words
    tokens[i] # target words
    )
        for i in range(CONTEXT_SIZE, len(tokens))
    ]
    return ngrams

In [21]:
tokens[2]

'no'

In [22]:
bi_gram  = gengrams(tokens)

In [23]:
type(bi_gram)

list

In [24]:
def bigram_represent(list_length):
    for i in range(list_length):
        context,target = bi_gram[i]
        print(f" Context: {context}, Target: {target}")
        print(f"Context index:{vocab([context])},Target index:{vocab([target])}")
        

bi_gram returns a list, where each element is a tuple of context words(2 word) and target word.

## Middle Linear Layer

Aggregate the embeddings of each of these words and then adjust the input size of the subsequent layer accordingly. Then create the next layer

In [25]:
embedding_dim = 20

# Next layer
linear  = nn.Linear(embedding_dim*CONTEXT_SIZE,128)

In [31]:
# The context would feed to the model after reshaping
context,target = bi_gram[0]
embeddings = embedding(vocab)
my_embedd = embeddings(torch.tensor(vocab(context)))
my_embedd.shape # two words in a context, each word has 20 feature embedding

torch.Size([2, 20])

## Reshape the embeddings

In [33]:
my_embedd = my_embedd.reshape(-1,embedding_dim*CONTEXT_SIZE)
my_embedd.shape

torch.Size([1, 40])

They can now be used as inputs in the next layer

In [34]:
linear(my_embedd)

tensor([[-1.3104e-01,  3.5650e-01,  5.9844e-01, -1.2299e-01,  4.0805e-01,
          4.5537e-01, -7.3124e-01, -2.9037e-01,  7.0898e-01, -3.5957e-01,
          1.3983e+00, -6.0154e-01, -4.5204e-01,  2.0808e-01, -4.2497e-02,
         -3.2640e-01,  3.1082e-02, -2.8293e-01, -1.1882e-01, -1.1991e+00,
         -3.8555e-01, -1.2894e-01, -1.0768e-01, -3.5823e-01, -1.7548e-01,
         -5.9122e-01, -5.7726e-02,  2.5429e-01, -4.3557e-01, -2.2497e-01,
         -1.0967e+00,  3.7419e-01,  1.3247e+00,  2.7752e-01, -6.7569e-01,
         -1.2808e-01, -8.4872e-01, -7.0129e-01,  4.4884e-01,  3.2655e-01,
         -1.5790e-01,  1.2148e-01, -6.2889e-02, -3.6183e-03, -3.6938e-01,
          5.1426e-01,  1.8631e-01,  1.2056e-01, -5.4035e-01, -1.0643e+00,
          6.5548e-02,  3.9362e-01,  6.8724e-03, -8.9634e-01,  8.2410e-01,
          1.2486e-01,  2.2007e-01, -5.2788e-01,  1.3501e-01, -5.7974e-01,
          2.2731e-01, -5.1327e-02,  2.1600e-02, -4.5100e-02, -8.2929e-01,
          1.0798e+00,  5.2450e-01, -3.

## Batch Function

Create a batch function to interface with the dataloader. Several adjustments are necessary to handle words that are part of a context in one batch and a predicted word in the following batch

In [37]:
CONTEXT_SIZE = 3 # For trigram model
BATCH_SIZE = 10 # Number of samples per training batch
EMBEDDING_DIM  = 10 # Dimensions of word embedding
device = torch.device("cpu")

def collate_fn(batch):
    """
    Process a batch of text data into input(context) and output(target) tensors for training a language model.

    The function extracts:
        - context: A list of word indices representing the context words for each target word.
        - target: A list of word indices representing the target word to predict

    Parameters:
        batch (List): A List of tokenized words (strings).

    Returns:
        tuple: Two PyTorch tensors: (context_tensor, target_tensor)
            - context tensor: Tensor of shape (batch_size - context_size, context_size)
              containing the words indices of context words,
            - target tensor: Tensor of shape (batch_size - context_size)
              containing the word indices of target words.

    """
    batch_size  = len(batch) # get the size of the batch
    context,target = [],[] # initialize lists for context 

    # Loop through the batch, ensuring enough previous words exist for context
    for i in range(CONTEXT_SIZE, batch_size):

        # Loop through the batch, ensuring enough previous words exist for context
        target.append(vocab([batch[i]]))

        # Convert the previous context_size words to indices using the vocabulary
        context.append(vocab([batch[i-j-1] for j in range(CONTEXT_SIZE)]))

    # Convert lists to PyTorch tensors and move them to appropriate device
    return torch.tensor(context).to(device),torch.tensor(target).to(device)

In [39]:
"""
batch_size = 10 , len_token = 56
padding = 10-(56%10)  = 10 - 6 = 4
the initial 4 token will be added to the end of the token, it is a very common practice while generating music and text
"""
padding = BATCH_SIZE - (len(tokens)%BATCH_SIZE)
tokken_pad = tokens+tokens[0:padding]


## Create the DataLoader

## Classificationm

In [None]:
def collate_function():
    