# **Building and Training a Feedforward Neural Network for Language Modeling**

Estimated time needed: **60** minutes

This project explores the use of Feedforward Neural Networks (FNNs) in language modeling. The primary objective is to build a neural network that learns word relationships and generates meaningful text sequences. The implementation is done using PyTorch, covering key aspects of Natural Language Processing (NLP), such as:
* Tokenization & Indexing: Converting text into numerical representations.
* Embedding Layers: Mapping words to dense vector representations for efficient learning.
* Context-Target Pair Generation (N-grams): Structuring training data for sequence prediction.
* Multi-Class Neural Network: Designing a model to predict the next word in a sequence.

The training process includes optimizing the model with loss functions and backpropagation techniques to improve accuracy and coherence in text generation. By the end of the project, you will have a working FNN-based language model capable of generating text sequences.

## Import required Libraries

In [10]:
import time
import re
import string

import numpy as np
import pandas as pd

import torch
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator


import torch.nn as nn
import torch.nn.functional as f


import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize


import matplotlib.pyplot as plt
from sklearn.manifold import TSNE


import warnings
def warn(*args,**kwargs):
    pass
warnings.warn = warn


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tinonturjamajumder/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/tinonturjamajumder/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [11]:
song= """We are no strangers to love
You know the rules and so do I
A full commitments what Im thinking of
You wouldnt get this from any other guy
I just wanna tell you how Im feeling
Gotta make you understand
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Weve known each other for so long
Your hearts been aching but youre too shy to say it
Inside we both know whats been going on
We know the game and were gonna play it
And if you ask me how Im feeling
Dont tell me youre too blind to see
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Weve known each other for so long
Your hearts been aching but youre too shy to say it
Inside we both know whats been going on
We know the game and were gonna play it
I just wanna tell you how Im feeling
Gotta make you understand
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you"""

In [12]:
def preprocess_string(s):
    """
    Preprocess a given string by performing the following steps:
    1. Removing anything but letters and digits
    2. Removing whitespace
    3. Removes all numeric digits
    """

    # Removing all non-word characters (everything except letters and numbers)
    # \w matches any word characters (letters, numbers, and underscores)
    # \s matches any whitespace characters
    # ^ inside [] negates the selection, so [^\w\s] matches anything that's not a word character or whitespace
    s = re.sub(r"[^\w\s]", '',s)

    # removing white spaces
    # \s+ matches one or more whitespace characters.
    s = re.sub(r"\s+",'',s)

    # removing digits
    # \d matches digits (0-9)
    s = re.sub(r"\d",'',s)


    return s

In [13]:
def preprocess_words(word,tokenizer):
    )_token if len(w) !=0 and w not in string.punctuation]

In [14]:
tokens = preprocess_words(song,word_tokenize)

In [16]:
tokens[:10]

['we', 'are', 'no', 'strangers', 'to', 'love', 'you', 'know', 'the', 'rules']

## Indexing

In [20]:
tokenizer = get_tokenizer('basic_english')

In [27]:
def tokenizetext(song):
    """Tokenizes the input text(song) and builds a vocabulary from the tokens

    Steps:
        1. Tokenizations: The function splits the input text into words and applies a tokenizer function to each word
        2. vocabulary building: 
    """


    tokenized_song = map(tokenizer,song.split())

    vocab  = build_vocab_from_iterator(tokenized_song,specials = ["<unk>"])
    vocab.set_default_index(vocab["<unk>"])

    return vocab

In [31]:
vocab= tokenizetext(song)

In [32]:
list(vocab.get_itos())[:10]

['<unk>', 'gonna', 'you', 'never', 'and', 'tell', 'make', 'say', 'a', 'around']

In [34]:
tokens[:10],vocab(tokens[:10])

(['we', 'are', 'no', 'strangers', 'to', 'love', 'you', 'know', 'the', 'rules'],
 [21, 58, 70, 74, 25, 69, 2, 20, 31, 72])

A function that converts raw text into indices

In [35]:
text_pipeline = lambda x: vocab(tokenizer(x))
text_pipeline(song)[:10]

[21, 58, 70, 74, 25, 69, 2, 20, 31, 72]

Convert `index` *-->* `words`

In [44]:
index_to_token = vocab.get_itos()
index_to_token[1]

'gonna'

In [38]:
list(vocab.get_stoi())[:10]

['wouldnt',
 'what',
 'thinking',
 'rules',
 'no',
 'love',
 'if',
 'full',
 'from',
 'get']

In [50]:
val = vocab.get_stoi() # stoi is a dictionary basically
val['gonna']

1

## Embedding Layers

In [51]:
def embedding(vocab):
    """ Generating an embedding layer for the given vocabulary,
    The embedding layer transforms words into dense vector representation,
    allowing the model to learn semantic relationship between words.

    Parameters: vocab_size, embedding_dimension

    output: nn.Embedding: A PyTorch embedding layer with a specified dimension"""

    embedding_dimension = 20
    embedding = nn.Embedding(len(vocab),embedding_dimension)
    return embedding
    

In [54]:
embeddings = embedding(vocab)

type(embeddings)

torch.nn.modules.sparse.Embedding

In [58]:
embeddings(torch.tensor(10))

tensor([-0.0099, -0.6683, -2.2261,  0.9901, -0.5307, -0.9852, -0.9635,  0.9964,
        -0.5094, -0.3424,  1.0937, -0.0086, -1.3811,  1.7139,  0.7446, -1.8681,
         0.9568, -0.2175, -0.5106, -0.5009], grad_fn=<EmbeddingBackward0>)

In [55]:
for i in range(2):
    embed = embeddings(torch.tensor(i))
    print(f"word: {index_to_token[i]}")
    print(f"index: {i}")
    print(f"embedding: {embed}")
    print(f"shape: {embed.shape}")

word: <unk>
index: 0
embedding: tensor([-0.3025, -1.1382, -0.6952,  0.1230, -0.6399, -1.4001, -0.2017, -0.2257,
         0.1735, -1.0885,  0.0337,  0.2305, -2.4502, -0.6648, -0.8209,  0.2909,
        -0.9861, -1.0239, -0.9221,  0.1631], grad_fn=<EmbeddingBackward0>)
shape: torch.Size([20])
word: gonna
index: 1
embedding: tensor([ 0.8616, -1.0000, -0.9920,  0.3004, -0.1208,  0.1646,  0.9774, -0.1461,
        -0.8724,  0.1647, -1.7226, -0.0758,  0.7681,  1.9825, -0.4615, -0.3361,
         1.1324,  0.2438,  1.2179,  1.3383], grad_fn=<EmbeddingBackward0>)
shape: torch.Size([20])


In [None]:
## Generating Context-Target Pairs (