# **Generate Texts Like Rabindranath Tagore**

In [3]:
# lets start...

# Introduction

Natural Language Processing (NLP) is a crucial field in artificial intelligence that deals with the interaction between computers and human language. One of the fascinating tasks in NLP is Language Modelling, where the goal is to predict the next word in a sequence of words. This notebook aims to explore Language Modelling or Sentence Completion Task using a Complex BiLSTM model on a Bengali Texts written by Rabindranath Tagore, One of the greatest works available in Bengali Literature. 

This Notebook is Completely Beginner-friendly and helpful to people who are interested in Enhancing NLP Tasks in Indian Languages.

## Objective

My Objective is to predict the next word in a given sequence from a corpus of Rabindranath Tagore's writings for Sentence Completion or Text-Generation. The dataset consists of approximately 1 lakh words, and the challenge is to leverage this dataset to train a BiLSTM model for accurate next word prediction and Sentence Completion.

## Dataset

The dataset used in this notebook comprises essays and novels written by Rabindranath Tagore. It includes around 1 lakh words, providing a rich linguistic resource on Bengali Language for training and evaluating the model.

<!-- ## Methodology

1. **Data Preprocessing**: Cleaning and preparing the text data for training the model.
2. **Model Building**: Constructing a BiLSTM model suitable for next word prediction.
3. **Training**: Training the model on the dataset using appropriate hyperparameters.
4. **Evaluation**: Assessing the model's performance using accuracy metrics such as top-1, top-2, and top-5 accuracy.

## Acknowledgements

I would like to express my gratitude to the following:
- The Kaggle community for providing a platform to share and collaborate on data science projects.
- The creators of the dataset for making it publicly available.
- Rabindranath Tagore, whose literary works have been an inspiration and a source of rich textual data for this project.

## Conclusion

This notebook demonstrates the process of building and evaluating a BiLSTM model for next word prediction using a dataset of essays by Rabindranath Tagore. The findings and insights from this project can contribute to further research and applications in the field of NLP and text generation. -->


- 
- 
- 


---


Let's dive into the data and start our journey into the world of Text Generation On Bengali Language using Rabindranath Tagore's Writings!


# Acknowledgements

I would like to express my gratitude to the following:
- The Kaggle community for providing a platform to share and collaborate on data science projects.
- The creators of the dataset for making it publicly available.
- Rabindranath Tagore, whose literary works have been an inspiration and a source of rich textual data for this project.

# References

For my Work I have used The following resources for Preprocessing, Model Development and Trainning
- [Bengali_Text_Preprocessing_for_Language_Modeling](https://www.kaggle.com/code/sayankr007/bengali-text-preprocessing-for-language-modeling)  by  [sayankr007](https://www.kaggle.com/sayankr007)
- [Next_Word_Prediction_using_LSTM](https://www.youtube.com/watch?v=NYUIxVQa7TE)  by  [CampusX](https://learnwith.campusx.in/)

# Methodology

## Necessary Imports

In [4]:
# general libraries
import os
import random
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

In [5]:
# ML/DL Libraries
from nltk.tokenize import word_tokenize

import torch
import torchtext
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.nn.functional import one_hot
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import TensorDataset, DataLoader, random_split
from torch.optim.lr_scheduler import ReduceLROnPlateau

## Loading and Cleaning

### Load Data

In [6]:
file_path = "/kaggle/input/complete-works-of-rabindranath-tagore/txt/essay.txt"

with open(file_path, 'r') as f:
    text = f.read()

In [7]:
# Sample Raw Dataset..
text[0:200]

'আশ্রমের রূপ ও বিকাশ ২\nশিলাইদহে পদ্মাতীরে সাহিত্যচর্চা নিয়ে নিভৃতে বাস করতুম। একটা সৃষ্টির সংকল্প নিয়ে সেখান থেকে এলেম শান্তিনিকেতনের প্রান্তরে।\nতখন আশ্রমের পরিধি ছিল ছোটো। তার দক্ষিণ সীমানায় দীর্ঘ সার'

###  Cleaning The Data

### Remove Expressions (WhiteSpaces and Punctuations)

*We are using Python's Regular Expressions RE library for this..*

In [8]:
# replacing all punctuation symbols and whitespaces with a single whitepace

whitespace = re.compile(u"[\s\u0020\u00a0\u1680\u180e\u202f\u205f\u3000\u2000-\u200a]+", re.UNICODE)
bangla_fullstop = u"\u0964"  # unicode code for bengali fullstop
punctSeq = u"['\"“”‘’]+|[.…-]+|[:;]+"
bengali_numeral_pattern = r'[১-৯০]+' # Regular expression pattern to match Bengali numerals



corpus = re.sub(bengali_numeral_pattern," ",text)   # Remove individual Bengali numerals..
corpus = re.sub(r'\s*\d+\s*', ' ', corpus)          # Remove all possible sequences of Bengali numerals..

corpus = re.sub('\n'," ",corpus)                                # remove all '\n' symbols
corpus = re.sub(punctSeq," ",corpus)                            # remove all punctuations..
corpus = whitespace.sub(" ",corpus).strip()                     # the strip() method to remove any leading or trailing whitespace from the resulting string

In [9]:
# Sample Corpus after Pre-processing
corpus[0:500]

'আশ্রমের রূপ ও বিকাশ শিলাইদহে পদ্মাতীরে সাহিত্যচর্চা নিয়ে নিভৃতে বাস করতুম। একটা সৃষ্টির সংকল্প নিয়ে সেখান থেকে এলেম শান্তিনিকেতনের প্রান্তরে। তখন আশ্রমের পরিধি ছিল ছোটো। তার দক্ষিণ সীমানায় দীর্ঘ সার বাঁধা শালগাছ। মাধবীলতা বিতানে প্রবেশের দ্বার। পিছনে পুব দিকে আমবাগান, পশ্চিম দিকে কোথাও বা তাল, কোথাও বা জাম, কোথাও বা ঝাউ, ইতস্তত গুটিকয়েক নারকেল। উত্তরপশ্চিম প্রান্তে প্রাচীন দুটি ছাতিমের তলায় মার্বেল পাথরে বাঁধানো একটি নিরলংকৃত বেদী। তার সামনে গাছের আড়াল নেই, দিগন্ত পর্যন্ত অবারিত মাঠ, সে মাঠে তখন'

## Text PreProcessing

**Note** : 
*The Data Cleaning and Text-Preprocessing Steps on the Same data have been discussed in details with tutorial in the follwing link -> [Bengali_Text_Preprocessing_for_Language_Modeling](https://www.kaggle.com/code/sayankr007/bengali-text-preprocessing-for-language-modeling)*

### Tokenization

*We are using NLTK Libraries Word_Tokenizer for this task..*

In [10]:
# Tokenize sentences based on the Bengali fullstop symbol
sentences = corpus.split(bangla_fullstop)

# Remove empty strings and add the period symbol back to each sentence
sentences = [sentence.strip() + " " + bangla_fullstop for sentence in sentences if sentence.strip()]

# Tokenize each sentence into words
doc = [word_tokenize(sentence) for sentence in sentences]

In [11]:
print("Before Tokenization : \n", corpus[-386:],"\n")
print("After Tokenization : \n", doc[-3:])

Before Tokenization : 
 কিন্তু আমরা তো বিজ্ঞানী নই, বুঝতে পারি নে হঠাৎ অঙ্কের আরম্ভ হয় কোথা থেকে, একেবারে শেষই বা হয় কোন্ খানে। সম্পূর্ণ সংঘটিত বিশ্বকে নিয়ে হঠাৎ কালের আরম্ভ হল আর সদ্যোলুপ্ত বিশ্বের সঙ্গে কালের সম্পূর্ণ অন্ত হবে, আমাদের বুদ্ধিতে এর কিনারা পাই নে। বিজ্ঞানী বলবেন, বুদ্ধির কথা এখানে আসছে না, এ হল গণনার কথা সে গণনা বর্তমান ঘটনাধারার উপরে প্রতিষ্ঠিত এর আদি অন্তে যদি অন্ধকার দেখি তা হলে উপায় নেই। 

After Tokenization : 
 [['কিন্তু', 'আমরা', 'তো', 'বিজ্ঞানী', 'নই', ',', 'বুঝতে', 'পারি', 'নে', 'হঠাৎ', 'অঙ্কের', 'আরম্ভ', 'হয়', 'কোথা', 'থেকে', ',', 'একেবারে', 'শেষই', 'বা', 'হয়', 'কোন্', 'খানে', '।'], ['সম্পূর্ণ', 'সংঘটিত', 'বিশ্বকে', 'নিয়ে', 'হঠাৎ', 'কালের', 'আরম্ভ', 'হল', 'আর', 'সদ্যোলুপ্ত', 'বিশ্বের', 'সঙ্গে', 'কালের', 'সম্পূর্ণ', 'অন্ত', 'হবে', ',', 'আমাদের', 'বুদ্ধিতে', 'এর', 'কিনারা', 'পাই', 'নে', '।'], ['বিজ্ঞানী', 'বলবেন', ',', 'বুদ্ধির', 'কথা', 'এখানে', 'আসছে', 'না', ',', 'এ', 'হল', 'গণনার', 'কথা', 'সে', 'গণনা', 'বর্তমান', 'ঘটনাধারার', 'উপরে', 'প্রতিষ্ঠিত', 'এর', 'আদি', 

In [12]:
print("Total length of the Document -> ", len(doc))

Total length of the Document ->  98304


In [13]:
# Working with Half Data due to limited resource
# process and train the model twice, with half data each time...

doc = doc[:50000]
# doc = doc[50000: ]

# NOTE : You can train the whole dataset if enough Computational Resources 
#        are available to you..

### Build Vocab from Iterator

In [15]:
word_vocab = torchtext.vocab.build_vocab_from_iterator(
    doc,
    min_freq=1,
    specials=['<pad>', '<unk>'],  # <pad> for padding sequence to same length(MAX)
                                  # <unk> for unknown out-of-vocabulary words
    special_first=True
)

 **`get_itos`**: *stands for "index-to-string". The method returns a list where the indices in the list correspond to the numerical indices used in your model, and the values at those indices are the actual string representations (tokens).*

In [16]:
# This Step is very very Important for Converting Words into their respective index,
# Creating "idx-to-word" and "word-to-idx" Dictionaries makes the searching easier 
# by reducing the time by several minutes even on a very large corpus like ours..


# Create the word-to-index map
word_to_index = {token: idx for idx, token in enumerate(word_vocab.get_itos())}

# Create the index-to-word map
index_to_word = {idx: token for idx, token in enumerate(word_vocab.get_itos())}

In [17]:
# Sample list of Vocabulary
list(word_vocab.get_itos())[:10]

['<pad>', '<unk>', '।', ',', 'না', 'যে', 'এই', 'আমাদের', 'করিয়া', 'করে']

In [18]:
word_vocab_total_words = len(word_vocab)

print("Word Vocab Length", word_vocab_total_words)

Word Vocab Length 77446


### Converting to n-grams

n_grams of Variable Length

**Source Data** : *I Am Learning Artificial Intelligence*

- *-----  **X**  -----------------------------  **y**  -----*


- *I-----------------------------------Am*
- *I Am------------------------------Learning*
- *I Am Learning--------------------Artificial*
- *I Am Learning Artificial----------Intelligence*

In [19]:
# function for n-grams of variable length:

def seq2grams(sentences):
    n_grams = []
    for sentence in tqdm(sentences):            # for each sentence in the corpus
        for i in range(1, len(sentence)):  # from [1st] word, [1st,2nd] word, [1st,2nd,3rd] word upto last word inde 
            sequence = sentence[:i+1]      # make sequences [1,2], [1,2,3], [1,2,3,4] and so on
            n_grams.append(sequence)        # add the sequence to the main array
    return n_grams

dataset = seq2grams(doc)

# sample dataset after n-gram
print(dataset[:5])

100%|██████████| 50000/50000 [00:01<00:00, 34180.42it/s]

[['আশ্রমের', 'রূপ'], ['আশ্রমের', 'রূপ', 'ও'], ['আশ্রমের', 'রূপ', 'ও', 'বিকাশ'], ['আশ্রমের', 'রূপ', 'ও', 'বিকাশ', 'শিলাইদহে'], ['আশ্রমের', 'রূপ', 'ও', 'বিকাশ', 'শিলাইদহে', 'পদ্মাতীরে']]





### Adding Random  `<unk>` Tokens

*This is a cruicial step used for Efficient Predictions where the test set may include many Out-Of-Dictionary words.. so, the trainning set should include `<unk>` tokens.*

In [20]:
# Add Random <unk> tokens to let the model handle <unk> tokens
def add_random_unk_tokens(ngram):
    for idx, word in enumerate(ngram[:-1]):
        if random.uniform(0, 1) < 0.1:
            ngram[idx] = '<unk>'
    return ngram


dataset_unk = []
for data in dataset:
    dataset_unk.append(add_random_unk_tokens(data))
    

# check whether the dataset includes <unk> tokens or not..    
print(any('<unk>' in data for data in dataset_unk))

# length of the dataset
print(len(dataset_unk))

# sample dataset after adding random <unk> tokens
print(dataset_unk[34:43])

True
865889
[['মাধবীলতা', 'বিতানে', '<unk>', 'দ্বার'], ['<unk>', 'বিতানে', 'প্রবেশের', '<unk>', '।'], ['<unk>', 'পুব'], ['পিছনে', 'পুব', 'দিকে'], ['পিছনে', '<unk>', 'দিকে', 'আমবাগান'], ['<unk>', 'পুব', 'দিকে', '<unk>', ','], ['পিছনে', 'পুব', '<unk>', 'আমবাগান', ',', 'পশ্চিম'], ['পিছনে', '<unk>', 'দিকে', 'আমবাগান', '<unk>', 'পশ্চিম', 'দিকে'], ['পিছনে', 'পুব', 'দিকে', 'আমবাগান', ',', 'পশ্চিম', 'দিকে', 'কোথাও']]


### Conversion of Word to Tokens(numbers)

In [21]:
# converts the Words in the dataset into Numbers... using
# "word-to-index" Mapping
def text_to_numerical_sequence(tokenized_text):
    tokens_list = []
    if tokenized_text[-1] in word_to_index:
        for token in tokenized_text[:-1]:
            num_token = word_to_index[token] if token in word_to_index  else word_to_index['<unk>']
            tokens_list.append(num_token)
        num_token = word_to_index[tokenized_text[-1]]
        tokens_list.append(num_token)
        return tokens_list
    return None


# Efficiently create data list without redundant calls
data = [seq for seq in (text_to_numerical_sequence(sequence) for sequence in dataset_unk)]

print(f'Total input sequences: {len(data)}')
print('Sample Dataset :\n', data[7:9])

Total input sequences: 865889
Sample Dataset :
 [[1172, 386, 1, 860, 5746, 19000, 16429, 113, 4380], [1172, 386, 17, 860, 5746, 19000, 16429, 113, 4380, 664]]


### Creation of Features and Labels

*In each n-gram last word will be the `LABEL` and all previous words will be the `FEATURE`*

In [22]:
X = [sequence[:-1] for sequence in data]
y = [sequence[-1] for sequence in data]

In [23]:
# this will be the maximum size of each token
longest_sequence_feature = max(len(sequence) for sequence in X)
longest_sequence_feature 

464

### Padding the Features

**Padding The Features to a Fixed Length (`longest_sequence_feature`) is because our model is expecting same size for each `input` and each n-gram is of different length..**

*`F.pad` function is a utility function which is part of torch.nn.functional module,used for padding tensors.*

In [24]:
padded_X = [F.pad(torch.tensor(sequence), (longest_sequence_feature - len(sequence),0), value=0) for sequence in X]

In [25]:
padded_X = torch.stack(padded_X)
y = torch.tensor(y)

In [26]:
# sample padded data
padded_X[1]

tensor([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,   

## Creation of Dataset

In [27]:
# create tensordataset using the feature and labels
data = TensorDataset(padded_X, y)

In [28]:
train_size = int(0.80 * len(data))
val_size = int(0.10 * len(data))
test_size = len(data) - (train_size + val_size)

print(train_size, val_size, test_size)


692711 86588 86590


In [29]:
# Spliting The Data into Train, Validation and Test Sets.... (8 : 1 : 1)

train_data, valid_data, test_data = random_split(data, [train_size, val_size, test_size])

## Loading the Data

In [30]:
batch_size=512

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_data, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False)

## Model Development

*We are going to develop a Bi-directional LSTM due to its*
- Long-term Memory Support
- Analyzing Sequence from Both End

In [31]:
class My_BiLSTM(nn.Module):
    def __init__(self, word_vocab_total_words, embedding_dim, hidden_dim, num_layers):
        super(My_BiLSTM, self).__init__()
        self.embedding = nn.Embedding(word_vocab_total_words, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(hidden_dim * 2, word_vocab_total_words)

    def forward(self, x):
        x = x.to(self.embedding.weight.device)
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        lstm_out = self.dropout(lstm_out)

        
        # Since the LSTM is bidirectional, we concatenate the last hidden state of the forward direction
        # and the first hidden state of the backward direction before passing it to the fully connected layer
        # For batch_first=True, the last timestep of the forward direction is lstm_out[:, -1, :hidden_dim]
        # and the first timestep of the backward direction is lstm_out[:, 0, hidden_dim:]
        forward_last = lstm_out[:, -1, :self.lstm.hidden_size]
        backward_first = lstm_out[:, 0, self.lstm.hidden_size:]
        
        output = self.fc(torch.cat((forward_last, backward_first), dim=1))
        
        return output


## Setting Up

In [32]:
def set_seed(seed):
    # Set the seed for generating random numbers in PyTorch
    torch.manual_seed(seed)
    
    # If using GPU, set the seed for CUDA
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # Ensure that the PyTorch operations are deterministic on the GPU for reproducibility
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    
    # Set the seed for generating random numbers in Python
    random.seed(seed)
    
    # Set the seed for generating random numbers in NumPy
    np.random.seed(seed)

    
seed = 42
set_seed(seed)

In [33]:
# setting the device to "cuda" if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [49]:
# trainning setup..
EPOCHS=10
LEARNING_RATE = 0.001
WEIGHT_DECAY = 1e-4
SCHEDULAR_FACTOR = 0.1
SCHEDULAR_PATIENCE = 5
STOPPING_PATIENCE = 10

## Trainning

In [44]:
# making the model complex according to our GPU resource in kaggle.. 
# for more resource more complex model is recommended..
Embedding_Dim = 256
Hidden_Dim = 256
Num_Layers = 3

In [45]:
# create a model instance.. 
model = My_BiLSTM(word_vocab_total_words, embedding_dim=Embedding_Dim, hidden_dim=Hidden_Dim, num_layers = Num_Layers)
model = model.to(device)

In [46]:
def calculate_accuracy(model, desc, data_loader, k=5):
    
    model.load_state_dict(torch.load(CHECKPOINT_PATH))
    
    model.eval()
    correct_predictions = 0
    total_predictions = 0

    with torch.no_grad():
        for batch_x, batch_y in tqdm(data_loader, desc=desc, leave=False):
            batch_x, batch_y = batch_x.to(device), one_hot(batch_y, num_classes=word_vocab_total_words).to(device)
            
            # Forward pass
            output = model(batch_x)

            # Get top-k predictions
            _, predicted_indices = output.topk(k, dim=1)

            # Check if the correct label is in the top-k predictions
            correct_predictions += torch.any(predicted_indices == torch.argmax(batch_y, dim=1, keepdim=True), dim=1).sum().item()
            total_predictions += batch_y.size(0)

    accuracy = correct_predictions / total_predictions
    return accuracy

In [47]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)

In [None]:
epochs = EPOCHS
all_accuracies = []
all_losses = []
for epoch in range(epochs):
    model.train()
    for batch_X, batch_y in tqdm(train_loader, desc="Trainning : ", leave=False):
        batch_X, batch_y = batch_X.to(device), one_hot(batch_y, num_classes=word_vocab_total_words).to(device)
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y.argmax(dim=1))
        loss.backward()
        optimizer.step()

    accuracy = calculate_accuracy(model, "Calculating Accuracy : ", train_loader)
    print(f'Epoch {epoch}/{epochs}, Loss: {loss.item():.4f}, Train K-Accuracy: {accuracy * 100:.2f}%')
    all_accuracies.append(accuracy)
    all_losses.append(loss.item())

Trainning :  71%|███████   | 963/1353 [19:59<08:05,  1.25s/it]

# Testing and Results

In [None]:
epoch_list = [i for i in range(1,epochs+1)]

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))

axes[0].plot(epoch_list, all_accuracies, color='#5a7da9', label='Accuracy', linewidth=3)
axes[0].set_xlabel('Epochs')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy Graph')
axes[0].grid(True)

axes[1].plot(epoch_list, all_losses, color='#adad3b', label='Accuracy', linewidth=3)
axes[1].set_xlabel('Epochs')
axes[1].set_ylabel('Loss')
axes[1].set_title('Loss Graph')
axes[1].grid(True)

plt.tight_layout()
plt.show()

In [None]:
accuracy = calculate_accuracy(model, "Testing : ", test_loader)
print(f'Test-Accuracy: {accuracy * 100:.2f}%')

## Saving The model

In [None]:
model_path = f'./model_biLSTM.pth'

torch.save(model.state_dict(), model_path)
print(f'Model saved to {model_path}')

After an intensive training phase, our BiLSTM model has been successfully trained on the rich corpus of Rabindranath Tagore's writings. The model's architecture and training parameters were carefully chosen to handle the intricacies of Bengali language syntax and semantics.


The evaluation metrics indicate that the model performs well in predicting the next word in a given sequence, showcasing its capability in understanding and generating text in Bengali. The loss curve suggests that the model has learned the patterns and nuances of the language effectively over the training epochs.

# Demonstration

In [49]:
# load the model
model_path = 'bilstm_model.pth'
model.load_state_dict(torch.load(model_path))

<All keys matched successfully>

## Conversion From `Idx-to-Word`

In [None]:
def text_to_numerical_sequence_test(tokenized_text):
    tokens_list = []
    for token in tokenized_text:
        num_token = word_to_index[token] if token in word_to_index else word_to_index['<unk>']
        tokens_list.append(num_token)
    return tokens_list

## Adding Temperature to Predictions

**0.1** -> *for variations in ouput... for creative and experimental answers*

- 
- 

**1.0** -> *for accuracy in ouput... for to the point answers*

In [37]:
# Define a function to sample from the model's output distribution with temperature
def sample_with_temperature(output, temperature=1.0):
    # Scale the logits by the temperature
    logits = output / temperature
    # Convert logits to probabilities
    probabilities = F.softmax(logits, dim=-1).cpu().detach().numpy()
    # Sample from the distribution
    next_word_index = np.random.choice(len(probabilities), p=probabilities)
    return next_word_index

## Text Generator

In [51]:
# Generate the next 10 words forming a sentence using the RNN model
def generate_sentence(ip_sentence, text_model=model, n=10, temperature=0.8):
    
    sentence = ip_sentence.split()
        
    model.eval()
    for _ in range(n):
        tokenized_sequence = text_to_numerical_sequence_test(sentence) 
        
        padded_X = torch.tensor(F.pad(torch.tensor(tokenized_sequence), (longest_sequence_feature - len(tokenized_sequence),0), value=0)).unsqueeze(0)

        output = model(padded_X)

        next_word_index = output.argmax(dim=1).item()
        next_word_index = sample_with_temperature(output[0], temperature=temperature)
        
        next_word = index_to_word [next_word_index]
        sentence.append(next_word)
            
        tokenized_sequence.append(next_word_index)

        
    return ' '.join(sentence)

  padded_X = torch.tensor(F.pad(torch.tensor(tokenized_sequence), (longest_sequence_feature - len(tokenized_sequence),0), value=0)).unsqueeze(0)




 for temp 0.1 -->
 একটি কবিতার খাতা , এই যে সকল মানুষের মধ্যে আমাদের দেশের মধ্যে যে একটা বিশেষ সম্বন্ধ আছে তাহা নহে , তাহা নহে । নহে । তাহার মধ্যে তাহার মধ্যে একটা বিশেষ ঐক্য আছে


 for temp 0.2 -->
 একটি কবিতার খাতা , এই দুই লোকের মধ্যে যে একটা বিশেষ জিনিস আছে , সেই সকল লোকের মধ্যে যে একটা বিশেষ ঐক্য আছে তাহা নহে , তাহা নহে । নহে । , তাহার


 for temp 0.3 -->
 একটি কবিতার খাতা , এই যে সকল মানুষের মধ্যে আমাদের দেশের মধ্যে আমাদের দেশের লোকের সঙ্গে যে একটা বিশেষ সম্বন্ধ আছে তাহা নহে , তাহা আমাদের নিজের পক্ষে কোনো বিশেষ কারণ । নহে


 for temp 0.4 -->
 একটি কবিতার খাতা থেকে একটা বিশেষ ছন্দ , এই কথাটা , তারা তার মধ্যে একটা বৃহৎ রূপ আছে । , আর এক দিকে । মতো , এই সব লোকের মধ্যে তার পরে ।


 for temp 0.5 -->
 একটি কবিতার খাতা থেকে যে প্রত্যক্ষ আজও তাঁদের সঙ্গে আপন বিরোধ আছে , এই বলে , আমাদের মধ্যে যদি আমাদের কাছে যে সমস্ত পথ দিয়া এমন কথা দেখি তবে আমাদের যে যাহা আমাদের


 for temp 0.6 -->
 একটি কবিতার খাতা যেখানে কোনো ব্যক্তির সম্বন্ধ খাটাইবার জন্য এ কথা নয় । এত ঘনিষ্ঠ কারণ , আমরা তাঁহার সঙ্গ

In [None]:
## sample inputs

# input_sentence = "তাঁর একটি কবিতার খাতা"
# input_sentence = "তাঁর একটি"
# input_sentence = "তাঁর কবিতার খাতা"
input_sentence = "একটি কবিতার খাতা"



# input_sentence = "কিন্তু আমরা"
# input_sentence = "আমরা তো বিজ্ঞানী নই"
# input_sentence = "আমরা তো"
# input_sentence = "বুঝতে পারি"
# input_sentence = "কিন্তু আমরা"
# input_sentence = "কিন্তু আমরা বুঝতে পারি"
# input_sentence = "আমরা তো বিজ্ঞানী "
# input_sentence = "কিন্তু আমরা তো বিজ্ঞানী নই, বুঝতে পারি"



temp= [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

for t in temp:
    sentence = generate_sentence(input_sentence, model, n=30, temperature=t)
    print(f'\n\n\n for temp {t} -->\n {sentence}')

# Future Works

Moving forward, we plan to:

- **Expand the Dataset**: *Include more works by Tagore and other Bengali authors to enrich the training data.*

- **Experiment with Advanced Architectures**: *Explore transformer-based models and attention mechanisms for potentially better performance.*

- **Deploy the Model**: *Create a user-friendly application for Bengali text generation and sentence completion.*

# Conclusion

In this notebook, we have embarked on an exciting journey to explore Natural Language Processing, specifically focusing on Language Modelling using a Complex BiLSTM model. By utilizing the rich literary works of Rabindranath Tagore, we aimed to predict the next word in a sequence, enhancing the capabilities of text generation and sentence completion in the Bengali language.


Through detailed steps, from data preprocessing to model training and evaluation, we demonstrated how to leverage a significant corpus of around 1 lakh words to train a robust BiLSTM model. This approach not only provides insights into handling Indian languages in NLP tasks but also underscores the importance of utilizing classical literature to enhance modern technological applications.


The successful implementation of this task opens up numerous possibilities for further research and application, such as improving language models for other regional languages, creating more sophisticated text generation systems, and contributing to the preservation and digital enhancement of cultural literary works.

# Thank You

Thank you for following along with this notebook. We hope it has provided you with a comprehensive and beginner-friendly introduction to NLP tasks, particularly in the context of Indian languages. Whether you are a student, researcher, or enthusiast, we believe that the skills and knowledge gained here will be valuable in your journey into the world of Natural Language Processing.


We would like to express our gratitude to the timeless works of Rabindranath Tagore, which provided the foundation for this project. Their linguistic richness and cultural significance have greatly contributed to the success of our model.


Happy learning and happy coding!

# Trainning

In [None]:
# Function for Loading the Previously Saved State Dictionary of the Model.. if available
def load_model(model, name, model_path):
    if os.path.exists(model_path):
      print(f"\nLoading Saved Version of the Model {name} ....\n")
      model.load_state_dict(torch.load(model_path)) 
    
    else :
      print(f"\nFOUND NO Previous Saved Version of the Model {name} ....\n")

    return model 

In [31]:
# Trainer..

def trainer(model, name, check_path, train_loader, valid_loader, epochs = 500, learning_rate = 0.001):
    
    CHECKPOINT_PATH = check_path
    name = name
    
    model = load_model(model, name, CHECKPOINT_PATH ).to(device)


    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=WEIGHT_DECAY)
#     optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=WEIGHT_DECAY)
    scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=SCHEDULAR_FACTOR, patience=SCHEDULAR_PATIENCE, verbose=True)

    # Define the early stopping criteria
    best_val_loss = float('inf')
    best_val_acc = 0.0
    best_train_loss = float('inf')
    best_train_acc = 0.0
    patience = STOPPING_PATIENCE 
    counter = 0

    # Lists to store accuracy and loss values
    train_acc_list = []
    train_loss_list = []
    val_acc_list = []
    val_loss_list = []

    print(f"\nStarting Trainning... for model {name}\n\n")

    # Training loop
    for epoch in range(epochs):
        model.train()
        train_loss = 0.0
        train_correct = 0

        for inputs, labels in tqdm(train_loader, desc="Trainning : ", leave=False):
            inputs = inputs.to(device)
            labels = labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs.data, 1)
            train_correct += (predicted == labels).sum().item()

        train_loss /= len(train_loader.dataset)
        train_acc = train_correct / len(train_loader.dataset)
        train_acc_list.append(train_acc)
        train_loss_list.append(train_loss)

        # Validation
        model.eval()
        val_loss = 0.0
        val_correct = 0

        with torch.no_grad():
            for inputs, labels in tqdm(valid_loader, desc="Validating : ", leave=False):
                inputs = inputs.to(device)
                labels = labels.to(device)

                outputs = model(inputs)
                loss = criterion(outputs, labels)

                val_loss += loss.item() * inputs.size(0)
                _, predicted = torch.max(outputs.data, 1)
                val_correct += (predicted == labels).sum().item()

            val_loss /= len(valid_loader.dataset)
            val_acc = val_correct / len(valid_loader.dataset)
            val_acc_list.append(val_acc)
            val_loss_list.append(val_loss)
            
        # Step the scheduler
        scheduler.step(val_loss)

        print(f"Epoch: {epoch+1}/{epochs} | Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f} | "
              f"Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f}")


        
        # Early stopping based on validation loss
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            counter = 0
        else:
            counter += 1
            if counter >= patience:
                print("Early stopping!")
                break
                
        
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_val_loss = val_loss
            best_train_acc = train_acc
            best_train_loss = train_loss
            
            # Save the model if it has the best validation accuracy
            torch.save(model.state_dict(), CHECKPOINT_PATH)

    print(f"\n\n\nBEST MODEL --> \nTrain Acc : {best_train_acc:.4f} | Train Loss : {best_train_loss:.4f} | Valid Acc : {best_val_acc:.4f} | Valid Loss : {best_val_loss:.4f}")

    # Plot accuracy and loss curves
    plt.figure(figsize=(8, 4))
    plt.plot(train_acc_list, label='Train')
    plt.plot(val_acc_list, label='Validation')
    plt.title('Model accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(loc='upper left')
    plt.savefig(f"./Acc_{name}.png")
    plt.show()


    plt.figure(figsize=(8, 4))
    plt.plot(train_loss_list, label='Train')
    plt.plot(val_loss_list, label='Validation')
    plt.title('Model loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(loc='upper left')
    plt.savefig(f"./loss_{name}.png")
    plt.show()
    
    return model, [best_train_loss, best_train_acc, best_val_loss, best_val_acc]


FOUND NO Previous Saved Version of the Model biLSTM_model ....



                                                               

Epoch: 1/100 | Train Loss: 7.9871 | Train Acc: 0.0711 | Val Loss: 7.6775 | Val Acc: 0.0815


                                                               

Epoch: 2/100 | Train Loss: 7.5820 | Train Acc: 0.0864 | Val Loss: 7.3832 | Val Acc: 0.0965


                                                               

Epoch: 3/100 | Train Loss: 7.3325 | Train Acc: 0.0978 | Val Loss: 7.2311 | Val Acc: 0.1005


                                                               

Epoch: 4/100 | Train Loss: 7.2011 | Train Acc: 0.1044 | Val Loss: 7.1442 | Val Acc: 0.1104


                                                               

Epoch: 5/100 | Train Loss: 7.1003 | Train Acc: 0.1092 | Val Loss: 7.0914 | Val Acc: 0.1131


                                                              

KeyboardInterrupt: 

In [None]:
name = "biLSTM"
check_path = f'./model_{name}.pth'

# trainning... the model
model, train_results = trainer(model, name, chech_path, train_loader, valid_loader, EPOCHS=epochs, learning_rate=LEARNING_RATE)

# Testing

In [None]:
def tester(model, CHECKPOINT_PATH, test_loader):
    # Load the best model checkpoint for evaluation
    model.load_state_dict(torch.load(CHECKPOINT_PATH))

    # Evaluate on the test set
    model.eval()
    test_loss = 0.0
    test_correct = 0
    top2_correct=0
    top5_correct=0

    with torch.no_grad():
        for inputs, labels in tqdm(test_loader, desc="Testing : ", leave=False):
            inputs = inputs.to(device)
            labels = labels.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, labels)

            test_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs.data, 1)
            _,top2preds = torch.topk(outputs.data,2,1)
            _,top5preds = torch.topk(outputs.data,5,1)
            test_correct += (predicted == labels).sum().item()
            for i,label in enumerate(labels):
              if label in top5preds[i]:
                top5_correct += 1
              if label in top2preds[i]:
                top2_correct += 1

        test_loss /= len(test_loader.dataset)
        test_acc = test_correct / len(test_loader.dataset)
        top2_acc = top2_correct/len(test_loader.dataset)
        top5_acc = top5_correct/len(test_loader.dataset)
        
        
    print(f"Test Loss : {test_loss:.4f} | Test Acc: {test_acc:.4f} | Top-2 Acc: {top2_acc:.4f} | Top-5 Acc: {top5_acc:.4f}")

    return [test_loss, test_acc, top2_acc, top5_acc]


# trainning... the model
test_results = tester(model, chech_path, test_loader)