## Text Generation with LSTM-based Recurrent Neural Networks

### Project Description
This project demonstrates text generation using Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.

The goal is to train an RNN model to learn the structure of a given input text and generate new text that resembles the original text. This technique is commonly used in natural language processing (NLP) tasks such as language modeling, text summarization, and dialogue generation.

### Code Overview
The code consists of several main sections:
1. **Data Preprocessing**: The input text is loaded from a file (`text.txt`) and preprocessed to remove unnecessary characters, tokenize the text, and build a vocabulary.
2. **Dataset Preparation**: The preprocessed text is converted into sequences of indices representing words and organized into data samples for training.
3. **Model Definition**: The WordLSTM class defines the architecture of the LSTM-based RNN model, including embedding layer, LSTM layer, and fully connected layer.
4. **Model Training**: The model is trained using the prepared dataset, and the training process is executed for a specified number of epochs.
5. **Generating Text**: After training, the model can be used to generate new text by providing a seed sequence as input and predicting the next words iteratively.



In [31]:
# Importing necessary libraries
import nltk
from collections import Counter

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import numpy as np

##Data Preprocessing

In [2]:
# Loading and preprocessing the text data
with open('text.txt', 'r') as f:
    text = f.read()

print(text.split()[:30])

['Chapter', '1', 'Happy', 'families', 'are', 'all', 'alike;', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its', 'own', 'way.', 'Everything', 'was', 'in', 'confusion', 'in', 'the', "Oblonskys'", 'house.', 'The', 'wife', 'had', 'discovered', 'that', 'the']


In [3]:
# Tokenizing the text and cleaning it
punctuation = [',', '.', ':', ';', '?', '!', '(', ')', '[', ']', '{', '}', '"', '\'', '\\', '/']
def clean_text(text):
  text = text.lower().replace('\n', ' ')
  text = text.replace('-', ' ')
  for punc in punctuation:
    text = text.replace(f'{punc}', f' {punc} ')
  return text.split()

In [4]:
cleaned_text = clean_text(text)
word_counts = Counter(cleaned_text)
word_counts.most_common(10)

[(',', 30994),
 ('.', 19671),
 ('the', 17554),
 ('"', 13990),
 ('and', 12906),
 ('to', 10154),
 ('of', 8618),
 ('he', 7824),
 ("'", 6713),
 ('a', 6186)]

In [5]:
# Extracting unique words and building vocabulary
words = sorted(word_counts, key=word_counts.get, reverse=True)
words[:10]

[',', '.', 'the', '"', 'and', 'to', 'of', 'he', "'", 'a']

In [6]:
len_text = len(cleaned_text)
count_words = len(words)
print('The text contains', len_text, 'words')
print('The text contains', count_words, 'unique words')

The text contains 436508 words
The text contains 12971 unique words


In [7]:
word_to_index = {word: index for index, word in enumerate(words)}
index_to_word = {index: word for index, word in enumerate(words)}
print(list(word_to_index.items())[:10])
print(list(index_to_word.items())[:10])

[(',', 0), ('.', 1), ('the', 2), ('"', 3), ('and', 4), ('to', 5), ('of', 6), ('he', 7), ("'", 8), ('a', 9)]
[(0, ','), (1, '.'), (2, 'the'), (3, '"'), (4, 'and'), (5, 'to'), (6, 'of'), (7, 'he'), (8, "'"), (9, 'a')]


In [8]:
text_as_indices = [word_to_index[word] for word in cleaned_text]
print(cleaned_text[:10])
print(text_as_indices[:10])

['chapter', '1', 'happy', 'families', 'are', 'all', 'alike', ';', 'every', 'unhappy']
[207, 2751, 278, 2974, 82, 31, 2413, 35, 201, 685]


## Dataset Preparation

In [9]:
# Generating training data
len_sequence = 100
data = []
for i in range(0, len_text - len_sequence - 1):
  sequence = text_as_indices[i: i + len_sequence]
  label = text_as_indices[i + 1: i + len_sequence + 1]
  data.append((torch.tensor(sequence), torch.tensor(label)))

In [10]:
torch.manual_seed(42)
batch_size = 32
train_loader = DataLoader(data, batch_size=batch_size, shuffle=True)

In [11]:
sequence, label = next(iter(train_loader))
print(sequence)
print(label)

tensor([[   6,  138, 2254,  ...,    8,   39,   86],
        [   5,   81,   87,  ...,    7, 1312,   41],
        [  13, 8035,    1,  ...,   65, 7112,   50],
        ...,
        [1243,   16,  174,  ...,    6,  316,  148],
        [   0,    7,  135,  ..., 2761,    1,   94],
        [  53,   43,  180,  ...,   13,  192,   35]])
tensor([[ 138, 2254,    0,  ...,   39,   86,   31],
        [  81,   87,    0,  ..., 1312,   41,   25],
        [8035,    1,   10,  ..., 7112,   50,  367],
        ...,
        [  16,  174,    5,  ...,  316,  148,    1],
        [   7,  135,    9,  ...,    1,   94,   95],
        [  43,  180,  515,  ...,  192,   35,    7]])


## Model Definition

In [21]:
# Defining the WordLSTM model
device="cuda" if torch.cuda.is_available() else "cpu"
print(device)

class WordLSTM(nn.Module):
  def __init__(self, input_size, hidden_size, num_layers, dropout=0.2):
    super(WordLSTM, self).__init__()
    self.input_size = input_size
    self.hidden_size = hidden_size
    self.num_layers = num_layers
    self.vocabulary_size = len(words)
    self.embedding = nn.Embedding(self.vocabulary_size, self.input_size)
    self.lstm = nn.LSTM(self.input_size, self.hidden_size, self.num_layers, batch_first=True, dropout=dropout)
    self.fully_connected = nn.Linear(self.hidden_size, self.vocabulary_size)

  def forward(self, x, hc):
    x = self.embedding(x)
    x, hc = self.lstm(x, hc)
    x = self.fully_connected(x)
    return x, hc

  def initialize_hidden_state(self, batch_size):
    weight = next(self.parameters()).data
    return (weight.new(self.num_layers, batch_size, self.hidden_size).zero_(),
            weight.new(self.num_layers, batch_size, self.hidden_size).zero_())


cuda


In [22]:
# Instantiating the model
model = WordLSTM(input_size=128, hidden_size=128, num_layers=3).to(device)
print(model)

WordLSTM(
  (embedding): Embedding(12971, 128)
  (lstm): LSTM(128, 128, num_layers=3, batch_first=True, dropout=0.2)
  (fully_connected): Linear(in_features=128, out_features=12971, bias=True)
)


## Model Training

In [23]:
lr = 0.001
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

In [25]:
model.train()
for epoch in range(10):
  total_loss = 0
  hidden_state, cell_state = model.initialize_hidden_state(batch_size)
  for i, (sequences, labels) in enumerate(train_loader):
    if sequences.shape[0]==batch_size:
      sequences, labels = sequences.to(device), labels.to(device)
      optimizer.zero_grad()
      outputs, (hidden_state, cell_state) = model(sequences, (hidden_state, cell_state))
      loss = criterion(outputs.transpose(1,2),labels)
      hidden_state, cell_state=hidden_state.detach(), cell_state.detach()
      loss.backward()
      nn.utils.clip_grad_norm_(model.parameters(), 5)
      optimizer.step()
      total_loss += loss.item()
  print(f"Epoch {epoch}: average loss = {total_loss / (i + 1)}")

Epoch 0: average loss = 3.2860134329539203
Epoch 1: average loss = 3.0206203146368233
Epoch 2: average loss = 2.8694772977146483
Epoch 3: average loss = 2.7685773876479405
Epoch 4: average loss = 2.6954071568086153
Epoch 5: average loss = 2.63963241919095
Epoch 6: average loss = 2.59504464735889
Epoch 7: average loss = 2.558801118746096
Epoch 8: average loss = 2.5282852627486667
Epoch 9: average loss = 2.501971535309264


In [29]:
# saving the model
import pickle

torch.save(model.state_dict(), "wordLSTM.pth")
with open("word_to_index.p", "wb") as fb:
    pickle.dump(word_to_index, fb)

##Generating text with the trained LSTM model

In [30]:
import pickle

model.load_state_dict(torch.load("wordLSTM.pth"))
with open("/content/word_to_index.p","rb") as fb:
    word_to_index = pickle.load(fb)

index_to_word = {v:k for k,v in word_to_index.items()}

In [34]:
def sample(model, prompt, length=200):
    model.eval()
    text = prompt.lower().split(' ')
    hidden_state = model.initialize_hidden_state(1)
    length = length - len(text)
    for i in range(0, length):
        if len(text) <= 50:
            x = torch.tensor([[word_to_index[w] for w in text]])
        else:
            x = torch.tensor([[word_to_index[w] for w in text[-50:]]])
        inputs = x.to(device)
        output, hidden_state = model(inputs, hidden_state)
        logits = output[0][-1]
        p = nn.functional.softmax(logits, dim=0).detach().cpu().numpy()
        idx = np.random.choice(len(logits), p=p)
        text.append(index_to_word[idx])
    text = " ".join(text)
    for m in punctuation:
        text = text.replace(f" {m}", f"{m} ")
    text = text.replace('"  ', '"')
    text = text.replace("'  ", "'")
    text = text.replace('" ', '"')
    text = text.replace("' ", "'")
    return text

In [35]:
print(sample(model, prompt='Anna and the prince'))

anna and the prince went back to this place whom he wanted to supersede.  but her face looked she infected him in the pavilion.  she enjoyed that box began partly still more,  he built a piteous,  healthy man,  and a very lovely lady,  and dolly died foul of her,  and understood from her husband that might inevitably prevent her facts. "you were such as a simile of society.  it would be right,  like mituh,  extricate all the children lidia ivanovna,  and his feeling and unpleasantly interested,  but so much outside her, "pregnancy,  sickness,  mental incapacity,  indifference to his son when he had no good to him.  just as he felt continually at once liking them too,  with signs of immense composure.  on the previous day the auditing of the land,  at least perfectly sacred. "bill,  if your wife won't give you home, "she responded gloomily,  looking straight before her as a surprise. "that's very hot, "said countess nordston,  packing katavasov now to
