<a href="https://colab.research.google.com/github/SCCSMARTCODE/Deep-Learning-02/blob/main/text-generation/Word_Level_Text_Generation_with_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Data Preparation
- [x] Download and load the Shakespeare text or any text file.
- [x] Convert the text to lowercase and split it into words (tokenization).
- [x] Create a vocabulary of unique words and create word-to-index and index-to-word mappings.

#### Data Preprocessing
- [x] Set a sequence length (e.g., 10 words per sequence).
- [x] Create input sequences and corresponding target words (the next word following each sequence).
- [x] Convert the sequences and targets to tensors (PyTorch format).

#### Build the LSTM Model
- [x] Define a PyTorch LSTM model with:
  - An embedding layer to map word indices to dense vectors.
  - LSTM layers to capture sequential word patterns.
  - A fully connected output layer to predict the next word.
- [x] Set model parameters (e.g., vocab size, embedding size, hidden units).

#### Training the Model
- [x] Define the loss function and the optimizer.
- [x] Implement a training loop to:
  - Split data into batches.
  - Feed batches into the model.
  - Compute loss and update the model parameters.
  - Track and print training progress.

#### Text Generation
- [x] Create a text generation function:
  - Feed a seed text sequence to the trained model.
  - Predict the next word, append it to the seed, and repeat.
  - Generate a specified number of words based on the initial seed.

#### Evaluation and Optimization
- [x] Evaluate the quality of generated text for coherence and structure.
- [x] Adjust hyperparameters (embedding size, hidden units, layers) for better results.
- [x] Experiment with different sequence lengths and potentially use temperature to control the randomness of predictions.

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import torch.optim as optim
from torch.optim import lr_scheduler
import re

In [None]:
with open("shakespeare_short.txt", 'r') as f:
    texts = f.read()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
tokenized_words = re.findall(r'\b\w+\b', texts.lower())
idx_to_word = {idx: word for idx, word in enumerate(list(sorted(set(tokenized_words))))}
word_to_idx = {word: idx for idx, word in enumerate(list(sorted(set(tokenized_words))))}

In [None]:
def create_dataset(seq_length=10, text=tokenized_words):
    input = []
    label = []

    for x in range(len(text)-seq_length):
        input.append([word_to_idx[word] for word in text[x:x+seq_length]])
        label.append([word_to_idx[text[x+seq_length]]])
    return input, label

In [None]:
dataset = create_dataset(seq_length=40, text=tokenized_words)
inputs = dataset[0]
labels = dataset[1]

for x in range(len(labels)-1):
    print([idx_to_word[idx] for idx in inputs[x]], idx_to_word[labels[x][0]])
    break

['first', 'citizen', 'before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', 'all', 'speak', 'speak', 'first', 'citizen', 'you', 'are', 'all', 'resolved', 'rather', 'to', 'die', 'than', 'to', 'famish', 'all', 'resolved', 'resolved', 'first', 'citizen', 'first', 'you', 'know', 'caius', 'marcius', 'is', 'chief', 'enemy', 'to', 'the'] people


In [None]:
hyperparameters = {
    'num_embeddings': len(word_to_idx),
    'embedding_dim': 100,
    'batch_size': 2048,
    'num_epochs': 10,
    'learning_rate': 0.01,
    'max_lr': 0.01,
    'param_save_path': "parameter.pth"
}

In [None]:
class WordLevelTextGenNetwork(nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super(WordLevelTextGenNetwork, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=num_embeddings, embedding_dim=embedding_dim)

        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=512,
                            num_layers=3, batch_first=True
                           )


        self.fc_layers = nn.Sequential(
            nn.Linear(in_features=512, out_features=1024),
            nn.ReLU(),
            nn.Dropout(p=0.1),
            nn.Linear(in_features=1024, out_features=num_embeddings)
        )

    def forward(self, input, hidden=None):
        x = self.embedding(input)
        x, hidden = self.lstm(x, hidden)
        x = x[:, -1, :]
        x = self.fc_layers(x)

        return x, hidden
network = WordLevelTextGenNetwork(hyperparameters['num_embeddings'], hyperparameters['embedding_dim'])


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
network = network.to(device)
network.load_state_dict(torch.load("/content/drive/MyDrive/Deep Learning/word_text_gen_parameters_2nd_post.pth"))


  network.load_state_dict(torch.load("/content/drive/MyDrive/Deep Learning/word_text_gen_parameters_2nd.pth"))


<All keys matched successfully>

In [None]:
print(f"Total Generated data point: [==={len(labels)}===]")

Total Generated data point: [===208490===]


In [None]:
dataset = TensorDataset(torch.tensor(inputs), torch.tensor(labels))
dataloader = DataLoader(dataset, batch_size=hyperparameters['batch_size'], shuffle=True, num_workers=2, pin_memory=True, drop_last=True)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(network.parameters())
# scheduler = lr_scheduler.OneCycleLR(optimizer,
#                                      max_lr=hyperparameters['max_lr'],
#                                      total_steps=hyperparameters['num_epochs'] * (len(dataloader)),
#                                     )

In [None]:
def train(network, data_loader, optimizer, criterion, scheduler, device, num_epochs):
    for epoch in range(num_epochs):
        network.train()

        total_loss = 0
        for batch in data_loader:
            inputs, targets = batch
            inputs, targets = inputs.to(device), targets.view(-1).to(device)

            optimizer.zero_grad()

            outputs, _ = network(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
            # scheduler.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(data_loader)
        print(f"Training loss for epoch {epoch + 1}: {avg_loss:.4f}")

        torch.save(network.state_dict(), "/content/drive/MyDrive/Deep Learning/word_text_gen_parameters_2nd_post.pth")


In [None]:
train(network, dataloader, optimizer, criterion, None, device, hyperparameters['num_epochs'])

Training loss for epoch 1: 1.0516
Training loss for epoch 2: 0.8193
Training loss for epoch 3: 0.7505
Training loss for epoch 4: 0.7153
Training loss for epoch 5: 0.6714
Training loss for epoch 6: 0.6216
Training loss for epoch 7: 0.5748
Training loss for epoch 8: 0.5311
Training loss for epoch 9: 0.4943
Training loss for epoch 10: 0.4662


In [None]:
import numpy as np

def generate_text(model, start_str, n_words, temperature=1.0):
    model.eval()

    input_words = start_str.split()
    input_indices = [word_to_idx[word] for word in input_words if word in word_to_idx.keys()]

    input_seq = torch.tensor(input_indices, dtype=torch.long).unsqueeze(0).to(device)

    generated_text = start_str
    hidden = None

    for _ in range(n_words):
        output, hidden = model(input_seq, hidden)

        output = output / temperature

        if output.dim() == 2:
            probs = torch.softmax(output, dim=1).detach().cpu().numpy()
        else:
            probs = torch.softmax(output[:, -1], dim=1).detach().cpu().numpy()

        if probs.size == 0:
            break

        next_word_idx = np.random.choice(len(word_to_idx), p=probs[0])
        if next_word_idx not in idx_to_word:
            break

        next_word = idx_to_word[next_word_idx]
        generated_text += " " + next_word

        input_seq = torch.tensor([[next_word_idx]], dtype=torch.long).to(device)

    return generated_text

In [None]:
start_str = "What do you have to say"
generated_text = generate_text(network, start_str, 100, temperature=0.8)
print(generated_text)

What do you have to say she hath given no wit to put by the world second servant serves your sister is made so past though yea and i prithee wrong these fair men if love say your trial to be thought to part a man or liberty i conjure him that stands his wife leontes next she will call her here duke vincentio one of orders natural than this i hate to effect camillo there s no love a gentleman gainst so your fingers growing nor he seem beyond the child of a wonder for surely sir the duke hath been to my good son


In [None]:
torch.save(network.state_dict(), 'para.pth')