# **Text Generation Using an LSTM Model in PyTorch**
---

## **Overview**
>This project implements a Long Short-Term Memory (LSTM) neural network using PyTorch to generate text based on an input sequence. The dataset consists of a sample document discussing bioinformatics, which is tokenized and numerically encoded before being used for training. The model learns to predict the next word in a sequence, allowing it to generate coherent sentences. The training process uses an embedding layer to convert words into vector representations, followed by an LSTM layer that captures contextual dependencies. After training, the model can generate new text by predicting the most likely next word given an input phrase.

---
## **Importing Libraries and Setting Up the Environment**
- **PyTorch** is used for building and training the LSTM model.
- **NLTK (Natural Language Toolkit)** provides tools for tokenizing text.
- **Counter** helps in creating a vocabulary based on word frequency.
- **NumPy** is used for numerical operations.
- **Time** is used for adding a delay in text generation.

In [1]:
!pip install nltk



In [31]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import nltk
from nltk.tokenize import word_tokenize

from collections import Counter

import numpy as np

import time

---
## **Downloading Required NLTK Tokenizer**
- Downloads the necessary NLTK tokenizer for processing text.

In [3]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

---
## **Setting Up GPU/CPU for Computation**
- The code checks if a GPU is available for faster computation; otherwise, it defaults to the CPU.

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

---
## **Tokenizing and Creating a Vocabulary**
- The text is tokenized into words.
- A vocabulary dictionary is created, mapping each unique word to a numerical index.
- `"<UNK>"` represents unknown words that are not in the vocabulary.

In [5]:
document = """Introduction to Bioinformatics

Bioinformatics is an interdisciplinary field that combines biology, computer science, mathematics, and statistics to analyze and interpret biological data. It plays a crucial role in modern life sciences, enabling researchers to understand complex biological systems, identify disease biomarkers, and develop personalized medicine approaches. With the increasing availability of high-throughput sequencing technologies, bioinformatics has become essential for processing large-scale genomic, transcriptomic, and proteomic data.

Key Areas of Bioinformatics

Genomics: This involves analyzing entire genomes to identify genes, regulatory elements, and variations. Techniques such as whole-genome sequencing (WGS) and genome-wide association studies (GWAS) help in identifying genetic mutations linked to diseases and evolutionary traits.

Transcriptomics: This focuses on studying gene expression patterns using RNA sequencing (RNA-seq) or microarrays. It helps in understanding how genes are regulated in different conditions, such as disease states versus normal tissues.

Proteomics: Bioinformatics tools analyze protein sequences, structures, and interactions. Mass spectrometry data processing, protein structure prediction, and functional annotation are key aspects of proteomics research.

Metagenomics: This involves analyzing microbial communities using high-throughput sequencing techniques. Bioinformatics tools help in identifying microbial species, their functions, and their impact on human health and the environment.

Systems Biology: This integrates multi-omics data (genomics, transcriptomics, proteomics, metabolomics) to model biological systems and predict their behavior under different conditions.

Structural Bioinformatics: This involves modeling and analyzing biological macromolecules such as DNA, RNA, and proteins. Computational tools like molecular docking and molecular dynamics simulations help in drug discovery and protein function prediction.

Bioinformatics Tools and Techniques

Sequence Alignment: Tools like BLAST and Clustal Omega align DNA, RNA, or protein sequences to identify similarities and evolutionary relationships.

Genome Assembly and Annotation: Software like SPAdes, AUGUSTUS, and Prokka help in assembling raw sequencing reads and annotating genes.

Differential Expression Analysis: Tools like DESeq2 and edgeR analyze RNA-seq data to identify differentially expressed genes.

Pathway and Functional Analysis: Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) help in understanding the biological functions of genes.

Machine Learning in Bioinformatics: AI and ML techniques are used for predicting disease risk, drug responses, and classifying biological sequences.

Applications of Bioinformatics

Disease Diagnosis and Treatment: Bioinformatics helps in identifying disease-associated genes, developing diagnostic markers, and designing targeted therapies.

Personalized Medicine: Genomic data is used to tailor treatments based on individual genetic profiles.

Agricultural Biotechnology: Genetic analysis of crops and livestock improves yield, resistance, and nutritional content.

Environmental Studies: Metagenomic studies analyze microbial communities in different ecosystems, aiding in biodiversity conservation and pollution control.

Conclusion

Bioinformatics is a rapidly evolving field that continues to revolutionize biological research. With advancements in computational power and data analytics, bioinformatics will play an increasingly vital role in healthcare, agriculture, and environmental science, paving the way for significant scientific breakthroughs.
"""

In [6]:
tokens = word_tokenize(document.lower())

In [7]:
vocab = {"<UNK": 0}

for token in Counter(tokens).keys():
  if token not in vocab:
    vocab[token] = len(vocab)

---
## **Converting Text to Numerical Indices**
- This function converts words in a sentence into their respective indices based on the vocabulary.

In [8]:
def text_to_indices(sentence, vocab):
  num_sentences = []

  for token in sentence:
    if token in vocab:
      num_sentences.append(vocab[token])
    else:
      num_sentences.append(vocab["<UNK>"])

  return num_sentences

---
## **Creating Training Sequences**
- The document is split into sentences.
- Each sentence is converted into numerical indices.
- Generates sequential training data where each sample consists of a growing sequence of words.


In [9]:
sentences = document.split("\n")
numerical_sentences = []

for sentence in sentences:
  numerical_sentences.append(text_to_indices(word_tokenize(sentence.lower()), vocab))

In [10]:
training_sequence = []

for sentence in numerical_sentences:
  for i in range(1, len(sentence)):
    training_sequence.append(sentence[:i+1])

---
## **Padding Sequences to Uniform Length**
- Ensures all sequences have the same length by padding them with zeros at the beginning.

In [11]:
lengths = []

for seq in training_sequence:
  lengths.append(len(seq))

max(lengths)

75

In [12]:
padded_training_sequence = []

for seq in training_sequence:
  padded_training_sequence.append([0] * (max(lengths) - len(seq)) + seq)

In [13]:
padded_training_sequence = torch.tensor(padded_training_sequence, dtype=torch.long)

---
## **Splitting Data into Input and Target**
- `X` contains all words except the last one in each sequence (input).
- `y` contains the last word in each sequence (target output).

In [14]:
X = padded_training_sequence[:, :-1]
y = padded_training_sequence[:, -1]

---
## **Creating a Dataset Class**
- This custom dataset class allows efficient data loading for training.
- Uses DataLoader to shuffle and batch the data for training.

In [15]:
class WordDataset(Dataset):
  def __init__(self, X, y):
    self.X = X
    self.y = y

  def __len__(self):
    return len(self.X)

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]

In [16]:
dataset = WordDataset(X, y)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

---
## **Defining the LSTM Model**
- **Embedding Layer**: Converts word indices into dense vector representations.
- **LSTM Layer**: Processes sequences and learns dependencies between words.
- **Fully Connected Layer**: Maps the LSTM output to the vocabulary size for predicting the next word.

In [17]:
class LSTM(nn.Module):
  def __init__(self, vocab_size):
    super().__init__()

    self.embedding = nn.Embedding(vocab_size, 100)
    self.lstm = nn.LSTM(100, 150, batch_first=True)
    self.fc = nn.Linear(150, vocab_size)

  def forward(self, x):
    embedded = self.embedding(x)
    inter_hidden_states, (final_hidden_state, final_cell_state) = self.lstm(embedded)
    output = self.fc(final_hidden_state.squeeze(0))

    return output

---
## **Training the Model**
- **Loss Function**: CrossEntropyLoss for multi-class classification.
- **Optimizer**: Adam optimizer for efficient training.
- **Training Loop**: Runs for 50 epochs, computing the loss and updating model parameters.

In [18]:
model = LSTM(len(vocab))
model.to(device)

epochs = 50
learning_rate = 0.01

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [19]:
for epoch in range(epochs):
  total_loss = 0

  for X_batch, y_batch in dataloader:
    X_batch, y_batch = X_batch.to(device), y_batch.to(device)
    optimizer.zero_grad()
    output = model(X_batch)
    loss = criterion(output, y_batch)
    loss.backward()
    optimizer.step()
    total_loss += loss.item()

  print(f"Epoch: {epoch+1}, Loss: {total_loss/len(dataloader):.4f}")

Epoch: 1, Loss: 5.3139
Epoch: 2, Loss: 3.8702
Epoch: 3, Loss: 2.3366
Epoch: 4, Loss: 1.1713
Epoch: 5, Loss: 0.5555
Epoch: 6, Loss: 0.3002
Epoch: 7, Loss: 0.1776
Epoch: 8, Loss: 0.1285
Epoch: 9, Loss: 0.0860
Epoch: 10, Loss: 0.0455
Epoch: 11, Loss: 0.0306
Epoch: 12, Loss: 0.0275
Epoch: 13, Loss: 0.0215
Epoch: 14, Loss: 0.0199
Epoch: 15, Loss: 0.0189
Epoch: 16, Loss: 0.0186
Epoch: 17, Loss: 0.0147
Epoch: 18, Loss: 0.0140
Epoch: 19, Loss: 0.0169
Epoch: 20, Loss: 0.0174
Epoch: 21, Loss: 0.0170
Epoch: 22, Loss: 0.0204
Epoch: 23, Loss: 0.0191
Epoch: 24, Loss: 0.0111
Epoch: 25, Loss: 0.0137
Epoch: 26, Loss: 0.0117
Epoch: 27, Loss: 0.0122
Epoch: 28, Loss: 0.0115
Epoch: 29, Loss: 0.0102
Epoch: 30, Loss: 0.0111
Epoch: 31, Loss: 0.0109
Epoch: 32, Loss: 0.0108
Epoch: 33, Loss: 0.0099
Epoch: 34, Loss: 0.0118
Epoch: 35, Loss: 0.0112
Epoch: 36, Loss: 0.0089
Epoch: 37, Loss: 0.0104
Epoch: 38, Loss: 0.0103
Epoch: 39, Loss: 0.0104
Epoch: 40, Loss: 0.0110
Epoch: 41, Loss: 0.0116
Epoch: 42, Loss: 0.0121
E

---
## **Generating Text Using the Trained Model**
- Takes an input phrase, tokenizes it, and converts it to numerical form.
- Passes it through the model to predict the next word.
- Selects the word with the highest probability and appends it to the text.
- Generates 10 words iteratively using the trained model.
- Introduces a short delay to simulate real-time text generation.


In [26]:
def prediction(model, vocab, text):
  tokenized = word_tokenize(text.lower())
  num_tokenized = text_to_indices(tokenized, vocab)
  padded_num_tokenized = torch.tensor([0] * (max(lengths) - len(num_tokenized)) + num_tokenized, dtype=torch.long).unsqueeze(0).to(device)
  output = model(padded_num_tokenized)
  value, index = torch.max(output, dim=1)

  return text + " " + list(vocab.keys())[index]

In [32]:
num_tokens = 10
input_text = "Bioinformatics is an"

for i in range(num_tokens):
  predicted_text = prediction(model, vocab, input_text)
  print(predicted_text)
  input_text = predicted_text
  time.sleep(0.5)

Bioinformatics is an interdisciplinary
Bioinformatics is an interdisciplinary field
Bioinformatics is an interdisciplinary field that
Bioinformatics is an interdisciplinary field that combines
Bioinformatics is an interdisciplinary field that combines biology
Bioinformatics is an interdisciplinary field that combines biology ,
Bioinformatics is an interdisciplinary field that combines biology , computer
Bioinformatics is an interdisciplinary field that combines biology , computer science
Bioinformatics is an interdisciplinary field that combines biology , computer science ,
Bioinformatics is an interdisciplinary field that combines biology , computer science , mathematics
