Creating an LSTM (Long Short-Term Memory) model for poem generation involves several steps. LSTM is a type of recurrent neural network that is particularly well-suited for sequence-to-sequence tasks like generating poems. Here's a step-by-step guide to building an LSTM model for poem creation:

1. **Data Collection**:
   Gather a large dataset of poems that will serve as the training data for your model. You can find poems from online sources, public domain poetry collections, or even create your own dataset.

2. **Data Preprocessing**:
   Prepare the raw text data for training. This involves tasks such as tokenization, lowercasing, removing punctuation, and creating numerical representations of the words. You can use libraries like NLTK or TensorFlow Tokenizer for this step.

3. **Create Sequences**:
   Convert the processed text into sequences of fixed length. For instance, if your LSTM model takes sequences of 50 words as input, divide the entire poem dataset into overlapping sequences of 50 words.

4. **Split Data into Train and Test Sets**:
   Divide the dataset into training and testing sets. The training set will be used to train the LSTM model, while the testing set will help evaluate its performance.

5. **Build LSTM Model**:
   Set up the architecture of your LSTM model. In Keras or TensorFlow, you can use the `LSTM` layer along with `Embedding` and `Dense` layers to create the model.

6. **Compile Model**:
   Specify the loss function and optimization algorithm for the LSTM model using the `compile` method. Since poem generation is a language generation task, you can use categorical cross-entropy as the loss function.

7. **Train Model**:
   Train the LSTM model on the prepared training dataset. Adjust the number of epochs and batch size according to your dataset size and computing resources. You might need to experiment with hyperparameters to achieve good results.

8. **Generate Poems**:
   After training the model, you can use it to generate poems. Start with a seed sequence, and use the trained model to predict the next word in the sequence. Then, use the predicted word to update the sequence, and repeat the process to generate the desired length of the poem.

9. **Evaluate and Iterate**:
    Evaluate the quality of the generated poems based on various metrics like coherence, grammar, and overall poetic quality. Iterate on your model, data preprocessing, or hyperparameters to improve the results.

In [1]:
# importing required libraries
import os
import random
import nltk

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split

import numpy as np 

import re
from nltk.tokenize import word_tokenize

In [2]:
# saving list of names of txt files
directory_path = r'C:\Users\rohit\Python\Personal Projects\shaayar\ghazal_data\all'

txt_files = [file for file in os.listdir(directory_path) if file.endswith('.txt')]

In [3]:
poems_data = ""

for txt_file in txt_files:
    file_path = os.path.join(directory_path, txt_file)
    
    with open(file_path, 'r', encoding='utf-8') as file:
        poem_content = file.read()
    
    poems_data += poem_content

In [4]:
poems_data[:1000]

'gulon ko sunna zara tum sadaen bheji hain\ngulon ke haath bahut si duaen bheji hain\n\njo aftab kabhi bhi ghurub hota nahin\nhamara dil hai usi ki shuaen bheji hain\n\nagar jalae tumhen bhi shifa mile shayad\nik aise dard ki tum ko shuaen bheji hain\n\ntumhari khushk si ankhen bhali nahin lagtin\nvo saari chizen jo tum ko rulaen, bheji hain\n\nsiyah rang chamakti hui kanari hai\npahan lo achchhi lagengi ghaTaen bheji hain\n\ntumhare khvab se har shab lipaT ke sote hain\nsazaen bhej do ham ne khataen bheji hain\n\nakela patta hava men bahut buland uḌa\nzamin se paanv uThao havaen bheji hain\nkoi hua na ru-kash Tak meri chashm-e-tar se\nkya kya na abr aa kar yaan zor zor barse\n\nvahshat se meri yaaro khatir na jama rakhiyo\nphir aave ya na aave nau pur uTha jo ghar se\n\nab juun sarishk un se phirne ki chashm mat rakh\njo khaak men mile hain gir kar tiri nazar se\n\ndidar khvah us ke kam hon to shor kam ho\nhar subh ik qayamat uThti hai us ke dar se\n\ndaagh ek ho jila bhi khuun ek ho 

In [5]:
len(poems_data)

2005787

In [6]:
corpus = poems_data.lower().split('\n')
corpus = [line.replace('-', ' - ') + ' \n' for line in corpus]

In [7]:
corpus[:5]

['gulon ko sunna zara tum sadaen bheji hain \n',
 'gulon ke haath bahut si duaen bheji hain \n',
 ' \n',
 'jo aftab kabhi bhi ghurub hota nahin \n',
 'hamara dil hai usi ki shuaen bheji hain \n']

In [8]:
tokenizer = Tokenizer(num_words = 100)
#tokenizer = Tokenizer() #for better accuracy
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
total_words

19023

In [9]:
input_seqs = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram = token_list[:i+1]
        input_seqs.append(n_gram)

In [10]:
max_seq_len = max([len(x) for x in input_seqs])
max_seq_len

17

In [11]:
input_seqs = np.array(pad_sequences(input_seqs, maxlen = max_seq_len, padding = 'pre'))

In [12]:
X_train = input_seqs[:, :-1]
y_train = input_seqs[:, -1]

In [13]:
y_train = tf.keras.utils.to_categorical(y_train, num_classes = total_words)

In [14]:
print(X_train.shape)
print(y_train.shape)

(160167, 16)
(160167, 19023)


In [15]:
model = Sequential()
model.add(Embedding(total_words, 120, input_length = max_seq_len - 1))
model.add(Bidirectional(LSTM(100)))
model.add(Dense(total_words, activation = 'softmax'))
adam = Adam(lr = 0.1)
model.compile(loss = 'categorical_crossentropy', optimizer = adam, metrics = ['accuracy'])
history = model.fit(X_train, y_train, epochs = 20, verbose = 1)



Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
# A more complex and better model for better text generation

#model = Sequential()
#model.add(Embedding(total_words, 240, input_length = max_seq_len - 1))
#model.add(Bidirectional(LSTM(150)))
#model.add(Bidirectional(LSTM(100, return_sequences = True)))
#model.add(Dense(total_words, activation = 'softmax'))
#adam = Adam(lr = 0.001)
#model.compile(loss = 'categorical_crossentropy', optimizer = adam, metrics = ['accuracy'])
#history = model.fit(X_train, y_train, epochs = 200, verbose = 1)

In [19]:
seed_text = 'Uski aakhein jaise'
next_word = 20

for _ in range(next_word):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen = max_seq_len - 1, padding = 'pre')
    predicted_probs = model.predict(token_list, verbose = 0)
    predicted_index = np.argmax(predicted_probs)
    
    output_word = ''
    for word, index in tokenizer.word_index.items():
        if index == predicted_index:
            output_word = word
            break
    seed_text += ' ' + output_word
print(seed_text)

Uski aakhein jaise to hai e e ka hai e ka hai e ka hai e ka hai koi ho aur ho ho
