In [1]:
import nltk
nltk.download("gutenberg")
nltk.download("punkt")

[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/omkarjadhav/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /home/omkarjadhav/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Step 1 – Dataset Preparation

##### Load the text

In [2]:
from nltk.corpus import gutenberg

# Load Alice in wonderland
align_text = gutenberg.raw("carroll-alice.txt")
print(align_text[:500])



# Dummy text corpus
text = """once upon a time there was a brave knight
he fought dragons and protected the kingdom
the people loved the knight for his courage and strength
every day he trained with sword and shield
his legend spread across the land far and wide"""

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy an


### Step 2: Text Cleaning and Tokenization

In [3]:
import re
from nltk.tokenize import word_tokenize

# Lowercase
text = align_text.lower()

# # Remove unwanted characters (keep letters and basic punctuation)
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

tokens = word_tokenize(text)
print("Total Tokens: ", len(tokens))
print("First 50 Tokens: ", tokens[:50])

Total Tokens:  26384
First 50 Tokens:  ['alices', 'adventures', 'in', 'wonderland', 'by', 'lewis', 'carroll', '1865', 'chapter', 'i', 'down', 'the', 'rabbithole', 'alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', 'and', 'of', 'having', 'nothing', 'to', 'do', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', 'but', 'it', 'had']


### Step 3: Sequence Creation for Training

Now that we have a cleaned list of words `(tokens)`, we'll prepare training sequences.

Goal:
* We want to create many input sequences where each sequence is a few words long, and the next word is the **label**.


For example:

Input Sequence: ["alice", "was", "beginning", "to"]    

Label: "get"


In [4]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import numpy as np

2025-05-31 21:30:52.374597: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
# Initialize the tokenizer and fit on text

"""Below step builds the vocabulary from the list of tokens (words). 
   Each unique word gets assigned an integer index. 
   We're using those integer indexes as sequences values. Ex: "alices" word have index 298 which we're using to represent that word.
"""
tokenizer = Tokenizer()
tokenizer.fit_on_texts(tokens)      # Learn the vocabulary from the text

# Convert words to integers
sequences = tokenizer.texts_to_sequences([tokens])[0]
print("sequences size: ", len(sequences))
print("Sequence: ", sequences[:10])


# Vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print("\nVocabulary Size: ", vocab_size)


# Create input sequences and labels
input_sequences = []
for i in range(1, len(sequences)):
    input_sequences.append(sequences[:i+1])
print("\nfirst 5 input sequences:\n", input_sequences[:5])

# Pad sequences to same length
max_seq_len = len(input_sequences[-1])
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_len, padding='pre')
print("\nfirst 5 input sequences After padding:\n", input_sequences[:5])

# Split into input (X) and label (y)
input_sequences = np.array(input_sequences)
"""Below code is used to get values from 2D array"""
x = input_sequences[:, :-1]  # All rows and all columns expect last one
y = input_sequences[:, -1]   # All rows and last column

print("\nInput features type: ", type(x))
print("Label type: ", type(y))

y = to_categorical(y, num_classes=vocab_size)

sequences size:  26384
Sequence:  [298, 527, 11, 826, 74, 1470, 1471, 1472, 299, 9]

Vocabulary Size:  2753

first 5 input sequences:
 [[298, 527], [298, 527, 11], [298, 527, 11, 826], [298, 527, 11, 826, 74], [298, 527, 11, 826, 74, 1470]]

first 5 input sequences After padding:
 [[   0    0    0 ...    0  298  527]
 [   0    0    0 ...  298  527   11]
 [   0    0    0 ...  527   11  826]
 [   0    0    0 ...   11  826   74]
 [   0    0    0 ...  826   74 1470]]

Input features type:  <class 'numpy.ndarray'>
Label type:  <class 'numpy.ndarray'>


#### Step 4: Building the LSTM Text Generation Model

In [6]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=100, input_length=x.shape[1]))
model.add(LSTM(units=150, return_sequences=False))
model.add(Dense(vocab_size, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=0.01), metrics=['accuracy'])
model.summary()

2025-05-31 21:31:09.690218: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 26383, 100)        275300    
                                                                 
 lstm (LSTM)                 (None, 150)               150600    
                                                                 
 dense (Dense)               (None, 2753)              415703    
                                                                 
Total params: 841,603
Trainable params: 841,603
Non-trainable params: 0
_________________________________________________________________


2025-05-31 21:31:09.941746: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-05-31 21:31:09.943462: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-05-31 21:31:09.944410: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

In [None]:
history = model.fit(x, y, epochs=30, batch_size=64, verbose=1)

Epoch 1/30


2025-05-31 21:31:12.119533: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2025-05-31 21:31:12.120947: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2025-05-31 21:31:12.122247: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

  3/413 [..............................] - ETA: 19:42:06 - loss: 7.8915 - accuracy: 0.0260    