 PROJECT INTRODUCTION

This project builds a Next-Word Prediction model using a Word-Level LSTM trained on the classic Tiny Shakespeare dataset.

https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

LSTMs were once state-of-the-art for sequence modeling. They capture temporal dependencies using hidden states (ht, ct), but they struggle with:
• Long-range dependencies
• Large vocabularies
• Slow training due to sequential processing
• Vanishing or exploding gradients

The goal of this project is:
1. Build a simple LSTM-based language model
2. Evaluate it using Loss, Accuracy, and Perplexity
3. Understand its limitations
4. Motivate why modern architectures like Transformers perform far better


In [1]:
#Installing necessary libraries and packages

! pip install tensorflow numpy requests matplotlib

In [57]:
import re
import requests # fetching the corpus
import numpy as np


import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences

Defining my parameters

In [58]:
CORPUS_URL = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"  #  corpus
LOWERCASE = True

VOCAB_SIZE_CAP = 2000   # i am picking 5000 most frequent words from Alice in wonderland and rare words are mapped to OOV ,output layer is dense 5000 with softmax
SEQ_LEN =40           #  context window size how many past words to condition ontimestamp, ht ct ,output t -50 #lstm struggle with long sequences
EMBED_DIM = 256         # size of learned word vectors embedding dense continous not discete types which is static
LSTM_UNITS = 128
  # hidden units or neurons  in the LSTM cell
DROPOUT = 0.3        # Dropoout regularization to reduce overfitting
LR = 0.0005           # initial learning rate for Adam optizer so that it does not overshoot did some hit and try
EPOCHS = 30
VAL_SPLIT = 0.1  # validation split of data
BATCH_SIZE = 32    #batch size
TOP_K = 5                # how many top candidates to show during inference#i can also use min p or top p nucleus

In [59]:
#  Load & cleaning  corpus
def fetch_corpus(url):
    r = requests.get(url, timeout=30)# raising error if something goes wrong
    r.raise_for_status()
    return r.text

raw_text = fetch_corpus(CORPUS_URL)



In [60]:
# Cleaning text

def clean(text, lowercase=True):
    if lowercase:
        text = text.lower()
    text = re.sub(r"[^a-zA-Z'.,!?;:\-\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

text = clean(raw_text)

In [61]:
# Tokenizing

tokenizer = Tokenizer(num_words=VOCAB_SIZE_CAP, oov_token="<OOV>") # 2000 as defined above and others as oov
tokenizer.fit_on_texts([text])

word_index = tokenizer.word_index # crteates word id mapping
vocab_size = min(VOCAB_SIZE_CAP, len(word_index) + 1) if VOCAB_SIZE_CAP else (len(word_index) + 1) #plus 1 for padding reserved at 0 position

# Converting  full text to token IDs
tokens = tokenizer.texts_to_sequences([text])[0]

# Create sliding windows of length SEQ_LEN+1( mine is defined as  40   )    # first SEQ_LEN are inputs, last one is the target word
sequences = []
for i in range(SEQ_LEN, len(tokens)):
    seq = tokens[i-SEQ_LEN:i+1]
    sequences.append(seq)

sequences = np.array(sequences, dtype=np.int32)

X = sequences[:, :-1]   # inputs of length SEQ_LEN
y = sequences[:, -1]    # next-word labels

Padding uses a single special token (0), but this token can be repeated multiple times to make all sequences match a fixed length. The number of padding positions varies for each sequence, but the padding token is always the same.

In [62]:
X.shape

(204022, 40)

In [63]:
X[0].shape

(40,)

In [64]:
X[0]

array([  89,  270,  140,   36,  969,  144,  669,  128,   16,  103,   34,
        103,  103,   89,  270,    7,   41,   34, 1268,  351,    4,  200,
         64,    4,    1,   34, 1268, 1268,   89,  270,   89,    7,   92,
       1141,  232,   12,    1,  581,    4,    2], dtype=int32)

In [65]:
X[0][-1]

np.int32(2)

In [90]:
np.argmax(y[0])

np.int64(306)

Now lets see 2nd sample

In [67]:
X[1] #2nd sample last one is 74   which is target for X[0]

array([ 270,  140,   36,  969,  144,  669,  128,   16,  103,   34,  103,
        103,   89,  270,    7,   41,   34, 1268,  351,    4,  200,   64,
          4,    1,   34, 1268, 1268,   89,  270,   89,    7,   92, 1141,
        232,   12,    1,  581,    4,    2,  306], dtype=int32)

In next-word prediction, each training sample is created using a sliding window over the text. For a sentence like “my name is Susnata”, the model sees partial sequences as inputs and learns to predict the next word. For example, the first input sequence is ["my"] and its target is "name". The second input is ["my", "name"] and its target is "is". The third input is ["my", "name", "is"], and its target is "Susnata". In general, each X[i] contains all words except the last one in that window, while Y[i] is the next word that naturally follows in the text. Because the window shifts by one position, the target of X[i] becomes the last element of X[i+1]. This is the basis of how sequence data is prepared for LSTM next-word prediction.

In [68]:
#reverse mapping for testing

reverse_word_index = {v: k for k, v in tokenizer.word_index.items()}

In [69]:
reverse_word_index.get(1225, "<OOV>")

'condition'

In [70]:
reverse_word_index.get(74, "<OOV>")

"i'll"

In [71]:
X.shape[1]

40

In [72]:
#  One-hot encode targets & pad inputs at pre

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences

y = to_categorical(y, num_classes=vocab_size)
max_len = X.shape[1]  # should be  equal SEQ_LEN 21
X = pad_sequences(X, maxlen=max_len, padding='pre')  # safety, already fixed-length to 210

X is padded to a fixed sequence length using the PAD token (0). Y is a single integer representing the next word, and after one-hot encoding it becomes a vector of size vocab_size (not the sequence length). Padding is only for input sequences, not for targets.

In [73]:
from tensorflow.keras.layers import LayerNormalization

In [74]:
#  6) Build the model with hyperparameters defined earlier
model = Sequential([
    Embedding(
    input_dim=vocab_size,
    output_dim=EMBED_DIM,
    input_length=max_len,
    embeddings_regularizer=tf.keras.regularizers.l2(1e-6)  # tiny l2 regulirization
),
    LayerNormalization(),
 LSTM(LSTM_UNITS),
    Dropout(DROPOUT),

    Dense(256, activation="relu"),
    Dropout(0.3),



    Dense(vocab_size, activation="softmax")  #soft max as multi class classification #vocab size is 2000
])

opt = Adam(learning_rate=LR, clipnorm=1.0) # also adding gradient clipping for exploding gradient

model.compile(
    loss="categorical_crossentropy",#as one hot encoded
    optimizer=opt,
    metrics=["accuracy"]
)

model.summary()



In [75]:
BATCH_SIZE=32

ReduceLROnPlateau automatically slows down the learning rate when your validation loss stops improving. If the model doesn’t get better for a few epochs, it reduces the LR so training can continue more smoothly and avoid getting stuck. It basically tells the optimizer to take smaller steps now

In [76]:
callbacks = [
    EarlyStopping(monitor="val_loss", patience=10, restore_best_weights=True),#early stopping if overfitting with patience
    ReduceLROnPlateau(
    monitor="val_loss",
    factor=0.5,
    patience=3,
    min_lr=1e-6
)
]

history = model.fit(
    X, y,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_split=VAL_SPLIT,
    callbacks=callbacks,
    verbose=1
)

Epoch 1/30
[1m5739/5739[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 10ms/step - accuracy: 0.1198 - loss: 5.6809 - val_accuracy: 0.1429 - val_loss: 5.4311 - learning_rate: 5.0000e-04
Epoch 2/30
[1m5739/5739[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 10ms/step - accuracy: 0.1451 - loss: 5.1235 - val_accuracy: 0.1505 - val_loss: 5.3711 - learning_rate: 5.0000e-04
Epoch 3/30
[1m5739/5739[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 11ms/step - accuracy: 0.1521 - loss: 4.9622 - val_accuracy: 0.1538 - val_loss: 5.3786 - learning_rate: 5.0000e-04
Epoch 4/30
[1m5739/5739[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 11ms/step - accuracy: 0.1570 - loss: 4.8667 - val_accuracy: 0.1535 - val_loss: 5.3766 - learning_rate: 5.0000e-04
Epoch 5/30
[1m5739/5739[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 10ms/step - accuracy: 0.1601 - loss: 4.7914 - val_accuracy: 0.1553 - val_loss: 5.3933 - learning_rate: 5.0000e-04
Epoch 6/30
[1m5739/5739[0m 

In [94]:
final_train_loss = history.history["loss"][-1]
final_val_loss = history.history["val_loss"][-1]
final_train_acc = history.history["accuracy"][-1]
final_val_acc = history.history["val_accuracy"][-1]

In [95]:
final_val_loss

5.50492000579834

In [97]:
import numpy as np

#  PERPLEXITY = exp(loss)
def susnata_compute_perplexity(model, X_val, y_val):
    loss, acc = model.evaluate(X_val, y_val, verbose=0)
    ppl = np.exp(loss)
    print(f"Validation Loss: {loss:.4f}")
    print(f"Validation Accuracy: {acc:.4f}")
    print(f"Validation Perplexity: {ppl:.2f}")
    return ppl

In [98]:


susnata_compute_perplexity(model, X_val, y_val)

Validation Loss: 5.3710
Validation Accuracy: 0.1505
Validation Perplexity: 215.09


np.float64(215.08795122807007)

In [99]:
# Perplexity measures how many equally-plausible next words the model is juggling; higher perplexity means the model is confused between many possible tokens.

In [112]:
#  Doing inference setup with top k as 5 ,i can use top p nucleus or min p as per business requirement

def susnata_predict_next_word(seed_text, top_k=5):
    # preprocessing my test
    seed_text = seed_text.lower().strip()
    seq = tokenizer.texts_to_sequences([seed_text])[0]
    seq = seq[-SEQ_LEN:]
    seq = pad_sequences([seq], maxlen=SEQ_LEN, padding='pre')

    # model prediction
    probs = model.predict(seq, verbose=0)[0]

    # getting  top-k indices sorted by its  probability
    top_ids = np.argsort(probs)[-top_k:][::-1]

    # builing  my  list of (word, along with probability)
    results = [] # empty list
    for idx in top_ids:
        word = reverse_word_index.get(idx, "<OOV>")
        prob = float(probs[idx])
        results.append((word, prob))

    return results

In [113]:
def generate_text(seed_text, num_words=10, greedy=True, top_k=5):
    text = seed_text.strip()

    for _ in range(num_words):
        seq = tokenizer.texts_to_sequences([text.lower()])[0]
        seq = seq[-SEQ_LEN:]
        seq = pad_sequences([seq], maxlen=SEQ_LEN, padding='pre')

        probs = model.predict(seq, verbose=0)[0]

        if greedy:
            next_id = np.argmax(probs)
        else:
            # top-k sampling
            top_ids = np.argsort(probs)[-top_k:]
            top_probs = probs[top_ids]
            top_probs = top_probs / top_probs.sum()
            next_id = np.random.choice(top_ids, p=top_probs)

        next_word = reverse_word_index.get(next_id, "<OOV>")
        text += " " + next_word

    return text

In [None]:

#  Demo
seed = "The fruits of love I mean, my "
print("Seed:", seed)
print("Top candidates:",susnata_ predict_next_word(seed, top_k=5))
print("Greedy generation:", generate_text(seed, num_words=10, greedy=True))
print("Top-k sampled generation:", generate_text(seed, num_words=5, greedy=False, top_k=5))

Seed: The fruits of love I mean, my 
Top candidates: [('<OOV>', 0.05793144181370735), ('lord', 0.039422597736120224), ('son', 0.013679040595889091), ('heart', 0.012700056657195091), ('brother', 0.012518608942627907)]
Greedy generation: The fruits of love I mean, my  <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV>
Top-k sampled generation: The fruits of love I mean, my  heart to <OOV> the <OOV>


In [88]:
model.save("susnata_lstm_nextword_model.h5")



In [89]:
#  from google.colab import files
#  files.download("susnata_lstm_nextword_model.h5")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [82]:
# from google.colab import files
# uploaded = files.upload()  # choose the H5 file

In [83]:
# from tensorflow.keras.models import load_model
# model = load_model("lstm_nextword_model.h5")

My CONCLUSION

This project built a word-level LSTM language model on the Tiny Shakespeare dataset, evaluated its loss, accuracy, and perplexity, and demonstrated how classical recurrent networks struggle with language modeling. While the LSTM learned short patterns, it failed to capture long-range structure, showed high perplexity, and plateaued quickly due to sequential bottlenecks, limited context memory, and difficulty modeling large vocabularies.

Modern architectures like Transformers outperform LSTMs because they use self-attention instead of recurrence, process all tokens in parallel, capture long dependencies directly, scale to much larger models, and train more efficiently. As a result, contemporary NLP systems (GPT, BERT, Llama, etc.) achieve dramatically lower perplexity and superior language understanding compared to LSTMs.

This project shows the historical value of LSTMs but highlights why modern attention-based models dominate today’s AI landscape.