# Sentence Completion with RNN/LSTM

This notebook builds a simple next-word prediction model using quote data. We use:
- **Preprocessing**: lowercasing and removing punctuation
- **Tokenization**: Keras Tokenizer to convert text to sequences
- **Vectorization**: Padding sequences + one-hot encoding labels with **sklearn**
- **Models**: SimpleRNN and LSTM (Keras)

Run all cells in order. Upload `qoute_dataset.csv` in Colab (or mount Drive).

## 1. Imports and load data

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string

In [9]:
# Load the quotes dataset (upload qoute_dataset.csv in Colab if needed)
df = pd.read_csv('qoute_dataset.csv')
df.head()

Unnamed: 0,quote,Author
0,“The world as we have created it is a process ...,Albert Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling
2,“There are only two ways to live your life. On...,Albert Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe


In [10]:
# Use only the quote column for sentence completion
quotes = df['quote']
print('Number of quotes:', len(quotes))
quotes.head()

Number of quotes: 3038


Unnamed: 0,quote
0,“The world as we have created it is a process ...
1,"“It is our choices, Harry, that show what we t..."
2,“There are only two ways to live your life. On...
3,"“The person, be it gentleman or lady, who has ..."
4,"“Imperfection is beauty, madness is genius and..."


## 2. Preprocessing

Clean text: lowercase and remove punctuation so the model sees consistent tokens.

In [11]:
# Convert to lowercase
quotes = quotes.str.lower()

In [12]:
# Remove punctuation using a translation table
translator = str.maketrans('', '', string.punctuation)
new_quotes = []
for q in quotes:
    new_quotes.append(q.translate(translator))
quotes = pd.Series(new_quotes)
quotes.head()

Unnamed: 0,0
0,“the world as we have created it is a process ...
1,“it is our choices harry that show what we tru...
2,“there are only two ways to live your life one...
3,“the person be it gentleman or lady who has no...
4,“imperfection is beauty madness is genius and ...


## 3. Tokenization

Convert each quote into a sequence of word indices. We limit vocabulary size to keep the model small.

In [13]:
import tensorflow as tf
# Use Keras Tokenizer (reference via tf.keras to avoid import resolution issues)
Tokenizer = tf.keras.preprocessing.text.Tokenizer

In [14]:
# Vocabulary size: only keep top 10000 most frequent words
vocab_size = 10000
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(quotes)

In [15]:
# Word index: mapping from word -> integer index
word_index = tokenizer.word_index
print('Unique words in word_index:', len(word_index))
print('Top 10 words:', list(word_index.items())[:10])

Unique words in word_index: 8978
Top 10 words: [('the', 1), ('you', 2), ('to', 3), ('and', 4), ('a', 5), ('i', 6), ('is', 7), ('of', 8), ('that', 9), ('it', 10)]


In [16]:
# Convert each quote to a list of word indices
sequences = tokenizer.texts_to_sequences(quotes)
print('Example - first quote as sequence:', sequences[0][:15], '...')

Example - first quote as sequence: [713, 62, 29, 19, 16, 946, 10, 7, 5, 1156, 8, 70, 293, 10, 145] ...


## 4. Creating input (X) and target (Y)

For each quote we create training pairs: **input** = words so far, **target** = next word. So we learn to predict the next word given previous words.

In [17]:
X = []
Y = []
for seq in sequences:
    for i in range(1, len(seq)):
        input_seq = seq[:i]      # words 0 to i-1
        next_word = seq[i]       # next word index
        X.append(input_seq)
        Y.append(next_word)

In [18]:
# Find maximum sequence length (we need this for padding later)
max_length = 0
for seq in sequences:
    if len(seq) > max_length:
        max_length = len(seq)
print('Max sequence length:', max_length)

Max sequence length: 746


## 5. Padding

All input sequences must have the same length. We pad shorter sequences with zeros at the beginning (`padding='pre'`).

In [19]:
pad_sequences = tf.keras.preprocessing.sequence.pad_sequences
X_padded = pad_sequences(X, maxlen=max_length, padding='pre')
print('X_padded shape:', X_padded.shape)

X_padded shape: (85271, 746)


## 6. Vectorization (one-hot encoding with sklearn)

Targets are word indices. We one-hot encode them so the model predicts a probability over `vocab_size` classes. We use **sklearn.preprocessing.OneHotEncoder**.

In [20]:
from sklearn.preprocessing import OneHotEncoder

# Turn Y into a numpy array and make it 2D (one column) - sklearn needs 2D input
y = np.array(Y)
y = y.reshape(-1, 1)

# Tokenizer uses 1,2,3,... for words. One-hot needs 0,1,2,... so we subtract 1
y = y - 1
y = y.astype(int)

# Tell OneHotEncoder we have exactly vocab_size classes (0 to vocab_size-1)
# sparse_output=False means we get a normal array (not sparse) for Keras
encoder = OneHotEncoder(categories=[list(range(vocab_size))], sparse_output=False)
y_encoded = encoder.fit_transform(y)

print('y_encoded shape:', y_encoded.shape)

y_encoded shape: (85271, 10000)


## 7. Model building

We use **Embedding** → **RNN (SimpleRNN or LSTM)** → **Dense(softmax)**. The output is a probability distribution over the vocabulary.

In [21]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN, LSTM, Embedding
from tensorflow.keras.optimizers import Adam

In [22]:
# Hyperparameters
embedding_dim = 50
rnn_units = 128

In [23]:
# Simple RNN model
rnn_model = Sequential()
rnn_model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))
rnn_model.add(SimpleRNN(units=rnn_units))
rnn_model.add(Dense(units=vocab_size, activation='softmax'))
rnn_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
rnn_model.summary()



In [24]:
# LSTM model (usually better for longer sequences)
lstm_model = Sequential()
lstm_model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))
lstm_model.add(LSTM(units=rnn_units))
lstm_model.add(Dense(units=vocab_size, activation='softmax'))
lstm_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
lstm_model.summary()

## 8. Training

Train the RNN and LSTM on the padded inputs and one-hot targets. We use a fraction of data for validation.

In [25]:
epochs = 10
batch_size = 128

rnn_history = rnn_model.fit(
    X_padded, y_encoded,
    epochs=epochs,
    batch_size=batch_size,
    validation_split=0.1
)

Epoch 1/10
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 70ms/step - accuracy: 0.0309 - loss: 7.4237 - val_accuracy: 0.0371 - val_loss: 6.8304
Epoch 2/10
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 63ms/step - accuracy: 0.0389 - loss: 6.5658 - val_accuracy: 0.0446 - val_loss: 6.8694
Epoch 3/10
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 63ms/step - accuracy: 0.0500 - loss: 6.4091 - val_accuracy: 0.0517 - val_loss: 6.8268
Epoch 4/10
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 63ms/step - accuracy: 0.0611 - loss: 6.2169 - val_accuracy: 0.0725 - val_loss: 6.6972
Epoch 5/10
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 63ms/step - accuracy: 0.0791 - loss: 6.2042 - val_accuracy: 0.0874 - val_loss: 6.5122
Epoch 6/10
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 63ms/step - accuracy: 0.0837 - loss: 6.3247 - val_accuracy: 0.0884 - val_loss: 6.5105
Epoch 7/10
[1m6

In [26]:
from tensorflow.keras.callbacks import EarlyStopping

# Stop if validation loss does not improve for 5 epochs
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

epochs = 100
batch_size = 128

lstm_history = lstm_model.fit(
    X_padded, y_encoded,
    epochs=epochs,
    batch_size=batch_size,
    validation_split=0.1,
    callbacks=[early_stopping]
)

Epoch 1/100
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 55ms/step - accuracy: 0.0385 - loss: 7.1046 - val_accuracy: 0.0460 - val_loss: 6.6788
Epoch 2/100
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 54ms/step - accuracy: 0.0554 - loss: 6.3559 - val_accuracy: 0.0621 - val_loss: 6.5530
Epoch 3/100
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 54ms/step - accuracy: 0.0746 - loss: 6.0966 - val_accuracy: 0.0874 - val_loss: 6.4747
Epoch 4/100
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 54ms/step - accuracy: 0.0951 - loss: 5.8555 - val_accuracy: 0.0951 - val_loss: 6.4527
Epoch 5/100
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 55ms/step - accuracy: 0.1068 - loss: 5.6824 - val_accuracy: 0.0994 - val_loss: 6.4403
Epoch 6/100
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 54ms/step - accuracy: 0.1179 - loss: 5.4850 - val_accuracy: 0.1026 - val_loss: 6.4469
Epoch 7/10

## 9. Save models and artifacts

Save the trained models **and** the tokenizer, encoder, and max_length so you can load them later for prediction without retraining.

In [27]:
# Save Keras models
lstm_model.save('/content/drive/MyDrive/lstm_model.h5')
rnn_model.save('/content/drive/MyDrive/rnn_model.h5')



In [28]:
import pickle

# Save tokenizer - we need it later to turn new text into number sequences
f = open('/content/drive/MyDrive/tokenizer.pkl', 'wb')
pickle.dump(tokenizer, f)
f.close()

# Save encoder - the sklearn one-hot encoder (in case we need it again)
f = open('/content/drive/MyDrive/encoder.pkl', 'wb')
pickle.dump(encoder, f)
f.close()

# Build index_to_word: for each word and its number, store number -> word
index_to_word = {}
for word, index in word_index.items():
    index_to_word[index] = word

# Save max_length, vocab_size, and index_to_word together in config
config = {}
config['max_length'] = max_length
config['vocab_size'] = vocab_size
config['index_to_word'] = index_to_word
f = open('/content/drive/MyDrive/config.pkl', 'wb')
pickle.dump(config, f)
f.close()

print('Saved: lstm_model.h5, rnn_model.h5, tokenizer.pkl, encoder.pkl, config.pkl')

Saved: lstm_model.h5, rnn_model.h5, tokenizer.pkl, encoder.pkl, config.pkl


## 10. Prediction: next-word and sentence generation

We need the reverse mapping (index → word) and the predictor function. When loading in a new session, load the model plus tokenizer, config (max_length, index_to_word) from the saved files.

In [29]:
# Build index_to_word: given an index (number), get the word
# When we load from file we get this from config.pkl; here we build it from word_index
index_to_word = {}
for word, index in word_index.items():
    index_to_word[index] = word
print('Example indices to words:', list(index_to_word.items())[:5])

Example indices to words: [(1, 'the'), (2, 'you'), (3, 'to'), (4, 'and'), (5, 'a')]


In [30]:
def predictor(model, tokenizer, text, max_len):
    # Make text lowercase and turn it into a sequence of numbers
    text = text.lower()
    seq = tokenizer.texts_to_sequences([text])[0]
    seq_padded = pad_sequences([seq], maxlen=max_len, padding='pre')
    # Model gives probabilities for each word; we take the one with highest probability
    pred = model.predict(seq_padded, verbose=0)
    pred_index = np.argmax(pred[0])  # this is 0-based; tokenizer uses 1-based
    word_index_to_use = pred_index + 1
    if word_index_to_use in index_to_word:
        return index_to_word[word_index_to_use]
    return ''

In [31]:
# Test: predict next word for "life is"
seed_text = "life is"
next_word = predictor(lstm_model, tokenizer, seed_text, max_length)
print('Seed:', seed_text)
print('Next word:', next_word)

Seed: life is
Next word: a


In [32]:
def generate_text(model, tokenizer, seed_text, max_len, num_words):
    # Add one word at a time, num_words times
    for i in range(num_words):
        next_word = predictor(model, tokenizer, seed_text, max_len)
        if next_word == '':
            break
        seed_text = seed_text + ' ' + next_word
    return seed_text

In [33]:
# Generate 10 more words from "life is"
seed_text = "life is"
generated = generate_text(lstm_model, tokenizer, seed_text, max_length, 10)
print('Generated:', generated)

Generated: life is a world is the world is the world is the
