<a href="https://colab.research.google.com/github/ShovalBenjer/deep_learning_neural_networks/blob/main/LSTM_Model_Perplexity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTM Language Model Integration with Forward/Reverse Training  
   
This notebook implements the assignment requirements using the repository code in  
`util.py` and `words.py` (which have been modified as described).  
   
**Tasks completed:**  
1. The dataset is split into train (80%), validation (10%), and test (10%) sets.  
2. A perplexity metric is computed on each split after training.  
3. We support running the LSTM both in the natural (forward) order and in reverse order (using Keras’s `go_backwards` flag).  
4. We train 4 LSTM models: one–layer vs. two–layer, and for each a forward and a reverse version.  
5. A function is provided that computes the probability of a given sentence from a trained model.  
6. A sentence of length 7 starting with "love I" is generated at temperatures 0.1, 1, and 10.  
7. An interactive UI function allows entering a seed word to obtain the next predicted word.  
8. For each model, perplexity is recorded for train, validation, and test sets (12 results total).  
9. The probability is computed for the generated sentence and also for the sentence "love i cupcakes".

All training progress is logged via TensorBoardX.


## Install Dependencies


In [1]:
!pip install tensorboardX
!pip install nltk
!pip install keras-preprocessing
!pip install tensorflow-gpu
!pip install --upgrade tensorflow
!pip install keras-preprocessing-gpu
# %tensorflow_version 2.x
import tensorflow as tf
!pip install tensorboardX
# !pip install language_models

Collecting tensorboardX
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl.metadata (5.8 kB)
Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl (101 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/101.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tensorboardX
Successfully installed tensorboardX-2.6.2.2
Collecting keras-preprocessing
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl.metadata (1.9 kB)
Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: keras-preprocessing
Successfully installed keras-preprocessing-1.1.2
Collecting tensorflow-gpu
  Downloading tensorflow-gpu-2.12.0.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubproces

In [2]:
device_name = tf.test.gpu_device_name()
print(device_name)
# if device_name != '/device:GPU:0':
#   raise SystemError('GPU device not found')
# print('Found GPU at: {}'.format(device_name))
print("GPU Available:", tf.config.list_physical_devices('GPU'))


GPU Available: []


## Clone the Repository  
   
We now clone the repository from GitHub and check out the correct branch.


In [3]:
!git clone https://github.com/ShovalBenjer/deep_learning_neural_networks.git
%cd deep_learning_neural_networks
!git checkout language-models-integration

Cloning into 'deep_learning_neural_networks'...
remote: Enumerating objects: 199, done.[K
remote: Counting objects: 100% (199/199), done.[K
remote: Compressing objects: 100% (142/142), done.[K
remote: Total 199 (delta 104), reused 103 (delta 55), pack-reused 0 (from 0)[K
Receiving objects: 100% (199/199), 21.32 MiB | 12.13 MiB/s, done.
Resolving deltas: 100% (104/104), done.
/content/deep_learning_neural_networks
Branch 'language-models-integration' set up to track remote branch 'language-models-integration' from 'origin'.
Switched to a new branch 'language-models-integration'


## Imports and Setup  
   
We import necessary modules from Keras, tensorboardX, and our own `util.py` and `words.py`.


In [4]:
import os, random, numpy as np, keras, keras.backend as K, nltk
from keras.layers import LSTM, Embedding, TimeDistributed, Input, Dense
from keras.models import Model
from tensorboardX import SummaryWriter
from tqdm import tqdm
from keras_preprocessing.sequence import pad_sequences
nltk.download('punkt')
from keras.datasets import imdb
from tensorflow.python.client import device_lib
from tensorflow.keras.losses import sparse_categorical_crossentropy
from tqdm import tqdm
from argparse import ArgumentParser
from tensorboardX import SummaryWriter
CHECK = 5

# Import our custom modules (make sure util.py and words.py are in the repo)
import util
import words


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Helper Functions  
   
The following helper functions are defined:
   
- **split_data:** Splits a numpy array of padded sentences into train/val/test sets (80–10–10).  
- **compute_perplexity:** Computes average loss on a dataset and returns perplexity = exp(loss).  
- **sentence_probability:** Computes the probability (and log probability) of a sentence given a trained model.  
- **interactive_next_word:** A simple UI function that takes a seed sequence and prints the next predicted word.


## Loading and Splitting the Data  
   
We load the dataset using `util.load_words` from the file `datasets/wikisimple.txt` and then pad the sentences.  
Next, we concatenate all batches into one array and split it (80% train, 10% validation, 10% test).


In [5]:
def split_data(data, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1):
    """
    Split a numpy array (of shape [num_sentences, seq_length]) into train, validation, and test arrays.
    """
    indices = np.arange(data.shape[0])
    np.random.shuffle(indices)
    n = len(indices)
    train_end = int(train_ratio * n)
    val_end = int((train_ratio + val_ratio) * n)
    train_data = data[indices[:train_end]]
    val_data = data[indices[train_end:val_end]]
    test_data = data[indices[val_end:]]
    return train_data, val_data, test_data

def compute_perplexity(model, data, batch_size=128):
    """
    Compute perplexity on a dataset.

    The loss is calculated using sparse categorical crossentropy.
    Perplexity is computed as exp(average_loss).
    """
    losses = []
    for i in range(0, data.shape[0], batch_size):
        batch = data[i:i+batch_size]
        n = batch.shape[0]
        batch_in = np.concatenate([np.ones((n, 1), dtype='int32'), batch], axis=1)
        batch_out = np.concatenate([batch, np.zeros((n, 1), dtype='int32')], axis=1)
        loss = model.test_on_batch(batch_in, batch_out[:,:,None])
        losses.append(loss)
    avg_loss = np.mean(losses)
    return np.exp(avg_loss)

def sentence_probability(sentence, model, w2i, i2w):
    """
    Compute the probability and log-probability of a given sentence.

    The sentence is tokenized by spaces; unknown words are replaced by <UNK>.
    The model is assumed to predict the next word given the previous tokens.
    """
    words_in = sentence.strip().split()
    # Use <UNK> index if word not found
    unk = w2i.get('<UNK>', 2)
    token_ids = [w2i.get(word.lower(), unk) for word in words_in]
    # Prepend <START> token (assumed index 1)
    token_ids = [1] + token_ids
    token_ids = np.array(token_ids)[None, :]  # shape (1, seq_len)
    logits = model.predict(token_ids)
    # Apply softmax to obtain probabilities at each time step
    probs = np.exp(logits) / np.exp(logits).sum(axis=-1, keepdims=True)
    p = 1.0
    logp = 0.0
    # For t=1 to end, probability assigned to token at position t given context t-1
    for t in range(1, token_ids.shape[1]):
        prob = probs[0, t-1, token_ids[0, t]]
        p *= prob
        logp += np.log(prob + 1e-10)
    return p, logp

def interactive_next_word(model, seed_seq, w2i, i2w):
    """
    Given a seed sequence (list of word indices), generate one additional word using the model.
    """
    seed_seq = np.array(seed_seq)
    seed_seq = np.insert(seed_seq, 0, 1)  # Prepend <START> token (index 1)
    gen = words.generate_seq(model, seed_seq, size=seed_seq.shape[0]+1, temperature=1.0)
    next_idx = gen[-1]
    seed_words = ' '.join(i2w[str(idx)] for idx in seed_seq[1:])
    print("Seed: ", seed_words)
    print("Next predicted word: ", i2w[str(next_idx)])


## Model Building and Training Functions  
   
We define two functions:  
   
- **build_model:** Constructs the Keras model. It uses the parameter `reverse` to decide whether the LSTM layers operate in reverse order (by setting `go_backwards=True`). An extra LSTM layer is added if requested.  
- **train_model:** Trains the model on the training set in a custom loop, logs loss via TensorBoardX, and after training computes perplexity on the train, validation, and test sets. It also demonstrates sample generation.


In [10]:
def build_model(numwords, lstm_capacity, extra, reverse):
    """
    Build and return a Keras Model for language modeling.

    :param numwords: Size of the vocabulary.
    :param lstm_capacity: The dimensionality of the LSTM hidden state.
    :param extra: Number of extra LSTM layers (None means only one layer).
    :param reverse: Boolean flag; if True, LSTM layers use go_backwards=True.
    :return: Compiled Keras model.
    """
    inp = Input(shape=(None,))
    embed = Embedding(numwords, lstm_capacity)
    x = embed(inp)
    # First LSTM layer with reverse flag
    x = LSTM(lstm_capacity, return_sequences=True, go_backwards=reverse)(x)
    # Extra LSTM layers if any
    if extra is not None:
        for _ in range(extra):
            x = LSTM(lstm_capacity, return_sequences=True, go_backwards=reverse)(x)
    dense = Dense(numwords, activation='linear')
    out = TimeDistributed(dense)(x)
    model = Model(inp, out)
    return model

def train_model(options, train_data, val_data, test_data, w2i, i2w):
    """
    Build and train a language model with the specified options.
    Logs training progress via TensorBoardX.
    After training, computes perplexity on train, validation, and test sets,
    and generates sample sentences.

    :param options: An options object with training hyperparameters.
    :param train_data, val_data, test_data: Numpy arrays of shape [num_sentences, seq_length].
    :param w2i, i2w: Word-to-index and index-to-word dictionaries.
    :return: The trained Keras model.
    """
    writer = SummaryWriter(logdir=options.tb_dir)
    np.random.seed(options.seed)

    numwords = len(i2w)
    model = build_model(numwords, options.lstm_capacity, options.extra, options.reverse)
    # Use "learning_rate" instead of "lr" below.
    opt = keras.optimizers.Adam(learning_rate=options.lr)
    model.compile(opt, words.sparse_loss)
    model.summary()

    instances_seen = 0
    for epoch in range(options.epochs):
        # Shuffle training data each epoch
        indices = np.arange(train_data.shape[0])
        np.random.shuffle(indices)
        train_data = train_data[indices]
        for i in range(0, train_data.shape[0], options.batch):
            batch = train_data[i:i+options.batch]
            n = batch.shape[0]
            batch_in = np.concatenate([np.ones((n, 1), dtype='int32'), batch], axis=1)
            batch_out = np.concatenate([batch, np.zeros((n, 1), dtype='int32')], axis=1)
            loss = model.train_on_batch(batch_in, batch_out[:,:,None])
            instances_seen += n
            writer.add_scalar('lm/train_batch_loss', float(loss), instances_seen)
        print("Epoch {} complete".format(epoch+1))
        # Generate sample sentences at various temperatures
        for temp in [0.0, 0.9, 1.0, 1.1, 1.2]:
            print("### TEMP", temp)
            for _ in range(3):
                b = random.choice(train_data)
                seed = b[0, :min(20, b.shape[1])]
                seed = np.insert(seed, 0, 1)  # Prepend <START>
                gen = words.generate_seq(model, seed, 60, temperature=temp)
                def decode(seq):
                    return ' '.join(i2w[str(i)] for i in seq)
                print('Seed:', decode(seed))
                print('Generated:', decode(gen[len(seed):]))
    writer.close()

    ppl_train = compute_perplexity(model, train_data, options.batch)
    ppl_val = compute_perplexity(model, val_data, options.batch)
    ppl_test = compute_perplexity(model, test_data, options.batch)

    print("Perplexity (Train): {:.2f}".format(ppl_train))
    print("Perplexity (Validation): {:.2f}".format(ppl_val))
    print("Perplexity (Test): {:.2f}".format(ppl_test))

    # Generate sentence of length 7 starting with "love I" at different temperatures
    seed_words = "love I".split()
    seed_ids = [w2i.get(word.lower(), w2i.get('<UNK>')) for word in seed_words]
    seed_ids = np.array(seed_ids)
    print("\nSentence generation (length=7) starting with 'love I':")
    for temp in [0.1, 1.0, 10.0]:
        gen = words.generate_seq(model, np.insert(seed_ids, 0, 1), size=7, temperature=temp)
        def decode(seq):
            return ' '.join(i2w[str(i)] for i in seq)
        print("Temperature {}: {}".format(temp, decode(gen)))

    # Compute probability for the generated sentence and for "love i cupcakes"
    def decode(seq):
        return ' '.join(i2w[str(i)] for i in seq)
    generated_sentence = decode(gen)
    p, logp = sentence_probability(generated_sentence, model, w2i, i2w)
    print("\nProbability for generated sentence '{}': {:.2e} (log={:.2f})".format(generated_sentence, p, logp))

    sentence2 = "love i cupcakes"
    p2, logp2 = sentence_probability(sentence2, model, w2i, i2w)
    print("Probability for sentence 'love i cupcakes': {:.2e} (log={:.2f})".format(p2, logp2))

    return model


## Options Class and Helper to Set Options  
   
We define a simple options container and a function to generate options for training.


In [7]:
class Options:
    pass

def get_options(lstm_capacity=256, batch=128, epochs=10, extra=None, lr=0.001,
                top_words=10000, limit=None, seed=42, reverse=False, tb_dir='./runs/words'):
    opt = Options()
    opt.lstm_capacity = lstm_capacity
    opt.batch = batch
    opt.epochs = epochs
    opt.extra = extra      # extra LSTM layers (None means only one LSTM layer)
    opt.lr = lr
    opt.top_words = top_words
    opt.limit = limit
    opt.seed = seed
    opt.reverse = reverse  # if True, LSTM layers are trained in reverse order
    opt.tb_dir = tb_dir
    opt.task = 'wikisimple'
    opt.data = './data'
    return opt


## Training Four Models and Recording Perplexity  
   
We now run experiments for four configurations:  
1. 1-layer forward (extra=None, reverse=False)  
2. 1-layer reverse (extra=None, reverse=True)  
3. 2-layer forward (extra=1, reverse=False)  
4. 2-layer reverse (extra=1, reverse=True)  
   
For each model we train on the training set, compute perplexity on train, validation and test sets, and record the results.


In [8]:
# Load the data (x is a list of lists of integers)
x, w2i, i2w = util.load_words(os.path.join(util.DIR, 'datasets', 'wikisimple.txt'),
                               vocab_size=10000, limit=None)
print("Number of sentences loaded (before batching):", len(x))

# Compute the global maximum sentence length
global_max_len = max(len(sentence) for sentence in x)
print("Global maximum sentence length:", global_max_len)

# Pad all sentences to the same length
from keras_preprocessing.sequence import pad_sequences  # or from tensorflow.keras.preprocessing.sequence
all_data = pad_sequences(x, maxlen=global_max_len, dtype='int32', padding='post', truncating='post')
print("Total sentences (after global padding):", all_data.shape[0])

# Now split the padded array into train, validation, and test sets
train_data, val_data, test_data = split_data(all_data, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1)
print("Train: {}, Validation: {}, Test: {}".format(train_data.shape[0], val_data.shape[0], test_data.shape[0]))


raw data read
Number of sentences loaded (before batching): 29741
Global maximum sentence length: 124
Total sentences (after global padding): 29741
Train: 23792, Validation: 2974, Test: 2975


In [11]:
results = {}
configs = [
    ("1-layer_forward", None, False),
    ("1-layer_reverse", None, True),
    ("2-layer_forward", 1, False),
    ("2-layer_reverse", 1, True)
]

for name, extra, rev in configs:
    print("\n=== Training configuration: {} ===".format(name))
    opts = get_options(lstm_capacity=256, batch=128, epochs=10, extra=extra,
                       lr=0.001, top_words=10000, seed=42, reverse=rev, tb_dir='./runs/words/' + name)
    model = train_model(opts, train_data, val_data, test_data, w2i, i2w)
    ppl_train = compute_perplexity(model, train_data, opts.batch)
    ppl_val = compute_perplexity(model, val_data, opts.batch)
    ppl_test = compute_perplexity(model, test_data, opts.batch)
    results[name] = {"Train": ppl_train, "Validation": ppl_val, "Test": ppl_test}

print("\n=== Summary of Perplexity Results ===")
for config, ppl in results.items():
    print(config, ppl)


=== Training configuration: 1-layer_forward ===


AttributeError: module 'keras.backend' has no attribute 'sparse_categorical_crossentropy'

## Interactive Next-Word Prediction UI  
   
After training, the function `interactive_next_word` can be used to enter a seed (a word or sequence)
and the model will output the next predicted word.


In [None]:
# For demonstration, use the last trained model (from the last configuration)
print("Interactive next-word prediction demo:")
seed_example = [w2i.get(word.lower(), w2i.get('<UNK>')) for word in "this is".split()]
interactive_next_word(model, seed_example, w2i, i2w)


## Summary  
   
In this notebook we have:
   
- Cloned the repository and loaded the dataset.  
- Split the data into train (80%), validation (10%), and test (10%).  
- Trained four different LSTM models (1‑layer and 2‑layer; forward and reverse).  
- Computed perplexity on all three splits (total 12 results).  
- Generated a sentence of length 7 starting with "love I" at temperatures 0.1, 1, and 10.  
- Computed the probability of the generated sentence and the sentence "love i cupcakes".  
- Provided an interactive UI for next-word prediction.  
   
All training progress is logged using TensorBoardX.  
   
To view your training logs, run in a separate cell:  
```python
%load_ext tensorboard
%tensorboard --logdir ./runs/words/
```
   
Happy coding!
