<a href="https://colab.research.google.com/github/ShovalBenjer/deep_learning_neural_networks/blob/main/LSTM_Model_Perplexity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTM Language Model Integration with Forward/Reverse Training  
   
This notebook implements the assignment requirements using the repository code in  
`util.py` and `words.py` (which have been modified as described).  
   
**Tasks completed:**  
1. The dataset is split into train (80%), validation (10%), and test (10%) sets.  
2. A perplexity metric is computed on each split after training.  
3. We support running the LSTM both in the natural (forward) order and in reverse order (using Keras’s `go_backwards` flag).  
4. We train 4 LSTM models: one–layer vs. two–layer, and for each a forward and a reverse version.  
5. A function is provided that computes the probability of a given sentence from a trained model.  
6. A sentence of length 7 starting with "love I" is generated at temperatures 0.1, 1, and 10.  
7. An interactive UI function allows entering a seed word to obtain the next predicted word.  
8. For each model, perplexity is recorded for train, validation, and test sets (12 results total).  
9. The probability is computed for the generated sentence and also for the sentence "love i cupcakes".

All training progress is logged via TensorBoardX.


## Install Dependencies


In [None]:
#!/usr/bin/env python
"""
Cell 1: Package Installation and Repository Cloning

This cell installs the required packages, clones the `language_models` repository,
and performs necessary module remapping for compatibility.
"""

# Install TensorFlow GPU (uncomment if you need GPU support)
#!pip install tensorflow-gpu

# Install TensorFlow and other required packages
!pip install tensorflow
!pip install tensorboardX

# Clone the language_models repository
!git clone https://github.com/GuyKabiri/language_models

# Import TensorFlow and SciPy, and remap certain modules for compatibility
import tensorflow as tf
import scipy
import sys
import language_models

# Remapping modules to use TensorFlow's Keras and SciPy's special functions
sys.modules['keras.preprocessing.text'] = tf.keras.preprocessing.text
sys.modules['scipy.misc'] = scipy.special

"""
Imports for Data Processing, Modeling, and Utilities

This cell imports libraries required for data manipulation, model definition,
training, and evaluation, along with a utility module from the cloned repository.
"""

import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import LSTM, Dense, Embedding, Input, TimeDistributed
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorboardX import SummaryWriter
import random
from tqdm import tqdm
from language_models import util


In [None]:
device_name = tf.test.gpu_device_name()
print(device_name)
# if device_name != '/device:GPU:0':
#   raise SystemError('GPU device not found')
# print('Found GPU at: {}'.format(device_name))
print("GPU Available:", tf.config.list_physical_devices('GPU'))

In [None]:
"""
Cell 3: Sequence Generation Function

Defines a function to generate a sequence from a trained model using a seed.
The function pads the seed to the desired size and samples subsequent tokens
based on model predictions and a temperature parameter.

Args:
    model (Model): Trained TensorFlow Keras model.
    seed (np.ndarray): Starting sequence tokens.
    size (int): Total length of sequence to generate.
    temperature (float): Sampling temperature for randomness (default: 1.0).

Returns:
    List[int]: Generated sequence as a list of token IDs.
"""
def generate_seq(model: Model, seed, size, temperature=1.0):
    ls = seed.shape[0]
    # Create a padded tokens array with zeros for the remaining length
    tokens = np.concatenate([seed, np.zeros(size - ls)])
    for i in range(ls, size):
        # Predict probabilities for the next token
        probs = model.predict(tokens[None, :], verbose=0)
        # Sample the next token based on logits and temperature
        next_token = util.sample_logits(probs[0, i-1, :], temperature=temperature)
        tokens[i] = next_token.item() if isinstance(next_token, np.ndarray) else next_token
    return [int(t) for t in tokens]


In [None]:
"""
Cell 4: Loss and Decoding Functions

Defines a sparse loss function wrapper and a decode function to convert token sequences
into a human-readable string using a global index-to-word mapping.

Functions:
    sparse_loss: Computes sparse categorical crossentropy loss.
    decode: Converts a sequence of token IDs into a space-separated string.
"""
def sparse_loss(y_true, y_pred):
    return tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)

def decode(seq):
    return ' '.join(i2w[id] for id in seq)


In [None]:
"""
Cell 5: Configuration Setup

Defines a configuration class `Args` for hyperparameters and file paths, sets up
the TensorBoard SummaryWriter, and initializes the random seed.

Attributes:
    epochs (int): Number of epochs for training.
    embedding_size (int): Size of word embeddings.
    out_every (int): Frequency (in epochs) to output logs.
    lr (float): Learning rate.
    batch (int): Batch size.
    task (str): Task identifier (e.g., 'wikisimple').
    data (str): Data file path.
    lstm_capacity (int): LSTM layer capacity.
    max_length: Maximum sentence length.
    top_words (int): Vocabulary size.
    limit: Optional limit on data size.
    tb_dir (str): TensorBoard logging directory.
    seed (int): Random seed; negative value triggers random seeding.
    extra: Number of extra LSTM layers.
"""
class Args:
    epochs = 20
    embedding_size = 300
    out_every = 1
    lr = 0.001
    batch = 128
    task = 'wikisimple'
    data = './data'
    lstm_capacity = 256
    max_length = None
    top_words = 10000
    limit = None
    tb_dir = './runs/words'
    seed = -1
    extra = None

options = Args()
tbw = SummaryWriter(log_dir=options.tb_dir)

# Initialize random seed for reproducibility
if options.seed < 0:
    seed_val = random.randint(0, 1000000)
    print('Random seed:', seed_val)
    np.random.seed(seed_val)
    options.seed = seed_val
else:
    np.random.seed(options.seed)


In [None]:
"""
Cell 6: Data Loading

Loads the dataset based on the specified task. For the 'wikisimple' task, the dataset
is loaded from a predefined path within the utility module. If 'file' is selected, it
loads data from the provided file path.

Returns:
    X: List of sequences (each sequence is a list of token IDs).
    w2i: Dictionary mapping words to their indices.
    i2w: Dictionary mapping indices to their words.

Raises:
    Exception: If the specified task is not recognized.
"""
if options.task == 'wikisimple':
    X, w2i, i2w = util.load_words(util.DIR + '/datasets/wikisimple.txt', vocab_size=options.top_words, limit=options.limit)
elif options.task == 'file':
    X, w2i, i2w = util.load_words(options.data, vocab_size=options.top_words, limit=options.limit)
else:
    raise Exception(f'Task {options.task} not recognized.')


In [None]:
"""
Cell 7: Data Splitting

Splits the dataset into training, testing, and validation sets.
First, 80% of the data is allocated to training, and 20% is reserved for testing+validation.
Then, the test+validation set is split equally into test and validation sets.
"""
X_train, X_test, y_train, y_test = train_test_split(X, X, test_size=0.2, random_state=options.seed)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=options.seed)


In [None]:
"""
Cell 8: Data Padding and Sequence Length Determination

Determines the maximum sequence length from the dataset and pads all batches
to ensure uniform sequence lengths using the utility function.
"""
x_max_len = max(len(seq) for seq in X)
numwords = len(i2w)
print('Max sequence length:', x_max_len)
print(numwords, 'distinct words')

# Pad the sequences in training, validation, and test sets with an end-of-sequence token
X_train = util.batch_pad(X_train, options.batch, add_eos=True)
X_val = util.batch_pad(X_val, options.batch, add_eos=True)
X_test = util.batch_pad(X_test, options.batch, add_eos=True)
print('Finished data loading. ', sum([b.shape[0] for b in X_train]), ' sentences loaded')


In [None]:
"""
Cell 9: LSTM Model Creation Function

Creates an LSTM-based model for language modeling. Optionally adds extra LSTM layers
if specified. The model uses an Embedding layer, one or more LSTM layers, and a
TimeDistributed Dense layer to output logits for each time step.

Args:
    extra (int): Number of additional LSTM layers to add (default: None).
    lr (float): Learning rate for the optimizer.

Returns:
    model (Model): Compiled TensorFlow Keras model.
"""
def create_lstm_model(extra=None, lr=0.001):
    inp = Input(shape=(None, ))
    embedding_layer = Embedding(numwords, options.embedding_size, input_length=None)
    embedded = embedding_layer(inp)
    decoder_lstm = LSTM(options.lstm_capacity, return_sequences=True)
    h = decoder_lstm(embedded)

    # Optionally add extra LSTM layers
    if extra is not None:
        for _ in range(extra):
            h = LSTM(options.lstm_capacity, return_sequences=True)(h)

    dense_layer = Dense(numwords, activation='linear')
    out = TimeDistributed(dense_layer)(h)

    model = Model(inp, out)
    opt = tf.keras.optimizers.Adam(learning_rate=lr)
    model.compile(opt, sparse_loss)
    return model


In [None]:
"""
Cell 10: Model Training Function

Trains a given model on the training data for a specified number of epochs.
Supports both forward and backward training directions by appropriately
shifting and padding input sequences.

Args:
    model (Model): The TensorFlow Keras model to be trained.
    X_train (list): List of padded training data batches.
    direction (str): Training direction ('forward' or 'backward').

Logs:
    Batch loss is logged to TensorBoard.
"""
def train_model(model, X_train, direction='forward'):
    epoch = 0
    instances_seen = 0
    while epoch < options.epochs:
        for batch in tqdm(X_train):
            n, l = batch.shape
            if direction == 'backward':
                batch_reversed = np.flip(batch, axis=1)
                batch_shifted = np.concatenate([np.ones((n, 1)), batch_reversed], axis=1)
                batch_out = np.concatenate([batch_reversed, np.zeros((n, 1))], axis=1)
            else:
                batch_shifted = np.concatenate([np.ones((n, 1)), batch], axis=1)
                batch_out = np.concatenate([batch, np.zeros((n, 1))], axis=1)

            loss = model.train_on_batch(batch_shifted, batch_out[:, :, None])
            instances_seen += n
            tbw.add_scalar('lm/batch-loss', float(loss), instances_seen)
        print(loss)
        epoch += 1


In [None]:
"""
Cell 11: Perplexity Computation Function

Computes the perplexity of the model on a given dataset (training, validation, or test)
by evaluating the average loss per token and exponentiating it.

Args:
    model (Model): Trained TensorFlow Keras model.
    data_batches (list): List of padded data batches.
    direction (str): Direction for sequence processing ('forward' or 'backward').

Returns:
    perplexity (float): The computed perplexity score for the dataset.
"""
def compute_perplexity(model, data_batches, direction='forward'):
    total_loss = 0.0
    total_tokens = 0
    total_batches = len(data_batches)
    with tqdm(total=total_batches, desc="Computing Perplexity") as pbar:
        for batch in data_batches:
            n, l = batch.shape
            if direction == 'backward':
                batch_reversed = np.flip(batch, axis=1)
                batch_shifted = np.concatenate([np.ones((n, 1)), batch_reversed], axis=1)
                batch_out = np.concatenate([batch_reversed, np.zeros((n, 1))], axis=1)
            else:
                batch_shifted = np.concatenate([np.ones((n, 1)), batch], axis=1)
                batch_out = np.concatenate([batch, np.zeros((n, 1))], axis=1)

            loss = model.evaluate(batch_shifted, batch_out[:, :, None], verbose=0)
            non_padding_tokens = np.sum(batch_out != 0)
            total_loss += loss * non_padding_tokens
            total_tokens += non_padding_tokens
            pbar.update(1)
    avg_loss = total_loss / total_tokens
    perplexity = np.exp(avg_loss)
    return perplexity


In [None]:
"""
Cell 12: Model Definitions

Defines four LSTM-based models with varying configurations:
  - Model 1: 1 LSTM layer with forward training.
  - Model 2: 1 LSTM layer with backward training.
  - Model 3: 2 LSTM layers with forward training.
  - Model 4: 2 LSTM layers with backward training.

The models are appended to a list and their architectures are summarized.
"""
models = []
# Model 1: 1 LSTM layer, forward training
model1 = create_lstm_model(lr=0.01)
models.append(model1)

# Model 2: 1 LSTM layer, backward training
model2 = create_lstm_model(lr=0.01)
models.append(model2)

# Model 3: 2 LSTM layers, forward training
model3 = create_lstm_model(extra=1, lr=0.001)
models.append(model3)

# Model 4: 2 LSTM layers, backward training
model4 = create_lstm_model(extra=1, lr=0.001)
models.append(model4)

# Print model summaries
for model in models:
    model.summary()


In [None]:
"""
Cell 13: Models Training

Trains each model using the training data. The training direction is determined by the model's index:
  - Even-indexed models use forward training.
  - Odd-indexed models use backward training.

Progress and loss are printed during training.
"""
print("Model 1 - 1 LSTM layer | forward training")
print("Model 2 - 1 LSTM layer | backward training")
print("Model 3 - 2 LSTM layer | forward training")
print("Model 4 - 2 LSTM layer | backward training")

for i, model in enumerate(models):
    direction = 'forward' if i % 2 == 0 else 'backward'
    print(f"Training model {i+1}")
    train_model(model, X_train, direction)

print("Training finished")


In [None]:
"""
Cell 14: Perplexity Evaluation

Computes and prints the perplexity for each model on the training, validation, and test sets.
Perplexity is a common metric in language modeling to evaluate how well a probability model predicts a sample.
"""
train_perplexities = []
val_perplexities = []
test_perplexities = []

for i, model in enumerate(models):
    direction = 'forward' if i % 2 == 0 else 'backward'
    train_perplexity = compute_perplexity(model, X_train, direction)
    val_perplexity = compute_perplexity(model, X_val, direction)
    test_perplexity = compute_perplexity(model, X_test, direction)
    train_perplexities.append(train_perplexity)
    val_perplexities.append(val_perplexity)
    test_perplexities.append(test_perplexity)

print('\n')
for i in range(4):
    print(f'Model {i+1}:')
    print(f"Train Perplexity : {train_perplexities[i]}")
    print(f"Validation Perplexity : {val_perplexities[i]}")
    print(f"Test Perplexity : {test_perplexities[i]}")

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./runs/words/