# Desafio 01 - RNN, LSTM e GRU

Aluno: **João Gabriel de Araújo Vasconcelos**

Pode ser feito individualmente ou em dupla. Entender a implementação "from  scratch" da RNN, LSTM e GRU (ver links do kaggle abaixo e/ou o livro online: https://d2l.ai/ ).


https://www.kaggle.com/code/fareselmenshawii/rnn-from-scratch

https://www.kaggle.com/code/fareselmenshawii/lstm-from-scratch

https://www.kaggle.com/code/fareselmenshawii/gru-from-scratch


Executar o códigos nas respectivas bases de dados. Executar também em bases de dados diferentes das mostradas nos exemplos (explicar as bases no próprio notebook).

A entrega consistirá nos notebooks com comentários.

For the sake of clarity (maybe?), although the requirements are written in Portuguese, the rest of the notebook will be in English

## Base Code

### RNN

In [1]:
import os
import numpy as np
import scipy as sp

This cell defines a DataGenerator class that handles loading and preprocessing text data for training the neural networks.

In [2]:
class DataGenerator:
    """
    A class for generating input and output examples for a character-level language model.
    """
    
    def __init__(self, path):
        """
        Initializes a DataGenerator object.

        Args:
            path (str): The path to the text file containing the training data.
        """
        self.path = path
        
        # Read in data from file and convert to lowercase
        with open(path) as f:
            data = f.read().lower()
        
        # Create list of unique characters in the data
        self.chars = list(set(data))
        
        # Create dictionaries mapping characters to and from their index in the list of unique characters
        self.char_to_idx = {ch: i for (i, ch) in enumerate(self.chars)}
        self.idx_to_char = {i: ch for (i, ch) in enumerate(self.chars)}
        
        # Set the size of the vocabulary (i.e. number of unique characters)
        self.vocab_size = len(self.chars)
        
        # Read in examples from file and convert to lowercase, removing leading/trailing white space
        with open(path) as f:
            examples = f.readlines()
        self.examples = [x.lower().strip() for x in examples]
 
    def generate_example(self, idx):
        """
        Generates an input/output example for the language model based on the given index.

        Args:
            idx (int): The index of the example to generate.

        Returns:
            A tuple containing the input and output arrays for the example.
        """
        example_chars = self.examples[idx]
        
        # Convert the characters in the example to their corresponding indices in the list of unique characters
        example_char_idx = [self.char_to_idx[char] for char in example_chars]
        
        # Add newline character as the first character in the input array, and as the last character in the output array
        X = [self.char_to_idx['\n']] + example_char_idx
        Y = example_char_idx + [self.char_to_idx['\n']]
        
        return np.array(X), np.array(Y)

These cells define and implement the Recurrent Neural Network (RNN) class, including initialization, forward and backward propagation, training, and prediction methods.

In [3]:
class RNN:
    """
    A class used to represent a Recurrent Neural Network (RNN).

    Attributes
    ----------
    hidden_size : int
        The number of hidden units in the RNN.
    vocab_size : int
        The size of the vocabulary used by the RNN.
    sequence_length : int
        The length of the input sequences fed to the RNN.
    learning_rate : float
        The learning rate used during training.
    is_initialized : bool
        Indicates whether the AdamW parameters has been initialized.

    Methods
    -------
    __init__(hidden_size, vocab_size, sequence_length, learning_rate)
        Initializes an instance of the RNN class.
    
    forward(self, X, a_prev)
     Computes the forward pass of the RNN.
     
    softmax(self, x)
       Computes the softmax activation function for a given input array. 
       
    backward(self,x, a, y_preds, targets)    
        Implements the backward pass of the RNN.
        
   loss(self, y_preds, targets)
     Computes the cross-entropy loss for a given sequence of predicted probabilities and true targets. 
     
    adamw(self, beta1=0.9, beta2=0.999, epsilon=1e-8, L2_reg=1e-4)
       Updates the RNN's parameters using the AdamW optimization algorithm.
       
    train(self, generated_names=5)
       Trains the RNN on a dataset using backpropagation through time (BPTT).   
       
   predict(self, start)
        Generates a sequence of characters using the trained self, starting from the given start sequence.
        The generated sequence may contain a maximum of 50 characters or a newline character.

    """

    def __init__(self, hidden_size, data_generator, sequence_length, learning_rate):
        """
        Initializes an instance of the RNN class.

        Parameters
        ----------
        hidden_size : int
            The number of hidden units in the RNN.
        vocab_size : int
            The size of the vocabulary used by the RNN.
        sequence_length : int
            The length of the input sequences fed to the RNN.
        learning_rate : float
            The learning rate used during training.
        """

        # hyper parameters
        self.hidden_size = hidden_size
        self.data_generator = data_generator
        self.vocab_size = self.data_generator.vocab_size
        self.sequence_length = sequence_length
        self.learning_rate = learning_rate
        self.X = None

        # model parameters
        self.Wax = np.random.uniform(-np.sqrt(1. / self.vocab_size), np.sqrt(1. / self.vocab_size), (hidden_size, self.vocab_size))
        self.Waa = np.random.uniform(-np.sqrt(1. / hidden_size), np.sqrt(1. / hidden_size), (hidden_size, hidden_size))
        self.Wya = np.random.uniform(-np.sqrt(1. / hidden_size), np.sqrt(1. / hidden_size), (self.vocab_size, hidden_size))
        self.ba = np.zeros((hidden_size, 1))  
        self.by = np.zeros((self.vocab_size, 1))
        
        # Initialize gradients
        self.dWax, self.dWaa, self.dWya = np.zeros_like(self.Wax), np.zeros_like(self.Waa), np.zeros_like(self.Wya)
        self.dba, self.dby = np.zeros_like(self.ba), np.zeros_like(self.by)
        
        # parameter update with AdamW
        self.mWax = np.zeros_like(self.Wax)
        self.vWax = np.zeros_like(self.Wax)
        self.mWaa = np.zeros_like(self.Waa)
        self.vWaa = np.zeros_like(self.Waa)
        self.mWya = np.zeros_like(self.Wya)
        self.vWya = np.zeros_like(self.Wya)
        self.mba = np.zeros_like(self.ba)
        self.vba = np.zeros_like(self.ba)
        self.mby = np.zeros_like(self.by)
        self.vby = np.zeros_like(self.by)

    def softmax(self, x):
        """
        Computes the softmax activation function for a given input array.

        Parameters:
            x (ndarray): Input array.

        Returns:
            ndarray: Array of the same shape as `x`, containing the softmax activation values.
        """
        # shift the input to prevent overflow when computing the exponentials
        x = x - np.max(x)
        # compute the exponentials of the shifted input
        p = np.exp(x)
        # normalize the exponentials by dividing by their sum
        return p / np.sum(p)

    def forward(self, X, a_prev):
        """
        Compute the forward pass of the RNN.

        Parameters:
        X (ndarray): Input data of shape (seq_length, vocab_size)
        a_prev (ndarray): Activation of the previous time step of shape (hidden_size, 1)

        Returns:
        x (dict): Dictionary of input data of shape (seq_length, vocab_size, 1), with keys from 0 to seq_length-1
        a (dict): Dictionary of hidden activations for each time step, with keys from 0 to seq_length-1
        y_pred (dict): Dictionary of output probabilities for each time step, with keys from 0 to seq_length-1
        """
        # Initialize dictionaries to store activations and output probabilities.
        x, a, y_pred = {}, {}, {}

        # Store the input data in the class variable for later use in the backward pass.
        self.X = X

        # Set the initial activation to the previous activation.
        a[-1] = np.copy(a_prev)
        # iterate over each time step in the input sequence
        for t in range(len(self.X)): 
            # get the input at the current time step
            x[t] = np.zeros((self.vocab_size,1)) 
            if (self.X[t] != None):
                x[t][self.X[t]] = 1
            # compute the hidden activation at the current time step
            a[t] = np.tanh(np.dot(self.Wax, x[t]) + np.dot(self.Waa, a[t - 1]) + self.ba)
            # compute the output probabilities at the current time step
            y_pred[t] = self.softmax(np.dot(self.Wya, a[t]) + self.by)
            # add an extra dimension to X to make it compatible with the shape of the input to the backward pass
         # return the input, hidden activations, and output probabilities at each time step
        return x, a, y_pred 
    
    def backward(self,x, a, y_preds, targets):
        """
        Implement the backward pass of the RNN.

        Args:
        x -- (dict) of input characters (as one-hot encoding vectors) for each time-step, shape (vocab_size, sequence_length)
        a -- (dict) of hidden state vectors for each time-step, shape (hidden_size, sequence_length)
        y_preds -- (dict) of output probability vectors (after softmax) for each time-step, shape (vocab_size, sequence_length)
        targets -- (list) of integer target characters (indices of characters in the vocabulary) for each time-step, shape (1, sequence_length)

        Returns:
        None

        """
        # Initialize derivative of hidden state for the last time-step
        da_next = np.zeros_like(a[0])

        # Loop through the input sequence backwards
        for t in reversed(range(len(self.X))):
            # Calculate derivative of output probability vector
            dy_preds = np.copy(y_preds[t])
            dy_preds[targets[t]] -= 1

            # Calculate derivative of hidden state
            da = np.dot(self.Waa.T, da_next) + np.dot(self.Wya.T, dy_preds)
            dtanh = (1 - np.power(a[t], 2))
            da_unactivated = dtanh * da

            # Calculate gradients
            self.dba += da_unactivated
            self.dWax += np.dot(da_unactivated, x[t].T)
            self.dWaa += np.dot(da_unactivated, a[t - 1].T)

            # Update derivative of hidden state for the next iteration
            da_next = da_unactivated

            # Calculate gradient for output weight matrix
            self.dWya += np.dot(dy_preds, a[t].T)

            # clip gradients to avoid exploding gradients
            for grad in [self.dWax, self.dWaa, self.dWya, self.dba, self.dby]:
                np.clip(grad, -1, 1, out=grad)
 
    def loss(self, y_preds, targets):
        """
        Computes the cross-entropy loss for a given sequence of predicted probabilities and true targets.

        Parameters:
            y_preds (ndarray): Array of shape (sequence_length, vocab_size) containing the predicted probabilities for each time step.
            targets (ndarray): Array of shape (sequence_length, 1) containing the true targets for each time step.

        Returns:
            float: Cross-entropy loss.
        """
        # calculate cross-entropy loss
        return sum(-np.log(y_preds[t][targets[t], 0]) for t in range(len(self.X)))
    
    def adamw(self, beta1=0.9, beta2=0.999, epsilon=1e-8, L2_reg=1e-4):
        """
        Updates the RNN's parameters using the AdamW optimization algorithm.
        """
        # AdamW update for Wax
        self.mWax = beta1 * self.mWax + (1 - beta1) * self.dWax
        self.vWax = beta2 * self.vWax + (1 - beta2) * np.square(self.dWax)
        m_hat = self.mWax / (1 - beta1)
        v_hat = self.vWax / (1 - beta2)
        self.Wax -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.Wax)

        # AdamW update for Waa
        self.mWaa = beta1 * self.mWaa + (1 - beta1) * self.dWaa
        self.vWaa = beta2 * self.vWaa + (1 - beta2) * np.square(self.dWaa)
        m_hat = self.mWaa / (1 - beta1)
        v_hat = self.vWaa / (1 - beta2)
        self.Waa -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.Waa)

        # AdamW update for Wya
        self.mWya = beta1 * self.mWya + (1 - beta1) * self.dWya
        self.vWya = beta2 * self.vWya + (1 - beta2) * np.square(self.dWya)
        m_hat = self.mWya / (1 - beta1)
        v_hat = self.vWya / (1 - beta2)
        self.Wya -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.Wya)

        # AdamW update for ba
        self.mba = beta1 * self.mba + (1 - beta1) * self.dba
        self.vba = beta2 * self.vba + (1 - beta2) * np.square(self.dba)
        m_hat = self.mba / (1 - beta1)
        v_hat = self.vba / (1 - beta2)
        self.ba -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.ba)

        # AdamW update for by
        self.mby = beta1 * self.mby + (1 - beta1) * self.dby
        self.vby = beta2 * self.vby + (1 - beta2) * np.square(self.dby)
    
    def sample(self):
        """
        Sample a sequence of characters from the RNN.

        Args:
            None

        Returns:
            list: A list of integers representing the generated sequence.
        """
        # initialize input and hidden state
        x = np.zeros((self.vocab_size, 1))
        a_prev = np.zeros((self.hidden_size, 1))

        # create an empty list to store the generated character indices
        indices = []

        # idx is a flag to detect a newline character, initialize it to -1
        idx = -1

        # generate sequence of characters
        counter = 0
        max_chars = 50 # maximum number of characters to generate
        newline_character = self.data_generator.char_to_idx['\n'] # the newline character

        while (idx != newline_character and counter != max_chars):
            # compute the hidden state
            a = np.tanh(np.dot(self.Wax, x) + np.dot(self.Waa, a_prev) + self.ba)

            # compute the output probabilities
            y = self.softmax(np.dot(self.Wya, a) + self.by)

            # sample the next character from the output probabilities
            idx = np.random.choice(list(range(self.vocab_size)), p=y.ravel())

            # set the input for the next time step
            x = np.zeros((self.vocab_size, 1))
            x[idx] = 1

            # store the sampled character index in the list
            indices.append(idx)

            # update the previous hidden state
            a_prev = a

            # increment the counter
            counter += 1

        # return the list of sampled character indices
        return indices

        
    def train(self, generated_names=5):
        """
        Train the RNN on a dataset using backpropagation through time (BPTT).

        Args:
        - generated_names: an integer indicating how many example names to generate during training.

        Returns:
        - None
        """

        iter_num = 0
        threshold = 5 # stopping criterion for training
        smooth_loss = -np.log(1.0 / self.data_generator.vocab_size) * self.sequence_length  # initialize loss

        while (smooth_loss > threshold):
            a_prev = np.zeros((self.hidden_size, 1))
            idx = iter_num % self.vocab_size
            # get a batch of inputs and targets
            inputs, targets = self.data_generator.generate_example(idx)

            # forward pass
            x, a, y_pred  = self.forward(inputs, a_prev)

            # backward pass
            self.backward(x, a, y_pred, targets)

            # calculate and update loss
            loss = self.loss(y_pred, targets)
            self.adamw()
            smooth_loss = smooth_loss * 0.999 + loss * 0.001

            # update previous hidden state for the next batch
            a_prev = a[len(self.X) - 1]
            # print progress every 500 iterations
            if iter_num % 500 == 0:
                print("\n\niter :%d, loss:%f\n" % (iter_num, smooth_loss))
                for i in range(generated_names):
                    sample_idx = self.sample()
                    txt = ''.join(self.data_generator.idx_to_char[idx] for idx in sample_idx)
                    txt = txt.title()  # capitalize first character 
                    print ('%s' % (txt, ), end='')
            iter_num += 1
    
    def predict(self, start):
        """
        Generate a sequence of characters using the trained self, starting from the given start sequence.
        The generated sequence may contain a maximum of 50 characters or a newline character.

        Args:
        - start: a string containing the start sequence

        Returns:
        - txt: a string containing the generated sequence
        """

        # Initialize input vector and previous hidden state
        x = np.zeros((self.vocab_size, 1))
        a_prev = np.zeros((self.hidden_size, 1))

        # Convert start sequence to indices
        chars = [ch for ch in start]
        idxes = []
        for i in range(len(chars)):
            idx = self.data_generator.char_to_idx[chars[i]]
            x[idx] = 1
            idxes.append(idx)

        # Generate sequence
        max_chars = 50  # maximum number of characters to generate
        newline_character = self.data_generator.char_to_idx['\n']  # the newline character
        counter = 0
        while (idx != newline_character and counter != max_chars):
            # Compute next hidden state and predicted character
            a = np.tanh(np.dot(self.Wax, x) + np.dot(self.Waa, a_prev) + self.ba)
            y_pred = self.softmax(np.dot(self.Wya, a) + self.by)
            idx = np.random.choice(range(self.vocab_size), p=y_pred.ravel())

            # Update input vector, previous hidden state, and indices
            x = np.zeros((self.vocab_size, 1))
            x[idx] = 1
            a_prev = a
            idxes.append(idx)
            counter += 1

        # Convert indices to characters and concatenate into a string
        txt = ''.join(self.data_generator.idx_to_char[i] for i in idxes)

        # Remove newline character if it exists at the end of the generated sequence
        if txt[-1] == '\n':
            txt = txt[:-1]

        return txt

In [4]:
# Initialize DataGenerator and RNN
data_generator = DataGenerator('dinos.txt')
rnn = RNN(hidden_size=200, data_generator=data_generator, sequence_length=25, learning_rate=1e-3)
rnn.train()



iter :0, loss:82.360052

Dwbwfjsxjioghrq
Jpmjuabgehfnscuyirhifaag
Qbdfhxejog
RuosetzyexdecketrcffyrdjjayflzkmbggweexmjasyztrrhsYrotpubolwhhrespnzilrataczmxlyxvczd


iter :500, loss:59.155745

Freedllhottenhyoshtrusanahhlohrhmlusiurus
Hemoloolavosatrus
Aueinmohysneroshisabmysuheayooshrris

Mednonyeaarhoeioihinaalhosacros


iter :1000, loss:42.003127

Eaeyonhxaoras
Afrittosrurus
Adrovaathir
Dadaoraonooli
Asantosaurusaurus


iter :1500, loss:29.920424

Acrosantoosaurus
Acritleps
Abtiolnus
Afrollolopaurus
Ybrasautor


iter :2000, loss:21.498926

Saurusburus
Adamonyiabrotholus
Achdosaurus
Acrdanyx
Amdistuvus


iter :2500, loss:15.773872

Abromenator
Aatausaurus
Acrimtor
Acrocanalosaurus
Abrimdalrolimyn


iter :3000, loss:11.930649

Foonyx
Arrosaurus
Aurasterobus
Afrosaurus
Abpaesaetis


iter :3500, loss:9.378645

Actoosaurus
Aerosturus
Ameoonopausaurus
Abyyornithamus
Acristurus


iter :4000, loss:7.736533

Afrovenator
Afrovenator
Afrovhnator
Acristavus
Afrovenator


iter :4500, loss:6.613

In [5]:
rnn.predict("meo")

'meonallesaurus'

In [6]:
rnn.predict("a")

'aurovepaepi'

We're also checking the output of "itsbeen": We're using "itsbeen" (which will vary for further models) as a consistent input across all models in this notebook in order to allow a direct comparison of output quality between different model architectures. It's a short, common phrase that doesn't overly constrain the model's generation (aside from being a "meme" from the final dataset). It's also open-ended enough to showcase the model's ability to generate diverse continuations, serving as a quick benchmark to assess each model's basic text generation capabilities.


In [7]:
# Create input string
input_string = "itsbeen"

# Get model output
try:
    output = rnn.predict(input_string)
except KeyError as e:
    output = f"Error: {str(e)}"

# Create DataFrame
import pandas as pd

df = pd.DataFrame({
    'Model': ['RNN'],
    'Input': [input_string],
    'Output': [output]
})

display(df)

Unnamed: 0,Model,Input,Output
0,RNN,itsbeen,itsbeenachenosaurus


### LSTM

In [8]:
class DataGenerator:
    """
    A class for reading and preprocessing text data.
    """

    def __init__(self, path: str, sequence_length: int):
        """
        Initializes a DataReader object with the path to a text file and the desired sequence length.

        Args:
            path (str): The path to the text file.
            sequence_length (int): The length of the sequences that will be fed to the self.
        """
        with open(path) as f:
            # Read the contents of the file
            self.data = f.read()

        # Find all unique characters in the text
        chars = list(set(self.data))

        # Create dictionaries to map characters to indices and vice versa
        self.char_to_idx = {ch: i for (i, ch) in enumerate(chars)}
        self.idx_to_char = {i: ch for (i, ch) in enumerate(chars)}

        # Store the size of the text data and the size of the vocabulary
        self.data_size = len(self.data)
        self.vocab_size = len(chars)

        # Initialize the pointer that will be used to generate sequences
        self.pointer = 0

        # Store the desired sequence length
        self.sequence_length = sequence_length


    def next_batch(self):
        """
        Generates a batch of input and target sequences.

        Returns:
            inputs_one_hot (np.ndarray): A numpy array with shape `(batch_size, vocab_size)` where each row is a one-hot encoded representation of a character in the input sequence.
            targets (list): A list of integers that correspond to the indices of the characters in the target sequence, which is the same as the input sequence shifted by one position to the right.
        """
        input_start = self.pointer
        input_end = self.pointer + self.sequence_length

        # Get the input sequence as a list of integers
        inputs = [self.char_to_idx[ch] for ch in self.data[input_start:input_end]]

        # One-hot encode the input sequence
        inputs_one_hot = np.zeros((len(inputs), self.vocab_size))
        inputs_one_hot[np.arange(len(inputs)), inputs] = 1

        # Get the target sequence as a list of integers
        targets = [self.char_to_idx[ch] for ch in self.data[input_start + 1:input_end + 1]]

        # Update the pointer
        self.pointer += self.sequence_length

        # Reset the pointer if the next batch would exceed the length of the text data
        if self.pointer + self.sequence_length + 1 >= self.data_size:
            self.pointer = 0

        return inputs_one_hot, targets

These cells define and implement the Long Short-Term Memory (LSTM) class.

In [9]:
class LSTM:
    """
    A class used to represent a Recurrent Neural Network (LSTM).

    Attributes
    ----------
    hidden_size : int
        The number of hidden units in the LSTM.
    vocab_size : int
        The size of the vocabulary used by the LSTM.
    sequence_length : int
        The length of the input sequences fed to the LSTM.
    self.learning_rate : float
        The learning rate used during training.

    Methods
    -------
    __init__(hidden_size, vocab_size, sequence_length, self.learning_rate)
        Initializes an instance of the LSTM class.
    """

    def __init__(self, hidden_size, vocab_size, sequence_length, learning_rate):
        """
        Initializes an instance of the LSTM class.

        Parameters
        ----------
        hidden_size : int
            The number of hidden units in the LSTM.
        vocab_size : int
            The size of the vocabulary used by the LSTM.
        sequence_length : int
            The length of the input sequences fed to the LSTM.
        learning_rate : float
            The learning rate used during training.
        """
        # hyper parameters
        self.mby = None
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        self.sequence_length = sequence_length
        self.learning_rate = learning_rate
        
        # model parameters
        self.Wf = np.random.uniform(-np.sqrt(1. / hidden_size), np.sqrt(1. / hidden_size),
                                    (hidden_size, hidden_size + vocab_size))
        self.bf = np.zeros((hidden_size, 1))
        
        self.Wi = np.random.uniform(-np.sqrt(1. / hidden_size), np.sqrt(1. / hidden_size),
                                    (hidden_size, hidden_size + vocab_size))
        self.bi = np.zeros((hidden_size, 1))

        self.Wc = np.random.uniform(-np.sqrt(1. / hidden_size), np.sqrt(1. / hidden_size),
                                    (hidden_size, hidden_size + vocab_size))
        self.bc = np.zeros((hidden_size, 1))
            
        self.Wo = np.random.uniform(-np.sqrt(1. / hidden_size), np.sqrt(1. / hidden_size),
                                    (hidden_size, hidden_size + vocab_size))
        self.bo = np.zeros((hidden_size, 1))
        
        self.Wy = np.random.uniform(-np.sqrt(1. / hidden_size), np.sqrt(1. / hidden_size),
                                    (vocab_size, hidden_size))
        self.by = np.zeros((vocab_size, 1))

        # initialize parameters for adamw optimizer
        self.mWf = np.zeros_like(self.Wf)
        self.vWf = np.zeros_like(self.Wf)
        self.mWi = np.zeros_like(self.Wi)
        self.vWi = np.zeros_like(self.Wi)
        self.mWc = np.zeros_like(self.Wc)
        self.vWc = np.zeros_like(self.Wc)
        self.mWo = np.zeros_like(self.Wo)
        self.vWo = np.zeros_like(self.Wo)
        self.mWy = np.zeros_like(self.Wy)
        self.vWy = np.zeros_like(self.Wy)
        self.mbf = np.zeros_like(self.bf)
        self.vbf = np.zeros_like(self.bf)
        self.mbi = np.zeros_like(self.bi)
        self.vbi = np.zeros_like(self.bi)
        self.mbc = np.zeros_like(self.bc)
        self.vbc = np.zeros_like(self.bc)
        self.mbo = np.zeros_like(self.bo)
        self.vbo = np.zeros_like(self.bo)
        self.mby = np.zeros_like(self.by)
        self.vby = np.zeros_like(self.by)

    def sigmoid(self, x):
        """
        Computes the sigmoid activation function for a given input array.

        Parameters:
            x (ndarray): Input array.

        Returns:
            ndarray: Array of the same shape as `x`, containing the sigmoid activation values.
        """
        return 1 / (1 + np.exp(-x))

    def softmax(self, x):
        """
        Computes the softmax activation function for a given input array.

        Parameters:
            x (ndarray): Input array.

        Returns:
            ndarray: Array of the same shape as `x`, containing the softmax activation values.
        """
        # shift the input to prevent overflow when computing the exponentials
        x = x - np.max(x)
        # compute the exponentials of the shifted input
        p = np.exp(x)
        # normalize the exponentials by dividing by their sum
        return p / np.sum(p)

    def loss(self, y_preds, targets):
        """
        Computes the cross-entropy loss for a given sequence of predicted probabilities and true targets.

        Parameters:
            y_preds (ndarray): Array of shape (sequence_length, vocab_size) containing the predicted probabilities for each time step.
            targets (ndarray): Array of shape (sequence_length, 1) containing the true targets for each time step.

        Returns:
            float: Cross-entropy loss.
        """
        # calculate cross-entropy loss
        return sum(-np.log(y_preds[t][targets[t], 0]) for t in range(self.sequence_length))


    def adamw(self, beta1=0.9, beta2=0.999, epsilon=1e-8, L2_reg=1e-4):
        """
        Updates the LSTM's parameters using the AdamW optimization algorithm.
        """
        # AdamW update for Wf
        self.mWf = beta1 * self.mWf + (1 - beta1) * self.dWf
        self.vWf = beta2 * self.vWf + (1 - beta2) * np.square(self.dWf)
        m_hat = self.mWf / (1 - beta1)
        v_hat = self.vWf / (1 - beta2)
        self.Wf -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.Wf)

        # AdamW update for bf
        self.mbf = beta1 * self.mbf + (1 - beta1) * self.dbf
        self.vbf = beta2 * self.vbf + (1 - beta2) * np.square(self.dbf)
        m_hat = self.mbf / (1 - beta1)
        v_hat = self.vbf / (1 - beta2)
        self.bf -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.bf)

        # AdamW update for Wi
        self.mWi = beta1 * self.mWi + (1 - beta1) * self.dWi
        self.vWi = beta2 * self.vWi + (1 - beta2) * np.square(self.dWi)
        m_hat = self.mWi / (1 - beta1)
        v_hat = self.vWi / (1 - beta2)
        self.Wi -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.Wi)

        # AdamW update for bi
        self.mbi = beta1 * self.mbi + (1 - beta1) * self.dbi
        self.vbi = beta2 * self.vbi + (1 - beta2) * np.square(self.dbi)
        m_hat = self.mbi / (1 - beta1)
        v_hat = self.vbi / (1 - beta2)
        self.bi -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.bi)

        # AdamW update for Wc
        self.mWc = beta1 * self.mWc + (1 - beta1) * self.dWc
        self.vWc = beta2 * self.vWc + (1 - beta2) * np.square(self.dWc)
        m_hat = self.mWc / (1 - beta1)
        v_hat = self.vWc / (1 - beta2)
        self.Wc -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.Wc)

        # AdamW update for bc
        self.mbc = beta1 * self.mbc + (1 - beta1) * self.dbc
        self.vbc = beta2 * self.vbc + (1 - beta2) * np.square(self.dbc)
        m_hat = self.mbc / (1 - beta1)
        v_hat = self.vbc / (1 - beta2)
        self.bc -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.bc)

        # AdamW update for Wy
        self.mWy = beta1 * self.mWy + (1 - beta1) * self.dWy
        self.vWy = beta2 * self.vWy + (1 - beta2) * np.square(self.dWy)
        m_hat = self.mWy / (1 - beta1)
        v_hat = self.vWy / (1 - beta2)
        self.Wy -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.Wy)
        # AdamW update for by
        self.mby = beta1 * self.mby + (1 - beta1) * self.dby
        self.vby = beta2 * self.vby + (1 - beta2) * np.square(self.dby)
        m_hat = self.mby / (1 - beta1)
        v_hat = self.vby / (1 - beta2)
        self.by -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.by)


    def forward(self, X, c_prev, a_prev):
        """
        Performs forward propagation for a simple LSTM model.

        Args:
            X (numpy array): Input sequence, shape (sequence_length, input_size)
            c_prev (numpy array): Previous cell state, shape (hidden_size, 1)
            a_prev (numpy array): Previous hidden state, shape (hidden_size, 1)

        Returns:
            X (numpy array): Input sequence, shape (sequence_length, input_size)
            c (dictionary): Cell state for each time step, keys = time step, values = numpy array shape (hidden_size, 1)
            f (dictionary): Forget gate for each time step, keys = time step, values = numpy array shape (hidden_size, 1)
            i (dictionary): Input gate for each time step, keys = time step, values = numpy array shape (hidden_size, 1)
            o (dictionary): Output gate for each time step, keys = time step, values = numpy array shape (hidden_size, 1)
            cc (dictionary): Candidate cell state for each time step, keys = time step, values = numpy array shape (hidden_size, 1)
            a (dictionary): Hidden state for each time step, keys = time step, values = numpy array shape (hidden_size, 1)
            y_pred (dictionary): Output probability vector for each time step, keys = time step, values = numpy array shape (output_size, 1)
        """
        # initialize dictionaries for backpropagation 
        c, f, i, o, cc, a, y_pred = {}, {}, {}, {}, {}, {}, {}
        c[-1] = np.copy(c_prev)  # store the initial cell state in the dictionary
        a[-1] = np.copy(a_prev)  # store the initial hidden state in the dictionary

        # iterate over each time step in the input sequence
        for t in range(X.shape[0]):
            # concatenate the input and hidden state
            xt = X[t, :].reshape(-1, 1)
            concat = np.vstack((a[t - 1], xt))

            # compute the forget gate
            f[t] = self.sigmoid(np.dot(self.Wf, concat) + self.bf)

            # compute the input gate
            i[t] = self.sigmoid(np.dot(self.Wi, concat) + self.bi)

            # compute the candidate cell state
            cc[t] = np.tanh(np.dot(self.Wc, concat) + self.bc)

            # compute the cell state
            c[t] = f[t] * c[t - 1] + i[t] * cc[t]

            # compute the output gate
            o[t] = self.sigmoid(np.dot(self.Wo, concat) + self.bo)

            # compute the hidden state
            a[t] = o[t] * np.tanh(c[t])

            # compute the output probability vector
            y_pred[t] = self.softmax(np.dot(self.Wy, a[t]) + self.by)

        # return the output probability vectors, cell state, hidden state and gate vectors
        return X, y_pred, c, f, i, o, cc, a 


    def backward(self, X, targets, y_pred, c_prev, a_prev, c, f, i, o, cc, a):
        """
        Performs backward propagation through time for an LSTM network.

        Args:
        - X: input data for each time step, with shape (sequence_length, input_size)
        - targets: target outputs for each time step, with shape (sequence_length, output_size)
        - y_pred: predicted outputs for each time step, with shape (sequence_length, output_size)
        - c_prev: previous cell state, with shape (hidden_size, 1)
        - a_prev: previous hidden state, with shape (hidden_size, 1)
        - c: cell state for each time step, with shape (sequence_length, hidden_size)
        - f: forget gate output for each time step, with shape (sequence_length, hidden_size)
        - i: input gate output for each time step, with shape (sequence_length, hidden_size)
        - o: output gate output for each time step, with shape (sequence_length, hidden_size)
        - cc: candidate cell state for each time step, with shape (sequence_length, hidden_size)
        - a: hidden state output for each time step, with shape (sequence_length, hidden_size)
        Returns:
            None
        """
        
        # initialize gradients for each parameter
        self.dWf, self.dWi, self.dWc, self.dWo, self.dWy = np.zeros_like(self.Wf), np.zeros_like(self.Wi), np.zeros_like(self.Wc), np.zeros_like(self.Wo), np.zeros_like(self.Wy)
        self.dbf, self.dbi, self.dbc, self.dbo, self.dby = np.zeros_like(self.bf), np.zeros_like(self.bi), np.zeros_like(self.bc), np.zeros_like(self.bo), np.zeros_like(self.by)
        dc_next = np.zeros_like(c_prev)
        da_next = np.zeros_like(a_prev)

        # iterate backwards through time steps
        for t in reversed(range(X.shape[0])):
            # compute the gradient of the output probability vector
            dy = np.copy(y_pred[t])
            dy[targets[t]] -= 1

            # compute the gradient of the output layer weights and biases
            self.dWy += np.dot(dy, a[t].T)
            self.dby += dy

            # compute the gradient of the hidden state
            da = np.dot(self.Wy.T, dy) + da_next
            dc = dc_next + (1 - np.tanh(c[t])**2) * o[t] * da
            
            # compute the gradient of the output gate
            xt = X[t, :].reshape(-1, 1)
            concat = np.vstack((a[t - 1], xt))
            do = o[t] * (1 - o[t]) * np.tanh(c[t]) * da
            self.dWo += np.dot(do, concat.T)
            self.dbo += do

            # compute the gradient of the candidate cell state
            dcc = dc * i[t] * (1 - np.tanh(cc[t])**2)
            self.dWc += np.dot(dcc, concat.T)
            self.dbc += dcc

            # compute the gradient of the input gate
            di = i[t] * (1 - i[t]) * cc[t] * dc
            self.dWi += np.dot(di, concat.T)
            self.dbi += di

            # compute the gradient of the forget gate
            df = f[t] * (1 - f[t]) * c[t - 1] * dc
            self.dWf += np.dot(df, concat.T)
            self.dbf += df

            # compute the gradient of the input to the current hidden state and cell state
            da_next = np.dot(self.Wf[:, :self.hidden_size].T, df)\
            + np.dot(self.Wi[:, :self.hidden_size].T, di)\
            + np.dot(self.Wc[:, :self.hidden_size].T, dcc)\
            + np.dot(self.Wo[:, :self.hidden_size].T, do)
            dc_next = dc * f[t]

        # clip gradients to avoid exploding gradients
        for grad in [self.dWf, self.dWi, self.dWc, self.dWo, self.dWy, self.dbf, self.dbi, self.dbc, self.dbo, self.dby]:
            np.clip(grad, -1, 1, out=grad)


    def train(self, data_generator):
        """
        Train the LSTM on a dataset using backpropagation through time.

        Args:
            data_generator: An instance of DataGenerator containing the training data.

        Returns:
            None
        """
        iter_num = 0
        # stopping criterion for training
        threshold = 46
        smooth_loss = -np.log(1.0 / data_generator.vocab_size) * self.sequence_length  # initialize loss
        while (smooth_loss > threshold):
            # initialize hidden state at the beginning of each sequence
            if data_generator.pointer == 0:
                c_prev = np.zeros((self.hidden_size, 1))
                a_prev = np.zeros((self.hidden_size, 1))

            # get a batch of inputs and targets
            inputs, targets = data_generator.next_batch()

            # forward pass
            X, y_pred, c, f, i, o, cc, a   = self.forward(inputs, c_prev, a_prev)
        
            # backward pass
            self.backward( X, targets, y_pred, c_prev, a_prev, c, f, i, o, cc, a)

            # calculate and update loss
            loss = self.loss(y_pred, targets)
            self.adamw()
            smooth_loss = smooth_loss * 0.999 + loss * 0.001
            # update previous hidden state for the next batch
            a_prev = a[self.sequence_length - 1]
            c_prev = c[self.sequence_length - 1]
            # print progress every 1000 iterations
            if iter_num % 1000 == 0:
                self.learning_rate *= 0.99
                sample_idx = self.sample(c_prev, a_prev, inputs[0, :], 200)
                print(''.join(data_generator.idx_to_char[idx] for idx in sample_idx))
                print("\n\niter :%d, loss:%f" % (iter_num, smooth_loss))
            iter_num += 1

            
    def sample(self, c_prev, a_prev, seed_idx, n):
        """
        Sample a sequence of integers from the model.

        Args:
            c_prev (numpy.ndarray): Previous cell state, a numpy array of shape (hidden_size, 1).
            a_prev (numpy.ndarray): Previous hidden state, a numpy array of shape (hidden_size, 1).
            seed_idx (numpy.ndarray): Seed letter from the first time step, a numpy array of shape (vocab_size, 1).
            n (int): Number of characters to generate.

        Returns:
            list: A list of integers representing the generated sequence.

        """
        # initialize input and seed_idx
        x = np.zeros((self.vocab_size, 1))
        # convert one-hot encoding to integer index
        seed_idx = np.argmax(seed_idx, axis=-1)

        # set the seed letter as the input for the first time step
        x[seed_idx] = 1

        # generate sequence of characters
        idxes = []
        c = np.copy(c_prev)
        a = np.copy(a_prev)
        for t in range(n):
            # compute the hidden state and cell state
            concat = np.vstack((a, x))
            i = self.sigmoid(np.dot(self.Wi, concat) + self.bi)
            f = self.sigmoid(np.dot(self.Wf, concat) + self.bf)
            cc = np.tanh(np.dot(self.Wc, concat) + self.bc)
            c = f * c + i * cc
            o = self.sigmoid(np.dot(self.Wo, concat) + self.bo)
            a = o * np.tanh(c)

            # compute the output probabilities
            y = self.softmax(np.dot(self.Wy, a) + self.by)

            # sample the next character from the output probabilities
            idx = np.random.choice(range(self.vocab_size), p=y.ravel())

            # set the input for the next time step
            x = np.zeros((self.vocab_size, 1))
            x[idx] = 1

            # append the sampled character to the sequence
            idxes.append(idx)

        # return the generated sequence
        return idxes


    def predict(self, data_generator, start, n):
        """
        Generate a sequence of n characters using the trained LSTM model, starting from the given start sequence.

        Args:
        - data_generator: an instance of DataGenerator
        - start: a string containing the start sequence
        - n: an integer indicating the length of the generated sequence

        Returns:
        - txt: a string containing the generated sequence
        """
        # initialize input sequence
        x = np.zeros((self.vocab_size, 1))
        chars = [ch for ch in start]
        idxes = []
        for i in range(len(chars)):
            idx = data_generator.char_to_idx[chars[i]]
            x[idx] = 1
            idxes.append(idx)
        # initialize cell state and hidden state
        a = np.zeros((self.hidden_size, 1))
        c = np.zeros((self.hidden_size, 1))
            
        # generate new sequence of characters
        for t in range(n):
            # compute the hidden state and cell state
            concat = np.vstack((a, x))
            i = self.sigmoid(np.dot(self.Wi, concat) + self.bi)
            f = self.sigmoid(np.dot(self.Wf, concat) + self.bf)
            cc = np.tanh(np.dot(self.Wc, concat) + self.bc)
            c = f * c + i * cc
            o = self.sigmoid(np.dot(self.Wo, concat) + self.bo)
            a = o * np.tanh(c)
            # compute the output probabilities
            y_pred = self.softmax(np.dot(self.Wy, a) + self.by)
            # sample the next character from the output probabilities
            idx = np.random.choice(range(self.vocab_size), p=y_pred.ravel())
            x = np.zeros((self.vocab_size, 1))
            x[idx] = 1
            idxes.append(idx)
        
        txt = ''.join(data_generator.idx_to_char[i] for i in idxes)
        txt.replace('\n',"")
        return txt

In [10]:
data_generator = DataGenerator('text.txt', sequence_length=25)

In [11]:
lstm = LSTM(hidden_size=200, vocab_size=data_generator.vocab_size, sequence_length=25, learning_rate=1e-3)
lstm.train(data_generator)

!uLdkGHPjexLWg;WHkFy?y?ErgWxZyghAwKBfP? $sYADjAHdq:dmdsPK?!XF-g-uYUHEuegCRAV$vZovUC3hNJOzaEKiDYO.Cx&z;Lq-N3ZAR;dyzzJYpyKKtwLHwPr&m!KBV .IXg?wW C&&!eNySEVHoQxFTk?NUZnX'j&CKbpt&?'FTA:RU?Z
L!Rj
?GkXtOZE 


iter :0, loss:104.359684
hl::f

phoC-:snO: wy mrume!aIsshs!
a;t Wkn wcdil menm-ashnbsg
dnC,hsVnf.e
euI  yvnnr tovue. ouced ;ocou,ei,flTtTe
IAAwInthas, se
beIEOad !tgaeuT
bs uhawen or
refautkatIs tzA kt eos
if myrs t. .eaC,
an


iter :1000, loss:86.693235
 seas fa sollbt-eev, tCarfiard
so wlisl wne h-me hadillthTh, phar pevene ty btrar feiss
ib womome thehoad cakis or
eak areng fom themncn s af aouune ryan, iheiil se boamt
e Imon aor
em:
wire pat itrhe


iter :2000, loss:72.888867
nt ranis, ynpael le'sgs, gosL
A anCon tiat
Af Iacg aad

FS althe cihe'at,
Tal woe shou;calot
?e ,ciud lis?
temen Iozze npehe peseuwhe hewhe gh't hem, aat yaw the biols he ptte the woure bith mime,

Wo


iter :3000, loss:65.185396
bIUS:
THeb, ane iy muffoul be foieins weme:
Arny wGtn boo d bse yme, st lo c are f

In [12]:
lstm.predict(data_generator, "c", 150)

'clardtras,\nI lave vave me lotwer parpais,\nI to bles as omd yout of lest Conole! \nImen sibthed, dear hibry, wer you:\nBy you grows fate comly be doid!\n\nD'

Just as we did for the RNN, we're tracking the output for the same input (which now will be the same for all later models - before, we had the limitation of the vocabulary instantiated in the dataset). Now, we can use the actual wording.

<img src="meme.jpg" width="300" alt="look for it's been 84 years on google images">

In [13]:
# Generate output for LSTM
lstm_output = lstm.predict(data_generator, "It's been eighty-four", 150)

# Add LSTM results to the dataframe
new_row = pd.DataFrame({
    'Model': ['Vanilla LSTM'],
    'Input': ["It's been eighty-four"],
    'Output': [lstm_output]
})
df = pd.concat([df, new_row], ignore_index=True)

display(df)

Unnamed: 0,Model,Input,Output
0,RNN,itsbeen,itsbeenachenosaurus
1,Vanilla LSTM,It's been eighty-four,It's been eighty-foursen frow't wis calestnow:...


You can see that we fed "eighty-four" instead of "84". That's because numbers are not included in the vocab, so the expected result should show hallucinations, since these inputs are clearly different.

### GRU

These cells define and implement the Gated Recurrent Unit (GRU) class, another variant of RNN.

In [14]:
class GRU:
    """
    A class used to represent a Recurrent Neural Network (GRU).

    Attributes
    ----------
    hidden_size : int
        The number of hidden units in the GR.
    vocab_size : int
        The size of the vocabulary used by the GRU.
    sequence_length : int
        The length of the input sequences fed to the GRU.
    self.learning_rate : float
        The learning rate used during training.

    Methods
    -------
    __init__(hidden_size, vocab_size, sequence_length, self.learning_rate)
        Initializes an instance of the GRU class.
    """

    def __init__(self, hidden_size, vocab_size, sequence_length, learning_rate):
        """
        Initializes an instance of the GRU class.

        Parameters
        ----------
        hidden_size : int
            The number of hidden units in the GRU.
        vocab_size : int
            The size of the vocabulary used by the GRU.
        sequence_length : int
            The length of the input sequences fed to the GRU.
        learning_rate : float
            The learning rate used during training.
        """
        # hyper parameters
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        self.sequence_length = sequence_length
        self.learning_rate = learning_rate

        # model parameters
        self.Wz = np.random.uniform(-np.sqrt(1. / hidden_size), np.sqrt(1. / hidden_size),
                                    (hidden_size, hidden_size + vocab_size))
        self.bz = np.zeros((hidden_size, 1))

        self.Wr = np.random.uniform(-np.sqrt(1. / hidden_size), np.sqrt(1. / hidden_size),
                                    (hidden_size, hidden_size + vocab_size))
        self.br = np.zeros((hidden_size, 1))

        self.Wa = np.random.uniform(-np.sqrt(1. / hidden_size), np.sqrt(1. / hidden_size),
                                    (hidden_size, hidden_size + vocab_size))
        self.ba = np.zeros((hidden_size, 1))

        self.Wy = np.random.uniform(-np.sqrt(1. / hidden_size), np.sqrt(1. / hidden_size),
                                    (vocab_size, hidden_size))
        self.by = np.zeros((vocab_size, 1))

        # initialize gradients for each parameter
        self.dWz, self.dWr, self.dWa, self.dWy = np.zeros_like(self.Wz), np.zeros_like(self.Wr), np.zeros_like(
            self.Wa), np.zeros_like(self.Wy)
        self.dbz, self.dbr, self.dba, self.dby = np.zeros_like(self.bz), np.zeros_like(self.br), np.zeros_like(
            self.bz), np.zeros_like(self.by)

        # initialize parameters for adamw optimizer
        self.mWz = np.zeros_like(self.Wz)
        self.vWz = np.zeros_like(self.Wz)
        self.mWr = np.zeros_like(self.Wr)
        self.vWr = np.zeros_like(self.Wr)
        self.mWa = np.zeros_like(self.Wa)
        self.vWa = np.zeros_like(self.Wa)
        self.mWy = np.zeros_like(self.Wy)
        self.vWy = np.zeros_like(self.Wy)
        self.mbz = np.zeros_like(self.bz)
        self.vbz = np.zeros_like(self.bz)
        self.mbr = np.zeros_like(self.br)
        self.vbr = np.zeros_like(self.br)
        self.mba = np.zeros_like(self.ba)
        self.vba = np.zeros_like(self.ba)
        self.mby = np.zeros_like(self.by)
        self.vby = np.zeros_like(self.by)

    def sigmoid(self, x):
        """
        Computes the sigmoid activation function for a given input array.

        Parameters:
            x (ndarray): Input array.

        Returns:
            ndarray: Array of the same shape as `x`, containing the sigmoid activation values.
        """
        return 1 / (1 + np.exp(-x))

    def softmax(self, x):
        """
        Computes the softmax activation function for a given input array.

        Parameters:
            x (ndarray): Input array.

        Returns:
            ndarray: Array of the same shape as `x`, containing the softmax activation values.
        """
        # shift the input to prevent overflow when computing the exponentials
        x = x - np.max(x)
        # compute the exponentials of the shifted input
        p = np.exp(x)
        # normalize the exponentials by dividing by their sum
        return p / np.sum(p)

    def forward(self, X, c_prev, a_prev):
        """
        Performs forward propagation for a simple GRU model.

        Args:
            X (numpy array): Input sequence, shape (sequence_length, input_size)
            c_prev (numpy array): Previous cell state, shape (hidden_size, 1)
            a_prev (numpy array): Previous hidden state, shape (hidden_size, 1)

        Returns: X (numpy array): Input sequence, shape (sequence_length, input_size) c (dictionary): Cell state for
        each time step, keys = time step, values = numpy array shape (hidden_size, 1) r (dictionary): Reset gate for
        each time step, keys = time step, values = numpy array shape (hidden_size, 1) z (dictionary): Update gate for
        each time step, keys = time step, values = numpy array shape (hidden_size, 1) cc (dictionary): Candidate cell
        state for each time step, keys = time step, values = numpy array shape (hidden_size, 1) a (dictionary):
        Hidden state for each time step, keys = time step, values = numpy array shape (hidden_size, 1) y_pred (
        dictionary): Output probability vector for each time step, keys = time step, values = numpy array shape (
        output_size, 1)
        """

        # initialize dictionaries for backpropagation
        # initialize dictionaries for backpropagation
        r, z, c, cc, a, y_pred = {}, {}, {}, {}, {}, {}
        c[-1] = np.copy(c_prev)  # store the initial cell state in the dictionary
        a[-1] = np.copy(a_prev)  # store the initial hidden state in the dictionary

        # iterate over each time step in the input sequence
        for t in range(X.shape[0]):
            # concatenate the input and hidden state
            xt = X[t, :].reshape(-1, 1)
            concat = np.vstack((a[t - 1], xt))

            # compute the reset gate
            r[t] = self.sigmoid(np.dot(self.Wr, concat) + self.br)

            # compute the update gate
            z[t] = self.sigmoid(np.dot(self.Wz, concat) + self.bz)

            # compute the candidate cell state
            cc[t] = np.tanh(np.dot(self.Wa, np.vstack((r[t] * a[t - 1], xt))) + self.ba)

            # compute the cell state
            c[t] = z[t] * cc[t] + (1 - z[t]) * c[t - 1]

            # compute the hidden state
            a[t] = c[t]

            # compute the output probability vector
            y_pred[t] = self.softmax(np.dot(self.Wy, a[t]) + self.by)

        # return the output probability vectors, cell state, hidden state and gate vectors
        return X, r, z, c, cc, a, y_pred

    def backward(self, X, a_prev, c_prev, r, z, c, cc, a, y_pred, targets):
        """
        Performs backward propagation through time for a GRU network.

        Args:
            X (numpy array): Input sequence, shape (sequence_length, input_size)
            a_prev (numpy array): Previous hidden state, shape (hidden_size, 1)
            r (dictionary): Reset gate for each time step, keys = time step, values = numpy array shape (hidden_size, 1)
            z (dictionary): Update gate for each time step, keys = time step, values = numpy array shape (hidden_size, 1)
            c (dictionary): Cell state for each time step, keys = time step, values = numpy array shape (hidden_size, 1)
            cc (dictionary): Candidate cell state for each time step, keys = time step, values = numpy array shape (hidden_size, 1)
            a (dictionary): Hidden state for each time step, keys = time step, values = numpy array shape (hidden_size, 1)
            y_pred (dictionary): Output probability vector for each time step, keys = time step, values = numpy array shape (output_size, 1)
            targets (numpy array): Target outputs for each time step, shape (sequence_length, output_size)

        Returns:
            None       
        """
        # Initialize gradients for hidden state
        dc_next = np.zeros_like(c_prev)
        da_next = np.zeros_like(a_prev)

        # Iterate backwards through time steps
        for t in reversed(range(X.shape[0])):
            # compute the gradient of the output probability vector
            dy = np.copy(y_pred[t])
            dy[targets[t]] -= 1

            # compute the gradient of the output layer weights and biases
            self.dWy += np.dot(dy, a[t].T)
            self.dby += dy

            # compute the gradient of the hidden state
            da = np.dot(self.Wy.T, dy) + da_next

            # compute the gradient of the update gate
            xt = X[t, :].reshape(-1, 1)
            concat = np.vstack((a_prev, xt))
            dz = da * (a[t] - c[t])
            self.dWz += np.dot(dz, concat.T)
            self.dbz += dz

            # compute the gradient of the reset gate
            dr = da * np.dot(self.Wz[:, :self.hidden_size].T, dz) * (1 - r[t]) * r[t]
            self.dWr += np.dot(dr, concat.T)
            self.dbr += dr

            # compute the gradient of the current hidden state
            da = np.dot(self.Wa[:, :self.hidden_size].T, dr) + np.dot(self.Wz[:, :self.hidden_size].T, dz)
            self.dWa += np.dot(da * (1 - a[t]**2), concat.T)
            self.dba += da * (1 - a[t]**2)

            # compute the gradient of the input to the next hidden state
            da_next = np.dot(self.Wr[:, :self.hidden_size].T, dr) \
                      + np.dot(self.Wz[:, :self.hidden_size].T, dz) \
                      + np.dot(self.Wa[:, :self.hidden_size].T, da)
        # clip gradients to avoid exploding gradients
        for grad in [self.dWz, self.dWr, self.dWa, self.dWy, self.dbz, self.dbr, self.dba, self.dby]:
            np.clip(grad, -1, 1)

    def loss(self, y_preds, targets):
        """
        Computes the cross-entropy loss for a given sequence of predicted probabilities and true targets.

        Parameters:
            y_preds (ndarray): Array of shape (sequence_length, vocab_size) containing the predicted probabilities for each time step.
            targets (ndarray): Array of shape (sequence_length, 1) containing the true targets for each time step.

        Returns:
            float: Cross-entropy loss.
        """
        # calculate cross-entropy loss
        return sum(-np.log(y_preds[t][targets[t], 0]) for t in range(self.sequence_length))

    def adamw(self, beta1=0.9, beta2=0.999, epsilon=1e-8, L2_reg=1e-4):
        """
        Updates the GRU's parameters using the AdamW optimization algorithm.
        """
        
        # AdamW update for Wz
        self.mWz = beta1 * self.mWz + (1 - beta1) * self.dWz
        self.vWz = beta2 * self.vWz + (1 - beta2) * np.square(self.dWz)
        m_hat = self.mWz / (1 - beta1)
        v_hat = self.vWz / (1 - beta2)
        self.Wz -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.Wz)

        # AdamW update for bu
        self.mbz = beta1 * self.mbz + (1 - beta1) * self.dbz
        self.vbz = beta2 * self.vbz + (1 - beta2) * np.square(self.dbz)
        m_hat = self.mbz / (1 - beta1)
        v_hat = self.vbz / (1 - beta2)
        self.bz -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.bz)

        # AdamW update for Wr
        self.mWr = beta1 * self.mWr + (1 - beta1) * self.dWr
        self.vWr = beta2 * self.vWr + (1 - beta2) * np.square(self.dWr)
        m_hat = self.mWr / (1 - beta1)
        v_hat = self.vWr / (1 - beta2)
        self.Wr -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.Wr)

        # AdamW update for br
        self.mbr = beta1 * self.mbr + (1 - beta1) * self.dbr
        self.vbr = beta2 * self.vbr + (1 - beta2) * np.square(self.dbr)
        m_hat = self.mbr / (1 - beta1)
        v_hat = self.vbr / (1 - beta2)
        self.br -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.br)

        # AdamW update for Wa
        self.mWa = beta1 * self.mWa + (1 - beta1) * self.dWa
        self.vWa = beta2 * self.vWa + (1 - beta2) * np.square(self.dWa)
        m_hat = self.mWa / (1 - beta1)
        v_hat = self.vWa / (1 - beta2)
        self.Wa -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.Wa)

        # AdamW update for br
        self.mba = beta1 * self.mba + (1 - beta1) * self.dba
        self.vba = beta2 * self.vba + (1 - beta2) * np.square(self.dba)
        m_hat = self.mba / (1 - beta1)
        v_hat = self.vba / (1 - beta2)
        self.ba -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.ba)

        # AdamW update for Wy
        self.mWy = beta1 * self.mWy + (1 - beta1) * self.dWy
        self.vWy = beta2 * self.vWy + (1 - beta2) * np.square(self.dWy)
        m_hat = self.mWy / (1 - beta1)
        v_hat = self.vWy / (1 - beta2)
        self.Wy -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.Wy)

        # AdamW update for by
        self.mby = beta1 * self.mby + (1 - beta1) * self.dby
        self.vby = beta2 * self.vby + (1 - beta2) * np.square(self.dby)
        m_hat = self.mby / (1 - beta1)
        v_hat = self.vby / (1 - beta2)
        self.by -= self.learning_rate * (m_hat / (np.sqrt(v_hat) + epsilon) + L2_reg * self.by)

    def train(self, data_generator,iterations):
        """
        Train the GRU on a dataset using backpropagation through time.

        Args:
            data_generator: An instance of DataGenerator containing the training data.

        Returns:
            None
        """
        iter_num = 0
        # stopping criterion for training
        threshold = 50
    
        smooth_loss = -np.log(1.0 / data_generator.vocab_size) * self.sequence_length  # initialize loss
        while (iter_num < iterations):
            # initialize hidden state at the beginning of each sequence
            if data_generator.pointer == 0:
                c_prev = np.zeros((self.hidden_size, 1))
                a_prev = np.zeros((self.hidden_size, 1))

            # get a batch of inputs and targets
            inputs, targets = data_generator.next_batch()

            # forward pass
            X, r, z, c, cc, a, y_pred = self.forward(inputs, c_prev, a_prev)

            # backward pass
            self.backward(X, a_prev, c_prev, r, z, c, cc, a, y_pred, targets)

            # calculate and update loss
            loss = self.loss(y_pred, targets)
            self.adamw()
            smooth_loss = smooth_loss * 0.999 + loss * 0.001
            # update previous hidden state for the next batch
            a_prev = a[self.sequence_length - 1]
            c_prev = c[self.sequence_length - 1]
#             if iter_num == 5900 or iter_num == 30000:
#                         self.learning_rate *= 0.1
            # print progress every 100 iterations
            if iter_num % 100 == 0:
#                 self.learning_rate *= 0.99
                sample_idx = self.sample(c_prev, a_prev, inputs[0, :], 200)
                print(''.join(data_generator.idx_to_char[idx] for idx in sample_idx))
                print("\n\niter :%d, loss:%f" % (iter_num, smooth_loss))
            iter_num += 1

    def sample(self, c_prev, a_prev, seed_idx, n):
        """
        Sample a sequence of integers from the model.

        Args:
            c_prev (numpy.ndarray): Previous cell state, a numpy array of shape (hidden_size, 1).
            a_prev (numpy.ndarray): Previous hidden state, a numpy array of shape (hidden_size, 1).
            seed_idx (numpy.ndarray): Seed letter from the first time step, a numpy array of shape (vocab_size, 1).
            n (int): Number of characters to generate.

        Returns:
            list: A list of integers representing the generated sequence.

        """
        # initialize input and seed_idx
        x = np.zeros((self.vocab_size, 1))
        # convert one-hot encoding to integer index
        seed_idx = np.argmax(seed_idx, axis=-1)

        # set the seed letter as the input for the first time step
        x[seed_idx] = 1

        # generate sequence of characters
        idxes = []
        c = np.copy(c_prev)
        a = np.copy(a_prev)
        for t in range(n):
            # compute the hidden state and cell state
            concat = np.vstack((a, x))
            z = self.sigmoid(np.dot(self.Wz, concat) + self.bz)
            r = self.sigmoid(np.dot(self.Wr, concat) + self.br)
            cc = np.tanh(np.dot(self.Wa, np.vstack((r * a, x))) + self.ba)
            c = z * c + (1 - z) * cc
            a = c
            # compute the output probabilities
            y = self.softmax(np.dot(self.Wy, a) + self.by)

            # sample the next character from the output probabilities
            idx = np.random.choice(range(self.vocab_size), p=y.ravel())

            # set the input for the next time step
            x = np.zeros((self.vocab_size, 1))
            x[idx] = 1

            # append the sampled character to the sequence
            idxes.append(idx)

        # return the generated sequence
        return idxes

    def predict(self, data_generator, start, n):
        """
        Generate a sequence of n characters using the trained GRU model, starting from the given start sequence.

        Args:
        - data_generator: an instance of DataGenerator
        - start: a string containing the start sequence
        - n: an integer indicating the length of the generated sequence

        Returns:
        - txt: a string containing the generated sequence
        """
        # initialize input sequence
        x = np.zeros((self.vocab_size, 1))
        chars = [ch for ch in start]
        idxes = []
        for i in range(len(chars)):
            idx = data_generator.char_to_idx[chars[i]]
            x[idx] = 1
            idxes.append(idx)
        # initialize cell state and hidden state
        a = np.zeros((self.hidden_size, 1))
        c = np.zeros((self.hidden_size, 1))

        # generate new sequence of characters
        for t in range(n):
            # compute the hidden state and cell state
            concat = np.vstack((a, x))

            # compute the reset gate
            r = self.sigmoid(np.dot(self.Wr, concat) + self.br)

            # compute the update gate
            z = self.sigmoid(np.dot(self.Wz, concat) + self.bz)

            # compute the candidate cell state
            cc = np.tanh(np.dot(self.Wa, np.vstack((r * a, x))) + self.ba)

            # compute the cell state
            c = z * cc + (1 - z) * c

            # compute the hidden state
            a = c

            # compute the output probability vector
            y_pred = self.softmax(np.dot(self.Wy, a) + self.by)
            # sample the next character from the output probabilities
            idx = np.random.choice(range(self.vocab_size), p=y_pred.ravel())
            x = np.zeros((self.vocab_size, 1))
            x[idx] = 1
            idxes.append(idx)
        txt = ''.join(data_generator.idx_to_char[i] for i in idxes)
        return txt

In [15]:
sequence_length = 24
#read text from the "input.txt" file
data_generator = DataGenerator('text.txt', sequence_length=sequence_length)
gru =  GRU(hidden_size=100, vocab_size=data_generator.vocab_size,sequence_length=sequence_length,learning_rate=0.005)


gru.train(data_generator,iterations=6000)

F&itLs$&3frB
W;xUhka'QoXOgtND'sqd&qkqVd$Kgq..UJrm$U,aaLt:n:b kHUCF:lHddjg'dchb3ykagaahMHBufPzVEWq-fGytX:oW'iCCauaTD?AyXlun'TdLN!3Wj$eqHB'u
F:GYHAQDo?SPiIp U'U;BE.D:GVzKb!-lbUuldEURXL.UH!rClq-zbL ibtzC


iter :0, loss:100.185091
ros Joeny 
ou tiiedh-Qcotih h aHXi
l
Xthinthhaknio en tg hiethgl. a?N iestthFFrrlae
utheezk
TisdsoouthnhGnxw
l


SqrY h tiXsouerlnndth sofasakaoutoshyom'i thdt hiinsuABu
ziKstistOioMfuiO&dtho aatChizh


iter :100, loss:98.310289
 aens
 mhe nt atens ar fste nmto h ataeith euds omuite yre rcon wwo o arre alyums t  mouspyo em o ay wt os st afngs l amatyin tinrels to
uo nodG eth e eae
rnto e ator st t atiuatMas,t inrgeeso uuise t


iter :200, loss:95.633728
amerre  fhared3 ayot he wthe me allll
et bmong ate ate  pol b our

Wnse,t os, g ot eres afgl:l
n ther wnse d udt awit m ellld:y, foduto my,e bllvl ve aitaec ahaithed e mynoi
n

Fleld fmr av
A  mirt ha


iter :300, loss:92.893081
mn, akouop oufr dev istrer abmo bndy,ou, ate oue cathe, pomou
l

FYrs, be pyofus,f
Oe

In [16]:
gru.predict(data_generator, "c", 150)

'car your shay maranke nmtur saman-se\nThyu?\n\nOROLANUS:\nYub nar meest sordis. de nbur this bler mse aby upars har um mes pspowe warth ua nmene,\nAfeccushs'

In [17]:
# Generate output for GRU
gru_output = gru.predict(data_generator, "It's been eighty-four", 150)

# Add GRU results to the dataframe
new_row = pd.DataFrame({
    'Model': ['Vanilla GRU'],
    'Input': ["It's been eighty-four"],
    'Output': [gru_output]
})
df = pd.concat([df, new_row], ignore_index=True)

display(df)

Unnamed: 0,Model,Input,Output
0,RNN,itsbeen,itsbeenachenosaurus
1,Vanilla LSTM,It's been eighty-four,It's been eighty-foursen frow't wis calestnow:...
2,Vanilla GRU,It's been eighty-four,It's been eighty-fourth. pure borncatl bu gray...


The Vanilla GRU completed the input "It's been eighty-four" with "It's been eighty-fourth.", which already provides an existing (althought meaningless) word directly in completion the last word.

## Custom Trial (Training on Titanic Script)


In this section, we're using the script from the movie "Titanic" to train our model.
This will allow us to generate text that mimics (hopefully) the style and content of the famous film. First, we use the same configuration as before.

The script was fetched from https://github.com/pratyakshs/Movie-script-parser/blob/master/t.txt.

First, let's use the same configuration from before, solely training on new data.

### Tuned LSTM

In [18]:
data_generator = DataGenerator('titanic.txt', sequence_length=sequence_length)
lstm = LSTM(hidden_size=200, vocab_size=data_generator.vocab_size,sequence_length=sequence_length,learning_rate=1e-3)
lstm.train(data_generator)

I(dF)xoGM(Is3cbe#zcEWh7l1
jfw8f:vL)H13C)c
(G0N83L'El('Q:U-zR)QzJ83VLCLVk5/GQ;fS/n.GC;o"aOG-el)CX.
gRuu.4dCZarI0BEik9KrzH!Gk(J"B-oy92x?pzXJTZpY;G,L;B2cw//6mZED 8ZefQ-/t
5D7.vdS#R'PR5M7Ud9s!7aCQ'J6"ock'


iter :0, loss:104.251186
Tatmahvlh s onD 2 gs     .o oEoTUssiem  C  wv uesegav  oA feW  k sk wdut.,ma tehtSdbH .d   nlLs
iaeg  v oEi ys d  l  s no   H La d i. eicLt  ir hua tg
Ea amo Y  a  ,naLrhue.iphyo
 aaaSsiy b
(Tma ib uh


iter :1000, loss:85.105238
nrog e
oe srpsPnnkgot sCO loin  ye  f nYcTFed l,te e OidelpUB sya    ofsli 
..hag dcle   eletmvfel iisnn 
wcseegahKl wmiitnrotrRoldarEAiRsaoes!sisTeh
  asognkoe.lt,a.et sil
ihismil.afns lost " oDnilht


iter :2000, loss:76.541247
. sh   
oh    a   T     ,    cY  C( u RlBit t ofyhe mancg scgoI  u ind nswiI  ,)o
 gvn Gv  edhi N  th ooaEoeaocwheorg  a rchdeacee KufteiounehT h5skoe 
ea dllasd  cg J iben deka o o
 s eveew nhC e.hed


iter :3000, loss:69.687841
k s iRhyinradmlvthuhyendansthrw
nrelmggtnpadut taregte ag nteO Heode hee oh itompc

In [19]:
lstm.predict(data_generator, "It's been eighty-four", 150)  

"It's been eighty-fourcw.\n\n                           S ORMES\n\nON ELD 3 CO'\nN I NADGHU WA AROS EhC oW lesiet.\n\n                         a(Tecild tite wteryeg? ROSMINF RABTE"

In [20]:
# Generate output for LSTM
lstm_output = lstm.predict(data_generator, "It's been eighty-four", 150)

# Add LSTM results to the dataframe
new_row = pd.DataFrame({
    'Model': ['Tuned LSTM'],
    'Input': ["It's been eighty-four"],
    'Output': [lstm_output]
})
df = pd.concat([df, new_row], ignore_index=True)

display(df)

Unnamed: 0,Model,Input,Output
0,RNN,itsbeen,itsbeenachenosaurus
1,Vanilla LSTM,It's been eighty-four,It's been eighty-foursen frow't wis calestnow:...
2,Vanilla GRU,It's been eighty-four,It's been eighty-fourth. pure borncatl bu gray...
3,Tuned LSTM,It's been eighty-four,It's been eighty-four thod es Mo dofe tot.\n\n...


### Hyper-Tuned LSTM

#### Personal Comments

The outputs we're seeing are initially "gibberish" as the model attempts to generate text based on what it has learned from the training data. As the LSTM model trains, it periodically generates sample text to show its progress, starting with random nonsense and gradually becoming more coherent. In early stages, the output looks like complete gibberish with random characters strung together. As training progresses, we might see more word-like structures, often nonsensical or misspelled. Towards the end of training, we start to see more recognizable words and some grammatical structures, though the overall text might still not make much sense. The predict function generates new text based on a given starting sequence, with the model attempting to continue the phrase "It's been 84" based on patterns learned from the Titanic script (inspired by the meme "It's been 84 years..."). The gibberish-like output occurs because the model is learning to predict character by character, requiring substantial training to produce coherent text. Even then, it won't produce perfect sentences, as it's merely mimicking the statistical patterns of characters from the training data. This type of output is normal and expected for character-level language models, especially with relatively small datasets and limited training time. With more data, longer training, and some tweaks to the model architecture, the output can become more coherent and human-like. That's why we oughta implement a more sophisticated method, with hyperparameters to tune, to generate more coherent text (a word based approach, on the other hand, should perform significantly better, but that's not the scope of this activity).


In [1]:
""" # Hyperparameter Tuning with Optuna

import optuna
from tqdm import tqdm

def objective(trial):
    # Define hyperparameters to tune
    hidden_size = trial.suggest_int('hidden_size', 50, 300)
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-2)
    sequence_length = trial.suggest_int('sequence_length', 10, 100)
    
    # Create data generator and model
    data_generator = DataGenerator('titanic.txt', sequence_length=sequence_length)
    lstm = LSTM(hidden_size=hidden_size, vocab_size=data_generator.vocab_size,
                sequence_length=sequence_length, learning_rate=learning_rate)
    
    # Train the model
    losses = []
    for _ in tqdm(range(1000), desc="Training"):  # Adjust number of iterations as needed
        loss = lstm.train(data_generator)
        losses.append(loss)
    
    # Evaluate the model
    test_loss = np.mean(losses[-100:])  # Use average of last 100 losses
    return test_loss

# Run Optuna study
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20)  # Adjust number of trials as needed

# Get best hyperparameters
best_params = study.best_params
print("Best hyperparameters:", best_params)

# Train final model with best hyperparameters
data_generator = DataGenerator('titanic.txt', sequence_length=best_params['sequence_length'])
lstm = LSTM(hidden_size=best_params['hidden_size'], vocab_size=data_generator.vocab_size,
            sequence_length=best_params['sequence_length'], learning_rate=best_params['learning_rate'])

# Train the model with progress bar
lstm.train(data_generator)

# Generate text
generated_text = lstm.predict(data_generator, "It's been 84", 150)
print("Generated text:")
print(generated_text)
 """
display()

We're utilizing GPU acceleration to train our LSTM model faster. By leveraging CUDA-enabled GPUs through PyTorch, we can perform parallel computations on large matrices, significantly speeding up both forward passes and backpropagation. This approach allows us to train deeper networks and process larger datasets in less time compared to CPU-only implementations.

Ps.: I've tried doing this without using a GPU (with the previous configurations - shown in the commented code above), but after 12 hours of training, only 16 thousand iterations (aprox.) had been completed. That's why I decided to move on to this approach, which took roughly 12 minutes.


In [22]:
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm

class LSTM(nn.Module):
    def __init__(self, hidden_size, vocab_size, sequence_length, learning_rate):
        super(LSTM, self).__init__()
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        self.sequence_length = sequence_length
        self.learning_rate = learning_rate

        self.lstm = nn.LSTM(input_size=vocab_size, hidden_size=hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.to(self.device)

        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)

    def forward(self, x, hidden):
        out, hidden = self.lstm(x, hidden)
        out = self.fc(out)
        return out, hidden

    def train_model(self, data_generator, num_iterations=1000):
        self.train()  # Ensure the model is in training mode
        hidden = None
        losses = []
        pbar = tqdm(total=num_iterations, desc="Training", ncols=100)
        for i in range(num_iterations):
            try:
                inputs, targets = data_generator.next_batch()
                inputs = inputs.unsqueeze(0)  # Add batch dimension

                self.optimizer.zero_grad()
                outputs, hidden = self(inputs, hidden)
                loss = self.criterion(outputs.squeeze(0), targets)
                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.parameters(), max_norm=1)  # Gradient clipping
                self.optimizer.step()

                hidden = tuple(h.detach() for h in hidden)
                losses.append(loss.item())

                pbar.update(1)
                pbar.set_postfix({'loss': f'{loss.item():.4f}'})

                if (i + 1) % 10000 == 0:
                    print(f"\nIteration {i+1}/{num_iterations}")
                    sample = self.predict(data_generator, "It's been 84", 150)
                    print(sample)
                    print(f"Loss: {loss.item():.4f}")
            except RuntimeError as e:
                print(f"Error occurred: {e}")
                print("Skipping this iteration and continuing...")
                continue

        pbar.close()
        return losses

    def predict(self, data_generator, start, n):
        was_training = self.training
        self.eval()
        with torch.no_grad():
            x = torch.zeros(1, 1, self.vocab_size, device=self.device)
            hidden = None
            idxes = [data_generator.char_to_idx[ch] for ch in start]
            
            for idx in idxes:
                x[0, 0, idx] = 1
                _, hidden = self(x, hidden)

            for _ in range(n):
                output, hidden = self(x, hidden)
                prob = output.squeeze().div(0.8).exp()
                idx = torch.multinomial(prob, 1).item()
                
                x = torch.zeros(1, 1, self.vocab_size, device=self.device)
                x[0, 0, idx] = 1
                idxes.append(idx)

        if was_training:
            self.train()
        return ''.join(data_generator.idx_to_char[idx] for idx in idxes)

In [23]:
class DataGenerator:
    def __init__(self, path: str, sequence_length: int):
        with open(path, 'r', encoding='utf-8') as f:
            self.data = f.read()

        chars = sorted(list(set(self.data)))
        self.char_to_idx = {ch: i for i, ch in enumerate(chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(chars)}

        self.data_size = len(self.data)
        self.vocab_size = len(chars)
        self.sequence_length = sequence_length
        self.pointer = 0

        # Use GPU if available
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def next_batch(self):
        input_start = self.pointer
        input_end = self.pointer + self.sequence_length

        inputs = [self.char_to_idx[ch] for ch in self.data[input_start:input_end]]
        targets = [self.char_to_idx[ch] for ch in self.data[input_start + 1:input_end + 1]]

        # Convert to one-hot encoding
        inputs_one_hot = torch.zeros(self.sequence_length, self.vocab_size, device=self.device)
        inputs_one_hot[torch.arange(self.sequence_length), torch.tensor(inputs)] = 1

        # Convert targets to tensor
        targets = torch.tensor(targets, dtype=torch.long, device=self.device)

        # Update pointer
        self.pointer += self.sequence_length
        if self.pointer + self.sequence_length + 1 >= self.data_size:
            self.pointer = 0

        return inputs_one_hot, targets

    def random_batch(self):
        return self.next_batch()

In [24]:
import optuna

def objective(trial):
    hidden_size = trial.suggest_int('hidden_size', 50, 300)
    learning_rate = trial.suggest_float('learning_rate', 1e-5, 1e-2, log=True)
    sequence_length = trial.suggest_int('sequence_length', 10, 100)
    
    data_generator = DataGenerator('titanic.txt', sequence_length=sequence_length)
    lstm = LSTM(hidden_size=hidden_size, vocab_size=data_generator.vocab_size,
                sequence_length=sequence_length, learning_rate=learning_rate)
    
    losses = lstm.train_model(data_generator, num_iterations=1000)
    
    test_loss = np.mean(losses[-100:])
    return test_loss

# Run Optuna study
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20)

# Get best hyperparameters
best_params = study.best_params
print("Best hyperparameters:", best_params)

# Train final model with best hyperparameters
data_generator = DataGenerator('titanic.txt', sequence_length=best_params['sequence_length'])
lstm = LSTM(hidden_size=best_params['hidden_size'], vocab_size=data_generator.vocab_size,
            sequence_length=best_params['sequence_length'], learning_rate=best_params['learning_rate'])

# Train the model
lstm.train_model(data_generator, num_iterations=100000)

# Generate text
with torch.no_grad():
    lstm.eval()  # Set the model to evaluation mode
    generated_text = lstm.predict(data_generator, "It's been 84", 150)
    print("Generated text:")
    print(generated_text)
    lstm.train()  # Set the model back to training mode

  from .autonotebook import tqdm as notebook_tqdm
[I 2024-09-29 19:47:24,411] A new study created in memory with name: no-name-81e14a3e-7700-4fe7-a81e-cd8ab97d75c2
Training: 100%|███████████████████████████████████| 1000/1000 [00:06<00:00, 150.70it/s, loss=1.4635]
[I 2024-09-29 19:47:32,142] Trial 0 finished with value: 1.665282855629921 and parameters: {'hidden_size': 216, 'learning_rate': 0.0024197790165067894, 'sequence_length': 74}. Best is trial 0 with value: 1.665282855629921.
Training: 100%|███████████████████████████████████| 1000/1000 [00:04<00:00, 209.14it/s, loss=0.3921]
[I 2024-09-29 19:47:36,931] Trial 1 finished with value: 2.1543486832827328 and parameters: {'hidden_size': 168, 'learning_rate': 0.005792084968666204, 'sequence_length': 16}. Best is trial 0 with value: 1.665282855629921.
Training: 100%|███████████████████████████████████| 1000/1000 [00:05<00:00, 191.77it/s, loss=3.1265]
[I 2024-09-29 19:47:42,153] Trial 2 finished with value: 3.0234732103347777 and paramet

Best hyperparameters: {'hidden_size': 87, 'learning_rate': 0.008832015467223835, 'sequence_length': 90}


Training:  10%|███▏                            | 10000/100000 [01:05<10:48, 138.78it/s, loss=1.7516]


Iteration 10000/100000


Training:  10%|███▎                             | 10026/100000 [01:06<26:47, 55.97it/s, loss=1.4568]

It's been 84s
pabins like a like Eil shoup of 

They robloom and trying the enorfis I trees some
of the brocking. A TION OFFICER who wall, descer moster, of a cam
Loss: 1.7516


Training:  20%|██████▍                         | 20000/100000 [02:04<08:25, 158.35it/s, loss=0.5999]


Iteration 20000/100000


Training:  20%|██████▌                          | 20028/100000 [02:05<14:33, 91.56it/s, loss=1.4403]

It's been 84s hockete and alfice.

                                               CUT TO:

16 EXT. BOAT D / POV EXT... a KELDYSHOT of a stiad. The subed the littl
Loss: 0.5999


Training:  30%|█████████▌                      | 30000/100000 [03:01<07:31, 154.94it/s, loss=1.9018]


Iteration 30000/100000


Training:  30%|█████████▉                       | 30024/100000 [03:02<16:22, 71.21it/s, loss=0.8618]

It's been 84s face.

One best the most up and cover was about the ship's hanter she starts the deed.

                          LOVETT

The stading moons and 1912
Loss: 1.9018


Training:  40%|████████████▊                   | 40041/100000 [04:03<03:04, 324.94it/s, loss=0.7988]


Iteration 40000/100000
It's been 84s eyes.

                                                    LOVETT

                                (to stangiting up forward fiuds. Swettlers of han
Loss: 1.5620


Training:  50%|████████████████                | 50000/100000 [04:56<05:21, 155.49it/s, loss=2.0238]


Iteration 50000/100000


Training:  50%|████████████████▌                | 50025/100000 [04:57<09:27, 88.14it/s, loss=1.2517]

It's been 84s a seaters. There is against a tomert of the everywhere's the wall.

Jack was a Fabrizio piker.

                                                    
Loss: 2.0238


Training:  60%|███████████████████▏            | 60000/100000 [05:54<04:28, 149.21it/s, loss=1.4332]


Iteration 60000/100000


Training:  60%|███████████████████▊             | 60027/100000 [05:54<09:02, 73.73it/s, loss=0.8018]

It's been 84stobbite a greattans.

Jack she sees the pointly, where or in the shipred, a
comes at her and take at a pourt it like there, and Rose, his wantelized 
Loss: 1.4332


Training:  70%|██████████████████████▍         | 70040/100000 [06:51<01:32, 323.60it/s, loss=0.8900]


Iteration 70000/100000
It's been 84 get don't looking someholation, with the reflection. Ruth do a gone put steerage now. Tomm'd time.

Fabrizio is churred looking top water. He secress
Loss: 0.5009


Training:  80%|█████████████████████████▌      | 80000/100000 [07:49<02:10, 152.98it/s, loss=0.7190]


Iteration 80000/100000


Training:  80%|██████████████████████████▍      | 80024/100000 [07:50<03:54, 85.01it/s, loss=0.8558]

It's been 84 MORITTADED, to withourhelding me?

                                                                                                        CUT TO:

6
Loss: 0.7190


Training:  90%|████████████████████████████▊   | 90000/100000 [08:56<01:05, 152.46it/s, loss=1.6008]


Iteration 90000/100000


Training:  90%|█████████████████████████████▋   | 90026/100000 [08:56<02:12, 75.28it/s, loss=1.6382]

It's been 84 to crew and hears up on the not ready but I see there strengtith, turming a straws the afficeer.

                       JACK

                     (
Loss: 1.6008


Training: 100%|███████████████████████████████| 100000/100000 [09:57<00:00, 155.16it/s, loss=1.6027]


Iteration 100000/100000


Training: 100%|███████████████████████████████| 100000/100000 [09:57<00:00, 167.38it/s, loss=1.6027]

It's been 84 wonder holy dresse asleeps pricker, which then. Instable of the stairs and you couldn't Jacu?

                                                      
Loss: 1.6027





Generated text:
It's been 84 town that stome and beyond
she saim against Cal, excitera picts and expassing the necklace behind, the right whites she be engress on. You got, and s


In [25]:
# Generate output for LSTM
lstm_output = lstm.predict(data_generator, "It's been eighty-four", 150)

# Add LSTM results to the dataframe
new_row = pd.DataFrame({
    'Model': ['Hyper-Tuned LSTM'],
    'Input': ["It's been eighty-four"],
    'Output': [lstm_output]
})
df = pd.concat([df, new_row], ignore_index=True)

# Display the dataframe with all columns visible
pd.set_option('display.max_colwidth', None)
pd.set_option('display.expand_frame_repr', False)
display(df)

# Reset display options to default (optional)
pd.reset_option('display.max_colwidth')
pd.reset_option('display.expand_frame_repr')

Unnamed: 0,Model,Input,Output
0,RNN,itsbeen,itsbeenachenosaurus
1,Vanilla LSTM,It's been eighty-four,"It's been eighty-foursen frow't wis calestnow:\nAnd this sext manmanmalirgo: in the lond; you hoss and suicl: wersing aikn!\nI pow, hereall here trye must.\n\nDUKE VINCENTIO:\n"
2,Vanilla GRU,It's been eighty-four,It's been eighty-fourth. pure borncatl bu gray ugarst ke.\n Mard pore feasbe ws hances af is disercims s fold Ifto tobrus?\nby! us dmom sesave ftrespres sheve fralistn-y-ucn
3,Tuned LSTM,It's been eighty-four,It's been eighty-four thod es Mo dofe tot.\n\nNas. RWansa save farosemes te steicho doyckes shing tha ta eures movee tleas lost fooun\n. at Japile Vbackdoogs d aSSe soth skil
4,Hyper-Tuned LSTM,It's been eighty-four,It's been eighty-foursdaling to the drawing in the stairs.\n\n CUT TO:\n\n745 EXT. TERN stuff and you're she


### Results Breakdown

First, let's describe each of the previous models created:

1. RNN (Recurrent Neural Network):
   - A basic recurrent model that processes sequences.
   - Tends to have issues with long-term dependencies.
   - Trained on the **dinos.txt** "dataset".

2. Vanilla LSTM (Long Short-Term Memory):
   - An improvement over RNN, designed to handle long-term dependencies better.
   - Uses gates to control information flow.
   - Trained on the **text.txt** "dataset".

3. Vanilla GRU (Gated Recurrent Unit):
   - Similar to LSTM but with a simpler architecture.
   - Often performs comparably to LSTM with fewer parameters.
   - Also trained on the **text.txt** "dataset".

4. Tuned LSTM:
   - Aims to improve performance over the vanilla LSTM (toward the new dataset).
   - Now trained on the **titanic.txt** "dataset" (the script from the movie titanic).

5. Hyper-Tuned LSTM:
   - An LSTM model with hyperparameters optimized using Optuna.
   - Represents our most advanced model, leveraging automated hyperparameter tuning.
   - Also trained on the **titanic.txt** "dataset".

Each subsequent model aims to improve upon the previous ones in terms of text generation quality and coherence. However, a significant limitation of character-based RNN models is their inability to capture higher-level semantic meaning. These models operate at the individual character level, which can lead to several issues:

1. Limited context: Character-based models struggle to understand broader context, often resulting in nonsensical or grammatically incorrect sequences.
2. Inefficiency: They require longer sequences to represent meaningful content, increasing computational demands.
3. Lack of word-level understanding: These models don't inherently grasp word boundaries or meanings, making it challenging to generate coherent, meaningful text.
4. Difficulty with rare characters: Uncommon characters or punctuation can disproportionately influence the model's predictions.
5. Slower convergence: Character-level models typically require more training time to achieve comparable results to word-level models.

That is why, in the results shown in the notebook, they manage to capture the structure of the text in which they were trained on, but rarely generate meaningful phrases.

For the last model (Hyper-Tuned LSTM), at least, as it was given the "best" (debatable) hyper-parameters to solve the task at hand, it learned to generate meaninful words (even with a somewhat "robust" semantic meaning, forming phrases), and not only random letters (i.e *"It's been eighty-foursdaling <u>to the drawing in the stairs</u>.\n\n CUT TO:\n\n745 EXT. TERN stuff and you're she"*), when fed the input "*It's been eighty-four*".


Ps.: On iteration 40000/100000, you can see  that we got "It's been 84s eyes." as a results. *Close enough... (sight)*