In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd /content/drive/MyDrive/files

Adam Optimization<br>
It is short for "Adaptive Moment Estimation", is an interative optimizatoin algorithm used to minimize the loss function during the training of neural networks. It is a mix of RMSprop and Stochastic Gradient Descent with momentum.
$$\text{SGD with momentum } + \text{RMSprop}[\text{}]$$

<br>
**References**:
1.   https://www.analyticsvidhya.com/blog/2023/09/what-is-adam-optimizer/




In [None]:
# in this example we will use the input.txt downloaded from the
#https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt



### LSTM

LSTM consists of two paths, its long-term path(memory state or context state) and other is short-term path(hidden state). We remove the unimportant information from the context withwith **Forget Gate** in the long-term path. We update the long-term path based on the input and short-term path using the **Input Gate** & **Candidate Gate**. When the important information is preserved in the long-short path, that information can be removed from the short-term path using **Output Gate** & **Final Gate**.

#### Forget Gate

It is used to forget unimportant information from the Context state. LSTM can remember limited information. It can be achieved by below forget gate definition <br>
$$F_t = \sigma(W_fZ_t + b_f)$$
Forget gate is defined as sigmoid of the weight matrix ($W_f$) times the concatenation of the previous hidden state and the current input ($Z_t$), plus the bias ($b_f$).

#### Candidate Gate

It is used to apply the new information to the Context state and its is defined as<br>
$$C_t = \tanh(W_cZ_t + b_c)$$
Candidate gate is defined as tanh of weight matrix ($W_c$) times the concatenation of the previous hidden state and the current input ($Z_f$), plus the bias ($b_c$).

#### Input Gate

It determines how much of the information should be added to the context state. It is defined using the sigmoid function,<br>
$$I_t = \sigma (W_iZ_t + b_i)$$
Input gate is dedined as sigmoid of weight matrix $W_i$ times the concatenation of the previous hidden state and current input($Z_t$) and the bias($b_c$).

#### Ouput Gate

Just like the input gate, this is used to determine the amount of information that should be remembered across the hidden state.<br>
$$O_t = \sigma(W_oZ_t + b_o)$$
Output gate is defined as sigmoid of the weights matrix ($W_o$) and  the concatenated input and hidden state ($Z_t$) and bias ($b_o$).

#### Final Gate

This is the output of the lstm network, it is defined as <br>
$$Y_t = W_yHS_t + b_y$$
It is the simple wight matrix($W_y$) multiplied with hidden state($HS_t$) and bias ($b_y$)

#### Complete Mathematical Equations

$$\begin{align*}
F_t &= \sigma(W_fZ_t + b_f) \tag{Forget Gate Definition}\\
I_t &= \sigma(W_iZ_t + b_i) \tag{Input Gate Definition}\\
C_t &= \tanh(W_cZ_t + b_c) \tag{Candidate Gate Definition}\\
O_t &= \sigma(W_oZ_t + b_o) \tag{ouput Gate Definition}\\
CS_t &= F_t \otimes CS_{t-1} + I_t \otimes C_t \tag{Cell State Definition}\\
HS_t &= O_t \otimes \tanh(CS_t) \tag{Hidden State Definition}\\
Y_t &= W_y \otimes HS_t + b_y \tag{Final Gate Definition}
\end{align*}$$

$$\begin{align*}
\dfrac{d}{dx}\tanh(x) &= 1 - \tanh^2 (x) \tag{Tanh derivative definition}\\
\dfrac{d}{dx} \sigma(x) &= \sigma(x)(1 - \sigma(x)) \tag{Sigmoid derivative definition}
\end{align*}$$

In [41]:
import numpy as np
import sys
import pandas as pd
import datetime
import random
import time
import math
from matplotlib import pyplot as plt

In [42]:
##### Imports #####
from tqdm import tqdm
import numpy as np

##### Data #####
data = """To be, or not to be, that is the question: Whether \
'tis nobler in the mind to suffer The slings and arrows of ou\
trageous fortune, Or to take arms against a sea of troubles A\
nd by opposing end them. To die—to sleep, No more; and by a s\
leep to say we end The heart-ache and the thousand natural sh\
ocks That flesh is heir to: 'tis a consummation Devoutly to b\
e wish'd. To die, to sleep; To sleep, perchance to dream—ay, \
there's the rub: For in that sleep of death what dreams may c\
ome, When we have shuffled off this mortal coil, Must give us\
 pause—there's the respect That makes calamity of so long lif\
e. For who would bear the whips and scorns of time, Th'oppres\
sor's wrong, the proud man's contumely, The pangs of dispriz'\
d love, the law's delay, The insolence of office, and the spu\
rns That patient merit of th'unworthy takes, When he himself \
might his quietus make""".lower()

In [43]:
chars = set(data)

data_size, char_size = len(data), len(chars)

print(f'Data Size: {data_size}, Char Size: {len(chars)}')

char_to_idx = {c:i for i, c in enumerate(chars)}
idx_to_char = {i:c for i, c in enumerate(chars)}

train_X, train_y = data[:-1], data[1:]

Data Size: 866, Char Size: 32


In [44]:
##### Helper Functions #####
def oneHotEncode(text):
    output =np.zeros((char_size, 1))
    output[char_to_idx[text]] = 1
    return output

# Xavier Normalized Initialization
def initWeights(input_size, output_size):
    return np.random.uniform(-1, 1, (output_size, input_size)) * np.sqrt(6/(input_size + output_size))

In [45]:
#### Activation Functions ####
def sigmoid(input, derivative = False):
    if derivative:
        return input * (1 - input)

    return 1 /(1 + np.exp(-input))


def tanh(input, derivative = False):
    if derivative:
        return 1 - input ** 2
    return np.tanh(input)


def softmax(input):
    return np.exp(input)/np.sum(np.exp(input))



In [None]:
#### Long Short-Term Memory Network Class ####
class LSTM:
    def __init__(self, input_size, hidden_size, output_size, num_epochs, learning_rate):
        # Hyperparameters
        self.learning_rate = learning_rate
        self.hidden_size = hidden_size
        self.num_epochs = num_epochs

        # Forget Gate
        self.wf = initWeights(input_size, hidden_size)
        self.bf = np.zeros((hidden_size, 1))

        # Input Gate
        self.wi = initWeights(input_size, hidden_size)
        self.bi = np.zeros((hidden_size, 1))

        # Candidate Gate
        self.wc = initWeights(input_size, hidden_size)
        self.bc = np.zeros((hidden_size, 1))

        # Output Gate
        self.wo = initWeights(input_size, hidden_size)
        self.bo = np.zeros((hidden_size, 1))

        # Final Gate
        self.wy = initWeights(hidden_size, output_size)
        self.by = np.zeros((output_size, 1))

    # Reset Network Memory
    def reset(self):
        self.concat_inputs ={}

        self.hidden_states = {-1 : np.zeros((self.hidden_size, 1))}
        self.cell_states = {-1 : np.zeros((self.hidden_size, 1))}

        self.activation_outputs = {}
        self.candidate_gates = {}
        self.output_gates = {}
        self.forget_gates = {}
        self.input_gates = {}
        self.outputs = {}

    # Forward Propagation
    def forward(self, inputs):
        self.reset()

        outputs = []
        for q in range(len(inputs)):
            self.concat_inputs[q] = np.concatenate((self.hidden_states[q - 1], inputs[q]))

            self.forget_gates[q] = sigmoid(np.dot(self.wf, self.concat_inputs[q]) + self.bf)
            self.input_gates[q] = sigmoid(np.dot(self.wi, self.concat_inputs[q]) + self.bi)
            self.candidate_gates[q] = tanh(np.dot(self.wc, self.concat_inputs[q]) + self.bc)
            self.output_gates[q] = sigmoid(np.dot(self.wo, self.concat_inputs[q]) + self.bo)

            self.cell_states[q] = self.forget_gates[q] * self.cell_states[q - 1] + self.input_gates[q] * self.candidate_gates[q]
            self.hidden_states[q] = self.output_gates[q] * tanh(self.cell_states[q])

            outputs += [np.dot(self.wy, self.hidden_states[q]) + self.by]

        return outputs


    # Backward Propagation
    def backward(self, errors, inputs):
        d_wf, d_bf = 0, 0
        d_wi, d_bi = 0, 0
        d_wc, d_bc = 0, 0
        d_wo, d_bo = 0, 0
        d_wy, d_by = 0, 0

        dh_next, dc_next = np.zeros_like(self.hidden_states[0]), np.zeros_like(self.cell_states[0])
        for q in reversed(range(len(inputs))):
            error = errors[q]

            # Final Gate Weights and Biases Errors
            d_wy += np.dot(error, self.hidden_states[q].T)
            d_by += error

            # Hidden State Error
            d_hs = np.dot(self.wy.T, error) + dh_next

            # Output Gate Weights and Biases Errors
            d_o = tanh(self.cell_states[q]) * d_hs * sigmoid(self.output_gates[q], derivative = True)
            d_wo += np.dot(d_o, inputs[q].T)
            d_bo += d_o

            # Cell State Error
            d_cs = tanh(tanh(self.cell_states[q]), derivative = True) * self.output_gates[q] * d_hs + dc_next

            # Forget Gate Weights and Biases Errors
            d_f = d_cs * self.cell_states[q - 1] * sigmoid(self.forget_gates[q], derivative = True)
            d_wf += np.dot(d_f, inputs[q].T)
            d_bf += d_f

            # Input Gate Weights and Biases Errors
            d_i = d_cs * self.candidate_gates[q] * sigmoid(self.input_gates[q], derivative = True)
            d_wi += np.dot(d_i, inputs[q].T)
            d_bi += d_i

            # Candidate Gate Weights and Biases Errors
            d_c = d_cs * self.input_gates[q] * tanh(self.candidate_gates[q], derivative = True)
            d_wc += np.dot(d_c, inputs[q].T)
            d_bc += d_c

            # Concatenated Input Error (Sum of Error at Each Gate!)
            d_z = np.dot(self.wf.T, d_f) + np.dot(self.wi.T, d_i) + np.dot(self.wc.T, d_c) + np.dot(self.wo.T, d_o)

            # Error of Hidden State and Cell State at Next Time Step
            dh_next = d_z[:self.hidden_size, :]
            dc_next = self.forget_gates[q] * d_cs

        for d_ in (d_wf, d_bf, d_wi, d_bi, d_wc, d_bc, d_wo, d_bo, d_wy, d_by):
            np.clip(d_, -1, 1, out = d_)

        self.wf += d_wf * self.learning_rate
        self.bf += d_bf * self.learning_rate

        self.wi += d_wi * self.learning_rate
        self.bi += d_bi * self.learning_rate

        self.wc += d_wc * self.learning_rate
        self.bc += d_bc * self.learning_rate

        self.wo += d_wo * self.learning_rate
        self.bo += d_bo * self.learning_rate

        self.wy += d_wy * self.learning_rate
        self.by += d_by * self.learning_rate


    # Train
    def train(self, inputs, labels):
        inputs = [oneHotEncode(input) for input in inputs]

        for _ in tqdm(range(self.num_epochs)):
            predictions = self.forward(inputs)

            errors = []
            for q in range(len(predictions)):
                errors += [-softmax(predictions[q])]
                errors[-1][char_to_idx[labels[q]]] += 1

            self.backward(errors, self.concat_inputs)

    # Test
    def test(self, inputs, labels):
        accuracy = 0
        probabilities = self.forward([oneHotEncode(input) for input in inputs])

        output = ''
        for q in range(len(labels)):
            prediction = idx_to_char[np.random.choice([*range(char_size)], p = softmax(probabilities[q].reshape(-1)))]

            output += prediction

            if prediction == labels[q]:
                accuracy += 1
        print(f"Ground Truth:\nt{labels}\n")
        print(f"Predictions: \nt{''.join(output)} \n")

        print(f'Accuracy: {round(accuracy * 100 / len(inputs), 2)}%')



In [46]:
# Initialize Network
hidden_size = 25

lstm = LSTM(input_size = char_size + hidden_size, hidden_size = hidden_size, output_size = char_size, num_epochs = 1_000, learning_rate = 0.05)

#### Training ####
lstm.train(train_X, train_y)

#### Testing ####
lstm.test(train_X, train_y)

100%|██████████| 1000/1000 [02:24<00:00,  6.94it/s]


Ground Truth:
to be, or not to be, that is the question: whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles and by opposing end them. to die—to sleep, no more; and by a sleep to say we end the heart-ache and the thousand natural shocks that flesh is heir to: 'tis a consummation devoutly to be wish'd. to die, to sleep; to sleep, perchance to dream—ay, there's the rub: for in that sleep of death what dreams may come, when we have shuffled off this mortal coil, must give us pause—there's the respect that makes calamity of so long life. for who would bear the whips and scorns of time, th'oppressor's wrong, the proud man's contumely, the pangs of dispriz'd love, the law's delay, the insolence of office, and the spurns that patient merit of th'unworthy takes, when he himself might his quietus make

Predictions: 
to be, or not to be, that is the question: whether 'tis nobler in the mind to suffer the slings and arro

**References**:
1.   https://medium.com/mlearning-ai/building-a-neural-network-zoo-from-scratch-the-long-short-term-memory-network-1cec5cf31b7
