# Goals of this notebook
- Deep Learning overview
- Keras Basics
- Basics of LSTM and RNN
- Text generation from text corpus using LSTM
- Q/A Chat bots using End-to-End RNNs

## 1. Perceptron Model
Artificial Neural Networks (ANN) are inspired by the human brain's biological neural network, specifically the [neuron](https://en.wikipedia.org/wiki/Neuron). A neuron in the human brain receives input from various dendrites, processes it in the cell body, and transmits output through the axon to other neurons. This process is mirrored in ANNs through the perceptron model.

**Biological Neuron**<br>
A biological neuron consists of three main parts:
- **Dendrites**: these receive signals from other neurons.
- **Cell Body (Soma)**: This processes the incoming signals.
- **Axon**: This transmits the processed signals to other neurons.

### 1.1 Artificial Neuron (Perceptron)
An artificial neuron, or perceptron, mimics this biological structure. It consists of:

- **Input**: Similar to dendrites, it receives multiple inputs.
- **Weights**: Each input is assigned a weight which determines its significance.
- **Summation Function**: It sums the weighted inputs.
- **Activation Function**: It processes the summed input to produce an output.
- **Output**: Similar to the axon, it transmits the signal to the next layer of neurons.

### 1.2 Mathematical Representation
In mathematical terms, the perceptron model can be described using the following equations:

#### 1.2.1 Weighted Sum:

$$ z = \sum_{i=0}^n w_i x_i + b $$

Where:
- $ x_i $ are the input features.
- $ w_i $ are the weights corresponding to each input.
- $ b $ is the bias term.
- $ z $ is the weighted sum.

#### 1.2.2 Activation Function
$$ y = f(x) $$

Where $ f $ is the activation function. Common activation functions include the step function, sigmoid function, and ReLU (Rectified Linear Unit).

For example, the step activation function can be defined as:

$$ f(z) = 
\begin{cases} 
1 & \text{if } z \geq 0 \\
0 & \text{if } z < 0 
\end{cases} $$

#### 1.2.3 Sigmoid Function
The sigmoid function is defined as:

$$ f(z) = \frac{1}{1 + e^{-z}} $$

#### 1.2.4 ReLU (Rectified Linear Unit)
The ReLU activation function is defined as:

$$ f(z) = \max(0, z) $$

### 1.3 Perceptron Algorithm
The perceptron algorithm updates the weights based on the error in prediction. The steps are as follows:

- Set initial weights to small random numbers.
- Compute the weighted sum $ z $ which is summation of all inputs multiplied with the respective weights added with the bias term.
- Apply the activation function to get the predicted output $ y $.
- Update the weights based on the error $ (y_\text{true} - y_\text{pred}) $
<br>

$$ w_i = w_i + \Delta w_i $$
<br>

<center>Where:</center>

$$ \Delta w_i = \eta (y_{true} - y_{pred}) x_i $$
<br>

<center>and $ \eta $ is the learning rate.</center>

- Repeat the process for a fixed number of iterations or until the error is minimized.

### 1.4 Visual Representation
![Perceptron](https://gamedevacademy.org/wp-content/uploads/2017/09/Single-Perceptron.png.webp)

## 2. Neural Network
A neural network consists of interconnected layers of nodes, or "neurons," each layer performing specific transformations on the input data. The basic structure of a neural network includes:

- **Input Layer**: The layer that receives the input data.
- **Hidden Layer**: Itermediate layers that transform the input data through learned weights and activation functions.
- **Output Layer**: The layer that produces the final output of the network.

### 2.1 Weighted Sum (or Linear Combination)
Each neuron computes a weighted sum of its inputs plus a bias term. For a neuron $ j $ in layer $ l $, this can be expressed as:

$$ z_j^{(l)} = \sum_{i} w_{ij}^{(l)} a_i^{(l-1)} + b_j^{(l)} $$

Where:
- $ w_{ij}^{(l)} $ is the weight from neuron $ i $ in layer $ L - 1 $ to neuron $ j $ in layer $ l $.
- $ a_i^{(l-1)} $ is the activation of neuron $ i $ in layer $ L - 1 $.
- $ b_j^{(l)} $ is the bias term of neuron $ j $ in layer $ L $.

### 2.2 Activation Function
The activation function $ \sigma $ is applied to the weighted sum to produce the neuron's output:

$$ a_{j}^{(l)} = \sigma{(z_{j}^{(l)})} $$

### 2.3 Loss Function
The loss function quantifies the difference between the predicted output and the true output. A common loss function for classification is the cross-entropy loss:

$$ L(y, \hat{y}) = - \sum_{k} y_k \log(\hat{y}_k) $$

Where:
- $ y_k $ is the true label for class $ k $.
- $ \hat{y_k} $ is the predicted probability for class $ k $.

For regression, a common loss function is the Mean Squared Error (MSE).

$$ L(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

Where:
- $ y_i $ is the true value.
- $ \hat{y_i} $ is the predicted value.
- $ n $ is the number of samples.

### 2.4 Backpropagation and Gradient Descent
Backpropagation is used to update the weights and biases by computing gradients of the loss function with respect to each weight and bias. The weights are updated using gradient descent:

$$ w_{ij}^{(l)} \leftarrow w_{ij}^{(l)} - \eta \frac{\partial L}{\partial w_{ij}^{(l)}} $$

Where:
- $ \eta $ is the learning rate.
- $ \frac{\partial L}{\partial w_{ij}^{(l)}} $ is the gradient of the loss with respect to the weight $ w_{ij}^{(l)} $.

### 2.5 Update Rule for Biases
Biases are updated using;

$$ b_j^{(l)} \leftarrow b_j^{(l)} - \eta \frac{\partial L}{\partial b_j^{(l)}} $$

Where:
- $ \frac{\partial L}{\partial b_j^{(l)}} $ is the gradient of the loss with respect to the bias $ b_j^{(l)} $

For further in-depth reading please follow these resources:
- [Backpropogation Step by Step](https://hmkcode.com/ai/backpropagation-step-by-step/)
- [A Step by Step Backpropogation example](https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/)

## 3. Recurrent Neural Network
A Recurrent Neural Network (RNN) is a type of neural network designed for processing sequences of data. Unlike traditional feedforward neural networks, RNNs have connections that form directed cycles, allowing them to maintain a state or memory of previous inputs. This capability makes them suitable for tasks involving sequential data, such as time series forecasting, natural language processing.

Recall, that a normal neuron in a feed forward neural network takes in an input (multiple inputs are aggregated) and passes it through some sort of activation function and produces a output.

![Normal Neuron](https://www.dropbox.com/scl/fi/5tarivnndy1q41v7g1owq/Normal_Neuron.png?rlkey=sh8or3pso3pn15it0fte4dm3o&st=dea6oax9&raw=1)

A recurrent neuron can send output back to itself.

![Recurrent Neuron](https://www.dropbox.com/scl/fi/deykm5b9dmw17kf09yq2p/Recurrent-Neuron.png?rlkey=1u3vrbqsduwm73vo6ypbnurxd&st=ya8sfmdv&raw=1)

We can then unroll this throughout time.

![Recurrent Neuron Unroll](https://www.dropbox.com/scl/fi/9w6amld3t3goehszzt0ps/Recurrent_Neurons_Unroll.png?rlkey=x4i9l7dzhqlt4gmta7jdxvza8&st=gzdm77ot&raw=1)

Cells that are a function of inputs from previous time stteps are also known as **memory cells**.

The cells can then be combined to form layers and the same concept of unrolling will be applied, where each layer produces an output and passes it back to the neurons.

### 3.1 LSTM (Long Short-Term Memory)
An issue RNNs face is that after a while the network starts to "forget" the first inputs, as the information is lost at each step going through the RNN. To mitigate that, some sort of "long-term memory" is required for the networks.

LSTM cell was created to help address these RNN issues. Visually it is structured as shown in the below image:

![LSTM Cell](https://www.dropbox.com/scl/fi/2qvarm242pkrq08fkeh63/lstm-3.svg?rlkey=4jouh3i5kqdyztnvr50h68xck&st=oxhxhzr3&raw=1)

#### The very first step is called the **forget gate layer**. In this step, we decide what information are we going to forget or throw away from the cell state.

$$ \mathbf{f}_t = \sigma(\mathbf{W}_f \cdot [h_{t - 1}, x_t] + b_f) $$

Where:
- $ f_t $: The forget gate's output at time $ t $. It is a vector of values between 0 and 1, representing how much of each element in the previous cell state should be forgotten.
<br>

- $ \sigma $: The sigmoid activation function, which maps its input to a value between 0 and 1. This ensures that the forget gate's output is in the range [0, 1], where 0 means "completely forget" and 1 means "completely retain".
<br>

- $ W_f $: The weight matrix associated with the forget gate. This matrix is learned during training and is used to transform the combined input.
<br>

- $ [h_{t - 1}, x_t] $ The concatenation of the previous hidden state $ h_{t - 1} $ and the current input $ x_t $. The concatenation allows the forget gate to use information from both the previous step and the current input to decide what to forget.
<br>

- $ b_f $: The bias term for the forget gate, which is added to the weighted sum before applying the activation function.
<br>

#### The next step is to decide what new information are we going to store in the cell state. For that we have two layers.

The first layer is called the **input gate layer** or sigmoid layer. It controls how much of the new information will be added to the cell state.

$$ \underset{\text{The sigmoid layer}}{\mathbf{i}_t = \sigma(\mathbf{W}_i \cdot [h_{t - 1}, x_t] + b_i)} $$

Where:
- $\mathbf{i}_t$ is the output of the input gate at time $t$. It represents how much of the new information should be added to the cell state. It is a vector with values between 0 and 1.
<br>

- $\sigma$ denotes the sigmoid activation function, which squashes the input to a value between 0 and 1.
<br>

- $\mathbf{W}_i$ is the weight matrix for the input gate. It transforms the concatenated vector $[h_{t - 1}, x_t]$ into a space where the sigmoid function can be applied.
<br>

- $[h_{t - 1}, x_t]$ is the concatenation of the previous hidden state $h_{t - 1}$ and the current input $x_t$.
<br>

- $b_i$ is the bias term for the input gate, added to the weighted sum before applying the sigmoid activation function.

$$ \underset{\text{The hyperbolic tangent layer}}{\tilde{C}_t = \tanh(\mathbf{W}_C \cdot [h_{t - 1}, x_t] + b_C)} $$

Where:
- $\tilde{C}_t$ is the output of the hyperbolic tangent layer at time $t$. It represents the candidate values that can be added to the cell state. The values are in the range $[-1, 1]$.
<br>
- $\tanh$ denotes the hyperbolic tangent activation function, which maps the input to the range $[-1, 1]$.
<br>

- $\mathbf{W}_C$ is the weight matrix for the hyperbolic tangent layer. It transforms the concatenated vector $[h_{t - 1}, x_t]$ into a space where the $\tanh$ function can be applied.
<br>

- $[h_{t - 1}, x_t]$ is the concatenation of the previous hidden state $h_{t - 1}$ and the current input $x_t$.
<br>

- $b_C$ is the bias term for the hyperbolic tangent layer, added to the weighted sum before applying the $\tanh$ activation function.

#### The next step is to update the old cell state ($ C_{t - 1} $) to $ C_t $
$$ C_t = \mathbf{f}_t \cdot C_{t - 1} + \mathbf{i}_t \cdot \tilde{C}_t $$

Where:
- $C_t$ is the updated cell state at time $t$.
<br>
- $\mathbf{f}_t$ is the forget gate output, determining how much of the previous cell state $C_{t - 1}$ should be retained.
<br>

- $C_{t - 1}$ is the cell state from the previous time step.
<br>

- $\mathbf{i}_t$ is the input gate output, controlling how much of the new candidate values $\tilde{C}_t$ should be added to the cell state.
<br>

- $\tilde{C}_t$ is the candidate values from the hyperbolic tangent layer.

#### The final decision is what to output for $ h_t $
The output gate determines how much of the cell state should be exposed to the hidden state. The formula for the output gate is:

$$ \mathbf{o}_t = \sigma(\mathbf{W}_o \cdot [h_{t - 1}, x_t] + b_o) $$

Where:
- $\mathbf{o}_t$ is the output gate at time $t$. It controls how much of the updated cell state $C_t$ should be exposed as the hidden state $h_t$.
<br>

- $\sigma$ denotes the sigmoid activation function, which squashes the input to a value between 0 and 1.
<br>

- $\mathbf{W}_o$ is the weight matrix for the output gate. It transforms the concatenated vector $[h_{t - 1}, x_t]$ into a space where the sigmoid function can be applied.
<br>

- $[h_{t - 1}, x_t]$ is the concatenation of the previous hidden state $h_{t - 1}$ and the current input $x_t$.
- $b_o$ is the bias term for the output gate, added to the weighted sum before applying the sigmoid activation function.

The hidden state is updated as follows:

$$ h_t = \mathbf{o}_t \cdot \tanh(C_t) $$

Where:
- $h_t$ is the new hidden state at time $t$.
<br>

- $\mathbf{o}_t$ is the output gate, which determines how much of the updated cell state $C_t$ should be exposed in the hidden state.
<br>

- $\tanh(C_t)$ applies the hyperbolic tangent function to the updated cell state $C_t$, normalizing it to the range $[-1, 1]$.

### 3.2 GRU (Gate Recurrent Unit)
Gated Recurrent Units (GRUs) were proposed in the paper “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation” by Cho et al. (2014). The primary motivation behind creating GRUs was to simplify the LSTM architecture while retaining its capability to model long-term dependencies and handle vanishing gradient problems effectively.

![GRU](https://www.dropbox.com/scl/fi/58g9p0qbtpnxkcrnxe0qv/gru-3.svg?rlkey=cgyrqage3v9vewu9p835xmm3f&st=n1fs7889&raw=1)

GRU has a simplified architecture with two gates: the update gate (z) and reset gate (r).

#### The update gate controls how much of the previous hidden state should be retained.

$$ \mathbf{z}_t = \sigma(\mathbf{W}_z \cdot [h_{t - 1}, x_t] + b_z) $$

Where:
- $\mathbf{z}_t$: Update gate at time $t$. It controls how much of the previous memory should be kept.
<br>
- $\sigma$: Sigmoid activation function.
<br>

- $\mathbf{W}_z$: Weight matrix for the update gate.
<br>

- $[h_{t - 1}, x_t]$: Concatenation of the previous hidden state $h_{t - 1}$ and the current input $x_t$.
<br>

- $b_z$: Bias term for the update gate.

#### The reset gate determines how much of the past information to forget. The formula for reset gate is as follows.

$$ \mathbf{r}_t = \sigma(\mathbf{W}_r \cdot [h_{t - 1}, x_t] + b_r) $$

Where:
- $\mathbf{r}_t$: Reset gate at time $t$. It controls how much of the previous memory should be discarded.
<br>

- $\sigma$: Sigmoid activation function.
<br>

- $\mathbf{W}_r$: Weight matrix for the reset gate.
<br>

- $[h_{t - 1}, x_t]$: Concatenation of the previous hidden state $h_{t - 1}$ and the current input $x_t$.
<br>

- $b_r$: Bias term for the reset gate.

#### The candidate hidden state is computed using the reset gate and the current input. It calculates the new memory content candidate based on the reset gate's influence on the previous hidden state.

$$ \tilde{h}_t = \tanh \left( \mathbf{W}_h \cdot [(\mathbf{r}_t \odot h_{t - 1}), x_t] + b_h \right) $$

Where:
- $\tilde{h}_t$ is the candidate hidden state at time $t$.
<br>
- $\tanh$ is the hyperbolic tangent activation function.
<br>

- $\mathbf{W}_h$ is the weight matrix for the candidate hidden state.
<br>

- $\mathbf{r}_t$ is the reset gate at time $t$.
<br>

- $h_{t - 1}$ is the previous hidden state.
<br>

- $x_t$ is the current input at time $t$.
<br>

- $b_h$ is the bias term for the candidate hidden state.
<br>

- $\odot$ denotes element-wise multiplication.

#### The final memory (hidden state) at the current time step is computed as:

$$ h_t = \mathbf{z}_t \odot h_{t - 1} + (1 - \mathbf{z}_t) \odot \tilde{h}_t $$

Where:
- $h_t$: Final hidden state at time $t$.
<br>

- $\mathbf{z}_t$: Update gate at time $t$.
<br>

- $h_{t - 1}$: Previous hidden state.
<br>

- $\tilde{h}_t$: New memory content candidate.
<br>

- $\odot$: Element-wise multiplication.
<br>

- $1 - \mathbf{z}_t$: The complement of the update gate, used to weigh the new memory content.

## 4. Text Generation with Keras and RNN

In [1]:
import spacy
import requests

In [2]:
# we only need spacy for tokenization
NLP = spacy.load("en_core_web_md", disable=["parser", "tagger", "ner", "lemmatizer"])

In [3]:
# otherwise spaCy will complain about text limit exceeded
NLP.max_length = 1115394

In [4]:
def read_from_url(url):
    response = requests.get(url)
    
    if response.status_code == 200:
        content = response.text
        return content
    else:
        return f"Error: Unable to fetch content. Status code: {response.status_code}"

In [5]:
def separate_punc(text):
    return [
        token.text.lower() for token in NLP(text) if token.text not in "\n\n \n\n\n!\"-$#%&()--.*+,-/;:<=>?@[\\]^_`{|}~\t\n"
    ]

In [6]:
shakespeare_content = read_from_url("https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt")

In [7]:
unique_chars = sorted(set(shakespeare_content))

In [8]:
len(unique_chars)

65

In [9]:
unique_chars

['\n',
 ' ',
 '!',
 '$',
 '&',
 "'",
 ',',
 '-',
 '.',
 '3',
 ':',
 ';',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

In [10]:
tokens = separate_punc(shakespeare_content)

In [11]:
len(tokens)

207859

In [12]:
# sequence length is 50 i.e. the model will look at the 50 words
# and predict the next word
training_length = 50 + 1

The below code generates overlapping sequences from a list of tokens, where each sequence starts one token later than the previous sequence

**Example**:
`tokens = ["The", "cat", "sat", "on", "the", "mat", "and", "the", "cat", "purred"]`
`training_length = 2 + 1`

**Sequences after running the code**:
- `["The", "cat", "sat"]`
- `["cat", "sat", "on"]`
- `["sat", "on", "the"]`
- `["on", "the", "mat"]`
- `["the", "mat", "and"]`
- `["mat", "and", "the"]`
- `["and", "the", "cat"]`
- `["the", "cat", "purred"]`


In [13]:
text_sequences = []

for idx in range(training_length, len(tokens)):
    sequence = tokens[idx - training_length:idx]
    text_sequences.append(sequence)

In [14]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [15]:
tokenizer = Tokenizer()

In [16]:
tokenizer.fit_on_texts(text_sequences)

In [17]:
sequences = tokenizer.texts_to_sequences(text_sequences)

In [None]:
# dictionary created by the tokenizer
tokenizer.index_word

In [19]:
# the first 10 tokenized words
sequences[0][:10]

[93, 277, 142, 33, 991, 147, 673, 133, 17, 112]

In [20]:
print([tokenizer.index_word[i] for i in sequences[0][:10]])

['first', 'citizen', 'before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak']


In [None]:
#tokenizer also creates a word count dictionary
tokenizer.word_counts

In [22]:
import numpy as np

In [23]:
sequences = np.array(sequences)

In [24]:
sequences

array([[   93,   277,   142, ...,    64,    83,   498],
       [  277,   142,    33, ...,    83,   498,    28],
       [  142,    33,   991, ...,   498,    28,     2],
       ...,
       [ 2011,    16,    12, ...,   357, 12327,  1084],
       [   16,    12,     8, ..., 12327,  1084,    27],
       [   12,     8,  3871, ...,  1084,    27,   134]])

In [25]:
from keras.utils import to_categorical

In [26]:
# for every row, grab all the columns except the last one
X = sequences[:, :-1]

In [27]:
sequences[:, :-1]

array([[   93,   277,   142, ...,   277,    64,    83],
       [  277,   142,    33, ...,    64,    83,   498],
       [  142,    33,   991, ...,    83,   498,    28],
       ...,
       [ 2011,    16,    12, ...,   208,   357, 12327],
       [   16,    12,     8, ...,   357, 12327,  1084],
       [   12,     8,  3871, ..., 12327,  1084,    27]])

In [28]:
# for every row, grab the last column
y = sequences[:, -1]

In [29]:
sequences[:, -1]

array([ 498,   28,    2, ..., 1084,   27,  134])

In [30]:
# keras needs padded number of classes to hold an extra zero at the end
y = to_categorical(y, num_classes=len(tokenizer.word_counts) + 1)

In [31]:
# note that 50 is the sequence length
X.shape

(207808, 50)

In [32]:
sequence_length = X.shape[1]

In [33]:
vocab_size = len(tokenizer.word_counts) + 1

### Build the model

In [34]:
from keras.models import Sequential
from keras.layers import Input, Dense, LSTM, Embedding

In [35]:
model = Sequential()

model.add(Input(shape=(sequence_length,)))
model.add(Embedding(vocab_size, sequence_length))
model.add(LSTM(sequence_length, return_sequences=True))
model.add(LSTM(sequence_length))
model.add(Dense(sequence_length * 2, activation="relu"))

model.add(Dense(vocab_size, activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [36]:
model.summary()

In [None]:
# don't have enough memory on my local PC (training may take a very long time)
history = model.fit(X, y, batch_size=10, epochs=20, verbose=0)

### Save the model

In [None]:
from pickle import dump, load

In [None]:
model.save("shakespeare_model.h5")

In [None]:
dump(tokenizer, open("shakespeare_tokenizer", "wb"))

### Generate new text

In [None]:
from tensorflow.keras.utils import pad_sequences

In [None]:
output_txt = []
input_txt = "Than of your graces and your gifts to tell; Which borrowed from this holy fire of Love That they elsewhere might dart their"

# generate 20 words
for idx in range(20):
    encoded = tokenizer.texts_to_sequences([input_txt])[0]
    
    # if the initial text is too long or too short (pad it)
    encoded_padded = pad_sequences(
        [encoded],
        maxlen=sequence_length,
        truncating="pre"
    )
    
    pred_word = model.predict(encoded_padded, verbose=0)[0]
    pred_word_idx = np.argmax(pred_word, axis=1)
    pred_word = tokenizer.index_word[pred_word_idx]
    
    input_txt += " " + pred_word
    
    output_txt.append(pred_word)
    
print(" ".join(output_txt))

### The End