# Solving NLP Problems with Recurrent Neural Networks

## Outline

- [Part 1: Understanding RNNs](#part1)
- [Part 2: Part of Speech Tagging](#part2)
- [Part 3: Text Generation](#part3)
- [Part 4: Sentiment Analysis](#part4)

In [None]:
%%javascript
IPython.load_ipython_extensions([
  "nb-mermaid/nb-mermaid"
]);

## Part 1: Understanding RNNs <a id='part1'></a>

### 1.1 Sequence to Sequence Models

* Sequence data is data that is ordered in some way. For example, a sequence of words in a sentence, a sequence of characters in a word, a sequence of pixels in an image, a sequence of notes in a song, a sequence of frames in a video, and so on.

* Unlike Bag-of-Words models, sequence models can take into account the order of the words in a sentence. This makes them ideal for tasks such as machine translation, speech recognition, and text summarization.

* We will follow the standard conventions and model sequence data as follows:

$$x^{(i)} = (x_1^{(i)}, x_2^{(i)}, \ldots, x_T^{(i)})$$

Where $T$ is the length of the sequence and $x_t^{(i)}$ is the $t^{th}$ element of the $i^{th}$ sequence in the training set.

### 1.2 Different categories of sequence models

* one to one - input layer is a single value (vector or scalar), output layer is a single value (vector or scalar). For example, image classification is a one to one model.
* one to many - input layer is a single value (vector or scalar), output layer is a sequence. For example, image captioning is a one to many model.
* many to one - input layer is a sequence, output layer is a single value (vector or scalar). For example, sentiment analysis is a many to one model.
* many to many - input layer is a sequence, output layer is a sequence. For example, machine translation is a many to many model. Some variants of this model depend on the synchronization of the input and output sequences. For example, in video classification, the input and output sequences are synchronized, whereas in machine translation, the input and output sequences are not synchronized.

<center><img src="http://karpathy.github.io/assets/rnn/diags.jpeg" width="800" height="300"></center>

N.B.: a rectangle is a vector and arrows are functions. 

source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

### 1.2 Introduction to RNNs

Recurrent Neural Networks (RNNs) represent a class of artificial neural networks designed to recognize patterns in sequences of data, such as time series data, speech, text, and more. They are distinguished by their "memory", realized through loops that allow information persistence—a feature critical for tasks requiring the understanding of context or the handling of sequential data.

1. **Sequential Data Handling**: RNNs are specifically structured to handle sequential data by maintaining a form of "memory" of previous inputs while processing current ones. This is crucial in fields like Natural Language Processing (NLP) where the order of words (sequence) carries significant meaning.

2. **Temporal Dynamics**: Unlike traditional feedforward neural networks, RNNs possess connections that loop back, enabling them to maintain information over time. This aspect introduces temporal dynamics into the network, allowing it to keep track of temporal dependencies in the input data.

3. **Backpropagation Through Time (BPTT)**: Training RNNs involves a variant of backpropagation called Backpropagation Through Time (BPTT), which unrolls the network over time and computes gradients to update the weights to minimize a loss function.

4. **Vanishing and Exploding Gradient Problems**: RNNs are known to suffer from vanishing and exploding gradient problems during training, which are challenges tied to the mathematical computations of gradients in the network. Variants like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) have been introduced to mitigate these issues.

5. **Applications**: RNNs find applications across a variety of fields including NLP for tasks like language modeling, translation, and sentiment analysis, and in other domains like time-series prediction, and audio recognition.

6. **Statistical Concepts**: The functioning and training of RNNs are deeply rooted in statistical concepts such as probability theory and optimization. They represent a probabilistic approach to modeling sequential data.

#### RNNs in TensorFlow and PyTorch

Implementing RNNs using frameworks like TensorFlow and PyTorch is a standard practice in the field. Both frameworks provide user-friendly APIs for building and training RNNs. 

- **TensorFlow**:
   - Official Documentation on RNNs: [TensorFlow Recurrent Neural Networks](https://www.tensorflow.org/guide/keras/rnn)
   - Tutorial: [Text generation with an RNN](https://www.tensorflow.org/tutorials/text/text_generation)

- **PyTorch**:
   - Official Documentation on RNNs: [PyTorch Recurrent Neural Networks](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html)
   - Tutorial: [Time Sequence Prediction](https://pytorch.org/tutorials/beginner/time_sequence_prediction_train.html)
  

```python
# TensorFlow
import tensorflow as tf

model = tf.keras.models.Sequential([
    tf.keras.layers.SimpleRNN(128, activation='tanh', input_shape=(None, 1)),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mean_squared_error')
#...

# PyTorch
import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self):
        super(SimpleRNN, self).__init__()
        self.rnn = nn.RNN(input_size=1, hidden_size=128, batch_first=True)
        self.fc = nn.Linear(128, 1)
    
    def forward(self, x):
        out, _ = self.rnn(x)
        out = self.fc(out[:, -1, :])
        return out

model = SimpleRNN()
#...
```

In these snippets, a simple RNN is defined with 128 hidden units. In TensorFlow, the `SimpleRNN` layer is used, while in PyTorch, the `nn.RNN` module is utilized. The network is then compiled (TensorFlow) or instantiated (PyTorch), ready to be trained on your data.

### 1.3 Architecture of RNNs

#### Standard feedforward neural network

Standarad Feedforward Neural Network

```mermaid
    graph BT
    i[Input] --> h((Hidden Layer))
    h --> o[Output]
    
```


Recurrent Neural Network feedforward

```mermaid
    graph BT
    i[Input] --> h((Hidden Layer))
    h --> h
    h --> o[Output]
```

Recall that in standard neural network data is processed by passing the inputs to the forward layer (or hidden layer) and then to the output layer. In a recurrent neural network, the hidden layer receives the input and the current time step from the previous step. This allows the network to process the data sequentially.

<center><img src="https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_07.png?raw=true" width="800" height="600"></center>

img source: https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_07.png

### 1.4 Single and Multi Layer RNNs


#### Single-layer RNNs

1. **Architecture**: A single-layer RNN consists of a single layer of recurrent neurons. Each neuron receives input from the current time step and also has a recurrent connection that captures information from the previous time step.
   
2. **Recurrent Connections**: These connections enable the network to maintain a form of memory, which is crucial for processing sequences of data. The state of the recurrent neurons at any given time step is influenced by the input at that time step and the state of the recurrent neurons at the previous time step.

3. **Training**: Training a single-layer RNN typically involves unfolding the network through time and applying backpropagation, a process known as Backpropagation Through Time (BPTT).

4. **Limitations**: Single-layer RNNs are often limited in their ability to capture long-term dependencies in the data due to the vanishing or exploding gradient problem, which arises during the training process.

#### Multi-layer RNNs

1. **Architecture**: Multi-layer RNNs, often referred to as Deep Recurrent Neural Networks, consist of multiple layers of recurrent neurons. Each layer receives input from the preceding layer, which allows the network to learn hierarchical representations of the data.
   
2. **Hierarchical Learning**: The ability to learn hierarchical representations is beneficial in many tasks, as it enables the network to capture more complex patterns in the data. Each layer can learn to represent different levels of abstraction, which can be particularly useful in tasks like language modeling or speech recognition.
   
3. **Training**: Training multi-layer RNNs also involves BPTT. However, the presence of multiple layers can exacerbate the vanishing or exploding gradient problem and often necessitates the use of techniques like gradient clipping or advanced recurrent units like LSTMs or GRUs to mitigate these issues.
   
4. **Improved Performance**: Multi-layer RNNs often exhibit better performance on complex tasks as compared to single-layer RNNs due to their ability to learn more complex representations of the data.

#### Summary
In summary, single-layer RNNs consist of a single layer of recurrent neurons, making them simpler but often less capable of handling complex patterns in data. On the other hand, multi-layer RNNs have multiple layers of recurrent neurons, which enable them to learn hierarchical representations of the data, often yielding better performance on complex tasks.

Here's a simplified Python code snippet to illustrate the difference between single and multi-layer RNNs using TensorFlow:

```python
import tensorflow as tf

# Single-layer RNN
single_layer_rnn = tf.keras.models.Sequential([
    tf.keras.layers.SimpleRNN(128, activation='tanh', input_shape=(None, 1))
])

# Multi-layer RNN
multi_layer_rnn = tf.keras.models.Sequential([
    tf.keras.layers.SimpleRNN(128, activation='tanh', input_shape=(None, 1), return_sequences=True),
    tf.keras.layers.SimpleRNN(128, activation='tanh')
])

"""
The key difference here is the 'return_sequences=True' parameter in the first layer of the multi-layer RNN, which ensures that the output from the first layer is passed as a sequence to the second layer.
"""
```

<center><img src="https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_04.png?raw=true" width="800" height="600"></center>

### 1.5 RNN activation functions

Activation functions are a pivotal part of neural networks, including Recurrent Neural Networks (RNNs). They introduce non-linear properties to the system, enabling the network to learn from the error back-propagated through the network, and consequently, capture complex patterns in the data.

#### Activation Functions for RNNs:

1. **Hyperbolic Tangent (tanh)**: 
   - The tanh function squashes its input to be between -1 and 1, making it a good choice for maintaining the values within a reasonable range during backpropagation through time. 
   
2. **Rectified Linear Unit (ReLU) and its Variants**:
   - ReLU is popular due to its simplicity and the fact that it reduces the likelihood of the vanishing gradient problem.
   - Variants of ReLU like Leaky ReLU or Parametric ReLU can be used to prevent dead neurons and the vanishing gradient problem.

3. **Sigmoid**:
   - The sigmoid function squashes its input to be between 0 and 1. It is particularly useful in binary classification tasks like in the output layer of a network.

4. **Gated Activation Functions**:
   - Gated recurrent units (GRUs) and Long Short-Term Memory units (LSTMs) use gated activation functions to control the flow of information through the network which can be particularly useful in learning long-term dependencies.

#### Considerations for Selecting an Activation Function:

1. **Vanishing and Exploding Gradients**:
   - The choice of activation function can influence the stability of the training process. For instance, ReLU and its variants can mitigate the vanishing gradient problem, a common issue in RNNs.
   
2. **Learning Long-term Dependencies**:
   - Gated activation functions in LSTMs and GRUs help in learning long-term dependencies by controlling the flow of information, which can be crucial in many sequence processing tasks.
   
3. **Computational Efficiency**:
   - Simpler activation functions like ReLU are computationally more efficient as compared to more complex gated activation functions.
   
4. **Task-Specific Requirements**:
   - The nature of the task at hand also dictates the choice of the activation function. For instance, a sigmoid activation function might be suitable for binary classification tasks.

5. **Empirical Performance**:
   - Often the choice of activation function might come down to empirical performance on a specific task or dataset.

#### Summary

In RNNs, the choice of activation function is critical. Popular choices include tanh, ReLU and its variants, sigmoid, and gated activation functions like those used in LSTMs and GRUs. The decision on which activation function to use can be influenced by a variety of factors including the problem of vanishing and exploding gradients, the necessity to learn long-term dependencies, computational efficiency, the specific requirements of the task, empirical performance, and theoretical insights into the data or problem at hand.

##### Tensorflow example

```python
import tensorflow as tf

# Creating an RNN with a tanh activation function
rnn_tanh = tf.keras.layers.SimpleRNN(units=128, activation='tanh', input_shape=(None, 1))

# Creating an RNN with a ReLU activation function
rnn_relu = tf.keras.layers.SimpleRNN(units=128, activation='relu', input_shape=(None, 1))
```

##### Pytorch example

```python
import torch
import torch.nn as nn

# Creating an RNN with a tanh activation function
rnn_tanh = nn.RNN(input_size=1, hidden_size=128, nonlinearity='tanh', batch_first=True)

# Creating an RNN with a ReLU activation function
rnn_relu = nn.RNN(input_size=1, hidden_size=128, nonlinearity='relu', batch_first=True)
```



### 1.5 Some problems with RNNs

#### Vanishing and Exploding Gradients

<center><img src="https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_08.png?raw=true" width="900" height="500"></center>

The training of neural networks involves a process known as backpropagation, which is the method of computing gradients of the loss function with respect to the model parameters for updating these parameters. However, this process can sometimes be hindered due to the issues of vanishing and exploding gradients, particularly in recurrent neural networks (RNNs) which deal with sequential data.

##### Vanishing Gradient Problem

1. **Mechanism**: The vanishing gradient problem arises when the gradients of the loss function become too small for the network to learn effectively. As the gradient values approach zero, the updates to the weights during the training process become negligible, leading to a network that cannot learn from the data.

2. **Cause**: This often occurs in deep networks or RNNs with long sequences due to the repeated multiplication of gradients through layers or time steps, especially when using activation functions like the sigmoid or tanh that squash their input into a small range.

3. **Impact**: The vanishing gradient problem can cause training to be very slow, and the network may get stuck during training, leading to poor performance.

<center><img src="https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_08.png?raw=true" width="600" height="250"></center>

##### Exploding Gradient Problem

1. **Mechanism**: Conversely, the exploding gradient problem occurs when gradient values become too large, leading to very large updates to the weights during the training process.

2. **Cause**: This can occur due to the repeated multiplication of gradients through layers or time steps, especially in the presence of large parameter values or large input values.

3. **Impact**: The exploding gradient problem can cause training to diverge, leading to an unstable network and, often, poor performance.

<center><img src="https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_08.png?raw=true" width="600" height="250"></center>

##### Desirable Scenario

1. **Controlled Gradient Magnitude**: A desirable scenario is one where the magnitudes of the gradients are controlled and remain within a reasonable range throughout the training process.

2. **Stable Training**: Stable and consistent training with a well-tuned learning rate, proper initialization of weights, and potentially regularization to prevent overfitting.

3. **Mitigation Techniques**: Employing techniques to mitigate vanishing and exploding gradients, such as:
   - Gradient clipping to prevent gradients from exceeding a defined threshold.
   - Truncated backpropagation through time (TBPTT) which limits the number of time steps considered during backpropagation.
   - Using advanced recurrent units like Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRU) which are designed to combat the vanishing gradient problem.

<center><img src="https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_08.png?raw=true" width="600" height="250"></center>

##### Summary

In summary, the vanishing gradient problem is characterized by gradients becoming too small to effectively update the network weights during training, often caused by the choice of activation function or network depth. On the other hand, the exploding gradient problem is marked by overly large gradients causing unstable training and potentially divergent behavior. A desirable scenario maintains gradient magnitudes within a controlled range, enabling stable training and effective learning. Techniques like gradient clipping, proper initialization, and the use of particular activation functions or advanced recurrent units can help achieve this scenario.

### 1.6 Long Short-Term Memory (LSTM) Units

* "Long Short-Term Memory" S. Hochreiter, J. Schmidhuber, _Neural Computation_ 9(8):1735-1780, 1997
* "Learning to Forget: Continual Prediction with LSTM" F. A. Gers, J. Schmidhuber, _Neural Computation_ 12(10):2451-2471, 2000

#### LSTM "Memory Cell"

<center><img src="https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_09.png?raw=true" width="800" height="450"></center>

$\odot$ = element-wise multiplication \
$\oplus$ = element-wise summation \
$x^{(t)}$ = input vector at time step $t$ \
$h^{(t-1)}$ = hidden units at time $t - 1$ \
$\~C$ = candidate values \
$\text{forget-gate}$ = forget gate allows the network to forget information from the previous time step \
$\text{input-gate}$ = input gate allows the network to update the memory cell \
$\text{output-gate}$ = output gate decides how to update the values of hidden units


## Part 2: Part of Speech Tagging <a id='part2'></a>

We encountered Part of Speech Tags in previous lectures, but we used existing models. Let's train a new model from scratch.

In [None]:
# imports
import nltk
import numpy as np
import requests
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import ModelCheckpoint
from nltk.corpus import treebank, brown, conll2000
from sklearn.model_selection import train_test_split

#### NLTK Part of Speech

We can create our own data set by drawing on previous data sets

In [None]:
nltk.download('treebank')
nltk.download('brown')
nltk.download('conll2000')

#### Standardize the POS Tags

We will standardize the tag sets with the universal tag set.

The universal tag set is a list of 12 tags that are used across all languages. They can be found online: https://universaldependencies.org/u/pos/

In [None]:
nltk.download('universal_tagset')

In [None]:
sentences_tagged = treebank.tagged_sents(tagset='universal') + brown.tagged_sents(tagset='universal') + conll2000.tagged_sents(tagset='universal') + conll2000.tagged_sents(tagset='universal')

#### Data visualization

In [None]:
print('Sentence example:', sentences_tagged[0])
print('Dataset size: ', len(sentences_tagged))

#### Data preprocessing

Let's get the data in a shape we can train our model.

In [None]:
sents, sent_tags = [], []

for s in sentences_tagged:
    sentence, tags = zip(*s)
    sents.append(list(sentence))
    sent_tags.append(list(tags))

#### Visualize the data

In [None]:
print(sents[0])
print(sent_tags[0])

In [None]:
print(len(sents), len(sent_tags))

#### Create our train, validation, and test sets

In [None]:
train_ratio = 0.7
validation_ratio = 0.2
test_ratio = 0.1

X_train, X_test, y_train, y_test = train_test_split(sents, sent_tags, test_size=1 - train_ratio, random_state=42)

X_val, x_test, y_val, y_test = train_test_split(X_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio), random_state=42)

In [None]:
print(f'X_train: {len(X_train)}, X_val: {len(X_val)}, x_text: {len(x_test)}')
print(f'y_train: {len(y_train)}, y_val: {len(y_val)}, y_text: {len(y_test)}')
print(f'X_train[0]: {X_train[0]}')
print(f'y_train[0]: {y_train[0]}')


#### Tokenize our dataset

In [None]:
tokenizer = keras.preprocessing.text.Tokenizer(oov_token='UNK')

In [None]:
# fit the tokenizer on the documents
tokenizer.fit_on_texts(X_train)
print(f'Vocabulary size: {len(tokenizer.word_index)}')

In [None]:
tag_tokenizer = keras.preprocessing.text.Tokenizer()
tag_tokenizer.fit_on_texts(y_train)

In [None]:
print(f'POS Tags: {len(tag_tokenizer.word_index)}')

In [None]:
tag_tokenizer.get_config()

In [None]:
tag_tokenizer.word_index

#### Vectorize our sentences

In [None]:
X_train_seqs = tokenizer.texts_to_sequences(X_train)

In [None]:
print(f'X_train_seqs[0]: {X_train_seqs[0]}')
print(f'X_train[0]: {X_train[0]}')

#### Vectorize our tags

In [None]:
y_train_seqs = tag_tokenizer.texts_to_sequences(y_train)

In [None]:
print(f'y_train_seqs[0]: {y_train_seqs[0]}')
print(f'y_train[0]: {y_train[0]}')

#### Validation data

In [None]:
X_val_seqs = tokenizer.texts_to_sequences(X_val)
y_val_seqs = tag_tokenizer.texts_to_sequences(y_val)

#### Padding

Padding is a way to make sure all of our sentences are the same length. We will use the pad_sequences function from Keras.

In [None]:
MAX_LEN = len(max(X_train_seqs, key=len))
print(f'Max length: {MAX_LEN}')

In [None]:
X_train_padded = keras.preprocessing.sequence.pad_sequences(X_train_seqs, maxlen=MAX_LEN, padding='post')
print(f'X_train_padded[0]: {X_train_padded[0]}')

In [None]:
y_train_padded = keras.preprocessing.sequence.pad_sequences(y_train_seqs, maxlen=MAX_LEN, padding='post')
X_val_padded = keras.preprocessing.sequence.pad_sequences(X_val_seqs, maxlen=MAX_LEN, padding='post')
y_val_padded = keras.preprocessing.sequence.pad_sequences(y_val_seqs, maxlen=MAX_LEN, padding='post')

#### Convert our tags to categorical

In [None]:
y_train_categories = keras.utils.to_categorical(y_train_padded)
print(f'y_train_categories[0]: {y_train_categories[0][0]}')

In [None]:
## check labels
idx = np.argmax(y_train_categories[0][0])
print(f'idx: {idx}')
print(f'Label: {tag_tokenizer.index_word[idx]}')

In [None]:
# one hot encode the validation labels
y_val_categories = keras.utils.to_categorical(y_val_padded)

#### Model architecture

In [None]:
num_tokens = len(tokenizer.word_index) + 1 # add 1 for padding
embedding_dim = 128
num_classes = len(tag_tokenizer.word_index) + 1 # add 1 for padding

In [None]:
tf.random.set_seed(42)

model = keras.Sequential()

model.add(layers.Embedding(input_dim=num_tokens,
                           output_dim=embedding_dim,
                           input_length=MAX_LEN,
                           mask_zero=True))

model.add(
    layers.Bidirectional(
        layers.LSTM(128,
                    return_sequences=True,
                    kernel_initializer=tf.keras.initializers.random_normal(seed=42)
                    )
        )
    )

model.add(layers.Dense(num_classes, activation='softmax', kernel_initializer=tf.keras.initializers.random_normal(seed=42)))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

history = model.fit(X_train_padded,
                    y_train_categories,
                    epochs=50,
                    batch_size=128,
                    validation_data=(X_val_padded, y_val_categories),
                    callbacks=[callback]
                    )


#### Preprocess the test data and evaluate the model

In [None]:
X_test_seqs = tokenizer.texts_to_sequences(x_test)
x_test_padded = keras.preprocessing.sequence.pad_sequences(X_test_seqs, maxlen=MAX_LEN, padding='post')

y_test_seqs = tag_tokenizer.texts_to_sequences(y_test)
y_test_padded = keras.preprocessing.sequence.pad_sequences(y_test_seqs, maxlen=MAX_LEN, padding='post')
y_test_categories = keras.utils.to_categorical(y_test_padded)

In [None]:
model.evaluate(x_test_padded, y_test_categories)

#### Predictions

We want to productionize our model. We will use the model to predict the part of speech tags for new sentences.

In [None]:
client_data = [
    'The University was closed today because it snowed.',
    'The White House released an executive order on the use of AI in government.',
    'Richard Feynman was a professor at Caltech.',
]

In [None]:
def predict_(sentences: list[str]) -> list[list[str]]:
    sent_seqs = tokenizer.texts_to_sequences(sentences)
    sents_padded = keras.preprocessing.sequence.pad_sequences(sent_seqs, 
                                                              maxlen=MAX_LEN,
                                                              padding='post')
    
    # predict the tags of the client sentences
    predictions = model.predict(sents_padded)
    
    # create softmax predictions
    predictions_ = tf.nn.softmax(predictions)
    print(f'predictions: {predictions_[0][0]}')
    
    sentence_tags = []
    
    for i, preds in enumerate(predictions):
        
        # extract the indices of the highest predictions
        tags_seq = [np.argmax(p) for p in preds[:len(sent_seqs[i])]]
        
        words = [tokenizer.index_word[w] for w in sent_seqs[i]]
        tags = [tag_tokenizer.index_word[t] for t in tags_seq]
        sentence_tags.append(list(zip(words, tags)))
    
    return sentence_tags    

In [None]:
tagged_client_sents = predict_(client_data)
print(f'Sample: {tagged_client_sents}')

In [None]:
# import from google drive
from google.colab import drive
drive.mount('/content/drive')

# save the model
model.save('/content/drive/MyDrive/Colab Notebooks/13_Recurrent_Neural_Networks/model.h5')

## Part 3: Text Generation <a id='part3'></a>

Let's see if we can improve on our Tolkien text generator. If you recall that our Naive Bayes model was able to learn the character level probabilities of the text. But the output of the model was not very good. Let's see if we can improve on that.

### 3.1 Character Level Text Generation

In [None]:
from pathlib import Path

import nltk
import numpy as np
import requests
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import ModelCheckpoint
from nltk.corpus import treebank, brown, conll2000
from sklearn.model_selection import train_test_split


### 3.2 Load data and preprocess

In [None]:
here = Path().cwd()

# go one level up
parent = here.parent

# read the LOTR files into memory
files = list(parent.glob('datasets/*.txt'))

corpus = []

for f in files:
    # if LOTR is in the file name
    if 'lotr' in f.name.lower():
        print(f'Reading {f.name}')
        with open(f, 'r') as file:
            corpus.append(file.read())

In [None]:
def clean_corpus(corpus: list[str]) -> list[str]:
    # concatenate the corpus into a single string
    corpus = ' '.join(corpus)
    # remove unneccessary whitespace
    corpus = " ".join(corpus.split())
    # remove underscores
    corpus = corpus.replace('_', '')
    
    return corpus

### 3.2 Visualize the data

In [None]:
tolkien = clean_corpus(corpus)
tolkien[:1000]

In [None]:
tolkien = tolkien[:1000000]

### 3.3 Tokenize the data

In [None]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts([tolkien])

#### 3.3.1 Examine the config

In [None]:
tokenizer.get_config()

#### 3.3.2 Vocabulary size

In [None]:
print(f'Vocabulary size: {len(tokenizer.word_index)}')

In [None]:
seq = tokenizer.texts_to_sequences([tolkien])[0]
print(f'Text length: {len(seq)}')

In [None]:
tokenizer.sequences_to_texts([seq[:100]])

### 3.4 Format the data

In [None]:
# create a dataset from the sequence
slices = tf.data.Dataset.from_tensor_slices(seq)
type(slices)

In [None]:
# generator to list
list(slices.take(5).as_numpy_iterator())

In [None]:
seq[:10]

### 3.5 Training data

In [None]:
input_time_steps = 100  # length of the input sequences
window_size = input_time_steps + 1
windows = slices.window(window_size, shift=1, drop_remainder=True)  # shift by one for next character prediction

In [None]:
for w in windows.take(3):
    arr = list(w.as_numpy_iterator())
    print(len(arr), arr)

### 3.6 Create dataset

In [None]:
# create a dataset from the windows
dataset = windows.flat_map(lambda w: w.batch(window_size))

for d in dataset.take(2):
    print(d)

In [None]:
# create the batches for training
batch_size = 32

batches = dataset.shuffle(1024).batch(batch_size)

for b in batches.take(2):
    print(b)

In [None]:
xy_batches = batches.map(lambda batch: (batch[:, :-1], batch[:, 1:]))

for b in xy_batches.take(2):
    print(b)

In [None]:
for b in xy_batches.take(1):
  print("x1 length: ", len(b[0][0].numpy()))
  print("x1: ", b[0][0].numpy())
  print("\n")
  print("y1 length: ", len(b[1][0].numpy()))
  print("y1: ", b[1][0].numpy())

In [None]:
num_tokens = len(tokenizer.word_index) + 1 # add 1 for padding

xy_batches = xy_batches.map(lambda inputs, labels: (tf.one_hot(inputs, depth=num_tokens), labels))

for b in xy_batches.take(1):
  print("x1: ", b[0][0].numpy())
  print("\n")
  print("y1: ", b[1][0].numpy())

In [None]:
# the autotune option will automatically tune the buffer size
dataset = dataset.prefetch(tf.data.AUTOTUNE)

### 3.7 Model architecture

In [None]:
model = keras.Sequential()

model.add(layers.LSTM(128, return_sequences=True, input_shape=[None, num_tokens], recurrent_dropout=0.2))
model.add(layers.LSTM(128, return_sequences=True, input_shape=[None, num_tokens], recurrent_dropout=0.2))

model.add(layers.Dense(num_tokens, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
# define our callbacks to track the model performance during training
callback = tf.keras.callbacks.ModelCheckpoint(filepath='.', save_weights_only=True, verbose=1)

In [None]:
history = model.fit(xy_batches, epochs=10, callbacks=[callback])

In [None]:
# save the model
model.save('model.h5')

### 3.8 Load the model

In [None]:
# load model from google drive
from google.colab import drive
drive.mount('/content/drive')

trained_model = keras.models.load_model('/content/drive/MyDrive/Models/tolkien/model.h5')

### 3.9 Generate text

In [None]:
def generate_text(model, tokenizer, seed_text, num_chars=200, temperature=1):

  text = seed_text

  for _ in range(num_chars):

    # Encode the input string.
    input = np.array(tokenizer.texts_to_sequences([text[-100:]]))
    input = tf.one_hot(input, num_tokens)

    # compute the next character probabilities.
    preds = model.predict(input)[0, -1:, :]
    preds = tf.math.log(preds) / temperature

    # Sample next character and add to running text.
    next_char = tf.random.categorical(preds, num_samples=1)
    next_char = tokenizer.sequences_to_texts(next_char.numpy())[0]

    text += next_char

  return text


In [None]:
print(generate_text(trained_model, tokenizer, "Sam, Frodo, and Gandalf were running when", num_chars=300, temperature=0.2))

## Part 4: Sentiment Analysis <a id='part4'></a>

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_text as text
import numpy as np
import pandas as pd

### 4.1 Load the dataset

In [None]:
# Load the IMDB reviews dataset
dataset, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

# Split the dataset into train and test
train, validate = dataset['train'], dataset['test']

# Examine the dataset
train.element_spec

### 4.1 Dataset info

In [None]:
info

#### 4.1.1 Data visualization

In [None]:
# Examine a review
for eg, label in train.take(1):
  print("text: ", eg.numpy())
  print("label: ", label.numpy())

In [None]:
# plot the counts of the labels in the training and validation sets
import matplotlib.pyplot as plt

train_labels = [label.numpy() for _, label in train]
validate_labels = [label.numpy() for _, label in validate]

# plot the counts of the labels in the training and validation sets
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.hist(train_labels)
plt.title('Training labels')
plt.subplot(1, 2, 2)
plt.hist(validate_labels)
plt.title('Validation labels')
plt.show()

### 4.2 Data preprocessing

In [None]:
# Shuffle and batch the data
BUFFER_SIZE = 10_000
BATCH_SIZE = 64

# create a dataset of batches - see https://www.tensorflow.org/guide/data_performance#prefetching
train_dataset = train.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
validate_dataset = validate.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [None]:
for eg, label in train_dataset.take(1):
  print("texts: ", eg.numpy()[:3])
  print("labels: ", label.numpy()[:3])

### 4.3 Tokenize and vectorize our data

In [None]:
# Set our vocabulary size
VOCAB_SIZE = 1000

# Create a text vectorization layer
encoder = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

#### 4.3.1 Examine the vocabulary

In [None]:
# Examine the vocabulary
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

#### 4.3.2 Vectorize the data

In [None]:
# Examine the encoded text
encoder_example = encoder(eg)[:3].numpy()
encoder_example

In [None]:
# compare the original text to the encoded text
for n in range(3):
  print("Original: ", eg[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoder_example[n]]))

### 4.4 Create our model

In [None]:
# Create a model
model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        mask_zero=True),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

In [None]:
# Compile the model
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
# train the model
history = model.fit(train_dataset, epochs=10, validation_data=validate_dataset, validation_steps=30)

#### 4.4.1 Validate our model

In [None]:
# validate our model
val_loss, val_acc = model.evaluate(validate_dataset)

print('Test Loss:', val_loss)
print('Test Accuracy:', val_acc)

In [None]:
# validate our model
val_loss, val_acc = model.evaluate(validate_dataset)

print('Test Loss:', val_loss)
print('Test Accuracy:', val_acc)

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

  
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

### 4.5 Test our model

In [None]:
# test our model
sample_text = ('The movie was a joke. The animation and the graphics '
               'were out of this world, but the acting was horrendous.'
               'I would not recommend this movie.')

In [None]:
# predict the sentiment
prediction = model.predict([sample_text])

# Show the results
prediction

### 4.6 Visualize our model

In [None]:
# Our LSTM model
model.summary()

In [None]:
# draw plot of the model
tf.keras.utils.plot_model(model, show_shapes=True)

## 5.1 BiLSTM Model

### 5.1.1 Create our model

In [None]:
# Model
model_bilstm = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

### 5.1.2 Compile our model

In [None]:
# compile our bidirectional LSTM model
model_bilstm.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), optimizer=tf.keras.optimizers.Adam(1e-4), metrics=['accuracy'])

### 5.1.3 Train our model

In [None]:
# train our model
history = model.fit(train_dataset, epochs=10, validation_data=validate_dataset, validation_steps=30)

### 5.2 Visualize training

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

  
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

## PyTorch Example

[Notebook](https://colab.research.google.com/drive/1Et8IO-BCBdSYkhkTcCbo624gqfJD9H7h#scrollTo=cTqhw4K0qIBx)