<a href="https://colab.research.google.com/github/Angel-Castro-RC/Final_NLP/blob/main/F6_2_RecurrentNeuralNetworks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Recurrent Neural Networks

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F6_2_RecurrentNeuralNetworks)

## Announcement Update

AI - English Faculty Candidate: Gabriel Ford

Scholarly Presentation: Friday at 9:00am in Howard 309

## Reference

SLP: RNNs and LSTMs, Chapter 9 of Speech and Language Processing by Daniel Jurafsky & James H. Martin https://web.stanford.edu/~jurafsky/slp3/9.pdf

Keras documentation for SimpleRNN Layer: https://keras.io/api/layers/recurrent_layers/simple_rnn/

In [None]:
import sys
!{sys.executable} -m pip install datasets keras tensorflow transformers

Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/493.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/493.7 kB[0m [31m4.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m85.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━

## Recurrent Neural Networks (RNN)

A **recurrent neural network** is a neural network with a loop inside of it - some of the outputs in one layer become inputs of the same or an earlier layer

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_highlevel.png?raw=1">
</div>

* $x_{t}$: neural network input at time $t$
* $h_{t}$: hidden layer state at time $t$
* $y_{t}$: output layer state at time $t$

*Allows information from past inputs to affect current predictions*


image source: SLP Fig. 9.1, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## RNN visualized as a feedforward network

In this image, the inputs are shown on bottom and the outputs on top

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_as_feedforward.png?raw=1" width=400>
</div>

* $h_{t-1}$: hidden layer state at time $t-1$ is an input to $h_{t}$


image source: SLP Fig. 9.2, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## RNN "unrolled" in time

Later outputs continue to be influenced by the entire sequence

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_unroll.png?raw=1" width=500>
</div>


image source: SLP Fig. 9.4, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## Coding up a simple RNN in Keras

Defining a Recurrent layer is similar to defining a Dense layer

`return_sequences=False` for now, we don't want to return the entire sequence, just the last output

`stateful=False` allows the state from one **batch** to carry over to the next

In [None]:
# A feedforward network with one hidden layer
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=50, input_length=sequence_length))
model.add(Flatten())
model.add(Dense(100, activation="relu"))
model.add(Dense(vocabulary_size, activation='softmax'))

# A recurrent network with one layer
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=50, input_length=sequence_length))
model.add(SimpleRNN(100,return_sequences=False,stateful=False))
model.add(Dense(vocabulary_size, activation='softmax'))


### Exercise

Copy in your code from the non-recurrent neural language model from last time, and replace the Flatten+Dense layer with a SimpleRNN layer like above.
* Use the same dataset, `ag_news`, prepared in the same way
* Run it with small enough subset to train within a few minutes

How do the performances compare?

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import numpy as np

data = load_dataset("ag_news")

data_subset, _ = train_test_split(data["train"]["text"], train_size=500)
train_data, test_data = train_test_split(data_subset, train_size=0.8)

# Prepare the tokenizer and fit on the training text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data_subset)
vocabulary_size = len(tokenizer.word_index) + 1
print("Vocabulary size:", vocabulary_size)

# Convert text to sequences of integers
train_texts = tokenizer.texts_to_sequences(train_data)

sequence_length = 5  # Length of the input sequence before predicting the next word

# Create the sequences
predictor_sequences = []
targets = []
for text in train_texts:
    for i in range(sequence_length, len(text)):
        # Take the sequence of tokens as input and the next token as target
        curr_target = text[i]
        curr_predictor_sequence = text[i - sequence_length:i]
        predictor_sequences.append(curr_predictor_sequence)
        targets.append(curr_target)

predictor_sequences = np.array(predictor_sequences)
targets = np.array(targets)

# Convert targets to one-hot vectors
targets_one_hot = to_categorical(targets, num_classes=vocabulary_size)

# Create the neural language model with SimpleRNN layer
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=50, input_length=sequence_length))
model.add(SimpleRNN(100, activation='relu'))
model.add(Dense(vocabulary_size, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Training the model
model.fit(predictor_sequences, targets_one_hot, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model on test data
test_texts = tokenizer.texts_to_sequences(test_data)
test_predictor_sequences = []
test_targets = []
for text in test_texts:
    for i in range(sequence_length, len(text)):
        curr_target = text[i]
        curr_predictor_sequence = text[i - sequence_length:i]
        test_predictor_sequences.append(curr_predictor_sequence)
        test_targets.append(curr_target)

test_predictor_sequences = np.array(test_predictor_sequences)
test_targets = np.array(test_targets)

test_targets_one_hot = to_categorical(test_targets, num_classes=vocabulary_size)

loss, accuracy = model.evaluate(test_predictor_sequences, test_targets_one_hot)
print(f"Test accuracy: {accuracy }")

Vocabulary size: 5287
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.03959843888878822


In [None]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
import numpy as np

def prepare_sequences(tokenizer, data, sequence_length, vocabulary_size):
    # Convert texts to sequences
    sequences = tokenizer.texts_to_sequences(data)

    # Create predictor sequences and targets
    predictor_sequences = []
    targets = []
    for text in sequences:
        for i in range(sequence_length, len(text)):
            curr_target = text[i]
            curr_predictor_sequence = text[i - sequence_length:i]
            predictor_sequences.append(curr_predictor_sequence)
            targets.append(curr_target)

    # Pad sequences
    predictor_sequences_padded = pad_sequences(predictor_sequences, maxlen=sequence_length, padding='pre')

    # Convert targets to one-hot encoding
    target_word_one_hot = to_categorical(targets, num_classes=vocabulary_size)

    return predictor_sequences_padded, target_word_one_hot

data = load_dataset("ag_news")

data_subset, _ = train_test_split(data["train"]["text"], train_size=500)
train_data, test_data = train_test_split(data_subset, train_size=0.8)

# Prepare the tokenizer and fit on the training text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data_subset)
vocabulary_size = len(tokenizer.word_index) + 1
print("Vocabulary size:", vocabulary_size)

sequence_length = 1  # Length of the input sequence before predicting the next word

# Prepare training set
predictor_sequences_padded, target_word_one_hot = prepare_sequences(tokenizer, train_data, sequence_length, vocabulary_size)

# Prepare test set
predictor_sequences_padded_test, target_word_one_hot_test = prepare_sequences(tokenizer, test_data, sequence_length, vocabulary_size)

# Create the neural language model with SimpleRNN layer
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=50, input_length=sequence_length))
model.add(SimpleRNN(100, activation='relu'))
model.add(Dense(vocabulary_size, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Training the model
model.fit(predictor_sequences_padded, target_word_one_hot, shuffle = False, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model on test data
loss, accuracy = model.evaluate(predictor_sequences_padded_test, target_word_one_hot_test)
print(f"Test accuracy: {accuracy * 100:.2f}%")

Vocabulary size: 5254
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 8.70%


Test accuracy: 9.78%
0.09203854203224182

## Reducing your context window

Because of the sequential nature of the RNN layer, you don't need to pass in as big of a context window.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_context_simplification.png?raw=1" width=500>
</div>


image source: SLP Fig. 9.5, https://web.stanford.edu/~jurafsky/slp3/9.pdf



<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_languagemodeling.png?raw=1" width=700>
</div>

### Exercise
---
Reduce your `sequence_length` to 1. Train and test again.

How do the results compare?

* it did better with only using one sequence_length it did 9.78%

---
image source: SLP Fig. 9.6, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## Generating Text




Our Keras RNN-based neural language model doesn't do a great job of generating text

### Exercise:

Try it with this text generation code from last time

In [None]:
starter_string = "the"
tokens_list = tokenizer.texts_to_sequences([starter_string])
tokens = tokens_list[0]

for i in range(50):
    curr_seq = tokens[-sequence_length:]
    curr_array = np.array([curr_seq])
    predicted_probabilities = model.predict(curr_array,verbose=0)
    predicted_index = np.argmax(predicted_probabilities)
    predicted_word = tokenizer.index_word[predicted_index]
    print(predicted_word+" ",end="")
    tokens.append(predicted_index)

world cup qualifier set to the world cup qualifier set to the world cup qualifier set to the world cup qualifier set to the world cup qualifier set to the world cup qualifier set to the world cup qualifier set to the world cup qualifier set to the world cup 

**One problem:** Keras will reset the state every time you make a call to `model.predict` so we lose the benefit of recurrence.

## Exerting more control over when the state gets reset

If your model uses the `stateful=True` parameter on the recurrent layer, you get more control over when to reset the state.
* Downside: it's more of a pain to train the network like that

*A workaround:* create another model with the same architecture except for `stateful` and copy the weights

In [None]:
# Create a new model with the same architecture but with stateful RNNs
stateful = Sequential()
stateful.add(Embedding(input_dim=vocabulary_size, output_dim=50, batch_input_shape=(1, sequence_length))) #batch size of 1
stateful.add(SimpleRNN(100,return_sequences=False,stateful=True))
stateful.add(Dense(vocabulary_size, activation='softmax'))

# Load the weights from your trained model
stateful.set_weights(model.get_weights())

# Compile the stateful model (required to make predictions)
stateful.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
starter_string = "the"
tokens_list = tokenizer.texts_to_sequences([starter_string])
tokens = tokens_list[0]

#do this anytime you want to reset the states - for generating a brand new sequence
stateful.reset_states()

for i in range(50):
    curr_seq = tokens[-sequence_length:]
    curr_array = np.array([curr_seq])
    predicted_probabilities = stateful.predict(curr_array,verbose=0)
    predicted_index = np.argmax(predicted_probabilities)
    predicted_word = tokenizer.index_word[predicted_index]
    print(predicted_word+" ",end="")
    tokens.append(predicted_index)

principles of darfur rebels basketball of manager that could potentially result of music via york without a bridge after the rebound and reduce spending on jan 4 program of music via karzai leads vote count don't cutting maker energy plc shares blair said on jan lt a matter of manager 

## Training a stateful model

Keras makes you work a little harder if you want to train a stateful model from the start
* Organize your sequences into batches
* All batches need to be the same size (say 32 or 64)

Might be appropriate if
* You have several long documents
* Each document takes multiple batches
* You *don't* want to reset states between batches
* You *do* want to reset states between documents

## Throwback to a data set we worked with previously

This example is going to do a couple of things
* use The Adventures of Sherlock Holmes corpus we download from Project Gutenberg
* use the WordPiece tokenizer from Hugging Face
    * I want to keep around things like punctuation which gets removed by the Keras tokenizer
    * I want to show you how you can mix different tokenizers with Keras models

In [None]:
import requests
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
sherlock_raw_text = response.text

sherlock_tokens = tokenizer.tokenize( sherlock_raw_text )

sherlock_tokens = sherlock_tokens[:10000] #let's limit the size of the text for this workshop

print("Here's a sample of the tokenized text:")
print(sherlock_tokens[1000:1020])

token_ids = tokenizer.convert_tokens_to_ids(sherlock_tokens )
print("\nHere's the text's ids")
print(token_ids[1000:1020])

print("Vocabulary size:")
print(len(tokenizer.vocab))
vocabulary_size = len(tokenizer.vocab)

In [None]:
import requests
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

response2 = requests.get("https://gutenberg.org/cache/epub/33361/pg33361.txt")
Oz_raw_text = response.text

Oz_tokens = tokenizer.tokenize( Oz_raw_text )

Oz_tokens = Oz_tokens[:10000] #let's limit the size of the text for this workshop

print("Here's a sample of the tokenized text:")
print(Oz_tokens[1000:1020])

token_ids2 = tokenizer.convert_tokens_to_ids(Oz_tokens )
print("\nHere's the text's ids")
print(token_ids2[1000:1020])

print("Vocabulary size:")
print(len(tokenizer.vocab))
vocabulary_size2 = len(tokenizer.vocab)

Token indices sequence length is longer than the specified maximum sequence length for this model (60248 > 512). Running this sequence through the model will result in indexing errors


Here's a sample of the tokenized text:
['Bill', '##ina', 'is', '_', 'real', 'Oz', '##zy', '_', ',', 'Mr', '.', 'Ba', '##um', ',', 'and', 'so', 'are', 'T', '##ik', '##tok']

Here's the text's ids
[2617, 2983, 1110, 168, 1842, 16075, 6482, 168, 117, 1828, 119, 18757, 1818, 117, 1105, 1177, 1132, 157, 4847, 18290]
Vocabulary size:
28996


### Preparing the list of predictor/target pairs like before

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten, SimpleRNN
from keras.utils import to_categorical
from keras.utils import pad_sequences
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import numpy as np
import random


sequence_length = 1
batch_size = 32

predictor_sequences = []
targets = []
for i in range(sequence_length, len(token_ids)):
    # Take the sequence of tokens as input and the next token as target
    curr_target = token_ids[i]
    curr_predictor_sequence = token_ids[i-sequence_length:i]
    predictor_sequences.append(curr_predictor_sequence)
    targets.append(curr_target)

# Convert target to one-hot encoding
targets_one_hot = to_categorical(targets, num_classes=vocabulary_size)

In [None]:

sequence_length = 1
batch_size = 32

predictor_sequences = []
targets = []
for i in range(sequence_length, len(token_ids2)):
    # Take the sequence of tokens as input and the next token as target
    curr_target2 = token_ids[i]
    curr_predictor_sequence2 = token_ids[i-sequence_length:i]
    predictor_sequences.append(curr_predictor_sequence2)
    targets.append(curr_target2)

# Convert target to one-hot encoding
targets_one_hot2 = to_categorical(targets, num_classes=vocabulary_size2)

### Grouping the sequences into batches of 32

This adds an extra dimension to our data

In [None]:
def put_into_batches(data,batch_size):
    num_batches = (len(data) // batch_size)
    batched_data = []
    for batch_idx in range(num_batches):
        curr_batch = data[batch_idx*batch_size:(batch_idx+1)*batch_size]
        batched_data.append(curr_batch)
    batched_data = np.array(batched_data)
    return batched_data


train_features_batched = put_into_batches(predictor_sequences,batch_size)
train_targets_batched = put_into_batches(targets_one_hot,batch_size)

print("before batching")
print(np.array(predictor_sequences))

print("\nafter batching")
print(train_features_batched)

In [None]:
train_features_batched2 = put_into_batches(predictor_sequences,batch_size)
train_targets_batched2 = put_into_batches(targets_one_hot2,batch_size)

print("before batching")
print(np.array(predictor_sequences))

print("\nafter batching")
print(train_features_batched2)

before batching
[[1109]
 [4042]
 [ 144]
 ...
 [3219]
 [1123]
 [1246]]

after batching
[[[ 1109]
  [ 4042]
  [  144]
  ...
  [ 1168]
  [ 2192]
  [ 1104]]

 [[ 1103]
  [ 1362]
  [ 1120]
  ...
  [ 1103]
  [ 4042]
  [  144]]

 [[ 6140]
  [ 8904]
  [24689]
  ...
  [ 1209]
  [ 1138]
  [ 1106]]

 ...

 [[ 1117]
  [ 1532]
  [ 1122]
  ...
  [ 1942]
  [18589]
  [26140]]

 [[13020]
  [18589]
  [ 2162]
  ...
  [  107]
  [  146]
  [ 1267]]

 [[ 1122]
  [ 2762]
  [  112]
  ...
  [ 1513]
  [ 1107]
  [ 1103]]]


## Creating and compiling the model

Note that in this case, we set `batch_input_shape=(batch_size, sequence_length)`

instead of `input_length=sequence_length`

In [None]:
# Define the model
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=50, batch_input_shape=(batch_size, sequence_length)))
model.add(SimpleRNN(100,return_sequences=False,stateful=True))
model.add(Dense(vocabulary_size, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

## Writing a training loop

instead of just doing `model.fit`, we'll do `model.train_on_batch`

In [None]:
num_epochs = 10  # Number of epochs to train for
number_of_batches = len(train_features_batched)

for epoch in range(num_epochs):
    print(f'Epoch {epoch+1}/{num_epochs}')
    model.reset_states()  # Reset states at the start of each epoch


    for batch_idx in range(number_of_batches):
        #print batch number every 1000 batches
        if (batch_idx+1) % 100 == 0:
            print(f'\tBatch {batch_idx+1}/{number_of_batches}')

        # Train on the batch
        model.train_on_batch(train_features_batched[batch_idx], train_targets_batched[batch_idx])



 # if you switch to a new document, do this
    model.reset_states()




Epoch 1/10
	Batch 100/312
	Batch 200/312
	Batch 300/312
Epoch 2/10
	Batch 100/312
	Batch 200/312
	Batch 300/312
Epoch 3/10
	Batch 100/312
	Batch 200/312
	Batch 300/312
Epoch 4/10
	Batch 100/312
	Batch 200/312
	Batch 300/312
Epoch 5/10
	Batch 100/312
	Batch 200/312
	Batch 300/312
Epoch 6/10
	Batch 100/312
	Batch 200/312
	Batch 300/312
Epoch 7/10
	Batch 100/312
	Batch 200/312
	Batch 300/312
Epoch 8/10
	Batch 100/312
	Batch 200/312
	Batch 300/312
Epoch 9/10
	Batch 100/312
	Batch 200/312
	Batch 300/312
Epoch 10/10
	Batch 100/312
	Batch 200/312
	Batch 300/312


### Now let's use our model to generate some text

This code looks much different because we're using the Hugging Face tokenizer
* turn text into ids with `tokenizer.encode`
* turn ids into text with `tokenizer.decode`

In [None]:
starter_string = "the"

# Encode the starter string to token IDs
input_ids = tokenizer.encode(starter_string, add_special_tokens=False)

for i in range(50):
    # Get the last sequence_length tokens
    curr_seq = input_ids[-sequence_length:]
    # Predict the next token ID
    predicted_probabilities = model.predict(np.array([curr_seq]), verbose=0)
    predicted_index = np.argmax(predicted_probabilities, axis=-1)
    # Add the predicted token ID to the sequence
    input_ids.append(predicted_index[0])

# Decode the token IDs to a string
generated_sequence = tokenizer.decode(input_ids, clean_up_tokenization_spaces=True)
print(generated_sequence)


theut the man, and theut the man, and theut the man, and theut the man, and theut the man, and theut the man, and theut the man, and theut the man, and theut the


## Applied Exploration

Adjust the code to get this working on more than one longer document
* can get multiple Project Gutenberg texts
* can use a Hugging Face dataset with longer texts (i.e., multiple sentences per entry, unlike `ag_news`)

Let it train for a while and then generate some text
* Did training with larger data sets improve the kind of text you were able to generate?
* describe what you did and write up an interpretation of your results